how facebook maintains a good speed for instant auto-suggestion

how facebook maintains a good speed for instant auto-suggestion - php

Facebook has the feature to show instant auto-suggestion result in-various situations such as : searching , message sending etc.
i think I have been correct in terming the functionality as 'auto-suggestion'.
If a user has 1000 friends and s/he wishes to send message to a friend , then facebook will suggest his/her name on typing a few characters.
My question is: While pulling the data out of database to find friends (or for any such situation) and then handling with it, which technique does FB use to maintain the speed in auto-suggestion?
Is it caching the variable or what? I wish to know in details as i am planning to build a social networking site. My scripting language is php

I think a good chunk of it is not so much PHP, although facebook are known to use hiphop to compile the PHP.
A more important factor IMO would be the database side of things. The query is probably as optimised as it can be, only getting back what it needs, caching will probably also come into play, i.e. the user's friends have been already retrieved, quite likely getting back the most frequently contacted friends. Also facebook have tons and tons of database servers, which can only help speed really.
Hope that helps

Probably a data structure like patricia-trie or ternary search tree.
A suggesttree like: suggesttree.

Auto-suggesting with 1000 or even 5000 entries is not that hard. You have to retrieve the whole friend list, and to store it in indexed javascript array (for example we did it using the first letter as index, so friends['a'] = [andrey, albert] ) and then you are actually searching in memory from a small subset.
The invite window is build in similar fashion - you build an index of names -> dom elements, you perform the dom manipulation offline - and you are attaching the results with only people that match the searched term.
The friendlist is most likely cached in memcached, and facebook warm up caches as early as it can - it does not wait to use the friend list in any way in order to put it in memcache. So - it's retrieven in memcached, stored in local storage and uses efficient JavaScript. No DB involved here.
P.S. I'm not speaking for facebook, but for a similar solution we've designed to handle fast auto-suggest / invite dialog on 5000+ entries.

Related

facebook/gmail alike web chatbox - what is a good way for nowadays chatapp to store text message?

I'm currently building a facebook alike chatbox, and I have encounter several considerations and problems along the way.
I had been googling useful resources all the time,like simple chatbox example or tutorial online.
My goal is to build one just like facebook/gmail chatbox and CometChat, I know it's hard and too much thing to scale behind the scene, but all I want to do is building it as simple as possible, and figuring out how facebook/gmail chatbox implement their chat functionality.
Progress:
I have finished facebook-like chatbox structure where I have sidebar at the right displaying online friends i can chat with, and popup chatbox at the bottom, and it is able to expand and minimize it.
I also have finished simple chatting based on MySQL database.
There's a table with 4 columns 'sender', 'receiver', 'message', 'time' for storing conversation.
My chatbox works this way:
1.The user send a message, and my front-end javascript will fetch the message the user type in and send the message to php file on the server via Ajax.
2. backend php file will store this message to MySQL.
3. The front-end will call the update function every 3 seconds to update the chatbox content if receiver send message to the sender, and show it out in frontend's chat.
I'm not sure this is a good way and long way to do, and I'm really concerned about it.
If users grow and grow, I have to think of ways to scale it well or my database and server will explode and frontend users might feel high latency in updating conversation.
Is BigTable a right way to do this if you have millions of users online?
How does facebook store their customer's text message or chat history in the backend well??
How does chat app like Whatapp store their text message?
Is it able to let the users chat directly to another user without storing state in server?
If I want to implement the chat history functionality in my chatbox, what is a good way to do ??
I am thinking server can create .txt file for each conversation in their file system, and it has a database table column to store the file path. Is this a good way and right way to do with chat history, I know its possible to do it this way, but im not sure if its a right way or good way.
I know this could be a huge, detailed application.
I'm asking not a detailed implementation but a big picture, concept of building it!
thank you!.

That's a good question and here's an attempt at answering it.
I believe you are thinking about scalability a bit too early. Your IM app might not reach the projected number of users for it to stop performing well. Consider enhancing your small product and scale as you go as much as is needed.
Disk I/O is one of the issues that you will face scaling your web application. Storing communication directly onto the disk with txt file might not be a reliable solution.
Push your technology stack to its limits before considering changing it or switching to something else. I assume you are using a relational database for your storage (since you mentioned columns and rows, which is not an ultimate indicator but still), there are other options out there that have good benchmarking results at the expense of multiple other compromises. (NoSQL: which you referred to as BigTable) is one option. Relational databases are great, they have been for quite a long time the industry standard but currently there are alternative solutions that are quite promising.
Look into NoSQL document based datastorage solutions such as MongoDB, CoucheDB or even Casandra and there are many others. There is a considerable amount of information about the performance of each, under specific circumstances and situations. Choose what is best for the problem at hand and not what is most fashionable or hipped.
Another option would be to outsource your scalability problems to a 3rd Party provider such as Firebase. In this situation all you have to worry about is your product and not what's happening under the hood.
Store only the data that you need and archive or dismiss what you don't.
With scalability there are generally 2 broad categories: Horizontal and Vertical scaling.
Horizontal: means adding more nodes to your system i.e. adding more server instances to handle the extra load. There are many cloud solution providers out there that make this genre of scaling very cheap and instantaneous.
Vertical: means adding more resources to the node you are currently running your app from in addition to use specific technologies that allow you to take full advantages of your resources. This optimization happens on the level of the instance resources (i.e. CPU, RAM, Disk Space etc...) and your data storage, programming language of choice, algorithms you are using etc... You might realize that php and mysql aren't the tools for this job, but that's arguable.
Read More about it here
Distributed Systems architects / programmers also take advantage of other (faster) programming languages at runtime (such as C, C++ or even Java) to speed up certain tasks. Look into how you can dissect your application into smaller decoupled modules / components that can run independently. (But i'm not sure if you will ever reach this stage with an IM client unless it becomes as popular as Whatsapp or Facebook chat).
I advise you to grab and read a couple of books about scaling web applications and leveraging cloud computing. Study scalable architectures and design your application depending on your business logic based on them.
This is a very broad and complex topic, I'm sure others might have additional interesting insight on the matter.

Recent Interview Q - Manipulate Objects on Page for Multiple Users?

If this isn't appropriate, I apologize, but I wanted to get some feedback on a question I was recently asked during a phone interview. I'm strong on front end development but not very clear on back end programming, something I am trying to remedy.
After I got off the call, I had a bit of l'esprit de l'escalier, I think...
Here's the scenario: You have a simple page where a user is presenting
with a random image and allowed to move it around the page, at the
same time that user can see other users of the same page who are also
moving around their own random images, but no one is allowed to
interact with any other user's images.
So, assuming the LAMP stack is in play and jQuery / JavaScript for your front end, describe how you would implement this and prevent these users from taking control of the objects. Assume the users are savvy enough to watch the post calls in firebug.
I was able to describe a simple interface and control. I was able to describe streaming coordinates to and from a database.
I struggled a bit to think of a good way to protect the information being retrieved while on the call.
After I was off the call, within moments, I thought about a simple method of preventing others from gaining control of this data by not exposing the actual IDs of the objects within the database from which they are called. But I'm still not certain of how to do this exactly. I imagine using a php engine to abstract the variable calls, using random Ids on the objects each user cannot interact with.
This is not something that I have ever considered when working with php / MySQL, but of course I'm thinking that I probably should, even when beating an open source CMS or something into submission.
So, my question is if someone could describe their own thoughts on this or point me to a resource to help me grok this, and how I would use AJAX / PHP to make this work? Am I on the right track?
I haven't heard if I got the job yet, but though it seems it was a primarily front end role, I think they wanted a bit more familiarity with the LAMP than I was able to demonstrate.
Thanks in advance for any help you can provide. Yes, I will be following up with this on my own, and I'm already putting together some plans to dig deeper into php and MySQL for my own edification.

I just took this up as a challenge myself, to try out new technology, and I found it a quite fun little thing to work on. The approach I took was in node.js using mongodb as storage.
Using socket.io, the manipulating was set up pretty fast. As for protecting the objects from external I relied on the session ID, which I linked to the object ID. This way, you can safely expose the ID of the object without it getting compromised.
Do note that the manipulating is limited to following the other cursors on the same page.
http://gist.github.com/ThomasHambach/5168951

PHP - detecting changes in external database-driven site

For a homework project, I'm creating a PHP driven website which main function is aggregating news about various university courses.
The main problem is this: (almost) each course has it's own website. These are usually just plain HTML or built using some simple free CMS system.
As a student, participating in 6-7 courses, almost every day you go through 6-7 websites checking if there are any news. The idea behind the project is that you don't have to do that, instead, you just check the aggregation site.
My idea is the following: each time a student logs in, go through his course list. For every course, get it's website (recursively, like with wget), and create a hash value of it. If the hash is different then one stored in database, we know that site has changed, and we notify the student.
So, what do you think, is this reasonable way to achieve the functionality?
And if yes, what is (technically) the best way to go about this? I was checking php_curl, put I don't know if it can get a website recursively.
Furthermore, there's a slight problem I have somewhat limited resources, only a few MB of quota on public (university) server. However, if that's a big problem, I could use a seperate hosting solution.
Thanks :)

Just use file_get_contents, or cURL if you absolutely have to (in case you need COOKIES).
You can use your hashing trick to check for modifications but it's not very elegant. What you want to know is when was it last changed. I doubt this information is on the website, but maybe they offer an RSS feed or some webservice or API you can use for this purpose.
Don't worry about doing recursive requests. Just make a new request each time.
"When all else fails, build a scraper"

Real time activity feed - code / platform implementation?

I am defining out specs for a live activity feed on my website. I have the backend of the data model done but the open area is the actual code development where my development team is lost on the best way to make the feeds work. Is this purely done by writing custom code or do we need to use existing frameworks to make the feeds work in real time? Some suggestions thrown to me were to use reverse AJAX for this. Some one mentioned having the client poll the server every x seconds but i dont like this because it is unwanted server traffic if there are no updates. I was also mentioned a push engine like light streamer to push from server to browser.
So in the end: What is the way to go? Is it code related, purely pushing SQL quires, using frameworks, using platforms, etc.
My platform is written in PHP codeignitor and DB is MySQL.
The activity stream will have lots of activities. There are 42 components on the social networking I am developing, each component has approx 30ish unique activities that can be streamed.

Check out http://www.stream-hub.com/

I have been using superfeedr.com with Rails and I can tell you it works really well. Here are a few facts about it:
Pros
Julien, the lead developer is very helpful when you encounter a problem.
Immediate push of new feed entries which support PubSubHubHub.
JSon response which is perfect for parsing whoever you'd like.
Retrieve API in case the update callback fails and you need to retrieve the latest entries for a given feed.
Cons
Documentation is not up to the standards I would like, so you'll likely end up searching the web to find obscure implementation details.
You can't control how often superfeedr fetches each feed, they user a secret algorithm to determine that.
The web interface allows you to manage your feeds but becomes difficult to use when you subscribe to a loot of them
Subscription verification mechanism works synchronous so you need to make sure the object URL is ready for the superfeedr callback to hit it (they do provide an async option which does not seem to work well).
Overall I would recommend superfeedr as a good solution for what you need.

Top techniques to avoid 'data scraping' from a website database

I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from being able to make a copy of the data in the database yet at the same time wants everything publicly available and even a "view all" link to display every record in the db.
Whilst I have put everything in place to prevent attacks such as SQL injection attacks, there is nothing to prevent anyone from viewing all the records as html and running some sort of script to parse this data back into another database. Even if I was to remove the "view all" link, someone could still, in theory, use an automated process to go through each record one by one and compile these into a new database, essentially pinching all the information.
Does anyone have any good tactics for preventing or even just detering this that they could share.

While there's nothing to stop a determined person from scraping publically available content, you can do a few basic things to mitigate the client's concerns:
Rate limit by user account, IP address, user agent, etc... - this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account or IP address.
Require JavaScript - to ensure the client has some resemblance of an interactive browser, rather than a barebones spider...
RIA - make your data available through a Rich Internet Application interface. JavaScript-based grids include ExtJs, YUI, Dojo, etc. Richer environments include Flash and Silverlight as 1kevgriff mentions.
Encode data as images. This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn't foolproof of course.
robots.txt - to deny obvious web spiders, known robot user agents.
User-agent: *
Disallow: /
Use robot metatags. This would stop conforming spiders. This will prevent Google from indexing you for instance:
<meta name="robots" content="noindex,follow,noarchive">
There are different levels of deterrence and the first option is probably the least intrusive.

If the data is published, it's visible and accessible to everyone on the Internet. This includes the people you want to see it and the people you don't.
You can't have it both ways. You can make it so that data can only be visible with an account, and people will make accounts to slurp the data. You can make it so that the data can only be visible from approved IP addresses, and people will go through the steps to acquire approval before slurping it.
Yes, you can make it hard to get, but if you want it to be convenient for typical users you need to make it convenient for malicious ones as well.

There are few ways you can do it, although none are ideal.
Present the data as an image instead of HTML. This requires extra processing on the server side, but wouldn't be hard with the graphics libs in PHP. Alternatively, you could do this just for requests over a certain size (i.e. all).
Load a page shell, then retrieve the data through an AJAX call and insert it into the DOM. Use sessions to set a hash that must be passed back with the AJAX call as verification. The hash would only be valid for a certain length of time (i.e. 10 seconds). This is really just adding an extra step someone would have to jump through to get the data, but would prevent simple page scraping.

Try using Flash or Silverlight for your frontend.
While this can't stop someone if they're really determined, it would be more difficult. If you're loading your data through services, you can always use a secure connection to prevent middleman scraping.

force a reCAPTCHA every 10 page loads for each unique IP

There is really nothing you can do. You can try to look for an automated process going through your site, but they will win in the end.
Rule of thumb: If you want to keep something to yourself, keep it off the Internet.

Take your hands away from the keyboard and ask your client the reason why he wants the data to be visible but not be able to be scraped?
He's asking for two incongruent things and maybe having a discussion as to his reasoning will yield some fruit.
It may be that he really doesn't want it publicly accessible and you need to add authentication / authorization. Or he may decide that there is value in actually opening up an API. But you won't know until you ask.

I don't know why you'd deter this. The customer's offering the data.
Presumably they create value in some unique way that's not trivially reflected in the data.
Anyway.
You can check the browser, screen resolution and IP address to see if it's likely some kind of automated scraper.
Most things like cURL and wget -- unless carefully configured -- are pretty obviously not browsers.

Using something like Adobe Flex - a Flash application front end - would fix this.
Other than that, if you want it to be easy for users to access, it's easy for users to copy.

There's no easy solution for this. If the data is available publicly, then it can be scraped. The only thing you can do is make life more difficult for the scraper by making each entry slightly unique by adding/changing the HTML without affecting the layout. This would possibly make it more difficult for someone to harvest the data using regular expressions but it's still not a real solution and I would say that anyone determined enough would find a way to deal with it.
I would suggest telling your client that this is an unachievable task and getting on with the important parts of your work.

What about creating something akin to the bulletin board's troll protection... If a scrape is detected (perhaps a certain amount of accesses per minute from one IP, or a directed crawl that looks like a sitemap crawl), you can then start to present garbage data, like changing a couple of digits of the phone number or adding silly names to name fields.
Turn this off for google IPs!

Normally to screen-scrape a decent amount one has to make hundreds, thousands (and more) requests to your server. I suggest you read this related Stack Overflow question:
How do you stop scripters from slamming your website hundreds of times a second?

Use the fact that scrapers tend to load many pages in quick succession to detect scraping behaviours. Display a CAPTCHA for every n page loads over x seconds, and/or include an exponentially growing delay for each page load that becomes quite long when say tens of pages are being loaded each minute.
This way normal users will probably never see your CAPTCHA but scrapers will quickly hit the limit that forces them to solve CAPTCHAs.

My suggestion would be that this is illegal anyways so at least you have legal recourse if someone does scrape the website. So maybe the best thing to do would just to include a link to the original site and let people scrape away. The more they scrape the more of your links will appear around the Internet building up your pagerank more and more.
People who scrape usually aren't opposed to including a link to the original site since it builds a sort of rapport with the original author.
So my advice is to ask your boss whether this could actually be the best thing possible for the website's health.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.