caching for huge database

caching for huge database - php

i am working on a project which is a kind of social network ... studies said that we'll have the first couple of months more than 100,000 users.
the website is done using php mysql and i am searching for the fastest caching engine, since we are talking about caching user data after he sign in.
so we are talking about a huge database, huge records in same table, huge number of users and requests, huge size of caching.
please note that as first step, the website should be hosted on shared server before it will gonna be moved to dedicated server ( it's the client decision, not our)
any tip, hint or suggestion is appreciated.
thanks

1) Put a lot of thought into a sensible database schema, since changing it later will be painful. MySQL tables are good for doing fast SELECT operations, which sounds appropriate for your app.
2) Don't optimize your code prematurely, ie don't worry about a caching solution yet, instead focus on writing modular code so you can easily improve bottlenecks with caching later.
3) After 1 and 2, you need to think about caching based on what will retrieved and how often. I've seen applications that put user information into the session variable - that will reduce database hits. If that's not sufficent, look into Memcached. If you have bigger data, maybe Varnish.

Related

Best way to store chat messages and files

I would like to know what do you think about storing chat messages in a database?
I need to be able to bind other stuff to them (like files, or contacts) and using a database is the best way I see for now.
The same question comes for files, because they can be bound to chat messages, I have to store them in the database too..
With thousands of messages and files I wonder about performance drops and database size.
What do you think considering I'm using PHP with MySQL/Doctrine?

I think that it would be OK to store any textual information on the database (names, messages history, etc) provided that you structure your database properly. I have worked for big Web-sites (multi-kilo visits a day) and telecom companies that store information about their users (including their traffic statistics) on the databases that have grown up to hundreds of gigabytes and the applications were working fine.
But regarding binary information like images and files it would be better to store them on the file systems and store only their paths on the database, because it will be cheaper to read them off the disks that to tie a database process to reading a multi-megabyte file.
As I said, it is important that you do several things:
Structure you information properly - it is very important to properly design your database, properly divide it into tables and tables into fields with your performance goals in mind because this will form the basis for your application and queries. Get that wrong and your queries will be slow.
Make proper decisions on table engines pertinent to every table. This is an important step because it will greatly affect the performance of your queries. For example, MyISAM blocks reading access to the table while it is being updated. That will be a problem for a web application like a social networking or a news site because im many situations your users will basically have to wait for a information update to be completed before the will see a generated page.
Create proper indexes - very important for performance, especially for applications with rapidly growing big databases.
Measure performance of your queries as data grows and look for the ways to improve it - you will always find bottlenecks that have to be removed, this is an ongoing non-stop process. Every popular web application has to do it.

I think a NoSQL database like CouchDB or MongtoDB is an option. You can also store the files separate and link them via a known filename but it depends on your system architecture.

What happens when my PHP website will start having a LOT of members?

This is something I am really curious about and I do not really understand how is that possible.
So lets say I am the owner of Facebook (ahah) and I have million of people visiting my website every day, thousands and thousands of images, videos, logs etc..
How do I store all this data?
Do I have more databases in different servers around the world and then I connect to them from a single location?
Do I use an internal API system that requests info from other servers where the data is stored?
For example I know that Facebook has a lot of data centers around the world and hundreds of servers..
How do they connect to these servers? Are the profiles stored in different locations and when I connect to my profile, I will then be using that specific server? Or is there one main server that has the support of other hundreds of servers around the world?
Is there a way to use PHP in a way that I will connect to different servers and to different mySQL (???) databases to store and retrieve data whenever I want?
Sorry if this looks like a silly question, but since it could happen a day to work on a successful website, I really want to know what I will have to do, and what is the logic behind.
Thank you very much.

I'll try to answer your (big) question but not from Facebook point of view since their architecture is pretty much known.
First thing you have to know is that you would have to distribute the workload of your web application. Question is how, so in order to determine what's going to be slow, you have to divide your app in segments.
First up is the HTTP server, or the one that accepts all the requests. By going to "www.your-facebook.com", you're contacting a service on an IP. Naturally, you would probably have more than one IP but let's say you have a single entry point.
Now what happens? You have an HTTP server software, let's say Apache and it handles incoming connections. Since Apache creates a thread per connected user, it requires certain amount of memory for that operation. Eventually, it will run out of memory and then shit hits the fan, stuff stops working, your site is unavailable.
Therefore, you have to somehow scale this part of your application that connects your PHP code / MySQL db to people who want to interact with it.
Let's assume you successfully scaled your Apache and you have a cluster of computers which can accept new computers in order to scale-out. You solved your first problem.
Next part is the actual layer that does the work. Accepts input from the user and saves it somewhere (MySQL) and that's the biggest problem you'll have - why?
Due to the database.
Databases store their data on mediums such as hard drives. Hard drives, be it an SSD or mechanical one - are limited by their ability to write or retrieve data. If I'm not mistaken, RAM operates at levels of around 6GB/sec transfer rate. Not to mention that the seek time is also much much lower than HDD's one is.
Therefore, if you have an X amount of users asking for a piece of information and you can only deliver it at a certain rate - your app crashes, or it becomes unresponsive and the layer handling database queries becomes slow since the hardware cannot match the speed at which you need the data.
What are the options here? There are many, I won't mention all of them
Split Reads and Writes. Set your database layer in such a way that you have dedicated machines that write the data and completely different ones that read it. You have to use replication and replication has its own quirks - it never works without breaking.
Optimize handling of your data set by sharding your data. Great for read / write performance, screwed up when you need to query multiple shards and merge the data.
Get better hardware, especially storage (such as FusionIO)
Pay for better storage engine (such as TokuDB)
Alleviate load on the database by using caching. The data that your users request probably doesn't change so often that you have to query the db every single time (say you're viewing someone's profile, what's the chance they'll change it every second?). That's why Facebook uses Memcached extensively - a system that stores small pieces of data in RAM, it's easily scalable and what not. Most important, it's damn quick!
Use different solutions next to MySQL. MySQL (and some other databases) aren't good for every type of data storage or retrieval. Someone mentioned NoSQL before. NoSQL solutions are quick, but still immature. They don't do as much as relational databases do. They use methods of delaying disk write (they keep cached copy of data they need to write in RAM) so that they can achieve fast insert rates. That's why it's not unusual to lose data when using NoSQL.
Topic about MySQL vs "insert database or whatever here" is broad, I don't want to go into that but remember - every single one of data stores out there saves data on the hard drive eventually. The difference (physical of course) is how they optimize their flushing to the disk itself.
I also didn't mention various reports you can run by gathering the data (how many men between 19 and 21 have clicked an advert X between 01:15 and 13:37 CET and such) which is what Facebook is actually gathering (scary stuff!).
Third up - the language gluing the data store (MySQL) and output (HTTP server). PHP.
As you can see, most of the work here is already done by Apache and MySQL. Optimization on PHP level is small, even facebook got small results (they claim 50%, but that's UP TO 50%). I tried HipHop extensively, it is not as fast as it claims to be. Naturally, Facebook guys mentioned that already, so it's no wonder. The advantage they get is because they replaced Apache with their own server built in into HipHop. Some people claim "language X is better than language Y" and they're right, but that's not always the case. Each language has its own advantages and disadvantages.
For example, PHP is widely-spread but it's slow for certain operations (implementing a Trie with over 1 billion entries for example). It's great for things like echo some HTML after parsing the output from the db. It's quick to insert and retrieve data from the database, and that's about 90% of the PHP usage - talk to the db, display the data, end.
Therefore, no matter what language you use (say we used C++ instead of PHP), your bottleneck will be the data storage / retrieval layer.
On the other hand, why is using C++ NOT handy? Because there are more people who know how to use PHP than ones who use C++. It's also MUCH slower to develop web apps in C++. Sure, they will execute faster, but who will notice the difference between 1 millisecond and 1 microsecond?
This post is more like an informative blog post, I know it's not filled with resources to back up my claims but anyone who did any work with larger data sets or websites will know that the P.I.T.A. is always the data storage component. Some things that I said probably won't fit with everyone, but in a NUTSHELL this is how you'd go about optimizing your site.

Unfortunately, your question doesn't have a simple answer. For the MySQL portion of it, you would need to investigate database scale-out. You can start looking at it here: http://www.mysql.com/why-mysql/scaleout/mixi.html. There are a number of different ways to set up Apache/PHP web sites across a server farm. One of them involves setting up round robin DNS. This is adding a DNS record with a number of different IP addresses. Your DNS then hands out a different IP address each time the record is requested so that the load is balanced across a number of servers. You can also set up clustering with MySQL, Apache and Heartbeat, but that is more of a high-availability solution than a scaling solution.

When you have a website with so many users you'll already have enough experience to know the answer of the question, you'll also have a lot of money to pay people to find the optimal architecture of your system.
I'm not saying that what I describe below is the Holy Grail, but it is certainly an option:
You will have a big, fragmented database with lots of backups and you'll have a few name servers which will know the location of servers and some rules about the data stored on each server. When data is searched the query will be sent to a name server which will find the server(s) where the answer can be found for the particular query. I've also upvoted N.B.'s answer, I think he is mostly right.

For lots of users, you should have a server with lots of memory and speed. Configure php.ini to allow more memory usage. A server with lots of users should have 4-12GB available. Also, save resources by closing the desktop environment. If you have this many users, you might want to consider a CDN and also make a database request queue.

php caching techniques

Hi this is more of an information request really.
I'm currently working on a pretty large event listing website and have started thinking about some caching for the data sets being used.
I have been messing with APC this week and have seen some real improvements during testing however what I'm struggling to get my head around is best practices and techniques required when trying to cache data that changes frequently.
Say for example the user hits the home page, this by default displays the latest 10 events happening and if that user is logged in those events are location specific. Is it possible to deploy some kind of caching system when dealing with logged in states and data that changes frequently, the system currently allows the user to "show more events: which is an ajax request to pull extra results from the db.
I haven't really found anything on this as I'm not sure what to search for but I'm really interested to know the techniques used for advanced caching systems that deal especially with data that changes and data specific to users?
I mean is it even worth it? are the other performance boosters when dealing with this sort of criteria?
Any articles or tips and info on this will be greatly appreciated!! Please let me know if any other info is required!!

Your basic solutions are:
file cache
memcached/redis
APC
Each used for slightly different goal.
File cache is usually something that you utilize when you can pre-render files or parts of them. It is used in templating solutions, partial views (mvc), css frameworks. That sort of stuff.
Memcached and redis are both more or less equal, except redis is more of a noSQL oriented thing. They are used for distributed cache ( multiple servers , same cached data ) and for storing the sessions, if you have cluster of webservers.
APC is good for two things: opcode cache and data cache. Faster then memcached, but works for each server separately.
Bottom line is : in a huge project you will use all of them. Each for a different task.

So you have opcode caching, which speeds things up by saving already compiled PHP files in cache.
Then you have data caching, where you save variables or objects that take time to get like data built from SQL queries.
Then you have output caching, which is where you save entire blocks of your webpages in files, and output those files instead of building that block of your webpage on each request.
I once wrote a blog post about how to do output caching:
http://www.spotlesswebdesign.com/blog.php?id=17
If it's location specific, and there are a billion locations, your best bet is probably output caching assuming you have a lot of disc space, but you will have to use your head for what is best, as each situation is very different when it comes to how best to apply caching.

If done correctly, using memcached or similar solutions can give huge boosts to site performance. By altering the cached data directly instead of rehydrating it from the database you can bypass the database entirely for data that either doesn't need to be saved or can be trivially rebuilt. Since the database is often the most critical component in web applications, any load you can take off it is a bonus.
On the other hand, making sure your database queries are as light and efficient as possible will have a much larger impact on performance than most cache tweaks.

Difference between server caching and Client Caching for a large dataset?

I am implementing a project in PHP with mysql. Right now i don't have much data but i was wondering that in future when i have a large dataset. It will slow down my search in the table. So to decrease that searching time, i was thinking for caching techniques. Which caching i.e. client or server will be good for a large dataset?
Thanks, aby

Server, in my opinion.
A client-side cacheing technique will have one of two negative outcomes depending on how you do it:
If you cache only what the user has searched for before, the cache won't be of any use unless the user performs exactly the same search again.
If you cache the whole dataset the user will have to download the whole thing, and that will slow your site down and incur bandwidth expenses.
The easiest thing you can do is just add appropriate indexes to the table you're searching. That will be sufficient for 99% of possible applications and should be the first thing you do, before you think about cacheing at all.
Apologies if I've pitched this answer below your level, I'm not sure exactly what you're doing, what you're planning to cache or how much experience you have.

Pay close attention to indexing in your database schemas. If you do this part right, the database should be able to keep up until your data and traffic is large. The right caching scheme will depend significantly on what your usage patterns are like. You should do testing as your site grows to know where the bottlenecks are and what the best caching scheme will be for your system.

How to load balance (scale) a simple PHP application?

I constantly read on the Internet how it's important to correctly architect my PHP applications so that they can scale.
I have built a simple/small CMS that is written in PHP (think of Wordpress, but waaaay simpler).
I essentially have URLs like such: http://example.com/?page_id=X where X is the id in my MySQL database that has the page content.
How can I configure my application to be load balanced where I'm simply performing PHP read activities.
Would something like Nginx as the front door setup to route traffic to multi-nodes running my same code to handle example.com/?page_id=X be enough to "load balance" my site?
Obviously, MySQL is not being load balanced in this situation, though for simplicity - that makes that out of scope for this question.

These are some well known techniques for scaling such an app.
Reduce DB hits
Most often the bottle neck will be your DB, so cache recent pages so that you reduce DB activity, perhaps in something like memcached.
Design your schema such that it is partition-able.
In the simplest case, separate your data into logical partitions, and store each partition in a separate mysql DB. Craigslist, for example, partitions data by city, and in some cases, by section within that. In your case, you could partition by Id quite simply.
Manage php sessions
Putting ngnx in front of a php website will not work if you use sessions. Load balancing php does have issues as sessions are persisted on local storage. Therefore you need to do session management explicitly. The traditional solution is to use memcached to store and look up some kind of cookie.
Don't optimize prematurely.
Focus on getting your application out so that the next magnitude of current users gets the optimal experience.
Note: Your main potential pain points are discussed here on SO

No, it is not at all important to scale your application if you don't need to.
My view on this is:
Make it work
Make sure it works correctly - testability, robustness
Make it work efficiently enough to be cost effective to run
Then, if you have to so much traffic that your system cannot handle it, AND you've already thrown all the hardware that (sensible) money can buy at it, then you need to scale. Not sooner.
Yes it is relatively easy to scale read-workloads, because you can simply perform reads against readonly database replicas. The challenge is to scale write-workloads.
A lot of sites have few writes, even if they're really busy.

The correct approach is to use some kind of load balancer such as:
http://www.softwareprojects.com/resources/programming/t-how-to-install-and-configure-haproxy-as-an-http-loa-1752.html
What this does is forward a certain user session only to a certain server, hence you dont have to worry about sessions and where they are stored at all. What you do have to worry is how to distribute the filesystem if the 2 servers are running on two different machines, especially if you make heavy use of the filesystem. Hope this article above helps...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.