Optimization: Where to process data? Database, Server or Client?

Optimization: Where to process data? Database, Server or Client? - php

I've been thinking a lot about optimization lately. I'm developing an application that makes me think where I should process data considering balancing server load, memory, client, loading, speed, size, etc..
I want to understand better how experienced programmers optimize their code when thinking about processing. Take the following 3 options:
Do some processing on the database level, when I'm getting the data.
Process the data on PHP
Pass the raw data to the client, and process with javascript.
Which would you guys prefer on which occasions and why? Sorry for the broad question, I'd also be thankful if someone could recommend me good reading sources on this.

Database is heart of any application, so you should keep load on database as light as possible. Here are some suggestions
Get only required fields from database.
Two simple queries are better than a single complex query.
Get data from database, process with PHP and then store this processed data into temporary storage(say cache e.g. Memcache, Couchbase, Redis). This data should be set with an expiry time, expiry time totally depends upon type of data. Caching will reduce your database load to a great extent.
Data is stored in normalized form. But if you know in advance that data is going to be requested and producing this data requires joins from many tables, then processed data, in advance, can be stored in separate table and can be served from this table.
Send as few as possible data on client side. Less HTML size will save bandwidth and browser will be able to render page quickly.
Load data on demand(using ajax, lazy loading etc), e.g a image is not visible on a page until user clicks on a tab, this image should be loaded upon user click.

Two thoughts: Computers should work, people should think. (IBM ad from the 1960s.)
"Premature optimization is the root of all evil (or at least most of it) in programming." --Donald Knuth
Unless you are, or are planning to become, Google or Amazon or Facebook, you should focus on functionality. "Make it work before you make it fast." If you are planning to grow to that size, do what they did: throw hardware at the problem. It is cheaper and more likely to be effective.
Edited to add: Since you control the processing power on the server, but probably not on the client, it is generally better to put intensive tasks on the server, especially if the clients are likely to be mobile devices. However, consider network latency, bandwidth requirements, and response time. If you can improve response time by processing on the client, then consider doing so. So, optimize the user experience, not the CPU cycles; you can buy more CPU cycles when you need them.
Finally, remember that the client cannot be trusted. For that reason, some things must be on the server.

So as a rule of thumb, process as much of the data in the database as possible. The cost of creating a new connection to query is very high, so you want to limit it as much as possible. Even if you have to write some very ugly SQL, performing a JOIN will almost always be quicker than performing 2 SELECT statements.
PHP should really only be used to format and cache data. If you are performing a ton of data operations after every request, you are probably storing your data in a format that's not very practical. You want to cache anything that is not changed often in an almost ready to server state using something like Redis or APCu.
Finally, client should never be performing data operations on more than a few objects. You never know the clients resource availability so always keep the client data lean. Perform pagination and sorting on any data sets larger than a few dozen in the back-end. An AJAX request using AngularJS is usually just as quick as performing a sort on 100+ items on an iPad 2.
If you would like further details on any aspect of this answer please ask and I will do my best to provide examples or additional detail.

Related

Database or file record keeping - php

I am creating a record system for my site which will track users and how they interact with my site's pages. This system will record button clicks, page view times, and the method used to navigate away from a page (among other things.) I an considering one of two options:
create a log file and append a string to it for each action.
create a database table and save entries based on user interaction.
Although I am sure that both methods could easily fill my needs, which would be better in the long run. Other considerations:
General page viewing will never cause this data to be read (only added to it.)
Old Data should be archived, but still accessible.
Data will be viewed and searched via web app

As with most performance questions, the answer is 'It depends.'
I would expect it depends on the file system, media type, and operating system of your server.
I don't believe I've ever experienced performance differences INSERTing data into a large, or a small MySQL database. The performance differences manifest when you retrieve that data. The database will almost always outperform queries to files, especially when you want complex or statistical data.
If you are only concerned with the speed of inserting/appending data, and expect a large amount of traffic, build a mock environment and benchmark each approach. If you want to have any amount of speed retrieving that data in a structured way, go with the database.

If you want performance you should inspect the server log, instead of trying to build your log system...

"Caching" identical MySQL results for all users

I'm hoping to develop a LAMP application that will centre around a small table, probably less than 100 rows, maybe 5 fields per row. This table will need to have the data stored within accessed rapidly, maybe up to once a second per user (though this is the 'ideal', in practice, this could probably drop slightly). There will be a number of updates made to this table, but SELECTs will far outstrip UPDATES.
Available hardware isn't massively powerful (it'll be launched on a VPS with perhaps 512mb RAM) and it needs to be scalable - there may only be 10 concurrent users at launch, but this could raise to the thousands (and, as we all hope with these things, maybe 10,000s, but this level there will be more powerful hardware available).
As such I was wondering if anyone could point me in the right direction for a starting point - all the data retrieved will be the same for all users, so I'm trying to investigate if there is anyway of sharing this data across all users, rather than performing 10,000 identical selects a second. Soooo:
1) Would the mysql_query_cache cache these results and allow access to the data, WITHOUT requiring a re-select for each user?
2) (Apologies for how broad this question is, I'd appreciate even the briefest of reponses greatly!) I've been looking into the APC cache as we already use this for an opcode cache - is there a method of caching the data in the APC cache, and just doing one MYSQL select per second to update this cache - and then just accessing the APC for each user? Or perhaps an alternative cache?
Failing all of this, I may look into having a seperate script which handles the queries and outputs the data, and somehow just piping this one script's data to all users. This isn't a fully formed thought and I'm not sure of the implementation, but perhaps a combo of AJAX to pull the outputted data from... "Somewhere"... :)
Once again, apologies for the breadth of these question - a couple of brief pointers from anyone would be very, very greatly appreciated.
Thanks again in advance

If you're doing something like an AJAX chat which polls the server constantly, you may want to look at node.js instead, which keeps an open connection between server and browser. This way, you can have changes pushed to the user when they happen and you won't need to do all that redundant checking once per second. This can scale very well to thousands of users and is written in javascript on the server-side, so not too difficult.

The problem with using the MySQL cache is that the entire table cache gets invalidated on any write to that table. You're better off using a caching solution like memcached or APC if you're trying to control that behavior more precisely. And yes, APC would be able to cache that information.
One other thing to keep in mind is that you need to know when to invalidate the cache as well, so you don't have stale data.

You can use apc,xcache or memcache for database query caching or you can use vanish or squid for gateway caching...

What happens when my PHP website will start having a LOT of members?

This is something I am really curious about and I do not really understand how is that possible.
So lets say I am the owner of Facebook (ahah) and I have million of people visiting my website every day, thousands and thousands of images, videos, logs etc..
How do I store all this data?
Do I have more databases in different servers around the world and then I connect to them from a single location?
Do I use an internal API system that requests info from other servers where the data is stored?
For example I know that Facebook has a lot of data centers around the world and hundreds of servers..
How do they connect to these servers? Are the profiles stored in different locations and when I connect to my profile, I will then be using that specific server? Or is there one main server that has the support of other hundreds of servers around the world?
Is there a way to use PHP in a way that I will connect to different servers and to different mySQL (???) databases to store and retrieve data whenever I want?
Sorry if this looks like a silly question, but since it could happen a day to work on a successful website, I really want to know what I will have to do, and what is the logic behind.
Thank you very much.

I'll try to answer your (big) question but not from Facebook point of view since their architecture is pretty much known.
First thing you have to know is that you would have to distribute the workload of your web application. Question is how, so in order to determine what's going to be slow, you have to divide your app in segments.
First up is the HTTP server, or the one that accepts all the requests. By going to "www.your-facebook.com", you're contacting a service on an IP. Naturally, you would probably have more than one IP but let's say you have a single entry point.
Now what happens? You have an HTTP server software, let's say Apache and it handles incoming connections. Since Apache creates a thread per connected user, it requires certain amount of memory for that operation. Eventually, it will run out of memory and then shit hits the fan, stuff stops working, your site is unavailable.
Therefore, you have to somehow scale this part of your application that connects your PHP code / MySQL db to people who want to interact with it.
Let's assume you successfully scaled your Apache and you have a cluster of computers which can accept new computers in order to scale-out. You solved your first problem.
Next part is the actual layer that does the work. Accepts input from the user and saves it somewhere (MySQL) and that's the biggest problem you'll have - why?
Due to the database.
Databases store their data on mediums such as hard drives. Hard drives, be it an SSD or mechanical one - are limited by their ability to write or retrieve data. If I'm not mistaken, RAM operates at levels of around 6GB/sec transfer rate. Not to mention that the seek time is also much much lower than HDD's one is.
Therefore, if you have an X amount of users asking for a piece of information and you can only deliver it at a certain rate - your app crashes, or it becomes unresponsive and the layer handling database queries becomes slow since the hardware cannot match the speed at which you need the data.
What are the options here? There are many, I won't mention all of them
Split Reads and Writes. Set your database layer in such a way that you have dedicated machines that write the data and completely different ones that read it. You have to use replication and replication has its own quirks - it never works without breaking.
Optimize handling of your data set by sharding your data. Great for read / write performance, screwed up when you need to query multiple shards and merge the data.
Get better hardware, especially storage (such as FusionIO)
Pay for better storage engine (such as TokuDB)
Alleviate load on the database by using caching. The data that your users request probably doesn't change so often that you have to query the db every single time (say you're viewing someone's profile, what's the chance they'll change it every second?). That's why Facebook uses Memcached extensively - a system that stores small pieces of data in RAM, it's easily scalable and what not. Most important, it's damn quick!
Use different solutions next to MySQL. MySQL (and some other databases) aren't good for every type of data storage or retrieval. Someone mentioned NoSQL before. NoSQL solutions are quick, but still immature. They don't do as much as relational databases do. They use methods of delaying disk write (they keep cached copy of data they need to write in RAM) so that they can achieve fast insert rates. That's why it's not unusual to lose data when using NoSQL.
Topic about MySQL vs "insert database or whatever here" is broad, I don't want to go into that but remember - every single one of data stores out there saves data on the hard drive eventually. The difference (physical of course) is how they optimize their flushing to the disk itself.
I also didn't mention various reports you can run by gathering the data (how many men between 19 and 21 have clicked an advert X between 01:15 and 13:37 CET and such) which is what Facebook is actually gathering (scary stuff!).
Third up - the language gluing the data store (MySQL) and output (HTTP server). PHP.
As you can see, most of the work here is already done by Apache and MySQL. Optimization on PHP level is small, even facebook got small results (they claim 50%, but that's UP TO 50%). I tried HipHop extensively, it is not as fast as it claims to be. Naturally, Facebook guys mentioned that already, so it's no wonder. The advantage they get is because they replaced Apache with their own server built in into HipHop. Some people claim "language X is better than language Y" and they're right, but that's not always the case. Each language has its own advantages and disadvantages.
For example, PHP is widely-spread but it's slow for certain operations (implementing a Trie with over 1 billion entries for example). It's great for things like echo some HTML after parsing the output from the db. It's quick to insert and retrieve data from the database, and that's about 90% of the PHP usage - talk to the db, display the data, end.
Therefore, no matter what language you use (say we used C++ instead of PHP), your bottleneck will be the data storage / retrieval layer.
On the other hand, why is using C++ NOT handy? Because there are more people who know how to use PHP than ones who use C++. It's also MUCH slower to develop web apps in C++. Sure, they will execute faster, but who will notice the difference between 1 millisecond and 1 microsecond?
This post is more like an informative blog post, I know it's not filled with resources to back up my claims but anyone who did any work with larger data sets or websites will know that the P.I.T.A. is always the data storage component. Some things that I said probably won't fit with everyone, but in a NUTSHELL this is how you'd go about optimizing your site.

Unfortunately, your question doesn't have a simple answer. For the MySQL portion of it, you would need to investigate database scale-out. You can start looking at it here: http://www.mysql.com/why-mysql/scaleout/mixi.html. There are a number of different ways to set up Apache/PHP web sites across a server farm. One of them involves setting up round robin DNS. This is adding a DNS record with a number of different IP addresses. Your DNS then hands out a different IP address each time the record is requested so that the load is balanced across a number of servers. You can also set up clustering with MySQL, Apache and Heartbeat, but that is more of a high-availability solution than a scaling solution.

When you have a website with so many users you'll already have enough experience to know the answer of the question, you'll also have a lot of money to pay people to find the optimal architecture of your system.
I'm not saying that what I describe below is the Holy Grail, but it is certainly an option:
You will have a big, fragmented database with lots of backups and you'll have a few name servers which will know the location of servers and some rules about the data stored on each server. When data is searched the query will be sent to a name server which will find the server(s) where the answer can be found for the particular query. I've also upvoted N.B.'s answer, I think he is mostly right.

For lots of users, you should have a server with lots of memory and speed. Configure php.ini to allow more memory usage. A server with lots of users should have 4-12GB available. Also, save resources by closing the desktop environment. If you have this many users, you might want to consider a CDN and also make a database request queue.

Data requests and performance

should i take the new data for my ajax onlinegame worldmap (while dragging/scrolling) from my mysql db or is it better to load the data from a generated (and frequently updated) XML ? (frequently updated -> because of new players joining the game/worldmap)?
in other words:
is mysql capable of dunno a few thousand players scrolling a worldmap (and therefore requesting new data) or should i use a XML sheet?

Personally I hate XML.
For you it might be the right tool for the job, but I'm just going to answer the "is mysql capable of..." part of your question :-)
Yes
But it depends on your SQL skills.
How to speed things up?
Keep the MySQL server on the same machine as the webserver to avoid network traffic.
Use memory tables to avoid disk IO.
Know your way around SQL
MySQL in de default config is tuned to small tables and small memory sizes, this sounds like it fits your case, but experiment and measure to see which config works best.
Fewer selects/inserts/updates with more data per request are faster than more selects/inserts/updates with less data per request.
Also note that if you don't cache the XML file in memory you will hit lock issues on the XML file slowing things down.

A database hit will almost always be more expensive than a file hit (due to crossing a network) - but the quickest option would be to keep an in-memory dataset/cache (be aware of memory consumption though).

i think mysql fits your need.. you also could cluster your data, when running low on system resources…
also could websockets be intressting for you. maybe you should have look at nodejs, with it you can handle new user joins easily (push the new players to the other players instead of pulling the data out of mysql

Is the Ajax response returned as XML or JSON? If the latter, then why bother messing about with XML?
If it were me, I'd maintain the data in the database with smart serverside caching (where you can invalidate cache items selectively)

How to load balance (scale) a simple PHP application?

I constantly read on the Internet how it's important to correctly architect my PHP applications so that they can scale.
I have built a simple/small CMS that is written in PHP (think of Wordpress, but waaaay simpler).
I essentially have URLs like such: http://example.com/?page_id=X where X is the id in my MySQL database that has the page content.
How can I configure my application to be load balanced where I'm simply performing PHP read activities.
Would something like Nginx as the front door setup to route traffic to multi-nodes running my same code to handle example.com/?page_id=X be enough to "load balance" my site?
Obviously, MySQL is not being load balanced in this situation, though for simplicity - that makes that out of scope for this question.

These are some well known techniques for scaling such an app.
Reduce DB hits
Most often the bottle neck will be your DB, so cache recent pages so that you reduce DB activity, perhaps in something like memcached.
Design your schema such that it is partition-able.
In the simplest case, separate your data into logical partitions, and store each partition in a separate mysql DB. Craigslist, for example, partitions data by city, and in some cases, by section within that. In your case, you could partition by Id quite simply.
Manage php sessions
Putting ngnx in front of a php website will not work if you use sessions. Load balancing php does have issues as sessions are persisted on local storage. Therefore you need to do session management explicitly. The traditional solution is to use memcached to store and look up some kind of cookie.
Don't optimize prematurely.
Focus on getting your application out so that the next magnitude of current users gets the optimal experience.
Note: Your main potential pain points are discussed here on SO

No, it is not at all important to scale your application if you don't need to.
My view on this is:
Make it work
Make sure it works correctly - testability, robustness
Make it work efficiently enough to be cost effective to run
Then, if you have to so much traffic that your system cannot handle it, AND you've already thrown all the hardware that (sensible) money can buy at it, then you need to scale. Not sooner.
Yes it is relatively easy to scale read-workloads, because you can simply perform reads against readonly database replicas. The challenge is to scale write-workloads.
A lot of sites have few writes, even if they're really busy.

The correct approach is to use some kind of load balancer such as:
http://www.softwareprojects.com/resources/programming/t-how-to-install-and-configure-haproxy-as-an-http-loa-1752.html
What this does is forward a certain user session only to a certain server, hence you dont have to worry about sessions and where they are stored at all. What you do have to worry is how to distribute the filesystem if the 2 servers are running on two different machines, especially if you make heavy use of the filesystem. Hope this article above helps...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.