I am implementing a project in PHP with mysql. Right now i don't have much data but i was wondering that in future when i have a large dataset. It will slow down my search in the table. So to decrease that searching time, i was thinking for caching techniques. Which caching i.e. client or server will be good for a large dataset?
Thanks, aby
Server, in my opinion.
A client-side cacheing technique will have one of two negative outcomes depending on how you do it:
If you cache only what the user has searched for before, the cache won't be of any use unless the user performs exactly the same search again.
If you cache the whole dataset the user will have to download the whole thing, and that will slow your site down and incur bandwidth expenses.
The easiest thing you can do is just add appropriate indexes to the table you're searching. That will be sufficient for 99% of possible applications and should be the first thing you do, before you think about cacheing at all.
Apologies if I've pitched this answer below your level, I'm not sure exactly what you're doing, what you're planning to cache or how much experience you have.
Pay close attention to indexing in your database schemas. If you do this part right, the database should be able to keep up until your data and traffic is large. The right caching scheme will depend significantly on what your usage patterns are like. You should do testing as your site grows to know where the bottlenecks are and what the best caching scheme will be for your system.
Related
I've been given a big project by a big client and I've been working on it for 2 months now. I'm getting closer and closer to a solution but it's just so insanely complex that I can't quite get there, and so I need ideas.
The project is quite simple: There is a 1mil+ database of lat/lng coordinates with lots of additional data for each record. A user will visit a page and enter some search terms which will filter out quite a lot of the records. All of the records that match the filter are displayed (often clustered) on a Google Maps.
The problem with this is that the client demands it's fast, lean, and low-bandwidth. Hence, I'm stuck. What I'm currently doing is: Present the first clusters, and when they hover over a cluster, begin loading in the data for that clusters children.
However, I've upped it to 30,000 of the millions of listings and it's starting to drag a little. I've made as many optimizations that I possibly can. When the filter is changed, I AJAX a query to the DB and return all the ID's of the matches, then update the map to reflect this.
So, optimization is not an option. I need an entirely new conceptual model for this. Any input at all would be highly appreciated, as this is an incredibly complex project of which I can't find anything in history even remotely close to it- I even looked at MMORPG's which have a lot of similar problems, and I have made a few, but the concept of having a million players in one room is still something MMORPG makers cringe at. It's getting common that people think there may be bottlenecks, but let me say that it's not a case of optimizing this way. I need a new model in which a huge database stays on the server, but is displayed fluidly to the user.
I'll be awarding 500 rep as soon as it becomes available for anything that solves this.
Thanks- Daniel.
I think there are a number of possible answers to your question depending on where it is slowing down, so here goes a few thoughts.
A wider table can effect the speed with which a query is returned. Longer records mean that more disc is being accessed to get the right data, so you might want to think about limiting your initial table to hold only the information that can be filtered out. Having said that, it will also depend on the db engine you are using, some suffer more than others.
Ensuring that your tables are correctly indexed makes a HUGE difference in performance. You need to make sure that the query is using the indexes to quickly get to the records that it needs.
A friend was working with Google Maps and said that the API really suffered if too much was displayed on the maps. This might just be totally out of your control.
Having worked for Epic Games in the past, the reason that "millions of players in a room" is something to cringe at is more often hardware driven. In a game, having that number of players would grind the graphics card to a halt as it tries to render all the polygons of the models. Secondly (and likely more importantly) the problem would be that you have to send each client information about what each item/player is doing. This means that your bandwidth use will spike very heavily. Your server might handle the load, but the players internet connection might not.
I do think that you need to edit your question though with some extra information on WHAT is slowing down. Your database? Your query? Google API? The transfer of data between server and client machine?
Let's be honest here; a db with 1 million records being accessed by presumably a large amount of users, is not going to run very well unless you put some extremely powerful hardware behind it.
In this type of case, I would suggest using several different database servers, and setting up some decent load balancing regimes in order to keep them running as smoothly as possible. First and foremost, you will need to find out the "average" load you can place on a db server before it starts to lag up; let's say for example, this is 50,000 records. Setting a low MaxClients per server may assist you with server performance and preventing against crashes, but it might aggravate your users when they can't execute any queries due to high load.. but it's something to keep in mind if your budget doesn't allow for much wiggle room hardware-wise.
On the topic of hardware however, that's something you really need to take a look at. Databases typically don't use a huge amount of CPU/RAM, but they can be quite taxing on your HDD. I would recommend going for SAS or SSD before looking at other components on your setup; these will make the world of a difference for you.
As far as load balancing goes, a very common technique used for most content providers is that when one query/particular content item (such as a popular video on youtube etc) is pulling in an above average amount of traffic, you can cache its result. A quick and dirty approach to this is to use an if statement in your search bar, which will then grab a static html page instead of actually running the query.
Another approach to this is to have a seperate db server on standalone, only for running queries which are taking in an excessive amount of traffic.
With that, never underestimate your code optimisation. While the differences may seem subtle to you, when run across millions of queries by thousands of users, those tiny differences really do add up.
Best of luck with it - let me know if you need any further assistance.
Eoghan
Google has a service named "Big Query". It is a sql Server in the cloud. It uses its fast servers for sql and it can search millions of data rows quickly. Unfortunately it is not free.. but maybe it will help you out:
https://developers.google.com/bigquery/
i am working on a project which is a kind of social network ... studies said that we'll have the first couple of months more than 100,000 users.
the website is done using php mysql and i am searching for the fastest caching engine, since we are talking about caching user data after he sign in.
so we are talking about a huge database, huge records in same table, huge number of users and requests, huge size of caching.
please note that as first step, the website should be hosted on shared server before it will gonna be moved to dedicated server ( it's the client decision, not our)
any tip, hint or suggestion is appreciated.
thanks
1) Put a lot of thought into a sensible database schema, since changing it later will be painful. MySQL tables are good for doing fast SELECT operations, which sounds appropriate for your app.
2) Don't optimize your code prematurely, ie don't worry about a caching solution yet, instead focus on writing modular code so you can easily improve bottlenecks with caching later.
3) After 1 and 2, you need to think about caching based on what will retrieved and how often. I've seen applications that put user information into the session variable - that will reduce database hits. If that's not sufficent, look into Memcached. If you have bigger data, maybe Varnish.
If I have a website where users login and logout, and each user has 4 session variables being used, how will this affect my site?
Say if I have 100,000 active members, then that would be effectively 400,000 session variables being passed at the same time. Will this affect the loading of my site? I understand php has a memory limit but do not fully understand it.
Thanks
4 variables per user is nothing, but my suggestion would be on a different level: Focus on what's causing actual bottlenecks in your web site. This issue is probably not relevant right now, and is really easy to switch from in the future if this what slows you down. (And it won't)
I bet you have much more important stuff to work on than worry about another variable, and when you get to that amount of active users, your whole structure will probably change, including servers and solutions. Good luck!
To answer shortly - yes, it will affect loading of your site, if you have 100k users. But it won't be only because of sessions, they will be a part of the bottleneck.
It's easy to calculate possible memory consumption and according to that you should decide how to scale your site.
Scaling options are endless (well, as a phrase of course, there's a finite amount of ways to scale programs but still there are many to choose).
If it happens that you attract that many users, chances are you will be able to afford professional help when it comes to scaling your site.
If it's a case of you wondering what those options are, then it might be the best to ask a question with specific things in mind that trouble you when determining when and where the bottlenecks might be.
Also, you don't deploy sites with so many active users on a single server, using default PHP configuration, especially the session one.
The answer, of course, is "it depends".
STRONG SUGGESTION:
1) Establish a "performance baseline"
2) Consider resources like CPU, memory, network and disk
3) Consider OS, Web server, Application and database
4) Run stress tests, and compare your performance between "normal loads" to "high loads"
5) Identify the bottlenecks, and deal with them appropriately
Assuming you're running Linux/Apache/MySql (a total guess on my part - you didn't say), here's an excellent three-part article that might get you started in the right direction:
Tuning LAMP systems, Part 1: Understanding the LAMP architecture
But trust me: worrying about minutia like whether you have 5 session variables instead or 4, instead of trying to gather solid baseline statistics, is NOT going to help you scale to 100,000 users :)!
I constantly read on the Internet how it's important to correctly architect my PHP applications so that they can scale.
I have built a simple/small CMS that is written in PHP (think of Wordpress, but waaaay simpler).
I essentially have URLs like such: http://example.com/?page_id=X where X is the id in my MySQL database that has the page content.
How can I configure my application to be load balanced where I'm simply performing PHP read activities.
Would something like Nginx as the front door setup to route traffic to multi-nodes running my same code to handle example.com/?page_id=X be enough to "load balance" my site?
Obviously, MySQL is not being load balanced in this situation, though for simplicity - that makes that out of scope for this question.
These are some well known techniques for scaling such an app.
Reduce DB hits
Most often the bottle neck will be your DB, so cache recent pages so that you reduce DB activity, perhaps in something like memcached.
Design your schema such that it is partition-able.
In the simplest case, separate your data into logical partitions, and store each partition in a separate mysql DB. Craigslist, for example, partitions data by city, and in some cases, by section within that. In your case, you could partition by Id quite simply.
Manage php sessions
Putting ngnx in front of a php website will not work if you use sessions. Load balancing php does have issues as sessions are persisted on local storage. Therefore you need to do session management explicitly. The traditional solution is to use memcached to store and look up some kind of cookie.
Don't optimize prematurely.
Focus on getting your application out so that the next magnitude of current users gets the optimal experience.
Note: Your main potential pain points are discussed here on SO
No, it is not at all important to scale your application if you don't need to.
My view on this is:
Make it work
Make sure it works correctly - testability, robustness
Make it work efficiently enough to be cost effective to run
Then, if you have to so much traffic that your system cannot handle it, AND you've already thrown all the hardware that (sensible) money can buy at it, then you need to scale. Not sooner.
Yes it is relatively easy to scale read-workloads, because you can simply perform reads against readonly database replicas. The challenge is to scale write-workloads.
A lot of sites have few writes, even if they're really busy.
The correct approach is to use some kind of load balancer such as:
http://www.softwareprojects.com/resources/programming/t-how-to-install-and-configure-haproxy-as-an-http-loa-1752.html
What this does is forward a certain user session only to a certain server, hence you dont have to worry about sessions and where they are stored at all. What you do have to worry is how to distribute the filesystem if the 2 servers are running on two different machines, especially if you make heavy use of the filesystem. Hope this article above helps...
A few minutes ago, I asked whether it was better to perform many queries at once at log in and save the data in sessions, or to query as needed. I was surprised by the answer, (to query as needed). Are there other good rules of thumb to follow when building PHP/MySQL multi-user apps that speed up performance?
I'm looking for specific ways to create the most efficient application possible.
hashing
know your hashes (arrays/tables/ordered maps/whatever you call them). a hash lookup is very fast, and sometimes, if you have O(n^2) loops, you may reduce them to O(n) by organizing them into an array (keyed by primary key) first and then processing them.
an example:
foreach ($results as $result)
if (in_array($result->id, $other_results)
$found++;
is slow - in_array loops through the whole $other_result, resulting in O(n^2).
foreach ($other_results as $other_result)
$hash[$other_result->id] = true;
foreach ($results as $result)
if (isset($hash[$result->id]))
$found++;
the second one is a lot faster (depending on the result sets - the bigger, the faster), because isset() is (almost) constant time. actually, this is not a very good example - you could do this even faster using built in php functions, but you get the idea.
optimizing (My)SQL
mysql.conf: i don't have any idea how much performance you can gain by optimizing your mysql configuration instead of leaving the default. but i've read you can ignore every postgresql benchmark that used the default configuration. afaik with configuration matters less with mysql, but why ignore it? rule of thumb: try to fit the whole database into memory :)
explain [query]: an obvious one, a lot of people get wrong. learn about indices. there are rules you can follow, you can benchmark it and you can make a huge difference. if you really want it all, learn about the different types of indices (btrees, hashes, ...) and when to use them.
caching
caching is hard, but if done right it makes the difference (not a difference). in my opinion: if you can live without caching, don't do it. it often adds a lot of complexity and points of failures. google did a bit of proxy caching once (to make the intertubes faster), and some people saw private information of others.
in php, there are 4 different kinds of caching people regulary use:
query caching: almost always translates to memcached (sometimes to APC shared memory). store the result set of a certain query to a fast key/value (=hashing) storage engine. queries (now lookups) become very cheap.
output caching: store your generated html for later use (instead of regenerating it every time). this can result in the biggest speed-ups, but somewhat works against PHPs dynamic nature.
browser caching: what about etags and http responses? if done right you may avoid most of the work right at the beginning! most php programmers ignore this option because they have no idea what HTTP is.
opcode caching: APC, zend optimizer and so on. makes php code load faster. can help with big applications. got nothing to do with (slow) external datasources though, and the potential is somewhat limited.
sometimes it's not possible to live without caches, e.g. if it comes to thumbnails. image resizing is very expensive, but fortunatley easy to control (most of the time).
profiler
xdebug shows you the bottlenecks of your application. if your app is too slow, it's helpful to know why.
queries in loops
there are (php-)experts who do not know what a join is (and for every one you educate, two new ones without that knowledge will surface - and they will write frameworks, see schnalles law). sometimes, those queries-in-loops are not that obvious, e.g. if they come with libraries. count the queries - if they grow with the results shown, there is something wrong.
inexperienced developers do have a primal, insatiable urge to write frameworks and content management systems
schnalle's law
Optimize your MySQL queries first, then the PHP that handles it, and then lastly cache the results of large queries and searches. MySQL is, by far, the most frequent bottleneck in an application. A poorly designed query can take two to three times longer than a well designed query that only selects needed information.
Therefore, if your queries are optimized before you cache them, you have saved a good deal of processing time.
However, on some shared hosts caching is file-system only thanks to a lack of Memcached. In this instance it may be better to run smaller queries than it is to cache them, as the seek time of the hard drive (and waiting for access due to other sites) can easily take longer than the query when your site is under load.
Cache.
Cache.
Speedy indexed queries.