I have a php web application where certain data changes on a weekly basis but is read very frequently often.
The SQL queries that retrieve the data and the php code for html output are fairly complex. There are multiple table joins, and numerous calculations - but they result in a fairly basic html table. Users are grouped, and the table is the same for each group each week, but different for different groups. I could potentially have hundreds of tables for thousands of users.
For performance reasons, I'd like to cache this data. Rather than running these queries and calculations every time someone hits the page, I want to run a weekly process to generate the table for each group giving me a simple read when required.
I'd be interested to know what techniques you've used successfully or unsuccessfully to achieve something like this?
Options I can see include:
Storing the html result of the calculations in a MySQL table, identified by user group
Storing the resultant data in a MySQL table, identified by user group (difficult as there's no fixed number of data items)
Caching the page output in static files
Any other suggestions would be welcome!
In the function to generate the table, make it store the result to a file on disk:
/cache/groups/1.txt
/cache/groups/2.txt
You don't necessarily have to run a weekly batch job for it, when calling the function to get the data, check if the cache is out of date (or non-existent). If so, generate and cache the results then. If not, just return the cached file.
function getGroupTable($groupId) {
if (cacheIsStale($groupId)) {
generateCache($groupId);
}
return file_get_contents($cacheFile);
}
The cacheIsStale() function could just look at the file's timestamps to test for freshness.
There are indeed a few options:
Prerender the pages on a weekly basis and then serve them "statically".
Use a cache (e.g. Squid) to cache such responses on a first-chance basis for a week. For example, you can configure the caching policy so requests that go to a particular page (e.g. very_long.php?...) are cached separately from the rest of the website.
Make sure you turn on DB caching. MySQL has caching of its own and you can fine tune it so that repeated long queries are not recalculated.
first of all, profile. verify that those queries are really consuming a significant amount of time. maybe MySQL query result caches has already done the work for you.
if they are really consuming resources, what i would do is to create a table with the computed results, and a procedure that do all needed managing, to be called when the data changes. those frequent reads should go only to the pre-computed data, without bothering to check if it's still valid.
simply add some hooks to the procedures that modify the base data, or database triggers if you can, these would be executed unfrequently (weekly?), and could take a lot of time to generate any results.
It seems you already have most of it covered.
One other option, assuming the table data is not huge, is to use memcache to cache the results - this would probably be the faster solution, although you would need to check memory requirements to see if it's a viable option.
Related
I have a program that creates logs and these logs are used to calculate balances, trends, etc for each individual client. Currently, I store everything in separate MYSQL tables. I link all the logs to a specific client by joining the two tables. When I access a client, it pulls all the logs from the log_table and generates a report. The report varies depending on what filters are in place, mostly date and category specific.
My concern is the performance of my program as we accumulate more logs and clients. My intuition tells me to store the log information in the user_table in the form of a serialized array so only one query is used for the entire session. I can then take that log array and filter it using PHP where as before, it was filtered in a MYSQL query (using multiple methods, such as BETWEEN for dates and other comparisons).
My question is, do you think performance would be improved if I used serialized arrays to store the logs as opposed to using a MYSQL table to store each individual log? We are estimating about 500-1000 logs per client, with around 50000 clients (and growing).
It sounds like you don't understand what makes databases powerful. It's not about "storing data", it's about "storing data in a way that can be indexed, optimized, and filtered". You don't store serialized arrays, because the database can't do anything with that. All it sees is a single string without any structure that it can meaningfully work with. Using it that way voids the entire reason to even use a database.
Instead, figure out the schema for your array data, and then insert your data properly, with one field per dedicated table column so that you can actually use the database as a database, allowing it to optimize its storage, retrieval, and database algebra (selecting, joining and filtering).
Is serialized arrays in a db faster than native PHP? No, of course not. You've forced the database to act as a flat file with the extra dbms overhead.
Is using the database properly faster than native PHP? Usually, yes, by a lot.
Plus, and this part is important, it means that your database can live "anywhere", including on a faster machine next to your webserver, so that your database can return results in 0.1s, rather than PHP jacking 100% cpu to filter your data and preventing users of your website from getting page results because you blocked all the threads. In fact, for that very reason it makes absolutely no sense to keep this task in PHP, even if you're bad at implementing your schema and queries, forget to cache results and do subsequent searches inside of those cached results, forget to index the tables on columns for extremely fast retrieval, etc, etc.
PHP is not for doing all the heavy lifting. It should ask other things for the data it needs, and act as the glue between "a request comes in", "response base data is obtained" and "response is sent back to the client". It should start up, make the calls, generate the result, and die as fast as it can again.
It really depends on how you need to use the data. You might want to look into storing with mongo if you don't need to search that data. If you do, leave it in individual rows and create your indexes in a way that makes them look up fast.
If you have 10 billion rows, and need to look up 100 of them to do a calculation, it should still be fast if you have your indexes done right.
Now if you have 10 billion rows and you want to do a sum on 10,000 of them, it would probably be more efficient to save that total somewhere. Whenever a new row is added, removed or updated that would affect that total, you can change that total as well. Consider a bank, where all items in the ledger are stored in a table, but the balance is stored on the user account and is not calculated based on all the transactions every time the user wants to check his balance.
I'm in the design phase of a website and I have a solution for a feature but I don't know if it will be the good one when the site, hopefully, grows. I want the users to be able to perform searches for other users and the results they find must be ordered: first the "spotlighted" users, then all the rest. The result must be ordered randomly, respecting the previously mentioned order, and with pagination.
One of the solutions I have in mind is to store the query results in a session variable in the server side. For performance, when the user leaves the search this variable is destroyed.
What will happen when the site has thousands of users and every day thousands of searches are performed? My solution will be viable or the server will be overloaded?
I have more solutions in mind like an intermediate table where n times by day users are dumped in the mentioned order. This way there is no need to create a big array in the user's session and pagination is done via multiple queries against the database.
Although I appreciate any suggestions I'm specially interested into hear opinions from developers seasoned in transited sites.
(The technology employed is LAMP, with InnoDb tables)
Premature optimization is bad. But you should be planning ahead. You dont need to implement it. But prepare yourself.
If there are thousands of users searching this query everyday then caching the query result in session is not a good idea. Cause same result can be cached for some users while other needs to execute it. For such case I'd recommend you save the search result in user independent data structure (File, memory etc).
For each search query save the result, creation date, last access date in your disk or in memory.
If any user searches the same query show the result from cache
Run a cron that invalidates the cache after sometime.
This way frequent searches will most time promptly available. Also it reduces the load on your database.
This is definitely not the answer you are looking for, but I have to say it.
Premature Optimization is the root of all evil.
Get that site up with a simple implementation of that query and come back and ask if that turns out to be your worst bottleneck.
I'm assuming you want to decrease the hitting on the DB by caching search results so other users searching for the same set of factors don't have to hit the DB again--especially on very loose query strings on non-indexed fields. If so, you can't store it in a session--that's only available to the single user.
I'd use a caching layer like Cache_Lite and cache the result set from the db query based on the query string (not the sql query, but the search parameters from your site). That way identical searches will be cached. Handle the sorting and pagination of the array in PHP, not in the DB.
I have about 10 tables with ~10,000 rows each which need to be pulled very often.
For example, list of countries, list of all schools in the world, etc.
PHP can't persist this stuff in memory (to my knowledge) so I would have to query the server for a SELECT * FROM TABLE every time. Should I use memcached here? At first though it's a clear absolutely yes, but at second thought, wouldn't mysql already be caching for me and this would be almost redundant?
I don't have too much understanding of how mysql caches data (or if it even does cache entire tables).
You could use MySQL query cache, but then you are still using DB resources to establish the connection and execute the query. Another option is opcode caching if your pages are relatively static. However I think memcached is the most flexible solution. For example if you have a list of countries which need to be accessed from various code-points within your application, you could pull the data from the persistent store (mysql), and store them into memcached. Then the data is available to any part of your application (including batch processes and cronjobs) for any business requirement.
I'd suggest reading up on the MySQL query cache:
http://dev.mysql.com/doc/refman/5.6/en/query-cache.html
You do need some kind of a cache here, certainly; layers of caching within and surrounding the database are considerably less efficient than what memcached can provide.
That said, if you're jumping to the conclusion that the Right Thing is to cache the query itself, rather than to cache the content you're generating based on the query, I think you're jumping to conclusions -- more analysis is needed.
What data, other than the content of these queries, is used during output generation? Would a page cache or page fragment cache (or caching reverse-proxy in front) make more sense? Is it really necessary to run these queries "often"? How frequently does the underlying data change? Do you have any kind of a notification event when that happens?
Also, SELECT * queries without a WHERE clause are a "code smell" (indicating that something probably is being done the Wrong Way), especially if not all of the data pulled is directly displayed to the user.
I have a table with just 3,000 records.
I render these 3000 records in the home page without pagination, my client is not interested in pagination...
So to show page completely it takes around 1 min, 15 sec. What can be done to make the page load more quickly?
My table structure:
customer table
customer id
customer name
guider id
and few columns
guider table
guider id
guider name
and few columns
Where's the slow down? The query or the serving?
If the former, see the comments above. If the latter:
Enable gzip on the server. Otherwise capture the [HTML?] output to a file, compress it (zip), then serve it as a download. Same for any other format if you think something else can render it better than a browser (CSV and Open Office).
If you're outputting the data into a HTML table then you may have an issue where the browser is waiting for the end of the table before rendering it. You can either break this into multiple table chunks like every 500 records/rows or try CSS "table-layout: fixed;".
Check the Todos
sql Connection (dont open the
connection in loop) for query it
should be one time connection
check your queries and analyse it if you are using some complex logic
which can be replaced
use standard class for sql connection and query ; use ezsql
sql query best practice
While you could implement a cache to do this, you don't necessarily need to do so, an introducing unnecessary cache structures can often cause problems of its own. Depending on where the bottleneck is, it may not even help you much, or at all.
You need to look in two places for your analysis:
1) The query you're using to get your data. Take a look at its plan, or if you're not comfortable doing that, run it in your favorite query tool and see how long it takes to come back. If it doesn't take too long, you've got a pretty good idea that your bottleneck isn't the query. If the query itself takes a long time, that's where you should focus your efforts.
2) How your page is rendering. What is the size of your page, in bytes? It may be too big. Can you cut the size down by formatting? Can you more effectively use CSS to eliminate duplicate styling on the page? Are you using a fixed or dynamic table layout? Dynamic is generally going to be quite a bit slower, especially for large tables. Try to avoid nesting tables. Do everything you can to make the page as small as possible, and keep testing!
while displaying records i want to
display guidername so , i did once
function that return the guider name
Sounds like you need to use a JOIN. Here's a simple example:
SELECT * FROM customer JOIN guider ON guider.id=customer.guider_id
This will change your page from using N + 1 (3001) queries to just one.
Make sure both guider.id and customer.guider_id are indexed and of appropriate data types (such as integers).
This is a little list, what you should think about for improving the performance, the importance is relative to each point, so the first ist not to be the most important to you - which depends on the details of your project.
Check your database structure. If there are just these two tables, their might be little you can do. But keep in mind that there is stuff like indices and with an increasing number of records a second denormalizes table structure will improve the speed of retrieving results.
Use rather one Query for selecting your data, than iterating through ids and doing selects repeatedly
Run a separate Query for the guiders, I assume there are only a few of them. Save all guiders in a data structure, e.g. a dictionary, first and use the foreign key to apply the correct one to the current record - this might save a lot of data which has to be transmitted from the database to your web server.
Get your result set by using something like mysqli_result::fetch_all() which returns a 2-dimensional array with all results. This should be faster than iteration through each row with fetch_row()
Sanitize your HTML Output, use (external) CSS. This will save a lot of output space if you format your stuff with style=" ... a lot of formatting code ..." attributes in each line. If you use one large table, split them up in multiple tables (some browsers wait for the complete table to load before rendering it).
In a lot of languages very important: Use a string builder for concatenating your results into the output string!
Caching: Think about generating the output once a day or once an hour. Write it to a cachefile which is opened instead of querying the database and building the same stuff on every request. Maybe you want to offer this generated file as download, rather than displaying it as plain HTML Site on the web.
Last but not least, check the connections to webserver and database, the server load as well as the number of requests. If your servers are running on heavy load everything ales here might help reducing the load or you just have to upgrade hardware.
LOL
everyone is talking of big boys toys, like database structure, caching and stuff.
While the problem most likely lays in mere HTML and browsers.
Just to split whole HTML table in chunks will help first chunk to show up immediately while others will eventually come.
Only ones were right who said to profile whole thing first.
Trying to answer without profiling results is shooting in the dark.
First of all, the website I run is hosted and I don't have access to be able to install anything interesting like memcached.
I have several web pages displaying HTML tables. The data for these HTML tables are generated using expensive and complex MySQL queries. I've optimized the queries as far as I can, and put indexes in place to improve performance. The problem is if I have high traffic to my site the MySQL server gets hammered, and struggles.
Interestingly - the data within the MySQL tables doesn't change very often. In fact it changes only after a certain 'event' that takes place every few weeks.
So what I have done now is this:
Save the HTML table once generated to a file
When the URL is accessed check the saved file if it exists
If the file is older than 1hr, run the query and save a new file, if not output the file
This ensures that for the vast majority of requests the page loads very fast, and the data can at most be 1hr old. For my purpose this isn't too bad.
What I would really like is to guarantee that if any data changes in the database, the cache file is deleted. This could be done by finding all scripts that do any change queries on the table and adding code to remove the cache file, but it's flimsy as all future changes need to also take care of this mechanism.
Is there an elegant way to do this?
I don't have anything but vanilla PHP and MySQL (recent versions) - I'd like to play with memcached, but I can't.
Ok - serious answer.
If you have any sort of database abstraction layer (hopefully you will), you could maintain a field in the database for the last time anything was updated, and manage that from a single point in your abstraction layer.
e.g. (pseudocode): On any update set last_updated.value = Time.now()
Then compare this to the time of the cached file at runtime to see if you need to re-query.
If you don't have an abstraction layer, create a wrapper function to any SQL update call that does this, and always use the wrapper function for any future functionality.
There are only two hard things in
Computer Science: cache invalidation
and naming things.
—Phil Karlton
Sorry, doesn't help much, but it is sooooo true.
You have most of the ends covered, but a last_modified field and cron job might help.
There's no way of deleting files from MySQL, Postgres would give you that facility, but MySQL can't.
You can cache your output to a string using PHP's output buffering functions. Google it and you'll find a nice collection of websites explaining how this is done.
I'm wondering however, how do you know that the data expires after an hour? Or are you assuming the data wont change that dramatically in 60 minutes to warrant constant page generation?