Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
When dealing with records of at least 1 million rows, in terms of performance, is it better to:
Select the whole record e.g., SELECT * FROM tbl then paginate the result using array_chunk() or array_slice()
or
Select only part of the record e.g., SELECT * FROM tbl LIMIT x per page?
i think it depends, you can stock the whole response in the memory using memcache if your table is not too big and it will avoid HDD request which is more time consuming, but as you dont know if your user will look for lot of pages, it would be better to limit it with SQL.
It depends.
Does data change often in this table?
Yes -> you need query DB.
Is database big and changes often?
Then use some kind of search engine like Elasticsearch and don't query DB just populate search engine
Is database small but queries take long time?
Use some kind of cache like redis/memcache
It really depends on your needs.
The best method will depend on your context. If you choose to use the database directly, beware of this issue:
The naive LIMIT method will give you problems when you get into later pages. ORDER BY some_key LIMIT offset,page_size works like this - go through the key, through away the first offset records, then return page_size records. So offset + page_size records examined, if offset is high you have a problem.
Better - remember the last key value of the current page. When fetching next page use it like this:
SELECT * FROM tbl WHERE the_key > $last_key ORDER BY the_key ASC LIMIT $page_size
If your key is not unique, make it unique by adding an extra unique ID column at the end.
It REALLY depends on context.
In general you want to make heavy use of indexes to select the content that you want out of a large dataset with fast results. It's also faster to paginate through the programming language than to use the database. The database is often times the bottleneck. We had to do it this way for an application that had 100's of queries a minute. Hits to the database needed to be capped so we needed to return datasets that we knew may not need another query to the DB, around 100 results, and then paginate by 25 in the application.
In general, index and narrow your results with these indexes and if performance is key with lots of activity on the db, tune your db and your code to decrease I/O and DB hits by paginating in the application. You'll know why when your server is bleeding with a load of 12 and your I/O is showing 20 utilization. You'll need to hit the operating table stat!
It is better to use LIMIT. think about it.. The first one will get all even if you have 1000000 rows. vs limit which will only get your set number each time.
You will then want to make sure you have your offsets set correctly to get the next set of items from the table.
Related
I have a table with following columns:
ItemCode VARCHAR
PriceA DECIMAL(10,4)
PriceB DECIMAL(10,4)
The table has around 1,000 rows.
My requirement is to check the difference (PriceA-PriceB) for each row and then display top 50 items that have maximum price differences.
There are two ways I can implement this
1) Trust that SQL calculation is non-complex, easy and fast, so I run the following query:
SELECT ItemCode, (PriceA - PriceB) AS PDiff FROM testtable ORDER BY PDiff DESC LIMIT 50
and second,
2) Add one more column (called PriceDiff), which will store the difference (PriceA-PriceB).
However, these will have to be inserted manually and need extra space. But it can simply run the MAX(PriceDiff) select query for top 50.
My question is - in terms of speed and efficiency for a web application (displaying results on a website/app), which of the above method is better?
I have attempted to generate time consumed for each query, but both are reporting similar figures so unable to make any inferences.
Any explanation by the experts, or any fine-tuning of code, will be really appreciated.
Thanks
In general, to improve performance you always have to make a tradeoff between memory and time. Caching results will improve speed, however takes more memory. You can reduce memory usage by calculating stuff on the fly at the expense of performance.
In your case, storing additional 1000+ values in the DB is a matter of few extra Kb. Calculating the diff on the fly will have a negligible impact on performance. Either option is absolute peanuts to any DB and server.
I would stick with doing calculations on the fly as that is less complex and keeps the db normalized.
The first method is fastest, but is prone to error, as was mentioned.
May I suggest another solution, using a primary key. You could then set the value of the new column to what you are trying to figure from within the web application.
Then, when wanting to know the top 50, you could use your original method of finding the top 50, using your second method, where you would select from the table which stores the differences.
These links explain primary keys and how to use them:
http://www.mysqltutorial.org/mysql-primary-key/
https://www.w3schools.com/sql/sql_primarykey.asp
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm building a website using php/mysql where there will be Posts and Comments.
Posts need to show number of comments they have. I have count_comments column in Posts table and update it every time comment is created or deleted.
Someone recently advised me that denormalazing this way is a bad idea and I should be using caching instead.
My take is: You are doing the right thing. Here is why:
See the field count_comments as not being part of your data model - this is easily provable, you can delete all contents of this field and it is trivial to recreate it.
Instead see it as a cache, the storage of which is just co-located with the post - perfectly smart, as you get it for free whenever you have to query for the post(s)
I do not think this is a bad approach.
One thing i do recognize is that its very easy to introduce side effects as code base is expanded by having a more rigid approach. The nice part is at some point the amount of rows in the database will have to be calculated or kept track of, there is not really a way of getting out of this.
I would not advise against this. There are other solutions to getting comment counts. Check out Which is fastest? SELECT SQL_CALC_FOUND_ROWS FROM `table`, or SELECT COUNT(*)
The solution is slower upon selects, but requires less code to keep track of comment count.
I will say that your approach avoids LIMIT DE-optimization, which is a plus.
This is an optimization that is almost never needed for two reasons:
1) Proper indexing will make simple counts extremely fast. Ensure that your comments.post_id column has an index.
2) By the time you need to cache this value, you will need to cache much more. If your site has so many posts, comments, users and traffic that you need to cache the comments total, then you will almost definitely need to be employing caching strategies for much of your data/output (saving built pages to static, memcache, etc.). Those strategies will, no doubt, encompass your comments total, making the table field approach moot.
I have no idea what was meant by "Caching" and I'll be interested in some other answer that the one I have to offer:
Remove redundant information from your database is important and, in a "Believer way" (means that I didn't really test it, its merely speculative), I think that using SUM() function from your database is a better way to go for it.
Assuming that all your comments has a post_id, all you need is something like:
SELECT SUM(id) FROM comments WHERE id = {post_id_variation_here}
That way, you reduce 1 constant CRUD happening just to read how much comments there are and increase performance.
Unless you haven't hundreds or thousands of hits per seconds on your application there's nothing wrong about using a SQL statement like this:
select posts_field1, ..., (select count(*) from comments where comments_parent = posts_id) as commentNumber from posts
you can go with caching the html output of your page anyway. than no database query has to be done at all.
Maby you could connect the post and comment tables to each other and count the comments rows in mysql with the mysql function: mysql_num_rows. Like so:
Post table
postid*
postcontent
Comment table
commentid
postid*
comment
And then count the comments in mysql like:
$link = mysql_connect("localhost", "mysql_user", "mysql_password");
mysql_select_db("database", $link);
$result = mysql_query("SELECT * FROM commenttable WHERE postid = '1'", $link);
$num_rows = mysql_num_rows($result);
the query i'd like to speed up (or replace with another process):
UPDATE en_pages, keywords
SET en_pages.keyword = keywords.keyword
WHERE en_pages.keyword_id = keywords.id
table en_pages has the proper structure but only has non-unique page_ids and keyword_ids in it. i'm trying to add the actual keywords(strings) to this table where they match keyword_ids. there are 25 million rows in table en_pages that need updating.
i'm adding the keywords so that this one table can be queried in real time and return keywords (the join is obviously too slow for "real time").
we apply this query (and some others) to sub units of our larger dataset. we do this frequently to create custom interfaces for specific sub units of our data for different user groups (sorry if that's confusing).
this all works fine if you give it an hour to run, but i'm trying to speed it up.
is there a better way to do this that would be faster using php and/or mysql?
I actually don't think you can speed up the process.
You can still add brutal power to your database by cluserting new servers.
Maybe I'm wrong or missunderstood the question but...
Couldn't you use TRIGGERS ?
Like... when a new INSERT is detected on "en_pages", doing a UPDATE after on that same row?
(I don't know how frequent INSERTS are in that table)
This is just an idea.
How often does "en_pages.keyword" and "en_pages.keyword_id" changes after being inserted ?!?!?
I don't know about mySQL but usually this sort of thing runs faster in SQL Server if you process a limited number of batches of records (say a 1000) at a time in a loop.
You might also consider a where clause (I don't know what mySQL uses for "not equal to" so I used the SQL Server verion):
WHERE en_pages.keyword <> keywords.keyword
That way you are only updating records that have a difference in the field you are updating not all of the them.
As some of you may know, use of the LIMIT keyword in MySQL does not preclude it from reading the preceding records.
For example:
SELECT * FROM my_table LIMIT 10000, 20;
Means that MySQL will still read the first 10,000 records and throw them away before producing the 20 we are after.
So, when paginating a large dataset, high page numbers mean long load times.
Does anyone know of any existing pagination class/technique/methodology that can paginate large datasets in a more efficient way i.e. that does not rely on the LIMIT MySQL keyword?
In PHP if possible as that is the weapon of choice at my company.
Cheers.
First of all, if you want to paginate, you absolutely have to have an ORDER BY clause. Then you simply have to use that clause to dig deeper in your data set. For example, consider this:
SELECT * FROM my_table ORDER BY id LIMIT 20
You'll have the first 20 records, let's say their id's are: 5,8,9,...,55,64. Your pagination link to page 2 will look like "list.php?page=2&id=64" and your query will be
SELECT * FROM my_table WHERE id > 64 ORDER BY id LIMIT 20
No offset, only 20 records read. It doesn't allow you to jump arbitrarily to any page, but most of the time people just browse the next/prev page. An index on "id" will improve the performance, even with big OFFSET values.
A solution might be to not use the limit clause, and use a join instead -- joining on a table used as some kind of sequence.
For more informations, on SO, I found this question / answer, which gives an example -- that might help you ;-)
There are basically 3 approaches to this, each of which have their own trade-offs:
Send all 10000 records to the client, and handle pagination client-side via Javascript or the like. Obvious benefit is that only a single query is necessary for all of the records; obvious downside is that if the record size is in any way significant, the size of the page sent to the browser will be of proportionate size - and the user might not actually care about the full record set.
Do what you're currently doing, namely SQL LIMIT and grab only the records you need with each request, completely stateless. Benefit in that it only sends the records for the page currently requested, so requests are small, downsides in that a) it requires a server request for each page, and b) it's slower as the number of records/pages increases for later pages in the result, as you mentioned. Using a JOIN or a WHERE clause on a monotonically increasing id field can sometimes help in this regard, specifically if you're requesting results from a static table as opposed to a dynamic query.
Maintain some sort of state object on the server which caches the query results and can be referenced in future requests for a limited period of time. Upside is that it has the best query speed, since the actual query only needs to run once; downside is having to manage/store/cleanup those state objects (especially nasty for high-traffic websites).
SELECT * FROM my_table LIMIT 10000, 20;
means show 20 records starting from record # 10000 in the search , if ur using primary keys in the where clause there will not be a heavy load on my sql
any other methods for pagnation will take real huge load like using a join method
I'm not aware of that performance decrease that you've mentioned, and I don't know of any other solution for pagination however a ORDER BY clause might help you reduce the load time.
Best way is to define index field in my_table and for every new inserted row you need increment this field. And after all you need to use WHERE YOUR_INDEX_FIELD BETWEEN 10000 AND 10020
It will much faster.
some other options,
Partition the tables per each page so ignore the limit
Store the results into a session (a good idea would be to create a hash of that data using md5, then using that cache the session per multiple users)
I am building a fairly large statistics system, which needs to allow users to requests statistics for a given set of filters (e.g. a date range).
e.g. This is a simple query that returns 10 results, including the player_id and amount of kills each player has made:
SELECT player_id, SUM(kills) as kills
FROM `player_cache`
GROUP BY player_id
ORDER BY kills DESC
LIMIT 10
OFFSET 30
The above query will offset the results by 30 (i.e. The 3rd 'page' of results). When the user then selects the 'next' page, it will then use OFFSET 40 instead of 30.
My problem is that nothing is cached, even though the LIMIT/OFFSET pair are being used on the same dataset, it is performing the SUM() all over again, just to offset the results by 10 more.
The above example is a simplified version of a much bigger query which just returns more fields, and takes a very long time (20+ seconds, and will only get longer as the system grows).
So I am essentially looking for a solution to speed up the page load, by caching the state before the LIMIT/OFFSET is applied.
You can of course use caching, but i would recommend caching the result, not the query in mysql.
But first things first, make sure that a) you have the proper indexing on your data, b) that it's being used.
If this does not work, as group by tends to be slow with large datasets, you need to put the summary data in a static table/file/database.
There are several techniques/libraries etc that help you perform server side caching of your data. PHP Caching to Speed up Dynamically Generated Sites offers a pretty simple but self explanatory example of this.
Have you considered periodically running your long query and storing all the results in a summary table? The summary table can be quickly queried because there are no JOINs and no GROUPings. The downside is that the summary table is not up-to-the-minute current.
I realize this doesn't address the LIMIT/OFFSET issue, but it does fix the issue of running a difficult query multiple times.
Depending on how often the data is updated, data-warehousing is a straightforward solution to this. Basically you:
Build a second database (the data warehouse) with a similar table structure
Optimise the data warehouse database for getting your data out in the shape you want it
Periodically (e.g. overnight each day) copy the data from your live database to the data warehouse
Make the page get its data from the data warehouse.
There are different optimisation techniques you can use, but it's worth looking into:
Removing fields which you don't need to report on
Adding extra indexes to existing tables
Adding new tables/views which summarise the data in the shape you need it.