I want to know, what's the best way to get the last 10 new entries from a database (MySQL)? Sure, at the moment I'm using:
(SELECT whatever FROM whatever ORDER BY id (or whatever) DESC LIMIT 0,10)
But what happens, if you have hundreds of entries or thousands? Does MySQL still select and just "read" only the last ten entries - and doesn't lost speed & time crawling through all other entries?
For my purpose I'll always just need the last 10~20 entries from the database, the rest & old ones are more for archive-stuff. Every entry/record has an auto-increment-ID, which I use to show via ORDER and SELECT my entries (using PHP ~ PDO and prepared statments) and I love minimal and solutions, that don't require a lot of resources.
Good enough or are there better ways?
Thanks for your thoughts and explanations! :)
Your solution will always work fast no matter the size of the database, provided you have an index on the relevant columns (in this case id, which I assume is a primary key).
The reason is that the indexes are stored as B-Trees with low height and therefore are extremly fast to search in. I recommend this website for you as a background reading: http://use-the-index-luke.com/sql/anatomy/the-tree
Related
I have a MySQL table with 2,000,000 rows, my website has 40.000 to 50.000 visits per day, PHP running 150 queries per second in total, and the MySQL CPU usage is around 90%. The website is extremely slow.
Dedicated Server: AMD Opteron 8 cores, 16 GB DDR3.
Here are the MYSQL query details:
Search Example: Guns And Roses
Table Storage Engine: MyISAM
Query example:
SELECT SQL_CACHE mp3list.*, likes.* FROM mp3list
LEFT JOIN likes ON mp3list.mp3id = likes.mp3id
WHERE mp3list.active=1 AND mp3list.songname LIKE '%guns%'
AND mp3list.songname LIKE '%and%' AND mp3list.songname LIKE '%roses%'
ORDER BY likes.likes DESC LIMIT 0, 15"
Column "songname" is VARCHAR(255).
I want to know what I have to do to implement a lighter mysql search, if someone could help me, I'll be always grateful, I'm looking for a solution for weeks.
Thank you in advance.
Well, one solution would be to stop using a performance killer like like '%something%'.
One way we've done this in the past is to maintain our own lookup tables. By that I mean, put together insert, update and delete triggers which apply any changes to a table like:
word varchar(20)
id int references mp3list(id)
primary key (word,id)
Whenever you make a change to mp3list, it gets reflected to that table, which should be a lot faster to search than your current solution.
This moves the cost of figuring out what MP3s contain what words to when you update, rather than every time you select, amortising the cost. Since the vast majority of databases are read far more often than written, this can give substantial improvements. Some DBMS' provide this functionality with a full text search index (MySQL is one of these).
And you can even put some smarts in the triggers (and queries) to totally ignore noise words like a, an and the, saving both space and time, giving you more fine-grained control over what you want to store.
I have a table in which approx 100,000 rows are added every day. I am supposed to generate reports from this table. I am using PHP to generate these reports. Recently the script which used to do this is taking too long to complete. How can I improve the performance by shifting to something else than MYSQL which is scalable in the long run.
MySQL is very scalable, that's for sure.
The key is not changing the db from Mysql to other but you should:
Optimize your queries (can sound silly for others but I remember for instance that a huge improvment I've done sometime ago is to change SELECT * into selecting only the column(s) I need. It's a frequent issue I meet in others code too)
Optimize your table(s) design (normalization etc).
Add indexes on the column(s) you are using frequently in the queries.
Similar advices here
For generating reports or file downloads with large chunks of data you should concider using flush and increasing time_limit and memory limit.
I doubt the problem lies in the amount of rows, since MySQL can support ALOT of rows. But you can of course fetch x rows a time and process them in chunks.
I do assume your MySQL is properly tweaked for performance.
First analyse why (or: whether) your queries are slow: http://dev.mysql.com/doc/refman/5.1/en/using-explain.html
You should read the following and learn a little bit about the advantages of a well designed innodb table and how best to use clustered indexes - only available with innodb !
The example includes a table with 500 million rows with query times of 0.02 seconds.
MySQL and NoSQL: Help me to choose the right one
Hope you find this of interest.
Another thought is to move records beyond a certain age to a historical database for archiving, reporting, etc. If you don't need that large volume for transactional processing it might make sense to extract them from the transactional data store.
It's common to separate transactional and reporting databases.
I am going to make some assumptions
Your 100k rows added every day have timestamps which are either real-time, or are offset by a relatively short amount of time (hours at most); your 100k rows are added either throughout the day or in a few big batches.
The data are never updated
You are using InnoDB engine (Frankly you would be insane to use MyISAM for large tables because in the event of a crash, index rebuild takes a prohibitive time)
You haven't explained what kind of reports you're trying to generate, but I'm assuming that your table looks like this:
CREATE TABLE logdata (
dateandtime some_timestamp_type NOT NULL,
property1 some_type_1 NOT NULL,
property2 some_type_2 NOT NULL,
some_quantity some_numerical_type NOT NULL,
... some other columns not required for reports ...
... some indexes ...
);
And that your reports look like
SELECT count(*), SUM(some_quantity), property1 FROM logdata WHERE dateandtime BETWEEEN some_time_range GROUP BY property1;
SELECT count(*), SUM(some_quantity), property2 FROM logdata WHERE dateandtime BETWEEEN some_time_range GROUP BY property2;
Now, as we can see, both of these reports are doing a scan of a large amount of the table, because you are reporting on a lot of rows.
The bigger the time range becomes the slower the reports will be. Moreover, if you have a lot of OTHER columns (say some varchars or blobs) which you aren't interested in reporting on, then they slow your report down too (because the server still needs to inspect the rows).
You can use several possible techniques for speeding this up:
Add covering index for each type of report, to support the columns you need and omit columns you don't. This may help a lot but slow inserts down.
Summarise data according to the dimension(s) that you want to report on. In this ficticious case, all your reports are either counting rows, or SUM()ing some_quantity.
Build mirror tables (containing the same data) which have appropriate primary keys / indexes/ columns to make the reports faster.
Use a column engine (e.g. Infobright)
Summarisation is usually an attractive option if your use-case supports it;
You may wish to ask a more detailed question with an explanation of your use-case.
The time limit can be temporarily turned off for a particular file if you know that it is going to potentially run over the time limit by calling set_time_limit (0); at the start of your script.
Other considerations such as indexing or archiving very old data to a different table should also be looked at.
Your best bet is something like MongoDB or CouchDB, both of which are non-relational databases oriented toward storing massive amounts of data. This is assuming that you've already tweaked your MySQL installation for performance and that your situation wouldn't benefit from parallelization.
I have three tables, one for articles, one for comments, one for likes, one for visits, in this example schema
**news**
news_id
**comments**
comment_id
news_id
**likes**
like_id
news_id
**hits**
hit_id
news_id
What i want to do is to listen all the articles in a sortable index in a box/div for each article with article count of hits, comments, and likes, i know how to do all this, so it's not the how i am seeking, it's the best way, i am thinking about those two solutions.
do it the normal way, a complex SQL query then cache the query let's say for an hour or two.
write a script that is executed every two or three hours to calculate the data and store it in the same news table in "news_hits, news_likes, news_comments" numeral fields.
and of course the third way is to do the query each time the page is loaded without any caching.
i feel that it's method number one that i shall go after, but i wanted a professional or experienced opinion, i am not expecting a huge number of visitors, around 500-1000 a day maximum, but still i want to be prepared for high traffic.
thank you,
Rami
It would be best to admit redundancy in this case, to improve speed. To the news table, add these fields:
comments_count int not null default 0,
likes_count int not null default 0,
hits_count int not null default 0
When a comment/like/hit is added/deleted, if the database supports triggers, trigger an increment/decrement of the referenced counter, and if not - do it manually on each insert/delete (stored procedure maybe?).
This type of data is more often read than written, so to optimize read speed, slowing down write speed and storage space isn't a big deal.
From time to time, it would be OK to run a query that would update these counters if by some reason they become erroneous.
Break the complex SQL into several smaller queries (less complex) and cache the individual result(s), so in anytime you want to prepare warm-up cache, it won't take too many database resources
With such a simple model, query and low number of visitors I would go for the straight query. It will execute just fine (milliseconds) with proper indexing.
If I understand the scenario correctly, the query should sort news articles by their popularity, which is determined in some way by the nr of likes/hits/comments.
If you are set on fixing a performance problem you may not actually run into, the simplest "solution" would be to use a query cache that expires every 10 seconds. With your current load, each visitor would basically always render the view from the database since the cache expires between page visits. If, one day you suddenly become overrun with say 200,000 visitors, you would only perform the query once every 10 seconds.
As some of you may know, use of the LIMIT keyword in MySQL does not preclude it from reading the preceding records.
For example:
SELECT * FROM my_table LIMIT 10000, 20;
Means that MySQL will still read the first 10,000 records and throw them away before producing the 20 we are after.
So, when paginating a large dataset, high page numbers mean long load times.
Does anyone know of any existing pagination class/technique/methodology that can paginate large datasets in a more efficient way i.e. that does not rely on the LIMIT MySQL keyword?
In PHP if possible as that is the weapon of choice at my company.
Cheers.
First of all, if you want to paginate, you absolutely have to have an ORDER BY clause. Then you simply have to use that clause to dig deeper in your data set. For example, consider this:
SELECT * FROM my_table ORDER BY id LIMIT 20
You'll have the first 20 records, let's say their id's are: 5,8,9,...,55,64. Your pagination link to page 2 will look like "list.php?page=2&id=64" and your query will be
SELECT * FROM my_table WHERE id > 64 ORDER BY id LIMIT 20
No offset, only 20 records read. It doesn't allow you to jump arbitrarily to any page, but most of the time people just browse the next/prev page. An index on "id" will improve the performance, even with big OFFSET values.
A solution might be to not use the limit clause, and use a join instead -- joining on a table used as some kind of sequence.
For more informations, on SO, I found this question / answer, which gives an example -- that might help you ;-)
There are basically 3 approaches to this, each of which have their own trade-offs:
Send all 10000 records to the client, and handle pagination client-side via Javascript or the like. Obvious benefit is that only a single query is necessary for all of the records; obvious downside is that if the record size is in any way significant, the size of the page sent to the browser will be of proportionate size - and the user might not actually care about the full record set.
Do what you're currently doing, namely SQL LIMIT and grab only the records you need with each request, completely stateless. Benefit in that it only sends the records for the page currently requested, so requests are small, downsides in that a) it requires a server request for each page, and b) it's slower as the number of records/pages increases for later pages in the result, as you mentioned. Using a JOIN or a WHERE clause on a monotonically increasing id field can sometimes help in this regard, specifically if you're requesting results from a static table as opposed to a dynamic query.
Maintain some sort of state object on the server which caches the query results and can be referenced in future requests for a limited period of time. Upside is that it has the best query speed, since the actual query only needs to run once; downside is having to manage/store/cleanup those state objects (especially nasty for high-traffic websites).
SELECT * FROM my_table LIMIT 10000, 20;
means show 20 records starting from record # 10000 in the search , if ur using primary keys in the where clause there will not be a heavy load on my sql
any other methods for pagnation will take real huge load like using a join method
I'm not aware of that performance decrease that you've mentioned, and I don't know of any other solution for pagination however a ORDER BY clause might help you reduce the load time.
Best way is to define index field in my_table and for every new inserted row you need increment this field. And after all you need to use WHERE YOUR_INDEX_FIELD BETWEEN 10000 AND 10020
It will much faster.
some other options,
Partition the tables per each page so ignore the limit
Store the results into a session (a good idea would be to create a hash of that data using md5, then using that cache the session per multiple users)
I am building a fairly large statistics system, which needs to allow users to requests statistics for a given set of filters (e.g. a date range).
e.g. This is a simple query that returns 10 results, including the player_id and amount of kills each player has made:
SELECT player_id, SUM(kills) as kills
FROM `player_cache`
GROUP BY player_id
ORDER BY kills DESC
LIMIT 10
OFFSET 30
The above query will offset the results by 30 (i.e. The 3rd 'page' of results). When the user then selects the 'next' page, it will then use OFFSET 40 instead of 30.
My problem is that nothing is cached, even though the LIMIT/OFFSET pair are being used on the same dataset, it is performing the SUM() all over again, just to offset the results by 10 more.
The above example is a simplified version of a much bigger query which just returns more fields, and takes a very long time (20+ seconds, and will only get longer as the system grows).
So I am essentially looking for a solution to speed up the page load, by caching the state before the LIMIT/OFFSET is applied.
You can of course use caching, but i would recommend caching the result, not the query in mysql.
But first things first, make sure that a) you have the proper indexing on your data, b) that it's being used.
If this does not work, as group by tends to be slow with large datasets, you need to put the summary data in a static table/file/database.
There are several techniques/libraries etc that help you perform server side caching of your data. PHP Caching to Speed up Dynamically Generated Sites offers a pretty simple but self explanatory example of this.
Have you considered periodically running your long query and storing all the results in a summary table? The summary table can be quickly queried because there are no JOINs and no GROUPings. The downside is that the summary table is not up-to-the-minute current.
I realize this doesn't address the LIMIT/OFFSET issue, but it does fix the issue of running a difficult query multiple times.
Depending on how often the data is updated, data-warehousing is a straightforward solution to this. Basically you:
Build a second database (the data warehouse) with a similar table structure
Optimise the data warehouse database for getting your data out in the shape you want it
Periodically (e.g. overnight each day) copy the data from your live database to the data warehouse
Make the page get its data from the data warehouse.
There are different optimisation techniques you can use, but it's worth looking into:
Removing fields which you don't need to report on
Adding extra indexes to existing tables
Adding new tables/views which summarise the data in the shape you need it.