fastest way to do 1 billion queries on millions of rows

fastest way to do 1 billion queries on millions of rows - php

I'm running a PHP script that searches through a relatively large MySQL instance with a table with millions of rows to find terms like "diabetes mellitus" in a column description that has a full text index on it. However, after one day I'm only through a couple hundred queries so it seems like my approach is never going to work. The entries in the description column are on average 1000 characters long.
I'm trying to figure out my next move and I have a few questions:
My MySQL table has unnecessary columns in it that aren't being queried. Will remove those affect performance?
I assume running this locally rather than on RDS will dramatically increase performance? I have a decent macbook, but I chose RDS since cost isn't an issue, and I tried to run on an instance that was better than the my Macbook.
Would using a compiled language like Go rather than PHP do more than the 5-10x boost people report in test examples? That is, given my task is there any reason to think a static language would produce 100X or more speed improvements?
Should I put the data in a text or CSV file rather than MySQL? Is using MySQL just causing unnecessary overhead?
This is the query:
SELECT id
FROM text_table
WHERE match(description) against("+diabetes +mellitus" IN BOOLEAN MODE);
Here's the line of output of EXPLAIN for the query, showing the optimizer is utilizing the FULLTEXT index:
1 SIMPLE text_table fulltext idx idx 0 NULL 1 Using where
The RDS instance is db.m4.10xlarge which has 160GB of RAM. The InnoDB buffer pool is typically about 75% of RAM on an RDS instance, which make it 120GB.
The text_table status is:
Name: text_table
Engine: InnoDB
Version: 10
Row_format: Compact
Rows: 26000630
Avg_row_length: 2118
Data_length: 55079485440
Max_data_length: 0
Index_length: 247808
Data_free: 6291456
Auto_increment: 29328‌568
Create_time: 2018-01-12 00:49:44
Update_time: NULL
Check_time: NULL
Collation: utf8_general_ci
Checksum: NULL
Create_options:
Comment:
This indicates the table has about 26 million rows, and the size of data and indexes is 51.3GB, but this doesn't include the FT index.
For the size of the FT index, query:
SELECT stat_value * ##innodb_page_size
FROM mysql.innodb_index_stats
WHERE table_name='text_table'
AND index_name = 'FTS_DOC_ID_INDEX'
AND stat_name='size'
The size of the FT index is 480247808.

Following up on comments above about concurrent queries.
If the query is taking 30 seconds to execute, then the programming language you use for the client app won't make any difference.
I'm a bit skeptical that the query is really taking 1 to 30 seconds to execute. I've tested MySQL fulltext search, and I found a search runs in under 1 second even on my laptop. See my presentation https://www.slideshare.net/billkarwin/practical-full-text-search-with-my-sql
It's possible that it's not the query that's taking so long, but it's the code you have written that submits the queries. What else is your code doing?
How are you measuring the query performance? Are you using MySQL's query profiler? See https://dev.mysql.com/doc/refman/5.7/en/show-profile.html This will help isolate how long it takes MySQL to execute the query, so you can compare to how long it takes for the rest of your PHP code to run.
Using PHP is going to be single-threaded, so you are running one query at a time, serially. The RDS instance you are using has 40 CPU cores, so you should be able to many concurrent queries at a time. But each query would need to be run by its own client.
So one idea would be to split your input search terms into at least 40 subsets, and run your PHP search code against each respective subset. MySQL should be able to run the concurrent queries fine. Perhaps there will be a slight overhead, but this will be more than compensated for by the parallel execution.
You can split your search terms manually into separate files, and then run your PHP script with each respective file as the input. That would be a straightforward way of solving this.
But to get really professional, learn to use a tool like GNU parallel to run the 40 concurrent processes and split your input over these processes automatically.

Related

Mysql performance for large query using for value in VS multiple queries

I was optimizing mysql queries for a project, and replaced many situations were a small query was run inside a loop thousands of times with a single query using For value IN() and i got a significant speed boost.
However, for some queries where the row id list got bigger, the performance of Value IN () got much worse, and it actually takes much longer to perform one query vs thousands of smaller ones.
All the queries are using indexes.
How / why is it faster to run a query 20 000 times compared to making one query with a value list containing 20 000 items. In the end, I would expect mysql do the same lookup of 20 000 rows, just without the overhead of sending the request - downloading the response every loop cycle.
Expecting the Value IN be faster feels intuitive yet it isnt.
Is there something I can optimize/configure in the query/database side ?
For context: Im optimizing queries compiled by Laravel framework. Previous devs didnt notice that the framework made "n" queries every time a property was accessed. I used eager loading to make single queries, but it made some queries run much slower( which is odd )
An example of a query that got slower. Running below query tens of thousands of times
select * from `products`
where `products`.`id` = ?
and `products`.`deleted_at` is null
limit 1
is faster than running the next query once:
select * from `products`
where `products`.`id`
IN ([list of 20 000 ids])
and `products`.`deleted_at` is null

100% CPU USAGE: MySQL 2,000,000 rows and query with LIKE operator

I have a MySQL table with 2,000,000 rows, my website has 40.000 to 50.000 visits per day, PHP running 150 queries per second in total, and the MySQL CPU usage is around 90%. The website is extremely slow.
Dedicated Server: AMD Opteron 8 cores, 16 GB DDR3.
Here are the MYSQL query details:
Search Example: Guns And Roses
Table Storage Engine: MyISAM
Query example:
SELECT SQL_CACHE mp3list.*, likes.* FROM mp3list
LEFT JOIN likes ON mp3list.mp3id = likes.mp3id
WHERE mp3list.active=1 AND mp3list.songname LIKE '%guns%'
AND mp3list.songname LIKE '%and%' AND mp3list.songname LIKE '%roses%'
ORDER BY likes.likes DESC LIMIT 0, 15"
Column "songname" is VARCHAR(255).
I want to know what I have to do to implement a lighter mysql search, if someone could help me, I'll be always grateful, I'm looking for a solution for weeks.
Thank you in advance.

Well, one solution would be to stop using a performance killer like like '%something%'.
One way we've done this in the past is to maintain our own lookup tables. By that I mean, put together insert, update and delete triggers which apply any changes to a table like:
word varchar(20)
id int references mp3list(id)
primary key (word,id)
Whenever you make a change to mp3list, it gets reflected to that table, which should be a lot faster to search than your current solution.
This moves the cost of figuring out what MP3s contain what words to when you update, rather than every time you select, amortising the cost. Since the vast majority of databases are read far more often than written, this can give substantial improvements. Some DBMS' provide this functionality with a full text search index (MySQL is one of these).
And you can even put some smarts in the triggers (and queries) to totally ignore noise words like a, an and the, saving both space and time, giving you more fine-grained control over what you want to store.

Performance - order by in MySQL or in PHP

I know this has been asked before at least in this thread:
is php sort better than mysql "order by"?
However, I'm still not sure about the right option here since the performance on doing the sorting on PHP side is almost 40 times faster.
This MySQL query runs in about 350-400ms
SELECT
keywords as id,
SUM(impressions) as impressions,
SUM(clicks) as clicks,
SUM(conversions) as conversions,
SUM(not_ctr) as not_ctr,
SUM(revenue) as revenue,
SUM(cost) as cost
FROM visits WHERE campaign_id = 104 GROUP BY keywords(it's an integer) DESC
Keywords and campaign_id columns are indexed.
Using about 150k rows and returns around 1500 rows in total.
The results are then recalculated (we calculate click through rates, conversion rates, ROI etc, as well as the totals for the whole result set). The calculations are done in PHP.
Now my idea was to store the results with PHP APC for quick retrieval, however we need to be able to order these results by the columns as well as the calculated values, therefore if I wanted to order by click-through rate I'd have to use
(SUM(clicks) / (SUM(impressions) - SUM(not_ctr)) within the query which makes it around 40ms slower and the initial 400ms is a really long time already.
In addition we paginate these results, but adding LIMIT 0,200 doesn't really affect the performance.
While testing the APC approach I executed the query, did the additional calculations and stored the array in memory so it would only be executed once during the initial request and that worked like a charm. Fetching and sorting the array from memory only took around 10ms, however the script memory usage was about 25mb. Maybe it's worth loading the results into a memory table and then querying that table directly?
This is all done on my local machine(i7, 8gb ram) which has the default MySQL install and the production server is a 512MB box on Rackspace on which I haven't tested yet, so if possible ignore the server setup.
So the real question is: Is it worth using memory tables or should I just use the PHP sorting and ignore the RAM usage since I can always upgrade the RAM? What other options would you consider in optimizing the performance?

In general, you want to do sorting on the database server and not in the application. One good reason is that the database should be implementing parallel sorts and it has access to indexes. A general rule may not be applicable in all circumstances.
I'm wondering if you indexes are helping you. I would recommend that you try the query:
With no indexes
With an index only on campaign_id
With both indexes
Indexes are not always useful. One particularly important factor is called "selectivity". If you only have two campaigns in the table, then you are probably better off doing a full-table scan rather than indirectly searching through an index. This because particularly important when the table does not fit into memory (resulting in a condition where every row requires load a page into cache).
Finally, if this is going to be an application that expands beyond your single server, be careful. What is optimal on a single machine may not be optimal in a different environment.

Performance of MySQL

MyPHP Application sends a SELECT statement to MySQL with HTTPClient.
It takes about 20 seconds or more.
I thought MySQL can’t get result immediately because MySQL Administrator shows stat for sending data or copying to tmp table while I'd been waiting for result.
But when I send same SELECT statement from another application like phpMyAdmin or jmater it takes 2 seconds or less.10 times faster!!
Dose anyone know why MySQL perform so difference?

Like #symcbean already said, php's mysql driver caches query results. This is also why you can do another mysql_query() while in a while($row=mysql_fetch_array()) loop.
The reason MySql Administrator or phpMyAdmin shows result so fast is they append a LIMIT 10 to your query behind your back.
If you want to get your query results fast, i can offer some tips. They involve selecting only what you need and when you need:
Select only the columns you need, don't throw select * everywhere. This might bite you later when you want another column but forget to add it to select statement, so do this when needed (like tables with 100 columns or a million rows).
Don't throw a 20 by 1000 table in front of your user. She cant find what she's looking for in a giant table anyway. Offer sorting and filtering. As a bonus, find out what she generally looks for and offer a way to show that records with a single click.
With very big tables, select only primary keys of the records you need. Than retrieve additional details in the while() loop. This might look like illogical 'cause you make more queries but when you deal with queries involving around ~10 tables, hundreds of concurrent users, locks and query caches; things don't always make sense at first :)
These are some tips i learned from my boss and my own experince. As always, YMMV.

Dose anyone know why MySQL perform so difference?
Because MySQL caches query results, and the operating system caches disk I/O (see this link for a description of the process in Linux)

MYSQL table becoming large

I have a table in which approx 100,000 rows are added every day. I am supposed to generate reports from this table. I am using PHP to generate these reports. Recently the script which used to do this is taking too long to complete. How can I improve the performance by shifting to something else than MYSQL which is scalable in the long run.

MySQL is very scalable, that's for sure.
The key is not changing the db from Mysql to other but you should:
Optimize your queries (can sound silly for others but I remember for instance that a huge improvment I've done sometime ago is to change SELECT * into selecting only the column(s) I need. It's a frequent issue I meet in others code too)
Optimize your table(s) design (normalization etc).
Add indexes on the column(s) you are using frequently in the queries.
Similar advices here

For generating reports or file downloads with large chunks of data you should concider using flush and increasing time_limit and memory limit.
I doubt the problem lies in the amount of rows, since MySQL can support ALOT of rows. But you can of course fetch x rows a time and process them in chunks.
I do assume your MySQL is properly tweaked for performance.

First analyse why (or: whether) your queries are slow: http://dev.mysql.com/doc/refman/5.1/en/using-explain.html

You should read the following and learn a little bit about the advantages of a well designed innodb table and how best to use clustered indexes - only available with innodb !
The example includes a table with 500 million rows with query times of 0.02 seconds.
MySQL and NoSQL: Help me to choose the right one
Hope you find this of interest.

Another thought is to move records beyond a certain age to a historical database for archiving, reporting, etc. If you don't need that large volume for transactional processing it might make sense to extract them from the transactional data store.
It's common to separate transactional and reporting databases.

I am going to make some assumptions
Your 100k rows added every day have timestamps which are either real-time, or are offset by a relatively short amount of time (hours at most); your 100k rows are added either throughout the day or in a few big batches.
The data are never updated
You are using InnoDB engine (Frankly you would be insane to use MyISAM for large tables because in the event of a crash, index rebuild takes a prohibitive time)
You haven't explained what kind of reports you're trying to generate, but I'm assuming that your table looks like this:
CREATE TABLE logdata (
dateandtime some_timestamp_type NOT NULL,
property1 some_type_1 NOT NULL,
property2 some_type_2 NOT NULL,
some_quantity some_numerical_type NOT NULL,
... some other columns not required for reports ...
... some indexes ...
);
And that your reports look like
SELECT count(*), SUM(some_quantity), property1 FROM logdata WHERE dateandtime BETWEEEN some_time_range GROUP BY property1;
SELECT count(*), SUM(some_quantity), property2 FROM logdata WHERE dateandtime BETWEEEN some_time_range GROUP BY property2;
Now, as we can see, both of these reports are doing a scan of a large amount of the table, because you are reporting on a lot of rows.
The bigger the time range becomes the slower the reports will be. Moreover, if you have a lot of OTHER columns (say some varchars or blobs) which you aren't interested in reporting on, then they slow your report down too (because the server still needs to inspect the rows).
You can use several possible techniques for speeding this up:
Add covering index for each type of report, to support the columns you need and omit columns you don't. This may help a lot but slow inserts down.
Summarise data according to the dimension(s) that you want to report on. In this ficticious case, all your reports are either counting rows, or SUM()ing some_quantity.
Build mirror tables (containing the same data) which have appropriate primary keys / indexes/ columns to make the reports faster.
Use a column engine (e.g. Infobright)
Summarisation is usually an attractive option if your use-case supports it;
You may wish to ask a more detailed question with an explanation of your use-case.

The time limit can be temporarily turned off for a particular file if you know that it is going to potentially run over the time limit by calling set_time_limit (0); at the start of your script.
Other considerations such as indexing or archiving very old data to a different table should also be looked at.

Your best bet is something like MongoDB or CouchDB, both of which are non-relational databases oriented toward storing massive amounts of data. This is assuming that you've already tweaked your MySQL installation for performance and that your situation wouldn't benefit from parallelization.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.