I have mysql/php script running on my linux machine. Its basically migrating files content to MySql table. there are about 4400,000 Files, Account files each file`s content is places in a table in one row. It have been 14 hours and so far it have only done 300,000 Accounts.
At first it was very fast and was doing about 1000 files a second now it's slowed down to 50 files per second and the mysql process is consuming 95% of server CPU.
Although The machine have multiple cores and I was thinking if its possible to allocate more then one core to mysql process which is consuming 95% of CPU.
Or is there any other way to make the process faster?
Thank you.
here is the script
https://paste.ee/p/LZwlH#GHxpgqiUUPsVQFchdKVny2DEJQxaXH9V
Do not use the mysql_* API. Switch to mysqli_* or PDO.
Please provide these:
SHOW CREATE TABLE
SHOW VARIABLES LIKE '%buffer%';
select * from players where p_name=' -- there is no need to select *, simply SELECT 1. Do you have an index on p_name? That is very important.
It smells like index updating, but can't be sure.
One way to speed up inserts is to 'batch' them -- 100 rows at a time will typically run 10 times as fast as 1 at a time.
Even better might be to use LOAD DATA. You may need to load into a temp table, then massage things before doing INSERT .. SELECT .. to put the data into the real table.
Temporarily remove the INSERT from the procedure. See how fast it runs. (You have not 'proven' that INSERT is the villain.)
Related
I'm currently working on the project which is like e-commerce site. There are hundreds of thousands of records in the database tables. I also have to use join operations on them to get data as there is query builder in project to select criteria of data. It takes too much time to fetch data. So, I'm using limit as some no of records(e.g. 10) per page. Now I come to know the concept of memcached. So I thought to use memcached for my project as it will take too much time for only once. But still there are some doubts.
Will too many cache file affect? I mean there will be too many files will be created as for each page of each module, there will be one cache file. So digit will go approx 10000 cache file.
Let's assume that there is no any problem of no of files. But what about to update files using replace() when any row of table is being added or deleted from middle of the table. And here, table is being updated near about every week.
So I'm in dilemma that should I go for memcached or not? If any one can advice and answer with explanation, then it will be appreciated.
If your website executes many of the same MySQL queries that frequently return the same data, then yes, there is probably some benefit to running memcached.
Problem:
"There are hundreds of thousands of records...It takes too much time to fetch the data".
This probably indicates a problem with your schema. Properly indexed, even when using JOINs, the queries should be able to execute quickly (< 0.1 seconds). Run an EXPLAIN query on the queries that are taking a long time to run and see if they can be improved.
Answer to Question 1
There won't be an issue with too many cache files. Memcached stores all cached information in memory (hence the name), so no disk files are used. Cached objects are stored in RAM and accessed directly from RAM.
Answer to Question 2
Not exactly sure what you are asking here, but if your application updates or deletes information from the database, then it is critical that the cache items affected by the updates and deletes are deleted. If the application doesn't remove cached items affected by such operations, than the next time the data is queried, cached results which are no longer valid may be returned. Make sure any data cached either has appropriate expiration times set, or the application removes them from cache when the data in the database changes.
Hope that helps.
I would start not from Memcached but from figuring out what the bottleneck is. Your tables have roughly one millions rows. I don't know the size of a row but my educated guess is that it is less than 1K based on the fact that a browser window accommodates information from one record.
So it is probably 1G of information in your database. Correct me if I'm wrong. If that's true then the whole database should be automatically cached in RAM by MySQL.
Now that your database is totally in RAM then with proper organization of indexes complexity of a query should be linear with respect to the number of the result set which measured in a number of kilobytes because it fits the browser window.
So my advice is to determine the size of the database and to see the result of "top" command in order to know how much memory is consumed by MySQL. And if you make sure that your database sits totally in memory then run the explain command against your most popular queries and add some indexes to your database according to the result of the explain. Even if your database is bigger than the amount of RAM then I still recommend you to look into the results of the explain command cause it really helps a lot.
I've a Cronjob script, written in PHP with following requirements:
Step 1 (DB server 1): Get some data from multiple tables (We have lot of data here)
Step 2 (Application server): Perform some calculation
Step 3 (DB Server 2): After calculation, insert that data in another database(MySQL)/table(InnoDB) for reporting purpose. This table contains 97 columns, actually different rates, which can not be normalized further. This is different physical DB server and have only one DB.
Script worked fine during development but on production, Step 1 returned approx 50 million records. Result, as obvious, script run for around 4 days and then failed. (Rough estimation, with current rate, it would have taken approx 171 days to finish)
Just for note, We were using prepared statements and Step 1 is getting data in bunch of 1000 records at a time.
What we did till now
Optimization Step 1: Multiple values in insert & drop all indexes
Some tests showed insert (Step 3 above) is taking maximum time (More then 95% time). To optimize, after some googling, we dropped all indexes from table, and instead of one insert query/row, we are not having one insert query/100 rows. This gave us a bit faster insert but still, as per rough estimate, it will take 90 days to run cron once, and we need to run it once every month as new data will be available every month.
Optimization step 2, instead of writing to DB, write to csv file and then import in mysql using linux command.
This step seems not working. Writing 30000 rows in CSV file took 16 minutes and we still need to import that CSV file in MySQL. We have single file handler for all write operations.
Current state
It seems I'm now clueless on what else can be done. Some key requirements:
Script need to insert approx 50,000,000 records (will increase with time)
There are 97 columns for each records, we can skip some but 85 columns at the minimum.
Based on input, we can break script into three different cron to run on three different server but insert had to be done on one DB server (master) so not sure if it will help.
However:
We are open to change database/storage engine (including NoSQL)
On production, we could have multiple database servers but insert had to be done on master only. All read operations can be directed to slave, which are minimal and occasional (Just to generate reports)
Question
I don't need any descriptive answer but can someone in short suggest what could be possible solution. I just need some optimization hint and I'll do remaining R&D.
We are open for everything, change database/storage engine, Server optimization/ multiple servers (Both DB and application), change programming language or whatever is best configuration for above requirements.
Final expectation, cron must finish in maximum 24 hours.
Edit in optimization step 2
To further understand why generating csv is taking time, I've created a replica of my code, with only necessary code. That code is present on git https://github.com/kapilsharma/xz
Output file of experiment is https://github.com/kapilsharma/xz/blob/master/csv/output500000_batch5000.txt
If you check above file, I'm inserting 500000 records and getting 5000 records form database at a time, making loop running 100 times. Time taken in first loop was 0.25982284545898 seconds but in 100th loop was 3.9140808582306. I assume its because of system resource and/or file size of csv file. In that case, it becomes more of programming question then DB optimization. Still, can someone suggest why it is taking more time in next loops?
If needed, whole code is committed except csv files and sql file generated to create dummy DB as these files are very big. However they can be easily generated with code.
Using OFFSET and LIMIT to walk through a table is O(N*N), that is much slower than you want or expected.
Instead, walk through the table "remembering where you left off". It is best to use the PRIMARY KEY for such. Since the id looks like an AUTO_INCREMENT without gaps, the code is simple. My blog discusses that (and more complex chunking techniques).
It won't be a full 100 (500K/5K) times as fast, but it will be noticeably faster.
This is a very broad question. I'd start by working out what the bottleneck is with the "insert" statement. Run the code, and use whatever your operating system gives you to see what the machine is doing.
If the bottleneck is CPU, you need to find the slowest part and speed it up. Unlikely, given your sample code, but possible.
If the bottleneck is I/O or memory, you're almost certainly going to need either better hardware, or a fundamental re-design.
The obvious way to re-design this is to find a way to handle only deltas in the 50M records. For instance, if you can write to an audit table whenever a record changes, your cron job can look at that audit table and pick out any data that was modified since the last batch run.
I had a mailer cron job on CakePHP, which failed merely on 600 rows fetch and send email to the registered users. It couldn't even perform the job in batch operations. We finally opted for mandrill and since then it all went well.
I'd suggest (considering it a bad idea to touch the legacy system in production) :
Schedule a mirco solution in golang or node.js considering
performance benchmarks, as database interaction is involved -
you'll be fine with any of these. Have this micro solution perform
the cron job. (Fetch + Calculate)
Reporting from NoSQL will be
challenging, so you should try out using available services like
Google Big Query. Have the cron job store data to google big
query and you should get a huge performance improvement even in
generating reports.
or
With each row inserted into your original db server 1, set up a messaging mechanism which performs the operations of cron job everytime an insert is made (sort of trigger) and store it into your reporting server. Possible services you can use are : Google PubSub or Pusher. I think per insert time consumption will be pretty less. (You can also use a async service setup which does the task of storing into the reporting database).
Hope this helps.
I have a big table with customers, t_customer that has 10.000.000 records.
I start a certain PHP script which chooses data from this table and I need to execute an action on each customer.
But as I progress through the data, the SQL inquiry runs more and more slowly, and now terminates with Query execution was interrupted.
My query is:
SELECT id, login FROM t_customer WHERE regdate<1370955715 LIMIT 2600000, 100000;
So the limit doesn't have any effect any more and I don't know what to do about this.
P.S.
SELECT id, login FROM t_customer WHERE regdate<1370955715 LIMIT 2600000, 10;
the above query is executed 30 seconds
P.S.S.
The same result even without a WHERE clause
So you are selecting 100K records in PHP? That is a bad idea.
Lower your batch size to 1K, paginate through your target set and then see how it goes. Make sure you have an index on the regdate too. 100K arrays in PHP are... complicated.
PHP is a scripting language, it's not really C++ :) That's why I write background heavy-lifting workers in C++.
MySQL offers a really clever feature called partitions. http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
It allows you to automatically split huge data sets into smaller files giving you a huge improvement while doing operations on your data. Just like RAID but for SQL :P There was an excellent post on SO on the best configuration for partitions, but can't find it at the moment.
If you improve the performance of your query, PHP should instantly have more time to smash through your loops and arrays.
I have a PHP script that in every run, inserts a new row to a Mysql db (with a relative small amount of data..)
I have more than 20 requests per second, and this is causing my CPU to scream for help..
I'm using the sql INSERT DELAYED method with a MyISAM engine (although I just notice that INSERT DELAYED is not working with MyISAM).
My main concern is my CPU load and I started to look for ways to store this data with more CPU friendly solutions.
My first idea was to write this data to an hourly log files and once an hour to retrieve the data from the logs and insert it to the DB at once.
Maybe a better idea is to use NoSQL DB instead of log files and then once an hour to insert the data from the NoSQL to the Mysql..
I didn't test yet any of these ideas, so I don't really know if this will manage to decrease my CPU load or not. I wanted to ask if someone can help me find the right solution that will have the lowest affect over my CPU.
I recently had a very similar problem and my solution was to simply batch the requests. This sped things up about 50 times because of the reduced overhead of mysql connections and also the greatly decreased amount of reindexing. Storing them to a file then doing one larger (100-300 individual inserts) statement at once probably is a good idea. To speed things up even more turn off indexing for the duration of the insert with
ALTER TABLE tablename DISABLE KEYS
insert statement
ALTER TABLE tablename ENABLE KEYS
doing the batch insert will reduce the number of instances of the php script running, it will reduce the number of currently open mysql handles (large improvement) and it will decrease the amount of indexing.
Ok guys, I manage to lower the CPU load dramatically with APC-cache
I'm doing it like so:
storing the data in memory with APC-cache, with TTL of 70 seconds:
apc_store('prfx_SOME_UNIQUE_STRING', $data, 70);
once a minute I'm looping over all the records in the cache:
$apc_list=apc_cache_info('user');
foreach($apc_list['cache_list'] as $apc){
if((substr($apc['info'],0,5)=='prfx_') && ($val=apc_fetch($apc['info']))){
$values[]=$val;
apc_delete($apc['info']);
}
}
inserting the $values to the DB
and the CPU continues to smile..
enjoy
I would insert a sleep(1); function at the top of your PHP script, before every insert at the top of your loop where 1 = 1 second. This only allows the loop to cycle once per second.
This way it will regulate a bit just how much load the CPU is getting, this would be ideal assuming your only writing a small number of records in each run.
You can read more about the sleep function here : http://php.net/manual/en/function.sleep.php
It's hard to tell without profiling both methods, if you write to a log file first you could end up just making it worse as your turning your operation count from N to N*2. You gain a slight edge by writing it all to a file and doing a batch insert but bear in mind that as the log file fills up it's load/write time increases.
To reduce database load, look at using mem cache for database reads if your not already.
All in all though your probably best of just trying both and seeing what's faster.
Since you are trying INSERT DELAYED, I assume you don't need up to the second data. If you want to stick with MySQL, you can try using replication and the BLACKHOLE table type. By declaring a table as type BLACKHOLE on one server, then replicating it to a MyISAM or other table type on another server, you can smooth out CPU and io spikes. BLACKHOLE is really just a replication log file, so "inserts" into it are very fast and light on the system.
I do not know what is your table size or your server capabilities but I guess you need to make a lot of inserts per single table. In such a situation I would recommend checking for the construction of vertical partitions that will reduce the physical size of each partition and significantly reduce the insertion time to the table.
Let's assume I have the following query:
SELECT address
FROM addresses a, names n
WHERE a.address_id = n.address_id
GROUP BY n.address_id
HAVING COUNT(*) >= 10
If the two tables were large enough (think if we had the whole US population in these two tables) then running an EXPLAIN on this SELECT would say that Using temporary; Using filesort which is usually not good.
If we have a DB with many concurrent INSERTs and SELECTs (like this) would delegating the GROUP BY a.address_id HAVING COUNT(*) >= 10 part to PHP be a good plan to minimise DB resources? What would the most efficient way (in terms of computing power) to code this?
EDIT: It seems the consensus is that offloading to PHP is the wrong move. How then, could I improve the query (let's assume indexes have been created properly)? More sepcifically how do I avoid the DB from creating a temporary table?
So your plan to minimize resources is by sucking all the data out of the database and having PHP process it, causing extreme memory usage?
Don't do client-side processing if at all possible - databases are DESIGNED for this sort of heavy work.
Offloading this to PHP is probably the opposite direction you want to go. If you must do this on a single machine then the database is likely the most efficient place to do it. If you have a bunch of PHP machines and only a single DB server, then offloading might make sense, but more likely you'll just clobber the IO capability of the DB. You'll probably get a bigger win by setting up a replica and doing your read queries there. Depending on your ratio of SELECT to INSERT queries, you might want to consider keeping a tally table (many more SELECTs than INSERTs). The more latency you can allow for your results, the more options you have. If you can allow 5 minutes latency, then you might start considering a distributed batch processing system like hadoop rather than a database.