I was wondering if there is a (free) tool for mysql/php benchmark.
In particular, I would like to insert thousands of data into the MySQL database, and test the application with concurrent queries to see if it will last. This is, test the application in the worst cases.
I saw some pay tools, but none free or customizable one.
Any suggestion? or any script?
Thnx
Insert one record into the table.
Then do:
INSERT IGNORE INTO table SELECT FLOOR(RAND()*100000) FROM table;
Then run that line several times. Each time you will double the number of rows in the table (and doubling grows VERY fast). This is a LOT faster than generating the data in PHP or other code. You can modify which columns you select RAND() from, and what the range of the numbers is. It's possible to randomly generate text too, but more work.
You can run this code from several terminals at once to test concurrent inserts. The IGNORE will ignore any primary key collisions.
Make a loop (probably infinite) that would keep inserting data into the database and test going from there.
for($i=1;$i=1000;$i++){
mysql_query("INSERT INTO testing VALUES ('".$i."')");
//do some other testing
}
for($i=1;$i<5000;$i++){
$query = mysql_query("INSERT INTO something VALUES ($i)");
}
replace something with your table ;D
if you want to test concurrency you will have to thread your insert/update statements.
An easy and very simple way(without going into fork/threads and all that jazz) would be to do it in bash as follows
1. Create an executable PHP script
#!/usr/bin/php -q
<?php
/*your php code to insert/update/whatever you want to test for concurrency*/
?>
2. Call it within a for loop by appending & so it goes in the background.
#!/bin/bash
for((i=0; i<100; i++))
do
/path/to/my/php/script.sh &;
done
wait;
You can always extend this by creating multiple php scripts having various insert/update/select queries and run them through the for loop (remember to change i<100 to higher number if you want more load. Just don't forget to add the & after you call your script. (Of course, you will need to chmod +x myscript.sh )
Edit: Added the wait statement, below this you can write other commands/stuff you may want to do after flooding your mysql db.
I did a quick search and found the following page at MySQL documentation => http://dev.mysql.com/doc/refman/5.0/en/custom-benchmarks.html. This page contains the following interesting links:
the Open Source Database Benchmark, available at
http://osdb.sourceforge.net/.
For example, you can try benchmarking packages such as SysBench and
DBT2, available at http://sourceforge.net/projects/sysbench/, and
http://osdldbt.sourceforge.net/#dbt2. These packages can bring a
system to its knees, so be sure to use them only on your development
systems.
For MySQL to be fast you should look into Memcached or Redis to cache your queries. I like Redis a lot and you can get a free (small) instance thanks to http://redistogo.com. Most of the times the READS are killing your server and not the WRITES which are less frequently(most of the times). When WRITES are frequently most of the times it is not really a big case when you lose some data. Sites which have big WRITE rates are for example Twitter or Facebook. But then again I don't think it is the end of the world if a tweet or Facebook wall post gets lost. Like I point out previously you can fix this easily by using Memcached or Redis.
If the WRITES are killing you could look into bulk insert if possible, transactional insert, delayed inserts when not using InnoDB or partitioning. If data is not really critical you could put the queries in memory first and then do bulk insert periodically. This way when you do read from MySQL you would return stale data(could be problem). But then again when you use redis you could easily store all your data in memory, but when your server crashes you can lose data, which could be big problem.
Related
I've a Cronjob script, written in PHP with following requirements:
Step 1 (DB server 1): Get some data from multiple tables (We have lot of data here)
Step 2 (Application server): Perform some calculation
Step 3 (DB Server 2): After calculation, insert that data in another database(MySQL)/table(InnoDB) for reporting purpose. This table contains 97 columns, actually different rates, which can not be normalized further. This is different physical DB server and have only one DB.
Script worked fine during development but on production, Step 1 returned approx 50 million records. Result, as obvious, script run for around 4 days and then failed. (Rough estimation, with current rate, it would have taken approx 171 days to finish)
Just for note, We were using prepared statements and Step 1 is getting data in bunch of 1000 records at a time.
What we did till now
Optimization Step 1: Multiple values in insert & drop all indexes
Some tests showed insert (Step 3 above) is taking maximum time (More then 95% time). To optimize, after some googling, we dropped all indexes from table, and instead of one insert query/row, we are not having one insert query/100 rows. This gave us a bit faster insert but still, as per rough estimate, it will take 90 days to run cron once, and we need to run it once every month as new data will be available every month.
Optimization step 2, instead of writing to DB, write to csv file and then import in mysql using linux command.
This step seems not working. Writing 30000 rows in CSV file took 16 minutes and we still need to import that CSV file in MySQL. We have single file handler for all write operations.
Current state
It seems I'm now clueless on what else can be done. Some key requirements:
Script need to insert approx 50,000,000 records (will increase with time)
There are 97 columns for each records, we can skip some but 85 columns at the minimum.
Based on input, we can break script into three different cron to run on three different server but insert had to be done on one DB server (master) so not sure if it will help.
However:
We are open to change database/storage engine (including NoSQL)
On production, we could have multiple database servers but insert had to be done on master only. All read operations can be directed to slave, which are minimal and occasional (Just to generate reports)
Question
I don't need any descriptive answer but can someone in short suggest what could be possible solution. I just need some optimization hint and I'll do remaining R&D.
We are open for everything, change database/storage engine, Server optimization/ multiple servers (Both DB and application), change programming language or whatever is best configuration for above requirements.
Final expectation, cron must finish in maximum 24 hours.
Edit in optimization step 2
To further understand why generating csv is taking time, I've created a replica of my code, with only necessary code. That code is present on git https://github.com/kapilsharma/xz
Output file of experiment is https://github.com/kapilsharma/xz/blob/master/csv/output500000_batch5000.txt
If you check above file, I'm inserting 500000 records and getting 5000 records form database at a time, making loop running 100 times. Time taken in first loop was 0.25982284545898 seconds but in 100th loop was 3.9140808582306. I assume its because of system resource and/or file size of csv file. In that case, it becomes more of programming question then DB optimization. Still, can someone suggest why it is taking more time in next loops?
If needed, whole code is committed except csv files and sql file generated to create dummy DB as these files are very big. However they can be easily generated with code.
Using OFFSET and LIMIT to walk through a table is O(N*N), that is much slower than you want or expected.
Instead, walk through the table "remembering where you left off". It is best to use the PRIMARY KEY for such. Since the id looks like an AUTO_INCREMENT without gaps, the code is simple. My blog discusses that (and more complex chunking techniques).
It won't be a full 100 (500K/5K) times as fast, but it will be noticeably faster.
This is a very broad question. I'd start by working out what the bottleneck is with the "insert" statement. Run the code, and use whatever your operating system gives you to see what the machine is doing.
If the bottleneck is CPU, you need to find the slowest part and speed it up. Unlikely, given your sample code, but possible.
If the bottleneck is I/O or memory, you're almost certainly going to need either better hardware, or a fundamental re-design.
The obvious way to re-design this is to find a way to handle only deltas in the 50M records. For instance, if you can write to an audit table whenever a record changes, your cron job can look at that audit table and pick out any data that was modified since the last batch run.
I had a mailer cron job on CakePHP, which failed merely on 600 rows fetch and send email to the registered users. It couldn't even perform the job in batch operations. We finally opted for mandrill and since then it all went well.
I'd suggest (considering it a bad idea to touch the legacy system in production) :
Schedule a mirco solution in golang or node.js considering
performance benchmarks, as database interaction is involved -
you'll be fine with any of these. Have this micro solution perform
the cron job. (Fetch + Calculate)
Reporting from NoSQL will be
challenging, so you should try out using available services like
Google Big Query. Have the cron job store data to google big
query and you should get a huge performance improvement even in
generating reports.
or
With each row inserted into your original db server 1, set up a messaging mechanism which performs the operations of cron job everytime an insert is made (sort of trigger) and store it into your reporting server. Possible services you can use are : Google PubSub or Pusher. I think per insert time consumption will be pretty less. (You can also use a async service setup which does the task of storing into the reporting database).
Hope this helps.
I would like to run a PHP script as a cronjob every night. The PHP script will import a XML file with about 145.000 products. Each product contains a link to an image which will be downloaded and saved on the server as well. I can imagine that this may cause some overload. So my question is: is it a better idea to split the PHP file? And if so, what would be a better solution? More cronjobs, with several minutes pause between each other? Run another PHP file using exec (guess not, cause I can't imagine that would make much of a difference), or someting else...? Or just use one script to import all products at once?
Thanks in advance.
It depends a lot on how you've written it in terms of whether it doesn't leak open files or database connections. It also depends on which version of php you're using. In php 5.3 there was a lot done to address garbage collection:
http://www.php.net/manual/en/features.gc.performance-considerations.php
If it's not important that the operation is transactional, i.e all or nothing (for example, if it fails half way through) then I would be tempted to tackle this in chunks where each run of the script processed the next x items, where x can be a variable depending on how long it takes. So what you'll need to do then is keep on repeating the script until nothing is done.
To do this, I'd recommend using a tool called the Fat Controller:
http://fat-controller.sourceforge.net
It can keep on repeating the script and then stop once everything is done. You can tell the Fat Controller that there's more to do, or that everything is done using exit statuses from the php script. There are some use cases on the Fat Controller website, for example: http://fat-controller.sourceforge.net/use-cases.html#generating-newsletters
You can also use the Fat Controller to run processes in parallel to speed things up, just be careful you don't run too many in parallel and slow things down. If you're writing to a database, then ultimately you'll be limited by the hard disc, which unless you have something fancy will mean your optimum concurrency will be 1.
The final question would be how to trigger this - and you're probably best off triggering the Fat Controller from CRON.
There's plenty of documentation and examples on the Fat Controller website, but if you need any specific guidance then I'd be happy to help.
To complete the previous answer, the best solution is to optimize your scripts:
Prefer JSON to XML, parsing JSON is faster (vastly).
Use one or few concurrent connection to database.
Alter multiple rows in one time (Insert 10-30 rows in one query, select 100 rows, delete multiple, not more to not overload memory and not less to make your transaction profitable).
Minimize the number of queries. (following previous point)
Skip definitively already up to date rows, use dates (timestamp, datetime).
You can also let the proc whisper with usleep(30) call.
To use multiple PHP process, use popen().
Well, this is the thing. Let's say that my future PHP CMS need to drive 500k visitors daily and I need to record them all in MySQL database (referrer, ip address, time etc.). This way I need to insert 300-500 rows per minute and update 50 more. The main problem is that script would call database every time I want to insert new row, which is every time someone hits a page.
My question, is there any way to locally cache incoming hits first (and what is the best solution for that apc, csv...?) and periodically send them to database every 10 minutes for example? Is this good solution and what is the best practice for this situation?
500k daily it's just 5-7 queries per second. If each request will be served for 0.2 sec, then you will have almost 0 simultaneous queries, so there is nothing to worry about.
Even if you will have 5 times more users - all should work fine.
You can just use INSERT DELAYED and tune your mysql.
About tuning: http://www.day32.com/MySQL/ - there is very useful script (will change nothing, just show you the tips how to optimize settings).
You can use memcache or APC to write log there first, but with using INSERT DELAYED MySQL will do almost same work, and will do it better :)
Do not use files for this. DB will serve locks much better, than PHP. It's not so trivial to write effective mutexes, so let DB (or memcache, APC) do this work.
A frequently used solution:
You could implement an counter in memcached which you increment on an visit, and push an update to the database for every 100 (or 1000) hits.
We do this by storing locally on each server to CSV, then having a minutely cron job to push the entries into the database. This is to avoid needing a highly available MySQL database more than anything - the database should be able to cope with that volume of inserts without a problem.
Save them to a directory-based database (or flat file, depends) somewhere and at a certain time, use a PHP code to insert/update them into your MySQL database. Your php code can be executed periodically using Cron, so check if your server has Cron so that you can set the schedule for that, say every 10 minutes.
Have a look at this page: http://damonparker.org/blog/2006/05/10/php-cron-script-to-run-automated-jobs/. Some codes have been written in the cloud and are ready for you to use :)
One way would be to use Apache access.log. You can get a quite fine logging by using cronolog utility with apache . Cronolog will handle the storage of a very big number of rows in files, and can rotate it based on volume day, year, etc. Using this utility will prevent your Apache from suffering of log writes.
Then as said by others, use a cron-based job to analyse these log and push whatever summarized or raw data you want in MySQL.
You may think of using a dedicated database (or even database server) for write-intensive jobs, with specific settings. For example you may not need InnoDB storage and keep a simple MyIsam. And you could even think of another database storage (as said by #Riccardo Galli)
If you absolutely HAVE to log directly to MySQL, consider using two databases. One optimized for quick inserts, which means no keys other than possibly an auto_increment primary key. And another with keys on everything you'd be querying for, optimized for fast searches. A timed job would copy hits from the insert-only to the read-only database on a regular basis, and you end up with the best of both worlds. The only drawback is that your available statistics will only be as fresh as the previous "copy" run.
I have also previously seen a system which records the data into a flat file on the local disc on each web server (be careful to do only atomic appends if using multiple proceses), and periodically asynchronously write them into the database using a daemon process or cron job.
This appears to be the prevailing optimium solution; your web app remains available if the audit database is down and users don't suffer poor performance if the database is slow for any reason.
The only thing I can say, is be sure that you have monitoring on these locally-generated files - a build-up definitely indicates a problem and your Ops engineers might not otherwise notice.
For an high number of write operations and this kind of data you might find more suitable mongodb or couchdb
Because INSERT DELAYED is only supported by MyISAM, it is not an option for many users.
We use MySQL Proxy to defer the execution of queries matching a certain signature.
This will require a custom Lua script; example scripts are here, and some tutorials are here.
The script will implement a Queue data structure for storage of query strings, and pattern matching to determine what queries to defer. Once the queue reaches a certain size, or a certain amount of time has elapsed, or whatever event X occurs, the query queue is emptied as each query is sent to the server.
you can use a Queue strategy using beanstalk or IronQ
I've got an application which needs to run a daily script; the daily script consists in downloading a CSV file with 1,000,000 rows, and inserting those rows into a table.
I host my application in Dreamhost. I created a while loop that goes through all the CSV's rows and performs an INSERT query for each one. The thing is that I get a "500 Internal Server Error". Even if I chop it out in 1000 files with 1000 rows each, I can't insert more than 40 or 50 thousand rows in the same loop.
Is there any way that I could optimize the input? I'm also considering going with a dedicated server; what do you think?
Thanks!
Pedro
Most databases have an optimized bulk insertion process - MySQL's is the LOAD DATA FILE syntax.
To load a CSV file, use:
LOAD DATA INFILE 'data.txt' INTO TABLE tbl_name
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES;
Insert multiple values, instead of doing
insert into table values(1,2);
do
insert into table values (1,2),(2,3),(4,5);
Up to an appropriate number of rows at a time.
Or do bulk import, which is the most efficient way of loading data, see
http://dev.mysql.com/doc/refman/5.0/en/load-data.html
Normally I would say just use LOAD DATA INFILE, but it seems you can't with your shared hosting environment.
I haven't used MySQL in a few years, but they have a very good document which describes how to speed up insertions for bulk insertions:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
A few ideas that can be gleaned from this:
Disable/enable keys around the insertions:
ALTER TABLE tbl_name DISABLE KEYS;
ALTER TABLE tbl_name ENABLE KEYS;
Use many values in your insert statements.
I.e.: INSERT INTO table (col1, col2) VALUES (val1, val2),(.., ..), ...
If I recall correctly, you can have up to 4096 values per insertion statement.
Run a FLUSH TABLES command before you even start, to ensure that there are no pending disk writes that may hurt your insertion performance.
I think this will make things fast. I would suggest using LOCK TABLES, but I think disabling the keys makes that moot.
UPDATE
I realized after reading this that by disabling your keys you may remove consistency checks that are important for your file loading. You can fix this by:
Ensuring that your table has no data that "collides" with the new data being loaded (if you're starting from scratch, a TRUNCATE statement will be useful here).
Writing a script to clean your input data to ensure no duplicates locally. Checking for duplicates is probably costing you a lot of database time anyway.
If you do this, ENABLE KEYS should not fail.
You can create cronjob script which adds x records to the database at one request.
Cronjob script will check if last import have not addded all needed rows he takes another x rows.
So you can add as many you need rows.
If you have your dedicated server it's more easier. You just run loop with all insert queries.
Of course you can try to set time_limit to 0 (if it's working on dreamhost) or make it bigger.
Your PHP script is most likely being terminated because it exceeded the script time limit. Since you're on a shared host, you're pretty much out of luck.
If you do switch to a dedicated server and if you get shell access, the best way would be to use the mysql command-line tool to insert the data.
OMG Ponies suggestion is great, but I've also 'manually' formatted data into the same format that mysqldump uses, then loaded it that way. Very fast.
Have you tried doing transactions? Just send the command BEGIN to MySQL, do all your inserts then do COMMIT. This would speed it up significantly,but like casablanca said, your script is probably timing out as well.
I've ran into this problem myself before and nos pretty much got it right on the head, but you'll need to do a bit more to get it to perform the best.
I found that in my situation that I couldn't MySQL to accept one large INSERT statement, but found that if I split it up into groups of about 10k INSERTS at a time like how nos suggested then it'll do it's job pretty quickly. One thing to note is that when doing multiple INSERTs like this that you will most likely hit PHP's timeout limit, but this can be avoided by resetting the timout with set_time_limit($seconds), I found that doing this after each successful INSERT worked really well.
You have to be careful about doing this, because you could find yourself in a loop on accident with an unlimited timout and for that I would suggest testing to make sure that each INSERT was successful by either checking for errors reported by MySQL with mysql_errno() or mysql_error(). You could also catch errors by checking the number of rows affected by the INSERT with mysql_affected_rows(). You could then stop after the first error happens.
It would be better if you use sqlloader.
You would need two things first control file that specifies the actions which SQL Loader should do and second csv file that you want to be loaded
Here is the below link that would help you out.
http://www.oracle-dba-online.com/sql_loader.htm
Go to phpmyadmin and select the table you would like to insert into.
Under the "operations" tab, and then the ' table options' option /section , change the storage engine from InnoDB to MyISAM.
I once had a similar challenge.
Have a good time.
I am confronted with a new kind of problem which I haven't encountered yet in my very young programming "career" and would like to know your opinion about how to tackle it best.
The situation
A research application (php/mysql) gathers stress related health data from users. User gets a an analyses after filling in the questionnaire. Value for each parameter is transformed into a percentile value using a benchmark (mean and standard devitation of existing data set).
The task
Since more and more ppl are filling in the questionnaire, there is the potential to make the benchmark values (mean/SD) more accurate by recalculating them using the new user data. I would like the database to regularly run a script that updates the benchmark values.
The question
I've never used stored precedures so far and I only have a slight notion of what they are but somehow I have a feeling they could maybe help me with this? Or should I write the script as php and then set up a cron job?
[edit]After the first couple of answers it looks like cron is clearly the way to go.[/edit]
What you're considering could be done in a number of ways.
You could setup a trigger in your DB to recalculate the values whenever a new record is updated. You could store the code needed to update the values in a sproc if necessary.
You could write a PHP script and run it regularly via cron.
#1 will slow down inserts to your database but will make sure your data is always up to date. #2 may lock the tables while it updates the new values, and your data will only be accurate until the next update. #2 is much easier to back up, as the script can easily be stored in your versioning system, whereas you'd need to store the trigger and sproc creation scripts in whatever backup you'd make.
Obviously you'll have to weigh up your requirements before you pick a method.
PHP set up as a cron job lets you keep it in your source code management system, and if you're using a database abstraction layer it'll be portable to other databases if you ever decide to switch. For those reasons, I tend to go with scripts over stored procedures.
The easiest way to make this work is probably to write a script in the same language your website is using (sounds like PHP) and call it from cron.
No need to make it more complicated than it needs to be by putting the logic in two places (your existing calculations and a stored procedure).
If the volume of data is big enough that calculating it on the fly is too much, then either:
Cron job with php script to denormalise the totals
Trigger on inserts that increments totals
Go with the cron job way. Simple, solid, works. In the PHP/MySQL world I would say stored procedures are no-go.