I am new to working with large amounts of data. I am wondering if there are any best practices when querying a database in batches or if anyone can give any advice.
I have a query that will pull out all data and PHP is used to write the data to an XML file. There can be anywhere between 10 and 500,000 rows of data and I have therefore witten the script to pull the data out in batches of 50, write to the file, then get the next 50 rows, append this to the file etc. Is this OK or should I be doing something else? Could I increase the batch size or should I decrease it to make the script run faster?
Any advice would be much appreciated.
Yes, for huge results it is recommended to use batches (performance and memory reasons).
Here is benchmark and example code of running query in batches
The best way to do this depends on a couple of different things. Most importantly is when and why you are creating this XML file.
If you are creating the XML file on demand, and a user is waiting for the file then you'll need to do some fine tuning and testing for performance.
If it's something that's created on a regular basis, maybe a nightly or hourly task, and then the XML file is requested after it's built (something like an RSS feed builder) then if what you have works I would recommend not messing with it.
As far as performance, there are different things that can help. Put in some simple timers into your scripts and play with the number of records per batch and see if there is any performance differences.
$start = microtime(true);
//process batch
$end = microtime(true);
$runTimeMilliseconds = $end - $start;
If the issue is user feedback, you may consider using AJAX to kick off each batch and report progress to the user. If you give the user feedback, they'll usually be happy to wait longer than if they're just waiting on the page to refresh in whole.
Also, check your SQL query to make sure there's no hidden performance penalties there. http://dev.mysql.com/doc/refman/5.0/en/explain.html EXPLAIN can show you how MySQL goes about processing your queries.
At an extreme, I'd imagine the best performance could be accomplished through parallel processing. I haven't worked with it in PHP, but here's the primary reference http://www.php.net/manual/en/refs.fileprocess.process.php
Depending on your hosting environment you could find the total number of records and split it among sub processes. Each building their own XML fragments. Then you could combine the fragments. So process 1 may handle records 0 to 99, process 2 100 to 199, etc.
You would be surprised ONE simple select all without limit is the fastest,
because it only query database once,
everything else is processed locally
$sql = select all_columns from table;
<?php
// set a very high memory
// query without limit, if can avoid sorting is the best
// iterate mysql result, and set it to an array
// $results[] = $row
// free mysql_result
// write xml for every one thousand
// because building xml is consuming MOST memory
for ($i=0; $i<$len; ++$i)
{
$arr = $results[$i];
// do any xml preparation
// dun forget file-write is expensive too
if ($i%1000 == 0 && $i > 0)
{
// write to file
}
}
?>
The best way to go about this is to schedule it as a CRON job, which i think is the best solution for batch processing in PHP. check this link for more info! Batch Processing in PHP. Hope this helps.
Related
I have this code that never finishes executing.
Here is what happens:
We make an API call to get large data and we need to check to see if any difference from our database, we need to update our DB for that specific row. Row numbers will increase as the project grows, could go even over 1 billion rows in some cases.
Issue is making this scalable that even in 1 billion row update, it works
To simulate it I did 9000 for loop
<?PHP
ini_set("memory_limit","-1");
ignore_user_abort(true);
for ($i=0; $i < 9000; $i++) {
// Complex SQL UPDATE query that requires joining tables,
// and doing search and update if matches several variables
}
//here I have log function to see if for loop has been finished
If I loop it 10 times, it still takes time but it works and records, but with 9000 it doesn't finish the loop and never records anything.
Note: I added ini_set("memory_limit","-1"); ignore_user_abort(true); to prevent memory errors.
Is there any way to make this scalable?
Details: I do this query 2 times a day
Without knowing the specifics of the API, how often you call it, how much data it's returning at a time, and how much information you actually have to store, it's hard to give you specific answers. In general, though, I'd approach it like this:
Have a "producer" script query the API on whatever basis you need, but instead of doing your complex SQL update, have it simply store the data locally (presumably in a table, let's call it tempTbl). That should ensure it runs relatively fast. Implement some sort of timestamp on this table, so you know when records were inserted. In the ideal world, the next time this "producer" script runs, if it encounters any data from the API that already exists in tempTbl, it will overwrite it with the new data (and update the last updated timestamp). This ensures tempTbl always contains the latest cached updates from the API.
You'll also have a "consumer" script which runs on a regular basis and which processes the data from tempTbl (presumably in LIFO order, but could be in any order you want). This "consumer" script will process a chuck of, say, 100 records from tempTbl, do your complex SQL UPDATE on them, and delete them from tempTbl.
The idea is that one script ("producer") is constantly filling tempTbl while the other script ("consumer") is constantly processing items in that queue. Presumably "consumer" is faster than "producer", otherwise tempTbl will grow too large. But with an intelligent schema, and careful throttling of how often each script runs, you can hopefully maintain stasis.
I'm also assuming these two scripts will be run as cron jobs, which means you just need to tweak how many records they process at a time, as well as how often they run. Theoretically there's no reason why "consumer" can't simply process all outstanding records, although in practice that may put too heavy a load on your DB so you may want to limit it to a few (dozen, hundred, thousand, or million?) records at a time.
I've been looking at asynchronous database requests in PHP using mysqlnd. The code is working correctly but comparing performance pulling data from one reasonable sized table versus the same data split across multiple tables using asynchronous requests I'm not getting anything like the performance I would expect although it does seem fairly changeable according to hardware setup.
As I understand it I should be achieving, rather than:
x = a + b + c + d
Instead:
x = max(a, b, c, d)
Where x is the total time taken and a to d are the times for individual requests. What I am actually seeing is a rather minor increase in performance on some setups and on others worse performance as if requests weren't asynchronous at all. Any thoughts or experiences from others who may have worked with this and come across the same are welcome.
EDIT: Measuring the timings here, we are talking about queries spread over 10 tables, individually the queries take no more than around 8 seconds to complete, combining the time each individual request takes to complete (not asynchronously) it totals around 18 seconds.
Performing the same requests asynchronously total query time is also around 18 seconds. So clearly the requests are not being executed in parallel against the database.
EDIT: Code used is exactly as shown in the documentation here
<?php
$link1 = mysqli_connect();
$link1->query("SELECT 'test'", MYSQLI_ASYNC);
$all_links = array($link1);
$processed = 0;
do {
$links = $errors = $reject = array();
foreach ($all_links as $link) {
$links[] = $errors[] = $reject[] = $link;
}
if (!mysqli_poll($links, $errors, $reject, 1)) {
continue;
}
foreach ($links as $link) {
if ($result = $link->reap_async_query()) {
print_r($result->fetch_row());
if (is_object($result))
mysqli_free_result($result);
} else die(sprintf("MySQLi Error: %s", mysqli_error($link)));
$processed++;
}
} while ($processed < count($all_links));
?>
I'll expand my comments and I'll try to explain why you won't gain any performance using the setup you have currently.
Asynchronous, in your case, means that the process of retrieving data is asynchronous compared to the rest of your code. The two moving parts (getting data) and working with the data are separate and are executed one after another, but only when the data arrives.
This implies that you want to utilize the CPU to its fullest, so you won't invoke PHP code until the data is ready.
In order for that to work, you must seize the control of PHP process and make it use one of operating system's event interfaces (epoll on Linux, or IOCP on Windows). Since PHP is either embedded into a web server (mod_php) or runs as its own standalone FCGI server (php-fpm), that implies the best utilization of asynchronous data fetching would be when you run a CLI php script since it's quite difficult to utilize event interfaces otherwise.
However, let's focus on your problem and why your code isn't faster.
You assumed that you are CPU bound and your solution was to retrieve data in chunks and process them that way - that's great, however since nothing you do yields faster execution, that means you are 100% I/O bound.
The process of retrieving data from databases forces the hard disk to perform seeking. No matter how much you "chunk" that, if the disk is slow and if the data is scattered around the disk - that part will be slow and creating more workers that deal with parts of the data will just make the system slower and slower since each worker will have the same problem with retrieving the data.
I'd conclude that your issue lies in the slow hard disk, too big of a data set that might be improperly constructed for chunked retrieval. I suggest updating this question or creating another question that will help you retrieve data faster and in a more optimal way.
Is there a way to query for a set time from a PHP script?
I want to write a PHP script that takes an id and then queries the MySQL database to see if there is a match. In the case where another user may have not yet uploaded their match, so I am aiming to query until I find a match or until 5 seconds have passed, which I will then return 0.
In pseudocode this is what I was thinking but it doesn't seem like a good method since I've read looping queries isn't good practice.
$id_in = 123;
time_c = time();
time_stop = time + 5; //seconds
while(time_c < time_stop){
time_c = time()
$result = mysql_query('SELECT * WHERE id=$id_in');
}
It sounds like your requirement is to poll some table until a row with a particular ID shows up. You'll need a query like this to do that:
SELECT some-column, another column FROM some-table WHERE id=$id_in
(Pro tip: don't use SELECT * in software.)
It seems that you want to poll for five seconds and then give up. So let's work through this.
One choice is to simply sleep(5), then poll the table using your query. The advantage of this is that it's very simple.
Another choice is what you have. This will make your php program hammer away at the table as fast as it can, over and over, until the poll succeeds or until your five seconds run out. The advantage of this approach is that your php program won't be asleep when the other program hits the table. In other words, it will pick up the change to the table with minimum latency. This choice, however, has an enormous disadvantage. By hammering away at the table as fast as you can, you'll tie up resources on the MySQL server. This is generally wasteful. It will prevent your application from scaling up efficiently (what if you have ten thousand users all doing this?) Specifically, it may slow down the other program trying to hit the table, so it can't get the update done in five seconds.
There's middle ground, however. Try doing a half-second wait
usleep(500000);
right before each time you poll the table. That won't waste MySQL resources as badly. Even if your php program is asleep when the other program hits the table, it won't be asleep for long.
There is no need to do simple polling and sleeping. I don't know your exact requirements, but in general your question asks for GET_LOCK() or PHP's Semaphore support.
Assuming your uploading process starts with
SELECT GET_LOCK("id-123", 0);
Your Query thread can then wait on that lock:
SELECT GET_LOCK("id-123", 5) as locked, entities.*
FROM entities WHERE id = 123;
You might eventually find that a TRIGGER is the thing what you were looking for.
I have quite a long, memory intensive loop. I can't run it in one go because my server places a time limit for execution and or I run out of memory.
I want to split up this loop into smaller chunks.
I had an idea to split the loop into smaller chunks and then set a location header to reload the script with new starting conditions.
MY OLD SCRIPT (Pseudocode. I'm aware of the shortcomings below)
for($i=0;$i<1000;$i++)
{
//FUNCTION
}
MY NEW SCRIPT
$start=$_GET['start'];
$end=$start+10;
for($i=$start;$i<$end;$i++;)
{
//FUNCTION
}
header("Location:script.php?start=$end");
However, my new script runs successfully for a few iterations and then I get a server error "Too many redirects"
Is there a way around this? Can someone suggest a better strategy?
I'm on a shared server so I can't increase memory allocation or script execution time.
I'd like a PHP solution.
Thanks.
"Too many redirects" is a browser error, so a PHP solution would be to use cURL or standard streams to load the initial page and let it follow all redirects. You would have to run this from a machine without time-out limitations though (e.g. using CLI)
Another thing to consider is to use AJAX. A piece of JavaScript on your page will run your script, gather the output from your script and determine whether to stop (end of computation) or continue (start from X). This way you can create a nifty progress meter too ;-)
You probably want to look into forking child processes to do the work. These child processes can do the work in smaller chunks in their own memory space, while the parent process fires off multiple children. This is commonly handled by Gearman, but can be done without.
Take a look at Forking PHP on Dealnews' Developers site. It has a library and some sample code to help manage code that needs to spawn child processes.
Generally if I have to iterate over something many many times and it has a decent amount of data, I use a "lazy load" type application like:
for($i=$start;$i<$end;$i++;)
{
$data_holder[] = "adding my big data chunks!";
if($i % 5 == 1){
//function to process data
process_data($data_holder); // process that data like a boss!
unset($data_holder); // This frees up the memory
}
}
// Now pick up the stragglers of whatever is left in the data chunk
if(count($data_holder) > 0){
process_data($data_holder);
}
That way you can continue to iterate through your data, but you don't stuff up your memory. You can work in chunks, then unset the data, work in chunks, unset data, etc.. to help prevent memory. As far as execution time, that depends on how much you have to do / how efficient your script is written.
The basic premise -- "Process your data in smaller chunks to avoid memory issues. Keep your design simple to keep it fast."
How about you put a conditional inside your loop to sleep every 100 iterations?
for ($i = 0; $i < 1000; $i++)
{
if ($i % 100 == 0)
sleep(1800) //Sleep for half an hour
}
First off, without knowing what your doing inside the loop, it's hard to tell you the best approach to actually solving your issue. However, if you want to execute something that takes a really long time, my suggestion would be to set up a cron job and let it nail out little portions at a time. The script would log where it stops and the next time it starts up, it could read the log for where to start.
Edit: If you are dead set against cron, and you aren't too concerned about user experience, you could do this:
Let the page load similar to the cron job above. Except after so many seconds or iterations, stop the script. Display a refresh meta tag or javascript refresh. Do this until the task is done.
With the limitations you have, I think the approach you are using could work. It may be that your browser is trying to be smart and not let you redirect back the page you were just on. It might be trying to prevent an endless loop.
You could try
Redirecting back and forth between two scripts that are identical (or aliases).
A different browser.
Having your script output an HTML page with a refresh tag, e.g.
<meta http-equiv="refresh" content="1; url=http://example.com/script.php?start=xxx">
I have a personal web site that crawls and collects MP3s from my favorite music blogs for later listening...
The way it works is a CRON job runs a .php scrip once every minute that crawls the next blog in the DB. The results are put into the DB and then a second .php script crawls the collected links.
The scripts only crawl two levels down into the page so.. main page www.url.com and links on that page www.url.com/post1 www.url.com/post2
My problem is that as I start to get a larger collection of blogs. They are only scanned once ever 20 to 30 minutes and when I add a new blog to to script there is a backup in scanning the links as only one is processed every minute.
Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.
Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.
What is the best way I could speed this process up.
Is there a way I can have multiple scripts affecting the DB but write them so they do not overwrite each other but queue the results?
Is there some way to create threading in PHP so that a script can process links at its own pace?
Any ideas?
Thanks.
USE CURL MULTI!
Curl-mutli will let you process the pages in parallel.
http://us3.php.net/curl
Most of the time you are waiting on the websites, doing the db insertions and html parsing is orders of magnitude faster.
You create a list of the blogs you want to scrape,Send them out to curl multi. Wait and then serially process the results of all the calls. You can then do a second pass on the next level down
http://www.developertutorials.com/blog/php/parallel-web-scraping-in-php-curl-multi-functions-375/
pseudo code for running parallel scanners:
start_a_scan(){
//Start mysql transaction (needs InnoDB afaik)
BEGIN
//Get first entry that has timed out and is not being scanned by someone
//(And acquire an exclusive lock on affected rows)
$row = SELECT * FROM scan_targets WHERE being_scanned = false AND \
(scanned_at + 60) < (NOW()+0) ORDER BY scanned_at ASC \
LIMIT 1 FOR UPDATE
//let everyone know we're scanning this one, so they'll keep out
UPDATE scan_targets SET being_scanned = true WHERE id = $row['id']
//Commit transaction
COMMIT
//scan
scan_target($row['url'])
//update entry state to allow it to be scanned in the future again
UPDATE scan_targets SET being_scanned = false, \
scanned_at = NOW() WHERE id = $row['id']
}
You'd probably need a 'cleaner' that checks periodically if there's any aborted scans hanging around too, and reset their state so they can be scanned again.
And then you can have several scan processes running in parallel! Yey!
cheers!
EDIT: I forgot that you need to make the first SELECT with FOR UPDATE. Read more here
This surely isn't the answer to your question but if you're willing to learn python I recommend you look at Scrapy, an open source web crawler/scraper framework which should fill your needs. Again, it's not PHP but Python. It is how ever very distributable etc... I use it myself.
Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.
Memory limit is only a problem, if your code leaks memory. You should fix that, rather than raising the memory limit. Script execution time is a security measure, which you can simply disable for your cli-scripts.
Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.
You can construct your application in such a way that instances don't override each other. A typical way to do it would be to partition per site; Eg. start a separate script for each site you want to crawl.
CLI scripts are not limited by max execution times. Memory limits are not normally a problem unless you have large sets of data in memory at any one time. Timeouts should be handle gracefully by your application.
It should be possible to change your code so that you can run several instances at once - you would have to post the script for anyone to advise further though. As Peter says, you probably need to look at the design. Providing the code in a pastebin will help us to help you :)