I can't get into too many specifics as this is a project for work, but anyways..
I'm in the process of writing a SOAP client in PHP that pushes all responses to a MySQL database. My main script makes an initial soap request that retrieves a large set of items (approximately ~4000 at the moment, but the list is expected to grow into hundreds of thousands at some point).
Once this list of 4000 items is returned, I use exec("/usr/bin/php path/to/my/historyScript.php &") that sends a history request for each item. The web service api supports up to 30 requests / sec. Below is some pseudo code for what I am currently doing:
$count = 0;
foreach( $items as $item )
{
if ( $count == 30 )
{
sleep(1); // Sleep for one second before calling the next 30 requests
$count = 0;
}
exec('/usr/bin/php path/to/history/script.php &');
$count++;
}
The problem I'm running into is that I am unsure when the processes finish and my development server is starting to crash. Since data is expected to grow, I know this is a very poor solution to my problem.
Might there be a better approach I should consider using for a task like this? I just feel that this is more of a 'hack'
I am not sure, but i feel that the reason for your application crash, you are keeping large set of data in PHP variable. Look into this, based on RAM size this(data size) will leads to system crash. And my suggestion is try to limit incoming data from external service per request, instead number of request to the service.
Related
I've been looking at asynchronous database requests in PHP using mysqlnd. The code is working correctly but comparing performance pulling data from one reasonable sized table versus the same data split across multiple tables using asynchronous requests I'm not getting anything like the performance I would expect although it does seem fairly changeable according to hardware setup.
As I understand it I should be achieving, rather than:
x = a + b + c + d
Instead:
x = max(a, b, c, d)
Where x is the total time taken and a to d are the times for individual requests. What I am actually seeing is a rather minor increase in performance on some setups and on others worse performance as if requests weren't asynchronous at all. Any thoughts or experiences from others who may have worked with this and come across the same are welcome.
EDIT: Measuring the timings here, we are talking about queries spread over 10 tables, individually the queries take no more than around 8 seconds to complete, combining the time each individual request takes to complete (not asynchronously) it totals around 18 seconds.
Performing the same requests asynchronously total query time is also around 18 seconds. So clearly the requests are not being executed in parallel against the database.
EDIT: Code used is exactly as shown in the documentation here
<?php
$link1 = mysqli_connect();
$link1->query("SELECT 'test'", MYSQLI_ASYNC);
$all_links = array($link1);
$processed = 0;
do {
$links = $errors = $reject = array();
foreach ($all_links as $link) {
$links[] = $errors[] = $reject[] = $link;
}
if (!mysqli_poll($links, $errors, $reject, 1)) {
continue;
}
foreach ($links as $link) {
if ($result = $link->reap_async_query()) {
print_r($result->fetch_row());
if (is_object($result))
mysqli_free_result($result);
} else die(sprintf("MySQLi Error: %s", mysqli_error($link)));
$processed++;
}
} while ($processed < count($all_links));
?>
I'll expand my comments and I'll try to explain why you won't gain any performance using the setup you have currently.
Asynchronous, in your case, means that the process of retrieving data is asynchronous compared to the rest of your code. The two moving parts (getting data) and working with the data are separate and are executed one after another, but only when the data arrives.
This implies that you want to utilize the CPU to its fullest, so you won't invoke PHP code until the data is ready.
In order for that to work, you must seize the control of PHP process and make it use one of operating system's event interfaces (epoll on Linux, or IOCP on Windows). Since PHP is either embedded into a web server (mod_php) or runs as its own standalone FCGI server (php-fpm), that implies the best utilization of asynchronous data fetching would be when you run a CLI php script since it's quite difficult to utilize event interfaces otherwise.
However, let's focus on your problem and why your code isn't faster.
You assumed that you are CPU bound and your solution was to retrieve data in chunks and process them that way - that's great, however since nothing you do yields faster execution, that means you are 100% I/O bound.
The process of retrieving data from databases forces the hard disk to perform seeking. No matter how much you "chunk" that, if the disk is slow and if the data is scattered around the disk - that part will be slow and creating more workers that deal with parts of the data will just make the system slower and slower since each worker will have the same problem with retrieving the data.
I'd conclude that your issue lies in the slow hard disk, too big of a data set that might be improperly constructed for chunked retrieval. I suggest updating this question or creating another question that will help you retrieve data faster and in a more optimal way.
I have quite a long, memory intensive loop. I can't run it in one go because my server places a time limit for execution and or I run out of memory.
I want to split up this loop into smaller chunks.
I had an idea to split the loop into smaller chunks and then set a location header to reload the script with new starting conditions.
MY OLD SCRIPT (Pseudocode. I'm aware of the shortcomings below)
for($i=0;$i<1000;$i++)
{
//FUNCTION
}
MY NEW SCRIPT
$start=$_GET['start'];
$end=$start+10;
for($i=$start;$i<$end;$i++;)
{
//FUNCTION
}
header("Location:script.php?start=$end");
However, my new script runs successfully for a few iterations and then I get a server error "Too many redirects"
Is there a way around this? Can someone suggest a better strategy?
I'm on a shared server so I can't increase memory allocation or script execution time.
I'd like a PHP solution.
Thanks.
"Too many redirects" is a browser error, so a PHP solution would be to use cURL or standard streams to load the initial page and let it follow all redirects. You would have to run this from a machine without time-out limitations though (e.g. using CLI)
Another thing to consider is to use AJAX. A piece of JavaScript on your page will run your script, gather the output from your script and determine whether to stop (end of computation) or continue (start from X). This way you can create a nifty progress meter too ;-)
You probably want to look into forking child processes to do the work. These child processes can do the work in smaller chunks in their own memory space, while the parent process fires off multiple children. This is commonly handled by Gearman, but can be done without.
Take a look at Forking PHP on Dealnews' Developers site. It has a library and some sample code to help manage code that needs to spawn child processes.
Generally if I have to iterate over something many many times and it has a decent amount of data, I use a "lazy load" type application like:
for($i=$start;$i<$end;$i++;)
{
$data_holder[] = "adding my big data chunks!";
if($i % 5 == 1){
//function to process data
process_data($data_holder); // process that data like a boss!
unset($data_holder); // This frees up the memory
}
}
// Now pick up the stragglers of whatever is left in the data chunk
if(count($data_holder) > 0){
process_data($data_holder);
}
That way you can continue to iterate through your data, but you don't stuff up your memory. You can work in chunks, then unset the data, work in chunks, unset data, etc.. to help prevent memory. As far as execution time, that depends on how much you have to do / how efficient your script is written.
The basic premise -- "Process your data in smaller chunks to avoid memory issues. Keep your design simple to keep it fast."
How about you put a conditional inside your loop to sleep every 100 iterations?
for ($i = 0; $i < 1000; $i++)
{
if ($i % 100 == 0)
sleep(1800) //Sleep for half an hour
}
First off, without knowing what your doing inside the loop, it's hard to tell you the best approach to actually solving your issue. However, if you want to execute something that takes a really long time, my suggestion would be to set up a cron job and let it nail out little portions at a time. The script would log where it stops and the next time it starts up, it could read the log for where to start.
Edit: If you are dead set against cron, and you aren't too concerned about user experience, you could do this:
Let the page load similar to the cron job above. Except after so many seconds or iterations, stop the script. Display a refresh meta tag or javascript refresh. Do this until the task is done.
With the limitations you have, I think the approach you are using could work. It may be that your browser is trying to be smart and not let you redirect back the page you were just on. It might be trying to prevent an endless loop.
You could try
Redirecting back and forth between two scripts that are identical (or aliases).
A different browser.
Having your script output an HTML page with a refresh tag, e.g.
<meta http-equiv="refresh" content="1; url=http://example.com/script.php?start=xxx">
I am new to working with large amounts of data. I am wondering if there are any best practices when querying a database in batches or if anyone can give any advice.
I have a query that will pull out all data and PHP is used to write the data to an XML file. There can be anywhere between 10 and 500,000 rows of data and I have therefore witten the script to pull the data out in batches of 50, write to the file, then get the next 50 rows, append this to the file etc. Is this OK or should I be doing something else? Could I increase the batch size or should I decrease it to make the script run faster?
Any advice would be much appreciated.
Yes, for huge results it is recommended to use batches (performance and memory reasons).
Here is benchmark and example code of running query in batches
The best way to do this depends on a couple of different things. Most importantly is when and why you are creating this XML file.
If you are creating the XML file on demand, and a user is waiting for the file then you'll need to do some fine tuning and testing for performance.
If it's something that's created on a regular basis, maybe a nightly or hourly task, and then the XML file is requested after it's built (something like an RSS feed builder) then if what you have works I would recommend not messing with it.
As far as performance, there are different things that can help. Put in some simple timers into your scripts and play with the number of records per batch and see if there is any performance differences.
$start = microtime(true);
//process batch
$end = microtime(true);
$runTimeMilliseconds = $end - $start;
If the issue is user feedback, you may consider using AJAX to kick off each batch and report progress to the user. If you give the user feedback, they'll usually be happy to wait longer than if they're just waiting on the page to refresh in whole.
Also, check your SQL query to make sure there's no hidden performance penalties there. http://dev.mysql.com/doc/refman/5.0/en/explain.html EXPLAIN can show you how MySQL goes about processing your queries.
At an extreme, I'd imagine the best performance could be accomplished through parallel processing. I haven't worked with it in PHP, but here's the primary reference http://www.php.net/manual/en/refs.fileprocess.process.php
Depending on your hosting environment you could find the total number of records and split it among sub processes. Each building their own XML fragments. Then you could combine the fragments. So process 1 may handle records 0 to 99, process 2 100 to 199, etc.
You would be surprised ONE simple select all without limit is the fastest,
because it only query database once,
everything else is processed locally
$sql = select all_columns from table;
<?php
// set a very high memory
// query without limit, if can avoid sorting is the best
// iterate mysql result, and set it to an array
// $results[] = $row
// free mysql_result
// write xml for every one thousand
// because building xml is consuming MOST memory
for ($i=0; $i<$len; ++$i)
{
$arr = $results[$i];
// do any xml preparation
// dun forget file-write is expensive too
if ($i%1000 == 0 && $i > 0)
{
// write to file
}
}
?>
The best way to go about this is to schedule it as a CRON job, which i think is the best solution for batch processing in PHP. check this link for more info! Batch Processing in PHP. Hope this helps.
I have to download 2.5k+ files using curl. I'm using Drupals inbuilt batch api to fire the curl script without it timing out but it's taking well over 10 minutes to grab and save the files.
Add this in with the the processing of the actual files. The potential runtime of this script is around 30 minutes. Server performance isn't an issue as both the dev/staging and live servers are more than powerful enough.
I'm looking for suggestions on how to improve the speed. The overall execution time isn't too big of a deal as this is meant to be run once but it would be nice to know the alternatives.
Let's assume for a second that the problem is end-to-end latency, not bandwidth or CPU. Latency in this case is around making a system call out to curl, building up the HTTP connection, requesting the file and tearing down the connection.
One approach is to shard out your requests and run them in parallel. You mention Drupal so I assume you're talking about PHP here. Let's also assume that the 2.5k files are listed in an array in URL form. You can do something like this:
<?php
$urls = array(...);
$workers = 4;
$shard_size = count($urls) / $workers;
for ($i = 0; $i < $shard_size; $i++) {
for ($j = 0; $j < $workers - 1; $j++) {
system("curl " . $urls[$i * $shard_size + $j] . "&");
}
system("curl " . $urls[$i * $shard_size + $j]);
}
?>
This is pretty lame, but you get the idea. It forks off $worker-1 subprocesses to get files in the background, and runs the last worker in the foreground so that you get some pacing. It should scale roughly linearly with the number of workers. It does not take into account the edge case where the size of the data set doesn't evenly divide into the # of workers. I bet you can take this approach and make something reasonably fast.
Curl also supports requesting multiple files on the same command line, but I don't know if it's smart enough to reuse an existing HTTP connection. It might be.
After playing around with a few different methods I came to conclusion that you just have to bite the bullet and go for it.
The script takes a while to process but it has a lot of data to churn through.
I have a personal web site that crawls and collects MP3s from my favorite music blogs for later listening...
The way it works is a CRON job runs a .php scrip once every minute that crawls the next blog in the DB. The results are put into the DB and then a second .php script crawls the collected links.
The scripts only crawl two levels down into the page so.. main page www.url.com and links on that page www.url.com/post1 www.url.com/post2
My problem is that as I start to get a larger collection of blogs. They are only scanned once ever 20 to 30 minutes and when I add a new blog to to script there is a backup in scanning the links as only one is processed every minute.
Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.
Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.
What is the best way I could speed this process up.
Is there a way I can have multiple scripts affecting the DB but write them so they do not overwrite each other but queue the results?
Is there some way to create threading in PHP so that a script can process links at its own pace?
Any ideas?
Thanks.
USE CURL MULTI!
Curl-mutli will let you process the pages in parallel.
http://us3.php.net/curl
Most of the time you are waiting on the websites, doing the db insertions and html parsing is orders of magnitude faster.
You create a list of the blogs you want to scrape,Send them out to curl multi. Wait and then serially process the results of all the calls. You can then do a second pass on the next level down
http://www.developertutorials.com/blog/php/parallel-web-scraping-in-php-curl-multi-functions-375/
pseudo code for running parallel scanners:
start_a_scan(){
//Start mysql transaction (needs InnoDB afaik)
BEGIN
//Get first entry that has timed out and is not being scanned by someone
//(And acquire an exclusive lock on affected rows)
$row = SELECT * FROM scan_targets WHERE being_scanned = false AND \
(scanned_at + 60) < (NOW()+0) ORDER BY scanned_at ASC \
LIMIT 1 FOR UPDATE
//let everyone know we're scanning this one, so they'll keep out
UPDATE scan_targets SET being_scanned = true WHERE id = $row['id']
//Commit transaction
COMMIT
//scan
scan_target($row['url'])
//update entry state to allow it to be scanned in the future again
UPDATE scan_targets SET being_scanned = false, \
scanned_at = NOW() WHERE id = $row['id']
}
You'd probably need a 'cleaner' that checks periodically if there's any aborted scans hanging around too, and reset their state so they can be scanned again.
And then you can have several scan processes running in parallel! Yey!
cheers!
EDIT: I forgot that you need to make the first SELECT with FOR UPDATE. Read more here
This surely isn't the answer to your question but if you're willing to learn python I recommend you look at Scrapy, an open source web crawler/scraper framework which should fill your needs. Again, it's not PHP but Python. It is how ever very distributable etc... I use it myself.
Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.
Memory limit is only a problem, if your code leaks memory. You should fix that, rather than raising the memory limit. Script execution time is a security measure, which you can simply disable for your cli-scripts.
Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.
You can construct your application in such a way that instances don't override each other. A typical way to do it would be to partition per site; Eg. start a separate script for each site you want to crawl.
CLI scripts are not limited by max execution times. Memory limits are not normally a problem unless you have large sets of data in memory at any one time. Timeouts should be handle gracefully by your application.
It should be possible to change your code so that you can run several instances at once - you would have to post the script for anyone to advise further though. As Peter says, you probably need to look at the design. Providing the code in a pastebin will help us to help you :)