Curl download of 2.5k+ files too slow - php

I have to download 2.5k+ files using curl. I'm using Drupals inbuilt batch api to fire the curl script without it timing out but it's taking well over 10 minutes to grab and save the files.
Add this in with the the processing of the actual files. The potential runtime of this script is around 30 minutes. Server performance isn't an issue as both the dev/staging and live servers are more than powerful enough.
I'm looking for suggestions on how to improve the speed. The overall execution time isn't too big of a deal as this is meant to be run once but it would be nice to know the alternatives.

Let's assume for a second that the problem is end-to-end latency, not bandwidth or CPU. Latency in this case is around making a system call out to curl, building up the HTTP connection, requesting the file and tearing down the connection.
One approach is to shard out your requests and run them in parallel. You mention Drupal so I assume you're talking about PHP here. Let's also assume that the 2.5k files are listed in an array in URL form. You can do something like this:
<?php
$urls = array(...);
$workers = 4;
$shard_size = count($urls) / $workers;
for ($i = 0; $i < $shard_size; $i++) {
for ($j = 0; $j < $workers - 1; $j++) {
system("curl " . $urls[$i * $shard_size + $j] . "&");
}
system("curl " . $urls[$i * $shard_size + $j]);
}
?>
This is pretty lame, but you get the idea. It forks off $worker-1 subprocesses to get files in the background, and runs the last worker in the foreground so that you get some pacing. It should scale roughly linearly with the number of workers. It does not take into account the edge case where the size of the data set doesn't evenly divide into the # of workers. I bet you can take this approach and make something reasonably fast.
Curl also supports requesting multiple files on the same command line, but I don't know if it's smart enough to reuse an existing HTTP connection. It might be.

After playing around with a few different methods I came to conclusion that you just have to bite the bullet and go for it.
The script takes a while to process but it has a lot of data to churn through.

Related

MySQL asynchronous database request performance

I've been looking at asynchronous database requests in PHP using mysqlnd. The code is working correctly but comparing performance pulling data from one reasonable sized table versus the same data split across multiple tables using asynchronous requests I'm not getting anything like the performance I would expect although it does seem fairly changeable according to hardware setup.
As I understand it I should be achieving, rather than:
x = a + b + c + d
Instead:
x = max(a, b, c, d)
Where x is the total time taken and a to d are the times for individual requests. What I am actually seeing is a rather minor increase in performance on some setups and on others worse performance as if requests weren't asynchronous at all. Any thoughts or experiences from others who may have worked with this and come across the same are welcome.
EDIT: Measuring the timings here, we are talking about queries spread over 10 tables, individually the queries take no more than around 8 seconds to complete, combining the time each individual request takes to complete (not asynchronously) it totals around 18 seconds.
Performing the same requests asynchronously total query time is also around 18 seconds. So clearly the requests are not being executed in parallel against the database.
EDIT: Code used is exactly as shown in the documentation here
<?php
$link1 = mysqli_connect();
$link1->query("SELECT 'test'", MYSQLI_ASYNC);
$all_links = array($link1);
$processed = 0;
do {
$links = $errors = $reject = array();
foreach ($all_links as $link) {
$links[] = $errors[] = $reject[] = $link;
}
if (!mysqli_poll($links, $errors, $reject, 1)) {
continue;
}
foreach ($links as $link) {
if ($result = $link->reap_async_query()) {
print_r($result->fetch_row());
if (is_object($result))
mysqli_free_result($result);
} else die(sprintf("MySQLi Error: %s", mysqli_error($link)));
$processed++;
}
} while ($processed < count($all_links));
?>
I'll expand my comments and I'll try to explain why you won't gain any performance using the setup you have currently.
Asynchronous, in your case, means that the process of retrieving data is asynchronous compared to the rest of your code. The two moving parts (getting data) and working with the data are separate and are executed one after another, but only when the data arrives.
This implies that you want to utilize the CPU to its fullest, so you won't invoke PHP code until the data is ready.
In order for that to work, you must seize the control of PHP process and make it use one of operating system's event interfaces (epoll on Linux, or IOCP on Windows). Since PHP is either embedded into a web server (mod_php) or runs as its own standalone FCGI server (php-fpm), that implies the best utilization of asynchronous data fetching would be when you run a CLI php script since it's quite difficult to utilize event interfaces otherwise.
However, let's focus on your problem and why your code isn't faster.
You assumed that you are CPU bound and your solution was to retrieve data in chunks and process them that way - that's great, however since nothing you do yields faster execution, that means you are 100% I/O bound.
The process of retrieving data from databases forces the hard disk to perform seeking. No matter how much you "chunk" that, if the disk is slow and if the data is scattered around the disk - that part will be slow and creating more workers that deal with parts of the data will just make the system slower and slower since each worker will have the same problem with retrieving the data.
I'd conclude that your issue lies in the slow hard disk, too big of a data set that might be improperly constructed for chunked retrieval. I suggest updating this question or creating another question that will help you retrieve data faster and in a more optimal way.

Better approach to optimize PHP Cron Job

I can't get into too many specifics as this is a project for work, but anyways..
I'm in the process of writing a SOAP client in PHP that pushes all responses to a MySQL database. My main script makes an initial soap request that retrieves a large set of items (approximately ~4000 at the moment, but the list is expected to grow into hundreds of thousands at some point).
Once this list of 4000 items is returned, I use exec("/usr/bin/php path/to/my/historyScript.php &") that sends a history request for each item. The web service api supports up to 30 requests / sec. Below is some pseudo code for what I am currently doing:
$count = 0;
foreach( $items as $item )
{
if ( $count == 30 )
{
sleep(1); // Sleep for one second before calling the next 30 requests
$count = 0;
}
exec('/usr/bin/php path/to/history/script.php &');
$count++;
}
The problem I'm running into is that I am unsure when the processes finish and my development server is starting to crash. Since data is expected to grow, I know this is a very poor solution to my problem.
Might there be a better approach I should consider using for a task like this? I just feel that this is more of a 'hack'
I am not sure, but i feel that the reason for your application crash, you are keeping large set of data in PHP variable. Look into this, based on RAM size this(data size) will leads to system crash. And my suggestion is try to limit incoming data from external service per request, instead number of request to the service.

How do I split a really long, memory intensive loop into smaller chunks

I have quite a long, memory intensive loop. I can't run it in one go because my server places a time limit for execution and or I run out of memory.
I want to split up this loop into smaller chunks.
I had an idea to split the loop into smaller chunks and then set a location header to reload the script with new starting conditions.
MY OLD SCRIPT (Pseudocode. I'm aware of the shortcomings below)
for($i=0;$i<1000;$i++)
{
//FUNCTION
}
MY NEW SCRIPT
$start=$_GET['start'];
$end=$start+10;
for($i=$start;$i<$end;$i++;)
{
//FUNCTION
}
header("Location:script.php?start=$end");
However, my new script runs successfully for a few iterations and then I get a server error "Too many redirects"
Is there a way around this? Can someone suggest a better strategy?
I'm on a shared server so I can't increase memory allocation or script execution time.
I'd like a PHP solution.
Thanks.
"Too many redirects" is a browser error, so a PHP solution would be to use cURL or standard streams to load the initial page and let it follow all redirects. You would have to run this from a machine without time-out limitations though (e.g. using CLI)
Another thing to consider is to use AJAX. A piece of JavaScript on your page will run your script, gather the output from your script and determine whether to stop (end of computation) or continue (start from X). This way you can create a nifty progress meter too ;-)
You probably want to look into forking child processes to do the work. These child processes can do the work in smaller chunks in their own memory space, while the parent process fires off multiple children. This is commonly handled by Gearman, but can be done without.
Take a look at Forking PHP on Dealnews' Developers site. It has a library and some sample code to help manage code that needs to spawn child processes.
Generally if I have to iterate over something many many times and it has a decent amount of data, I use a "lazy load" type application like:
for($i=$start;$i<$end;$i++;)
{
$data_holder[] = "adding my big data chunks!";
if($i % 5 == 1){
//function to process data
process_data($data_holder); // process that data like a boss!
unset($data_holder); // This frees up the memory
}
}
// Now pick up the stragglers of whatever is left in the data chunk
if(count($data_holder) > 0){
process_data($data_holder);
}
That way you can continue to iterate through your data, but you don't stuff up your memory. You can work in chunks, then unset the data, work in chunks, unset data, etc.. to help prevent memory. As far as execution time, that depends on how much you have to do / how efficient your script is written.
The basic premise -- "Process your data in smaller chunks to avoid memory issues. Keep your design simple to keep it fast."
How about you put a conditional inside your loop to sleep every 100 iterations?
for ($i = 0; $i < 1000; $i++)
{
if ($i % 100 == 0)
sleep(1800) //Sleep for half an hour
}
First off, without knowing what your doing inside the loop, it's hard to tell you the best approach to actually solving your issue. However, if you want to execute something that takes a really long time, my suggestion would be to set up a cron job and let it nail out little portions at a time. The script would log where it stops and the next time it starts up, it could read the log for where to start.
Edit: If you are dead set against cron, and you aren't too concerned about user experience, you could do this:
Let the page load similar to the cron job above. Except after so many seconds or iterations, stop the script. Display a refresh meta tag or javascript refresh. Do this until the task is done.
With the limitations you have, I think the approach you are using could work. It may be that your browser is trying to be smart and not let you redirect back the page you were just on. It might be trying to prevent an endless loop.
You could try
Redirecting back and forth between two scripts that are identical (or aliases).
A different browser.
Having your script output an HTML page with a refresh tag, e.g.
<meta http-equiv="refresh" content="1; url=http://example.com/script.php?start=xxx">

What are the known or expected impact of using Php/Querypath crawler on a target web server, and how can it be kept to a minimum?

I'm building a php+querypath crawler to prototype an idea. I'm worried that once I run it, the target site might be affected in some way, since it has a large number of relevant pages I want to scrape -- 1361 pages at the moment.
What are the recommendations to keep the impact to a minimum on the target site?
Since you are building a crawler the only impact you can have on the target website is, using up their bandwidth.
To keep the impact minimum, you can do the following:
1. While building your crawler, download a sample page of the target site on your computer and test your script on that copy.
2. Ensure that loop which is running to scrape the 1361 pages is functioning properly and downloading each page only once.
3. Ensure that your script is downloading only 1 page at a time and optionally include an interval between each fetch so that there is less load on the target server.
4. Depending on how heavy each page is you can decide to download the entire 1361 pages over hours/days/months.
QueryPath itself will issue vanilla HTTP requests -- nothing fancy at all. 1361 is not necessarily a large number, either.
I would suggest running your crawl in a loop, grabbing some number of pages (say, 10) in a row, sleeping for several seconds, and then grabbing another ten. Assuming $urls is an array of URLs you could try something like this:
$count = count($urls);
$interval = 10; // Every ten times...
$wait = 2; // Wait two seconds.
for ($i = 0; $i < $count; ++$i) {
// Do whatever you're going to do with QueryPath.
$qp = qp($url);
if ($i > 0 && $i % $interval == 0) {
sleep($wait);
}
}
As the previous poster suggests, test with a smaller number of URLs, then go up from there.
Here are a few other tips:
The robots.txt file of the remote site sometimes suggests how long a crawler should wait (Crawl Delay) between crawls. This, if set, is a good indicator for what your $wait variable should be set to.
Hitting the site off-peak (e.g. at 2AM local time) will minimize the chances that the remote site is flooded with requests.

Improved efficiency when making multiple requests to Google Complete API with PHP

I was playing around with the Google Complete API looking for a quick way to get hold of the top 26 most searched terms for various question prefixes - one for each letter of the alphabet.
I wouldn't count myself a programmer but it seemed like a fun task!
My script works fine locally but it takes too long on my shared server and times out after 30 seconds - and as it's shared I can't access the php.ini to lengthen the max execution time.
It made me wonder if there was a more efficient way of making the requests to the API, here is my code:
<?php
$prep = $_POST['question'];
for($i=0;$i<26;$i++){
$letters = range('a','z');
$letter = $letters[$i];
$term = $prep . $letter;
if(!$xml=simplexml_load_file('http://google.com/complete/search?output=toolbar&q=' . $term)){
trigger_error('Error reading XML file',E_USER_ERROR);
}
do{
$count = 1;
$result = ucfirst($xml->CompleteSuggestion->suggestion->attributes()->data);
$queries = number_format((int)$xml->CompleteSuggestion->num_queries->attributes()->int);
echo '<p><span>' . ucfirst($letter) . ':</span> ' . $result . '?</p>';
echo '<p class="queries">Number of queries: ' . $queries . '</p><br />';
} while ($count < 0);
}
?>
I also wrote a few lines that fed the question in to the Yahoo Answers API, which worked pretty well although it made the results take even longer and I couldn't exact match on the search term through the API so I got a few odd answers back!
Basically, is the above code the most efficient way of calling an API multiple times?
Thanks,
Rich
You should using user perspective to re-look into this issue, ask yourself,
Will you like to wait 30 seconds for a web page to load?
Obviously you dun want
How can I make the page load faster?
You are depending on an external resource (google api)
and not just calling once, but 26 times asynchronously
So, if you change the above synchronously,
the total time is reduced form 26 to 1 (with the expenses of network bandwidth)
Take a look at http://php.net/manual/en/function.curl-multi-exec.php,
here is first step of optimization
If you get the above done,
your time spent on external resource could reduce up to 95%
Will this good enough ?
Obviously not yet
Any call to external resource is not reliable, even is google
if the network down, DNS not resolvable, your page is going down too
How to prevent that ?
You need cache, basically the logic is :-
search for existing cache, if found, return from cache
if not, query google api synchronously (from a to z)
store the result into cache
return the result
However, on-demand process is still not ideal (first user issue the request have to wait longest),
if you know the permutation of user input (hopefully not that big),
you can use a scheduler (cronjob) to periodically pull result from google api,
and store the result locally
I recommend using cron jobs for this kind of work. This way you can either change the max execution time with a parameter or splitt the work into multiple operations and run the cron job more regulary to run one operation after another.

Categories