PHP / HTTP Timeout when using Large set of Curl Request

PHP / HTTP Timeout when using Large set of Curl Request - php

I have a PHP script which basically goes through my database (currently only about 200 rows)
check for XYZ urls and then trys to scrape them to see if an update value is available.
Now I set the timeout on curl to 5 seconds as shouldn't take longer than this for each request.
I have set the timeout on PHP (max timeout) to 0 - which is unlimited.
I can now successfully do sets of 40 or so rows by add LIMIT 40 to my sql cmd.
Currently I have the scrape in part of a loop for 'each row' returned by the SQL command.
This is ending in a HTTP timeout presumably because it isn't returned within a specific time.
Now my question is can I load each 'loop' too the page individually so you can see them appear as they are completed also this would resolve the HTTP timeout?
I dont have access to apache as its using standard hosting package.
Could any one assist me in the right direction to write this code so it doesn't timeout.
Hope this makes sense if you want me to go over anything above please let me know!

Related

Stopping requests server-side

I've got a name lookup box that operates by your typical ajax requests. Here's the general flow of the Javascript that fires every time a letter is pressed:
If ajax request already open, then abort it.
If timeout already created, destroy it.
Set new timeout to run the following in half a second:
Send string to 'nameLookup.php' via ajax
Wait for response
Display results
The issue is that nameLookup.php is very resource heavy. In some cases up to 10,000 names are being pulled from an SQL database, decrypted, and compared against the string. Under normal circumstances requests can take anywhere from 5 to 60 seconds to return.
I completely understand that when you abort a request on the client side the server is still working on things and sends back the result. That it's just the client side that knows to ignore the response. But the server is getting so hung up on working on all of these requests.
So if you do:
Request 1
Abort Request 1
Request 2
Abort Request 2
Request 3
Wait for response from Request 3
The server is either not even working on Request 3 until it's finished with 1 and 2... or it's just so hung up on working on Request 1 and 2 that Request 3 is taking an extra long amount of time.
I need to know how to tell the server to stop working on Request 1 and 2 so I can free up resources for it to work on Request 3.
I'm using Javascript & jQuery on the client side. PHP/Apache and SQL on the server side.

Store a boolean value in the DB in a table, or in the session.
Have your resource intensive script check periodically that value to see if it should continue or not. If the DB says to stop, then your script cancels itself (by calling return; in the current function for example).
When you want to cancel, instead of calling abort();, make an AJAX request to set that value to false.
Next time the resource checks that value it will see that it has to stop.
Potential limitations:
1. Your script does not have a way of checking periodically the DB.
2. Based on how often the script checks the DB, it might take a few seconds to effectively kill the script.

I think there is something missing from this question. What are the triggers for doing the requests? You might be trying to solve the wrong problem.
Let me elaborate on that. If you lookup box is actually doing autocompletion of some kind and is doing a new search everytime the user presses a key, then you are going to have the issue you describe.
The solution in that case is not killing all the process. The solution lies in not starting them. So, you might make some decisions like not trying to search if there is only one character to search with - lets say we go for three. We then might say we want to wait until we can be reasonable sure the user has finished typing before sending off the request. Lets say we wait 1 second.
Now, someone looking for all the Paul's in you list of names will send off one search when they type 'pau' and pause for 1 second, instead of three searches for 'p' then 'pa' then 'pau'... so no need to kill anything.

I've come up with an awesome solution that I've tested and it's working beautifully. It's just a few lines of PHP code to put in whatever files are being resource intensive.
This solution utilizes the Process Identifier (PID) from the server. We can use two PHP function: posix_getpid() to get the current PID and posix_kill() to kill another PID. This also assumes that you already have called session_start() somewhere else.
Here's the code:
//if any existing PIDs to kill, go through each
if ($_SESSION['pid']) foreach ($_SESSION['pid'] as $i => $pid) {
//if posix_kill returns true, unset this PID from the session so we don't waste time killing it again
if(posix_kill($pid,0)) unset($_SESSION['pid'][$i]);
}
//now that all others are killed, we can store the current PID in the session
$_SESSION['pid'][]=posix_getpid();
//close the session now, otherwise the PID we just added won't actually be saved to the session until the process ends.
session_write_close();
Couple things to note:
posix_kill has two values. The first is the pid, and the second is supposed to be one of the signal constants from this list. Nothing there made any sense to me, other people seemed to have success just using 0, and when I use 0 it returns true. So whatever works!
calling session_write_close() before the resource intensive things start happening is crucial. Otherwise the new PID that has been saved to the session won't ACTUALLY be saved to the session until all of the page's processing is done. Which means the next process won't know to cancel the one that's still going on and taking forever.

Optimize PHP CURL for web crawler

I am trying to code a crawler based on PHP with curl. I have database of 20,000-30,000 URLs that I have to crawl. Each call to curl to fetch a webpage takes around 4-5 seconds.
How can I optimize this and reduce the time required to fetch a page?

You can use curl_multi_* for that. The amount of curl resources you append to one multi handle is the amount of parallel requests it will do. I usually start with 20-30 threads, depending on the size of returned content (make sure your script won't terminate on memory limit).
Note, that it will run as long as it takes to run the slowest request. So if a request times out, you might wait for very long. To avoid that, it can be a good idea to set timeout to some acceptable value.
You can see the code example at my answer in another thread here.

comet style application with loops

Do all comet style applications require a loop somewhere in the application on the serverside to detect updates/changes? If no, please could you explain how the logic behind a loopless comet style application would work?

This kind of application will always require a loop, you need to periodically check for new data etc. Of course you can make the "loop" non-blocking by using an even-loop based approach, but in the end there's still a loop somewhere.
Just think about it for a moment, how would you make it work without a loop? I sure can't imagine a way that doesn't utilize a loop somewhere.

Short answer is, no, not all require a loop on the serverside.
Instead you can use long-polling AJAX calls from the browser to request data,
at which the server simply responds with the data and the browser waits until the response is gotten before sending a new request.

The solution could be stream_set_blocking. Use any possible blocking resource to be suspended by OS and wait for appropriate interruption.
Client side:
Ajax call to endpoint script (timeout for ajax e.g. 30 seconds - after 30 seconds initiate another one - after 30 seconds you will get response from server - script execution time reached)
If you will get response during 30 seconds handle response (async) and open new connection (as in comet done - I saw it in cometD client)
Server setup:
Setup apache timeouts (between request and data sent to 30-31 second), this is so apache will allow you to wait so much
set apache to allow lot of child instances (concurrent users * 1.5), but you need to be sure that you have enough memory for this amount of apache instances (+ memory used by php children)
Script one:
execution_time = 28
set shutdown_function in order to send response (timeout, but formatted and understandable for ajax if You need it)
you need to open file, empty one
enable blocking mode using stream_set_blocking for file stream
try read from file and you will get suspended until other process will write to file or timeout be reached.
As soon as script gets content in file written from other process it will get back and will send response. (this will trigger another ajax call and another slept process)
Worst thing is that you need to think how to get multiple reader scripts reading from same bus (file) without disturbing each other.
Also there could be that timeout will be exactly at that time when message will be written into bus.
(hope that this solution is not as bad as my English)

How can I create a progress bar attached to a server-side process?

We just ran into a problem with our cloud host - they've changed their apache settings to force a much shorter page timeout, and now during certain processes (report creation, etc.) that take more than 15 seconds (which the client is fine with; we're processing huge amounts of data) we get an error:
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request POST /administrator/index.php.
Reason: Error reading from remote server
I have confirmed that our code is still running correctly in the background, and double-checked with the host that this is really just a timeout. Their suggestion was to create a progress bar that is associated with the backend code; that way apache knows something is still going on and won't time out.
I've done progress bars associated the page load events (i.e. when all images are loaded, etc.) but have no idea how to go about creating a progress bar associated with backend code. This is a Joomla site, coded in mvc php, and the code that's causing the issue is part of the model - the various pieces that could be involved are all doing humongous queries. The tables are indexed correctly and the queries are optimized; the issue is not how to make the processes take less time - because we're on a cloud server the timeout limit could be changed to 5 seconds tomorrow without any kind of warning. What I need is someone to point me in the right direction of how to create the progress bar so it's actually associated with the function being run in the model.
Any ideas? I'm a complete beginner as far as this goes.

The easiest way I can think of is to use a two-step process:
Have the model write out events to a textfile or something when it gets to a critical point.
Use some ajax method to have the page regularly check that file for updates, and update the progress bar as such.

Whatever the background process does should update something like a file or database entry with a percentage completed every X seconds or at set places in its flow. Then you can call another script from Javascript every X seconds and it returns the percentage completed via the database record.
updateRecord(0);
readLargeFile();
updateRecord(25);
encodeLargeFile();
updateRecord(50);
writeLargeFile();
updateRecord(75);
celebrate();
updateRecord(100);

Faster alternative to file_get_contents()

Currently I'm using file_get_contents() to submit GET data to an array of sites, but upon execution of the page I get this error:
Fatal error: Maximum execution time of 30 seconds exceeded
All I really want the script to do is start loading the webpage, and then leave. Each webpage may take up to 5 minutes to load fully, and I don't need it to load fully.
Here is what I currently have:
foreach($sites as $s) //Create one line to read from a wide array
{
file_get_contents($s['url']); // Send to the shells
}
EDIT: To clear any confusion, this script is being used to start scripts on other servers, that return no data.
EDIT: I'm now attempting to use cURL to do the trick, by setting a timeout of one second to make it send the data and then stop. Here is my code:
$ch = curl_init($s['url']); //load the urls
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 1); //Only send the data, don't wait.
curl_exec($ch); //Execute
curl_close($ch); //Close it off.
Perhaps I've set the option wrong. I'm looking through some manuals as we speak. Just giving you an update. Thank you all of you that are helping me thus far.
EDIT: Ah, found the problem. I was using CURLOPT_CONNECTTIMEOUT instead of CURLOPT_TIMEOUT. Whoops.
However now, the scripts aren't triggering. They each use ignore_user_abort(TRUE); so I can't understand the problem
Hah, scratch that. Works now. Thanks a lot everyone

There are many ways to solve this.
You could use cURL with its curl_multi_* functions to execute asynchronously the requests. Or use cURL the common way but using 1 as timeout limit, so it will request and return timeout, but the request will be executed.
If you don't have cURL installed, you could continue using file_get_contents but forking processes (not so cool, but works) using something like ZendX_Console_Process_Unix so you avoid the waiting between each request.

As Franco mentioned and I'm not sure was picked up on, you specifically want to use the curl_multi functions, not the regular curl ones. This packs multiple curl objects into a curl_multi object and executes them simultaneously, returning (or not, in your case) the responses as they arrive.
Example at http://php.net/curl_multi_init

Re your update that you only need to trigger the operation:
You could try using file_get_contents with a timeout. This would lead to the remote script being called, but the connection being terminated after n seconds (e.g. 1).
If the remote script is configured so it continues to run even if the connection is aborted (in PHP that would be ignore_user_abort), it should work.
Try it out. If it doesn't work, you won't get around increasing your time_limit or using an external executable. But from what you're saying - you just need to make the request - this should work. You could even try to set the timeout to 0 but I wouldn't trust that.
From here:
<?php
$ctx = stream_context_create(array(
'http' => array(
'timeout' => 1
)
)
);
file_get_contents("http://example.com/", 0, $ctx);
?>
To be fair, Chris's answer already includes this possibility: curl also has a timeout switch.

it is not file_get_contents() who consume that much time but network connection itself.
Consider not to submit GET data to an array of sites, but create an rss and let them get RSS data.

I don't fully understands the meaning behind your script.
But here is what you can do:
In order to avoid the fatal error quickly you can just add set_time_limit(120) at the beginning of the file. This will allow the script to run for 2 minutes. Of course you can use any number that you want and 0 for infinite.
If you just need to call the url and you don't "care" for the result you should use cUrl in asynchronous mode. This case any call to the URL will not wait till it finished. And you can call them all very quickly.
BR.

If the remote pages take up to 5 minutes to load, your file_get_contents will sit and wait for that 5 minutes. Is there any way you could modify the remote scripts to fork into a background process and do the heavy processing there? That way your initial hit will return almost immediately, and not have to wait for the startup period.
Another possibility is to investigate if a HEAD request would do the trick. HEAD does not return any data, just headers, so it may be enough to trigger the remote jobs and not wait for the full output.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.