I have a widget that runs on my homepage which is loading xml data from an external source. I want to timeout the xml load after x seconds (lately the other site has been having load issues). Here is the function I have so far. I can't figure out how to make the timer ineract with the simplexml_load_file().
Am I on the right track? Is there a way to make this work? Or is there a better way to do this? If this does timeout, I still need the rest of the page to continue loading, so I can't use set_time_limit(), because that will end all script execution, right?
function timer($end) {
$count = 0;
while($end > $count) {
sleep(1);
$count++;
}
return true;
}
$we = simplexml_load_file('http://forecast.weather.gov/MapClick.php?lat=44.08920&lon=-70.17250&FcstType=xml');
if(timer(3)) return;
So you want to set a timeout for simplexml_load_file(). You can't set it specifically, but you can just set it globally (for all socket based streams) before using the function:
ini_set('default_socket_timeout', 3);
$we = simplexml_load_file($url);
// you can restore the default value after use, if you want
ini_restore('default_socket_timeout');
I would use CURL instead of loading the URL directly...
function getXml($url, $timeout = 0){
$ch = curl_init($url);
curl_setopt_array($ch,array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_TIMEOUT => (int) $timeout
));
if($xml = curl_exec($ch)){
return new SimpleXmlElement($xml);
}
else {
return null;
}
}
//Example
$xmlData = getXml('http://yoururl.com', 2); // 2 second timeout
You could first read the content of the file with some blocking or more reliable function (like fopen, fsockopen or curl, choose the best you can use) and then pass the content to simplexml_load_string instead of simplexml_load_file
Related
I'm working on a simple app that scans all URLs on a page and display http status code of every URls. When the URLs is more than 50 I got an error Fatal error: Maximum execution time of 30 seconds exceeded , what I did is, add this line of code ini_set('max_execution_time', 300);. It work, my problem is I have to wait until it finish then display the whole result, is there a way to make it optimize or display the result while it is scanning.
Thanks.
CODE
$html = file_get_contents('http://www.example.com');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'-'. statusCode($url) .'<br />';
}
function statusCode($url) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
return $httpCode;
}
PHP makes http request synchronously. To achieve your expected result you should make them asynchronously. I suggest you to use Guzzle library. The simple async request looks like:
$request = new \GuzzleHttp\Psr7\Request('GET', 'http://example.com');
$promise = $client->sendAsync($request)->then(function ($response) {
echo 'I completed! ' . $response->getBody();
});
So, making more than 50 http requests asynchronously, you just get result, when it finishes.
Yes because php need to end the process to display them. Only after all process done php display results in the page. As #Romeo Sierra said, if it's in php shell, it will display as it go with process.
I think you want to display it in a page right?
You can do this,
Get data to create the for loop in to the page and create a Javascript loop from it.
In that loop create a ajax call to get one data result according to the index of the loop.
Ajax success append data to a list or table.
Results will display as process go with it.
Hope this will work.
When I run a check on 10 urls, if I am able to get a connection with the host server, the handle will return a success message (CURLE_OK)
When processing each handle if a server refuses the connection, the handle will include a error message.
The problem
I assumed that when we get a bad handle, CURL will mark this handle but continue to process the unprocessed handles, however this is not what seems to happen.
When we come across a bad handle, CURL will mark this handle as bad, but will not process the remaining unprocessed handles.
This can be hard to detect, if I do get a connection with all handles, which is what happens most of the time, then the problem is not visible.(CURL only stops on first bad connection);
For the test, I had to find a suitable site which loads slow/refuses x amount simultaneous of connections.
set_time_limit(0);
$l = array(
'http://smotri.com/video/list/',
'http://smotri.com/video/list/sports/',
'http://smotri.com/video/list/animals/',
'http://smotri.com/video/list/travel/',
'http://smotri.com/video/list/hobby/',
'http://smotri.com/video/list/gaming/',
'http://smotri.com/video/list/mult/',
'http://smotri.com/video/list/erotic/',
'http://smotri.com/video/list/auto/',
'http://smotri.com/video/list/humour/',
'http://smotri.com/video/list/film/'
);
$mh = curl_multi_init();
$s = 0;
$f = 10;
while($s <= $f)
{
$ch = curl_init();
$curlsettings = array(
CURLOPT_URL => $l[$s],
CURLOPT_TIMEOUT => 0,
CURLOPT_CONNECTTIMEOUT => 0,
CURLOPT_RETURNTRANSFER => 1
);
curl_setopt_array($ch, $curlsettings);
curl_multi_add_handle($mh,$ch);
$s++;
}
$active = null;
do
{
curl_multi_exec($mh,$active);
curl_multi_select($mh);
$info = curl_multi_info_read($mh);
echo '<pre>';
var_dump($info);
if($info['result'] === CURLE_OK)
echo curl_getinfo($info['handle'],CURLINFO_EFFECTIVE_URL) . ' success<br>';
if($info['result'] != 0)
echo curl_getinfo($info['handle'],CURLINFO_EFFECTIVE_URL) . ' failed<br>';
} while ($active > 0);
curl_multi_close($mh);
I have dumped $info in the script which asks the Multi Handle if there is any new information on any handles whilst running.
When the script has ended we will see some bool(false) - when no new information was available(handles were still processing), along with all handles if all was successful or limited handles if one handle failed.
I have failed at fixing this, its probably something I have overlooked and I have gone too far down the road on attempting to fix things which are not relevant.
Some attempts at fixing this was.
Assign each $ch handle to a array - $ch[1], $ch[2] etc (instead of
adding current $ch handle to multi_handle then overwriting - as whats
in the test)
Removing handles after success/failure with
curl_multi_remove_handle
Set CURLOPT_CONNECTTIMEOUT and CURLOPT_TIMEOUT to infinity.
many more.(I will update this post as I have forgotten all of them)
Testing this with Php version 5.4.14
Hopefully I have illustrated the points well enough.
Thanks for reading.
I've been mucking around with your script for a while now trying to get it to work.It was only when I read Repeated calls to this function will return a new result each time, until a FALSE is returned as a signal that there is no more to get at this point., for http://se2.php.net/manual/en/function.curl-multi-info-read.php, that I realized a while loop might work.
The extra while loop makes it behave exactly how you'd expect. Here is the output I get:
http://smotri.com/video/list/sports/ failed
http://smotri.com/video/list/travel/ failed
http://smotri.com/video/list/gaming/ failed
http://smotri.com/video/list/erotic/ failed
http://smotri.com/video/list/humour/ failed
http://smotri.com/video/list/animals/ success
http://smotri.com/video/list/film/ success
http://smotri.com/video/list/auto/ success
http://smotri.com/video/list/ failed
http://smotri.com/video/list/hobby/ failed
http://smotri.com/video/list/mult/ failed
Here's the code I used for testing:
<?php
set_time_limit(0);
$l = array(
'http://smotri.com/video/list/',
'http://smotri.com/video/list/sports/',
'http://smotri.com/video/list/animals/',
'http://smotri.com/video/list/travel/',
'http://smotri.com/video/list/hobby/',
'http://smotri.com/video/list/gaming/',
'http://smotri.com/video/list/mult/',
'http://smotri.com/video/list/erotic/',
'http://smotri.com/video/list/auto/',
'http://smotri.com/video/list/humour/',
'http://smotri.com/video/list/film/'
);
$mh = curl_multi_init();
$s = 0;
$f = 10;
while($s <= $f)
{
$ch = curl_init();
if($s%2)
{
$curlsettings = array(
CURLOPT_URL => $l[$s],
CURLOPT_TIMEOUT_MS => 3000,
CURLOPT_RETURNTRANSFER => 1,
);
}
else
{
$curlsettings = array(
CURLOPT_URL => $l[$s],
CURLOPT_TIMEOUT_MS => 4000,
CURLOPT_RETURNTRANSFER => 1,
);
}
curl_setopt_array($ch, $curlsettings);
curl_multi_add_handle($mh,$ch);
$s++;
}
$active = null;
do
{
$mrc = curl_multi_exec($mh,$active);
curl_multi_select($mh);
while($info = curl_multi_info_read($mh))
{
echo '<pre>';
//var_dump($info);
if($info['result'] === 0)
{
echo curl_getinfo($info['handle'],CURLINFO_EFFECTIVE_URL) . ' success<br>';
}
else
{
echo curl_getinfo($info['handle'],CURLINFO_EFFECTIVE_URL) . ' failed<br>';
}
}
} while ($active > 0);
curl_multi_close($mh);
Hope that helps. For testing just adjust CURLOPT_TIMEOUT_MS to your internet connection. I made it so it alternates between 3000 and 4000 milliseconds as 3000 will fail and 4000 usually succeeds.
Update
After going through the PHP and libCurl docs I have found how curl_multi_exec works (in libCurl its curl_multi_perform). Upon first being called it starts handling transfers for all the added handles (added before via curl_multi_add_handle).
The number it assigns $active is the number of transfers still running. So if it's less than the total number of handles you have then you know one or more transfers are complete. So curl_multi_exec acts as a kind of progress indicator as well.
As all transfers are handled in a non-blocking fashion (transfers can finish simultaneously) the while loop curl_multi_exec's in cannot represent each iteration of completed url requests.
All data is stored in a queue so as soon as one or more transfers are complete you can call curl_multi_info_read to fetch this data.
In my original answer I had curl_multi_info_read in a while loop. This loop would keep iterating until curl_multi_info_read found no remaining data in the queue. After which the outer while loop would move onto the next iteration if $active != 0 (meaning curl_multi_exec reported transfers still not complete).
To summarize, the outer loop keeps iterating when there are still transfers not completed and the inner loop iterates only when there's data from a completed transfer.
The PHP documentation is pretty bad for curl multi functions so I hope this cleared a few things up. Below is an alternative way to do the same thing.
do
{
curl_multi_exec($mh,$active);
} while ($active > 0);
// while($info = curl_multi_info_read($mh)) would work also here
for($i = 0; $i <= $f; $i++){
$info = curl_multi_info_read($mh);
if($info['result'] === 0)
{
echo curl_getinfo($info['handle'],CURLINFO_EFFECTIVE_URL) . ' success<br>';
}
else
{
echo curl_getinfo($info['handle'],CURLINFO_EFFECTIVE_URL) . ' failed<br>';
}
}
From this information you can also see curl_multi_select is not needed as you don't want something that blocks until there is activity.
With the code you provided in your question it only seemed like curl wasn't proceeding after a few failed transfers but there was actually still data queued in the buffer. Your code just wasn't calling curl_multi_info_read enough times. The reason all the successful transfers were picked up by your code is due to PHP being run on a single thread and so the script hanged waiting for the requests. The timeouts for the failed requests didn't impact PHP enough to make it hang/wait that long so the number of iterations the while loop was doing was less than the number of queued data.
(I'm scraping this stuff with the permission of the website in question, by the way).
Pretty simple web scraper, was working fine when I was loading all the links by hand, but when I've tried to load them in via JSON and variables (so I can do lots of scraping with the one script and make the process more modular by just adding more links to JSON) it runs on an infinite loop.
(Page has been loading for about 15 minutes now)
Here is my JSON. Only one store is in there for testing purposes but there is going to be about 15 more.
[
{
"store":"Incu Men",
"cat":"Accessories",
"general_cat":"Accessories",
"spec_cat":"accessories",
"url":"http://www.incuclothing.com/shop-men/accessories/",
"baseurl":"http://www.incuclothing.com",
"next_select":"a.next",
"prod_name_select":".infobox .fn",
"label_name_select":".infobox .brand",
"desc_select":".infobox .description",
"price_select":"#price",
"mainImg_select":"",
"more_imgs":".product-images",
"product_url":".hproduct .photo-link"
}
]
Here is the PHP scraper code:
<?php
//Set infinite time limit
set_time_limit (0);
// Include simple html dom
include('simple_html_dom.php');
// Defining the basic cURL function
function curl($url) {
$ch = curl_init();
// Initialising cURL
curl_setopt($ch, CURLOPT_URL, $url);
// Setting cURL's URL option with the $url variable passed into the function
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
// Setting cURL's option to return the webpage data
$data = curl_exec($ch);
// Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch);
// Closing cURL
return $data;
// Returning the data from the function
}
function getLinks($catURL, $prodURL, $baseURL, $next_select) {
$urls = array();
while($catURL) {
echo "Indexing: $url" . PHP_EOL;
$html = str_get_html(curl($catURL));
foreach ($html->find($prodURL) as $el) {
$urls[] = $baseURL . $el->href;
}
$next = $html->find($next_select, 0);
$url = $next ? $baseURL . $next->href : null;
echo "Results: $next" . PHP_EOL;
}
return $urls;
}
$string = file_get_contents("jsonWorkers/incuMens.json");
$json_array = json_decode($string,true);
foreach ($json_array as $value){
$baseURL = $value['baseurl'];
$catURL = $value['url'];
$store = $value['store'];
$general_cat = $value['general_cat'];
$spec_cat = $value['spec_cat'];
$next_select = $value['next_select'];
$prod_name = $value['prod_name_select'];
$label_name = $value['label_name_select'];
$description = $value['desc_select'];
$price = $value['price_select'];
$prodURL = $value['product_url'];
if (!is_null($value['mainImg_select'])){
$mainImg = $value['mainImg_select'];
}
$more_imgs = $value['more_imgs'];
$allLinks = getLinks($catURL, $prodURL, $baseURL, $next_select);
}
?>
Any ideas why the script would be running infinitely and not returning anything/stopping/printing anything to screen? I'm just gonna let it run until it stops. When I was doing this by hand it would only take a minute or so, sometimes less, so I'm sure it's a problem with my variables/json but I can't for the life of me see what the issues lie.
Can anyone take a quick look and point me in the right direction?
There is a problem with your while($catURL) loop. What do you want to do ?
Moreover, you can force to display information on your browser with the flush() command.
I am connecting to an unreliable API via file_get_contents. Since it's unreliable, I decided to put the api call into a while loop thusly:
$resultJSON = FALSE;
while(!$resultJSON) {
$resultJSON = file_get_contents($apiURL);
set_time_limit(10);
}
Putting it another way: Say the API fails twice before succeeding on the 3rd try. Have I sent 3 requests, or have I sent however many hundreds of requests as will fit into that 3 second window?
file_get_contents(), like basically all functions in PHP, is a blocking call.
Yes, it is a blocking function. You should also check to see if the value is specifically "false". (Note that === is used, not ==.) Lastly, you want to sleep for 10 seconds. set_time_limit() is used to set the max execution time before it is automatically killed.
set_time_limit(300); //Run for up to 5 minutes.
$resultJSON = false;
while($resultJSON === false)
{
$resultJSON = file_get_contents($apiURL);
sleep(10);
}
Expanding on #Sammitch suggestion to use cURL instead of file_get_contents():
<?php
$apiURL = 'http://stackoverflow.com/';
$curlh = curl_init($apiURL);
// Use === not ==
// if ($curlh === FALSE) handle error;
curl_setopt($curlh, CURLOPT_FOLLOWLOCATION, TRUE); // maybe, up to you
curl_setopt($curlh, CURLOPT_HEADER, FALSE); // or TRUE, according to your needs
curl_setopt($curlh, CURLOPT_RETURNTRANSFER, TRUE);
// set your timeout in seconds here
curl_setopt($curlh, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($curlh, CURLOPT_TIMEOUT, 30);
$resultJSON = curl_exec($curlh);
curl_close($curlh);
// if ($resultJSON === FALSE) handle error;
echo "$resultJSON\n"; // Now process $resultJSON
?>
There are a lot more curl_setopt options. You should check them out.
Of course, this assumes you have cURL available.
I am not aware of any function in PHP that does not "block". As an alternative, and if your server permits such things, you can:
Use pcntl_fork() and do other stuff in your script while waiting for the API call to go through.
Use exec() to call another script in the background [using &] to do the API call for you if pcntl_fork() is unavailable.
However, if you literally cannot do anything else in your script without a successful call to that API then it doesn't really matter if the call 'blocks' or not. What you should really be concerned about is spending so much time waiting for this API that you exceed the configured max_execution_time and your script is aborted in the middle without being properly completed.
$max_calls = 5;
for( $i=1; $i<=$max_calls; $i++ ) {
$resultJSON = file_get_contents($apiURL);
if( $resultJSON !== false ) {
break;
} else if( $i = $max_calls ) {
throw new Exception("Could not reach API within $max_calls requests.");
}
usleep(250000); //wait 250ms between attempts
}
It's worth noting that file_get_contents() has a default timeout of 60 seconds so you're really in danger of the script being killed. Give serious consideration to using cURL instead since you can set much more reasonable timeout values.
I have an array containing the contents of a MySQL table. I need to put each of these contents into curl_multi_handles so that I can execute them all simultaneously
Here is the code for the array, in case it helps:
$SQL = mysql_query("SELECT url FROM urls") or die(mysql_error());
while($resultSet = mysql_fetch_array($SQL)){
$urls[]=$resultSet
}
So I need to put be able to send data to each url at the same time. I don't need to get any data back, and in fact I'll be having them time out after two seconds. It only needs to send the data and then close.
My code prior to this, was executing them one at a time. here is that code:
$SQL = mysql_query("SELECT url FROM shells") or die(mysql_error()); while($resultSet = mysql_fetch_array($SQL)){
$ch = curl_init($resultSet['url'] . $fullcurl); //load the urls and send GET data
curl_setopt($ch, CURLOPT_TIMEOUT, 2); //Only load it for two seconds (Long enough to send the data)
curl_exec($ch);
curl_close($ch);
So my question is: How can I load the contents of the array into curl_multi_handle, execute it, and then remove each handle and close the curl_multi_handle?
You still call curl_init and curl_setopt. Then you load it into a multi_handle, and keep calling execute until it's done. This is based on the documentation at curl_multi_init. Since you're timing out in two seconds, and not processing responses, I think you can just sleep for two seconds at a time. curl_multi_select might be better if you actually need to process the responses.
$SQL = mysql_query("SELECT url FROM shells") ;
$mh = curl_multi_init();
$handles = array();
while($resultSet = mysql_fetch_array($SQL)){
//load the urls and send GET data
$ch = curl_init($resultSet['url'] . $fullcurl);
//Only load it for two seconds (Long enough to send the data)
curl_setopt($ch, CURLOPT_TIMEOUT, 2);
curl_multi_add_handle($mh, $ch);
$handles[] = $ch;
}
// Create a status variable so we know when exec is done.
$running = null;
//execute the handles
do {
// Call exec. This call is non-blocking, meaning it works in the background.
curl_multi_exec($mh,$running);
// Sleep while it's executing. You could do other work here, if you have any.
sleep(2);
// Keep going until it's done.
} while ($running > 0);
// For loop to remove (close) the regular handles.
foreach($handles as $ch)
{
// Remove the current array handle.
curl_multi_remove_handle($mh, $ch);
}
// Close the multi handle
curl_multi_close($mh);
If i were you, i would write class mysql and a class curl.
Its very good at all.
First i would create a method witch would return all urls from a passed mysql result.
Something like
public function getUrls($mysql_fetch_array)
{
foreach($mysql_fetch_array as $result)
{
$urls[] = $result["url"];
}
}
then you could write a method like curlSend($url,$param)
//remember you have to edit i dont know your full code so its just
// a way you could do it
public function curlSend($url,$param="")
{
$ch = curl_init($resultSet['url'] . $fullcurl); //load the urls and send GET data
curl_setopt($ch, CURLOPT_TIMEOUT, 2); //Only load it for two seconds (Long enough to send the data)
curl_exec($ch);
curl_close($ch);
}
public function send()
{
$urls = getUrls($this->mysql->result($sql));
foreach($urls as $url)
{
$this->curlSend($url);
}
}
Now this is how you could do it.