Optimize execution time of getting HTTP status code

Optimize execution time of getting HTTP status code - php

I'm working on a simple app that scans all URLs on a page and display http status code of every URls. When the URLs is more than 50 I got an error Fatal error: Maximum execution time of 30 seconds exceeded , what I did is, add this line of code ini_set('max_execution_time', 300);. It work, my problem is I have to wait until it finish then display the whole result, is there a way to make it optimize or display the result while it is scanning.
Thanks.
CODE
$html = file_get_contents('http://www.example.com');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'-'. statusCode($url) .'<br />';
}
function statusCode($url) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
return $httpCode;
}

PHP makes http request synchronously. To achieve your expected result you should make them asynchronously. I suggest you to use Guzzle library. The simple async request looks like:
$request = new \GuzzleHttp\Psr7\Request('GET', 'http://example.com');
$promise = $client->sendAsync($request)->then(function ($response) {
echo 'I completed! ' . $response->getBody();
});
So, making more than 50 http requests asynchronously, you just get result, when it finishes.

Yes because php need to end the process to display them. Only after all process done php display results in the page. As #Romeo Sierra said, if it's in php shell, it will display as it go with process.
I think you want to display it in a page right?
You can do this,
Get data to create the for loop in to the page and create a Javascript loop from it.
In that loop create a ajax call to get one data result according to the index of the loop.
Ajax success append data to a list or table.
Results will display as process go with it.
Hope this will work.

Related

Identify specific curl multi response

I use curl_multi_exec() to request several websites in parallel. Say, URL1, URL2, and URL3. As soon as one of these websites returns a result, I can process it and then wait for the next response.
Now I need to know, based on the response of the request, which URL this result comes from. I cannot simply check the URL from the response as there might be redirections. So what is the best way to identify from which URL (URL1, URL2, or URL3) the response came from? Can the information from curl_multi_info_read() or curl_getinfo() somehow be used for that? Is there a cURL Option that I can set and request for that?
I also tried storing the cURL handlers before requesting the URLs and compare them with curl_multi_info_read($curlMultiHandle)['handle'] but as this is a resource, it is not really comparable.
Any ideas?

It is possible to attach custom data to handle
curl_setopt($handle, \CURLOPT_PRIVATE, json_encode(['id' => $query_id]));
and then fetch this data
curl_getinfo($handle, \CURLINFO_PRIVATE);

Suppose you have multiple Image objects for which you need to load the data. You run your requests in parallel and don't know the order of download completion. So you have to identify somehow your concrete Image object when you receive the data. Instead of using urls (which might change after redirection) as keys in an associative array of Image objects I recommend the following simple approach.
$mh = curl_multi_init();
$activeHandles = array();
$loadingImages = array();
function loadImage(Image $image) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $image->getUrl());
curl_multi_add_handle($mh, $ch);
...
$this->loadingImages[] = $image;
$activeHandles[] = $ch;
}
function retrieveImages() {
// Somewhere you run curl_multi_exec($mh, $running).
// Here you get the results.
while ($result = curl_multi_info_read($mh)) {
// How to get the data is out of our scope.
// We are interested in identifying the image object.
$ch = $result['handle'];
$idx = array_search($ch, $activeHandles);
$image = $loadingImages[$idx];
if ($success) {
// Don't remember to free resources!
unset($activeHandles[$idx]);
unset($loadingImages[$idx]);
curl_multi_remove_handle($mh, $ch);
........
}
}
}

PHP Scraper appears to be in an infinite loop

(I'm scraping this stuff with the permission of the website in question, by the way).
Pretty simple web scraper, was working fine when I was loading all the links by hand, but when I've tried to load them in via JSON and variables (so I can do lots of scraping with the one script and make the process more modular by just adding more links to JSON) it runs on an infinite loop.
(Page has been loading for about 15 minutes now)
Here is my JSON. Only one store is in there for testing purposes but there is going to be about 15 more.
[
{
"store":"Incu Men",
"cat":"Accessories",
"general_cat":"Accessories",
"spec_cat":"accessories",
"url":"http://www.incuclothing.com/shop-men/accessories/",
"baseurl":"http://www.incuclothing.com",
"next_select":"a.next",
"prod_name_select":".infobox .fn",
"label_name_select":".infobox .brand",
"desc_select":".infobox .description",
"price_select":"#price",
"mainImg_select":"",
"more_imgs":".product-images",
"product_url":".hproduct .photo-link"
}
]
Here is the PHP scraper code:
<?php
//Set infinite time limit
set_time_limit (0);
// Include simple html dom
include('simple_html_dom.php');
// Defining the basic cURL function
function curl($url) {
$ch = curl_init();
// Initialising cURL
curl_setopt($ch, CURLOPT_URL, $url);
// Setting cURL's URL option with the $url variable passed into the function
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
// Setting cURL's option to return the webpage data
$data = curl_exec($ch);
// Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch);
// Closing cURL
return $data;
// Returning the data from the function
}
function getLinks($catURL, $prodURL, $baseURL, $next_select) {
$urls = array();
while($catURL) {
echo "Indexing: $url" . PHP_EOL;
$html = str_get_html(curl($catURL));
foreach ($html->find($prodURL) as $el) {
$urls[] = $baseURL . $el->href;
}
$next = $html->find($next_select, 0);
$url = $next ? $baseURL . $next->href : null;
echo "Results: $next" . PHP_EOL;
}
return $urls;
}
$string = file_get_contents("jsonWorkers/incuMens.json");
$json_array = json_decode($string,true);
foreach ($json_array as $value){
$baseURL = $value['baseurl'];
$catURL = $value['url'];
$store = $value['store'];
$general_cat = $value['general_cat'];
$spec_cat = $value['spec_cat'];
$next_select = $value['next_select'];
$prod_name = $value['prod_name_select'];
$label_name = $value['label_name_select'];
$description = $value['desc_select'];
$price = $value['price_select'];
$prodURL = $value['product_url'];
if (!is_null($value['mainImg_select'])){
$mainImg = $value['mainImg_select'];
}
$more_imgs = $value['more_imgs'];
$allLinks = getLinks($catURL, $prodURL, $baseURL, $next_select);
}
?>
Any ideas why the script would be running infinitely and not returning anything/stopping/printing anything to screen? I'm just gonna let it run until it stops. When I was doing this by hand it would only take a minute or so, sometimes less, so I'm sure it's a problem with my variables/json but I can't for the life of me see what the issues lie.
Can anyone take a quick look and point me in the right direction?

There is a problem with your while($catURL) loop. What do you want to do ?
Moreover, you can force to display information on your browser with the flush() command.

fetching content from a webpage using curl

First of all have a look at here,
www.zedge.net/txts/4519/
this page has so many text messages , I want my script to open each of the message and download it,
but i am having some problem,
This is my simple script to open the page,
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.zedge.net/txts/4519");
$contents = curl_exec ($ch);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_close ($ch);
?>
The page download fine but how would i open every text message page inside this page one by one and save its content in a text file,
I know how to save the content of a webpage in a text file using curl but in this case there are so many different pages inside the page i've downloaded how to open them one by one seperately ?
I've this idea but don't know if it will work,
Downlaod this page,
www.zedge.net/txts/4519
look for the all the links of text messages page inside the page and save each link into one text file (one in each line), then run another curl session , open the text file read each link one by one , open it copy the content from the particular DIV and then save it in a new file.

The algorithm is pretty straight forward:
download www.zedge.net/txts/4519 with curl
parse it with DOM (or alternative) for links
either store them all into text file/database or process them on the fly with "subrequest"
// Load main page
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, "http://www.zedge.net/txts/4519");
$contents = curl_exec ($ch);
$dom = new DOMDocument();
$dom->loadHTML( $contents);
// Filter all the links
$xPath = new DOMXPath( $dom);
$items = $xPath->query( '//a[class=myLink]');
foreach( $items as $link){
$url = $link->getAttribute('href');
if( strncmp( $url, 'http', 4) != 0){
// Prepend http:// or something
}
// Open sub request
curl_setopt($ch, CURLOPT_URL, "http://www.zedge.net/txts/4519");
$subContent = curl_exec( $ch);
}
See documentation and examples for xPath::query, note that DOMNodeList implements Traversable and therefor you can use foreach.
Tips:
Use curl opt COOKIE_JAR_FILE
Use sleep(...) not to flood server
Set php time and memory limit

I used DOM for my code part. I called my desire page and filtered data using getElementsByTagName('td')
Here i want the status of my relays from the device page. every time i want updated status of relays. for that i used below code.
$keywords = array();
$domain = array('http://USERNAME:PASSWORD#URL/index.htm');
$doc = new DOMDocument;
$doc->preserveWhiteSpace = FALSE;
foreach ($domain as $key => $value) {
#$doc->loadHTMLFile($value);
//$anchor_tags = $doc->getElementsByTagName('table');
//$anchor_tags = $doc->getElementsByTagName('tr');
$anchor_tags = $doc->getElementsByTagName('td');
foreach ($anchor_tags as $tag) {
$keywords[] = strtolower($tag->nodeValue);
//echo $keywords[0];
}
}
Then i get my desired relay name and status in $keywords[] array.
Here i am sharing of Output.
If you want to read all messages in the main page. then first you have to collect all link for separate messages. Then you can use it for further same process.

Timing out a script portion and allowing the rest to continue

I have a widget that runs on my homepage which is loading xml data from an external source. I want to timeout the xml load after x seconds (lately the other site has been having load issues). Here is the function I have so far. I can't figure out how to make the timer ineract with the simplexml_load_file().
Am I on the right track? Is there a way to make this work? Or is there a better way to do this? If this does timeout, I still need the rest of the page to continue loading, so I can't use set_time_limit(), because that will end all script execution, right?
function timer($end) {
$count = 0;
while($end > $count) {
sleep(1);
$count++;
}
return true;
}
$we = simplexml_load_file('http://forecast.weather.gov/MapClick.php?lat=44.08920&lon=-70.17250&FcstType=xml');
if(timer(3)) return;

So you want to set a timeout for simplexml_load_file(). You can't set it specifically, but you can just set it globally (for all socket based streams) before using the function:
ini_set('default_socket_timeout', 3);
$we = simplexml_load_file($url);
// you can restore the default value after use, if you want
ini_restore('default_socket_timeout');

I would use CURL instead of loading the URL directly...
function getXml($url, $timeout = 0){
$ch = curl_init($url);
curl_setopt_array($ch,array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_TIMEOUT => (int) $timeout
));
if($xml = curl_exec($ch)){
return new SimpleXmlElement($xml);
}
else {
return null;
}
}
//Example
$xmlData = getXml('http://yoururl.com', 2); // 2 second timeout

You could first read the content of the file with some blocking or more reliable function (like fopen, fsockopen or curl, choose the best you can use) and then pass the content to simplexml_load_string instead of simplexml_load_file

How would I automate my array to be used with cURL?

I have an array containing the contents of a MySQL table. I need to put each of these contents into curl_multi_handles so that I can execute them all simultaneously
Here is the code for the array, in case it helps:
$SQL = mysql_query("SELECT url FROM urls") or die(mysql_error());
while($resultSet = mysql_fetch_array($SQL)){
$urls[]=$resultSet
}
So I need to put be able to send data to each url at the same time. I don't need to get any data back, and in fact I'll be having them time out after two seconds. It only needs to send the data and then close.
My code prior to this, was executing them one at a time. here is that code:
$SQL = mysql_query("SELECT url FROM shells") or die(mysql_error()); while($resultSet = mysql_fetch_array($SQL)){
$ch = curl_init($resultSet['url'] . $fullcurl); //load the urls and send GET data
curl_setopt($ch, CURLOPT_TIMEOUT, 2); //Only load it for two seconds (Long enough to send the data)
curl_exec($ch);
curl_close($ch);
So my question is: How can I load the contents of the array into curl_multi_handle, execute it, and then remove each handle and close the curl_multi_handle?

You still call curl_init and curl_setopt. Then you load it into a multi_handle, and keep calling execute until it's done. This is based on the documentation at curl_multi_init. Since you're timing out in two seconds, and not processing responses, I think you can just sleep for two seconds at a time. curl_multi_select might be better if you actually need to process the responses.
$SQL = mysql_query("SELECT url FROM shells") ;
$mh = curl_multi_init();
$handles = array();
while($resultSet = mysql_fetch_array($SQL)){
//load the urls and send GET data
$ch = curl_init($resultSet['url'] . $fullcurl);
//Only load it for two seconds (Long enough to send the data)
curl_setopt($ch, CURLOPT_TIMEOUT, 2);
curl_multi_add_handle($mh, $ch);
$handles[] = $ch;
}
// Create a status variable so we know when exec is done.
$running = null;
//execute the handles
do {
// Call exec. This call is non-blocking, meaning it works in the background.
curl_multi_exec($mh,$running);
// Sleep while it's executing. You could do other work here, if you have any.
sleep(2);
// Keep going until it's done.
} while ($running > 0);
// For loop to remove (close) the regular handles.
foreach($handles as $ch)
{
// Remove the current array handle.
curl_multi_remove_handle($mh, $ch);
}
// Close the multi handle
curl_multi_close($mh);

If i were you, i would write class mysql and a class curl.
Its very good at all.
First i would create a method witch would return all urls from a passed mysql result.
Something like
public function getUrls($mysql_fetch_array)
{
foreach($mysql_fetch_array as $result)
{
$urls[] = $result["url"];
}
}
then you could write a method like curlSend($url,$param)
//remember you have to edit i dont know your full code so its just
// a way you could do it
public function curlSend($url,$param="")
{
$ch = curl_init($resultSet['url'] . $fullcurl); //load the urls and send GET data
curl_setopt($ch, CURLOPT_TIMEOUT, 2); //Only load it for two seconds (Long enough to send the data)
curl_exec($ch);
curl_close($ch);
}
public function send()
{
$urls = getUrls($this->mysql->result($sql));
foreach($urls as $url)
{
$this->curlSend($url);
}
}
Now this is how you could do it.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Optimize execution time of getting HTTP status code - php

Related

Identify specific curl multi response

PHP Scraper appears to be in an infinite loop

fetching content from a webpage using curl

Timing out a script portion and allowing the rest to continue

How would I automate my array to be used with cURL?

Categories

Resources