How to reduce load time of curl and parse html

How to reduce load time of curl and parse html - php

i need to reduce load time of my script. It is curl and simple parse dom.
This is my script, i need help :(
It lasts about 2 minutes, i need to parse many different pages!
require_once ("simple_html_dom.php");
function curl ($page){
ob_start();
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "URL");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, "POSTFIELDS");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
ob_end_clean();
return $result;
}
$start = microtime(true);
set_time_limit(0);
$text = "here text of last page";
$i = 0;
while(strpos(str_get_html(curl($i)), $text) == null){
$html = str_get_html(curl($i));
foreach($html->find('div#box-container-inner div.box') as $e){
PRINT etc... only for test
}
echo "parsata la pagina ".($i+1)."<br>";
$i++;
}
$time_elapsed_secs = (microtime(true) - $start)/60;
echo $time_elapsed_secs;

You appear to be running CURL twice in each loop (once to evaluate the while-loop condition, once to set $html) and converting the resulting string into an object for each loop. That's four potentially-intensive processes each time you loop that you can knock down to two per loop.
Instead, you can assign the $html variable to the result of str_get_html(curl($i)) while within the while-loop evaluation:
while(strpos(($html = str_get_html(curl($i))), $text) === false) {
// $html = str_get_html(curl($i));

Just add curl timeout as parameter set to 60 seconds or your choice.
curl_setopt($ch, CURLOPT_TIMEOUT, 60);

Related

how to skip InnerText empy in a If statment using php

I am trying to skip when InnerText is empty but it put a default value.
This is my code:
if (strip_tags($result[$c]->innertext) == '') {
$c++;
continue;
}
This is the output:
Thanks
EDIT2: I did the var_dump
var_dump($result[$c]->innertext)
and I got this:
how can I fix it please?
EDIT3: This is my code; I extract in this way the names of the teams and the results, but the last one not works in the best way when We have postponed matches
<?php
include('../simple_html_dom.php');
function getHTML($url,$timeout)
{
$ch = curl_init($url); // initialize curl with given url
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set useragent
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
return #curl_exec($ch);
}
$response=getHTML("https://www.betexplorer.com/soccer/japan/j3-league/results/",10);
$html = str_get_html($response);
$titles = $html->find("a[class=in-match]"); // name match
$result = $html->find("td[class=h-text-center]/a"); // result match
$c=0; $b=0; $o=0; $z=0; $h=0; // set counters
foreach ($titles as $match) { //get all data
$match_status = $result[$h++];
if (strip_tags($result[$c]->innertext) == 'POSTP.') { //bypass postponed match but it doesn't work anymore
$c++;
continue;
}
list($num1, $num2) = explode(':', $result[$c++]->innertext); // <- explode
$num1 = intval($num1);
$num2 = intval($num2);
$num3 = ($num1 + $num2);
$risultato = ($num1 . '-' . $num2);
list($home, $away) = explode(' - ', $titles[$z++]->innertext); // <- explode
$home = strip_tags($home);
$away = strip_tags($away);
$matchunit = $home . ' - ' . $away;
echo "<tr><td class='rtitle'>".
"<td> ".$matchunit. "</td> / " . // name match
"<td class='first-cell'>" . $risultato . "</td> " .
"</td></tr><br/>";
} //close foreach
?>

By browsing the content of the website you will always be dependent on the changes made in the future.
However, I will use PHP's native libxml DOM extension.
By doing the following:
<?php
function getHTML($url,$timeout)
{
$ch = curl_init($url); // initialize curl with given url
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set useragent
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
return #curl_exec($ch);
}
$response=getHTML("https://www.betexplorer.com/soccer/japan/j3-league/results/",10);
// "Create" the new DOM document
$doc = new DomDocument();
// Load HTML from a string, and disable xml error handling
$doc->loadHTML($response, LIBXML_NOERROR);
// Creates a new DOMXPath object
$xpath = new DomXpath($doc);
// Evaluates the given XPath expression and get all tr's without first line from table main
$row = $xpath->query('//table[#class="table-main js-tablebanner-t js-tablebanner-ntb"]/tr[position()>1]');
echo '<table>';
// Parse the values in row
foreach ($row as $tr) {
// Only get 2 first td's
$col = $xpath->query('td[position()<3]', $tr);
// Do not show POSTP and Round values
if (!str_contains($tr->nodeValue, 'POSTP') && !str_contains($tr->nodeValue, 'Round')) {
echo '<tr><td>'.$col->item(0)->nodeValue.'</td><td>'.$col->item(1)->nodeValue.'</td></tr>';
}
}
echo '</table>';
You obtain:
<tr><td>Nagano - Tegevajaro Miyazaki</td><td>3:2</td></tr>
<tr><td>YSCC - Toyama</td><td>1:2</td></tr>
...

PHP - Scrape data of all trustpilot reviews [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
<?php
for ($x = 0; $x <= 25; $x++) {
$ch = curl_init("https://uk.trustpilot.com/review/example.com?languages=all&page=$x");
//curl_setopt($ch, CURLOPT_POST, true);
//curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
//curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout in seconds
$trustpilot = curl_exec($ch);
// Check if any errorccurred
if(curl_errno($ch))
{
die('Fatal Error Occoured');
}
}
?>
This code will get all 25 pages of reviews for example.com, what I then want to do is then put all the results into a JSON array or something.
I attempted the code below in order to maybe retrieve all of the names:
<?php
$trustpilot = preg_replace('/\s+/', '', $trustpilot); //This replaces any spaces with no spaces
$first = explode( '"name":"' , $trustpilot );
$second = explode('"' , $first[1] );
$result = preg_replace('/[^a-zA-Z0-9-.*_]/', '', $second[0]); //Don't allow special characters
?>
This is clearly a lot harder than I anticipated, does anyone know how I could possibly get all of the reviews into JSON or something for however many pages I choose, for example in this case I choose 25 pages worth of reviews.
Thanks!

do not parse HTML with regex.
use DOMDocument & DOMXPath to parse em. also, you create a new curl handle for each page, but you never close them, which is a resource/memory leak in your code, but also a waste of cpu because you could just keep re-using the same curl handle over and over (instead of creating a new curl handle for each page, which takes cpu), and protip: this html compress rather well, so you should use CURLOPT_ENCODING to download the pages compressed,
e.g:
<?php
declare(strict_types = 1);
header("Content-Type: text/plain;charset=utf-8");
$ch = curl_init();
curl_setopt($ch, CURLOPT_ENCODING, ''); // enables compression
$reviews = [];
for ($x = 0; $x <= 25; $x ++) {
curl_setopt($ch, CURLOPT_URL, "https://uk.trustpilot.com/review/example.com?languages=all&page=$x");
// curl_setopt($ch, CURLOPT_POST, true);
// curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
// curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($ch, CURLOPT_TIMEOUT, 30); // timeout in seconds
$trustpilot = curl_exec($ch);
// Check if any errorccurred
if (curl_errno($ch)) {
die('fatal error: curl_exec failed, ' . curl_errno($ch) . ": " . curl_error($ch));
}
$domd = #DOMDocument::loadHTML($trustpilot);
$xp = new DOMXPath($domd);
foreach ($xp->query("//article[#class='review-card']") as $review) {
$id = $review->getAttribute("id");
$reviewer = $xp->query(".//*[#class='content-section__consumer-info']", $review)->item(0)->textContent;
$stars = $xp->query('.//div[contains(#class,"star-item")]', $review)->length;
$title = $xp->query('.//*[#class="review-info__body__title"]', $review)->item(0)->textContent;
$text = $xp->query('.//*[#class="review-info__body__text"]', $review)->item(0)->textContent;
$reviews[$id] = array(
'reviewer' => mytrim($reviewer),
'stars' => ($stars),
'title' => mytrim($title),
'text' => mytrim($text)
);
}
}
curl_close($ch);
echo json_encode($reviews, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE | (defined("JSON_UNESCAPED_LINE_TERMINATORS") ? JSON_UNESCAPED_LINE_TERMINATORS : 0) | JSON_NUMERIC_CHECK);
function mytrim(string $text): string
{
return preg_replace("/\s+/", " ", trim($text));
}
output:
{
"4d6bbf8a0000640002080bc2": {
"reviewer": "Clement Skau Århus, DK, 3 reviews",
"stars": 5,
"title": "Godt fundet på!",
"text": "Det er rigtig fint gjort at lave et example domain. :)"
}
}
because there is only 1 review here for the url you listed. and 4d6bbf8a0000640002080bc2 is the website's internal id (probably a sql db id) for that review.

php curl steam-api super slow

I'm making a scoreboard and implementing the steam API to retrieve avatars for users. At first I was using file_get, but it was so slow! So someone suggested I use curl.
Old method
$url = 'http://www.com';
$content = file_get_contents($url);
$json = json_decode($content, true);
I then used a foreach loop to grab the items I wanted from the data.
foreach($output['response']['players'] as $item) {
}
new curl code,
$url = 'www.com';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
echo $output = curl_exec($ch);
curl_close($ch);
$json = json_decode($output, true);
I get pretty much the same result from the json method but it is a little faster. But it is still extremely slow, is there anyway to increase the speed of this? Can I load the table and then load the avatars as they become available?
Scoreboard
http://fyre.site.nfoservers.com/index.php

Consider using for loops since those can speed things up. If you are talking about the load times (time until page loads and displays) being slow, consider using output buffering like this:
Unset arrays or values that you don't need anymore.
Note that the steam API accepts 100 ID's at once, so the friendslist is split into chunks of 100.
It will push out the information once its done, and the site will not wait until it is done. Try it out, I guess.
$totalfriends = count($friends);
$chunkedfriends = array_chunk($friends, 100);
$chunks = ceil($totalfriends / 100);
if(ob_get_length() > 0) {
ob_end_flush();
ob_implicit_flush();}
for($i=0; $i < $chunks; $i++){
$url = "https://api.steampowered.com/ISteamUser/GetPlayerSummaries/v0001/?key=". $steamkey . "&steamids=". implode(',', $chunkedfriends[$i]) . "";
$friendscountchunk = count($chunkedfriends[$i]);
$ch = curl_init();
curl_setopt($ch, CURLOPT_PIPEWAIT, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$urlresult=curl_exec($ch);
curl_close($ch);
$json_decoded = json_decode($urlresult);
if(ob_get_length() > 0) {
ob_end_flush();
ob_implicit_flush();}
for($x=0; $x < $friendscountchunk; $x++){
?>
<li class="friendsli"><a href="steamuser.php?id=<?=$json_decoded->response->players->player[$x]->steamid?>">
<img src=' <?=$json_decoded->response->players->player[$x]->avatar?>'/><p class="friendname"> <?=$json_decoded->response->players->player[$x]->personaname?> </p>
</a></li> <?php
}}
unset($friends); unset($player); unset($json_decoded);
I don't think it is the best script or method, but it will help for sure.
You cannot speed up an external API, but you can improve and adapt your code.

Google search results with php

I'm using the following php script to get search results from Google.
include("simple_html_dom.php");
include("random-user-agent.php");
$query = 'facebook';
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'http://www.google.com/search?q='.$query.'');
#curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT,random_user_agent());
$str = curl_exec($curl);
curl_close($curl);
$html= str_get_html($str);
$i = 0;
foreach($html->find('li[class=g]') as $element) {
foreach($element->find('h3') as $item)
{
$title[$i] = ''.$item->plaintext.'' ;
}
$i++;
}
print_r($title);
When this script runs in a cronjob (with 5 sec sleep) I receive a warning from Google and have to fill in a captcha (obvious). I always thought that using curl and a random user agent can avoid this. What is the correct solution?

A better way to avoid captcha is to set a randomized sleep between 3-6 seconds per request.
Best solution is to use proxies.

Is it possible to get partial response using PHP cURL?

Here is my code
$url = "partial_response.php";
$sac_curl = curl_init();
curl_setopt($sac_curl, CURLOPT_HTTPGET, true);
curl_setopt($sac_curl, CURLOPT_URL, $url);
curl_setopt($sac_curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($sac_curl, CURLOPT_HEADER, false);
curl_setopt($sac_curl, CURLOPT_TIMEOUT, 11);
$resp = curl_exec($sac_curl);
curl_close($sac_curl);
echo $resp;
Partial_response.php
header( 'Content-type: text/html; charset=utf-8' );
echo 'Job waiting ...<br />';
for( $i = 0 ; $i < 10 ; $i++ )
{
echo $i . '<br/>';
flush();
ob_flush();
sleep(1);
}
echo 'End ...<br/>';
From the about code am trying to get a partial response from partial_response.php. what I want is, I need curl to return me "Job waiting.." alone instead of waiting for the partial_response.php to complete the loop and return the entire data. so when I reduce CURLOPT_TIMEOUT below 11 i dont get any response at all. Kindly clarify my doubt.
Thanks in advance.

I later realized that cURL could not do what I want, I used stream_context_get_options
to achieve what I wanted. Here it is, http://www.php.net/manual/en/function.stream-context-get-options.php.

No, I'm afraid not. At least not that I know of any, this is simply because PHP is a synchronous language, meaning you cannot "skip" tasks. (I.e. curl_exec() will always - no matter what - be executed until the request is completed)

I'm not sure abut the timeout, but you can get partial response using cURL by using the CURLOPT_WRITEFUNCTION flag:
curl_setopt($ch, CURLOPT_WRITEFUNCTION, $callback);
Where $ch is the Curl handler, and $callback is the callback function name. This command will stream response data from remote site. The callback function can look something like:
$result = '';
$callback = function ($ch, $str) {
global $result;
//$str has the chunks of data streamed back.
$result .= $str;
// here you can mess with the stream data either with $result or $str.
// i.e. look for the "Job waiting" string and terminate the response.
return strlen($str);//don't touch this
};
If not interrupted at the end $result will contain all the response from remote site.
So combining everything will look something like:
$result = '';
$callback = function ($ch, $str) {
global $result;
//$str has the chunks of data streamed back.
$result .= $str;
// here you can mess with the stream data either with $result or $str.
// i.e. look for the "Job waiting" string and terminate the response.
return strlen($str);//don't touch this
};
$url = "partial_response.php";
$sac_curl = curl_init();
curl_setopt($sac_curl, CURLOPT_HTTPGET, true);
curl_setopt($sac_curl, CURLOPT_URL, $url);
curl_setopt($sac_curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($sac_curl, CURLOPT_HEADER, false);
curl_setopt($sac_curl, CURLOPT_TIMEOUT, 11);
curl_setopt($ch, CURLOPT_WRITEFUNCTION, $callback);
curl_exec($sac_curl); // the response is now in $result.
curl_close($sac_curl);
echo $result;

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to reduce load time of curl and parse html - php

Just add curl timeout as parameter set to 60 seconds or your choice. curl_setopt($ch, CURLOPT_TIMEOUT, 60);

Related

how to skip InnerText empy in a If statment using php

PHP - Scrape data of all trustpilot reviews [duplicate]

php curl steam-api super slow

Google search results with php

Is it possible to get partial response using PHP cURL?

Categories

Resources