PHP - Scrape data of all trustpilot reviews [duplicate]

PHP - Scrape data of all trustpilot reviews [duplicate] - php

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
<?php
for ($x = 0; $x <= 25; $x++) {
$ch = curl_init("https://uk.trustpilot.com/review/example.com?languages=all&page=$x");
//curl_setopt($ch, CURLOPT_POST, true);
//curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
//curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout in seconds
$trustpilot = curl_exec($ch);
// Check if any errorccurred
if(curl_errno($ch))
{
die('Fatal Error Occoured');
}
}
?>
This code will get all 25 pages of reviews for example.com, what I then want to do is then put all the results into a JSON array or something.
I attempted the code below in order to maybe retrieve all of the names:
<?php
$trustpilot = preg_replace('/\s+/', '', $trustpilot); //This replaces any spaces with no spaces
$first = explode( '"name":"' , $trustpilot );
$second = explode('"' , $first[1] );
$result = preg_replace('/[^a-zA-Z0-9-.*_]/', '', $second[0]); //Don't allow special characters
?>
This is clearly a lot harder than I anticipated, does anyone know how I could possibly get all of the reviews into JSON or something for however many pages I choose, for example in this case I choose 25 pages worth of reviews.
Thanks!

do not parse HTML with regex.
use DOMDocument & DOMXPath to parse em. also, you create a new curl handle for each page, but you never close them, which is a resource/memory leak in your code, but also a waste of cpu because you could just keep re-using the same curl handle over and over (instead of creating a new curl handle for each page, which takes cpu), and protip: this html compress rather well, so you should use CURLOPT_ENCODING to download the pages compressed,
e.g:
<?php
declare(strict_types = 1);
header("Content-Type: text/plain;charset=utf-8");
$ch = curl_init();
curl_setopt($ch, CURLOPT_ENCODING, ''); // enables compression
$reviews = [];
for ($x = 0; $x <= 25; $x ++) {
curl_setopt($ch, CURLOPT_URL, "https://uk.trustpilot.com/review/example.com?languages=all&page=$x");
// curl_setopt($ch, CURLOPT_POST, true);
// curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
// curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($ch, CURLOPT_TIMEOUT, 30); // timeout in seconds
$trustpilot = curl_exec($ch);
// Check if any errorccurred
if (curl_errno($ch)) {
die('fatal error: curl_exec failed, ' . curl_errno($ch) . ": " . curl_error($ch));
}
$domd = #DOMDocument::loadHTML($trustpilot);
$xp = new DOMXPath($domd);
foreach ($xp->query("//article[#class='review-card']") as $review) {
$id = $review->getAttribute("id");
$reviewer = $xp->query(".//*[#class='content-section__consumer-info']", $review)->item(0)->textContent;
$stars = $xp->query('.//div[contains(#class,"star-item")]', $review)->length;
$title = $xp->query('.//*[#class="review-info__body__title"]', $review)->item(0)->textContent;
$text = $xp->query('.//*[#class="review-info__body__text"]', $review)->item(0)->textContent;
$reviews[$id] = array(
'reviewer' => mytrim($reviewer),
'stars' => ($stars),
'title' => mytrim($title),
'text' => mytrim($text)
);
}
}
curl_close($ch);
echo json_encode($reviews, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE | (defined("JSON_UNESCAPED_LINE_TERMINATORS") ? JSON_UNESCAPED_LINE_TERMINATORS : 0) | JSON_NUMERIC_CHECK);
function mytrim(string $text): string
{
return preg_replace("/\s+/", " ", trim($text));
}
output:
{
"4d6bbf8a0000640002080bc2": {
"reviewer": "Clement Skau Århus, DK, 3 reviews",
"stars": 5,
"title": "Godt fundet på!",
"text": "Det er rigtig fint gjort at lave et example domain. :)"
}
}
because there is only 1 review here for the url you listed. and 4d6bbf8a0000640002080bc2 is the website's internal id (probably a sql db id) for that review.

Related

how to skip InnerText empy in a If statment using php

I am trying to skip when InnerText is empty but it put a default value.
This is my code:
if (strip_tags($result[$c]->innertext) == '') {
$c++;
continue;
}
This is the output:
Thanks
EDIT2: I did the var_dump
var_dump($result[$c]->innertext)
and I got this:
how can I fix it please?
EDIT3: This is my code; I extract in this way the names of the teams and the results, but the last one not works in the best way when We have postponed matches
<?php
include('../simple_html_dom.php');
function getHTML($url,$timeout)
{
$ch = curl_init($url); // initialize curl with given url
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set useragent
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
return #curl_exec($ch);
}
$response=getHTML("https://www.betexplorer.com/soccer/japan/j3-league/results/",10);
$html = str_get_html($response);
$titles = $html->find("a[class=in-match]"); // name match
$result = $html->find("td[class=h-text-center]/a"); // result match
$c=0; $b=0; $o=0; $z=0; $h=0; // set counters
foreach ($titles as $match) { //get all data
$match_status = $result[$h++];
if (strip_tags($result[$c]->innertext) == 'POSTP.') { //bypass postponed match but it doesn't work anymore
$c++;
continue;
}
list($num1, $num2) = explode(':', $result[$c++]->innertext); // <- explode
$num1 = intval($num1);
$num2 = intval($num2);
$num3 = ($num1 + $num2);
$risultato = ($num1 . '-' . $num2);
list($home, $away) = explode(' - ', $titles[$z++]->innertext); // <- explode
$home = strip_tags($home);
$away = strip_tags($away);
$matchunit = $home . ' - ' . $away;
echo "<tr><td class='rtitle'>".
"<td> ".$matchunit. "</td> / " . // name match
"<td class='first-cell'>" . $risultato . "</td> " .
"</td></tr><br/>";
} //close foreach
?>

By browsing the content of the website you will always be dependent on the changes made in the future.
However, I will use PHP's native libxml DOM extension.
By doing the following:
<?php
function getHTML($url,$timeout)
{
$ch = curl_init($url); // initialize curl with given url
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set useragent
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
return #curl_exec($ch);
}
$response=getHTML("https://www.betexplorer.com/soccer/japan/j3-league/results/",10);
// "Create" the new DOM document
$doc = new DomDocument();
// Load HTML from a string, and disable xml error handling
$doc->loadHTML($response, LIBXML_NOERROR);
// Creates a new DOMXPath object
$xpath = new DomXpath($doc);
// Evaluates the given XPath expression and get all tr's without first line from table main
$row = $xpath->query('//table[#class="table-main js-tablebanner-t js-tablebanner-ntb"]/tr[position()>1]');
echo '<table>';
// Parse the values in row
foreach ($row as $tr) {
// Only get 2 first td's
$col = $xpath->query('td[position()<3]', $tr);
// Do not show POSTP and Round values
if (!str_contains($tr->nodeValue, 'POSTP') && !str_contains($tr->nodeValue, 'Round')) {
echo '<tr><td>'.$col->item(0)->nodeValue.'</td><td>'.$col->item(1)->nodeValue.'</td></tr>';
}
}
echo '</table>';
You obtain:
<tr><td>Nagano - Tegevajaro Miyazaki</td><td>3:2</td></tr>
<tr><td>YSCC - Toyama</td><td>1:2</td></tr>
...

Curl and array values in curlopt_url does not work

i have a very weird issue with curl and url defined inside an array.
I have an array of url and i want perform an http GET on those urls with curl
for ($i = 0, $n = count($array_station) ; $i < $n ; $i++)
{
$station= curl_init();
curl_setopt($station, CURLOPT_VERBOSE, true);
curl_setopt($station, CURLOPT_URL, $array_station[$i]);
curl_setopt($station, CURLOPT_RETURNTRANSFER, true);
curl_setopt($station, CURLOPT_FOLLOWLOCATION, true);
$response = curl_exec($station);
curl_close($station);
}
If i define my $array_station in the way below
$array_station=array("http://www.example.com","http://www.example2.com");
the code above with curl working flawlassy,but since my $array_station is build in the way below (i perform a scan of directory searchin a specific filename, then i clean the url), the curl does not work, no error showed and nothing happens..
$di = new RecursiveDirectoryIterator(__DIR__,RecursiveDirectoryIterator::SKIP_DOTS);
$it = new RecursiveIteratorIterator($di);
$array_station=array();
$i=0;
foreach($it as $file) {
if (pathinfo($file, PATHINFO_FILENAME ) == "db_insert") {
$string = str_replace('/web/htdocs/', 'http://', $file.PHP_EOL);
$string2 = str_replace('/home','', $string);
$array_station[$i]=$string2;
$i++;
}
}
Doyou have some ideas? i'm giving up :-(

I'm on mobile right now so i cannot test it, but why are you adding a new line (PHP_EOL) to the url? Try to remove the new line or trim() the url at the end.

Add the lines of code below.
If there is a curl error it will report the error number.
If the request is made, it will show the HTTP request and response headers. The request is in $info and response header is in $head
for ($i = 0, $n = count($array_station) ; $i < $n ; $i++)
{
$station= curl_init();
curl_setopt($station, CURLOPT_VERBOSE, true);
curl_setopt($station, CURLOPT_URL, $array_station[$i]);
curl_setopt($station, CURLOPT_RETURNTRANSFER, true);
curl_setopt($station, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
$response = curl_exec($station);
if (curl_errno($station)){
$response .= 'Retreive Base Page Error: ' . curl_error($station);
}
else {
$skip = intval(curl_getinfo($station, CURLINFO_HEADER_SIZE));
$head = substr($response ,0,$skip);
$response = substr($response ,$skip);
$info = var_export(curl_getinfo($station),true);
}
echo $head;
echo $info;
curl_close($station);
}

curl_exec returns empty string

I'm still a bit new to using curl to pull data and I've recently started using Fiddler to help find what options need to be set.
I'm trying to see if I can pull an image from a site. I first hit a search page - I set the search parameters, then start hitting links in the results. When I attempt to go a link in one of the results for an image, I get an empty string returned from curl_exec().
The weird thing is - at one point, it worked - I got the data back and successfully saved the image locally. But then it stopped, and I have no idea what I was doing to have it working. Naturally, everything works OK in the browser. :(
I'm using Simple HTML DOM to parse through results and cUrl for the actual page requests. curl_error() does not show an error, curl_getinfo() thinks everything is OK too. It's probably something trivial, but I'm not sure how to troubleshoot it beyond where I am.
<?php
include 'includes/simple_html_dom.php';
$url = "http://nwweb.co.bell.tx.us/NewWorld.Aegis.WebPortal/Corrections/InmateInquiry.aspx";
// Get Cookie - ASP.NET_SessionId
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$r = curl_exec($ch);
preg_match_all('/^Set-Cookie:\s*([^;]*)/mi', $r, $matches);
$cookies = array();
foreach($matches[1] as $item)
{
parse_str($item, $cookie);
$cookies = array_merge($cookies, $cookie);
}
$sessionCookie = "ASP_NET_SessionId=".$cookies['ASP_NET_SessionId'];
// now load up page into Simple HTML DOM and get all inputs - ignore buttons and populate our dates
$startDate = "02%2F01%2F2000";
$endDate = "02%2F07%2F2016";
$getInputs = str_get_html($r);
$inputs = $getInputs->find('input');
$inputs_array = array();
$buttons_array = array();
for ($i=0; $i<count($inputs); $i++)
{
if ($inputs[$i]->type != "submit")
{
$inputs_array[$inputs[$i]->id] = $inputs[$i]->value;
if (stripos($inputs[$i]->id, "FromDate") > 0)
$inputs_array[$inputs[$i]->id] = $startDate;
if (stripos($inputs[$i]->id, "ToDate") > 0)
$inputs_array[$inputs[$i]->id] = $endDate;
}
}
// build up our curl data - includes hidden inputs, our to & from dates, plus the Search button
$curl_data = http_build_query($inputs_array)."&ctl00%24DefaultContent%24uxSearch=Search";
// POST the data, include session cookie
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $curl_data);
curl_setopt($ch, CURLOPT_COOKIE, $sessionCookie);
$response = curl_exec($ch);
// this shows that we can get data
// find the links from the HTML
$htmlDom = str_get_html($response); // load up Simple HTML DOM
// get the table of results
$divTable = $htmlDom->find('div#ctl00_DefaultContent_uxResultsWrapper',0)->find('table',0);
$rows = $divTable->find('tr');
for ($i=1; $i<count($rows);$i++)
{
if ($i>3) break; // limit the length of script for debugging
$link = $rows[$i]->find('td',1)->find('a',0)->href;
// build up query to get inmate details from the link above
$url = "http://nwweb.co.bell.tx.us/NewWorld.Aegis.WebPortal/Corrections/".$link;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_COOKIE, $sessionCookie);
$page = curl_exec($ch);
$pageData = str_get_html($page);
// Now find the Photo, there's a thumb in div.BookingPhotos
// It is linked to a full size image, the link is of the form http://nwweb.co.bell.tx.us/NewWorld.Aegis.WebPortal/GetImage.aspx?ImageKey=17C030IS, but in the href, it has ../GetImage.aspx?ImageKey=xxxx
$photoLink = $pageData->find('div.BookingPhotos',0)->find('a',0)->href;
// get rid of .. and put the base URL on the front
$imgLink = str_replace("..", "http://nwweb.co.bell.tx.us/NewWorld.Aegis.WebPortal", $photoLink);
// now attempt to pull the image
$ch = curl_init($imgLink);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_COOKIE, $sessionCookie);
// here is the PROBLEM - NO DATA RETURNED
$imgData = curl_exec($ch); // I get a header back, but NO data
}
?>

Can PHP cURL retrieve response headers AND body in a single request?

Is there any way to get both headers and body for a cURL request using PHP? I found that this option:
curl_setopt($ch, CURLOPT_HEADER, true);
is going to return the body plus headers, but then I need to parse it to get the body. Is there any way to get both in a more usable (and secure) way?
Note that for "single request" I mean avoiding issuing a HEAD request prior of GET/POST.

One solution to this was posted in the PHP documentation comments: http://www.php.net/manual/en/function.curl-exec.php#80442
Code example:
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
// ...
$response = curl_exec($ch);
// Then, after your curl_exec call:
$header_size = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
$header = substr($response, 0, $header_size);
$body = substr($response, $header_size);
Warning: As noted in the comments below, this may not be reliable when used with proxy servers or when handling certain types of redirects. #Geoffrey's answer may handle these more reliably.

Many of the other solutions offered this thread are not doing this correctly.
Splitting on \r\n\r\n is not reliable when CURLOPT_FOLLOWLOCATION is on or when the server responds with a 100 code RFC-7231, MDN.
Not all servers are standards compliant and transmit just a \n for new lines (and a recipient may discard the \r in the line terminator) Q&A.
Detecting the size of the headers via CURLINFO_HEADER_SIZE is also not always reliable, especially when proxies are used Curl-1204 or in some of the same redirection scenarios.
The most correct method is using CURLOPT_HEADERFUNCTION.
Here is a very clean method of performing this using PHP closures. It also converts all headers to lowercase for consistent handling across servers and HTTP versions.
This version will retain duplicated headers
This complies with RFC822 and RFC2616, please do not make use of the mb_ (and similar) string functions, it is a not only incorrect but even a security issue RFC-7230!
$ch = curl_init();
$headers = [];
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// this function is called by curl for each header received
curl_setopt($ch, CURLOPT_HEADERFUNCTION,
function($curl, $header) use (&$headers)
{
$len = strlen($header);
$header = explode(':', $header, 2);
if (count($header) < 2) // ignore invalid headers
return $len;
$headers[strtolower(trim($header[0]))][] = trim($header[1]);
return $len;
}
);
$data = curl_exec($ch);
print_r($headers);

Curl has a built in option for this, called CURLOPT_HEADERFUNCTION. The value of this option must be the name of a callback function. Curl will pass the header (and the header only!) to this callback function, line-by-line (so the function will be called for each header line, starting from the top of the header section). Your callback function then can do anything with it (and must return the number of bytes of the given line). Here is a tested working code:
function HandleHeaderLine( $curl, $header_line ) {
echo "<br>YEAH: ".$header_line; // or do whatever
return strlen($header_line);
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.google.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADERFUNCTION, "HandleHeaderLine");
$body = curl_exec($ch);
The above works with everything, different protocols and proxies too, and you dont need to worry about the header size, or set lots of different curl options.
P.S.: To handle the header lines with an object method, do this:
curl_setopt($ch, CURLOPT_HEADERFUNCTION, array($object, 'methodName'))

is this what are you looking to?
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Expect:'));
$response = curl_exec($ch);
list($header, $body) = explode("\r\n\r\n", $response, 2);

If you specifically want the Content-Type, there's a special cURL option to retrieve it:
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$response = curl_exec($ch);
$content_type = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);

Just set options :
CURLOPT_HEADER, 0
CURLOPT_RETURNTRANSFER, 1
and use curl_getinfo with CURLINFO_HTTP_CODE (or no opt param and you will have an associative array with all the informations you want)
More at : http://php.net/manual/fr/function.curl-getinfo.php

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$parts = explode("\r\n\r\nHTTP/", $response);
$parts = (count($parts) > 1 ? 'HTTP/' : '').array_pop($parts);
list($headers, $body) = explode("\r\n\r\n", $parts, 2);
Works with HTTP/1.1 100 Continue before other headers.
If you need work with buggy servers which sends only LF instead of CRLF as line breaks you can use preg_split as follows:
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$parts = preg_split("#\r?\n\r?\nHTTP/#u", $response);
$parts = (count($parts) > 1 ? 'HTTP/' : '').array_pop($parts);
list($headers, $body) = preg_split("#\r?\n\r?\n#u", $parts, 2);

My way is
$response = curl_exec($ch);
$x = explode("\r\n\r\n", $v, 3);
$header=http_parse_headers($x[0]);
if ($header=['Response Code']==100){ //use the other "header"
$header=http_parse_headers($x[1]);
$body=$x[2];
}else{
$body=$x[1];
}
If needed apply a for loop and remove the explode limit.

Here is my contribution to the debate ... This returns a single array with the data separated and the headers listed. This works on the basis that CURL will return a headers chunk [ blank line ] data
curl_setopt($ch, CURLOPT_HEADER, 1); // we need this to get headers back
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_VERBOSE, true);
// $output contains the output string
$output = curl_exec($ch);
$lines = explode("\n",$output);
$out = array();
$headers = true;
foreach ($lines as $l){
$l = trim($l);
if ($headers && !empty($l)){
if (strpos($l,'HTTP') !== false){
$p = explode(' ',$l);
$out['Headers']['Status'] = trim($p[1]);
} else {
$p = explode(':',$l);
$out['Headers'][$p[0]] = trim($p[1]);
}
} elseif (!empty($l)) {
$out['Data'] = $l;
}
if (empty($l)){
$headers = false;
}
}

The problem with many answers here is that "\r\n\r\n" can legitimately appear in the body of the html, so you can't be sure that you're splitting headers correctly.
It seems that the only way to store headers separately with one call to curl_exec is to use a callback as is suggested above in https://stackoverflow.com/a/25118032/3326494
And then to (reliably) get just the body of the request, you would need to pass the value of the Content-Length header to substr() as a negative start value.

Just in case you can't / don't use CURLOPT_HEADERFUNCTION or other solutions;
$nextCheck = function($body) {
return ($body && strpos($body, 'HTTP/') === 0);
};
[$headers, $body] = explode("\r\n\r\n", $result, 2);
if ($nextCheck($body)) {
do {
[$headers, $body] = explode("\r\n\r\n", $body, 2);
} while ($nextCheck($body));
}

A better way is to use the verbose CURL response which can be piped to a temporary stream. Then you can search the response for the header name. This could probably use a few tweaks but it works for me:
class genericCURL {
/**
* NB this is designed for getting data, or for posting JSON data
*/
public function request($url, $method = 'GET', $data = array()) {
$ch = curl_init();
if($method == 'POST') {
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "POST");
curl_setopt($ch, CURLOPT_POSTFIELDS, $string = json_encode($data));
}
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_VERBOSE, true);
//open a temporary stream to output the curl log, which would normally got to STDERR
$err = fopen("php://temp", "w+");
curl_setopt($ch, CURLOPT_STDERR, $err);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$server_output = curl_exec ($ch);
//rewind the temp stream and put it into a string
rewind($err);
$this->curl_log = stream_get_contents($err);
curl_close($ch);
fclose($err);
return $server_output;
}
/**
* use the curl log to get a header value
*/
public function getReturnHeaderValue($header) {
$log = explode("\n", str_replace("\r\n", "\n", $this->curl_log));
foreach($log as $line) {
//is the requested header there
if(stripos($line, '< ' . $header . ':') !== false) {
$value = trim(substr($line, strlen($header) + 3));
return $value;
}
}
//still here implies not found so return false
return false;
}
}

Improvement of Geoffreys answer:
I couldn't get the right length for header with $headerSize = curl_getinfo($this->curlHandler, CURLINFO_HEADER_SIZE);- i had to calculate header size on my own.
In addition some improvements for better readability.
$headerSize = 0;
curl_setopt_array($this->curlHandler, [
CURLOPT_URL => $yourUrl,
CURLOPT_POST => 0,
CURLOPT_RETURNTRANSFER => 1,
// this function is called by curl for each header received
CURLOPT_HEADERFUNCTION =>
function ($curl, $header) use (&$headers, &$headerSize) {
$lenghtCurrentLine = strlen($header);
$headerSize += $lenghtCurrentLine;
$header = explode(':', $header, 2);
if (count($header) > 1) { // store only vadid headers
$headers[strtolower(trim($header[0]))][] = trim($header[1]);
}
return $lenghtCurrentLine;
},
]);
$fullResult = curl_exec($this->curlHandler);
$result = substr($fullResult, $headerSize);

Return response headers with a reference parameter:
<?php
$data=array('device_token'=>'5641c5b10751c49c07ceb4',
'content'=>'测试测试test'
);
$rtn=curl_to_host('POST', 'http://test.com/send_by_device_token', array(), $data, $resp_headers);
echo $rtn;
var_export($resp_headers);
function curl_to_host($method, $url, $headers, $data, &$resp_headers)
{$ch=curl_init($url);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $GLOBALS['POST_TO_HOST.LINE_TIMEOUT']?$GLOBALS['POST_TO_HOST.LINE_TIMEOUT']:5);
curl_setopt($ch, CURLOPT_TIMEOUT, $GLOBALS['POST_TO_HOST.TOTAL_TIMEOUT']?$GLOBALS['POST_TO_HOST.TOTAL_TIMEOUT']:20);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_HEADER, 1);
if ($method=='POST')
{curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($data));
}
foreach ($headers as $k=>$v)
{$headers[$k]=str_replace(' ', '-', ucwords(strtolower(str_replace('_', ' ', $k)))).': '.$v;
}
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$rtn=curl_exec($ch);
curl_close($ch);
$rtn=explode("\r\n\r\nHTTP/", $rtn, 2); //to deal with "HTTP/1.1 100 Continue\r\n\r\nHTTP/1.1 200 OK...\r\n\r\n..." header
$rtn=(count($rtn)>1 ? 'HTTP/' : '').array_pop($rtn);
list($str_resp_headers, $rtn)=explode("\r\n\r\n", $rtn, 2);
$str_resp_headers=explode("\r\n", $str_resp_headers);
array_shift($str_resp_headers); //get rid of "HTTP/1.1 200 OK"
$resp_headers=array();
foreach ($str_resp_headers as $k=>$v)
{$v=explode(': ', $v, 2);
$resp_headers[$v[0]]=$v[1];
}
return $rtn;
}
?>

Try this if you are using GET:
$curl = curl_init($url);
curl_setopt_array($curl, array(
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_ENCODING => "",
CURLOPT_MAXREDIRS => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_CUSTOMREQUEST => "GET",
CURLOPT_HTTPHEADER => array(
"Cache-Control: no-cache"
),
));
$response = curl_exec($curl);
curl_close($curl);

If you don't really need to use curl;
$body = file_get_contents('http://example.com');
var_export($http_response_header);
var_export($body);
Which outputs
array (
0 => 'HTTP/1.0 200 OK',
1 => 'Accept-Ranges: bytes',
2 => 'Cache-Control: max-age=604800',
3 => 'Content-Type: text/html',
4 => 'Date: Tue, 24 Feb 2015 20:37:13 GMT',
5 => 'Etag: "359670651"',
6 => 'Expires: Tue, 03 Mar 2015 20:37:13 GMT',
7 => 'Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT',
8 => 'Server: ECS (cpm/F9D5)',
9 => 'X-Cache: HIT',
10 => 'x-ec-custom-error: 1',
11 => 'Content-Length: 1270',
12 => 'Connection: close',
)'<!doctype html>
<html>
<head>
<title>Example Domain</title>...
See http://php.net/manual/en/reserved.variables.httpresponseheader.php

How to get Google +1 count for current page in PHP?

I want to get count of Google +1s for current web page ? I want to do this process in PHP, then write number of shares or +1s to database. That's why, I need it. So, How can I do this process (getting count of +1s) in PHP ?
Thanks in advance.

This one works for me and is faster than the CURL one:
function getPlus1($url) {
$html = file_get_contents( "https://plusone.google.com/_/+1/fastbutton?url=".urlencode($url));
$doc = new DOMDocument(); $doc->loadHTML($html);
$counter=$doc->getElementById('aggregateCount');
return $counter->nodeValue;
}
also here for Tweets, Pins and Facebooks
function getTweets($url){
$json = file_get_contents( "http://urls.api.twitter.com/1/urls/count.json?url=".$url );
$ajsn = json_decode($json, true);
$cont = $ajsn['count'];
return $cont;
}
function getPins($url){
$json = file_get_contents( "http://api.pinterest.com/v1/urls/count.json?callback=receiveCount&url=".$url );
$json = substr( $json, 13, -1);
$ajsn = json_decode($json, true);
$cont = $ajsn['count'];
return $cont;
}
function getFacebooks($url) {
$xml = file_get_contents("http://api.facebook.com/restserver.php?method=links.getStats&urls=".urlencode($url));
$xml = simplexml_load_string($xml);
$shares = $xml->link_stat->share_count;
$likes = $xml->link_stat->like_count;
$comments = $xml->link_stat->comment_count;
return $likes + $shares + $comments;
}
Note: Facebook numbers are the sum of likes+shares and some people said plus comments (I didn't search this yet), anyway use the one you need.
This will works if your php settings allow open external url, check your "allow_url_open" php setting.
Hope helps.

function get_plusones($url) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, "https://clients6.google.com/rpc");
curl_setopt($curl, CURLOPT_POST, 1);
curl_setopt($curl, CURLOPT_POSTFIELDS, '[{"method":"pos.plusones.get","id":"p","params":{"nolog":true,"id":"' . $url . '","source":"widget","userId":"#viewer","groupId":"#self"},"jsonrpc":"2.0","key":"p","apiVersion":"v1"}]');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, array('Content-type: application/json'));
$curl_results = curl_exec ($curl);
curl_close ($curl);
$json = json_decode($curl_results, true);
return intval( $json[0]['result']['metadata']['globalCounts']['count'] );
}
echo get_plusones("http://www.stackoverflow.com")
from internoetics.com

The cURL and API way listed in the other posts here no longer works.
There is still at least 1 method, but it's ugly and Google clearly doesn't support it. You just rip the variable out of the JavaScript source code for the official button with a regular expression:
function shinra_gplus_get_count( $url ) {
$contents = file_get_contents(
'https://plusone.google.com/_/+1/fastbutton?url='
. urlencode( $url )
);
preg_match( '/window\.__SSR = {c: ([\d]+)/', $contents, $matches );
if( isset( $matches[0] ) )
return (int) str_replace( 'window.__SSR = {c: ', '', $matches[0] );
return 0;
}

The next PHP script works great so far for retrieving Google+ count on shares and +1's.
$url = 'http://nike.com';
$gplus_type = true ? 'shares' : '+1s';
/**
* Get Google+ shares or +1's.
* See out post at stackoverflow.com/a/23088544/328272
*/
function get_gplus_count($url, $type = 'shares') {
$curl = curl_init();
// According to stackoverflow.com/a/7321638/328272 we should use certificates
// to connect through SSL, but they also offer the following easier solution.
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
if ($type == 'shares') {
// Use the default developer key AIzaSyCKSbrvQasunBoV16zDH9R33D88CeLr9gQ, see
// tomanthony.co.uk/blog/google_plus_one_button_seo_count_api.
curl_setopt($curl, CURLOPT_URL, 'https://clients6.google.com/rpc?key=AIzaSyCKSbrvQasunBoV16zDH9R33D88CeLr9gQ');
curl_setopt($curl, CURLOPT_POST, 1);
curl_setopt($curl, CURLOPT_POSTFIELDS, '[{"method":"pos.plusones.get","id":"p","params":{"nolog":true,"id":"' . $url . '","source":"widget","userId":"#viewer","groupId":"#self"},"jsonrpc":"2.0","key":"p","apiVersion":"v1"}]');
curl_setopt($curl, CURLOPT_HTTPHEADER, array('Content-type: application/json'));
}
elseif ($type == '+1s') {
curl_setopt($curl, CURLOPT_URL, 'https://plusone.google.com/_/+1/fastbutton?url='.urlencode($url));
}
else {
throw new Exception('No $type defined, possible values are "shares" and "+1s".');
}
$curl_result = curl_exec($curl);
curl_close($curl);
if ($type == 'shares') {
$json = json_decode($curl_result, true);
return intval($json[0]['result']['metadata']['globalCounts']['count']);
}
elseif ($type == '+1s') {
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($curl_result);
$counter=$doc->getElementById('aggregateCount');
return $counter->nodeValue;
}
}
// Get Google+ count.
$gplus_count = get_gplus_count($url, $gplus_type);

Google does not currently have a public API for getting the +1 count for URLs. You can file a feature request here. You can also use the reverse engineered method mentioned by #DerVo. Keep in mind though that method could change and break at anytime.

I've assembled this code to read count directly from the iframe used by social button.
I haven't tested it on bulk scale, so maybe you've to slow down requests and/or change user agent :) .
This is my working code:
function get_plusone($url)
{
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, "https://plusone.google.com/_/+1/fastbutton?
bsv&size=tall&hl=it&url=".urlencode($url));
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$html = curl_exec ($curl);
curl_close ($curl);
$doc = new DOMDocument();
$doc->loadHTML($html);
$counter=$doc->getElementById('aggregateCount');
return $counter->nodeValue;
}
Usage is the following:
echo get_plusones('http://stackoverflow.com/');
Result is: 3166

I had to merge a few ideas from different options and urls to get it to work for me:
function getPlusOnes($url) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, "https://plusone.google.com/_/+1/fastbutton?url=".urlencode($url));
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$html = curl_exec ($curl);
curl_close ($curl);
$doc = new DOMDocument();
$doc->loadHTML($html);
$counter=$doc->getElementById('aggregateCount');
return $counter->nodeValue;
}
All I had to do was update the url but I wanted to post a complete option for those interested.
echo getPlusOnes('http://stackoverflow.com/')
Thanks to Cardy for using this approach, then I just had to just get a url that worked for me...

I've released a PHP library retrieving count for major social networks. It currently supports Google, Facebook, Twitter and Pinterest.
Techniques used are similar to the one described here and the library provides a mechanism to cache retrieved data. This library also have some other nice features: installable through Composer, fully tested, HHVM support.
http://dunglas.fr/2014/01/introducing-the-socialshare-php-library/

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP - Scrape data of all trustpilot reviews [duplicate] - php

Related

how to skip InnerText empy in a If statment using php

Curl and array values in curlopt_url does not work

curl_exec returns empty string

Can PHP cURL retrieve response headers AND body in a single request?

How to get Google +1 count for current page in PHP?

Categories

Resources