simple_html_dom: 403 Access denied - php

I implemented this function in order to parse HTML pages using two different "methods".
As you can see both are using the very handy class called simple_html_dom.
The difference is the first method is also using curl to load the HTML while the second is not using curl
Both methods are working fine on a lot of pages but I'm struggling with this specific call:
searchThroughDOM('https://fr.shopping.rakuten.com/offer/buy/3458931181/new-york-1997-4k-ultra-hd-blu-ray-blu-ray-bonus-edition-boitier-steelbook.html', 'simple_html_dom');
In both cases, I end up with a 403 access denied response.
Did I do something wrong?
Or is there another method in order to avoid this type of denial?
function searchThroughDOM ($url, $method)
{
echo '$url = '.$url.'<br>'.'$method = '.$method.'<br><br>';
$time_start = microtime(true);
switch ($method) {
case 'curl':
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36');
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($str);
break;
case 'simple_html_dom':
$html = new simple_html_dom();
$html->load_file($url);
break;
}
$collection = $html->find('h1');
foreach($collection as $x => $x_value) {
echo 'x = '.$x.' => value = '.$x_value.'<br>';
}
$html->save('result.htm');
$html->clear();
$time_end = microtime(true);
echo 'Elapsed Time (DOM) = '.($time_end - $time_start).'<br><br>';
}

From my point of view , there is nothing wrong with "simple_html_dom"
you may remove the simple html dom "part" of the code , leave only for the CURL
which I assume is the source of the problem.
There are lots of reasons cause the curl Not working on page
first of all I can see you add
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
you should also try to add CURLOPT_SSL_VERIFYHOST , false
Secondly , check your curl version, see if it is too old
third option, if none of above working , you may want to enable cookie , it may possible the cookie disabled cause the website detect it is machine, not real person send the request .
lastly , if all above attempt failed , try other library or even file_get_content ,
Curl is not your only option, of cause it is the most powerful one.

Related

Privacy Crawler

i need your help, can anyone explain me why my code doesnt find the a-tag privacy on the site zoho.com?
my code finds the link "privacy" on other sites well but not on the site zoho.com
I use symfony Crawler: https://symfony.com/doc/current/components/dom_crawler.html
// Imprint Check //
function findPrivacy($domain) {
$ua = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13';
$curl = curl_init($domain);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($curl, CURLOPT_USERAGENT, $ua);
$data = curl_exec($curl);
$crawler = new Crawler($data);
$nodeValues = $crawler->filter('a')->each(function ($node) {
if(str_contains($node->attr('href'), 'privacy-police') || str_contains($node->attr('href'), 'privacy')) {
return true;
} else {
return false;
}
});
return $nodeValues;
}
if you watch the source code from zoho.com, then you will see the footer is empty. But on the site, the footer isnt empty if you scroll down.
How can I find now this link Privacy?
Your script cannot find what is not there. If you load the zoho.com page in a browser and look at the source code, you will notice that the word privacy is not even present. It's possible that the footer containing the link to the privacy policy is loaded asynchronously, which PHP cannot handle.
EDIT: by asynchronously loaded I mean using something like AJAX, which is client-side only. Since PHP is server-side only, it cannot perform the operations required to load the footer containing the link to the privacy policy.

Failed to load external entity on simplexml_load_file at Openstreetmap

I recently checked one of our websites and realized that the search for postal code wasn't working anymore.
I get the following error:
'Failed to load external entity'
If instead I use simplexml_load_string() I receive
'Start tag expected, '<' not found'.
This is the code I'm using:
libxml_use_internal_errors(true);
$xml = simplexml_load_file('https://nominatim.openstreetmap.org/search?postalcode=28217&country=DE&format=xml&polygon=1&addressdetails=1&boundary=postalcode');
if (false === $xml) {
$errors = libxml_get_errors();
var_dump($errors);
}
I read somewhere it might actually has something to do with HTTP headers but I did not find any useful info on this.
In OSM Nominatim's usage policy it is stated that you need to provide a User-Agent or HTTP-Referer request header to identify the application. As such, using a user-agent to masquerade as end-user browser is really not great etiquette.
You can find the usage policy here. It also says that the default values used by http libraries (like the one simplexml_load_file() uses) are not acceptable.
You say you are using simplexml_load_string(), but fail to say how are you getting the XML to that function. But the most likely scenario is that whichever method you are using to get the XML file, you are also neglecting to pass the mandatory headers.
As such, I'd create a request using php-curl, provide one of these headers to identify your app; and parse the resulting XML string with simplexml_parse_string().
E.g.:
// setup variables
$nominatim_url = 'https://nominatim.openstreetmap.org/search?postalcode=28217&country=DE&format=xml&polygon=1&addressdetails=1&boundary=postalcode';
$user_agent = 'ID_Identifying_Your_App v100';
$http_referer = 'http://www.urltoyourapplication.com';
$timeout = 10;
// curl initialization
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $nominatim_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
// this is are the bits you are missing
// Setting curl's user-agent
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
// you an also use this one (http-referer), it's up to you. Either one or both.
curl_setopt($ch, CURLOPT_REFERER, $http_referer);
// get the XML
$data = curl_exec($ch);
curl_close($ch);
// load it in simplexml
$xml = simplexml_load_string($data);
// This was your code, left as it was
if (false === $xml) {
$errors = libxml_get_errors();
var_dump($errors);
}
you can useing curlwith adding custom header , i hope this code useful for you :
<?php
$request_url='https://nominatim.openstreetmap.org/search?postalcode=28217&country=DE&format=xml&polygon=1&addressdetails=1&boundary=postalcode';
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
'Accept-Language: en-US,en;q=0.9,fa;q=0.8,und;q=0.7',
'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36'));
curl_setopt($ch, CURLOPT_URL, $request_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
echo($data);

How to show curl output in html format in php

I set code to hit the links to the proxy list in php. The hit is generated succesfully. and I am getting the output in html. but this html is not in display proper on browswer. I want exact html in return from the proxy. any body know how to do it please give me some idea about it here is the code which I am using
<?php
$curl = curl_init();
$timeout = 30;
$proxies = file("proxy.txt");
$r="https://www.abcdefgth.com";
// Not more than 2 at a time
for($x=0;$x<2000; $x++){
//setting time limit to zero will ensure the script doesn't get timed out
set_time_limit(30);
//now we will separate proxy address from the port
//$PROXY_URL=$proxies[$getrand[$x]];
echo $proxies[$x];
curl_setopt($curl, CURLOPT_URL,$r);
curl_setopt($curl , CURLOPT_PROXY , preg_replace('/\s+/', '',$proxies[$x]));
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5");
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($curl, CURLOPT_REFERER, "http://google.com/");
$text = curl_exec($curl);
echo "Hit Generated:";
}
?>
A simple look into the documentation of the function you use would have answered your question:
On http://php.net/manual/en/function.curl-exec.php it clearly states right in the "Return value" section that you receive back either a boolean value from that function. Except if you have specified the CURLOPT_RETURNTRANSFER flag which you did not do you in code.
So have a try adding
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
Followed by any attempt to actually output the result you receive in $text, which you also forgot.

Manipulate dom with php to scrape data

I am currently trying to manipulate dom throuhg php to extract views from an fb video page. The below code was working until a bit ago. However now it doesnt find the node that contains the views count. This information is inside a div with id fbPhotoPageMediaInfo. What would be the best way to manipulate the dom through php to get views of an fb video page?
private function _callCurl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Linux; Android 5.0.1; SAMSUNG-SGH-I337 Build/LRX22C; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
curl_setopt($ch, CURLOPT_URL, $url);
$response = curl_exec($ch);
$http = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
return array(
$http,
$response,
);
}
function test()
{
$url = "https://www.facebook.com/TaylorSwift/videos/10153665021155369/";
$request = callCurl($url);
if ($request[0] == 200) {
$dom = new DOMDocument();
#$dom->loadHTML($request[1]);
$elm = $dom->getElementById('fbPhotoPageMediaInfo');
if (isset($elm->nodeValue)) {
$views = preg_replace('/[^0-9]/', '', $elm->nodeValue);
} else {
$views = null;
}
} else {
echo "Error!";
}
return isset($views) ? $views : null;
}
Here is what I've determined...
If you var_dump() on $request you can see that it's giving you a 302 code (redirect) rather than a 200 (ok).
Changing CURLOPT_FOLLOWLOCATION to true or commenting it out entirely makes the error go away, but now we're getting a different page from the one expected.
I ran the following to see where I was being redirected to:
$htm = file_get_contents("https://www.facebook.com/TaylorSwift/videos/10153665021155369/");
var_dump($htm);
This gave me a page saying I was using an outdated browser, and needed to update it. So apparently Facebook doesn't like the User Agent.
I updated it as follows:
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/44.0.2');
That appears to solve the problem.
Personally I prefer to use Simplehtmldom.
FB like other high traffic sites do update their source to help prevent scraping. You may in the future have to adjust your node search.
<?php
$ua = "Mozilla/5.0 (Windows NT 5.0) AppleWebKit/5321 (KHTML, like Gecko) Chrome/13.0.872.0 Safari/5321"; // must be a valid User Agent
ini_set('user_agent', $ua);
require_once('simplehtmldom/simple_html_dom.php'); // http://simplehtmldom.sourceforge.net/
Function Scrape_FB_Views($url) {
IF (!filter_var($url, FILTER_VALIDATE_URL) === false) {
// Create DOM from URL
$html = file_get_html($url);
IF ($html) {
IF (($html->find('span[class=fcg]', 3))) { // 4th instance of span with fcg class
$text = trim($html->find('span[class=fcg]', 3)->plaintext); // get content of span as plain text
$result = preg_replace('/[^0-9]/', '', $text); // replace all non-numeric characters
}ELSE{
$result = "Node is no longer valid."
}
}ELSE{
$result = "Could not get HTML.";
}
}ELSE{
$result = "URL is invalid.";
}
return $result;
}
$url = "https://www.facebook.com/TaylorSwift/videos/10153665021155369/";
echo("<p>".Scrape_FB_Views($url)."</p>");
?>

php file_get_contents() get stuck in loading an image

As mentioned above, the php file_get_contents() function or even the fopen()/fread() combination stucks and times out when trying to read this simple image url:
http://pics.redblue.de/artikelid/GR/1140436/fee_786_587_png
but the same image is easily loaded by browsers, whats the catch?
EDITED:
as requested in comments, I am showing the function I used to get the data:
function customRead($url)
{
$contents = '';
$handle = fopen($url, "rb");
$dex = 0;
while ( !feof($handle) )
{
if ( $dex++ > 100 )
break;
$contents .= fread($handle, 2048);
}
fclose($handle);
echo "\nbreaking due to too many calls...\n";
return $contents;
}
I also tried simply this:
echo file_get_contents('http://pics.redblue.de/artikelid/GR/1140436/fee_786_587_png');
Both give the same issue
EDITED:
As suggested in comment I used curl:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.1 Safari/537.11');
$res = curl_exec($ch);
$rescode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch) ;
echo "\n\n\n[DATA:";
echo $res;
echo "]\n\n\n[CODE:";
print_r($rescode);
echo "]\n\n\n[ERROR:";
echo curl_error($ch);
echo "]\n\n\n";
this is the result:
[DATA:]
[CODE:0]
[ERROR:]
If you don't get the remote data with file_get_contents, you can try it with cURL as it can provide error messages on curl_error. If you get nothing, even no error, then something on your server blocks outgoing connections. Maybe you even want to try curl over SSH. I'm not sure if that makes any difference but it's worth the try. If you don't get anything you may want to consider contacting the server admin (if you're not that) or the provider.

Categories