php get file contents browser - php

I get page contents with this php code:
but I don't know why server access this page with old browser version
$url = $target_domain . 'http://www.facebook.com';
//Download page
$site = file_get_contents($url);
$dom = DOMDocument::loadHTML($site);
if($dom instanceof DOMDocument) {
// find <head> tag
$head_tag_list = $dom->getElementsByTagName('head');
// there should only be one <head> tag
if($head_tag_list->length !== 1) {
throw new Exception('Wow! The HTML is malformed without single head tag.');
}
$head_tag = $head_tag_list->item(0);
// find first child of head tag to later use in insertion
$head_has_children = $head_tag->hasChildNodes();
if($head_has_children) {
$head_tag_first_child = $head_tag->firstChild;
}
// create new <base> tag
$base_element = $dom->createElement('base');
$base_element->setAttribute('href', $target_domain);
// insert new base tag as first child to head tag
if($head_has_children) {
$base_node = $head_tag->insertBefore($base_element, $head_tag_first_child);
} else {
$base_node = $head_tag->appendChild($base_element);
}
echo $dom->saveHTML();
} else {
// something went wrong in loading HTML to DOM Document
// provide error messaging
}
?>
take a look on photo :
";
please help me how to access with new version of browsers.

Instead of using file_get_contents which is little poor, you can use the Curl which we can add useful parameters, in your case you should indicate a new UserAgent in the request, try this this simple exemple :
function curl_download($Url) {
// is cURL installed yet?
if (!function_exists('curl_init')) {
die('Sorry cURL is not installed!');
}
// OK cool - then let's create a new cURL resource handle
$ch = curl_init();
// Now set some options (most are optional)
// Set URL to download
curl_setopt($ch, CURLOPT_URL, $Url);
// User agent
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36");
// Include header in result? (0 = yes, 1 = no)
curl_setopt($ch, CURLOPT_HEADER, 0);
// Should cURL return or print out the data? (true = return, false = print)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
// Timeout in seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
// Download the given URL, and return output
$output = curl_exec($ch);
// Close the cURL resource, and free system resources
curl_close($ch);
return $output;
}

Related

PHP - Get final url from a url after all redirections (curl + php)

// From URL to get redirected URL
$url = 'https://www.shareasale.com/m-pr.cfm?merchantID=83483&userID=1860618&productID=916465625';
$ch = curl_init(); // create cURL handle (ch)
if (!$ch) {
die("Couldn't initialize a cURL handle");
}
// set some cURL options
$ret = curl_setopt($ch, CURLOPT_URL, $url);
$ret = curl_setopt($ch, CURLOPT_HEADER, 1);
$ret = curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$ret = curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$ret = curl_setopt($ch, CURLOPT_TIMEOUT, 30);
// execute
$ret = curl_exec($ch);
if (!empty($ret)) {
$info = curl_getinfo($ch);
curl_close($ch); // close cURL handler
if (empty($info['http_code'])) {
die("No HTTP code was returned");
} else {
echo 'REDIRECTED FINAL URL'.$info['url']); // this does not give final url.
}
}
is their any way we can get final url from a url after all re-directions ?
Let me know if any changes needs to be done in this code ?
https://www.shareasale.com/m-pr.cfm?merchantID=83483&userID=1860618&productID=916465625
This is the url which has lots of redirections, i am testing code with this one but it does not return final url, it return some the url then the url which you see in url bar.
The code you have is working correctly, but it is only part of what you want. When you get to the final URL redirect, your return includes...
<HTML><head></head><body>
<script LANGUAGE="JavaScript1.2">
window.location.replace('https:\/\/loomyhome.com\/collections\/all-products\/products\/blue-my-mind-rug?sscid=71k5_300lf&')
</script>
</body></html>
So you then need to extract the URL from there. You can use a regex (not my best skill) which would be something like...
preg_match('#(https:.*?)\'\)#', $ret, $match);
echo stripslashes($match[1]);
(using stripslashes to unescape the string). Gives...
https://loomyhome.com/collections/all-products/products/blue-my-mind-rug?sscid=71k5_3097f&

simple_html_dom: 403 Access denied

I implemented this function in order to parse HTML pages using two different "methods".
As you can see both are using the very handy class called simple_html_dom.
The difference is the first method is also using curl to load the HTML while the second is not using curl
Both methods are working fine on a lot of pages but I'm struggling with this specific call:
searchThroughDOM('https://fr.shopping.rakuten.com/offer/buy/3458931181/new-york-1997-4k-ultra-hd-blu-ray-blu-ray-bonus-edition-boitier-steelbook.html', 'simple_html_dom');
In both cases, I end up with a 403 access denied response.
Did I do something wrong?
Or is there another method in order to avoid this type of denial?
function searchThroughDOM ($url, $method)
{
echo '$url = '.$url.'<br>'.'$method = '.$method.'<br><br>';
$time_start = microtime(true);
switch ($method) {
case 'curl':
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36');
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($str);
break;
case 'simple_html_dom':
$html = new simple_html_dom();
$html->load_file($url);
break;
}
$collection = $html->find('h1');
foreach($collection as $x => $x_value) {
echo 'x = '.$x.' => value = '.$x_value.'<br>';
}
$html->save('result.htm');
$html->clear();
$time_end = microtime(true);
echo 'Elapsed Time (DOM) = '.($time_end - $time_start).'<br><br>';
}
From my point of view , there is nothing wrong with "simple_html_dom"
you may remove the simple html dom "part" of the code , leave only for the CURL
which I assume is the source of the problem.
There are lots of reasons cause the curl Not working on page
first of all I can see you add
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
you should also try to add CURLOPT_SSL_VERIFYHOST , false
Secondly , check your curl version, see if it is too old
third option, if none of above working , you may want to enable cookie , it may possible the cookie disabled cause the website detect it is machine, not real person send the request .
lastly , if all above attempt failed , try other library or even file_get_content ,
Curl is not your only option, of cause it is the most powerful one.

Failed to load external entity on simplexml_load_file at Openstreetmap

I recently checked one of our websites and realized that the search for postal code wasn't working anymore.
I get the following error:
'Failed to load external entity'
If instead I use simplexml_load_string() I receive
'Start tag expected, '<' not found'.
This is the code I'm using:
libxml_use_internal_errors(true);
$xml = simplexml_load_file('https://nominatim.openstreetmap.org/search?postalcode=28217&country=DE&format=xml&polygon=1&addressdetails=1&boundary=postalcode');
if (false === $xml) {
$errors = libxml_get_errors();
var_dump($errors);
}
I read somewhere it might actually has something to do with HTTP headers but I did not find any useful info on this.
In OSM Nominatim's usage policy it is stated that you need to provide a User-Agent or HTTP-Referer request header to identify the application. As such, using a user-agent to masquerade as end-user browser is really not great etiquette.
You can find the usage policy here. It also says that the default values used by http libraries (like the one simplexml_load_file() uses) are not acceptable.
You say you are using simplexml_load_string(), but fail to say how are you getting the XML to that function. But the most likely scenario is that whichever method you are using to get the XML file, you are also neglecting to pass the mandatory headers.
As such, I'd create a request using php-curl, provide one of these headers to identify your app; and parse the resulting XML string with simplexml_parse_string().
E.g.:
// setup variables
$nominatim_url = 'https://nominatim.openstreetmap.org/search?postalcode=28217&country=DE&format=xml&polygon=1&addressdetails=1&boundary=postalcode';
$user_agent = 'ID_Identifying_Your_App v100';
$http_referer = 'http://www.urltoyourapplication.com';
$timeout = 10;
// curl initialization
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $nominatim_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
// this is are the bits you are missing
// Setting curl's user-agent
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
// you an also use this one (http-referer), it's up to you. Either one or both.
curl_setopt($ch, CURLOPT_REFERER, $http_referer);
// get the XML
$data = curl_exec($ch);
curl_close($ch);
// load it in simplexml
$xml = simplexml_load_string($data);
// This was your code, left as it was
if (false === $xml) {
$errors = libxml_get_errors();
var_dump($errors);
}
you can useing curlwith adding custom header , i hope this code useful for you :
<?php
$request_url='https://nominatim.openstreetmap.org/search?postalcode=28217&country=DE&format=xml&polygon=1&addressdetails=1&boundary=postalcode';
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
'Accept-Language: en-US,en;q=0.9,fa;q=0.8,und;q=0.7',
'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36'));
curl_setopt($ch, CURLOPT_URL, $request_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
echo($data);

Manipulate dom with php to scrape data

I am currently trying to manipulate dom throuhg php to extract views from an fb video page. The below code was working until a bit ago. However now it doesnt find the node that contains the views count. This information is inside a div with id fbPhotoPageMediaInfo. What would be the best way to manipulate the dom through php to get views of an fb video page?
private function _callCurl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Linux; Android 5.0.1; SAMSUNG-SGH-I337 Build/LRX22C; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
curl_setopt($ch, CURLOPT_URL, $url);
$response = curl_exec($ch);
$http = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
return array(
$http,
$response,
);
}
function test()
{
$url = "https://www.facebook.com/TaylorSwift/videos/10153665021155369/";
$request = callCurl($url);
if ($request[0] == 200) {
$dom = new DOMDocument();
#$dom->loadHTML($request[1]);
$elm = $dom->getElementById('fbPhotoPageMediaInfo');
if (isset($elm->nodeValue)) {
$views = preg_replace('/[^0-9]/', '', $elm->nodeValue);
} else {
$views = null;
}
} else {
echo "Error!";
}
return isset($views) ? $views : null;
}
Here is what I've determined...
If you var_dump() on $request you can see that it's giving you a 302 code (redirect) rather than a 200 (ok).
Changing CURLOPT_FOLLOWLOCATION to true or commenting it out entirely makes the error go away, but now we're getting a different page from the one expected.
I ran the following to see where I was being redirected to:
$htm = file_get_contents("https://www.facebook.com/TaylorSwift/videos/10153665021155369/");
var_dump($htm);
This gave me a page saying I was using an outdated browser, and needed to update it. So apparently Facebook doesn't like the User Agent.
I updated it as follows:
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/44.0.2');
That appears to solve the problem.
Personally I prefer to use Simplehtmldom.
FB like other high traffic sites do update their source to help prevent scraping. You may in the future have to adjust your node search.
<?php
$ua = "Mozilla/5.0 (Windows NT 5.0) AppleWebKit/5321 (KHTML, like Gecko) Chrome/13.0.872.0 Safari/5321"; // must be a valid User Agent
ini_set('user_agent', $ua);
require_once('simplehtmldom/simple_html_dom.php'); // http://simplehtmldom.sourceforge.net/
Function Scrape_FB_Views($url) {
IF (!filter_var($url, FILTER_VALIDATE_URL) === false) {
// Create DOM from URL
$html = file_get_html($url);
IF ($html) {
IF (($html->find('span[class=fcg]', 3))) { // 4th instance of span with fcg class
$text = trim($html->find('span[class=fcg]', 3)->plaintext); // get content of span as plain text
$result = preg_replace('/[^0-9]/', '', $text); // replace all non-numeric characters
}ELSE{
$result = "Node is no longer valid."
}
}ELSE{
$result = "Could not get HTML.";
}
}ELSE{
$result = "URL is invalid.";
}
return $result;
}
$url = "https://www.facebook.com/TaylorSwift/videos/10153665021155369/";
echo("<p>".Scrape_FB_Views($url)."</p>");
?>

My php curl link rotator code needs a look over to fix

Im trying to build a script to send visitors through which will clean up the referring domains along with a url rotator, but visiting my link isnt directing me to the site for some reason. Im missing something in the code I believe. Help would be lovely.
Visiting here, should take me to freegamesarcade.co ---> http://buyerextra.com/folder1/ref.php
Heres the code inside /ref.php, look correct?
------
<?php
$referrer = array(
"https://t.co /hpulBZTTIR",
"https://t.co /G1mE5AQ4c4",
"https://t.co /Bwf1Du4Sbi",
);
echo '<meta name="referrer" content="no-referrer" />';
if (!function_exists('curl_init')){
die('Sorry cURL is not installed!');
}
// OK cool - then let's create a new cURL resource handle
$ch = curl_init();
//$Url = "http://freegamesarcade.co";
$Url = "http://freegamesarcade.co";
$rnd= rand(0, count($referrer));
$ref = $referrer[$rnd];
echo $ref;
// Now set some options (most are optional)
// Set URL to download
curl_setopt($ch, CURLOPT_URL, $Url);
// Set a referer
curl_setopt($ch, CURLOPT_REFERER, $ref);
// User agent
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
// Include header in result? (0 = yes, 1 = no)
curl_setopt($ch, CURLOPT_HEADER, 0);
// Should cURL return or print out the data? (true = return, false = print)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Timeout in seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
// Download the given URL, and return output
$output = curl_exec($ch);
// Close the cURL resource, and free system resources
curl_close($ch);
echo "Your referrer (you came from): https://t.co/". $ref;
'<meta http-equiv="refresh" content="0; url=<?=$ref;?>" />';
?>

Categories