Scraping 'next' page issue

Scraping 'next' page issue - php

I am trying to scrape product data by product section from a Zen-cart store using Simple HTML DOM. I can scrape data from the first page fine but when I try to load the 'next' page of products the site returns the index.php landing page.
If I use the function directly with *http://URLxxxxxxxxxx.com/index.php?main_page=index&cPath=36&sort=20a&page=2* it scrapes the product information from page 2 fine.
The same thing occurs if I use cURL.
getPrices('http://URLxxxxxxxxxx.com/index.php?main_page=index&cPath=36');
function getPrices($sectionURL) {
$opts = array('http' => array('method' => "GET", 'header' => "Accept-language: en\r\n" . "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6\r\n" . "Cookie: zenid=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\r\n"));
$context = stream_context_create($opts);
$html = file_get_contents($sectionURL, false, $context);
$dom = new simple_html_dom();
$dom -> load($html);
//Do cool stuff here with information from page.. product name, image, price and more info URL
if ($nextPage = $dom -> find('a[title= Next Page ]', 0)) {
$nextPageURL = $nextPage -> href;
echo $nextPageURL;
$dom -> clear();
unset($dom);
getPrices($nextPageURL);
} else {
echo "\nNo more pages to scrape!!";
$dom -> clear();
unset($dom);
}
}
Any ideas on how to fix this problem?

I see lots of potential culprits. You're not keeping track of cookies, or setting referer and there's a good chance simple_html_dom is letting you down.
My recommendation is to proxy your requests through fiddler or charles and make sure they look the way they do coming from a browser.

Turned out next page URLs being passed to the function in loop were passing & instead of & and file_get_contents didn't like it.
$sectionURL = str_replace( "&", "&", urldecode(trim($sectionURL)) );

Related

file_get_contents returns unreadable text for a specific url

When I try to read the rss feeds of the kat.cr using php file_get_contents function, I get some unreadable text but when I open it up with my browser the feed is fine.
I have tried many other hosts but no chance in getting the correct data.
I even have tried setting the user-agent to diffrent browsers but still no change.
this is a simple code that I've tried:
$options = array('http' => array('user_agent' => 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'));
$url = 'https://kat.cr/movies/?rss=1';
$data = file_get_contents($url, FILE_TEXT, stream_context_create($options));
echo $data;
I'm curious how their doing it and what I can do to overcome the problem.
A part of unreadable text:
‹ي]يrم6–‎?Oپي©™ت,à7{»‌âgw&يؤe;éN¹\S´HK\S¤–¤l+ے÷ِùِIِ”(إژzA5‌ةض؛غ%K4ـ{qtqy½ùوa^ »¬nٍھ|ûٹSِ eه¤Jَrِْصڈ1q^}sü§7uسlدزؤYً¾²yفVu‌•يغWGG·Iس&m>،“j~$ےzؤ(?zï‍ج’²جٹم?!ّ÷¦حغ";‏گ´Yس¢ï³{tر5ز ³َsgYٹْ.ں#
Actually everytime I open up the link there is some different unreadable text.

As I mentioned in the comment - the contents returned are gzip encoded so you need to un-gzip the data. Depending upon your version of php you may or may not have gzdecode installed, I don't but the function here does the trick.
if( !function_exists('gzdecode') ){
function gzdecode( $data ){
$g=tempnam('/tmp','ff');
#file_put_contents( $g, $data );
ob_start();
readgzfile($g);
$d=ob_get_clean();
unlink($g);
return $d;
}
}
$data=gzdecode( file_get_contents( $url ) );
echo $data;

Why file_get_contents returning garbled data?

I am trying to grab the HTML from the below page using some simple php.
URL: https://kat.cr/usearch/architecture%20category%3Abooks/
My code is:
$html = file_get_contents('https://kat.cr/usearch/architecture%20category%3Abooks/');
echo $html;
where file_get_contents works, but returns scrambled data:
I have tried using cUrl as well as various functions like: htmlentities(), mb_convert_encoding, utf8_encode and so on, but just get different variations of the scrambled text.
The source of the page says it is charset=utf-8, but I am not sure what the problem is.
Calling file_get_contents() on the base url kat.cr returns the same mess.
What am I missing here?

It is GZ compressed and when fetched by the browser the browser decompresses this, so you need to decompress. To output it as well you can use readgzfile():
readgzfile('https://kat.cr/usearch/architecture%20category%3Abooks/');

Your site response is being compressed, therefore you've to uncompress in order to convert it to the original form.
The quickest way is to use gzinflate() as below:
$html = gzinflate(substr(file_get_contents("https://kat.cr/usearch/architecture%20category%3Abooks/"), 10, -8));
Or for more advanced solution, please consider the following function (found at this blog):
function get_url($url)
{
//user agent is very necessary, otherwise some websites like google.com wont give zipped content
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-Language: en-US,en;q=0.8rn" .
"Accept-Encoding: gzip,deflate,sdchrn" .
"Accept-Charset:UTF-8,*;q=0.5rn" .
"User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:19.0) Gecko/20100101 Firefox/19.0 FirePHP/0.4rn"
)
);
$context = stream_context_create($opts);
$content = file_get_contents($url ,false,$context);
//If http response header mentions that content is gzipped, then uncompress it
foreach($http_response_header as $c => $h)
{
if(stristr($h, 'content-encoding') and stristr($h, 'gzip'))
{
//Now lets uncompress the compressed data
$content = gzinflate( substr($content,10,-8) );
}
}
return $content;
}
echo get_url('http://www.google.com/');

Scraping iframe video from other sites through PHP

I want to scrape video from other sites to my sites (e.g. from a live video site).
How can I scrape the <iframe> video from other websites? Is the process the same as that for scraping images?
$html = file_get_contents('http://website.com/');
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$iframes = $dom->getElementsByTagName('frame');
foreach ($iframes as $iframe) {
$pic = $iframe->getAttribute('src');
echo '<li><frame src="'.$pic.'"';
}

This post is a little old, but still, here's my answer:
I'd recommend you to use cURL and Xpath to scrape the site and parse the HTML data. file_get_content has some security issues and some hosts may disable it. You could do something like this:
<?php
function scrape($URL){
//cURL options
$options = Array(
CURLOPT_RETURNTRANSFER => TRUE, //return html data in string instead of printing it out on screen
CURLOPT_FOLLOWLOCATION => TRUE, //follow header('Location: location');
CURLOPT_CONNECTTIMEOUT => 60, //max time to try to connect to page
CURLOPT_HEADER => FALSE, //include header
CURLOPT_USERAGENT => "Mozilla/5.0 (X11; Linux x86_64; rv:21.0) Gecko/20100101 Firefox/21.0", //User Agent
CURLOPT_URL => $URL //SET THE URL
);
$ch = curl_init($URL);//initialize a cURL session
curl_setopt_array($ch, $options);//set the cURL options
$data = curl_exec($ch);//execute cURL (the scraping)
curl_close($ch);//close the cURL session
return $data;
}
function parse(&$data, $query, &$dom){
$Xpath = new DOMXpath($dom); //new Xpath object associated to the domDocument
$result = $Xpath->query($query);//run the Xpath query through the HTML
var_dump($result);
return $result;
}
//new domDocument
$dom = new DomDocument("1.0");
//Scrape and parse
$data = scrape('http://stream-tv-series.net/2013/02/22/new-girl-s1-e6-thanksgiving/'); //scrape the website
#$dom->loadHTML($data); //load the html data to the dom
$XpathQuery = '//iframe'; //Your Xpath query could look something like this
$iframes = parse($data, $XpathQuery, $dom); //parse the HTML with Xpath
foreach($iframes as $iframe){
$src = $iframe->getAttribute('src'); //get the src attribute
echo '<li><iframe src="' . $src . '"></iframe></li>'; //echo the iframes
}
?>
Here are some links that you could find useful:
cURL: http://php.net/manual/fr/book.curl.php
Xpath: http://www.w3schools.com/xpath/
There is also the DomDocument documention on php.net. I can't post the link, I don't have enough reputation.

Setting useragent in PHP file_get_contents?

I am trying to get the content from a page using file_get_contents to get the HTML and regex for further processing.
The site I am getting my content from has a desktop and mobile site so I was wondering is there a way to send a custom useragent to get the mobile site instead of the desktop site?
Using file_get_contents I have tried it with my code shown below but all I get is a blank page:
$options = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"Cookie: foo=bar\r\n" .
"User-Agent: Mozilla/5.0 (iPad; U; CPU iPad OS 5_0_1 like Mac OS X; en-us) AppleWebKit/535.1+ (KHTML like Gecko) Version/7.2.0.0 Safari/6533.18.5\r\n" // i.e. An iPad )
);
$context = stream_context_create($options);
$file = file_get_contents('http://www.example.com/api/'.$atrib,false,$context);
$doc = new DOMDocument();// create new dom document
$doc->loadHTML($file);// load the xmlpage
$tags = $doc->getElementsByTagName('video'); // find the tag we are looking for
foreach ($tags as $tag) { // for ever tag that is the same make it a new tag of its own
if (isset($_GET['key']) && $_GET['key'] == $key) { // if key is in url and matches script key - do or dont
echo $tag->getAttribute('src'); // get out 3 min video from the attribute in page
} else { // if key is not in url or not correct show error
echo "ACCESS DENIED!"; // our out bound error
}
}
I am trying to get the useragent to load up the content from the sites mobile page and useing regex get the src url from this line of code in the page just in case this is the problem:
<video id="player" src="http://example.com/api/4.m3u8" poster="http://example.com/default.png" autoplay="" autobuffer="" preload="" controls="" height="537" width="935"></video>

As mentioned by Casimir et Hippolyte in the comments, uncomment the closing parenthesis at the end of the line of "User-Agent: Mozilla...:
ini_set('display_errors', 'On');

PHP file_get_contents and VAST xml

This is what am I trying to do: download a xml VAST from a URL and save locally in a XML file, in PHP. For that I am using file_get_contents and file_put_contents. this is the script I am using:
<?php
$tid=time();
$xml1 = file_get_contents('http://ad.afy11.net/ad?enc=4&asId=1000009566807&sf=0&ct=256');
file_put_contents("downloads/file1_$tid.xml", $xml1);
echo "<p>file 1 recorded</p>";
?>
The URL in question is a real URL that will deliver a xml VAST code. My problem is that when I save de file it will write an empty VAST tag:
<?xml version="1.0" encoding="UTF-8"?> <VAST version="2.0"> </VAST>
But if I run on Firefox it will actually deliver some code:
<VAST version="2.0"><Ad id="Adify"><Wrapper><AdSystem>Eyeblaster</AdSystem><VASTAdTagURI>http://bs.serving-sys.com/BurstingPipe/adServer.bs?cn=is&c=23&pl=VAST&pli=6583370&PluID=0&pos=7070&ord=4288438534]&cim=1</VASTAdTagURI><Impression>http://ad.afy11.net/ad?ipc=NMUsqYdyBUCjh4-i2HwWfK1oILM2AAAAN6-rBkSy8JNMZcuzAlj1XlSySpo6Hi7xEYULS+UgOVN5D3UuhFUVSWbFHoLE-+3su0-QnGgZgMJyiTm-R6O+yQ==</Impression><Creatives/></Wrapper></Ad></VAST>
Not a 100% of the time, they do cap the amount of requests, but WAY more often that when I try save the file using the PHP script.
Is that a way to make the PHP script mimic a browser???? I dont know if this is the right question but thats the only thing I can think of why I get an empty VAST tag when using the php script and get a full tag when using the browser.
any ideas???
thanks :)
Update: After doing some extra research, I found some info about stream_context_create function, but I haven't been able to duplicate the browser's results.
here's my new code:
<?php
$tid=time();
$opts = array('http' =>
array(
'method' => 'GET',
//'user_agent ' => "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2) Gecko/20100301 Ubuntu/9.10 (karmic) Firefox/3.6",
'header' => array(
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*\/*;q=0.8
'
),
)
);
$context = stream_context_create($opts);
$xml1 = file_get_contents('http://ad.afy11.net/ad?enc=4&asId=1000009566807&sf=0&ct=256');
file_put_contents("downloads/file1_$tid.xml", $xml1);
echo "<p>file 1 recorded</p>";
echo "<textarea rows='6' cols='80'> $xml1 </textarea> ";
echo "<br><iframe src='http://ad.afy11.net/ad?enc=4&asId=1000009566807&sf=0&ct=256' width='960' height='300'></iframe>";
?>
I also addded a iframe to compare when the browser are getting the right file and when the php function are not.

After some research I found a solution for my problem, and I would like to share here for future reference.
The idea as to pass some HTTP header with the file_get_contents. I accomplish that with this:
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>array("Accept-language: en", "Content-Type: multipart/form-data\r\n"),
'user_agent'=> $_SERVER['HTTP_USER_AGENT']
)
);
$context = stream_context_create($opts);
$xml4 = file_get_contents($url1, true, $context);
That's it, now I can get the same xml as if I was using the browser.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scraping 'next' page issue - php

I see lots of potential culprits. You're not keeping track of cookies, or setting referer and there's a good chance simple_html_dom is letting you down. My recommendation is to proxy your requests through fiddler or charles and make sure they look the way they do coming from a browser.

Turned out next page URLs being passed to the function in loop were passing & instead of & and file_get_contents didn't like it. $sectionURL = str_replace( "&", "&", urldecode(trim($sectionURL)) );

Related

file_get_contents returns unreadable text for a specific url

Why file_get_contents returning garbled data?

Scraping iframe video from other sites through PHP

Setting useragent in PHP file_get_contents?

PHP file_get_contents and VAST xml

Categories

Resources