I am trying to fetch prices from play and amazon for a personal project, but i have 2 problems.
Firstly i have got play to work, but it fetches the wrong price, and secondly amazon doesnt fetch any results.
Here is the code i have been trying to get working.
$playdotcom = file_get_contents('http://www.play.com/Search.html?searchstring=".$getdata[getlist_item]."&searchsource=0&searchtype=r2alldvd');
$amazoncouk = file_get_contents('http://www.amazon.co.uk/gp/search?search-alias=dvd&keywords=".$getdata[getlist_item]."');
preg_match('#<span class="price">(.*)</span>#', $playdotcom, $pmatch);
$newpricep = $pmatch[1];
preg_match('#used</a> from <strong>(.*)</strong>#', $playdotcom, $pmatch);
$usedpricep = $pmatch[1];
preg_match('#<span class="bld lrg red"> (.*)</span>#', $amazoncouk, $amatch);
$newpricea = $amatch[1];
preg_match('#<span class="price bld">(.*)</span> used#', $amazoncouk, $amatch);
$usedpricea = $amatch[1];
then echo the results:
echo "Play :: New: $newpricep - Used: $usedpricep";
echo "Amazon :: New: $newpricea - Used: $usedpricea";
Just so you know whats going on
$getdata[getlist_item] = "American Pie 5: The Naked Mile";
which is working fine.
Any idea why these aren't working correctly?
EDIT: I have just realised that $getdata[getlist_item] in the file_get_contents is not using the variable, just printing the variable as is... why is it doing that???
The quotes you are using aren't consistent! Both your opening and closing quotes need to be the same.
Try this:
$playdotcom = file_get_contents("http://www.play.com/Search.html?searchstring=".$getdata['getlist_item']."&searchsource=0&searchtype=r2alldvd");
$amazoncouk = file_get_contents("http://www.amazon.co.uk/gp/search?search-alias=dvd&keywords=".$getdata['getlist_item']);
As it were ".$getdata[getlist_item]." was considered part of the string as you never closed the single quote string you initiated.
Use curl function with correct headers. Below code will read the any web pages and then use a proper parser DOMDocument or simpleHTMLDomParser tool for read price from html content
$playdotcom = getPage("http://www.play.com/Search.html?searchstring=".$getdata['getlist_item']."&searchsource=0&searchtype=r2alldvd");
$amazoncouk = getPage("http://www.amazon.co.uk/gp/search?search-alias=dvd&keywords=".$getdata['getlist_item']);
function getPage($url){
$user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0';
$options = array(
CURLOPT_CUSTOMREQUEST =>"GET",
CURLOPT_POST =>false,
CURLOPT_USERAGENT => $user_agent,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HEADER => false,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_ENCODING => 'gzip',
CURLOPT_AUTOREFERER => true,
CURLOPT_CONNECTTIMEOUT => 30000,
CURLOPT_TIMEOUT => 30000,
CURLOPT_MAXREDIRS => 10,
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
curl_close( $ch );
return $content;
}
Related
I need to parse this web page ....
http://monitorps.sardegnasalute.it/monitorps/MonitorServlet?page=carLavoroPresidi&tipoProntoSoccorso=TUTTI&codiceAziendaSanitaria=200102&idPresidio=102MAD02&indirizzo=null&idProntoSoccorso=30
... using PHP to extract the numbers that are in the table under le columns "ROSSO", GIALLO", "VERDE" and "BIANCO".
(NOTE: you could see different value in that page if you try to browse it ... it doesn't matter ..,, it change dinamically .... )
Those values are a POST request result inside the web page.
This is the PHP code that I'm using to send a POST request using curl, and than parse the JSON response (using Skyscanner JSON Path .. it's working fine in my code .. ), trying to extract the values using a XPath parsing.
<?php
include "./tmp/vendor/autoload.php";
$ch = curl_init();
curl_setopt_array($ch, array(
CURLOPT_URL => "http://monitorps.sardegnasalute.it/monitorps/MonitorServlet",
CURLOPT_RETURNTRANSFER => true,
CURLOPT_ENCODING => "",
CURLOPT_MAXREDIRS => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_CUSTOMREQUEST => "POST",
CURLOPT_POSTFIELDS => "idMacroArea=null&codiceAziendaSanitaria=200102&idAreaVasta=null&idPresidio=102MAD02&idProntoSoccorso=30&tipoProntoSoccorso=TUTTI&vicini=null&xhr=true",
CURLOPT_HTTPHEADER => array(
"cache-control: no-cache",
"content-type: application/x-www-form-urlencoded"
),
));
$server_output = curl_exec ($ch);
curl_close ($ch);
$jsonObject = new JsonPath\JsonObject($server_output);
$jsonPathExpr = '$..view';
$res = $jsonObject->get($jsonPathExpr);
print $res[0];
$dom = new DOMDocument();
#$dom->loadHTML(json_encode($res[0]));
$xpath = new DOMXPath($dom);
$xpath_for_parsing = '/html/body/div[1]/div/div/div/table/tbody/tr[2]/td[4]';
$colorWaitingNumber = $xpath->query($xpath_for_parsing);
$theValue = 'N.D.';
foreach( $colorWaitingNumber as $node )
{
$theValue = $node->nodeValue;
}
print $theValue;
?>
The result is in the following image
where the table is the result of the command in my code ...
print $res[0];
and
N.D
is the result when I try to parse to extract one of my desired value
About the xpath that I'm using I've checked it verifying with the page source code ......
Where am I doing wrong?
I've solved!
My original code was "quite" right except for a mistake.
You've to comment this line ...
//#$dom->loadHTML(json_encode($res[0]));
and substitute it with this one
#$dom->loadHTML($res[0]);
and all will work fine!
I try to get a booking.com page from a hotel to fetch the prices afterwards with regex. The problem is the following:
I call file_get_contents with parameter like checkin and checkout (file_get_contents("/hotel/at/myhotel.html?checkin=2017-10-12&checkout=2017-10-13")) dates so that the prices are shown to the visitor. If I watch the source code in the browser I see the entry:
b_this_url : '/hotel/at/myhotel.html?label=gen173nr-1FCAsoDkIcbmV1ZS1wb3N0LWhvbHpnYXUtaW0tbGVjaHRhbEgHYgVub3JlZmgOiAEBmAEHuAEHyAEM2AEB6AEB-AEDkgIBeagCAw;sid=58ccf750fc4acb908e20f0f28544c903;checkin=2017-10-12;checkout=2017-10-13;dist=0;sb_price_type=total;type=total&',
If I echo the string from file_get_contents the string looks like:
b_this_url : '/hotel/at/myhotel.html',
So all parameters that I passed to the url with file_get_contents are gone and therefore I couldn't find any prices with my regex on the page ...
Does anyone have a solution for this problem?
The webpage is not completely generated server-side, but it relies heavily on JavaScript after the HTML part loads. If you are looking for rendering the page as it looks in browser, I think you should use php curl instead of file_get_contents() for this kind of web scraping thing. I generated an automatic code for you from Postman (a google chrome extension / standalone desktop app) for your given url. The response contains the full url with params. See the image and I posted the code for you also.
<?php
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => "https://www.booking.com/hotel/at/hilton-innsbruck.de.html?checkin=2017-10-10%3Bcheckout%3D2017-10-11",
CURLOPT_RETURNTRANSFER => true,
CURLOPT_ENCODING => "",
CURLOPT_MAXREDIRS => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_CUSTOMREQUEST => "GET",
CURLOPT_HTTPHEADER => array(
"cache-control: no-cache",
"postman-token: 581a75a7-6600-6ed6-75fd-5fb09c25d927"
),
));
$response = curl_exec($curl);
$err = curl_error($curl);
curl_close($curl);
if ($err) {
echo "cURL Error #:" . $err;
} else {
echo $response;
}
I need to scrape an ASP website using cURL. My hosting does not allow me to turn off safe_mode or open_basedir. That's why CURLOPT_FOLLOWLOCATION cannot be activated (it throws an error "CURLOPT_FOLLOWLOCATION cannot be activated when an open_basedir is set").
I tried to implement some workaround but after several unlucky days starting to be desperate. I am wondering how to change the code below to contain manual redirection instead of CURLOPT_FOLLOWLOCATION:
include_once __DIR__.'/simple_html_dom.php';
define('COOKIE_FILE', __DIR__.'/cookie.txt');
#unlink(COOKIE_FILE); //clear cookies before we start
define('CURL_LOG_FILE', __DIR__.'/request.txt');
#unlink(CURL_LOG_FILE);//clear curl log
class ASPBrowser {
public $exclude = array();
public $lastUrl = '';
public $dom = false;
/**Get simplehtmldom object from url
* #param $url
* #param $post
* #return bool|simple_html_dom
*/
public function getDom($url, $post = false) {
$f = fopen(CURL_LOG_FILE, 'a+'); // curl session log file
if($this->lastUrl) $header[] = "Referer: {$this->lastUrl}";
$curlOptions = array(
CURLOPT_ENCODING => 'gzip,deflate',
CURLOPT_AUTOREFERER => 1,
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_URL => $url,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_SSL_VERIFYHOST => false,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 9,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_HEADER => 0,
CURLOPT_USERAGENT => "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
CURLOPT_COOKIEFILE => COOKIE_FILE,
CURLOPT_COOKIEJAR => COOKIE_FILE,
CURLOPT_STDERR => $f, // log session
CURLOPT_VERBOSE => true,
);
if($post) { // add post options
$curlOptions[CURLOPT_POSTFIELDS] = $post;
$curlOptions[CURLOPT_POST] = true;
}
$curl = curl_init();
curl_setopt_array($curl, $curlOptions);
$data = curl_exec($curl);
$this->lastUrl = curl_getinfo($curl, CURLINFO_EFFECTIVE_URL); // get url we've been redirected to
curl_close($curl);
if($this->dom) {
$this->dom->clear();
$this->dom = false;
}
$dom = $this->dom = str_get_html($data);
fwrite($f, "{$post}\n\n");
fwrite($f, "-----------------------------------------------------------\n\n");
fclose($f);
return $dom;
}
function createASPPostParams($dom, array $params) {
$postData = $dom->find('input,select,textarea');
$postFields = array();
foreach($postData as $d) {
$name = $d->name;
if(trim($name) == '' || in_array($name, $this->exclude)) continue;
$value = isset($params[$name]) ? $params[$name] : $d->value;
$postFields[] = rawurlencode($name).'='.rawurlencode($value);
}
$postFields = implode('&', $postFields);
return $postFields;
}
function doPostRequest($url, array $params) {
$post = $this->createASPPostParams($this->dom, $params);
return $this->getDom($url, $post);
}
function doPostBack($url, $eventTarget, $eventArgument = '') {
return $this->doPostRequest($url, array(
'__EVENTTARGET' => $eventTarget,
'__EVENTARGUMENT' => $eventArgument
));
}
function doGetRequest($url) {
return $this->getDom($url);
}
}
(Credits: Andrey http://256cats.com/scraping-asp-websites-php-dopostback-ajax-emulation/)
You're probably looking for the CURLINFO_REDIRECT_URL info variable, as that returns the URL that it would otherwise had redirected to if you'd allowed it. Added in PHP 5.3.7.
Note that the exact response code 3xx also affects how the HTTP request method is supposed to change or not change when you follow a redirect. See details in the HTTP spec, RFC 7231 section 6.4.
The libcurl docs for CURLINFO_REDIRECT_URL.
I am trying to retrieve the html from file get contents in php then save it to a php file so I can include it into my homepage.
Unfortunately my script isn't saving the data into the file. I also need to verwrite this data on a daily basis as it will be setup with a cron job.
Can anyone tell me where I am going wrong please? I am just learning php :-)
<?php
$richSnippets = file_get_contents('http://website.com/data');
$filename = 'reviews.txt';
$handle = fopen($filename,"x+");
$somecontent = echo $richSnippets;
fwrite($handle,$somecontent);
echo "Success";
fclose($handle);
?>
A couple of things,
http://website.com/data gets a 404 error, it doesn't exist.
Change your code to
$site = 'http://www.google.com';
$homepage = file_get_contents($site);
$filename = 'reviews.txt';
$handle = fopen($filename,"w");
fwrite($handle,$homepage);
echo "Success";
fclose($handle);
Remove $somecontent = echo $richSnippets; it doesn't do anything.
if you have the proper permissions it should work.
Be sure that your pointing to an existing webpage.
Edit
When cURL is enabled you can use the following function
function get_web_page( $url ){
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
curl_close( $ch );
return $content;
}
Now change
$homepage = file_get_contents($site);
in to
$homepage = get_web_page($site);
You should use / instead of ****
$homepage = file_get_contents('http://website.com/data');
Also this part
$somecontent = echo $richSnippets;
I don't see $richSnippets above... it's probably not declared?
You probably want to do this:
fwrite($handle,$homepage);
I'm crawling a web with cURL and DOM PHP. The web has a products sections where you can go page by page viewing all the products and also you have subsections for more concise searching, in each page 9 products are listed.
I need to store the information of the subsection to witch the product belongs. I start with all the subsections URL's and the program above shows how I try to get the next 9 products page of a subsection.
The problem it's that the web makes redirects with some information that I suppose it's on a cookie because there is not post traces in the network.
For example: In the ALL PRODUCTS section the URL of the second page is like:
www.example.com/product/?n=2
The first page of any subsection has a unique URL like:
www.example.com/product/subsection
The problem is that the link to the next subsection page (next 9 products) is
www.example.com/product/?n=2
The URL it's THE SAME as the all product section but it shows the subsection products.
The problem it's that I get the ALL PRODUCTS page instead of the SUBSECTION page.
I have tried with cookies but I don't get distinct results. Any suggestion?
<?php
private ckfile;
public function main()
{
$this->ckfile = tempnam ("C:/Web/", "CURLCOOKIE");
$copy = $this->get_page();
$next_visit = $this->link_next($copy);
while($next_visit != false){//it's not last page
$copy = $this->get_page($next_visit,$get_name($next_visit));
$next_visit = $this->link_next($copy);
}
}
public function get_page($URL = "http://www.example.com" , $nombre = "example" )
{
$ch = curl_init();
$options = array(
CURLOPT_HTTPHEADER => array("Accept-Language: es-es,en"),
CURLOPT_USERAGENT => "Googlebot/2.1 (+http://www.google.com/bot.html)",
CURLOPT_AUTOREFERER => true, // set referer on redirect ,
CURLOPT_ENCODING => "", //allow all encodings
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_HEADER => false,
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_COOKIEFILE => $this->ckfile,
CURLOPT_COOKIEJAR => $this->ckfile,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_URL => $URL
);
curl_setopt_array($ch, $options);
$g = 'C:/Web/'.$nombre.'.html';
if(!is_file($g)){
$fp=fopen ($g, "w");
curl_setopt ($ch,CURLOPT_FILE, $fp);
$trash = curl_exec ($ch); // don't browse them
fclose($fp);
}
curl_close ($ch);
return $g;
}
public function link_next($value)
{
# function that searches the DOM for a link and returns a well formed URL
# or returns false if doesn't find one( last page)
}
?>
To make multiple calls you want to use curl multi:
$ch = curl_multi_init();
Not
$ch = curl_init();
See this post for an example Multiple PHP cUrl posts to same page