Cant use loadHTMLfile or file_get_contents for external URL - php

I want to know Groupon active deals so I write a scraper like:
libxml_use_internal_errors(true);
$dom = new DOMDocument();
#$dom->loadHTMLFile('https://www.groupon.com/browse/new-york?category=food-and-drink&minPrice=1&maxPrice=999');
$xpath = new DOMXPath($dom);
$entries = $xpath->query("//li[#class='slot']//a/#href");
foreach($entries as $e) {
echo $e->textContent . '<br />';
}
but when I run this function browser loading all time, just loading something but don't show any error.
How can I fix it? Not just case with Groupon - I also try other websites but also don't work. WHy?

What about using CURL to loading page data.
Not just case with Groupon - I also try other websites but also don't work
I think this code will help you but you should expect unexpected situations for each website which you want to scrap.
<?php
$dom = new DOMDocument();
$data = get_url_content('https://www.groupon.com', true);
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$entries = $xpath->query("//label");
foreach($entries as $e) {
echo $e->textContent . '<br />';
}
function get_url_content($url = null, $justBody = true)
{
/* Init CURL */
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_HTTPHEADER, []);
$data = curl_exec($ch);
if ($justBody)
$data = #(explode("\r\n\r\n", $data, 2))[1];
var_dump($data);
return $data;
}

Related

Xpath scraping further link from subpages?

I Finally managed to make a script in php for scraping basic elements from other websites. It is super simple. This example shows how to get title and url.
ini_set('display_errors', 1);
$url = 'http://test123cxqwq12.000webhostapp.com/mainpage.php';
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$title = $xpath->query('/html/body/a/h1');
$source = $xpath->query('/html/body/a/#href');
for ($i = 0; $i <= count($source)-1; $i++) {
$new = $source[$i]->nodeValue;
$text = $title[$i]->nodeValue;
echo ""."</br>";
}
Page with results: http://test123cxqwq12.000webhostapp.com/scrap.php
Page to scraping content: http://test123cxqwq12.000webhostapp.com/mainpage.php
Subpage: http://test123cxqwq12.000webhostapp.com/subpage.php
Now I would like to go a step further and take the data from the subpage. So instead of taking source from main page like is right now. I would like to go into this source and take another source (in this example google.com link) from subpage. I'm out of ideas. I would like to ask for some tips, is it possible to do it with xpath in similar way I was doing now?
I think a solution could be to store the URL in a database then apply your Curl and xpath functions to them
<?php
function curlGet($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_URL, $url);
$results = curl_exec($ch);
curl_close($ch);
return $results;
}
function returnXPathObject($item) {
$xmlPageDom = new DomDocument();
#$xmlPageDom->loadHTML($item);
$xmlPageXPath = new DOMXPath($xmlPageDom);
return $xmlPageXPath;
}
$allUrl = $cxn->query("SELECT * FROM yourDatabaseUrl");
$allUrl = $allUrl->fetchAll();
for ($i = 0; $i<count($allUrl); $i++){
$url = $allUrl[$i];
$getDom = curlGet($url);
$getDomXpath = returnXPathObject($getDom);
$title = $getDomXpath->query('/html/body/a/h1');
$source = $getDomXpath->query('/html/body/a/#href');
}
I'm not sure about this answer it's just a proposition

Scraping HTML page using XPath and PHP

I'm trying to scraping a HTML page using this PHP code
<?php
ini_set('display_errors', 1);
$url = 'http://www.cittadellasalute.to.it/index.php?option=com_content&view=article&id=6786:situazione-pazienti-in-pronto-soccorso&catid=165:pronto-soccorso&Itemid=372';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$greenWaitingNumber = $xpath->query('/html/body/div/div/div[4]/div[3]/section/p');
foreach( $greenWaitingNumber as $node )
{
echo "Number first green line: " .$node->nodeValue;
echo '<br>';
echo '<br>';
}
?>
All works fine (no error and in my browser console I can see '200' as return code ...), but nothing is printed in my HTML page .... .
Probably the problem is about the xpath /html/body/div/div/div[4]/div[3]/section/p that refers to the first green line in the source HTML page, but this is my Firefox Firebug tells me for that page section ....
Suggestions / examples?
!!! UPDATE !!!!
As Santosh Sapkota suggest in his reply, the first problem is that the text inside that green box, is loaded from iFrame ... I've seen the url of the HTML page inside the IFrame ad so I've tried to use this one in my code that now is ...
<?php
ini_set('display_errors', 1);
$url = 'http://listeps.cittadellasalute.to.it/?id=01090101';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$greenWaitingNumber = $xpath->query('/html/body/div/div/div[4]/div[3]/section/p');
foreach( $greenWaitingNumber as $node )
{
echo "Number first green line: " .$node->nodeValue;
echo '<br>';
echo '<br>';
}
?>
but unfortunately nothing is still printed in my output HTML page ....
Other suggestions / examples?
Must be problem with you xpath. As well as check if there is content laded from iFrame or not.

Getting site title in unknown format using Php Curl and Dom-Document

I want to get site title using site url with most of the site it is working but it is getting some not readable text with japennese and chinnese site.
Here is my function
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
Use
use--------
$html = $this->file_get_contents_curl($url);
Parsing
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;
I am getting this ouput "ã¢ã¡ã¼ãIDç»é² ã¡ã¼ã«ã®ç¢ºèªï½Ameba(ã¢ã¡ã¼ã)"
Site URL : https://user.ameba.jp/regist/registerIntro.do?campaignId=0053&frmid=3051
Please help me out suggest some way to get exact site title in any language.
//example
/* MEthod----------4 */
function file_get_contents_curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$uurl="http://www.piaohua.com/html/xuannian/index.html";
$html = file_get_contents_curl($uurl);
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
if(!empty($nodes->item(0)->nodeValue)){
$title = utf8_decode($nodes->item(0)->nodeValue);
}else{
$title =$uurl;
}
echo $title;
Make sure your script is using utf-8 encoding by adding following line to the begining of the file
mb_internal_encoding('UTF-8');
After doing so, remove utf8_decode function from your code. Everything should work fine without it
[DOMDocument::loadHtml]1 function gets encoding from html page meta tag. So you could have problems if page do not excplicitly specifies its encoding.
Simply add this line on top of your PHP Code.
header('Content-Type: text/html;charset=utf-8');
The code..
<?php
header('Content-Type: text/html;charset=utf-8');
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = file_get_contents_curl('http://www.piaohua.com/html/lianxuju/2013/1108/27730.html');
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
echo $title = $nodes->item(0)->nodeValue;

php dom not accepting url

I am trying to create a program that will open a text file with urls seperated by |. It will then take the first line of the text document, crawl that url and remove it from the text file. Each url is to be scraped by a basic crawler. I know the crawler part works because if I enter in one of the urls in quotations, rather than a variable from the text file, it will work. I am at the point where it will not return anything because the url simply will not be accepted.
this is a basic version of my code because I had to break it down alot to iscolate the problem.
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));
$url = $urlarray[0];
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
$title = $element->getAttribute('title');
$class = $element->getAttribute('class');
if($class == 'result_link')
{
$title = str_replace('Synonyms of ', '', $title);
echo $title . "<br />";
}
}`
The code below works like a champ tested with your example data:
<?php
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));
$url = $urlarray[0];
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
$title = $element->getAttribute('title');
$class = $element->getAttribute('class');
if($class == 'result_link')
{
$title = str_replace('Synonyms of ', '', $title);
echo $title . "<br />";
}
}
?>
ALMOST FORGOT: LETS NOW PUT IT IN A LOOP TO LOOP THROUGH ALL URLS:
<?php
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));
$url = $urlarray[0];
foreach($urlarray as $url) {
if(!empty($url)) {
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,trim($url));
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $element)
{
$title = $element->getAttribute('title');
$class = $element->getAttribute('class');
if($class == 'result_link')
{
$title = str_replace('Synonyms of ', '', $title);
echo $title . "<br />";
}
}
echo '<hr />';
}
}
?>
So if you put in a URL manually $url = 'http://www.mywebsite.com'; every thing works as expected?
If so there is a problem here:
$urlarray = explode("|", $contents = file_get_contents('urls.txt'));
are you sure urls.txt is loading? Are you sure it contains http://a.com|http://b.com etc?
I would var dump
$contents = file_get_contents('urls.txt') before the explode statement to see if it is loading in.
If yes, then I would explode the into $urlarray, and var dump $urlarray[0]
if it looks right I would trim it before being sent to dom with trim($urlarray[0])
I may even go as far as using valid regex to make sure these URL's are in fact URL's before sending it to dom.
Let me know the results and I will try to help further, or post all sample code including URLS.txt
And I will run it locally

Problem with curl, xpath query

I need some help with my xpath query. I can get this code to work with just about every site I need to scrape except this small part of a particular site... I just get a blank page... Does anyone have an idea on how I can do this better?
//
$target_url = "http://www.teambuy.ca/vancouver/";
$userAgent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)';
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT,$userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
// parse the html into a DOMDocument
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body/div[#id='pagewrap']/div[#id='content']/div[#id='bottomSection']/div[#id='bottomRight']/div[#id='sideDeal']/div[2]/div/a/center/span");
foreach ($hrefs as $e) {
$e->nodeValue;
}
$insert = $e->nodeValue;
echo "$insert";
--EDIT--
No luck...
Fatal error: Call to a member function loadHTMLfile() on a non-object in ... Line 4
//
$xpath_query = $dom->loadHTMLfile("http://www.teambuy.ca/vancouver/");
$hrefs = $xpath_query->evaluate("/html/body/div[7]/div[4]/div[3]/div[2]/div[1]/div[2]/div/a/center/span");
foreach ($hrefs as $e) {
echo $e->nodeValue;
}
$insert = $e->nodeValue;
echo "$insert";
don't use cURL. just use
$dom->loadHTMLFile("http://www.teambuy.ca/calgary/");
don't use
$xpath = new DOMXPath($dom);
just use
$href = $dom->xpath($xpath_query);
I imagine your xpath query could be simplified as well...
also,
foreach ($hrefs as $e) {
$e->nodeValue;
}
does nothing. might want to try this instead.
foreach ($hrefs as $e) {
echo $e->nodeValue;
}

Categories