Scraping HTML page using XPath and PHP

Scraping HTML page using XPath and PHP - php

I'm trying to scraping a HTML page using this PHP code
<?php
ini_set('display_errors', 1);
$url = 'http://www.cittadellasalute.to.it/index.php?option=com_content&view=article&id=6786:situazione-pazienti-in-pronto-soccorso&catid=165:pronto-soccorso&Itemid=372';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$greenWaitingNumber = $xpath->query('/html/body/div/div/div[4]/div[3]/section/p');
foreach( $greenWaitingNumber as $node )
{
echo "Number first green line: " .$node->nodeValue;
echo '<br>';
echo '<br>';
}
?>
All works fine (no error and in my browser console I can see '200' as return code ...), but nothing is printed in my HTML page .... .
Probably the problem is about the xpath /html/body/div/div/div[4]/div[3]/section/p that refers to the first green line in the source HTML page, but this is my Firefox Firebug tells me for that page section ....
Suggestions / examples?
!!! UPDATE !!!!
As Santosh Sapkota suggest in his reply, the first problem is that the text inside that green box, is loaded from iFrame ... I've seen the url of the HTML page inside the IFrame ad so I've tried to use this one in my code that now is ...
<?php
ini_set('display_errors', 1);
$url = 'http://listeps.cittadellasalute.to.it/?id=01090101';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$greenWaitingNumber = $xpath->query('/html/body/div/div/div[4]/div[3]/section/p');
foreach( $greenWaitingNumber as $node )
{
echo "Number first green line: " .$node->nodeValue;
echo '<br>';
echo '<br>';
}
?>
but unfortunately nothing is still printed in my output HTML page ....
Other suggestions / examples?

Must be problem with you xpath. As well as check if there is content laded from iFrame or not.

Related

Getting title of a webpage issue

I want to get the title of a webpage with file_get_contents(),
I tried:
$get=file_get_meta_tags("http://example.com");
echo $get["title"];
but it doesn't match.
What is wrong with it?

Title tag is not part of match in get_meta_tags() function and it is also not a meta tag.
Try this:
$get=file_get_contents("http://example.com");
preg_match("#<title>(.*?)</title>#i,$get,$matches);
print_r($matches);
Regex #<title>(.*?)</title>#i matches the title string.

Use the Below Code snipet to get the webpage title.
<?php
function curl_file_get_contents($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$targetUrl = "http://google.com/";
$html = curl_file_get_contents($targetUrl);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
$page_title = $nodes->item(0)->nodeValue;
echo "Title: $page_title". '<br/><br/>';
?>

Getting site title in unknown format using Php Curl and Dom-Document

I want to get site title using site url with most of the site it is working but it is getting some not readable text with japennese and chinnese site.
Here is my function
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
Use
use--------
$html = $this->file_get_contents_curl($url);
Parsing
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;
I am getting this ouput "ã¢ã¡ã¼ãIDç»é² ã¡ã¼ã«ã®ç¢ºèªï½Ameba(ã¢ã¡ã¼ã)"
Site URL : https://user.ameba.jp/regist/registerIntro.do?campaignId=0053&frmid=3051
Please help me out suggest some way to get exact site title in any language.
//example
/* MEthod----------4 */
function file_get_contents_curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$uurl="http://www.piaohua.com/html/xuannian/index.html";
$html = file_get_contents_curl($uurl);
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
if(!empty($nodes->item(0)->nodeValue)){
$title = utf8_decode($nodes->item(0)->nodeValue);
}else{
$title =$uurl;
}
echo $title;

Make sure your script is using utf-8 encoding by adding following line to the begining of the file
mb_internal_encoding('UTF-8');
After doing so, remove utf8_decode function from your code. Everything should work fine without it
[DOMDocument::loadHtml]1 function gets encoding from html page meta tag. So you could have problems if page do not excplicitly specifies its encoding.

Simply add this line on top of your PHP Code.
header('Content-Type: text/html;charset=utf-8');
The code..
<?php
header('Content-Type: text/html;charset=utf-8');
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = file_get_contents_curl('http://www.piaohua.com/html/lianxuju/2013/1108/27730.html');
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
echo $title = $nodes->item(0)->nodeValue;

how to use curl and preg_match _all div content

I try to practice CURL,but it doesn't go well
Pleasw tell me what's wrong
here is my code
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://xxxxxxx.com/");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, "Google Bot");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$downloaded_page = curl_exec($ch);
curl_close($ch);
preg_match_all('/<div\s* class =\"abc\">(.*)<\/div>/', $downloaded_page, $title);
echo "<pre>";
print($title[1]);
echo "</pre>";
and the warning is Notice: Array to string conversion
the html I want to parse is like this
<div class="abc">
<ul> blablabla </ul>
<ul> blablabla </ul>
<ul> blablabla </ul>
</div>

preg_match_all returns an array of arrays.
If your code is:
preg_match_all('/<div\s+class="abc">(.*)<\/div>/', $downloaded_page, $title);
you actually want to do the following:
echo "<pre>";
foreach ($title[1] as $realtitle) {
echo $realtitle . "\n";
}
echo "</pre>";
Since it will search all div's that have class "abc". I also suggest you harden your regex to be more robust.
preg_match_all('/<div[^>]+class="abc"[^>]*>(.*)<\/div>/', $downloaded_page, $title);
This will match as well as
BTW: DomDocument is slow as hell, I found out that regexes sometimes (depending on the size of your document) can give 40x speed increase. Just keep it simple.
Best,
Nicolas

Don't parse HTML with regex.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.lipsum.com/');
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$html = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
# foreach ($xpath->query('//div') as $div) { // all div's in html
foreach ($xpath->query('//div[contains(#class, "abc")]') as $div) { // all div's that have "abc" classname
// $div->nodeValue contains fetched DIV content
}

Simple html dom file_get_html not working - is there any workaround?

<?php
// Report all PHP errors (see changelog)
error_reporting(E_ALL);
include('inc/simple_html_dom.php');
//base url
$base = 'https://play.google.com/store/apps';
//home page HTML
$html_base = file_get_html( $base );
//get all category links
foreach($html_base->find('a') as $element) {
echo "<pre>";
print_r( $element->href );
echo "</pre>";
}
$html_base->clear();
unset($html_base);
?>
I have the above code and I'm trying to get certain elements of the Play Store page but it isn't returning anything. Is it possible that certain PHP functions might be disabled on the server to stop that?
The above code works perfectly on other sites.
Is there any workaround?

As I said, your example is working fine for me... But try this way using curl instead:
//base url
$base = 'https://play.google.com/store/apps';
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $base);
curl_setopt($curl, CURLOPT_REFERER, $base);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$html_base = new simple_html_dom();
// Load HTML from a string
$html_base->load($str);
//get all category links
foreach($html_base->find('a') as $element) {
echo "<pre>";
print_r( $element->href );
echo "</pre>";
}
$html_base->clear();
unset($html_base);
It gets all the links as expected:
And make sure you have php_openssl and php_curl installed...

remove the semicolon from php.ini and restart Apache server to enable php module configuration
; Windows Extensions
...
;extension=php_openssl.dll
...

You must set "allow_url_fopen" as TRUE in "php.ini" to allow accessing files via HTTP or FTP.
Some hosting venders disable PHP's "allow_url_fopen" flag for security issues.

$post = curl_init();
curl_setopt($post, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($post, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($post, CURLOPT_HEADER, 0);
curl_setopt($post,CURLOPT_RETURNTRANSFER, true);
curl_setopt($post,CURLOPT_URL,$website);
curl_setopt($post,CURLOPT_POST,1);
curl_setopt($post,CURLOPT_POSTFIELDS,"regno=$Number");
curl_setopt($post, CURLOPT_FOLLOWLOCATION, True);
curl_getinfo($post, CURLINFO_HTTP_CODE);
$curlresponse = curl_exec($post);
curl_close($post);
$dom = new DOMDocument();
$dom->loadHTML($curlresponse);
DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseStartTag: misplaced
THIS IS URL : http://www.annauniv.edu/cgi-bin/result/cgrade.pl?regno=11210104001

Why does this regex not match the URLs in this Google results page?

I'm having trouble scraping the URLs out of the Google results. This code worked for me for a long time but seems like Google changed a few things this week and now I'm getting a ton of extra characters surrounded by the actual URL I want.
preg_match_all('#<h3\s*class="r">\s*<a[^<>]*href="([^<>]*)"[^<>]*>(.*)</a>\s*</h3>#siU', $results, $matches[$key]);
EDIT
All links come out like this when scraped with the above code
/url?url=http://cooksandtravelbooks.com/write-for-us/&rct=j&sa=U&ei=XdayUNnHBIqDiwKZuYEY&ved=0CBQQFjAA&q=cooking+%5C%22Write+for+Us%5C%22&usg=AFQjCNGMiCiWYY_8JDAhqJggVDW2qHRMfw

<?php
$url = "http://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
foreach($dom->getElementsByTagName('a') as $link) {
echo $link->getAttribute('href');
echo "<br />";
}
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scraping HTML page using XPath and PHP - php

Must be problem with you xpath. As well as check if there is content laded from iFrame or not.

Related

Getting title of a webpage issue

Getting site title in unknown format using Php Curl and Dom-Document

how to use curl and preg_match _all div content

Simple html dom file_get_html not working - is there any workaround?

Why does this regex not match the URLs in this Google results page?

Categories

Resources