how to use curl and preg_match _all div content

how to use curl and preg_match _all div content - php

I try to practice CURL,but it doesn't go well
Pleasw tell me what's wrong
here is my code
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://xxxxxxx.com/");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, "Google Bot");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$downloaded_page = curl_exec($ch);
curl_close($ch);
preg_match_all('/<div\s* class =\"abc\">(.*)<\/div>/', $downloaded_page, $title);
echo "<pre>";
print($title[1]);
echo "</pre>";
and the warning is Notice: Array to string conversion
the html I want to parse is like this
<div class="abc">
<ul> blablabla </ul>
<ul> blablabla </ul>
<ul> blablabla </ul>
</div>

preg_match_all returns an array of arrays.
If your code is:
preg_match_all('/<div\s+class="abc">(.*)<\/div>/', $downloaded_page, $title);
you actually want to do the following:
echo "<pre>";
foreach ($title[1] as $realtitle) {
echo $realtitle . "\n";
}
echo "</pre>";
Since it will search all div's that have class "abc". I also suggest you harden your regex to be more robust.
preg_match_all('/<div[^>]+class="abc"[^>]*>(.*)<\/div>/', $downloaded_page, $title);
This will match as well as
BTW: DomDocument is slow as hell, I found out that regexes sometimes (depending on the size of your document) can give 40x speed increase. Just keep it simple.
Best,
Nicolas

Don't parse HTML with regex.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.lipsum.com/');
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$html = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
# foreach ($xpath->query('//div') as $div) { // all div's in html
foreach ($xpath->query('//div[contains(#class, "abc")]') as $div) { // all div's that have "abc" classname
// $div->nodeValue contains fetched DIV content
}

Related

Scraping HTML page using XPath and PHP

I'm trying to scraping a HTML page using this PHP code
<?php
ini_set('display_errors', 1);
$url = 'http://www.cittadellasalute.to.it/index.php?option=com_content&view=article&id=6786:situazione-pazienti-in-pronto-soccorso&catid=165:pronto-soccorso&Itemid=372';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$greenWaitingNumber = $xpath->query('/html/body/div/div/div[4]/div[3]/section/p');
foreach( $greenWaitingNumber as $node )
{
echo "Number first green line: " .$node->nodeValue;
echo '<br>';
echo '<br>';
}
?>
All works fine (no error and in my browser console I can see '200' as return code ...), but nothing is printed in my HTML page .... .
Probably the problem is about the xpath /html/body/div/div/div[4]/div[3]/section/p that refers to the first green line in the source HTML page, but this is my Firefox Firebug tells me for that page section ....
Suggestions / examples?
!!! UPDATE !!!!
As Santosh Sapkota suggest in his reply, the first problem is that the text inside that green box, is loaded from iFrame ... I've seen the url of the HTML page inside the IFrame ad so I've tried to use this one in my code that now is ...
<?php
ini_set('display_errors', 1);
$url = 'http://listeps.cittadellasalute.to.it/?id=01090101';
//#Set CURL parameters: pay attention to the PROXY config !!!!
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_PROXY, '');
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$greenWaitingNumber = $xpath->query('/html/body/div/div/div[4]/div[3]/section/p');
foreach( $greenWaitingNumber as $node )
{
echo "Number first green line: " .$node->nodeValue;
echo '<br>';
echo '<br>';
}
?>
but unfortunately nothing is still printed in my output HTML page ....
Other suggestions / examples?

Must be problem with you xpath. As well as check if there is content laded from iFrame or not.

Getting title of a webpage issue

I want to get the title of a webpage with file_get_contents(),
I tried:
$get=file_get_meta_tags("http://example.com");
echo $get["title"];
but it doesn't match.
What is wrong with it?

Title tag is not part of match in get_meta_tags() function and it is also not a meta tag.
Try this:
$get=file_get_contents("http://example.com");
preg_match("#<title>(.*?)</title>#i,$get,$matches);
print_r($matches);
Regex #<title>(.*?)</title>#i matches the title string.

Use the Below Code snipet to get the webpage title.
<?php
function curl_file_get_contents($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$targetUrl = "http://google.com/";
$html = curl_file_get_contents($targetUrl);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
$page_title = $nodes->item(0)->nodeValue;
echo "Title: $page_title". '<br/><br/>';
?>

PHP Simple HTML DOM parser - parsing nested elements

I've been playing with PHP Simple HTML DOM Parser Manual found here http://simplehtmldom.sourceforge.net/manual.htm and I got success with some tests except this one:
It got nested tables and spans and I would like to parse the outer text of span with class of mynum.
<?php
require_once 'simple_html_dom.php';
$url = 'http://relumastudio.com/test/target.html';
$ch = curl_init();
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21");
curl_setopt($ch, CURLOPT_URL, $url);
$result = curl_exec($ch);
$DEBUG = 1;
if($DEBUG){
$html = new simple_html_dom();
$html->load($url);
echo $html->find('span[class=mynum]',0)->outertext; // I should get 123456
}else{
echo $result;
}
curl_close($ch);
I thought I could get away with just once call to echo $html->find('span[class=mynum]',0)->outertext; to get the text 123456 but I can't.
Any ideas? Any help is greatly appreciated. Thank You.

Load the url properly first. Then use ->innertext in this case:
$url = 'http://relumastudio.com/test/target.html';
$html = file_get_html($url);
$num = $html->find('span.mynum', 0)->innertext;
echo $num;

You need innertext.
$html = new simple_html_dom();
$html->load_file($url);
echo $html->find('span[class=mynum]',0)->innertext;
outertext returns <span class="mynum">123456</span>

Using wildcard in Preg Match

I am making a PHP scraper and have the following piece of code that grabs the title from the page by looking inside the span uiButtonText. However I want to now scan for a hyperlink and have it pregmatch (.*).
The stars I want to be wild cards so that I can get the hyperlink from the page even if the href and onclick changes for each one.
if (preg_match("/<span class=\"uiButtonText\">(.*)<\/span>/i", $cache, $matches)){print($matches[1] . "\n");}else {}
My Full Code:
<?php
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
$url = "http://www.facebook.com/MauiNuiBotanicalGardens/info";
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
$cache = $html;
if (preg_match("/<span class=\"uiButtonText\">(.*)<\/span>/i", $cache, $matches)) {print($matches[1] . "\n");}else {}
?>`

if you want to stick with your regex, try this:
$html = '<span class="uiButtonText">Google!</span>';
preg_match("/<span class=\"uiButtonText\"><a href=\".*\" class=\"thelink\" onclick=\".*\">(.*)<\/a><\/span>/i", $html, $matches);
print_r($matches[1]);
Output: Google!
A better way would be to use PHP Simple HTML DOM Parser and doing something like this:
$html = file_get_html("http://www.facebook.com/MauiNuiBotanicalGardens/info");
foreach($html->find("a.thelink") as $link){
echo $link->innertext . "<BR>";
}
Above is not tested, but should work

Why does this regex not match the URLs in this Google results page?

I'm having trouble scraping the URLs out of the Google results. This code worked for me for a long time but seems like Google changed a few things this week and now I'm getting a ton of extra characters surrounded by the actual URL I want.
preg_match_all('#<h3\s*class="r">\s*<a[^<>]*href="([^<>]*)"[^<>]*>(.*)</a>\s*</h3>#siU', $results, $matches[$key]);
EDIT
All links come out like this when scraped with the above code
/url?url=http://cooksandtravelbooks.com/write-for-us/&rct=j&sa=U&ei=XdayUNnHBIqDiwKZuYEY&ved=0CBQQFjAA&q=cooking+%5C%22Write+for+Us%5C%22&usg=AFQjCNGMiCiWYY_8JDAhqJggVDW2qHRMfw

<?php
$url = "http://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data);
foreach($dom->getElementsByTagName('a') as $link) {
echo $link->getAttribute('href');
echo "<br />";
}
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

how to use curl and preg_match _all div content - php

Related

Scraping HTML page using XPath and PHP

Getting title of a webpage issue

PHP Simple HTML DOM parser - parsing nested elements

Using wildcard in Preg Match

Why does this regex not match the URLs in this Google results page?

Categories

Resources