PHP Simple HTML DOM parser - parsing nested elements - php

I've been playing with PHP Simple HTML DOM Parser Manual found here http://simplehtmldom.sourceforge.net/manual.htm and I got success with some tests except this one:
It got nested tables and spans and I would like to parse the outer text of span with class of mynum.
<?php
require_once 'simple_html_dom.php';
$url = 'http://relumastudio.com/test/target.html';
$ch = curl_init();
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21");
curl_setopt($ch, CURLOPT_URL, $url);
$result = curl_exec($ch);
$DEBUG = 1;
if($DEBUG){
$html = new simple_html_dom();
$html->load($url);
echo $html->find('span[class=mynum]',0)->outertext; // I should get 123456
}else{
echo $result;
}
curl_close($ch);
I thought I could get away with just once call to echo $html->find('span[class=mynum]',0)->outertext; to get the text 123456 but I can't.
Any ideas? Any help is greatly appreciated. Thank You.

Load the url properly first. Then use ->innertext in this case:
$url = 'http://relumastudio.com/test/target.html';
$html = file_get_html($url);
$num = $html->find('span.mynum', 0)->innertext;
echo $num;

You need innertext.
$html = new simple_html_dom();
$html->load_file($url);
echo $html->find('span[class=mynum]',0)->innertext;
outertext returns <span class="mynum">123456</span>

Related

Trying to scrape kickasstorrents with simple html dom

I am trying to scrape kickasstorrents with simple html dom, but I am getting an error and I haven't even started yet. I followed some simple html tutorials and I have set up my url and using curl.
Code is as follows:
<?php
require('inc/config.php');
include_once('inc/simple_html_dom.php');
function scrap_kat() {
// initialize curl
$html = 'http://katcr.to/new/';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $html);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
$ip=rand(0,255).'.'.rand(0,255).'.'.rand(0,255).'.'.rand(0,255);
curl_setopt($ch, CURLOPT_HTTPHEADER, array("REMOTE_ADDR: $ip", "HTTP_X_FORWARDED_FOR: $ip"));
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/".rand(3,5).".".rand(0,3)." (Windows NT ".rand(3,5).".".rand(0,2)."; rv:2.0.1) Gecko/20100101 Firefox/".rand(3,5).".0.1");
$html2 = curl_exec($ch);
if($html2 === false)
{
echo 'Curl error: ' . curl_error($ch);
}
else
{
// create HTML DOM
$kat = file_get_contents($html);
}
curl_close($ch);
// scripting starts
// clean up memory
$kat->clear();
unset($kat);
// return information
return $ret;
}
$ret = scrap_kat();
echo $ret;
?>
I receive the errors
Fatal error: Call to a member function clear() on resource in C:\wamp64\www\index.php on line 36
What do I do wrong?
Thanks.
Simple_html_dom is a class. In that class there may be a function call, clear or it is in Simple_html_dom_node class. But In simple html dom, you need to use simple_html_dom class.
#Hassaan, is correct. file_get_contents is a native php function, you have to create an object of simple_html_dom class. Like,
$html = new simple_html_dom();
And use this below code.
function scrap_kat() {
$url = 'http://katcr.to/new/';
// $timeout= 120;
# create object
$html = new simple_html_dom();
#### CURL BLOCK ####
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/".rand(3,5).".".rand(0,3)." (Windows NT ".rand(3,5).".".rand(0,2)."; rv:2.0.1) Gecko/20100101 Firefox/".rand(3,5).".0.1");
//curl_setopt($curl, CURLOPT_TIMEOUT, $timeout);
$ip=rand(0,255).'.'.rand(0,255).'.'.rand(0,255).'.'.rand(0,255);
curl_setopt($curl, CURLOPT_HTTPHEADER, array("REMOTE_ADDR: $ip", "HTTP_X_FORWARDED_FOR: $ip"));
$content = curl_exec($curl);
curl_close($curl);
# note the variable change.
# load the curl string into the object.
$html->load($content);
//echo $ip;
#### END CURL BLOCK ####
print_r($html->find('a'));
// clean up memory
$html->clear();
unset($html);
}
scrap_kat();
Well, their are a lot of errors in your code, so I am just telling you how you can do this. If explanation needed, please comment below this answer. I will.
file_get_contents is PHP's built in function. For simple html dom you can use file_get_html
Replace
$kat = file_get_contents($html);
with
$kat = file_get_html($html);
Why you are returning $ret; as your code in your question. There is no variable $ret in you function scrap_kat()
You can return $kat instead of $ret and don't unset($kat);

My xpath query is not returning any results

I am trying to scrape some data from Yahoo, but the xpath query is returning me length 0 when I var_dump this. Here's a portion of my scraping code.
error_reporting(0);
function curl($url) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)');
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_AUTOREFERER, false);
curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 200);
return curl_exec($curl);
}
$page = curl('https://www.yahoo.com');
$dom = new DOMDocument();
$dom->loadHTML($page);
$xpath = new DOMXPath($dom);
$link = $xpath->query('//li[#style="background-color:#fafaff;"]/div/div/div/h3/a');
foreach ($link as $links) {
$get_title[] = $links->nodeValue;
$get_link[] = $links->getAttribute('href');
}
This code has no syntax errors, but there is a logical error.
Your code is working correctly. The problem is that the HTML returned by Yahoo.com simply doesn't contain any li elements that match your selector. You can see this by looking at the contents of $page.
I check each n every thing . but at last i found another solution . this code is not working . so it’s rubbish . Thanks . The Exact way to scrape data from yahoo is so simple . Using Ajax you can easily scrape data . first load yahoo page and then with the help of ajax scrape anything .
Thanks To all who respond on my question .

Curl with Simple HTML DOM using Link Pagination

I want to combine Curl and Simple HTML DOM.
Both are working fine separately.
I want to curl a site and then I want to look into the inner data using DOM
with pagination page numbers.
I am using this code.
<?php
include 'simple_html_dom.php';
function dlPage($href) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $href);
curl_setopt($curl, CURLOPT_REFERER, $href);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$dom = new simple_html_dom();
// Load HTML from a string
$dom->load($str);
return $dom;
}
$url = 'http://example.com/';
$data = dlPage($url);
// echo $data;
#######################################################
$startpage = 1;
$endpage = 3;
for ($p=$startpage;$p<=$endpage;$p++) {
$html = file_get_html('http://example.com/page/$p.html');
// connect to main page links
foreach ($html->find('div#link a') as $link) {
$linkHref = $link->href;
//loop through each link
$linkHtml = file_get_html($linkHref);
// parsing inner data
foreach($linkHtml->find('h1') as $title) {
echo $title;
}
foreach ($linkHtml->find('div#data') as $description) {
echo $description;
}
}
}
?>
How can I combine this to make it work as one single script?

how to use curl and preg_match _all div content

I try to practice CURL,but it doesn't go well
Pleasw tell me what's wrong
here is my code
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://xxxxxxx.com/");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, "Google Bot");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$downloaded_page = curl_exec($ch);
curl_close($ch);
preg_match_all('/<div\s* class =\"abc\">(.*)<\/div>/', $downloaded_page, $title);
echo "<pre>";
print($title[1]);
echo "</pre>";
and the warning is Notice: Array to string conversion
the html I want to parse is like this
<div class="abc">
<ul> blablabla </ul>
<ul> blablabla </ul>
<ul> blablabla </ul>
</div>
preg_match_all returns an array of arrays.
If your code is:
preg_match_all('/<div\s+class="abc">(.*)<\/div>/', $downloaded_page, $title);
you actually want to do the following:
echo "<pre>";
foreach ($title[1] as $realtitle) {
echo $realtitle . "\n";
}
echo "</pre>";
Since it will search all div's that have class "abc". I also suggest you harden your regex to be more robust.
preg_match_all('/<div[^>]+class="abc"[^>]*>(.*)<\/div>/', $downloaded_page, $title);
This will match as well as
BTW: DomDocument is slow as hell, I found out that regexes sometimes (depending on the size of your document) can give 40x speed increase. Just keep it simple.
Best,
Nicolas
Don't parse HTML with regex.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.lipsum.com/');
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$html = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
# foreach ($xpath->query('//div') as $div) { // all div's in html
foreach ($xpath->query('//div[contains(#class, "abc")]') as $div) { // all div's that have "abc" classname
// $div->nodeValue contains fetched DIV content
}

Using wildcard in Preg Match

I am making a PHP scraper and have the following piece of code that grabs the title from the page by looking inside the span uiButtonText. However I want to now scan for a hyperlink and have it pregmatch (.*).
The stars I want to be wild cards so that I can get the hyperlink from the page even if the href and onclick changes for each one.
if (preg_match("/<span class=\"uiButtonText\">(.*)<\/span>/i", $cache, $matches)){print($matches[1] . "\n");}else {}
My Full Code:
<?php
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
$url = "http://www.facebook.com/MauiNuiBotanicalGardens/info";
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
$cache = $html;
if (preg_match("/<span class=\"uiButtonText\">(.*)<\/span>/i", $cache, $matches)) {print($matches[1] . "\n");}else {}
?>`
if you want to stick with your regex, try this:
$html = '<span class="uiButtonText">Google!</span>';
preg_match("/<span class=\"uiButtonText\"><a href=\".*\" class=\"thelink\" onclick=\".*\">(.*)<\/a><\/span>/i", $html, $matches);
print_r($matches[1]);
Output: Google!
A better way would be to use PHP Simple HTML DOM Parser and doing something like this:
$html = file_get_html("http://www.facebook.com/MauiNuiBotanicalGardens/info");
foreach($html->find("a.thelink") as $link){
echo $link->innertext . "<BR>";
}
Above is not tested, but should work

Categories