using curl with simplehtmldom - php

Recently our hosting disabled allow_url_fopen, it seems simplehtmldom needs it turned on I saw a work arround with allow_url_fopen in this site simplehtmldom.sourceforge.net...aq.htm#hosting, "Use curl to get the page, then call "str_get_dom" to create DOM object". but still to no luck. can you tell me if I did it properly or am I missing something?
<?php
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'www.weather.bm/');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
$str = curl_exec($curl);
curl_close($curl);
$html= str_get_html($str);
?>
<?php
$element = $html->find("div");
$element[66]->class = "mwraping66";
foreach($html->find('.mwraping66 img') as $e)
$doc = phpQuery::newDocumentHTML( $e ); $containers = pq('.mwraping66', $doc);
foreach ( $containers as $container ) { $div = pq('img', $container);
$div->eq(1)->removeAttr('style')->addClass('thumbnail')->html( pq( 'img', $div->eq(1))- >removeAttr('height')->removeAttr('width')->removeAttr('alt') );
} print $doc;
?>
<?php
$element = $html->find("div");
$element[31]->class = "mwraping31";
foreach($html->find('.mwraping31') as $e)
echo $e->plaintext;
?>.................................
compared to:
<?php
include('simple_html_dom.php');
include ('phpQuery.php');
// Create DOM from URL
$html = file_get_html('www.weather.bm/');
?>
<?php
$element = $html->find("div");
$element[66]->class = "mwraping66";
foreach($html->find('.mwraping66 img') as $e).....
Thanks you for your help

I know this is too late to answer this query but i have found similar questions and answer in this forum.. this is the link to that Using simple html dom .. i am not sure whether this will answer your query because i am also new to dom .try to use this modified simple_html_dom.php file http://webarto.com/82/php-simple-html-dom-curl it uses curl instead of file_get_content; this file is working for me and its usage is also same as the original simple_html_dom.php

Related

Is it possible to extract Dom Elements from htmlentities() function in php?

I appreciate the time you take to try and help me with my question.
So what i am doing is trying an html parser from a link. So I use curl first to link to the website then I convert it into htmlentities() so it doesn't load on the page so I get a string from that then i use the DOM object to extract the tag from. I checked different methods for a parser on google search so i learned a little bit about it then i execute my script but the problem is that the string is getting saved as textCont and not as a real html document so i would like to know how can convert htmlentities string into a real dom document and extract elements from it ?
the image of the var_dump is here
here is my script:
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'https://www.usatoday.com/story/news/world/2021/02/17/dubai-princess-sheikha-latifa-says-she-hostage-after-flee-attempt/6778014002/?utm_source=feedblitz&utm_medium=FeedBlitzRss&utm_campaign=usatodaycomworld-topstories');
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($curl);
curl_close($curl);
$htmlentities = htmlentities($result);
// I added the code here
$htmlDom = new DOMDocument();
$htmlDom->loadHTML($htmlentities);
$htmlDom->preserveWhiteSpace = false;
$styles = $htmlDom->getElementsByTagName('style');
foreach ($styles as $style) {
$item = $style->getElementsByTagName('td');
//echo the values
echo '1: '.$item->item(0)->nodeValue.'<br />';
echo '2: '.$item->item(1)->nodeValue.'<br />';
echo '3: '.$item->item(2)->nodeValue;
}
EDIT:
what i added next to the code is this:
$htmlentities = htmlentities($result);
$htmlentities = str_replace(""",'"', $htmlentities);
$htmlentities = str_replace("'","'", $htmlentities);
$htmlentities = str_replace("<","<", $htmlentities);
$htmlentities = str_replace(">",">", $htmlentities);
libxml_use_internal_errors(true);
$htmlDom = new DOMDocument();
$htmlDom->loadHTML($htmlentities);
libxml_clear_errors();
var_dump($htmlDom);

Why do I have an empty Array as result while fetching content?

<?php
$page = file_get_contents("https://www.google.com");
preg_match('#<div id="searchform" class="jhp big">(.*?)</div>#Uis', $page, $matches);
print_r($matches);
?>
The following code I wrote, has to grab a specific part of another web page (in this case google). Unfortunately it is not working, and I'm not sure why (since the regular expression itself is grabbing everything inside of the div).
Help would be appreciated!
According to the source of the page you have pasted, there does not exist a line with that structure. This is one of the reasons why parsing HTML with regalar expressions is not recommended.
Using the getElementById() seems to do what you are after:
<?php
$page = file_get_contents("https://www.google.com");
$doc = new DOMDocument();
$doc->loadHTML($page);
$result = $doc->getElementById('searchform');
print_r($result);
?>
EDIT:
You could use the code below:
<?php
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'https://google.com');
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
$page = curl_exec($curl);
curl_close($curl);
$doc = new DOMDocument();
$doc->loadHTML($page);
echo($page);
$result = $doc->getElementById('searchform');
print_r($result);
?>
You might need to refer to this question though since you might need to change some settings.
DomxPath would be a better choice for you, here is an example.
<?php
$content = file_get_contents('https://www.google.com');
//gets rid of a few things that domdocument hates
$content = preg_replace("/&(?!(?:apos|quot|[gl]t|amp);|#)/", '&', $content);
$doc = new DOMDocument();
$doc->loadHTML($content);
$xpath = new DomXPath($doc);
$item = $xpath->query('//div[#id="searchform"]');

Dealing with "PHP HTML DOM Parser", including the same library into two different files

I have two PHP files (same folders) that access the library simple_html_dom.php
The first one, caridefine.phphas this:
include('simple_html_dom.php');
$url = 'http://www.statistics.com/index.php?page=glossary&term_id=209';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
libxml_use_internal_errors(true);
$dom = new DomDocument();
$dom->loadHtml($curl_scraped_page);
$xpath = new DomXPath($dom);
print $xpath->evaluate('string(//p[preceding::b]/text())');
The second one, caridefine2.php has this:
include('simple_html_dom.php');
$url = 'http://www.statsoft.com//textbook/statistics-glossary/z/?button=0#Z Distribution (Standard Normal)';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
$html = new simple_html_dom();
$html->load($curl_scraped_page);
foreach ($html->find('/p/a [size="4"]') as $font) {
$link = $font->parent();
$paragraph = $link->parent();
$text = str_replace($link->plaintext, '', $paragraph->plaintext);
echo $text;
}
Separately, each of them worked fine, I ran the caridefine.php, it worked well, so did the caridefine2.php.
But when I tried to load these two files in other PHP files:
<div class="examples">
<?php
$this->load->view('definer/caridefine.php');
?>
</div>
<div class="examples">
<?php
$this->load->view('definer/caridefine2.php');
?>
</div>
None of them worked. Just gave me a blank page, when I pressed CTRL+U, it said: Cannot redeclare file_get_html() (previously declared in C:\xampp\htdocs\MSPN\APPLICATION\views\Definer\simple_html_dom.php:70) in C:\xampp\htdocs\MSPN\APPLICATION\views\Definer\simple_html_dom.php on line 85
I googled for this problem, I found that "if you load many objects without clearing the previous ones, it can be a problem."
I've tried doing $html->clear() and unset($dom). It gave me nothing.
What is it that makes me like in the end of the line?
Thanks..
I have tried analyse my own problem:
Here is the correction:
Change include('simple_html_dom.php'); in each file into: require_once('simple_html_dom.php');
What happened was the file called the file simple_html_dom.php twice. So it won't work.
That should do it.

code not parsing through a simple google.com test

<?php
$file = 'http://www.google.com';
$doc = new DOMDocument();
# $doc->loadHTML(file_get_contents($file));
echo $doc->getElementsByTagName('span')->item(2)->nodeValue;
if (0 != $element->length)
{
$content = trim($element->item(2)->nodeValue);
if (empty($content))
{
$content = trim($element->item(2)->textContent);
}
echo $content . "\n";
}
?>
im trying to get the inner content of a span tag from google.com's home site. this code should output the first span tag, but it is not outputting any results?
The is not an error ... the first span in http://www.google.com is empty and am not sure what else you expect
<span class=gbtcb></span> <---------------- item(0)
<span class=gbtb2></span> <---------------- item(1)
<span class=gbts>Search</span> <----------- item(2)
Try
$element = $doc->getElementsByTagName('span')->item(2);
var_dump($element->nodeValue);
Output
Search
First, bear in mind that the HTML is not necessarily valid XML.
That aside, check that you're actually getting some contents to parse; you need to have allow_url_fopen enabled in order to use file_get_contents() with URLs.
In general, avoid using the error suppression operator (#) because it will almost certainly come back to bite you some time (and this time might well be that time); there is a discussion on this elsewhere on SO.
So, as a first step, switch to something like the following let me know if you're getting any contents at all.
// stop using # to suppress errors
$contents = file_get_contents($file);
// check that you're getting something to parse
echo $contents;
Try this and tell us what the output is
<?
echo ini_get('allow_url_fopen');
?>
Try using cURL to get the data and then load it into a DOMDocument:
<?php
$url = "http://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data); //The # is necessary to suppress invalid markup
echo $dom->getElementsByTagName('span')->item(2)->nodeValue;
if (0 != $element->length)
{
$content = trim($element->item(2)->nodeValue);
if (empty($content))
{
$content = trim($element->item(2)->textContent);
}
echo $content . "\n";
}
?>

Get div and the correct close tag preg

Now preg has always been a tool to me that i like but i cant figure out for the life if me if what i want to do is possible let and how to do it is going over my head
What i want is preg_match to be able to return me a div's innerHTML the problem is the div im tring to read has more divs in it and my preg keeps closing on the first tag it find
Here is my Actual code
$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
preg_match('% <div id="torrent_details">(.*)</div> %six', $data, $match);
print_r($match);
This has been updated for TomcatExodus's help
Live at :: http://megatorrentz.com/beta/details.php?hash=98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6
<?php
$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($data);
libxml_use_internal_errors(false);
$div = $domd->getElementById("torrent_details");
if ($div) {
$dom2 = new DOMDocument();
$dom2->appendChild($dom2->importNode($div, true));
echo $dom2->saveHTML();
} else {
echo "Has no element with the given ID\n";
}
Using regular expression leads often to problems when parsing markup documents.
XPath version - independent of the source layout. The only thing you need is a div with that id.
loadHTMLFile($url);
$xp = new domxpath($dom);
$result = $xp->query("//*[#id = 'torrent_details']");
$div=$result->item(0);
if($result->length){
$out =new DOMDocument();
$out->appendChild($out->importNode($div, true));
echo $out->saveHTML();
}else{
echo "No such id";
}
?>
And this is the fix for Maerlyn solution. It didn't work because getElementById() wants a DTD with the id attribute specified. I mean, you can always build a document with "apple" as the record id, so you need something that says "id" is really the id for this tag.
validateOnParse = true;
#$domd->loadHTML($data);
//this doesn't work as the DTD is not specified
//or the specified id attribute is not the attributed called "id"
//$div = $domd->getElementById("torrent_details");
/*
* workaround found here: https://fosswiki.liip.ch/display/BLOG/GetElementById+Pitfalls
* set the "id" attribute as the real id
*/
$elements = $domd->getElementsByTagName('div');
if (!is_null($elements)) {
foreach ($elements as $element) {
//try-catch needed because of elements with no id
try{
$element->setIdAttribute('id', true);
}catch(Exception $e){}
}
}
//now it works
$div = $domd->getElementById("torrent_details");
//Print its content or error
if ($div) {
$dom2 = new DOMDocument();
$dom2->appendChild($dom2->importNode($div, true));
echo $dom2->saveHTML();
} else {
echo "Has no element with the given ID\n";
}
?>
Both of the solutions work for me.
You can do this:
/]>(.)<\/div>/i
Which would give you the largest possible innerHTML.
You cannot. I will not link to the famous question, because I dislike the pointless drivel on top. But still regular expressions are unfit to match nested structures.
You can use some trickery, but this is neither reliable, nor necessarily fast:
preg_match_all('#<div id="1">((<div>.*?</div>|.)*?)</div>#ims'
Your regex had a problem due to the /x flag not matching the opening div. And you used a wrong assertion notation.
preg_match_all('% <div \s+ id="torrent_details">(?<innerHtml>.*)</div> %six', $html, $match);
echo $match['innerHtml'];
That one will work, but you should only need preg_match not preg_match_all if the pages are written well, there should only be one instance of id="torrent_details" on the given page.
I'm retracting my answer. This will not work properly. Use DOM for navigating the document.
haha did it with a bit of tampering thanks for the DOMDocument idea i just to use simple
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
$doc = new DOMDocument();
libxml_use_internal_errors(false);
$doc->strictErrorChecking = FALSE;
libxml_use_internal_errors(true);
$doc->loadHTML($data);
$xml = simplexml_import_dom($doc);
print_r($xml->body->table->tr->td->table[2]->tr->td[0]->span[0]->div);

Categories