PHP Simple HTML DOM counts wrong the number of elements - php

Using this code I want to count the number of elements (dt) with class "level3" in certain node:
include_once('simple_html_dom.php');
ini_set("memory_limit", "-1");
ini_set('max_execution_time', 1200);
function parseInit($url) {
$ch = curl_init();
$timeout = 0;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$data = parseInit("https://www.smile-dental.de/index.php");
$html = new simple_html_dom();
$html = $html->load($data);
$struct = $html->find("dt.level1", 0)->next_sibling()->find("dt.level2", 0)->next_sibling()->find("dt.level3");
echo count($struct);
$html->clear();
unset($html);
But as a result I've got such problem. Real result should be 2, but I get 53 (total count of the DT elements with class "level3" into the first DT node with class "level1" ). Could you help me and explain what the problem is?
Thanks in advance!
---EDIT---
Generally, I want to create hierarchical structure of links (of left navigation bar). I wrote such function. But it works wrong, and maybe because of situation which was written by me above. But maybe there also other problems besides this one in the code.
function get_links($struct) {
static $iter = 1;
$nav_left_links = $struct->find("dt.level".$iter);
echo "<ul>";
foreach ($nav_left_links as $links) {
echo "<li>".$links->find("a", 0)->href;
echo str_pad('',4096)."\n";
ob_flush();
flush();
usleep(500000);
$iter++;
if ($links->next_sibling() && count($links->next_sibling()->find("dt")) > 0) {
get_links($links->next_sibling());
} else {
$iter--;
if ($key == count($nav_left_links)) {
break;
} else {
continue;
}
}
echo "</li>";
}
echo "</ul>";
$iter--;
}
$data = parseInit("https://www.smile-dental.de/index.php");
$html = new simple_html_dom();
$html = $html->load($data);
$struct = $html->find(".mod_vertical_dropmenu_142_inner", 0);
get_links($struct);
$html->clear();
unset($html);
Or maybe if somebody knows how to rewrite this code without PHP Simple HTML DOM, using classic methods for parsing, I would be very grateful.

Unfortunately, it looks like you have uncovered a bug. I did some experiments, and even after correcting the validation errors, simple-html-dom wasn't able to traverse the dl, dt, and dd elements properly. I did get it to work when I used a regex to convert all the dl elements to ul, and the dd and dt elements to li, though:
result of $html->find("li.level1", 1)->find("li.level2", 1)->find("li.level3");
<li class="level3 off-nav-321-8120 notparent first"><span class="outer"> <span class="inner"> <span>Pro-Seal Versiegeler</span> </span> </span></li>
<li class="level3 off-nav-321-8120 notparent first"></li>
<li class="level3 off-nav-321-8122 notparent last"><span class="outer"> <span class="inner"> <span>Pro-Seal L.E.D. Versiegeler</span> </span> </span></li>
<li class="level3 off-nav-321-8122 notparent last"></li>

Related

Simple HTML Dom getting tags values

I have html code like this:
<td class="table-main__odds" data-oid="4ci66xv464x0xbdci3" data-odd="2.18"></td>
<td class="table-main__odds colored" data-oid="4ci66xv498x0x0">
<span>
<span>
<span data-odd="3.68"></span>
</span>
</span>
</td>
<td class="table-main__odds" data-oid="4ci66xv464x0xbdci4" data-odd="3.09"></td>
<td class="table-main__odds" data-oid="4ci60xv464x0xbdchn" data-odd="10.35"></td>
<td class="table-main__odds" data-oid="4ci60xv498x0x0" data-odd="6.12"></td>
<td class="table-main__odds colored" data-oid="4ci60xv464x0xbdcho">
<span>
<span>
<span data-odd="1.26"></span>
</span>
</span>
</td>
I need to get data-odd values, but you can see some values are into td tag, some values are into span tag, but all are data-odd
I'm trying this approach:
<?php
include_once('../simple_html_dom.php');
$url = "xyz";
function curl_request($url, $timeout = 30) {
// Initialize curl with given url
$ch = curl_init($url);
// Set user-agent
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]);
// Write the response to a variable
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Follow redirects
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
// Max seconds to execute
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
// Stop on error
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
return curl_exec($ch);
}
function get_html($url) {
return str_get_html(curl_request($url));
}
$html = get_html($url);
$b = 0;
$search = $html->find('td[class=table-main__odds], td[class=table-main__odds colored] span');
foreach($search as $allOdds){
$quote = array($allOdds->href, $allOdds->innertext);
if (isset($allOdds->attr['data-odd'])) {
$quote['data-odd'] = $allOdds->attr['data-odd'];
}
$quotes[] = $quote;
}
foreach($quotes as $mark) {
echo $mark[0]. " ";
}
?>
but I got the follow error:
Fatal error: Call to a member function find() on a non-object in... on line 33 (foreach line)
Any suggestion?
Thanks
EDIT: I put $html = get_html($url);
EDIT2: I added var_dump($quotes); after foreach cicle
EDIT3: My output is like this:
see new image
If you simply want to get the attribute values of the data-odd attribute in your html, try something simple like this:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
$odds = $xp->query('//*[#data-odd]/#data-odd');
foreach ($odds as $odd) {
echo($odd->value) . "\r\n";
}
Output:
2.18
3.68
3.09
10.35
6.12
1.26

PHP DOMDocument getting elements by tag name ignores commented ones [duplicate]

I'm creating a little web app to help me manage and analyze the content of my websites, and cURL is my favorite new toy. I've figured out how to extract info about all sorts of elements, how to find all elements with a certain class, etc., but I am stuck on two problems (see below). I hope there is some nifty xpath answer, but if I have to resort to regular expressions I guess that's ok. Although I'm not so great with regex so if you think that's the way to go, I'd appreciate examples...
Pretty standard starting point:
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
$info .= "<br />cURL error number:" .curl_errno($ch);
$info .= "<br />cURL error:" . curl_error($ch);
return $info;
}
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
and extraction of info, for example:
// iframes
$iframes = $xpath->evaluate("/html/body//iframe");
$info .= '<h3>iframes ('.$iframes->length.'):</h3>';
for ($i = 0; $i < $iframes->length; $i++) {
// get iframe attributes
$iframe = $iframes->item($i);
$framesrc = $iframe->getAttribute("src");
$framewidth = $iframe->getAttribute("width");
$frameheight = $iframe->getAttribute("height");
$framealt = $iframe->getAttribute("alt");
$frameclass = $iframe->getAttribute("class");
$info .= $framesrc.' ('.$framewidth.'x'.$frameheight.'; class="'.$frameclass.'")'.'<br />';
}
Questions/Problems:
How to extract HTML comments?
I can't figure out how to identify the comments – are they considered nodes, or something else entirely?
How to get the entire content of a div, including child nodes? So if the div contains an image and a couple of hrefs, it would find those and hand it all back to me as a block of HTML.
Comment nodes should be easy to find in XPath with the comment() test, analogous to the text() test:
$comments = $xpath->query('//comment()'); // or another path, as you prefer
They are standard nodes: here is the manual entry for the DOMComment class.
To your other question, it's a bit trickier. The simplest way is to use saveXML() with its optional $node argument:
$html = $dom->saveXML($el); // $el should be the element you want to get
// the HTML for
For the HTML comments a fast method is:
function getComments ($html) {
$rcomments = array();
$comments = array();
if (preg_match_all('#<\!--(.*?)-->#is', $html, $rcomments)) {
foreach ($rcomments as $c) {
$comments[] = $c[1];
}
return $comments;
} else {
// No comments matchs
return null;
}
}
That Regex
\s*<!--[\s\S]+?-->
Helps to you.
In regex Test
for comments your looking for recursive regex. For instance, to get rid of html comments:
preg_replace('/<!--(?(?=<!--)(?R)|.)*?-->/s',$yourHTML);
to find them:
preg_match_all('/(<!--(?(?=<!--)(?R)|.)*?-->)/s',$yourHTML,$comments);

PHP Simple HTML Dom parser returns 0

I use PHP Simple HTML Dom parser to get some elements of a page. Unfortunately, I get as a result 0 or 1... I would like to get the innerHTML instead.
Here is a photo of the dom:
And here is my code:
include('simple_html_dom.php');
// We take the url we want to scrape
$URL = 'https://www.legifrance.gouv.fr/affichTexte.do?cidTexte=JORFTEXT000033011065&dateTexte=20160821';
// Curl init
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $URL);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec ($ch);
curl_close($ch);
// We get the html
$html = new simple_html_dom();
$html->load($result);
// Find all article blocks
foreach($html->find('div.data') as $article) {
$item['title'] = $article->find('.titreSection', 0) ->plaintext;
$resultat[] = "<p>" + $item['title']."</p></br>";
}
include 'vue_scrap.php';
?>
Here is the code of my view:
foreach ($resultat as $result){
echo $result;
}
Thank you for your help.
In fact I just did a mistake with that line:
$resultat[] = "<p>" + $item['title']."</p></br>";
The correct version is:
$resultat[] = "<p>".$item['title']."</p></br>";

Problems with multiple attributes while using PHP Simple HTML DOM

I use this code for getting elements of left navigation bar:
function parseInit($url) {
$ch = curl_init();
$timeout = 0;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$data = parseInit("https://www.smile-dental.de/index.php");
$data = preg_replace('/<(d[ldt])( |>)/smi', '<div data-type="$1"$2', $data);
$data = preg_replace('/<\/d[ldt]>/smi', '</div>', $data);
$html = new simple_html_dom();
$html = $html->load($data);
But faced with such problem.
For example, if I use such syntax for getting elements: $html->find("div[data-type=dd].level2"), then I get ALL elements with data attributes DT, DD, DL and class name LEVEL2. If I use another syntax: $html->find("div.level2[data-type=dd]"), then I get ALL elements with data attribute DD, but with class names LEVEL1, LEVEL2 and LEVEL3 etc..
Could you explain me what the problem is? Thanks in advance!
P.S.: All DT, DL and DD elements was changed with regexp to the DIV elements with appropriate data attributes, because this parser incorrectly counts the number of these elements.
REGEXes are not made to manipulate HTML, DOM parsers are... And simple_html_dom you're using can do it easily...
The following code will do what you want just fine (check comments):
$data = parseInit("https://www.smile-dental.de/index.php");
// Create a DOM object
$html = new simple_html_dom();
$html = $html->load($data);
// Find all tags to replace
$nodes = $html->find('td, dd, dl');
// Loop through every node and make the wanted changes
foreach ($nodes as $key => $node) {
// Get the original tag's name
$originalTag = $node->tag;
// Replace it with the new tag
$node->tag = 'div';
// Set a new attribute with the original tag's name
$node->{'data-type'} = $originalTag;
}
// Clear DOM variable
$html->clear();
unset($html);
Here's is it in action
Now, for multiple attributes filtering, you can use either of the following methods:
foreach ( $html->find("div.level2") as $key => $node) {
if ( $node->{'data-type'} == 'dt' ) {
# code...
}
}
OR (courtesy to h0tw1r3):
// array containing all the filtered nodes
$dts = array_filter($html->find('div.level2'), function($node){return $node->{'data-type'} == 'dt';});
Please read the MANUAL for more details...

php strange looping problem

Sorry for the long code, I'm really losing it.
This code is supposed to get a list of urls through POST, in a textarea with breaklines between each url. The script should download each url, go through the html and take some links, then go in those links, get some data and echo it out.
For some reason, visually it looks as if I'm running getDetails() only once, as I'm getting only one set of results.
I have checked multiple times if the foreach loop takes each url separately and that part is working
Can anyone spot the problem?
require_once('simple_html_dom.php');
function getDetails($html) {
$dom = new simple_html_dom;
$dom->load($html);
$title = $dom->find('h1', 0)->find('a', 0);
foreach($dom->find('span[style="color:#333333"]') as $element) {
$address = $element->innertext;
}
$address = str_replace("<br>"," ",$address);
$address = str_replace(","," ",$address);
$title->innertext = str_replace(","," ",$title->innertext);
if ($address == "") {
$exp = explode("<strong><strong>",$html);
$exp2 = explode("</strong>",$exp[1]);
$address = $exp2[0];
}
echo $title->innertext . "," . $address . "<br>";
}
function getHtml($Url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Url);
curl_setopt($ch, CURLOPT_REFERER, "http://www.google.com/");
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$output = curl_exec($ch);
curl_close($ch);
return $output;
}
function getdd($u) {
$html = getHtml($u);
$dom = new simple_html_dom;
$dom->load($html);
foreach($dom->find('a') as $element) {
if (strstr($element->href,"display_one.asp")) {
$durls[] = $element->href;
}
}
return $durls;
}
if (isset($_POST['url'])) {
$urls = explode("\n",$_POST['url']);
foreach ($urls as $u) {
$durls2 = getdd($u);
$durls2 = array_unique($durls2);
foreach ($durls2 as $durl) {
$d = getHtml("http://www.example.co.il/" . $durl);
getDetails($d);
}
}
}
You're only assigning the last element in the loop, it looks like. You'll need to concatenate. Something like $address .= $element->innertext; inside the loop (note the .= instead of =).
edit: unless I'm mistaking what it's supposed to be doing. I think I may've been focusing on the wrong part of the code.
When you use DOMDocument on html you load it with $dom->loadHTMLFile() or $dom->loadHTML() you should also call libxml_use_internal_errors(true) before hand so that it will not crash because of improperly formatted html.

Categories