php simple html DOM has some problem with parentheses in href
If you have a sample.php page and it contains:
if you do like this:
$html = file_get_html('sample.php');
foreach($html->find('a[href*=(parentheses)]') as $element)
{
echo $element->href;
}
or like this:
$html = file_get_html('sample.php');
foreach($html->find('a[href*=this-href]') as $element)
{
echo $element->href;
}
It works.
But if you write something after or before the parentheses it doesn't work:
This:
$html = file_get_html('sample.php');
foreach($html->find('a[href*=contains-(parentheses)]') as $element)
{
echo $element->href;
}
Or this:
$html = file_get_html('sample.php');
foreach($html->find('a[href*=(parentheses)-and-more]') as $element)
{
echo $element->href;
}
Doesn't work.
The reason that does not work is because there is a glaring error in the Simple HTML DOM code (well, one of many):
On line 673 of simple_html_dom.php you will see the line:
return preg_match("/".$pattern."/i", $value);
Change it to:
return preg_match("/".preg_quote($pattern)."/i", $value);
Presto, problem solved.
You can report the error here: https://sourceforge.net/p/simplehtmldom/bugs/ but with all the errors about the find method and others it is likely already reported.
Related
I have two files:
mainpage.html and recordinput.php
I need get a div's innerhtml from the mainpage.html in my php file.
I have copied the code here:
in my php file, I have
$dochtml = new DOMDocument();
//libxml_use_internal_errors(true);
$dochtml->loadHTMLFile("mainpage.html");
$div = $dochtml->getElementById('div2');
$div2html = get_inner_html($div);
echo "store information as: ".$div2html;
function get_inner_html(DOMNode $elem )
{
$innerHTML = " ";
$children = $elem->childNodes;
foreach ($children as $child)
{
$innerHTML .= $elem->ownerDocument->saveHTML( $child );
}
echo "function return: ".$innerHTML."<br />";
return $innerHTML;
}
The return is just empty. Any body helps me? I have spent two days on this. I feel like the problem is in here:
$dochtml->loadHTMLFile("mainpage.html");
Thanks
PHP DOMDocument has already provided the function to retrieve content between your selectors. Here is how you do it
$div = $dochtml->getElementById('div2')->nodeValue;
So you don't need to make your own function.
If you're looking to get the div contents including all nested tags then you can do it like this:
echo $div->ownerDocument->saveHTML($div);
Example: http://3v4l.org/GCbJk
Note that this includes the div2 tag itself, which you could easily then strip off.
I'm using simple html dom parser and everything works fine, but my code produces the asked code multiple times.
U can see what I'm talking about here:
http://stijnaerts.be/crawl/
I'm using the following php code:
<?php
include("simple_html_dom.php");
$webpage ="http://www.partyindustries.be/partypics/";
$html = file_get_html($webpage);
$links = $html->find('a');
foreach($html->find('a') as $element){
$div = $element->find('div[.kalenderRow partyPicsRow]');
$som = count($div);
if($som != 0)
{
echo $element;
}
}
?>
What is causing the multiple entries?
I'm testing a parser using SIMPLE_HTML_DOM and while parsing
the returned HTML DOM from this URL: HERE
It is not finding the H1 elements...
I tried returning all the div's with success.
I'm using a simple request for diagnosing this problem:
foreach($html->find('H1') as $value) { echo "<br />F: ".htmlspecialchars($value); }
While looking at the source code I realized that:
h1 is upper case -> H1 - but the SIMPLE_HTML... is handling that:
//PaperG - If lowercase is set, do a case insensitive test of the value of the selector.
if ($lowercase) {
$check = $this->match($exp, strtolower($val), strtolower($nodeKeyValue));
} else {
$check = $this->match($exp, $val, $nodeKeyValue);
}
if (is_object($debugObject)) {$debugObject->debugLog(2, "after match: " . ($check ? "true" : "false"));}
Can any body help me understanding what is going on here?
Try This
$oHtml = str_get_html($html);
foreach($oHtml->find('h1') as $element)
{
echo $element->innertext;
}
You will also use regular expression following function return an array of all h1 tag's innertext
function getH1($yourhtml)
{
$h1tags = preg_match_all("/(<h1.*>)(\w.*)(<\/h1>)/isxmU", $yourhtml, $patterns);
$res = array();
array_push($res, $patterns[2]);
array_push($res, count($patterns[2]));
return $res;
}
Found it...
But cant explain it!
I tested with another code including H1 (uppercase) and it worked.
While playing with the SIMPLE_HTML_DOM code i commented the "remove_noise" and now its working
perfectly, I think it's because that this website has invalid HTML and
the noise remover is removing too much and not ending after the end tags scripts:
// $this->remove_noise("'<\s*script[^>]*[^/]>(.*?)<\s*/\s*script\s*>'is");
// $this->remove_noise("'<\s*script\s*>(.*?)<\s*/\s*script\s*>'is");
Thank you all for your help.
Im parsing html from some a page, to get a list of the outgoing, i want to split them in two - the ones with the rel="nofollow" / rel="nofollow me" / rel="me nofollow" element and the ones with with out those expressions.
At the moment im using the code bellow parsed using - PHP Simple HTML DOM Parser
$html = file_get_html("$url");
foreach($html->find('a') as $element) {
echo $element->href; // THE LINK
}
but im not quite sure how to implement it, any ideas ?
Try using something like this :
$html = file_get_html("$url");
// Creating array for storing links
$arrayLinks = array(
"nofollow" => array(),
"others" => array()
);
foreach($html->find('a') as $element) {
// Search for "nofollow" expression with no case-sensitive (i flag)
if(preg_match('#nofollow#i', $element->rel)) {
$arrayLinks["nofollow"][] = $element->href;
}
else {
$arrayLinks["others"][] = $element->href;
}
}
// Display the array
echo "<pre>";
print_r($arrayLinks);
echo "</pre>";
Do a regexp on $element->rel I guess
I'm using SimpleHTMLDOM Parser to scape a website and I would like to know if there's any error handling method. For example, if the link is broken there is no use to advance in the code and search the document.
Thank you.
<?php
$html = file_get_html('http://www.google.com/');
foreach($html->find('a') as $element)
{
if(empty($element->href))
{
continue; //will skip <a> without href
}
echo $element->href . "<br>\n";
}
?>
a loop and continue?