php html dom parentheses in href

php html dom parentheses in href - php

php simple html DOM has some problem with parentheses in href
If you have a sample.php page and it contains:
if you do like this:
$html = file_get_html('sample.php');
foreach($html->find('a[href*=(parentheses)]') as $element)
{
echo $element->href;
}
or like this:
$html = file_get_html('sample.php');
foreach($html->find('a[href*=this-href]') as $element)
{
echo $element->href;
}
It works.
But if you write something after or before the parentheses it doesn't work:
This:
$html = file_get_html('sample.php');
foreach($html->find('a[href*=contains-(parentheses)]') as $element)
{
echo $element->href;
}
Or this:
$html = file_get_html('sample.php');
foreach($html->find('a[href*=(parentheses)-and-more]') as $element)
{
echo $element->href;
}
Doesn't work.

The reason that does not work is because there is a glaring error in the Simple HTML DOM code (well, one of many):
On line 673 of simple_html_dom.php you will see the line:
return preg_match("/".$pattern."/i", $value);
Change it to:
return preg_match("/".preg_quote($pattern)."/i", $value);
Presto, problem solved.
You can report the error here: https://sourceforge.net/p/simplehtmldom/bugs/ but with all the errors about the find method and others it is likely already reported.

Related

How to get innerhtml of an element from PHP

I have two files:
mainpage.html and recordinput.php
I need get a div's innerhtml from the mainpage.html in my php file.
I have copied the code here:
in my php file, I have
$dochtml = new DOMDocument();
//libxml_use_internal_errors(true);
$dochtml->loadHTMLFile("mainpage.html");
$div = $dochtml->getElementById('div2');
$div2html = get_inner_html($div);
echo "store information as: ".$div2html;
function get_inner_html(DOMNode $elem )
{
$innerHTML = " ";
$children = $elem->childNodes;
foreach ($children as $child)
{
$innerHTML .= $elem->ownerDocument->saveHTML( $child );
}
echo "function return: ".$innerHTML."<br />";
return $innerHTML;
}
The return is just empty. Any body helps me? I have spent two days on this. I feel like the problem is in here:
$dochtml->loadHTMLFile("mainpage.html");
Thanks

PHP DOMDocument has already provided the function to retrieve content between your selectors. Here is how you do it
$div = $dochtml->getElementById('div2')->nodeValue;
So you don't need to make your own function.

If you're looking to get the div contents including all nested tags then you can do it like this:
echo $div->ownerDocument->saveHTML($div);
Example: http://3v4l.org/GCbJk
Note that this includes the div2 tag itself, which you could easily then strip off.

Crawler gets the asked code twice

I'm using simple html dom parser and everything works fine, but my code produces the asked code multiple times.
U can see what I'm talking about here:
http://stijnaerts.be/crawl/
I'm using the following php code:
<?php
include("simple_html_dom.php");
$webpage ="http://www.partyindustries.be/partypics/";
$html = file_get_html($webpage);
$links = $html->find('a');
foreach($html->find('a') as $element){
$div = $element->find('div[.kalenderRow partyPicsRow]');
$som = count($div);
if($som != 0)
{
echo $element;
}
}
?>
What is causing the multiple entries?

simple_html_dom not returning <h1> elements?

I'm testing a parser using SIMPLE_HTML_DOM and while parsing
the returned HTML DOM from this URL: HERE
It is not finding the H1 elements...
I tried returning all the div's with success.
I'm using a simple request for diagnosing this problem:
foreach($html->find('H1') as $value) { echo "<br />F: ".htmlspecialchars($value); }
While looking at the source code I realized that:
h1 is upper case -> H1 - but the SIMPLE_HTML... is handling that:
//PaperG - If lowercase is set, do a case insensitive test of the value of the selector.
if ($lowercase) {
$check = $this->match($exp, strtolower($val), strtolower($nodeKeyValue));
} else {
$check = $this->match($exp, $val, $nodeKeyValue);
}
if (is_object($debugObject)) {$debugObject->debugLog(2, "after match: " . ($check ? "true" : "false"));}
Can any body help me understanding what is going on here?

Try This
$oHtml = str_get_html($html);
foreach($oHtml->find('h1') as $element)
{
echo $element->innertext;
}
You will also use regular expression following function return an array of all h1 tag's innertext
function getH1($yourhtml)
{
$h1tags = preg_match_all("/(<h1.*>)(\w.*)(<\/h1>)/isxmU", $yourhtml, $patterns);
$res = array();
array_push($res, $patterns[2]);
array_push($res, count($patterns[2]));
return $res;
}

Found it...
But cant explain it!
I tested with another code including H1 (uppercase) and it worked.
While playing with the SIMPLE_HTML_DOM code i commented the "remove_noise" and now its working
perfectly, I think it's because that this website has invalid HTML and
the noise remover is removing too much and not ending after the end tags scripts:
// $this->remove_noise("'<\s*script[^>]*[^/]>(.*?)<\s*/\s*script\s*>'is");
// $this->remove_noise("'<\s*script\s*>(.*?)<\s*/\s*script\s*>'is");
Thank you all for your help.

when parsing html, check if element is present

Im parsing html from some a page, to get a list of the outgoing, i want to split them in two - the ones with the rel="nofollow" / rel="nofollow me" / rel="me nofollow" element and the ones with with out those expressions.
At the moment im using the code bellow parsed using - PHP Simple HTML DOM Parser
$html = file_get_html("$url");
foreach($html->find('a') as $element) {
echo $element->href; // THE LINK
}
but im not quite sure how to implement it, any ideas ?

Try using something like this :
$html = file_get_html("$url");
// Creating array for storing links
$arrayLinks = array(
"nofollow" => array(),
"others" => array()
);
foreach($html->find('a') as $element) {
// Search for "nofollow" expression with no case-sensitive (i flag)
if(preg_match('#nofollow#i', $element->rel)) {
$arrayLinks["nofollow"][] = $element->href;
}
else {
$arrayLinks["others"][] = $element->href;
}
}
// Display the array
echo "<pre>";
print_r($arrayLinks);
echo "</pre>";

Do a regexp on $element->rel I guess

Simple HTML DOM Parser error handling

I'm using SimpleHTMLDOM Parser to scape a website and I would like to know if there's any error handling method. For example, if the link is broken there is no use to advance in the code and search the document.
Thank you.

<?php
$html = file_get_html('http://www.google.com/');
foreach($html->find('a') as $element)
{
if(empty($element->href))
{
continue; //will skip <a> without href
}
echo $element->href . "<br>\n";
}
?>

a loop and continue?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

php html dom parentheses in href - php

Related

How to get innerhtml of an element from PHP

Crawler gets the asked code twice

simple_html_dom not returning <h1> elements?

when parsing html, check if element is present

Simple HTML DOM Parser error handling

Categories

Resources