node value in domdocument is print as html source and not excuted - php

here is my code
$link = "<a class=\"openevent\" href=\"$finalUrl\" target=\"_blank\">Open Event</a>";
foreach ($spans as $span) {
if ($span->getAttribute('class') == 'category') {
$span->nodeValue .= $link;
}
}
the problem here is that $link will something like this
<a class="openevent" href="http://www.domain.com/Free-Live-Streaming-Video-Online-Hockey-NHL-Pre-season-Buffalo-Sabres-Montreal-Canadiens-170647.html" target="_blank">Open Event</a>
with my current code the above html code appear in the browser as is and not executed to be as Open Event
so what is wrong with my coding

Use createElement and appendChild to add an element to each span, rather than setting nodeValue.

Related

PHP web Crawler prints twice extracted every statement

i been working on this web crawler. it works fine, except that it prints every single extracted statement twice.
i tried echoing at every loop but seems like it need some out of box perspective.
my code goes as:
<?php
require_once('dom/simple_html_dom.php');
$html = file_get_html('https://www.uworld.com/Forum/topics.aspx?ForumID=1&gid=1');
$elementCount=0;
foreach($html->find('h3.h3-forum-title a') as $element) {
$elementCount++;
$element->href = "http://www.studentdoc.com/phpBB2/" . $element->href;
echo '<li target="_blank" class="itemtitle">';
if($elementCount < 5 && $elementCount > 2 && rand(0,1) == 1) {
echo '<span class="item_new">new</span>';
}
echo $element;
echo '</li>';
if($elementCount==12){
break;
}
}
?>
Any help is appreciated..
The issue occurs because the element exists twice on the webpage. You have to narrow down your find parameters like this:
foreach($html->find('div.hidden-lg div div div div h3.h3-forum-title a') as $element) {
// process the elements
}

php html tags converted to string

I am trying to process a HTML file with php as a DOM document. Processing is okay, but when I save the html document with $html->saveHTMLFile("file_out.html"); all link tags are converted from:
Click here: <a title="editable" href="http://somewhere.net">somewhere.net</a>
to
Click here: <a title="editable" href="http://somewhere.net"> somewhere.net </a>
I process the links as php scripts, maybe this makes a difference?
I cannot convert the < back to < with htmlentitites_decode() or such. Is there any other conversion or encoding I can use?
The php script looks like the following:
<?php
$text = $_POST["textareaX"];
$id = $_GET["id"];
$ref = $_GET["ref"];
$html = new DOMDocument();
$html->preserveWhiteSpace = true;
$html->formatOutput = false;
$html->substituteEntities = false;
$html->loadHTMLFile($ref.".html");
$elem = $html->getElementById($id);
$elem->nodeValue = $innerHTML;
if ($text == "")
{ $text = "--- No details. ---"; }
$newtext = "";
$words = explode(" ",$text);
foreach ($words as $word) {
if (strpos($word, "http://") !== false) {
$newtext .= "<a alt=\"editable\" href=\"".$word."\">".$word."</a>";
}
else {$newtext .= $word." ";}
}
$text = $newtext;
function setInnerHTML($DOM, $element, $innerHTML) {
$node = $DOM->createTextNode($innerHTML);
$children = $element->childNodes;
foreach ($children as $child) {
$element->removeChild($child);
}
$element->appendChild($node);
}
setInnerHTML($html, $elem, $text);
$html->saveHTMLFile($ref.".html");
header('Location: '."tracking.php?ref=$ref&user=unLock");
?>
We get the reference to a file from "id" and "ref" and the input data from array "textareaX". Next I open the file, identify the html element by id and replace its content (a link) with the input data from the textarea. I provide only the href in the textarea and the script builds the hyperlink from that. Next I plug this back into the original file and overwrite the input file.
When I write the new file though, the link <a href= ...> </a> is converted to <a href=...> </a>, which is a problem.
Here is part of your code with the issue identified:
<?php
function setInnerHTML($DOM, $element, $innerHTML) {
/*********************************
Well, there's your problem:
**********************************/
$node = $DOM->createTextNode($innerHTML);
$children = $element->childNodes;
foreach ($children as $child) {
$element->removeChild($child);
}
$element->appendChild($node);
}
?>
What you are doing is passing your new anchor (a) tag as a string then creating a text node out of it (text is just that - text, not HTML). The createTextNode function automatically encodes any HTML tags so that they will be visible as text when viewed by a browser (this is so you can present HTML as visible code on your page if you choose to).
What you need to do is create the element as HTML (not a text node) then append it:
<?php
function setInnerHTML($DOM, $element, $innerHTML) {
$f = $DOM->createDocumentFragment();
$f->appendXML($innerHTML);
$element->appendChild($f);
}
?>

how to get href from within element using php and simple html dom

I have an html page that looks a bit like this
xxxx
google!
<div class="big-div">
<a href="http://www.url.com/123" title="123">
<div class="little-div">xxx</div></a>
<a href="http://www.url.com/456" title="456">
<div class="little-div">xxx</div></a>
</div>
xxxx
I am trying to pull of the href's out of the big-div. I can get all the href's out of the whole page by using code like below.
$links = $html->find ('a');
foreach ($links as $link)
{
echo $link->href.'<br>';
}
But how do I get only the href's within the div "big-div".
Edit:
I think I got it. For those that care:
foreach ($html->find('div[class=big-div]') as $element) {
$links = $element->find('a');
foreach ($links as $link) {
echo $link->href.'<br>';
}
}
The documentation is useful:
$html->find(".big-div")->find('a');
And then proceed to get the href and whatever other attributes you are interested in.
Edit: The above would be the general idea. I've never used Simple HTML DOM, so perhaps you need to tweak the syntax somewhat. Try:
foreach($html->find('.big-div') as $bigDiv) {
$link = $bigDiv->find('a');
echo $link->href . '<br>';
}
or perhaps:
$bigDivs = $html->find('.big-div');
foreach($bigDivs as $div) {
$link = $div->find('a');
echo $link->href . '<br>';
}
Quick flip - put this in your foreach
$image = $html->find('.big-div')->href;

html code appear in the page as source and not executed by browser

my code is like that:
$link = "<a class=\"openevent\" href=\"$finalUrl\" target=\"_blank\">Open Event</a>";
foreach ($spans as $span) {
if ($span->getAttribute('class') == 'category') {
$span->nodeValue .= $link;
}
}
the problem here is that the $link variable is echo in the page as html source as this
<a class="openevent" href="http://www.mysite.com/Free-Live-Streaming-Video-Online-Other-Cycling-Cycling-The-Tour-of-Britain-170638.html" target="_blank">Open Event</a>
instead of appearing as usual hyperlink
what is wrong with my code?
You are adding text to the spans' node value, to add an anchor node you'll have to create an anchor node with createElement and add the attributes to it then append it to the span.
foreach ($spans as $span) {
if ($span->getAttribute('class') == 'category') {
$link = $doc->createElement('a', 'Open Event');
$link->setAttribute("class", "openevent");
$link->setAttribute("href", $finalUrl);
$link->setAttribute("target", "_blank");
$span->appendChild($link);
}
}
Look like you are building some kind of xml in foreach. when you build xml it encodes the html characters '<' as &mp;gt; so while print you will not actually print the html. may be html_entity_decode function will work for you.
echo html_entity_decode($doc->saveHTML())

Remove HTML element from parsed HTML document on a condition

I've parsed a HTML document using Simple PHP HTML DOM Parser. In the parsed document there's a ul-tag with some li-tags in it. One of these li-tags contains one of those dreaded "Add This" buttons which I want to remove.
To make this worse, the list item has no class or id, and it is not always in the same position in the list. So there is no easy way (correct me if I'm wrong) to remove it with the parser.
What I want to do is to search for the string 'addthis.com' in all li-elements and remove any element that contains that string.
<ul>
<li>Foobar</li>
<li>addthis.com</li><!-- How do I remove this? -->
<li>Foobar</li>
</ul>
FYI: This is purley a hobby project in my quest to learn PHP and not a case of content theft for profit.
All suggestions are welcome!
Couldn't find a method to remove nodes explicitly, but can remove with setting outertext to empty.
$html = new simple_html_dom();
$html->load(file_get_contents("test.html"), false, false); // preserve formatting
foreach($html->find('ul li') as $element) {
if (count($element->find('a.addthis_button')) > 0) {
$element->outertext="";
}
}
echo $html;
Well what you can do is use jQuery after the parsing. Something like this:
$('li').each(function(i) {
if($(this).html() == "addthis.com"){
$(this).remove();
}
});
This solution uses DOMDocument class and domnode.removechild method:
$str="<ul><li>Foobar</li><li>addthis.com</li><li>Foobar</li></ul>";
$remove='addthis.com';
$doc = new DOMDocument();
$doc->loadHTML($str);
$elements = $doc->getElementsByTagName('li');
$domElemsToRemove = array();
foreach ($elements as $element) {
$pos = strpos($element->textContent, $remove); // or similar $element->nodeValue
if ($pos !== false) {
$domElemsToRemove[] = $element;
}
}
foreach( $domElemsToRemove as $domElement ){
$domElement->parentNode->removeChild($domElement);
}
$str = $doc->saveHTML(); // <ul><li>Foobar</li><li>Foobar</li></ul>

Categories