PHP DOM Document not parsing / retrieving HTML

PHP DOM Document not parsing / retrieving HTML - php

I wrote the following:
<?php
$str = 'http://stackoverflow.com';
$DOM = new DOMDocument;
$DOM->loadHTML($str);
//get all H1
$items = $DOM->getElementsByTagName('h1');
//display all H1 text
for ($i = 0; $i < $items->length; $i++)
{
echo $items->item($i)->nodeValue . "<br/>";
}
?>
And just wanted to simply retrieve all the H1 elements of stackoverflow, but can't get it working. Whenever I try filling in the variable $str manually (for example: <h1>hello</h1><div><h1>hello2</h1></div>) it is working. But whenever I try to parse content from another webpage it is not doing anything at all...
Help would be appericiated!

$str = 'http://stackoverflow.com';
$DOM = new DOMDocument;
$DOM->loadHTMLFile($str); // get html
echo $DOM->saveHTML(); echo html
$DOM->saveHTMLFile(FILE_NAME); save html to file

Related

PHP: Remove a hyperlink from element but retain the text and class

I need to process a DOM and remove all hyperlinks to a particular site while retaining the underlying text. Thus, something ling text changes into text. Taking cue from this thread, I wrote this:
$as = $dom->getElementsByTagName('a');
for ($i = 0; $i < $as->length; $i++) {
$node = $as->item($i);
$link_href = $node->getAttribute('href');
if (strpos($link_href,'offendinglink.com') !== false) {
$cl = $node->getAttribute('class');
$text = new DomText($node->nodeValue);
$node->parentNode->insertBefore($text, $node);
$node->parentNode->removeChild($node);
$i--;
}
}
This works fine except that I also need to retain the class attributed to the offending <a> tag and maybe turn it into a <div> or a <span>. Thus, I need this:
text
to turn into this:
<div class="nice">text</div>
How do I access the new element after it's been added (like in my code snippet)?

quote "How do I access the new element after it's been added (like in my code snippet)?" - your element is in $text i think.. anyway, i think this should work, if you need to save the class and the textContent, but nothing else
foreach($dom->getElementsByTagName('a') as $url){
if(parse_url($url->getAttribute("href"),PHP_URL_HOST)!=='badsite.com') {
continue;
}
$ele = $dom->createElement("div");
$ele->textContent = $url->textContent;
$ele->setAttribute("class",$url->getAttribute("class"));
$url->parentNode->insertBefore($ele,$url);
$url->parentNode->removeChild($url);
}

Tested solution:
<?php
$str = "<b>Dummy</b> <a href='http://google.com' target='_blank' class='nice' id='nicer'>Google.com</a> <a href='http://yandex.ru' target='_blank' class='nice' id='nicer'>Yandex.ru</a>";
$doc = new DOMDocument();
$doc->loadHTML($str);
$anchors = $doc->getElementsByTagName('a');
$l = $anchors->length;
for ($i = 0; $i < $l; $i++) {
$anchor = $anchors->item(0);
$link = $doc->createElement('div', $anchor->nodeValue);
$link->setAttribute('class', $anchor->getAttribute('class'));
$anchor->parentNode->replaceChild($link, $anchor);
}
echo preg_replace(['/^\<\!DOCTYPE.*?<html><body>/si', '!</body></html>$!si'], '', $doc->saveHTML());
Or see runnable.

adding div to html code got with File_get_contents

I am using file_get_contents to get the html source of remote page, the code got consist of many tables.
what i am trying to do is the code has many <td> like the one below
<td colspan="2">
<b>Video </b>
<span class="section">Sports</span><b>: </b>
<span id="category466" class="category">Motor Sports</span>
</td>
I want to add the div below just before closing </td>
<div style="float: right; padding-right: 2px;"><a class="open_event_tab" target="_blank" href="page123.html" >open event</a></div>
my code now look like this:
<?php
//Get the url
$url = "http://remotesite.com/page1.html";
$html = file_get_contents($url);
$doc = new DOMDocument(); // create DOMDocument
libxml_use_internal_errors(true);
$doc->loadHTML($html); // load HTML you can add $html
$elements = $doc->getElementsByTagName('td');
?>
and i am stopped at getElementsByTagName then i dont know waht to do to add the div as discriped above.

Read the documentation!
The DOMDocument::getElementsByTagName() method returns an instance of DOMNodeList.
DOMNodeList implements the Traversible interface, which means that it can be used in a foreach loop. You can also loop over it using the DOMNodeList::$length property and the DOMNodeList::item($index) method.
Looping over the DOMNodeList you will be working with instances of DOMNode. The DOMNode class has a method called DOMNode::appendChild(), which, funnily enough, takes a DOMNode as its argument.
Now you just have to create the DOMNode and append it. It may not be intuitive to work with the DOM, but at least it is simple once you get acquainted with the documentation.
Put this page under your pillow.

This code works now with the updated HTML (below the code). It inserts the div at the places, where you want them do be.
<?php
//Get the url
$url = "http://remotesite.com/page1.html";
$html = file_get_contents($url);
$doc = new DOMDocument('1.0'); // create DOMDocument
libxml_use_internal_errors(false);
$doc->loadXML($html); // load HTML you can add $html
$domxpath = new DOMXPath($doc);
$filtered = $domxpath->query("//td[#colspan='2']");
$nodeList = $doc->getElementsByTagName('td');
$length = $filtered->length;
$nodes = array();
for ($i = $length - 1; $i >= 0; --$i) {
$node = $filtered->item($i);
$lastChildHTML = $doc->saveXML($node->lastChild);
if (strpos($lastChildHTML, 'class="category"') !== false) {
$nodes[] = $node;
}
}
$allTDNodes = $doc->getElementsByTagName('td');
$tdNodes = array();
foreach ($allTDNodes as $tdNode) {
if (in_array($tdNode, $nodes, true)) {
$tdNodes[] = $tdNode;
}
}
$tdNodes = array_reverse($tdNodes);
$length = count($nodes, 0);
for ($i = 0; $i < $length; $i++) {
$replacement = $doc->createDocumentFragment();
$nodeContent = $doc->saveXML($tdNodes[$i]);
$replacement->appendXML($nodeContent);
$divNode = createDivNode($doc);
$replacement->firstChild->appendChild($divNode);
$tdNodes[$i]->appendChild($divNode);
}
echo $doc->saveXML();
function createDivNode($doc) {
$divNode = $doc->createElement('div');
$divNode->setAttribute('style', 'float: right; padding-right: 2px;');
$aNode = $doc->createElement('a', 'openEvent');
$aNode->setAttribute('class', 'open_event_tab');
$aNode->setAttribute('target', '_blank');
$aNode->setAttribute('href', 'page123.html');
$divNode->appendChild($aNode);
return $divNode;
}
I have updated the used HTML to make it XHTML compliant and fixed a style issue (the relevant areas had css property height: 0px attached to them).

Fetch the attributes using PHP crawler

I am trying to fetch the name,address and location from crawling of a website . Its a single page and dont want any other thing other than this. I am using the below code.
<?php
include 'simple_html_dom.php';
$html = "http://www.phunwa.com/phone/0191/2604233";
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$div = $xpath->query('//*[#class="address-tags"]')->item(0);
for($i=0; $i < $div->length; $i++ )
{
print "nodename=".$div->item( $i )->nodeName;
print "\t";
print "nodevalue : ".$div->item( $i )->nodeValue;
print "\r\n";
echo $link->getElementsByTagName("<p>");
}
?>
The website html source code is
<div class="address-tags">
<p><strong>Name:</strong> RAJ GOPAL SINGH</p>
<p><strong>Address:</strong> R/O BARNAI NETARKOTHIAN, P.O.MUTHI TEH.& DISTT.JAMMU,X, 181206</p>
<p><strong>Location:</strong> JAMMU, Jammu & Kashmir, India</p>
<p><strong>Other Numbers:</strong> 01912604233 | +911912604233 | +91-191-2604233</p>
Can somone please help me get the three attributes as output. Nothing is echop on the page as of now.
Thanks alot .

you need $dom->load($html); instead of $dom->loadHtml($html);. After doing this you wil; find your html is not well formed, so $xpath stay empty.
Maybe try something like:
$html = file_get_contents('http://www.phunwa.com/phone/0191/2604233');
$name = preg_replace('/(.*)(<p><strong>Name:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$address = preg_replace('/(.*)(<p><strong>Address:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$location = preg_replace('/(.*)(<p><strong>Location:<\/strong> )([^<]+)(<\/p>)(.*)/mis','$3',$html);
$othernumbers = preg_replace('/(.*)(<p><strong>Other Numbers:<\/strong> )(.*)/mis','$3',$html);
list($othernumbers,$trash)= preg_split('/<\/p>/mis',$othernumbers,0);
echo 'name: '.$name.'<br>address: '.$address.'<br>location: '.$location.'<br>other numbers: '.$othernumbers;
exit;

You should use the following for your XPath query:
//*[#class='address-tags']/p
so you're retrieving the actual paragraph nodes that are children of the 'address-tags' parent. Then you can use a loop on them:
$nodes = $xpath->query('//*[#class="address-tags"]/p');
for ($i = 0; $i < $nodes->length; $i++) {
echo $nodes->item($i)->nodeValue;
}
// or just
foreach($nodes as $node) {
echo $node->nodeValue;
}
Right now your code is properly fetching the first div that's found, but then you continue treating that div as if it was a DOMNodeList returned from an xpath query, which is incorrect. ->item() returns a DOMNode object, which does NOT have an ->item() method.

Replace Tag in HTML with DOMDocument

I'm trying to edit html tags with DOMDocument::loadHTML in php. The html data is a part of html and not the whole page. I followed what this page (PHP - DOMDocument - need to change/replace an existing HTML tag w/ a new one) says.
This should convert pre tags into div tags but it gives "Fatal error: Uncaught exception 'DOMException' with message 'Not Found Error'."
<?php
$contents = <<<STR
<pre>hi</pre>
<pre>hello</pre>
<pre>bye</pre>
STR;
$dom = new DOMDocument;
#$dom->loadHTML($contents);
foreach( $dom->getElementsByTagName("pre") as $nodePre ) {
$nodeDiv = $dom->createElement("div", $nodePre->nodeValue);
$dom->replaceChild($nodeDiv, $nodePre);
}
echo $dom->saveHTML();
?>
[Edit]
While I'm trying to iterate the node object backwards, I get this error, 'Notice: Trying to get property of non-object...'
<?php
$contents = <<<STR
<pre>hi</pre>
<pre>hello</pre>
<pre>bye</pre>
STR;
$dom = new DOMDocument;
#$dom->loadHTML($contents);
$domPre = $dom->getElementsByTagName('pre');
$length = $domPre->length;
For ($i = $length; $i > -1 ; $i--) {
$nodePre = $domPre->item($i);
echo $nodePre->nodeValue . '<br />';
// $nodeDiv = $dom->createElement("div", $nodePre->nodeValue);
// $dom->replaceChild($nodeDiv, $nodePre);
}
// echo $dom->saveHTML();
?>
[Edit]
Okey, solved. Since the answered code has some error I post the solution here. Thanks all.
Solution:
<?php
$contents = <<<STR
<pre>hi</pre>
<pre>hello</pre>
<pre>bye</pre>
STR;
$dom = new DOMDocument;
#$dom->loadHTML($contents);
$domPre = $dom->getElementsByTagName('pre');
$length = $domPre->length;
For ($i = $length - 1; $i > -1 ; $i--) {
$nodePre = $domPre->item($i);
$nodeDiv = $dom->createElement("div", $nodePre->nodeValue);
$nodePre->parentNode->replaceChild($nodeDiv, $nodePre);
}
echo $dom->saveHTML();
?>

The problem is the call to replaceChild(). Rather than
$dom->replaceChild($nodeDiv, $nodePre);
use
$nodePre->parentNode->replaceChild($nodeDiv, $nodePre);
update
Here is a working code. Seems there is some issue with replacing multiple nodes (more info here: http://php.net/manual/en/domnode.replacechild.php) so you'll have to use a regressive loop to replace the elements.
$contents = <<<STR
<pre>hi</pre>
<pre>hello</pre>
<pre>bye</pre>
STR;
$dom = new DOMDocument;
#$dom->loadHTML($contents);
$elements = $dom->getElementsByTagName("pre");
for ($i = $elements->length - 1; $i >= 0; $i --) {
$nodePre = $elements->item($i);
$nodeDiv = $dom->createElement("div", $nodePre->nodeValue);
$nodePre->parentNode->replaceChild($nodeDiv, $nodePre);
}

Another way with paquettg/php-html-parser (didn't find the way to change name, so had to use hack with re-binding $this):
use PHPHtmlParser\Dom;
use PHPHtmlParser\Dom\HtmlNode;
$dom = new Dom;
$dom->load($text);
/** #var HtmlNode[] $tags */
foreach($dom->find('pre') as $tag) {
$changeTag = function() {
$this->name = 'div';
};
$changeTag->call($tag->tag);
};
echo (string)$dom;

How to return outer html of DOMDocument?

I'm trying to replace video links inside a string - here's my code:
$doc = new DOMDocument();
$doc->loadHTML($content);
foreach ($doc->getElementsByTagName("a") as $link)
{
$url = $link->getAttribute("href");
if(strpos($url, ".flv"))
{
echo $link->outerHTML();
}
}
Unfortunately, outerHTML doesn't work when I'm trying to get the html code for the full hyperlink like <a href='http://www.myurl.com/video.flv'></a>
Any ideas how to achieve this?

As of PHP 5.3.6 you can pass a node to saveHtml, e.g.
$domDocument->saveHtml($nodeToGetTheOuterHtmlFrom);
Previous versions of PHP did not implement that possibility. You'd have to use saveXml(), but that would create XML compliant markup. In the case of an <a> element, that shouldn't be an issue though.
See http://blog.gordon-oheim.biz/2011-03-17-The-DOM-Goodie-in-PHP-5.3.6/

You can find a couple of propositions in the users notes of the DOM section of the PHP Manual.
For example, here's one posted by xwisdom :
<?php
// code taken from the Raxan PDI framework
// returns the html content of an element
protected function nodeContent($n, $outer=false) {
$d = new DOMDocument('1.0');
$b = $d->importNode($n->cloneNode(true),true);
$d->appendChild($b); $h = $d->saveHTML();
// remove outter tags
if (!$outer) $h = substr($h,strpos($h,'>')+1,-(strlen($n->nodeName)+4));
return $h;
}
?>

The best possible solution is to define your own function which will return you outerhtml:
function outerHTML($e) {
$doc = new DOMDocument();
$doc->appendChild($doc->importNode($e, true));
return $doc->saveHTML();
}
than you can use in your code
echo outerHTML($link);

Rename a file with href to links.html or links.html to say google.com/fly.html that has flv in it or change flv to wmv etc you want href from if there are other href
it will pick them up as well
<?php
$contents = file_get_contents("links.html");
$domdoc = new DOMDocument();
$domdoc->preservewhitespaces=“false”;
$domdoc->loadHTML($contents);
$xpath = new DOMXpath($domdoc);
$query = '//#href';
$nodeList = $xpath->query($query);
foreach ($nodeList as $node){
if(strpos($node->nodeValue, ".flv")){
$linksList = $node->nodeValue;
$htmlAnchor = new DOMElement("a", $linksList);
$htmlURL = new DOMAttr("href", $linksList);
$domdoc->appendChild($htmlAnchor);
$htmlAnchor->appendChild($htmlURL);
$domdoc->saveHTML();
echo ("<a href='". $node->nodeValue. "'>". $node->nodeValue. "</a><br />");
}
}
echo("done");
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP DOM Document not parsing / retrieving HTML - php

$str = 'http://stackoverflow.com'; $DOM = new DOMDocument; $DOM->loadHTMLFile($str); // get html echo $DOM->saveHTML(); echo html $DOM->saveHTMLFile(FILE_NAME); save html to file

Related

PHP: Remove a hyperlink from element but retain the text and class

adding div to html code got with File_get_contents

Fetch the attributes using PHP crawler

Replace Tag in HTML with DOMDocument

How to return outer html of DOMDocument?

Categories

Resources