I am using file_get_contents to get the html source of remote page, the code got consist of many tables.
what i am trying to do is the code has many <td> like the one below
<td colspan="2">
<b>Video </b>
<span class="section">Sports</span><b>: </b>
<span id="category466" class="category">Motor Sports</span>
</td>
I want to add the div below just before closing </td>
<div style="float: right; padding-right: 2px;"><a class="open_event_tab" target="_blank" href="page123.html" >open event</a></div>
my code now look like this:
<?php
//Get the url
$url = "http://remotesite.com/page1.html";
$html = file_get_contents($url);
$doc = new DOMDocument(); // create DOMDocument
libxml_use_internal_errors(true);
$doc->loadHTML($html); // load HTML you can add $html
$elements = $doc->getElementsByTagName('td');
?>
and i am stopped at getElementsByTagName then i dont know waht to do to add the div as discriped above.
Read the documentation!
The DOMDocument::getElementsByTagName() method returns an instance of DOMNodeList.
DOMNodeList implements the Traversible interface, which means that it can be used in a foreach loop. You can also loop over it using the DOMNodeList::$length property and the DOMNodeList::item($index) method.
Looping over the DOMNodeList you will be working with instances of DOMNode. The DOMNode class has a method called DOMNode::appendChild(), which, funnily enough, takes a DOMNode as its argument.
Now you just have to create the DOMNode and append it. It may not be intuitive to work with the DOM, but at least it is simple once you get acquainted with the documentation.
Put this page under your pillow.
This code works now with the updated HTML (below the code). It inserts the div at the places, where you want them do be.
<?php
//Get the url
$url = "http://remotesite.com/page1.html";
$html = file_get_contents($url);
$doc = new DOMDocument('1.0'); // create DOMDocument
libxml_use_internal_errors(false);
$doc->loadXML($html); // load HTML you can add $html
$domxpath = new DOMXPath($doc);
$filtered = $domxpath->query("//td[#colspan='2']");
$nodeList = $doc->getElementsByTagName('td');
$length = $filtered->length;
$nodes = array();
for ($i = $length - 1; $i >= 0; --$i) {
$node = $filtered->item($i);
$lastChildHTML = $doc->saveXML($node->lastChild);
if (strpos($lastChildHTML, 'class="category"') !== false) {
$nodes[] = $node;
}
}
$allTDNodes = $doc->getElementsByTagName('td');
$tdNodes = array();
foreach ($allTDNodes as $tdNode) {
if (in_array($tdNode, $nodes, true)) {
$tdNodes[] = $tdNode;
}
}
$tdNodes = array_reverse($tdNodes);
$length = count($nodes, 0);
for ($i = 0; $i < $length; $i++) {
$replacement = $doc->createDocumentFragment();
$nodeContent = $doc->saveXML($tdNodes[$i]);
$replacement->appendXML($nodeContent);
$divNode = createDivNode($doc);
$replacement->firstChild->appendChild($divNode);
$tdNodes[$i]->appendChild($divNode);
}
echo $doc->saveXML();
function createDivNode($doc) {
$divNode = $doc->createElement('div');
$divNode->setAttribute('style', 'float: right; padding-right: 2px;');
$aNode = $doc->createElement('a', 'openEvent');
$aNode->setAttribute('class', 'open_event_tab');
$aNode->setAttribute('target', '_blank');
$aNode->setAttribute('href', 'page123.html');
$divNode->appendChild($aNode);
return $divNode;
}
I have updated the used HTML to make it XHTML compliant and fixed a style issue (the relevant areas had css property height: 0px attached to them).
Related
How do I change the outerHtml of an element using PHP DomDocument class? Make sure, no third party library is used such as Simple PHP Dom or else.
For example:
I want to do something like this.
$dom = new DOMDocument;
$dom->loadHTML($html);
$tag = $dom->getElementsByTagName('h3');
foreach ($tag as $e) {
$e->outerHTML = '<h5>Hello World</h5>';
}
libxml_clear_errors();
$html = $dom->saveHTML();
echo $html;
And the output should be like this:
Old Output: <h3>Hello World</h3>
But I need this new output: <p>Hello World</p>
You can create a copy of the element content and attributes in a new node (with the new name you need), and use the function replaceChild().
The current code will work only with simple elements (a text inside a node), if you have nested elements, you will need to write a recursive function.
$dom = new DOMDocument;
$dom->loadHTML($html);
$titles = $dom->getElementsByTagName('h3');
for($i = $titles->length-1 ; $i >= 0 ; $i--)
{
$title = $titles->item($i);
$titleText = $title->textContent ; // get original content of the node
$newTitle = $dom->createElement('h5'); // create a new node with the correct name
$newTitle->textContent = $titleText ; // copy the content of the original node
// copy the attribute (class, style, ...)
$attributes = $title->attributes ;
for($j = $attributes->length-1 ; $j>= 0 ; --$j)
{
$attributeName = $attributes->item($j)->nodeName ;
$attributeValue = $attributes->item($j)->nodeValue ;
$newAttribute = $dom->createAttribute($attributeName);
$newAttribute->nodeValue = $attributeValue ;
$newTitle->appendChild($newAttribute);
}
$title->parentNode->replaceChild($newTitle, $title); // replace original node per our copy
}
libxml_clear_errors();
$html = $dom->saveHTML();
echo $html;
I need to process a DOM and remove all hyperlinks to a particular site while retaining the underlying text. Thus, something ling text changes into text. Taking cue from this thread, I wrote this:
$as = $dom->getElementsByTagName('a');
for ($i = 0; $i < $as->length; $i++) {
$node = $as->item($i);
$link_href = $node->getAttribute('href');
if (strpos($link_href,'offendinglink.com') !== false) {
$cl = $node->getAttribute('class');
$text = new DomText($node->nodeValue);
$node->parentNode->insertBefore($text, $node);
$node->parentNode->removeChild($node);
$i--;
}
}
This works fine except that I also need to retain the class attributed to the offending <a> tag and maybe turn it into a <div> or a <span>. Thus, I need this:
text
to turn into this:
<div class="nice">text</div>
How do I access the new element after it's been added (like in my code snippet)?
quote "How do I access the new element after it's been added (like in my code snippet)?" - your element is in $text i think.. anyway, i think this should work, if you need to save the class and the textContent, but nothing else
foreach($dom->getElementsByTagName('a') as $url){
if(parse_url($url->getAttribute("href"),PHP_URL_HOST)!=='badsite.com') {
continue;
}
$ele = $dom->createElement("div");
$ele->textContent = $url->textContent;
$ele->setAttribute("class",$url->getAttribute("class"));
$url->parentNode->insertBefore($ele,$url);
$url->parentNode->removeChild($url);
}
Tested solution:
<?php
$str = "<b>Dummy</b> <a href='http://google.com' target='_blank' class='nice' id='nicer'>Google.com</a> <a href='http://yandex.ru' target='_blank' class='nice' id='nicer'>Yandex.ru</a>";
$doc = new DOMDocument();
$doc->loadHTML($str);
$anchors = $doc->getElementsByTagName('a');
$l = $anchors->length;
for ($i = 0; $i < $l; $i++) {
$anchor = $anchors->item(0);
$link = $doc->createElement('div', $anchor->nodeValue);
$link->setAttribute('class', $anchor->getAttribute('class'));
$anchor->parentNode->replaceChild($link, $anchor);
}
echo preg_replace(['/^\<\!DOCTYPE.*?<html><body>/si', '!</body></html>$!si'], '', $doc->saveHTML());
Or see runnable.
I wrote the following:
<?php
$str = 'http://stackoverflow.com';
$DOM = new DOMDocument;
$DOM->loadHTML($str);
//get all H1
$items = $DOM->getElementsByTagName('h1');
//display all H1 text
for ($i = 0; $i < $items->length; $i++)
{
echo $items->item($i)->nodeValue . "<br/>";
}
?>
And just wanted to simply retrieve all the H1 elements of stackoverflow, but can't get it working. Whenever I try filling in the variable $str manually (for example: <h1>hello</h1><div><h1>hello2</h1></div>) it is working. But whenever I try to parse content from another webpage it is not doing anything at all...
Help would be appericiated!
$str = 'http://stackoverflow.com';
$DOM = new DOMDocument;
$DOM->loadHTMLFile($str); // get html
echo $DOM->saveHTML(); echo html
$DOM->saveHTMLFile(FILE_NAME); save html to file
I'm trying to multiplicate a row (with data-id='first') from a template three times and fill the proper field ({first}) with some value (0,1,2 in this case). Below you can find my simple code. I don't understand, why this line - $nodeList = $xpath->query("//*[text()[contains(.,'first')]]", $newNode); finds more than one node (it finds nodes which contain text 'first'). It just finds both rows - the cloned and the original one, so it replaces the text in both of them, while it should replace it only in the new one - please note that I'm providing the second parameter for function $xpath->query which should make the search relative to just that new node I just cloned.
Here's a fiddle: https://eval.in/170941
HTML:
<html>
<head>
<title>test</title>
</head>
<body>
<table>
<tr data-id="first">
<td>{first}</td>
</tr>
</table>
</body>
</html>
PHP:
<?php
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$element = $xpath->query("//*[#data-id='first']")->item(0);
$element->removeAttribute("data-id");
$parent = $element->parentNode;
for ($i = 0; $i < 3; $i++) {
$newNode = $element->cloneNode(true);
$parent->insertBefore($newNode, $element);
$nodeList = $xpath->query("//*[text()[contains(.,'first')]]", $newNode);
for($j = 0; $j < $nodeList->length; $j++) {
$n = $nodeList->item($j);
$n->nodeValue = preg_replace("{{first}}", $i, $n->nodeValue);
}
}
$parent->removeChild($element);
echo $dom->saveHTML();
As you can see, the result is a three elements table with rows valued 0,0,0, while expected values should be 0,1,2.
Starting an xpath location path with / means tha it start at the document root. So //* is always any element node, the context argument has no effect.
Try:
$nodeList = $xpath->query(".//*[text()[contains(.,'first')]]", $newNode);
HINT: DOMXpath::query() does only allow expressions that return a node list, DOMXpath::evaluate() allows all expressions. Example: count(//*).
HINT: DOMNodelist objects implement iterator, you can use foreach to iterate them.
The problem you are having is that you are cloning the original node, but in your first pass you're altering the original node's content. Every pass after that is copying the already modified node, so there is no {first} to find.
One solution is to make a clone of the source element which you never insert into the document, and use that inside your loop.
Here's my fiddle: https://eval.in/171149
<?php
$html = '<html><head><title>test</title></head><body><table><tr data-id="first"><td>{first}</td></tr></table></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$element = $xpath->query("//*[#data-id='first']")->item(0);
$element->removeAttribute("data-id");
$parent = $element->parentNode;
$clonedNode = $element->cloneNode(true);
for ($i = 0; $i < 3; $i++) {
$newNode = $clonedNode->cloneNode(true);
$parent->insertBefore($newNode, $element);
$nodeList = $xpath->query("//*[text()[contains(.,'first')]]", $newNode);
for($j = 0; $j < $nodeList->length; $j++) {
$n = $nodeList->item($j);
$n->nodeValue = preg_replace("{{first}}", $i, $n->nodeValue);
}
}
$parent->removeChild($element);
echo $dom->saveHTML();
I am trying to create something for php html dom to work with a element path pattern.
It looks as fallow. I can have different paths where I want to have some text out. like;
$elements = 'h1;span;';
$elements = 'div.test;h2;span';
I tried to create an function to handle these inserts but I am stuck on the
part to set 'getElementsByTagName()' in the good order and to receive the value of
the last element,
what I have done now;
function convertName($html, $elements) {
$elements = explode(';', $elements);
$dom = new DOMDocument;
$dom->loadHTML($html);
$name = null;
foreach ($elements as $element) :
$name. = getElementsByTagName($element)->item(0)->;
endforeach;
$test = $dom->$name.'nodeValue';
print_r($test); // receive value
}
I hope someone can give me some input or examples.
May be something like this:
function convertName($html, $elements) {
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$elements = explode(';', $elements);
$elemValues = array();
foreach ($elements as $element) {
$nodelist = $xpath->query("//$element");
for($i=0; $i < $nodelist->length; $i++)
$elemValues[$element][] = $nodelist->item($i)->nodeValue;
}
return $elemValues;
}
// TESTING
$html = <<< EOF
<span class="bar">Some normal Text</span>
<input type="hidden" name="hf" value="123">
<h1>Heading 1<span> span inside h1</span></h1>
<div class='foo'>Some DIV</div>
<span class="bold">Bold Text</span>
<p/>
EOF;
$elements = 'h1;span;';
// replace all but last ; with / to get valid XPATH
$elements = preg_replace('#;(?=[^;]*;)#', '/', $elements);
// call our function
$elemValues = convertName($html, $elements);
print_r($elemValues);
OUTPUT:
Array
(
[h1/span] => Array
(
[0] => span inside h1
)
)