PHP walking html Dom; issue it is duplicating results

PHP walking html Dom; issue it is duplicating results - php

Trying to walk the dom for div and indent it as I go. It works,except there are duplicates. I could save to an array and check for duplicates, but wondering if there is an easier way. Thanks.
function dom_parse_div_tag($htmlfile)
{
libxml_use_internal_errors(true); // supresses dom warnings
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($htmlfile);
$nodes = $dom->getElementsByTagName("div");
foreach ($nodes as $ii=>$node) {
echo "<br>";
$nodeclass = $node->attributes->getNamedItem('class');
if (isset($nodeclass))
echo "Class:" . $nodeclass->nodeValue ."<br>";
dom_child_node_print($node,0);
}
}
function dom_child_node_print($node,$level)
{
echo "<br>";
if($node->hasChildNodes()) {
$nclass = $node->attributes->getNamedItem('class');
if (isset($nclass))
echobr("Class:" . $nclass->nodeValue);
foreach ($node->childNodes as $ochildnode) {
if($ochildnode->hasChildNodes()) {
dom_child_node_print($ochildnode, $level + 1);
}
else {
if (trim($ochildnode->nodeValue) !== "") {
echo "Level$level," . strg_remove_linefeed($ochildnode->nodeValue) ."<br>";
}
}
}
}
}

Related

XML: why is my DOM traversal function yielding only the top-level node?

I thought I would write a simple function to visit all the nodes in a DOM tree. I wrote it, gave it a not-too-complex bit of XML to work on, but when I ran it I got only the top-level (DOMDocument) node.
Note that I am using PHP's Generator syntax:
http://php.net/manual/en/language.generators.syntax.php
Here's my function:
function DOMIterate($node)
{
yield $node;
if ($node->hasChildNodes())
{
foreach ($node->childNodes as $subnode) {
// if($subnode != null) {
DOMIterate($subnode);
// }
}
}
}
And the testcase code that is supposed to print the results:
$doc = new DOMDocument();
$doc->loadXML($input);
foreach (DOMIterate($doc) as $node) {
$type = $node->nodeType;
if ($type == XML_ELEMENT_NODE) {
$tag = $node-> tagName;
echo "$tag\n";
}
else if ($type == XML_DOCUMENT_NODE) {
echo "document\n";
}
else if ($type == XML_TEXT_NODE) {
$text = $node->wholeText;
echo "text: $text\n";
} else {
$linenum = $node->getLineNo();
echo "unknown node type: $type at input line $linenum\n";
}
}
The input XML is the first 18 lines of
https://www.w3schools.com/xml/plant_catalog.xml
plus a closing

If you're using PHP7, you can try this:
<?php
$string = <<<EOS
<div level="1">
<div level="2">
<p level="3"></p>
<p level="3"></p>
</div>
<div level="2">
<span level="3"></span>
</div>
</div>
EOS;
$document = new DOMDocument();
$document->loadXML($string);
function DOMIterate($node)
{
yield $node;
if ($node->childNodes) {
foreach ($node->childNodes as $childNode) {
yield from DOMIterate($childNode);
}
}
}
foreach (DOMIterate($document) as $node) {
echo $node->nodeName . PHP_EOL;
}
Here's a working example - http://sandbox.onlinephpfunctions.com/code/ab4781870f8f988207da78b20093b00ea2e8023b
Keep in mind that you'll also get the text nodes that are contained within the tags.

Using yield in a function called from the generator doesn't return the value to the caller of the original generator. You need to use yield from to propagate the values back.
function DOMIterate($node)
{
yield $node;
if ($node->hasChildNodes())
{
foreach ($node->childNodes as $subnode) {
// if($subnode != null) {
yield from DOMIterate($subnode);
// }
}
}
}
This requires PHP 7. If you're using an earlier version, see Recursive generators in PHP

Using DomDocuments, finding and returning value of ID

I have the jquery that i can run and console and finds the element.
$.get("http://www.roblox.com/groups/group.aspx?gid=2755722", function(webpage) {
if ($(webpage).find("#ctl00_cphRoblox_rbxGroupFundsPane_GroupFunds .robux").length) {
alert("Eureka I found it!")
} else {
alert("nope!")
}
})
<div id="ctl00_cphRoblox_rbxGroupFundsPane_GroupFunds" class="StandardBox" style="padding-right:0">
<b>Funds:</b>
<span class="robux" style="margin-left:5px">29</span>
<span class="tickets" style="margin-left:5px">45</span>
</div>
When i try to run it as PHP with functions and using DomDocuments to handle it all, it wont return anything when i decode it. (the following is all part of a class)
protected function xpath($url,$path)
{
libxml_use_internal_errors(true);
$dom = new DomDocument;
$dom->loadHTML($this->file_get_contents_curl($url));
$xpath = new DomXPath($dom);
return $xpath->query($path);
}
public function GetGroupStats($id)
{
$elements = array (
'Robux' => "//span[#id='ctl00_cphRoblox_rbxGroupFundsPane_GroupFunds .robux']",
'Tix' => "//span[#id='ctl00_cphRoblox_rbxGroupFundsPane_GroupFunds .tickets']",
);
$data = array();
foreach($elements as $name => $element)
{
foreach ($this->xpath('http://www.roblox.com/Groups/group.aspx?gid='.$id,$element) as $i => $node)
$data[$name] = $node->nodeValue;
}
return $data;
}
//File that includes the class and runs the function (ignore the login stuff because it isn't required for this situation)
<?php
$randomstuffdude = include 'RApi.php';
$GetAccessToken = $_GET['token'];
if ($GetAccessToken == "secrettoken6996") {
$rbxBot = new Roblox();
$rbxBot -> DoLogin();
$StatsArray = $rbxBot->GetGroupStats(2755722);
foreach ($StatsArray as $other => $array) {
echo $other . ' : ' . $array . ' / ';
}
} else {
echo "no";
}
?>

get value using DOMDocument

I am trying to fetch a value from the following html snippet using DOMDocument:
<h3>
<meta itemprop="priceCurrency" content="EUR">€
<meta itemprop="price" content="465.0000">465
</h3>
I need to fetch the value 465 from this code snippet. To avail this I am using the following code:
foreach($dom->getElementsByTagName('h3') as $h) {
foreach($h->getElementsByTagName('meta') as $p) {
if($h->getAttribute('itemprop') == 'price') {
foreach($h->childNodes as $child) {
$name = $child->nodeValue;
echo $name;
$name = preg_replace('/[^0-9\,]/', '', $name);
// $name = number_format($name, 2, ',', ' ');
if (strpos($name,',') == false)
{
$name = $name .",00";
}
}
}
}
}
But this code is not fetching the value...can anyone please help me on this.

You have an invalid HTML. Where is the closing tag for meta? This is why you get the results you see.
To find what you are looking for you can use xpath:
$doc = new \DOMDocument();
$doc->loadXML($yourHTML);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//meta[#itemprop='price']");
echo $elements->item(0)->textContent;

Inside your loop, you're pointing in the wrong object:
foreach($h->childNodes as $child) {
// ^ its not supposed to be `$h`
You should point to $p instead.
After that just use your current condition, if it satisfies, then loop all the child nodes:
$price = '';
foreach($dom->getElementsByTagName('h3') as $h) {
foreach($h->getElementsByTagName('meta') as $p) {
if($p->getAttribute('itemprop') === 'price') {
foreach($h->childNodes as $c) {
if($c->nodeType == XML_TEXT_NODE) {
$price .= trim($c->textContent);
}
}
if(strpos($price, ',') === false) {
$price .= ',00';
}
}
}
}
Sample Output
Another way is to use xpath queries:
$xpath = new DOMXpath($dom);
$meta = $xpath->query('//h3/meta[#itemprop="price"]');
if($meta->length > 0) { // found
$price = trim($xpath->evaluate('string(./following-sibling::text()[1])', $meta->item(0)));
if(strpos($price, ',') === false) { $price .= ',00'; }
$currency = $xpath->evaluate('string(./preceding-sibling::meta[#itemprop="priceCurrency"]/following-sibling::text()[1])', $meta->item(0));
$price = "{$currency} {$price}";
echo $price;
}
Out

Use jQuery, like this:
var priceCurrency = $('meta[itemprop="priceCurrency"]').attr("content");
var price = $('meta[itemprop="price"]').attr("content");
alert(priceCurrency + " " + price);
Outputs:
EUR 465.0000
CODEPEN DEMO

php dom not able to find any nodes

I'm trying to get the href of all anchor(a) tags using this code
$obj = json_decode($client->getResponse()->getContent());
$dom = new DOMDocument;
if($dom->loadHTML(htmlentities($obj->data->partial))) {
foreach ($dom->getElementsByTagName('a') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
echo $node->getAttribute('href');
}
}
where the returned JSON is like here but it doesn't echo anything. The HTML does have a tags but the foreach is never run. What am I doing wrong?

Just remove that htmlentities(). It will work just fine.
$contents = file_get_contents('http://jsonblob.com/api/jsonBlob/54a7ff55e4b0c95108d9dfec');
$obj = json_decode($contents);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($obj->data->partial);
libxml_clear_errors();
foreach ($dom->getElementsByTagName('a') as $node) {
echo $dom->saveHTML($node) . '<br/>';
echo $node->getAttribute('href') . '<br/>';
}

How to replace text in HTML

From this question: What regex pattern do I need for this? I've been using the following code:
function process($node, $replaceRules) {
if($node->hasChildNodes()) {
foreach ($node->childNodes as $childNode) {
if ($childNode instanceof DOMText) {
$text = preg_replace(
array_keys($replaceRules),
array_values($replaceRules),
$childNode->wholeText
);
$node->replaceChild(new DOMText($text),$childNode);
} else {
process($childNode, $replaceRules);
}
}
}
}
$replaceRules = array(
'/\b(c|C)olor\b/' => '$1olour',
'/\b(kilom|Kilom|M|m)eter/' => '$1etre',
);
$htmlString = "<p><span style='color:red'>The color of the sky is: gray</p>";
$doc = new DOMDocument();
$doc->loadHtml($htmlString);
process($doc, $replaceRules);
$string = $doc->saveHTML();
echo mb_substr($string,119,-15);
It works fine, but it fails (as the child node is replaced on the first instance) if the html has text and HTML. So it works on
<div>The distance is four kilometers</div>
but not
<div>The distance is four kilometers<br>1000 meters to a kilometer</div>
or
<div>The distance is four kilometers<div class="guide">1000 meters to a kilometer</div></div>
Any ideas of a method that would work on such examples?

Calling $node->replaceChild will confuse the $node->childNodes iterator. You can get the child nodes first, and then process them:
function process($node, $replaceRules) {
if($node->hasChildNodes()) {
$nodes = array();
foreach ($node->childNodes as $childNode) {
$nodes[] = $childNode;
}
foreach ($nodes as $childNode) {
if ($childNode instanceof DOMText) {
$text = preg_replace(
array_keys($replaceRules),
array_values($replaceRules),
$childNode->wholeText);
$node->replaceChild(new DOMText($text),$childNode);
}
else {
process($childNode, $replaceRules);
}
}
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP walking html Dom; issue it is duplicating results - php

Related

XML: why is my DOM traversal function yielding only the top-level node?

Using DomDocuments, finding and returning value of ID

get value using DOMDocument

php dom not able to find any nodes

How to replace text in HTML

Categories

Resources