firstChild in DOMDocument not working - php

Here is the code snippet from which I have to fetch the firstChild from the DIV named u-Row-6...
<div class="u-Row-6">
<div class='article_details_price2'>
<strong >
855,90 € *
</strong>
<div class="PseudoPrice">
<em>EVP: 999,00 € *</em>
<span>
(14.32 % <span class="frontend_detail_data">gespart</span>)
</span>
</div>
</div>
</div>
For this I have used the following code:
foreach($dom->getElementsByTagName('div') as $p) {
if ($p->getAttribute('class') == 'u-Row-6') {
if ($first) {
$name = $p->firstChild-nodeValue;
$name = str_replace('€', '', $name);
$name = str_replace(chr(194), " ", $name);
$first = false;
}
}
}
But mysteriously this code is not working for me

There is a number of problems with your code:
$first is not initialized to a true value, which will prevent the string replacement code from running even once
The $p->firstChild-nodeValue lacks an > before nodeValue
$p->firstChild will actually resolve to a text node (any text between <div class="u-Row-6"> and <div class='article_details_price2'> - currently nothing), not the strong you are looking for and not <div class='article_details_price2'> either, as one might have expected.
You may want to use an XPath query instead, to get all the strong tags within a div of class "u-Row-6", and then loop through the found tags:
$src = <<<EOS
<div class="u-Row-6">
<div class='article_details_price2'>
<strong >
855,90 € *
</strong>
<div class="PseudoPrice">
<em>EVP: 999,00 € *</em>
<span>
(14.32 % <span class="frontend_detail_data">gespart</span>)
</span>
</div>
</div>
</div>
EOS;
$dom = new DOMDocument();
$dom->loadHTML($src);
$xpath = new DOMXPath($dom);
$strongTags = $xpath->query('//div[#class="u-Row-6"]//strong');
foreach ($strongTags as $tag) {
echo "The strong tag contents: " . $tag->nodeValue, PHP_EOL;
// Replacement code goes here ...
}
Output:
The strong tag contents:
855,90 € *
XPaths are actually quite handy. Read more about them here.

Related

Replace certain Child value if doesn't contain certain string? or Rewrite XPATH query? Website scrape

Preface: This is the first XPath and DOM script I have ever worked on.
The following code works, to a point.
If the child->nodevalue, which should be price, is empty it throws off the rest of the elements and it just snowballs from there. I have spent hours reading, rewriting and can't come up with a way to fix it.
I am at the point where I think my XPath query could be the issue because I am out of ideas on how to test that is the right child value.
The Content I am scraping looks like this(Actually it looks nothing like this there are 148 lines of HTML for each product but these are the relevant ones):
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The Title I Need
</span>
</a>
</h2>
<span class="a-offscreen">
$1,000,000
</span>
</div>
Here is the code I am using.
$html =file_get_contents('http://localhost:8888/scraper/source.html');
$doc = new \DOMDocument();
$doc->loadHTML($html);
$xpath = new \DOMXpath($doc);
$xpath->preserveWhiteSpace = FALSE;
$nodes= $xpath->query("//a[#class = 'a-link-normal s-no-outline'] | //span[#class = 'a-size-base-plus a-color-base a-text-normal'] | //span[#class = 'a-price']");
$data =[];
foreach ($nodes as $node) {
$url = $node->getAttribute('href');
if(trim($url,"\xc2\xa0 \n \t \r") != ''){
array_push($data,$url);
}
foreach ($node->childNodes as $child) {
if (trim($child->nodeValue, "\xc2\xa0 \n \t \r") != '') {
array_push($data, $child->nodeValue);
}
}
}
$chunks = (array_chunk($data, 4));
foreach($chunks as $chunk) {
$newarray = [
'url' => $chunk[0],
'title' => $chunk[1],
'todaysprice' => $chunk[2],
'hiddenprice' => $chunk[3]
];
echo '<p>' . $newarray['url'] . '<br>' . $newarray['title'] . '<br>' .
$newarray['todaysprice'] . '</p>';
}
Outputs:
URL
Title
Price
URL
Title
Price
URL
Title
URL. <---- "Price was missing so it used the next child node value and now everything from here down is wrong."
Title
Price
URL
I am aware this code is FAR from the right but I had to start somewhere.
If I understand you correctly, you are probably looking for something like the below. For the sake of simplicty, I skipped the array building parts, and just echoed the target data.
So assume your html looks like the one below:
$html = '
<body>
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The Title I Need
</span>
</a>
</h2>
<span class="a-offscreen">
$1,000,000
</span>
</div>
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed2.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The other Title I Need
</span>
</a>
</h2>
</div>
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed3.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The Final Title I Need
</span>
</a>
</h2>
<span class="a-offscreen">
$2,000,000
</span>
</div>
</body>
';
Try this:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$data = $xpath->query('//h2[#class="second class"]');
foreach($data as $datum){
echo trim($xpath->query('.//a/#href',$datum)[0]->nodeValue),"\r\n";
echo trim($xpath->query('.//a/span',$datum)[0]->nodeValue),"\r\n";
#$price = $xpath->query('./following-sibling::span',$datum);
#EDITED
$price = $xpath->query('./following-sibling::span[#class="a-offscreen"]',$datum);
if ($price->length>0) {
echo trim($price[0]->nodeValue), "\r\n";
} else {
echo("No Price"),"\r\n";
}
echo "\r\n";
};
Output:
TheURLINeed.php
The Title I Need
$1,000,000
TheURLINeed2.php
The other Title I Need
No Price
TheURLINeed3.php
The Final Title I Need
$2,000,000

Replace class content using php

I want to replace string from specific classes from HTML.
In HTML there is other content which I don't want to change.
In below code want to change data on class one and three only, class two content should be as it is.
I need to this in dynamic way.
<div class="one"> I want to change this </div>
<div class="two"> I don't want to change this </div>
<div class="three"> I want to change this </div> 
Dom functions are helpful
php manual
//your html file content
$str = '...<div class="one"> I want to change this </div>
<div class="two"> I don\'t want to change this </div>
<div class="three"> I want to change this </div>... ';
$dom = new DOMDocument();
$dom->loadHtml($str);
$domXpath = new DOMXPath($dom);
//query the nodes matched
$list = $domXpath->query('//div[#class!="two"]');
if ($list->length > 0) {
foreach ($list as $node) {
//change node value
$node->nodeValue = 'Content changed!';
}
}
//get the result
$new_str = $dom->saveHTML();
var_dump($new_str);

Slicing HTML based on delimiter

I am converting Word docs on the fly to HTML and needing to parse said HTML based on a delimiter. For example:
<div id="div1">
<p>
<font>
<b>[[delimiter]]Start of content section 1.</b>
</font>
</p>
<p>
<span>More content in section 1</span>
</p>
</div>
<div id="div2">
<p>
<b>
<font>[[delimiter]]Start of section 2</font>
</b>
<p>
<span>More content in section 2</span>
<p><font>[[delimiter]]Start of section 3</font></p>
<div>
<div id="div3">
<span><font>More content in section 3</font></span>
</div>
<!-- This continues on... -->
Should be parsed as:
Section 1:
<div id="div1">
<p>
<font>
<b>[[delimiter]]Start of content section 1.</b>
</font>
</p>
<p>
<span>More content in section 1</span>
</p>
</div>
Section 2:
<div id="div2">
<p>
<b>
<font>[[delimiter]]Start of section 2</font>
</b>
<p>
<span>More content in section 2</span>
<p></p>
<div>
Section 3:
<div id="div2">
<p>
<b>
</b>
<p>
<p><font>[[delimiter]]Start of section 3</font></p>
<div>
<div id="div3">
<span><font>More content in section 3</font></span>
</div>
I can't simply "explode"/slice based on the delimiter, because that would break the HTML. Every bit of text content has many parent elements.
I have no control over the HTML structure and it sometimes changes based on the structure of the Word doc. An end user will import their Word doc to be parsed in the application, so the resulting HTML will not be altered before being parsed.
Often the content is at different depths in the HTML.
I cannot rely on element classes or IDs because they are not consistent from doc to doc. #div1, #div2, and #div3 are just for illustration in my example.
My goal is to parse out the content, so if there's empty elements left over that's OK, I can simply run over the markup again and remove empty tags (p, font, b, etc).
My attempts:
I am using the PHP DOM extension to parse the HTML and loop through the nodes. But I cannot come up with a good algorithm to figure this out.
$doc = new \DOMDocument();
$doc->loadHTML($html);
$body = $doc->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child) {
if ($child->hasChildNodes()) {
// Do recursive call...
} else {
// Contains slide identifier?
}
}
In order to solve an issue like this, you first need to work out the steps needed to get a solution, before even starting to code.
Find an element that starts with [[delimiter]]
Check if it's parent has a next sibling
No? Repeat 2
Yes? This next sibling contains the content.
Now once you put this to work, you are already 90% ready. All you need to do is clean up the unnecessary tags and you're done.
To get something that you can extend on, don't build one mayor pile of obfuscated code that works, but split all the data you need in something you can work with.
Below code works with two classes that does exactly what you need, and gives you a nice way to go trough all the elements, once you need them. It does use PHP Simple HTML DOM Parser instead of DOMDocument, because I like it a little better.
<?php
error_reporting(E_ALL);
require_once("simple_html_dom.php");
$html = <<<XML
<body>
<div id="div1">
<p>
<font>
<b>[[delimiter]]Start of content section 1.</b>
</font>
</p>
<p>
<span>More content in section 1</span>
</p>
</div>
<div id="div2">
<p>
<b>
<font>[[delimiter]]Start of section 2</font>
</b>
</p>
<span>More content in section 2</span>
<p>
<font>[[delimiter]]Start of section 3</font>
</p>
</div>
<div id="div3">
<span>
<font>More content in section 3</font>
</span>
</div>
</body>
XML;
/*
* CALL
*/
$parser = new HtmlParser($html, '[[delimiter]]');
//dump found
//decode/encode to only show public values
print_r(json_decode(json_encode($parser)));
/*
* ACTUAL CODE
*/
class HtmlParser
{
private $_html;
private $_delimiter;
private $_dom;
public $Elements = array();
final public function __construct($html, $delimiter)
{
$this->_html = $html;
$this->_delimiter = $delimiter;
$this->_dom = str_get_html($this->_html);
$this->getElements();
}
final private function getElements()
{
//this will find all elements, including parent elements
//it will also select the actual text as an element, without surrounding tags
$elements = $this->_dom->find("[contains(text(),'".$this->_delimiter."')]");
//find the actual elements that start with the delimiter
foreach($elements as $element) {
//we want the element without tags, so we search for outertext
if (strpos($element->outertext, $this->_delimiter)===0) {
$this->Elements[] = new DelimiterTag($element);
}
}
}
}
class DelimiterTag
{
private $_element;
public $Content;
public $MoreContent;
final public function __construct($element)
{
$this->_element = $element;
$this->Content = $element->outertext;
$this->findMore();
}
final private function findMore()
{
//we need to traverse up until we find a parent that has a next sibling
//we need to keep track of the child, to cleanup the last parent
$child = $this->_element;
$parent = $child->parent();
$next = null;
while($parent) {
$next = $parent->next_sibling();
if ($next) {
break;
}
$child = $parent;
$parent = $child->parent();
}
if (!$next) {
//no more content
return;
}
//create empty element, to build the new data
//go up one more element and clean the innertext
$more = $parent->parent();
$more->innertext = "";
//add the parent, because this is where the actual content lies
//but we only want to add the child to the parent, in case there are more delimiters
$parent->innertext = $child->outertext;
$more->innertext .= $parent->outertext;
//add the next sibling, because this is where more content lies
$more->innertext .= $next->outertext;
//set the variables
if ($more->tag=="body") {
//Your section 3 works slightly different as it doesn't show the parent tag, where the first two do.
//That's why i show the innertext for the root tag and the outer text for others.
$this->MoreContent = $more->innertext;
} else {
$this->MoreContent = $more->outertext;
}
}
}
?>
Cleaned up output:
stdClass Object
(
[Elements] => Array
(
[0] => stdClass Object
(
[Content] => [[delimiter]]Start of content section 1.
[MoreContent] => <div id="div1">
<p><font><b>[[delimiter]]Start of content section 1.</b></font></p>
<p><span>More content in section 1</span></p>
</div>
)
[1] => stdClass Object
(
[Content] => [[delimiter]]Start of section 2
[MoreContent] => <div id="div2">
<p><b><font>[[delimiter]]Start of section 2</font></b></p>
<span>More content in section 2</span>
</div>
)
[2] => stdClass Object
(
[Content] => [[delimiter]]Start of section 3
[MoreContent] => <div id="div2">
<p><font>[[delimiter]]Start of section 3</font></p>
</div>
<div id="div3">
<span><font>More content in section 3</font></span>
</div>
)
)
)
The nearest I've got so far is...
$html = <<<XML
<body>
<div id="div1">
<p>
<font>
<b>[[delimiter]]Start of content section 1.</b>
</font>
</p>
<p>
<span>More content in section 1</span>
</p>
</div>
<div id="div2">
<p>
<b>
<font>[[delimiter]]Start of section 2</font>
</b>
</p>
<span>More content in section 2</span>
<p>
<font>[[delimiter]]Start of section 3</font>
</p>
</div>
<div id="div3">
<span>
<font>More content in section 3</font>
</span>
</div>
</body>
XML;
$doc = new \DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
$div = $xp->query("body/node()[descendant::*[contains(text(),'[[delimiter]]')]]");
foreach ($div as $child) {
echo "Div=".$doc->saveHTML($child).PHP_EOL;
}
echo "Last bit...".$doc->saveHTML($child).PHP_EOL;
$div = $xp->query("following-sibling::*", $child);
foreach ($div as $remain) {
echo $doc->saveHTML($remain).PHP_EOL;
}
I think I had to tweak the HTML to correct a (hopefully) erroneous missing </div>.
It would be interesting to see how robust this is, but difficult to test.
The 'last bit' attempts to take the element with the last marker in in ( in this case div2) till the end of the document (using following-sibling::*).
Also note that it assumes that the body tag is the base of the document. So this will need to be adjusted to fit your document. It may be as simple as changing it to //body...
update
With a bit more flexibility and the ability to cope with multiple sections in the same overall segment...
$html = <<<XML
<html>
<body>
<div id="div1">
<p>
<font>
<b>[[delimiter]]Start of content section 1.</b>
</font>
</p>
<p>
<span>More content in section 1</span>
</p>
</div>
<div id="div1a">
<p>
<span>More content in section 1</span>
</p>
</div>
<div id="div2">
<p>
<b>
<font>[[delimiter]]Start of section 2</font>
</b>
</p>
<span>More content in section 2</span>
<p>
<font>[[delimiter]]Start of section 3</font>
</p>
</div>
<div id="div3">
<span>
<font>More content in section 3</font>
</span>
</div>
</body>
</html>
XML;
$doc = new \DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
$div = $xp->query("//body/node()[descendant::*[contains(text(),'[[delimiter]]')]]");
$partCount = $div->length;
for ( $i = 0; $i < $partCount; $i++ ) {
echo "Div $i...".$doc->saveHTML($div->item($i)).PHP_EOL;
// Check for multiple sections in same element
$count = $xp->evaluate("count(descendant::*[contains(text(),'[[delimiter]]')])",
$div->item($i));
if ( $count > 1 ) {
echo PHP_EOL.PHP_EOL;
for ($j = 0; $j< $count; $j++ ) {
echo "Div $i.$j...".$doc->saveHTML($div->item($i)).PHP_EOL;
}
}
$div = $xp->query("following-sibling::*", $div->item($i));
foreach ($div as $remain) {
if ( $i < $partCount-1 && $remain === $div->item($i+1) ) {
break;
}
echo $doc->saveHTML($remain).PHP_EOL;
}
echo PHP_EOL.PHP_EOL;
}

Substitute a phrase and characters from a result with simple html dom

My code working good to result me an external part of the price of the item from an online store, but is loaded with standard html, css and letters, I wanna be just numbers without "," or "ABC" just numbers like "123".
This is a part of external mobile-store site:
<div class="prod-box-separation" style="padding-left:15px;padding-right:15px;text-align:center;padding-top:7px;">
<div style="color:#cc1515;">
<div class="price-box">
<span class="regular-price" id="product-price-47488">
<span >
<span class="price">2.443,<sup>00</sup> RON</span>
</span>
</span>
</div>
</div>
</div>
<div class="prod-box-separation" style="padding-left:10px;padding-right:10px;">
<style>
.delivery {
display:block;
}
</style>
<p class="availability in-stock">
<div class="stock_info">Produs in stoc</div>
<div class="delivery"><div class="delivery_title">Livrare(in timpul orelor de program):</div>
<div class="delivery_item">Bucuresti - BANEASA : imediat</div>
<div class="delivery_item">Bucuresti - EROILOR : luni dupa ora 13.00.</div>
<div class="delivery_item">CURIER : Marti</div>
</div>
</p>
Garanţie: 12 luni
Here is my actual code:
<?php
include_once('../simple_html_dom.php');
$dom = file_get_html("http://www.site.com/page.html");
// alternatively use str_get_html($html) if you have the html string already...
foreach ($dom->find('span[class=price]') as $node)
{
echo $node->innertext;
}
?>
and my result is this: 2.443,<sup>00</sup> RON But correct result will be: 2.443 or 2443
You could do something like this:
<?php
include_once('../simple_html_dom.php');
$dom = file_get_html("http://www.site.com/page.html");
// alternatively use str_get_html($html) if you have the html string already...
foreach ($dom->find('span[class=price]') as $node)
{
$result = $node->innertext;
$price = explode(",<sup>", $result);
echo $price[0];
}
?>

PHP remove all div with class="myclass" except first one + add another div instead of others

I have a $content with
<div class="slidesWrap">
<div class="slidesСontainer">
<div class="myclass">
...
</div>
<div class="myclass">
...
</div>
...
...
<div class="myclass">
...
</div>
</div>
<div class="nav">
...
</div>
</div>
some other text here, <p></p> bla-bla-bla
I would like to remove via PHP all the divs with class="myclass" except the first one, and add another div instead of others, so that the result is:
<div class="slidesWrap">
<div class="slidesСontainer">
<div class="myclass">
...
</div>
<div>Check all divs here</div>
</div>
<div class="nav">
...
</div>
</div>
some other text here, <p></p> bla-bla-bla
Would be grateful if someone can point me a solution.
UDATE2:
some similar question here
from that I came up with the following test code:
$content = '<div class="slidesWrap">
<div class="slidesСontainer">
<div class="myclass">
</div>
<div class="myclass">
</div>
<div class="myclass">
</div>
</div>
<div class="nav">
</div>
</div>
some other text here, <p></p> bla-bla-bla';
$dom = new DOMDocument();
$dom->loadHtml($content);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//*[#class="myClass" and position()>1]') as $liNode) {
$liNode->parentNode->removeChild($liNode);
}
echo $dom->saveXml($dom->documentElement);
Any ideas where I can test it?
Here is what you are looking for (similar to your edit, but it removes the added html tags):
$doc = new DOMDocument();
$doc->loadHTML($content);
$xp = new DOMXpath($doc);
$elements = $xp->query("//div[#class='myclass']");
if($elements->length > 1)
{
$newElem = $doc->createElement("div");
$newElem->appendChild($doc->createTextNode("Check all divs "));
$newElemLink = $newElem->appendChild($doc->createElement("a"));
$newElemLink->setAttribute("href", "myurl");
$newElemLink->appendChild($doc->createTextNode("here"));
$elements->item(1)->parentNode->replaceChild($newElem, $elements->item(1));
for($i = $elements->length - 1; $i > 1 ; $i--)
{
$elements->item($i)->parentNode->removeChild($elements->item($i));
}
}
echo $doc->saveXML($doc->getElementsByTagName('div')->item(0));
$var = ':not(.myClass:eq(1))';
$var.removeClass("myClass");
$var.addClass("some_other_Class");
If I got you right, you've got a string called $content with all that content in it
It's not the best solution I guess but here is my attempt (which works fine for me):
if( substr_count($content, '<div class="myclass') > 1 ) {
$parts = explode('<div class="myclass',$content);
echo '<div class="myclass'.$parts[1];
echo '<div>Check all divs here</div>';
}
else {echo $content;}

Categories