DOMDocument get nodeValue of each matching element - php

I've been hacking at this for a while and just cant seem to get it right.
How can you get get the contents of all script elements, when the number of script elements is variable. My example markup looks like this:
<div></div>
<iframe><iframe>
<script>xxxx</script>
<script>xxxx</script>
<script>xxxx</script>
What I have so far works only if I keep the number of scripts static so clearly Im not iterating over the array correctly, but Im totally thrown by the DOMXPath documentation as how to do it. This is what I have so far:
$dom = new DOMDocument();
$dom->preserveWhiteSpace = true;
#$dom->loadHtml($form_content);
$xpath = new DOMXPath($dom);
$items = $xpath->query('//script');
foreach ($items as $item) {
$scriptContents = $item->previousSibling->previousSibling->nodeValue . "\r\ n\r\n";
$scriptContents .= $item->previousSibling->nodeValue . "\r\n\r\n";
$scriptContents .= $item->nodeValue . "\r\n\r\n";
}
echo $scriptContents;
How should I go about this? I've been search SO for a while now, but can seem to apply a solution that works. Thanks in advance - b

It appears that you are overwriting $scriptContents with each iteration, which is probably not what you are intending. The way the script currently is operating, your output would be limited to the two previous siblings of the last script tag (whether or not they are actually script tags themselves) along with the last script tag.
If you are strictly trying to output the script tags you can do this:
$xpath = new DOMXPath($dom);
$items = $xpath->query('//script');
foreach ($items as $item) {
echo $item->nodeValue . "\r\n\r\n";
}

Related

PHP parse HTML empty input value

I know there are many questions on parsing HTML in PHP, but I can't seem to find the specific problem I'm experiencing. My code works on other elements in the page, and also iterates over the inputs returning the tag name. At the same time their value property is empty, when 2 of them have a value for sure. Here is my code
$html = file_get_contents('http://...sample website...html');
$doc = new DOMDocument;
libxml_use_internal_errors(true);
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*/input[#type='hidden']");
if(!is_null($elements)){
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
echo $element->nodeValue. "\n";
}
}
$xpath->query("//*/input[#type='hidden']/#value");
instead of
$xpath->query("//*/input[#type='hidden']");
also works well.
Same question, same answers
I got it myself, if anyone else has a similar problem it is just that nodeValue returns the "innerHTML" of an element, to get its properties use $element -> getAttribute("value") (for the "value" attribute)

PHP DOMDocument, retrieve just content of a div, without div tag

I'm using DOMDocument to retrieve on a HTML page a special div.
I just want to retrive the content of this div, without the div tag.
For example :
$dom = new DOMDocument;
$dom->loadHTML($webtext['content']);
$main = $dom->getElementById('inter');
$dom->saveHTML()
Here, i have the result :
<div id="inter">
//SOME THINGS IN MY DIV
</div>
And i just want to have :
//SOME THINGS IN MY DIV
Ideas ? Thanks !
I'm going to go with simple does it. You already have:
$dom = new DOMDocument;
$dom->loadHTML($webtext['content']);
$main = $dom->getElementById('inter');
$dom->saveHTML();
Now, DOMDocument::getElementById() returns one DOMElement which extends DOMNode which has the public stringnodeValue. Since you don't specify if you are expecting anything but text within that div, I'm going to assume that you want anything that may be stored in there as plain text. For that, we are going to remove $dom->saveHTML();, and instead replace it with:
$divString = $main->nodeValue;
With that, $divString will contain //SOME THINGS IN MY DIV, which, from your example, is the desired output.
If, however, you want the HTML of the inside of it and not just a String representation - replace it with the following instead:
$divString = "";
foreach($main->childNodes as $c)
$divString .= $c->ownerDocument->saveXML($c);
What that does is takes advantage of the inherited DOMNode::childNodes which contains a DOMNodeList each containing its own DOMNode (for reference, see above), and we loop through each one getting the ownerDocument which is a DOMDocument and we call the DOMDocument::saveXML() function. The reason we pass the current $c node in to the function is to prevent an entire valid document from being outputted, and because the ownerDocument is what we are looping through - we need to get one child at a time, with no children left behind. (sorry, it's late, couldn't resist.)
Now, after either option, you can do with $divString what you will. I hope this has helped explain the process to you and hopefully you walk away with a better understanding of what is going on instead of rote copying of code just because it works. ^^
you can use my custom function to remove extra div from content
$html_string = '<div id="inter">
SOME THINGS IN MY DIV
</div>';
// custom function
function DOMgetinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
your code will like
$dom = new DOMDocument;
$dom->loadHTML($html_string);
$divs = $dom->getElementsByTagName('div');
$innerHTML_contents = DOMgetinnerHTML($divs->item(0));
echo $innerHTML_contents
and your output will be
SOME THINGS IN MY DIV
you can use xpath
$xpath = new DOMXPath($xml);
foreach($xpath->query('//div[#id="inter"]/*') as $node)
{
$node->nodeValue
}
or simplu you can edit your code. see here
$main = $dom->getElementById('inter');
echo $main->nodeValue

PHP XPath Table elements disapearing

I have just learned about XPath and I am wanting to read data from only certain columns in a table.
My current code looks like this:
<?php
$file_contents = file_get_contents('test.html');
$dom_document = new DOMDocument();
$dom_document->loadHTML($file_contents);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom_document);
$elements = $dom_xpath->query("//tr[#class='rowstyle']");
if (!is_null($elements)) {
foreach ($elements as $element)
{
echo $element->nodeValue . '<br />';
}
}
else
{
echo 'none';
}
?>
Also a variation in the query because through my research I have seen lots of issues with nest table elements but it produces the same result:
$elements = $dom_xpath->query("//table[#class='tablestyle']/tbody/tr[#class='rowstyle']");
It does grab the row of data but it makes into a single string, combining all of the cells into one string and making the tags disappear.
What I really want to do is separate those cells and grab the certain row number.
I am also curious on how to find out which version of XPath I have... My PHP version is 5.3.5
Its not combining those cells... youre outputting the nodeValue which in this case is behaving like innerHTML. IF you want to work on the cells themselves the either use childNodes or a xpah query using the row as the context, then loop over the cells.
Example:
$dom_xpath = new DOMXpath($dom_document);
$elements = $dom_xpath->query("//tr[#class='rowstyle']");
foreach ($elements as $element)
{
foreach($element->childNodes as $cell) {
echo $cell->nodeValue . '<br />';
}
}

Getting the text portion of a node using php Simple XML

Given the php code:
$xml = <<<EOF
<articles>
<article>
This is a link
<link>Title</link>
with some text following it.
</article>
</articles>
EOF;
function traverse($xml) {
$result = "";
foreach($xml->children() as $x) {
if ($x->count()) {
$result .= traverse($x);
}
else {
$result .= $x;
}
}
return $result;
}
$parser = new SimpleXMLElement($xml);
traverse($parser);
I expected the function traverse() to return:
This is a link Title with some text following it.
However, it returns only:
Title
Is there a way to get the expected result using simpleXML (obviously for the purpose of consuming the data rather than just returning it as in this simple example)?
There might be ways to achieve what you want using only SimpleXML, but in this case, the simplest way to do it is to use DOM. The good news is if you're already using SimpleXML, you don't have to change anything as DOM and SimpleXML are basically interchangeable:
// either
$articles = simplexml_load_string($xml);
echo dom_import_simplexml($articles)->textContent;
// or
$dom = new DOMDocument;
$dom->loadXML($xml);
echo $dom->documentElement->textContent;
Assuming your task is to iterate over each <article/> and get its content, your code will look like
$articles = simplexml_load_string($xml);
foreach ($articles->article as $article)
{
$articleText = dom_import_simplexml($article)->textContent;
}
node->asXML();// It's the simple solution i think !!
So, the simple answer to my question was: Simplexml can't process this kind of XML. Use DomDocument instead.
This example shows how to traverse the entire XML. It seems that DomDocument will work with any XML whereas SimpleXML requires the XML to be simple.
function attrs($list) {
$result = "";
foreach ($list as $attr) {
$result .= " $attr->name='$attr->value'";
}
return $result;
}
function parseTree($xml) {
$result = "";
foreach ($xml->childNodes AS $item) {
if ($item->nodeType == 1) {
$result .= "<$item->nodeName" . attrs($item->attributes) . ">" . parseTree($item) . "</$item->nodeName>";
}
else {
$result .= $item->nodeValue;
}
}
return $result;
}
$xmlDoc = new DOMDocument();
$xmlDoc->loadXML($xml);
print parseTree($xmlDoc->documentElement);
You could also load the xml using simpleXML and then convert it to DOM using dom_import_simplexml() as Josh said. This would be useful, if you are using simpleXml to filter nodes for parsing, e.g. using XPath.
However, I don't actually use simpleXML, so for me that would be taking the long way around.
$simpleXml = new SimpleXMLElement($xml);
$xmlDom = dom_import_simplexml($simpleXml);
print parseTree($xmlDom);
Thank you for all the help!
You can get the text node of a DOM element with simplexml just by treating it like a string:
foreach($xml->children() as $x) {
$result .= "$x"
However, this prints out:
This is a link
with some text following it.
TitleTitle
..because the text node is treated as one block and there is no way to tell where the child fits in inside the text node. The child node is also added twice because of the other else {}, but you can just take that out.
Sorry if I didn't help much, but I don't think there's any way to find out where the child node fits in the text node unless the xml is consistent (but then, why not use tags). If you know what element you want to strip the text out of, strip_tags() will work great.
This has already been answered, but CASTING TO STRING ( i.e. $sString = (string) oSimpleXMLNode->TagName) always worked for me.
Try this:
$parser = new SimpleXMLElement($xml);
echo html_entity_decode(strip_tags($parser->asXML()));
That's pretty much equivalent to:
$parser = simplexml_load_string($xml);
echo dom_import_simplexml($parser)->textContent;
Like #tandu said, it's not possible, but if you can modify your XML, this will work:
$xml = <<<EOF
<articles>
<article>
This is a link
</article>
<link>Title</link>
<article>
with some text following it.
</article>
</articles>

Keep new line, when the HTML is on 1 line and new line layout is done with <div>

I need to get content from a site
I need to get
/html/body/div/div[2]/table/tbody/tr/td/div/div[2]/form/fieldset[2]/table[2]
or
<table class='properties'>
For which the code is visible here: http://paste.pocoo.org/show/347881/
contents with all the content formatted just on new lines.
I don't care about paddings, and other formatting, I just want to keep the new lines.
For example a proper output would be
tájékoztató
az eljárás eredményéről
A Közbeszerzések Tanácsa (Szerkesztőbizottsága) tölti ki
A hirdetmény kézhezvételének dátuma____________________
KÉ nyilvántartási szám_________________________________
I. SZAKASZ: AJÁNLATKÉRŐ
I.1) Név, cím és kapcsolattartási pont(ok)
The problem I face that the new lines are introduced with the div's and cannot get it.
Update
This be executed by a PHP cron, so there is no access to JS.
There is a library called phpQuery: http://code.google.com/p/phpquery/
You can walk through DOM object like with jQuery:
phpQuery::newDocument($htmlCode)->find('table.properties');
On a mached element's content fire strip_tags and you will get pure content of that table.
The trick is to fetch the inner divs in an xpath expression, then use their textContent property:
<?php
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML(file_get_contents("..."));
libxml_use_internal_errors(false);
$domx = new DOMXPath($domd);
$items = $domx->query("/html/body/div/div[2]/table/tr/td/div/div[2]/form/fieldset[2]/table[2]/tr/td/div//div/div[#style='padding-left: 0px;']");
$output = "";
foreach ($items as $item) {
$output .= $item->textContent . "\n";
}
echo $output;

Categories