Can't get the plaintext without using foreach - php

I'm trying to use simple_html_dom to get the plaintext ('THIS TEXT') of one HTML element:
<div class="parent">
<span><i class="fa fa-awesome"></i>THIS TEXT</span>
</div>
I'm getting that text by using:
foreach($html->find('div.parent span.child') as $text){
echo $text->plaintext;
}
But it is just one element and I'm searching for a way to get that plaintext without using foreach loop (since it is just one element).
P.S: I've been trying this:
$html->find('div.parent span.child', 1);
But var_dump-ing that results in a NULL.
I also tried this:
$html->find('div.delivery-status span.status', 1)->plaintext;
But var_dump-ing it results in:
Notice: Trying to get property 'plaintext' of non-object in
C:\xampp\htdocs\curl\index.php on line 19
I also read the documentation but i can't seem to be able to figure this one out :(. Can somebody please help me or at least point me into the right direction? :-s
Thank you!:D

You're using a pretty ancient library, but it looks like a foreach loop is how the author intended it to work. This is typical for DOM functions that return a node list for most functions. What's wrong with the loop? You could do this in plain old PHP as well:
$html = <<< HTML
<div class="parent">
<span><i class="fa fa-awesome"></i>THIS TEXT</span>
</div>
HTML;
$dom = new \DomDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new \DOMXPath($dom);
$data = $xpath->query("//div[#class='parent']/span/text()");
echo $data[0]->textContent;

The <span> in the question does not have a child css class, so your selector is not correct. Also you seem to be missing the point that when calling find, the index of children is zero based. Try this:
$str = '<div class="parent"><span><i class="fa fa-awesome"></i>THIS TEXT</span></div>';
$html = str_get_html($str);
// no .child for the span, and 0 as the index of target child
print $html->find('div.parent span', 0)->plaintext;

Related

Changing a tag <a> to <div> with DOMDocument on WordPress

I'm a beginner in PHP and I would like to set up several functions to replace specific code bits on WordPress (including plugin elements that I can't edit directly).
Below is an example (first line: initial result, second line: desired result):
<span class="fn" itemprop="name">Gael Beyries</span>
<div class="vcard author"><span class="fn" itemprop="name">Gael Beyries</span></div>
PS: I came across this topic: Parsing WordPress post content but the example is too complicated for what I want to do. Could you present me an example code that solves this problem so I can try to modify it to modify other html elements?
Although I'm not sure how this fits into WP, I have basically taken the code from the linked answer and adapted it to your requirements.
I've assumed you want to find the <a> tags with class="vcard author" and this is the basis of the XPath expression. The code in the foreach() loop just copies the data into a new node and replaces the old one...
function replaceAWithDiv($content){
$dom = new DOMDocument();
$dom->loadHTML($content);
$xpath = new DOMXPath($dom);
$aTags = $xpath->query('//a[#class="vcard author"]');
foreach($aTags as $a){
// Create replacement element
$div = $dom->createElement("div");
$div->setAttribute("class", "vcard author");
// Copy contents from a tag to div
foreach ($a->childNodes as $child ) {
$div->appendChild($child);
}
// Replace a tag with div
$a->parentNode->replaceChild($div, $a);
}
return $dom->saveHTML();
}

Error when parsing DOMDocument with PHP

We are upgrading our Software to PHP 7.2.3 and I have the following code snippet which worked fine in previous versions:
$doc = new DOMDocument();
$doc->loadHTML("<html><body>".($_POST['reportForm_structure'])."</body></html>");
$root = $doc->documentElement->firstChild->firstChild->firstChild;
file_put_contents('D:\testoutput.txt', print_r($root ,true));
foreach($root->childNodes as $child) {
if ($child->nodeName == "ul") {
foreach($child->childNodes as $ulChild) {
$this->loadNodes($ulChild, $this->report);
}
}
}
The file_put_contentsis just for error research.
I get the following error: Invalid argument supplied for foreach(). The message refers to line of code where the first foreach loop is. So the data structure is not initialized correctly. I can see that the conversion from HTML to DOMDocument does not work properly anymore. When I check the output of file_put_contents I can see that $root is a DOMText object instead of a DOMElement object but why? When pass the argument of loadHTMLdirectly to file_put_contents,
file_put_contents('D:\testoutput.txt', print_r("<html><body>".($_POST['reportForm_structure'])."</body></html>", true);
the output looks like proper HTML, so that's why I am confused that I does not work anymore.
<html><body><ul class="ltr">
<li class="open last" id="root" rel="root">
<ins> </ins>HeaderText
<ul><li class="open last" id="id1" rel="header"><ins> </ins>Test123
<ul><li class="open leaf last" id="id2" rel="header"><a class="clicked" href="#"><ins> </ins>Test456</a></li></ul></li></ul></li>
Does anyone know how to solve this issue. Did I miss something in the configuration here?
I couldn't reproduce the DOMText node with the code you show. But my guess is that you are preserving whitespace and then fetch the whitespace node between the ul element and the li element.
v-------- whitespace node
<html><body><ul class="ltr">
<li class="open last" id="root" rel="root">
In any case, if you want the element with the ID "root", use a more precise query, e.g. use
$root = $doc->getElementById("root");
You can also you can set $doc->preserveWhiteSpace = false but it's better to query for the node by ID instead of traversing down three children and assuming it's that node.
Thanks #Gordon and #DarsVaeda for pointing me in the right direction. DOMDocument interprets carriage returns and tabs as text nodes. I had to remove those to make it work again. Changed
$doc->loadHTML("<html><body>".$_POST['reportForm_structure']."</body></html>");
to
$doc = new DOMDocument();
$string = trim(preg_replace('/\t+/', '', $_POST['reportForm_structure']));
$string = preg_replace( "/\r|\n/", "", $string );
$doc->loadHTML("<html><body>".$string."</body></html>");

Is there an easy way to get subelements with DomDocument and DomXPath?

Supposed I have HTML like this:
<div id="container">
<li class="list">
Test text
</li>
</div>
And I want to get the contents of the li.
I can get the contents of the container div using this code:
$html = '
<div id="container">
<li class="list">
Test text
</li>
</div>';
$dom = new \DomDocument;
$dom->loadHTML($html);
$xpath = new \DomXPath($dom);
echo $dom->saveHTML($xpath->query("//div[#id='container']")->item(0));
I was hoping I could get the contents of the subelement by simply adding it to the query (like how you can do it in simpleHtmlDom):
echo $dom->saveHTML($xpath->query("//div[#id='container'] li[#class='list']")->item(0));
But a warning (followed by a fatal error) was thrown, saying:
Warning: DOMXPath::query(): Invalid expression ...
The only way I know of to do what I'm wanting is this:
$html = '
<div id="container">
<li class="list">
Test text
</li>
</div>';
$dom = new \DomDocument;
$dom->loadHTML($html);
$xpath = new \DomXPath($dom);
$dom2 = new \DomDocument;
$dom2->loadHTML(trim($dom->saveHTML($xpath->query("//div[#id='container']")->item(0))));
$xpath2 = new \DomXPath($dom2);
echo $xpath2->query("//li[#class='list']")->item(0)->nodeValue;
However, that's an awful lot of code just to get the contents of the li, and the problem is that as items are nested deeper (like if I want to get `div#container ul.container li.list) I have to continue adding more and more code.
With simpleHtmlDom, all I would have had to do is:
$html->find('div#container li.list', 0);
Am I missing an easier way to do things with DomDocument and DomXPath, or is it really this hard?
You were close in your initial attempt; your syntax was just off by a character. Try the following XPath:
//div[#id='container']/li[#class='list']
You can see you had a space between the div node and the li node where there there should be a forward slash.
SimpleHTMLDOM uses CSS selectors, not Xpath. About anything in CSS selectors can be done with Xpath, too. DOMXpath::query() does only support Xpath expression that return a node list, but Xpath can return scalars, too.
In Xpath the / to separates the parts of an location path, not a space. It has two additional meanings. A / at the start of an location path makes it absolute (it starts at the document and not the current context node). A second / is the short syntax for the descendant axis.
Try:
$html = '
<div id="container">
<li class="list">
Test text
</li>
</div>';
$dom = new \DomDocument;
$dom->loadHTML($html);
$xpath = new \DomXPath($dom);
echo trim($xpath->evaluate("string(//div[#id='container']//li[#class='list'])"));
Output:
Test text
In CSS selector sequences the space is a combinator for two selectors.
CSS: foo bar
Xpath short syntax: //foo//bar
Xpath full syntax: /descendant::foo/descendant::bar
Another combinator would be > for a child. This axis is the default one in Xpath.
CSS: foo > bar
Xpath short syntax: //foo/bar
Xpath full syntax: /descendant::foo/child::bar

How can I execute XPath queries on DOMElements using PHP?

I'm trying to do Xpath queries on DOMElements but it doesn't seem to work. Here is the code
<html>
<div class="test aaa">
<div></div>
<div class="link">contains a link</div>
<div></div>
</div>
<div class="test bbb">
<div></div>
<div></div>
<div class="link">contains a link</div>
</div>
</html>
What I'm doing is this:
$dom = new DOMDocument();
$html = file_get_contents("file.html");
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$entries = $xpath->query("//div[contains(#class,'test')]");
if (!$entries->length > 0) {
echo "Nothing\n";
} else {
foreach ($entries as $entry) {
$link = $xpath->query('/div[#class=link]',$entry);
echo $link->item(0)->nodeValue;
// => PHP Notice: Trying to get property of non-object
}
}
Everything works fine up to $xpath->query('/div[#class=link], $entry);. I don't know how to use Xpath on a particular DOMElement ($entry).
How can I use xpath queries on DOMElement?
It looks like you're trying to mix CSS selectors with XPath. You want to be using a predicate ([...]) looking at the value of the class attribute.
For example, your //div.link might look like //div[contains(concat(' ',normalize-space(#class),' '),' link ')].
Secondly, within the loop you try to make a query with a context node then ignore that by using an absolute location path (it starts with a slash).
Updated to reflect changes to the question:
Your second XPath expression (/div[#class=link]) is still a) absolute, and b) has an incorrect condition. You want to be asking for matching elements relative to the specified context node ($entry) with the class attribute having a string value of link.
So /div[#class=link] should become something like div[#class="link"], which searches children of the $entry elements (use .//div[...] or descendant::div[...] if you want to search deeper).

Keep new line, when the HTML is on 1 line and new line layout is done with <div>

I need to get content from a site
I need to get
/html/body/div/div[2]/table/tbody/tr/td/div/div[2]/form/fieldset[2]/table[2]
or
<table class='properties'>
For which the code is visible here: http://paste.pocoo.org/show/347881/
contents with all the content formatted just on new lines.
I don't care about paddings, and other formatting, I just want to keep the new lines.
For example a proper output would be
tájékoztató
az eljárás eredményéről
A Közbeszerzések Tanácsa (Szerkesztőbizottsága) tölti ki
A hirdetmény kézhezvételének dátuma____________________
KÉ nyilvántartási szám_________________________________
I. SZAKASZ: AJÁNLATKÉRŐ
I.1) Név, cím és kapcsolattartási pont(ok)
The problem I face that the new lines are introduced with the div's and cannot get it.
Update
This be executed by a PHP cron, so there is no access to JS.
There is a library called phpQuery: http://code.google.com/p/phpquery/
You can walk through DOM object like with jQuery:
phpQuery::newDocument($htmlCode)->find('table.properties');
On a mached element's content fire strip_tags and you will get pure content of that table.
The trick is to fetch the inner divs in an xpath expression, then use their textContent property:
<?php
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML(file_get_contents("..."));
libxml_use_internal_errors(false);
$domx = new DOMXPath($domd);
$items = $domx->query("/html/body/div/div[2]/table/tr/td/div/div[2]/form/fieldset[2]/table[2]/tr/td/div//div/div[#style='padding-left: 0px;']");
$output = "";
foreach ($items as $item) {
$output .= $item->textContent . "\n";
}
echo $output;

Categories