DOMXPath union extract with PHP

DOMXPath union extract with PHP - php

I'm trying to get img and the div which is coming after the div which contains that img, all in one query.
So I did this:
$nodes = $xpath->query('//div[starts-with(#id, "someid")]/img |
//div[starts-with(#id, "someid")]/following-sibling::div[#class="spec_class"][1]/text()');
Now, I'm able to get the attributes of img tag, but I can't get the text of the following sibling. If I separate the query (two queries - first for the img and second query for the sibling) it works. But how can I do this with only one query? By the way, there is no error in the syntax. But somehow the union doesn't work or maybe I'm not extracting the sibling content right.
Here's the markup (which repeats many times with another text and id="someid_%randomNumber%)
<div id="someid_1">
<img src="link_to_image.png" />
...some text...
</div>
<div>...another text...</div>
<div class="spec_class">
...Important text...
</div>
I want to get in one query both link_to_image.png and ...Important text...

Your query seems correct.
Example XML:
<div>
<div id="someid-1"><img src="foo"/></div>
<div class="spec_class">bar</div>
<div class="spec_class">baz</div>
</div>
Example PHP Code:
$dom = new DOMDocument;
$dom->loadXml($xhtml);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//div…') as $node) {
echo $dom->saveXML($node);
}
Outputs (demo):
<img src="foo"/>bar
Note that you will have to iterate the DOMNodeList returned by the XPath query.

Related

PHP dom parser: How to get element count only if it comes after another element?

I'm trying to get a count of how many images are on an HTML page sprinkled throughout an article but I do not want to count the image if it comes before the text of the article begins. The problem is the classes are exactly the same, so I can't use that to help me, and not every article is even going to start with an image. So the HTML might look like this:
<img class="image-asset" src="image.jpg">
<p>First line</p>
<p>Second line</p>
<img class="image-asset" src="second_image.jpg">
<p>Third line</p>
<img class="image-asset" src="third_image.jpg">
In this instance, I want to only count the second and third images. Here's my code, which is successfully counting every image at the moment:
$photoCount = count($html->find('div.image-asset'));

I believe you are looking for something along these lines:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$target = $xpath->query('//img[preceding-sibling::p]');
echo count($target), PHP_EOL;
//and just to be on the safe side:
foreach ($target as $t) {
echo $t->ownerDocument->saveHTML($t), PHP_EOL;
};
Output:
2
<img class="image-asset" src="second_image.jpg">
<img class="image-asset" src="third_image.jpg">

Target element within specific element domdocument

I want to target a tags with class genre within parent div with id test:
<div id="test">
<a class="genre">hello</a>
<a class="genre">hello2</a>
</div>
So far, I can get all the genre a tags:
$xpath = new DOMXPath($doc);
$elements = $xpath->query('//a[#class="genre"]');
... but I want to adjust //a[#class="genre"] so I only target the ones within the test div.

I don't understand why you did not write it yourself because you use all needed elements of xpath in your expression. Or, maybe, i've misunderstand you question
$elements = $xpath->query('//div[#id="test"]/a[#class="genre"]');

Extract only first level paragraphs from html

I have the following html:
<div id="myID">
<p>I want this</p>
<p>and I want this</p>
<div>
<p>I don't want this</p>
</div>
</div>
I want to extract only the first level <p>...</p> elements.
I've tried using the excellent simple_html_dom library e.g. $html->find('#myID p') but in the case above, this finds all three <p>...</p> elements
Is there a better way to do this?

Instead of having to use some external library why don't you just use the built in classes to handle the dom?
First create a DOMDocument instance using your HTML:
$dom = new DOMDocument();
$dom->loadHtml($yourHtml);
After that use DOMXPath to select your elements:
$xpath = new DOMXpath($dom);
$nodes = $xpath->query("//*[#id='myID']/p");
var_dump($nodes->length); // outputs 2
This selects all p elements which are direct children of the element with the id myID. Demo

Simple HTML DOM Parser - Get all plaintex rather than text of certain element

I tried all the solutions posted on this question. Although it is similar to my question, it's solutions aren't working for me.
I am trying to get the plain text that is outside of <b> and it should be inside the <div id="maindiv>.
<div id=maindiv>
<b>I don't want this text</b>
I want this text
</div>
$part is the object that contains <div id="maindiv">.
Now I tried this:
$part->find('!b')->innertext;
The code above is not working. When I tried this
$part->plaintext;
it returned all of the plain text like this
I don't want this text I want this text
I read the official documentation, but I didn't find anything to resolve this:

Query:
$selector->query('//div[#id="maindiv"]/text()[2]')
Explanation:
// - selects nodes regardless of their position in tree
div - selects elements which node name is 'div'
[#id="maindiv"] - selects only those divs having the attribute id="maindiv"
/ - sets focus to the div element
text() - selects only text elements
[2] - selects the second text element (the first is whitespace)
Note! The actual position of the text element may depend on
your preserveWhitespace setting.
Manual: http://www.php.net/manual/de/class.domdocument.php#domdocument.props.preservewhitespace
Example:
$html = <<<EOF
<div id="maindiv">
<b>I dont want this text</b>
I want this text
</div>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($html);
$selector = new DOMXpath($doc);
$node = $selector->query('//div[#id="maindiv"]/text()[2]')->item(0);
echo trim($node->nodeValue); // I want this text

remove the <b> first:
$part->find('b', 0)->outertext = '';
echo $part->innertext; // I want this text

Using PHP X-Path to extract specific parts of a webpage

I am after a specific value from a webapge; the product name that is in the h1 tag:
<div id="extendinfo_container">
<h1><strong>Product Name</strong></h1>
<div style="font-size:0;height:4px;"></div>
<p class="text_breadcrumbs">
<img src="arrow_091.gif" align="absmiddle"/>
Product Name<img src="arrow_091.gif" align="absmiddle"/>
<strong>Product Name</strong>
<div class="dotted_line_blue">
<img src="theme_shim.gif" height="1" width="100%" alt=" " />
</div>
</div>
This is a poorly structured website with more than one h1 so I cannot simply do getElementById('h1').
I want to be as specific as possible in which element I get and this is the code I have:
$doc = new DOMDocument();
#$doc->loadHTML(file_get_contents('http://url/to/website'));
// locate <div id="extendinfo_container"><a><h1><strong>(.*)</strong></h1></a> as product name
$x = new DOMXPath($doc);
$pName = $x->query('//div[#id="extendinfo_container"]/a/h1/strong');
var_dump($pName->nodeValue);
This is return null. What query do I need to use to get the content I want?

query() returns a DOMNodeList, which doesn't have a nodeValue property. You have to select one element (i.e. the first):
$pName = $x->query('//div[#id="extendinfo_container"]/a/h1/strong')->item(0);
Or iterate over it:
foreach( $pName as $el) {
var_dump( $el->nodeValue);
}
Either one of these will give you access to a DOMNode, which is what you're looking for.

PHP's DOM is VERY picky about the html you load into it. It will barf and refuse to load even slightly malformed documents.
Turn off error supression (#$doc->loadHTML, remove the #) and make sure that it's not puking on this page you're trying to analyze. Otherwise, your XPath query looks fine, and if the document does get loaded/parsed properly, it SHOULD work.

The query works fine. I was accessing the value wrong. Here is the correct way to access the value:
var_dump($pName->item(0)->nodeValue);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

DOMXPath union extract with PHP - php

Related

PHP dom parser: How to get element count only if it comes after another element?

Target element within specific element domdocument

Extract only first level paragraphs from html

Simple HTML DOM Parser - Get all plaintex rather than text of certain element

Using PHP X-Path to extract specific parts of a webpage

Categories

Resources