XPath - Get text from parent using php xpath - php

I am trying to get the text from a specific node's parent. For example:
<td colspan="1" rowspan="1">
<span>
<a class="info" shape="rect"
rel="empLinkData" href="/employee.htm?id=8468524">
Jack Johnson
</a>
</span>
(*)
</td>
I am able to successfully process the anchor tag by using:
$xNodes = $xpath->query('//a[#class="info"][#rel="empLinkData"]');
// $xNodes contains employee ids and names
foreach ($xNodes as $xNode)
{
$sLinktext = #$xNode->firstChild->data;
$sLinkurl = 'http://www.company.com' . $xNode->getAttribute('href');
if ($sLinktext != '' && $sLinkurl != '')
{
echo '<li><a href="' . $sLinkurl . '">' .
$sLinktext . '</a></li>';
}
}
Now, I need to retrieve the text from the <td> tag (in this case, the (*) appearing right after the span tag closes), but I can't seem to refer to it properly.
The xpath for this that seems to make the most sense to me is:
$xNodes = $xpath->query('//a[#class="info"]
[#rel="empLinkData"]/ancestor::*');
but it is retrieving the wrong data from elsewhere nested above this code.

It's not necessary to retreat back up the tree. Instead, directly select the td that contains the relevant element:
//td[descendant::a[#class="info"][#rel="empLinkData"]]/text()
Edit: As #Dimitre rightly pointed out, this selects all text children. Your td has two such nodes: the whitespace-only text node that precedes the span and the text node that follows it. If you only want the second text node, then use:
//td[descendant::a[#class="info"][#rel="empLinkData"]]/text()[2]
Or:
//td[descendant::a[#class="info"][#rel="empLinkData"]]/text()[last()]
As you can see, the resulting expressions are essentially the same, but you do need to target the correct text node (if you want only one). Note also that if the target text is truly in a td then it's safer to target that element type directly (without wildcards). As this is HTML, your actual document almost certainly contains several other elements, including multiple other anchors that you may not want to target.
Sample PHP:
$nodes = $xpath->query(
'//td[descendant::a[#class="info"][#rel="empLinkData"]]/text()[last()]');
echo "[". $nodes->item(0)->nodeValue . "]";

Deepest td ancestor:
//a[#class="info"][#rel="empLinkData"]/ancestor::td[1]

Use:
//*[a[#class="info"][#rel="empLinkData"]]/following-sibling::text()[1]
This selects a single text node -- exactly the wanted one.
Do note that an XPath expression like:
//td[descendant::a[#class="info"][#rel="empLinkData"]]/text()
selects more than one text nodes -- not only the wanted text node.

Related

How to get xPath nodeValue USD dollar amount

I'm trying to start from the <span> element that has text Value when transacted
Then get its parent <div> and get following sibling which is a <div> and from that <div> get the text of the child <span>.
From what I can tell, the code is correct and should echo $1,034.29.
It echos $0.00 instead.
What am I missing here?
php code:
$a = new DOMXPath($doc);
$dep_val_txt = $a->query("//span[contains(text(), 'Value when transacted')]");
$dep_val_nxt_elem = $a->query("parent::div", $dep_val_txt[0]);
$dep_val_elem = $a->query("following-sibling::*[1]", $dep_val_nxt_elem[0]);
$dep_val = $dep_val_elem->item(0)->childNodes->item(0)->nodeValue;
echo $dep_val;
html code:
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh sc-1n72lkw-0 bKaZjn" opacity="1">Value when transacted</span>
</div>
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh u3ufsr-0 gXDEBk" opacity="1">$1,034.29</span>
</div>
In case someone else stumbles upon this question in the future, I will summarize the solution which was concluded by conversation with OP in the comments:
The issue here is not with the DOM selectors, as observed by the fact that his output is $0.00 even though he is not formatting the value to appear as a currency. This led me to believe that the website being scraped is in fact using placeholder values which are updated on the client side using Javascript. The reason this cannot be resolved with selectors is because the DOM received by PHP will be the initial render, which does not contain the values we wish to scrape.
So the solution is to examine the website being scraped to determine where and how the values are being fetched before being added to the DOM on the client side. For example, if the website is using an API call to fetch the values, one can simply use the same API to fetch the intended data without having to scrape the HTML DOM at all.
If you follow OPs question literally
start from the <span> element that has text "Value when transacted"
get its parent <div>
get following sibling which is a <div>
get the text of the child <span>
then the xpath expression should be
//span[text()='Value when transacted']/parent::div/following-sibling::div/span
You might find it easier and faster to process using a regex to match the price, here's a quick example in PHP:
<?php
// Your input HTML (as per your example)
$inputHtml = <<<HTML
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh sc-1n72lkw-0 bKaZjn" opacity="1">Value when transacted</span>
</div>
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh u3ufsr-0 gXDEBk" opacity="1">$1,034.29</span>
</div>
HTML;
$matches = [];
// Look for any div > span element which contains a string starting with $ and then match a number (allowing for a , or . within the price matched).
if (preg_match_all('#<div.*>\s*<span.*?>\$([0-9.,]+)</span>\s*</div>#mis', $inputHtml, $matches)) {
echo 'Price found: ' . $matches[1][0] . PHP_EOL;
}
Console output from this:
Price found: 1,034.29

Fetch Content from A Div id using PHP Simple HTML DOM Parser

I have following html code
<div id="b_changetext" class="FL gL_13 PT15"> <span class="gr_15 uparw_pc"><strong>5.80</strong></span> (+2.28%)</div>
I wanted to extract content (+2.28%)
Tried following code
foreach($html->find('div[n_changetext]') as $e){
echo $e->innertext . '<br>';
echo "wwwwww";
}
On running it does not enter the for loop . ( "wwwwww" is not displayed)
Can anyone please suggest a solution
div[n_changetext] finds elements with an n_changetext attribute (which is not valid in HTML).
To find an element with a given id you must specify that the name of the attribute is id and specify the value.
The value, in your example, starts with a b not an n:
find('div[id=b_changetext]')

Using filterxpath to get the texts that haven't selectors

<div class="menutitle" onclick="SwitchMenu('sub38');sorter=new table.sorter('sorter');sorter.init('taboastreams38',4);"><meta name="fe38" itemprop="startDate" content="2017-09-19T18:30"><span class="t">19:30</span> <span class="es" style="display: none;">qqqq</span><span class="en">fff</span> **text here** : <b><span itemprop="name">eee- rrr</span></b></div>
I am using php and xpathfilter I tried many times to get the text without selector but I cant . I can get the data from any other selector but this ( text here) text in this location i cant .
the code that i used
$EventlistNodeValues = $crawler->filterXPath('//div[#class="menutitle"]')->each(function (Crawler $node, $i) {
$event = $node->filterXPath('//'); // i need to change selector here to get text
return json_encode($event,true);
});
//div[#class='menutitle']//b/../text()
after I used the above xpath pattern I am able to select the first previous of the tag which is here the text without selectors and this text is what i need to select .

Simple HTML DOM Parser - Get all plaintex rather than text of certain element

I tried all the solutions posted on this question. Although it is similar to my question, it's solutions aren't working for me.
I am trying to get the plain text that is outside of <b> and it should be inside the <div id="maindiv>.
<div id=maindiv>
<b>I don't want this text</b>
I want this text
</div>
$part is the object that contains <div id="maindiv">.
Now I tried this:
$part->find('!b')->innertext;
The code above is not working. When I tried this
$part->plaintext;
it returned all of the plain text like this
I don't want this text I want this text
I read the official documentation, but I didn't find anything to resolve this:
Query:
$selector->query('//div[#id="maindiv"]/text()[2]')
Explanation:
// - selects nodes regardless of their position in tree
div - selects elements which node name is 'div'
[#id="maindiv"] - selects only those divs having the attribute id="maindiv"
/ - sets focus to the div element
text() - selects only text elements
[2] - selects the second text element (the first is whitespace)
Note! The actual position of the text element may depend on
your preserveWhitespace setting.
Manual: http://www.php.net/manual/de/class.domdocument.php#domdocument.props.preservewhitespace
Example:
$html = <<<EOF
<div id="maindiv">
<b>I dont want this text</b>
I want this text
</div>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($html);
$selector = new DOMXpath($doc);
$node = $selector->query('//div[#id="maindiv"]/text()[2]')->item(0);
echo trim($node->nodeValue); // I want this text
remove the <b> first:
$part->find('b', 0)->outertext = '';
echo $part->innertext; // I want this text

Retrieving relative DOM nodes in PHP

I want to retrieve the data of the next element tag in a document, for example:
I would like to retrieve <blockquote> Content 1 </blockquote> for every different span only.
<html>
<body>
<span id=12341></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>
<!-- misc html in between including other spans w/ no relative blockquotes-->
<span id=12342></span>
<blockquote>Content 1</blockquote>
<!-- misc html in between including other spans w/ no relative blockquotes-->
<span id=12343></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>
<blockquote>Content 3</blockquote>
<blockquote>Content 4</blockquote>
<!-- misc html in between including other spans w/ no relative blockquotes-->
<span id=12344></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>
<blockquote>Content 3</blockquote>
</body>
</html>
Now two things I'm wondering:
1.)How can I write an expression that matches and only outputs a blockquote that's followed right after a closed element (<span></span>)?
2.)If I wanted, how could I get Content 2, Content 3, etc if I ever have a need to output them in the future while still applying to the rules of the previous question?
Now two things I'm wondering:
1.)How can I write an expression that matches and only outputs a blockquote
that's followed right after a closed
element (<span></span>)?
Assuming that the provided text is converted to a well-formed XML document (you need to enclose the values of the id attributes in quotes)
Use:
/*/*/span/following-sibling::*[1][self::blockquote]
This means in English: Select all blockquote elements each of which is the first, immediate following sibling of a span element that is a grand-child of the top element of the document.
2.)If I wanted, how could I get Content 2, Content 3, etc if I ever
have a need to output them in the
future while still applying to the
rules of the previous question?
Yes.
You can get all sets of contigious blockquote elements following a span:
/*/*/span/following-sibling::blockquote
[preceding-sibling::*[not(self::blockquote)][1][self::span]]
You can get the contigious set of blockquote elements following the (N+1)-st span by:
/*/*/span/following-sibling::blockquote
[preceding-sibling::*
[not(self::blockquote)][1]
[self::span and count(preceding-sibling::span)=$vN]
]
where $vN should be substituted by the number N.
Thus, the set of contigious set of blockquote elements following the first span is selected by:
/*/*/span/following-sibling::blockquote
[preceding-sibling::*
[not(self::blockquote)][1]
[self::span and count(preceding-sibling::span)=0]
]
the set of contigious set of blockquote elements following the second span is selected by:
/*/*/span/following-sibling::blockquote
[preceding-sibling::*
[not(self::blockquote)][1]
[self::span and count(preceding-sibling::span)=1]
]
etc. ...
See in the XPath Visualizer the nodes selected by the following expression :
/*/*/span/following-sibling::blockquote
[preceding-sibling::*
[not(self::blockquote)][1]
[self::span and count(preceding-sibling::span)=3]
]
Short answer: Load your HTML into DOMDocument, and select the nodes you want with XPath.
http://www.php.net/DOM
Long answer:
$flag = false;
$TEXT = array();
foreach ($body->childNodes as $el) {
if ($el->nodeName === '#text') continue;
if ($el->nodeName === 'span') {
$flag = true;
continue;
}
if ($flag && $el->nodeName === 'blockqoute') {
$TEXT[] = $el->firstChild->nodeValue;
$flag = false;
continue;
}
}
Try the following *
/html/body/span/following-sibling::*[1][self::blockquote]
to match any first blockquotes after a span element that are direct children of body or
//span/following-sibling::*[1][self::blockquote]
to match any first blockquotes following a span element anywhere in the document
* edit: fixed Xpath. Credits to Dimitre. My initial version would match any first blockquote after the span, e.g. it would match span p blockquote, which is not what you wanted.
Both of the above would match "Content 1" blockquotes. If you'd want to match the other blockquotes following the span (siblings, not descendants) remove the [1]
Example:
$dom = new DOMDocument;
$dom->load('yourFile.xml');
$xp = new DOMXPath($dom);
$query = '/html/body/span/following-sibling::*[1][self::blockquote]';
foreach($xp->query($query) as $blockquote) {
echo $dom->saveXml($blockquote), PHP_EOL;
}
If you want to do that without XPath, you can do
$dom = new DOMDocument;
$dom->preserveWhiteSpace = FALSE;
$dom->load('yourFile.xml');
$body = $dom->getElementsByTagName('body')->item(0);
foreach($body->getElementsByTagName('span') as $span) {
if($span->nextSibling !== NULL &&
$span->nextSibling->nodeName === 'blockquote')
{
echo $dom->saveXml($span->nextSibling), PHP_EOL;
}
}
If the HTML you scrape is not valid XHTML, use loadHtmlFile() instead to load the markup. You can suppress errors with libxml_use_internal_errors(TRUE) and libxml_clear_errors().
Also see Best methods to parse HTML for alternatives to DOM (though I find DOM a good choice).
Besides #Dimitre good answer, you could also use:
/html
/body
/blockquote[preceding-sibling::*[not(self::blockquote)][1]
/self::span[#id='12341']]

Categories