In XPath, how can I get the node with the highest value? e.g.
<tr>
<td>$12.00</td>
<td>$24.00</td>
<td>$13.00</td>
</tr>
would return $24.00.
I'm using PHP DOM, so this would be XPath version 1.0.
I spend the last little while trying to come up with the most elegant solution for you. As you know, max ins't available in XPath 1.0. I've tried several different approach, most of which don't seem very efficient.
<?php
$doc = new DOMDocument;
$doc->loadXml('<table><tr><td>$12.00</td><td>$24.00</td><td>$13.00</td></tr></table>');
function dom_xpath_max($this, $nodes)
{
usort($nodes, create_function('$a, $b', 'return strcmp($b->textContent, $a->textContent);'));
return $this[0]->textContent == $nodes[0]->textContent;
}
$xpath = new DOMXPath($doc);
$xpath->registerNamespace('php', 'http://php.net/xpath');
$xpath->registerPHPFunctions('dom_xpath_max');
$result = $xpath->evaluate('//table/tr/td[php:function("dom_xpath_max", ., ../td)]');
echo $result->item(0)->textContent;
?>
Alternatively, you could use a foreach loop to iterate through the result of a simpler XPath expression (once which only selects all of the TD elements) and find the highest number.
<?php
...
$xpath = new DOMXPath($doc);
$result = $xpath->evaluate('//table/tr/td');
$highest = '';
foreach ( $result as $node )
if ( $node->textContent > $highest )
$highest = $node->textContent;
echo $highest;
?>
You could also use the XSLTProcessor class and a XSL document that uses the math:max function from exslt.org but I've tried that and couldn't get it to work quite right because of the dollar signs ($).
I've tested both solutions and they worked well for me.
First of all it would be extremely difficult or not possible to write a single XPath query to return highest value which contains characters other than numbers like in your case $. But if you consider XML fragment excluding $ like
<tr>
<td>12.00</td>
<td>24.00</td>
<td>13.00</td>
</tr>
then you can write a single XPath query to retrieve the highest value node.
//tr/td[not(preceding-sibling::td/text() > text() or following-sibling::td/text() > text())]
This query returns you <td> with value 24.00.
Hope this helps.
A pure XPath 1.0 solution is difficult: at the XSLT level you would use recursion, but that involves writing functions or templates, so it rules out pure XPath. In PHP, I would simply pull all the data back into the host language and compute the max() using PHP code.
Related
I am using explode to manipulate information I am scraping from a website. I am trying to eliminate something specific from the string so that it will return what I want and also add the rest of the items to the array.
$pageArray = explode('<td class="player-label"><a href="/nfl/players/antonio-brown.php?type=overall&week=draft">', $fantasyPros);
I would like to skip the antonio-brown section and use a regular expression or whatever is best to replace it so that it will not look for a specific name but every name on the list and add them to my array. Do you have any suggestions on what I should use here? I appreciate any assistance.
Seems like a parser job to me with appropriate xpath functions, e.g. not().
Consider the following code:
<?php
$data = <<<DATA
<td class="player-label">
Some brown link here
Some green link here
</td>
DATA;
$dom = new DOMDocument();
$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$green_links = $xpath->query("//a[not(contains(#href, 'antonio-brown'))]");
foreach ($green_links as $link) {
// do sth. useful here
}
?>
This prints out every link where there's no antonio-brown in it.
You can easily adjust this to td or any other element.
I am trying to extract text between Multilevel XML tags.
This is the data file
<eSearchResult>
<Count>7117</Count>
<RetMax>10</RetMax>
<RetStart>0</RetStart>
<QueryKey>1</QueryKey>
<WebEnv>
NCID_1_457044331_130.14.22.215_9001_1401819380_1399850995
</WebEnv>
<IdList>
<Id>24887359</Id>
<Id>24884828</Id>
<Id>24884718</Id>
<Id>24884479</Id>
<Id>24882343</Id>
<Id>24879340</Id>
<Id>24871662</Id>
<Id>24870721</Id>
<Id>24864115</Id>
<Id>24863809</Id>
</IdList>
<TranslationSet/>
<TranslationStack>
<TermSet>
<Term>BRCA1[tiab]</Term>
.
.
.
</TranslationStack>
</eSearchResult>
I just want to extract the ten ids between <ID></ID> tags enclosed inside <IdList></IdList>.
Regex gets me just the first value out of the ten.
preg_match_all('~<Id>(.+?)<\/Id>~', $temp_str, $pids)
the xml data is stored in the $temp_Str variable and I am trying to get the values stored in $pids
Any other suggestions to go about this ?
Using preg_match_all (http://www.php.net/manual/en/function.preg-match-all.php), I've included a regex that matches on digits within an <Id> tag. The trickiest part (I think), is in the foreach loop, where I iterate $out[1]. This is because, from the URL above,
Orders results so that $matches[0] is an array of full pattern
matches, $matches[1] is an array of strings matched by the first
parenthesized subpattern, and so on.
preg_match_all('/<Id>\s*(\d+)\s*<\/Id>/',
"<eSearchResult>
<Count>7117</Count>
<RetMax>10</RetMax>
<RetStart>0</RetStart>
<QueryKey>1</QueryKey>
<WebEnv>
NCID_1_457044331_130.14.22.215_9001_1401819380_1399850995
</WebEnv>
<IdList>
<Id>24887359</Id>
<Id>24884828</Id>
<Id>24884718</Id>
<Id>24884479</Id>
<Id>24882343</Id>
<Id>24879340</Id>
<Id>24871662</Id>
<Id>24870721</Id>
<Id>24864115</Id>
<Id>24863809</Id>
</IdList>
<TranslationSet/>
<TranslationStack>
<TermSet>
<Term>BRCA1[tiab]</Term>
</TranslationStack>
</eSearchResult>",
$out,PREG_PATTERN_ORDER);
foreach ($out[1] as $o){
echo $o;
echo "\n";
}
?>
You should use php's xpath capabilities, as explained here:
http://www.w3schools.com/php/func_simplexml_xpath.asp
Example:
<?php
$xml = simplexml_load_file("searchdata.xml");
$result = $xml->xpath("IdList/Id");
print_r($result);
?>
XPath is flexible, can be used conditionally, and is supported in a wide variety of other languages as well. It is also more readable and easier to write than regex, as you can construct conditional queries without using lookaheads.
use this pattern (?:\<IdList\>|\G)\s*\<Id\>(\d+)\<\/Id\> with g option
Demo
Do not use PCRE to parse XML. Here are CSS Selectors and even better Xpath to fetch parts of an XML DOM.
If you want any Id element in the first IdList of the eSearchResult
/eSearchResult/IdList[1]/Id
As you can see Xpath "knows" about the actual structure of an XML document. PCRE does not.
You need to create an Xpath object for a DOM document
$dom = new DOMDocument();
$dom->loadXml($xmlString);
$xpath = new DOMXpath($dom);
$result = [];
foreach ($xpath->evaluate('/eSearchResult/IdList[1]/Id') as $id) [
$result[] = trim($id->nodeValue);
}
var_dump($id);
How do I ignore html tags in this preg_replace.
I have a foreach function for a search, so if someone searches for "apple span" the preg_replace also applies a span to the span and the html breaks:
preg_replace("/($keyword)/i","<span class=\"search_hightlight\">$1</span>",$str);
Thanks in advance!
I assume you should make your function based on DOMDocument and DOMXPath rather than using regular expressions. Even those are quite powerful, you run into problems like the one you describe which are not (always) easily and robust to solve with regular expressions.
The general saying is: Don't parse HTML with regular expressions.
It's a good rule to keep in mind and albeit as with any rule, it does not always apply, it's worth to make up one's mind about it.
XPath allows you so find all texts that contain the search terms within texts only, ignoring all XML elements.
Then you only need to wrap those texts into the <span> and you're done.
Edit: Finally some code ;)
First it makes use of xpath to locate elements that contain the search text. My query looks like this, this might be written better, I'm not a super xpath pro:
'//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..'
$search contains the text to search for, not containing any " (quote) character (this would break it, see Cleaning/sanitizing xpath attributes for a workaround if you need quotes).
This query will return all parents that contain textnodes which put together will be a string that contain your search term.
As such a list is not easy to process further as-is, I created a TextRange class that represents a list of DOMText nodes. It is useful to do string-operations on a list of textnodes as if they were one string.
This is the base skeleton of the routine:
$str = '...'; # some XML
$search = 'text that span';
printf("Searching for: (%d) '%s'\n", strlen($search), $search);
$doc = new DOMDocument;
$doc->loadXML($str);
$xp = new DOMXPath($doc);
$anchor = $doc->getElementsByTagName('body')->item(0);
if (!$anchor)
{
throw new Exception('Anchor element not found.');
}
// search elements that contain the search-text
$r = $xp->query('//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..', $anchor);
if (!$r)
{
throw new Exception('XPath failed.');
}
// process search results
foreach($r as $i => $node)
{
$textNodes = $xp->query('.//child::text()', $node);
// extract $search textnode ranges, create fitting nodes if necessary
$range = new TextRange($textNodes);
$ranges = array();
while(FALSE !== $start = strpos($range, $search))
{
$base = $range->split($start);
$range = $base->split(strlen($search));
$ranges[] = $base;
};
// wrap every each matching textnode
foreach($ranges as $range)
{
foreach($range->getNodes() as $node)
{
$span = $doc->createElement('span');
$span->setAttribute('class', 'search_hightlight');
$node = $node->parentNode->replaceChild($span, $node);
$span->appendChild($node);
}
}
}
For my example XML:
<html>
<body>
This is some <span>text</span> that span across a page to search in.
and more text that span</body>
</html>
It produces the following result:
<html>
<body>
This is some <span><span class="search_hightlight">text</span></span><span class="search_hightlight"> that span</span> across a page to search in.
and more <span class="search_hightlight">text that span</span></body>
</html>
This shows that this even allows to find text that is distributed across multiple tags. That's not that easily possible with regular expressions at all.
You find the full code here: http://codepad.viper-7.com/U4bxbe (including the TextRange class that I have taken out of the answers example).
It's not working properly on the viper codepad because of an older LIBXML version that site is using. It works fine for my LIBXML version 20707. I created a related question about this issue: XPath query result order.
A note of warning: This example uses binary string search (strpos) and the related offsets for splitting textnodes with the DOMText::splitText function. That can lead to wrong offsets, as the functions needs the UTF-8 character offset. The correct method is to use mb_strpos to obtain the UTF-8 based value.
The example works anyway because it's only making use of US-ASCII which has the same offsets as UTF-8 for the example-data.
For a real life situation, the $search string should be UTF-8 encoded and mb_strpos should be used instead of strpos:
while(FALSE !== $start = mb_strpos($range, $search, 0, 'UTF-8'))
I have an XML document from which I want to extract some data:
<tnt:results>
<tnt:result>
<Document id="id1">
<impact _blabla_ for="tree.def" name="Something has changed"
select="moreblabla">true</impact>
<impact _blabla_ for="plant.def" name="Something else has changed"
select="moreblabla">true</impact>
</Document>
</tnt:result>
</tnt:results>
in reality there is no new line -- it's one continuous string and and there can be multiple < Document > elements. I want to have a regular expression that extracts:
id1
tree.def / plant.def
Something has changed / Something else has changed
I was able to come up with this code so far, but it only matches the first impact, rather than both of them:
preg_match_all('/<Document id="(.*)">(<impact.*for="(.*)".*name="(.*)".*<\/impact>)*<\/Document>/U', $response, $matches);
The other way to do it would be to match everything inside the Document element and pass it through a RegEx once more, but I thought I can do this with only one RegEx.
Thanks a lot in advance!
Just use DOM, it's easy enough:
$dom = new DOMDocument;
$dom->loadXML($xml_string);
$documents = $dom->getElementsByTagName('Document');
foreach ($documents as $document) {
echo $document->getAttribute('id'); // id1
$impacts = $document->getElementsByTagName('impact');
foreach ($impacts as $impact) {
echo $impact->getAttribute('for'); // tree.def
echo $impact->getAttribute('name'); // Something has changed
}
}
Don't use RegEx. Use an XML parser.
Really, if you have to worry about multiple Document elements and extracting all sorts of attributes, you're much better off using an XML parser or a query language like XPath.
I am not totally new to PHP or XML but I am 100% new to paring XML with PHP. I have an XML string that has several nodes but the only ones I am insterested in are the < keyword > nodes which there are an uncertain number of each containing a phrase like so: < keyword >blue diamond jewelry< /keyword > for example say the string looked like this:
<xml>
<pointless_node/>
<seq>
<keyword>diamond ring</keyword>
<keyword>ruby necklace</keyword>
<keyword>mens watch</keyword>
</seq>
<some_node/>
</xml>
I would want an array like this:
['diamond ring','ruby necklace','mens watch']
I tried looking at the PHP manual and just get confused and not sure what to do. Can someone please walk me through how to do this? I am using PHP4.
THANKS!
This turns $keywords into an array of
Objects. Is there a way to get the
text from the objects?
Sure, see this.
$dom = domxml_open_mem($str);
$keywords = $dom->get_elements_by_tagname('keyword');
foreach($keywords as $keyword) {
$text = $keyword->get_content();
// Whatever
}
XML_Parser->xml_parse_into_struct() might be what you're looking for.
Works for Php versions >= 4
http://se.php.net/xml_parse_into_struct
http://www.w3schools.com/PHP/func_xml_parse_into_struct.asp
I think the easiest is:
$dom = domxml_open_mem($str);
$keywords = $dom->get_elements_by_tagname('keyword');
see: http://www.php.net/simplexml-element-xpath
Try the following xpath and array construction
$string = "<xml>
<pointless_node/>
<seq>
<keyword>diamond ring</keyword>
<keyword>ruby necklace</keyword>
<keyword>mens watch</keyword>
</seq>
<some_node/>
</xml>";
$xml = domxml_open_mem($xmlstr)
$xpath = $xml->xpath_new_context();
$result = xpath_eval($xpath,'//keyword');
foreach ($result->nodeset as $node)
{
$result[] = $node->dump_node($node);
}
edit: modified code to reflect php 4 requirements
edit: modified to account for poorly documented behaviour of xpath_new_context (php docs comments point out the error)