Partial matching using Xpath

Partial matching using Xpath - php

I'm trying to create a search function allowing partial matching by song title or genre using Xpath.
This is my XML file:
<?xml version="1.0" encoding="UTF-8"?>
<playlist>
<item>
<songid>USAT29902236</songid>
<songtitle>I Say a Little Prayer</songtitle>
<artist>Aretha Franklin</artist>
<genre>Soul</genre>
<link>https://www.amazon.com/I-Say-a-Little-Prayer/dp/B001BZD6KO</link>
<releaseyear>1968</releaseyear>
</item>
<item>
<songid>GBAAM8300001</songid>
<songtitle>Every Breath You Take</songtitle>
<artist>The Police</artist>
<genre>Pop/Rock</genre>
<link>https://www.amazon.com/Every-Breath-You-Take-Police/dp/B000008JI6</link>
<releaseyear>1983</releaseyear>
</item>
<item>
<songid>GBBBN7902002</songid>
<songtitle>London Calling</songtitle>
<artist>The Clash</artist>
<genre>Post-punk</genre>
<link>https://www.amazon.com/London-Calling-Remastered/dp/B00EQRJNTM</link>
<releaseyear>1979</releaseyear>
</item>
</playlist>
and this is my search function so far:
function searchSong($words){
global $xml;
if(!empty($words)){
foreach($words as $word){
//$query = "//playlist/item[contains(songtitle/genre, '{$word}')]";
$query = "//playlist/item[(songtitle[contains('{$word}')]) and (genre[contains('{$word}')])]";
$result = $xml->xpath($query);
}
}
print_r($result);
}
Calling the function searchSong(array("take", "soul")) should return the second and first song from XML file, but the array is always empty.

A few errors here: use of and instead of or, assuming searches are case-insensitive, and passing incorrect number of parameters to contains. The last would have triggered PHP warnings if you were looking for them. Also, you're only ever returning the last item you search for.
Case insensitive searches in XPath 1.0 (which is all PHP supports) are a huge pain to do:
$result = $xml->query(
"//playlist/item[(songtitle[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '{$word}')]) or (genre[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '{$word}')])]"
);
This assumes you've taken your search terms and converted them to lower-case already. For example:
<?php
function searchSong($xpath, ...$words)
{
$return = [];
foreach($words as $word) {
$word = strtolower($word);
$q = "//playlist/item[(songtitle[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '{$word}')]) or (genre[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '{$word}')])]";
$result = $xpath->query($q);
foreach($result as $node) {
$return[] = $node;
}
}
return $return;
}

In DOM you have another option, you can register PHP functions and use them in Xpath expressions.
So write a function that does the matching logic:
function contentContains($nodes, ...$needles) {
// ICUs transliterator is really convenient,
// lets get one for lowercase and replacing umlauts
$transliterator = \Transliterator::create('Any-Lower; Latin-ASCII');
foreach ($nodes as $node) {
$haystack = $transliterator->transliterate($node->nodeValue);
foreach ($needles as $needle) {
if (FALSE !== strpos($haystack, $needle)) {
return TRUE;
}
}
}
return FALSE;
}
Now you can register it on an DOMXpath instance:
$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPHPFunctions(['contentContains']);
$expression = "//item[
php:function('contentContains', songtitle, 'take', 'soul') or
php:function('contentContains', genre, 'take', 'soul')
]";
$result = [];
foreach ($xpath->evaluate($expression) as $node) {
// read values as strings
$result[] = [
'title' => $xpath->evaluate('string(songtitle)', $node),
'gerne' => $xpath->evaluate('string(genre)', $node),
// ...
];
}
var_dump($result);

Related

How to return full set of child nodes based on search of XML file

I am trying to search an XML file of the following structure:
<Root>
<Record>
<Filenumber>12314123</Filenumber>
<StatusEN>Closed</StatusEN>
<StatusDate>02 Nov 2019</StatusDate>
</Record>
<Record>
<Filenumber>0678672301</Filenumber>
<StatusEN>Closed</StatusEN>
<StatusDate>02 Nov 2019</StatusDate>
</Record>
</Root>
I want to search based on the filenumber, but return all 3 nodes and values for the match.
I am trying
$q = '12314123';
$file = "status.xml";
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->Load($file);
$xpath = new DOMXPath($doc);
$query = "/Root/Record/Filenumber[contains(text(), '$q')]";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo $entry->parentNode->nodeValue ;
}
This seems to return all the values I want but in one single string. How can I return them as separate variables or even better, in an array or JSON?

DOMNodeList or DOMNodeElement don't know how become an array. And that's why we must do it with our hands:
foreach ($entries as $entry) {
$result = [];
foreach ($entry->parentNode->childNodes as $node) {
$result[$node->nodeName] = $node->nodeValue;
}
var_dump($result);
}

Surrounding bits of text with xml elements

I'm looking for a way to dynamically surround parts of text with XML nodes based on regular expressions.
Consider the following example
<speak>The test number is 123456789, and some further block of text.</speak>
Now let's say I have a regular expression targeting the number to selectively surround it with a new tag so it would become:
<speak>The test number is <say-as interpret-as="characters">123456789</say-as>, and some further block of text.</speak>
I thought about using DomDocument for creating the tags, but not sure about the substitution part. Any advice?

DOM is the correct way. It allows you to find and traverse text nodes. Use RegEx on the content of these nodes and build the new nodes up as a fragment.
function wrapMatches(\DOMNode $node, string $pattern, string $tagName, $tagAttributes = []) {
$document = $node instanceof DOMDocument ? $node : $node->ownerDocument;
$xpath = new DOMXpath($document);
// iterate all descendant text nodes
foreach ($xpath->evaluate('.//text()', $node) as $textNode) {
$content = $textNode->textContent;
$found = preg_match_all($pattern, $content, $matches, PREG_OFFSET_CAPTURE);
$offset = 0;
if ($found) {
// fragments allow to treat multiple nodes as one
$fragment = $document->createDocumentFragment();
foreach ($matches[0] as $match) {
list($matchContent, $matchStart) = $match;
// add text from last match to current
$fragment->appendChild(
$document->createTextNode(substr($content, $offset, $matchStart - $offset))
);
// add wrapper element, ...
$wrapper = $fragment->appendChild($document->createElement($tagName));
// ... set its attributes ...
foreach ($tagAttributes as $attributeName => $attributeValue) {
$wrapper->setAttribute($attributeName, $attributeValue);
}
// ... and add the text content
$wrapper->textContent = $matchContent;
$offset = $matchStart + strlen($matchContent);
}
// add text after last match
$fragment->appendChild($document->createTextNode(substr($content, $offset)));
// replace the text node with the new fragment
$textNode->parentNode->replaceChild($fragment, $textNode);
}
}
}
$xml = <<<'XML'
<speak>The test number is 123456789, and some further block of text.</speak>
XML;
$document = new DOMDocument();
$document->loadXML($xml);
wrapMatches($document, '(\d+)u', 'say-as', ['interpret-as' => 'characters']);
echo $document->saveXML();

This is conveniently handled using the xsl:analyze-string instruction in XSLT 2.0. For example you can define the rule:
<xsl:template match="speak">
<xsl:analyze-string select="." regex="\d+">
<xsl:matching-substring>
<say-as interpret-as="characters">
<xsl:value-of select="."/>
</say-as>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:template>

You can use preg_replace something like this:
$str = '<speak>The test number is 123456789, and some further block of text.</speak>';
echo preg_replace('/(\d+)/','<say-as interpret-as="characters">$1</say-as>',$str);
and the output would be:
<speak>The test number is <say-as interpret-as="characters">123456789</say-as>, and some further block of text.</speak>

I ended up doing it the simple way, since I don't need to handle nested nodes and other XML specific stuff. So just made a simple method for creating the tags as strings. It's good enough.
protected function createTag($name, $attributes = [], $content = null)
{
$openingTag = '<' . $name;
if ($attributes) {
foreach ($attributes as $attribute => $value) {
$openingTag .= sprintf(' %s="%s"', $attribute, $value);
}
}
$openingTag .= '>';
$closingTag = '</' . $name . '>';
$content = $content ?: '$1';
return $openingTag . $content . $closingTag;
}
$tag = $this->createTag($tagName, $attributes);
$text = preg_replace($regex, $tag, $text);

set tags in html using domdocument and preg_replace_callback

I try to replace words that are in my dictionary of terminology with an (html)anchor so it gets a tooltip. I get the replace-part done, but I just can't get it back in the DomDocument object.
I've made a recursive function that iterates the DOM, it iterates every childnode, searching for the word in my dictionary and replacing it with an anchor.
I've been using this with an ordinary preg_match on HTML, but that just runs into problems.. when HTML gets complex
The recursive function:
$terms = array(
'example'=>'explanation about example'
);
function iterate_html($doc, $original_doc = null)
{
global $terms;
if(is_null($original_doc)) {
self::iterate_html($doc, $doc);
}
foreach($doc->childNodes as $childnode)
{
$children = $childnode->childNodes;
if($children) {
self::iterate_html($childnode);
} else {
$regexes = '~\b' . implode('\b|\b',array_keys($terms)) . '\b~i';
$new_nodevalue = preg_replace_callback($regexes, function($matches) {
$doc = new DOMDocument();
$anchor = $doc->createElement('a', $matches[0]);
$anchor->setAttribute('class', 'text-info');
$anchor->setAttribute('data-toggle', 'tooltip');
$anchor->setAttribute('data-original-title', $terms[strtolower($matches[0])]);
return $doc->saveXML($anchor);
}, $childnode->nodeValue);
$dom = new DOMDocument();
$template = $dom->createDocumentFragment();
$template->appendXML($new_nodevalue);
$original_doc->importNode($template->childNodes, true);
$childnode->parentNode->replaceChild($template, $childnode);
}
}
}
echo iterate_html('this is just some example text.');
I expect the result to be:
this is just some <a class="text-info" data-toggle="tooltip" data-original-title="explanation about example">example</a> text

I don't think building a recursive function to walk the DOM is usefull when you can use an XPath query. Also, I'm not sure that preg_replace_callback is an adapted function for this case. I prefer to use preg_split. Here is an example:
$html = 'this is just some example text.';
$terms = array(
'example'=>'explanation about example'
);
// sort by reverse order of key size
// (to be sure that the longest string always wins instead of the first in the pattern)
uksort($terms, function ($a, $b) {
$diff = mb_strlen($b) - mb_strlen($a);
return ($diff) ? $diff : strcmp($a, $b);
});
// build the pattern inside a capture group (to have delimiters in the results with the PREG_SPLIT_DELIM_CAPTURE option)
$pattern = '~\b(' . implode('|', array_map(function($i) { return preg_quote($i, '~'); }, array_keys($terms))) . ')\b~i';
// prevent eventual html errors to be displayed
$libxmlInternalErrors = libxml_use_internal_errors(true);
// determine if the html string have a root html element already, if not add a fake root.
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$fakeRootElement = false;
if ( $dom->documentElement->nodeName !== 'html' ) {
$dom->loadHTML("<div>$html</div>", LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$fakeRootElement = true;
}
libxml_use_internal_errors($libxmlInternalErrors);
// find all text nodes (not already included in a link or between other unwanted tags)
$xp = new DOMXPath($dom);
$textNodes = $xp->query('//text()[not(ancestor::a)][not(ancestor::style)][not(ancestor::script)]');
// replacement
foreach ($textNodes as $textNode) {
$parts = preg_split($pattern, $textNode->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
$fragment = $dom->createDocumentFragment();
foreach ($parts as $k=>$part) {
if ($k&1) {
$anchor = $dom->createElement('a', $part);
$anchor->setAttribute('class', 'text-info');
$anchor->setAttribute('data-toggle', 'tooltip');
$anchor->setAttribute('data-original-title', $terms[strtolower($part)]);
$fragment->appendChild($anchor);
} else {
$fragment->appendChild($dom->createTextNode($part));
}
}
$textNode->parentNode->replaceChild($fragment, $textNode);
}
// building of the result string
$result = '';
if ( $fakeRootElement ) {
foreach ($dom->documentElement->childNodes as $childNode) {
$result .= $dom->saveHTML($childNode);
}
} else {
$result = $dom->saveHTML();
}
echo $result;
demo
Feel free to put that into one or more functions/methods, but keep in mind that this kind of editing has a non-neglictable weight and should be used each time the html is edited (and not each time the html is displayed).

How to sort content of an XML file loaded with SimpleXML?

There is an XML file with a content similar to the following:
<FMPDSORESULT xmlns="http://www.filemaker.com">
<ERRORCODE>0</ERRORCODE>
<DATABASE>My_Database</DATABASE>
<LAYOUT/>
<ROW MODID="1" RECORDID="1">
<Name>John</Name>
<Age>19</Age>
</ROW>
<ROW MODID="2" RECORDID="2">
<Name>Steve</Name>
<Age>25</Age>
</ROW>
<ROW MODID="3" RECORDID="3">
<Name>Adam</Name>
<Age>45</Age>
</ROW>
I tried to sort the ROW tags by the values of Name tags using array_multisort function:
$xml = simplexml_load_file( 'xml1.xml');
$xml2 = sort_xml( $xml );
print_r( $xml2 );
function sort_xml( $xml ) {
$sort_temp = array();
foreach ( $xml as $key => $node ) {
$sort_temp[ $key ] = (string) $node->Name;
}
array_multisort( $sort_temp, SORT_DESC, $xml );
return $xml;
}
But the code doesn't work as expected.

I would recommend using the DOM extension, as it is more flexible:
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$doc->formatOutput = true;
$doc->load('xml1.xml');
// Get the root node
$root = $doc->getElementsByTagName('FMPDSORESULT');
if (!$root->length)
die('FMPDSORESULT node not found');
$root = $root[0];
// Pull the ROW tags from the document into an array.
$rows = [];
$nodes = $root->getElementsByTagName('ROW');
while ($row = $nodes->item(0)) {
$rows []= $root->removeChild($row);
}
// Sort the array of ROW tags
usort($rows, function ($a, $b) {
$a_name = $a->getElementsByTagName('Name');
$b_name = $b->getElementsByTagName('Name');
return ($a_name->length && $b_name->length) ?
strcmp(trim($a_name[0]->textContent), trim($b_name[0]->textContent)) : 0;
});
// Append ROW tags back into the document
foreach ($rows as $row) {
$root->appendChild($row);
}
// Output the result
echo $doc->saveXML();
Output
<?xml version="1.0"?>
<FMPDSORESULT xmlns="http://www.filemaker.com">
<ERRORCODE>0</ERRORCODE>
<DATABASE>My_Database</DATABASE>
<LAYOUT/>
<ROW MODID="3" RECORDID="3">
<Name>Adam</Name>
<Age>45</Age>
</ROW>
<ROW MODID="1" RECORDID="1">
<Name>John</Name>
<Age>19</Age>
</ROW>
<ROW MODID="2" RECORDID="2">
<Name>Steve</Name>
<Age>25</Age>
</ROW>
</FMPDSORESULT>
Regarding XPath
You can use DOMXPath for even more flexible traversing. However, in this specific problem the use of DOMXPath will not bring significant improvements, in my opinion. Anyway, I'll give examples for completeness.
Fetching the rows:
$xpath = new DOMXPath($doc);
$xpath->registerNamespace('myns', 'http://www.filemaker.com');
$rows = [];
foreach ($xpath->query('//myns:ROW') as $row) {
$rows []= $row->parentNode->removeChild($row);
}
Appending the rows back into the document:
$root = $xpath->evaluate('/myns:FMPDSORESULT')[0];
foreach ($rows as $row) {
$root->appendChild($row);
}

Some SimpleXMLElement methods return arrays but most return SimpleXMLElement objects which implement Iterator. A var_dump() will only show part of of the data in a simplified representation. However it is an object structure, not a nested array.
If I understand you correctly you want to sort the ROW elements by the Name child. You can fetch them with the xpath() method, but you need to register a prefix for the namespace. It returns an array of SimpleXMLElement objects. The array can be sorted with usort.
$fResult = new SimpleXMLElement($xml);
$fResult->registerXpathNamespace('fm', 'http://www.filemaker.com');
$rows = $fResult->xpath('//fm:ROW');
usort(
$rows,
function(SimpleXMLElement $one, SimpleXMLElement $two) {
return strcasecmp($one->Name, $two->Name);
}
);
var_dump($rows);
In DOM that will not look much different, but DOMXpath::evaluate() return a DOMNodeList. You can convert it into an array using iterator_to_array.
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$xpath->registerNamespace('fm', 'http://www.filemaker.com');
$rows = iterator_to_array($xpath->evaluate('//fm:ROW'));
usort(
$rows,
function(DOMElement $one, DOMElement $two) use ($xpath) {
return strcasecmp(
$xpath->evaluate('normalize-space(Name)', $one),
$xpath->evaluate('normalize-space(Name)', $two)
);
}
);
var_dump($rows);
DOM has no magic methods to access children and values, Xpath can be used to fetch them. The Xpath function string() converts the first node into a string. It return an empty string if the node list is empty. normalize-space() does a little more. It replaces all groups of whitespaces with a single space and strips it from the start and end of the string.

PHP parsing XML file with and without namespaces

I need to get a XML File into a Database. Thats not the problem. Cant read it, parse it and create some Objects to map to the DB. Problem is, that sometimes the XML File can contain namespaces and sometimes not. Furtermore sometimes there is no namespace defined at all.
So what i first got was something like this:
<?xml version="1.0" encoding="UTF-8"?>
<struct xmlns:b="http://www.w3schools.com/test/">
<objects>
<object>
<node_1>value1</node_1>
<node_2>value2</node_2>
<node_3 iso_land="AFG"/>
<coords lat="12.00" long="13.00"/>
</object>
</objects>
</struct>
And the parsing:
$obj = new stdClass();
$nodes = array('node_1', 'node_2');
$t = $xml->xpath('/objects/object');
foreach($nodes AS $node) {
if($t[0]->$node) {
$obj->$node = (string) $t[0]->$node;
}
}
Thats fine as long as there are no namespaces. Here comes the XML File with namespaces:
<?xml version="1.0" encoding="UTF-8"?>
<b:struct xmlns:b="http://www.w3schools.com/test/">
<b:objects>
<b:object>
<b:node_1>value1</b:node_1>
<b:node_2>value2</b:node_2>
<b:node_3 iso_land="AFG"/>
<b:coords lat="12.00" long="13.00"/>
</b:object>
</b:objects>
</b:struct>
I now came up with something like this:
$xml = simplexml_load_file("test.xml");
$namespaces = $xml->getNamespaces(TRUE);
$ns = count($namespaces) ? 'a:' : '';
$xml->registerXPathNamespace("a", "http://www.w3schools.com/test/");
$nodes = array('node_1', 'node_2');
$obj = new stdClass();
foreach($nodes AS $node) {
$t = $xml->xpath('/'.$ns.'objects/'.$ns.'object/'.$ns.$node);
if($t[0]) {
$obj->$node = (string) $t[0];
}
}
$t = $xml->xpath('/'.$ns.'objects/'.$ns.'object/'.$ns.'node_3');
if($t[0]) {
$obj->iso_land = (string) $t[0]->attributes()->iso_land;
}
$t = $xml->xpath('/'.$ns.'objects/'.$ns.'object/'.$ns.'coords');
if($t[0]) {
$obj->lat = (string) $t[0]->attributes()->lat;
$obj->long = (string) $t[0]->attributes()->long;
}
That works with namespaces and without. But i feel that there must be a better way. Before that i could do something like this:
$t = $xml->xpath('/'.$ns.'objects/'.$ns.'object');
foreach($nodes AS $node) {
if($t[0]->$node) {
$obj->$node = (string) $t[0]->$node;
}
}
But that just wont work with namespaces.

You could make 'http://www.w3schools.com/test/' the default namespace. This way a:objectswould match regardless of whether the document says <a:objects> or <objects>.
If memory usage is not a issue you can even do it with a textual replacement, e.g.
$data = '<?xml version="1.0" encoding="UTF-8"?>
<struct xmlns:b="http://www.w3schools.com/test/">
<objects>
<object>
<node_1>value1</node_1>
<node_2>value2</node_2>
<node_3 iso_land="AFG"/>
<coords lat="12.00" long="13.00"/>
</object>
</objects>
</struct>';
$data = str_replace( // or preg_replace(,,,1) if you want to limit it to only one replacement
'xmlns:b="http://www.w3schools.com/test/"',
'xmlns="http://www.w3schools.com/test/" xmlns:b="http://www.w3schools.com/test/"',
$data
);
$xml = new SimpleXMLElement($data);
$xml->registerXPathNamespace("a", "http://www.w3schools.com/test/");
foreach($xml->xpath('//a:objects/a:object') as $n) {
echo $n->node_1;
}

You can make your XPATH statements more generic by matching on any element * and using a predicate filter to match on the local-name(), which will match on the element name with/without namespaces.
An XPATH like this:
/*[local-name()='struct']/*[local-name()='objects']/*[local-name()='object']/*[local-name()='coords']
Applied to the code sample you were using:
$obj = new stdClass();
$nodes = array('node_1', 'node_2');
$t = $xml->xpath('/*[local-name()="objects"]/*[local-name()="object"]');
foreach($nodes AS $node) {
if($t[0]->$node) {
$obj->$node = (string) $t[0]->$node;
}
}

Take a look at This
http://blog.sherifmansour.com/?p=302
It helped me a lot.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Partial matching using Xpath - php

Related

How to return full set of child nodes based on search of XML file

Surrounding bits of text with xml elements

set tags in html using domdocument and preg_replace_callback

How to sort content of an XML file loaded with SimpleXML?

PHP parsing XML file with and without namespaces

Categories

Resources