Making xPath into a variable - php

How do I turn the output into a variable so i can cross reference it to see if it matches another variable I have set
foreach ($nodes as $i => $node) {
echo $node->nodeValue;
}
I know this is incorrect and wouldn't work but:
foreach ($nodes as $i => $node) {
$target = $node->nodeValue;
}
$match = "some text"
if($target == $match) {
// Match - Do Something
} else {
// No Match - Do Nothing
}
Actually this solves my question but maybe not the right way about it:
libxml_use_internal_errors(true);
$dom = new DomDocument;
$dom->loadHTMLFile("http://www.example.com");
$xpath = new DomXPath($dom);
$nodes = $xpath->query("(//tr/td/a/span[#class='newprodtext' and contains(translate(text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'adidas')])[1]");
foreach ($nodes as $i => $node) {
echo $node->nodeValue, "\n";
$target[0] = $node->nodeValue;
}
$match = "adidas";
if($target == $match) {
// Match
} else {
// No Match
}

Your problem is more about general understanding of loops, assigning values to arrays and using if conditionals with php than using xpath.
In your foreach loop, you're assigning each $node's nodeValue to the same index in your $target array, $target will always have only one value (the last one)
In your if conditional statement, you're comparing an array (or null if $nodes has no items, so you probably want to declare $target first) against the string 'adidas', that will never be true.
You probably want to do something like:
$matched = false;
$match = 'adidas';
foreach ($nodes as $i => $node) {
$nodeValue = trim(strip_tags($node->textContent));
if ($nodeValue === $match) {
$matched = true;
break;
}
}
if ($matched) {
// Match
} else {
// No Match
}
Update
I see that this xpath expression was given to you in another answer, that presumably already does the matching, so you just need to check the length property of $nodes
if ($nodes->length > 0) {
// Match
} else {
// No match
}

Related

set tags in html using domdocument and preg_replace_callback

I try to replace words that are in my dictionary of terminology with an (html)anchor so it gets a tooltip. I get the replace-part done, but I just can't get it back in the DomDocument object.
I've made a recursive function that iterates the DOM, it iterates every childnode, searching for the word in my dictionary and replacing it with an anchor.
I've been using this with an ordinary preg_match on HTML, but that just runs into problems.. when HTML gets complex
The recursive function:
$terms = array(
'example'=>'explanation about example'
);
function iterate_html($doc, $original_doc = null)
{
global $terms;
if(is_null($original_doc)) {
self::iterate_html($doc, $doc);
}
foreach($doc->childNodes as $childnode)
{
$children = $childnode->childNodes;
if($children) {
self::iterate_html($childnode);
} else {
$regexes = '~\b' . implode('\b|\b',array_keys($terms)) . '\b~i';
$new_nodevalue = preg_replace_callback($regexes, function($matches) {
$doc = new DOMDocument();
$anchor = $doc->createElement('a', $matches[0]);
$anchor->setAttribute('class', 'text-info');
$anchor->setAttribute('data-toggle', 'tooltip');
$anchor->setAttribute('data-original-title', $terms[strtolower($matches[0])]);
return $doc->saveXML($anchor);
}, $childnode->nodeValue);
$dom = new DOMDocument();
$template = $dom->createDocumentFragment();
$template->appendXML($new_nodevalue);
$original_doc->importNode($template->childNodes, true);
$childnode->parentNode->replaceChild($template, $childnode);
}
}
}
echo iterate_html('this is just some example text.');
I expect the result to be:
this is just some <a class="text-info" data-toggle="tooltip" data-original-title="explanation about example">example</a> text
I don't think building a recursive function to walk the DOM is usefull when you can use an XPath query. Also, I'm not sure that preg_replace_callback is an adapted function for this case. I prefer to use preg_split. Here is an example:
$html = 'this is just some example text.';
$terms = array(
'example'=>'explanation about example'
);
// sort by reverse order of key size
// (to be sure that the longest string always wins instead of the first in the pattern)
uksort($terms, function ($a, $b) {
$diff = mb_strlen($b) - mb_strlen($a);
return ($diff) ? $diff : strcmp($a, $b);
});
// build the pattern inside a capture group (to have delimiters in the results with the PREG_SPLIT_DELIM_CAPTURE option)
$pattern = '~\b(' . implode('|', array_map(function($i) { return preg_quote($i, '~'); }, array_keys($terms))) . ')\b~i';
// prevent eventual html errors to be displayed
$libxmlInternalErrors = libxml_use_internal_errors(true);
// determine if the html string have a root html element already, if not add a fake root.
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$fakeRootElement = false;
if ( $dom->documentElement->nodeName !== 'html' ) {
$dom->loadHTML("<div>$html</div>", LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$fakeRootElement = true;
}
libxml_use_internal_errors($libxmlInternalErrors);
// find all text nodes (not already included in a link or between other unwanted tags)
$xp = new DOMXPath($dom);
$textNodes = $xp->query('//text()[not(ancestor::a)][not(ancestor::style)][not(ancestor::script)]');
// replacement
foreach ($textNodes as $textNode) {
$parts = preg_split($pattern, $textNode->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
$fragment = $dom->createDocumentFragment();
foreach ($parts as $k=>$part) {
if ($k&1) {
$anchor = $dom->createElement('a', $part);
$anchor->setAttribute('class', 'text-info');
$anchor->setAttribute('data-toggle', 'tooltip');
$anchor->setAttribute('data-original-title', $terms[strtolower($part)]);
$fragment->appendChild($anchor);
} else {
$fragment->appendChild($dom->createTextNode($part));
}
}
$textNode->parentNode->replaceChild($fragment, $textNode);
}
// building of the result string
$result = '';
if ( $fakeRootElement ) {
foreach ($dom->documentElement->childNodes as $childNode) {
$result .= $dom->saveHTML($childNode);
}
} else {
$result = $dom->saveHTML();
}
echo $result;
demo
Feel free to put that into one or more functions/methods, but keep in mind that this kind of editing has a non-neglictable weight and should be used each time the html is edited (and not each time the html is displayed).

Parse html with regexp

I want to find all <h3> blocks in this example:
<h3>sdf</h3>
sdfsdf
<h3>sdf</h3>
32
<h2>fs</h2>
<h3>23sd</h3>
234
<h1>h1</h1>
(From h3 to other h3 or h2) This regexp find only first h3 block
~\<h3[^>]*\>[^>]+\<\/h3\>.+(?:\<h3|\<h2|\<h1)~is
I use php function preg_match_all (Quote from docs: After the first match is found, the subsequent searches are continued on from end of the last match.)
What i have to modify in my regexp?
ps
<h3>1</h3>
1content
<h3>2</h3>
2content
<h2>h2</h2>
<h3>3</h3>
3content
<h1>h1</h1>
this content have to be parsed as:
[0] => <h3>1</h3>1content
[1] => <h3>2</h3>2content
[2] => <h3>2</h3>3content
with DOMDocument:
$dom = new DOMDocument();
#$dom->loadHTML($html);
$nodes = $dom->getElementsByTagName('body')->item(0)->childNodes;
$flag = false;
$results = array();
foreach ($nodes as $node) {
if ( $node->nodeType == XML_ELEMENT_NODE &&
preg_match('~^h(?:[12]|(3))$~i', $node->nodeName, $m) ):
if ($flag)
$results[] = $tmp;
if (isset($m[1])) {
$tmp = $dom->saveXML($node);
$flag = true;
} else
$flag = false;
elseif ($flag):
$tmp .= $dom->saveXML($node);
endif;
}
echo htmlspecialchars(print_r($results, true));
with regex:
preg_match_all('~<h3.*?(?=<h[123])~si', $html, $matches);
echo htmlspecialchars(print_r($matches[0], true));
You shouldn't use Regex to parse HTML if there is any nesting involved.
Regex
(<(h\d)>.*?<\/\2>)[\r\n]([^\r\n<]+)
Replacement
\1\3
or
$1$3
http://regex101.com/r/uQ3uC2
preg_match_all('/<h3>(.*?)<\/h3>/is', $stringHTML, $matches);

How to remove invalid element from DOM?

We have the following code that lists the xpaths where $value is found.
We have detected for a given URL (see on picture) a non standard tag td1 which in addition doesn't have a closing tag. Probably the site developers have put that there intentionally, as you see in the screen shot below.
This element creates problems identifying the corect XPath for nodes.
A broken Xpath example :
/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/**td1**/td[2]/span/u[1]
(as you see td1 is identified and chained in the Xpath)
We think by removing this element it helps us to build the valid XPath we are after.
A valid example is
/html/body/div[2]/div[2]/table/tr[2]/td/table/tr[1]/td[2]/table/tr[2]/td[2]/table[3]/tr[2]/td[2]/span/u[1]
How can we remove prior loading in DOMXpath? Do you have some other approach?
We would like to remove all the invalid tags which may be other than td1, as h8, diw, etc...
private function extract($url, $value) {
$dom = new DOMDocument();
$file = 'content.txt';
//$current = file_get_contents($url);
$current = CurlTool::downloadFile($url, $file);
//file_put_contents($file, $current);
#$dom->loadHTMLFile($current);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom);
$elements = $dom_xpath->query("//*[text()[contains(., '" . $value . "')]]");
var_dump($elements);
if (!is_null($elements)) {
foreach ($elements as $element) {
var_dump($element);
echo "\n1.[" . $element->nodeName . "]\n";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
if( ($node->nodeValue != null) && ($node->nodeValue === $value) ) {
echo '2.' . $node->nodeValue . "\n";
$xpath = preg_replace("/\/text\(\)/", "", $node->getNodePath());
echo '3.' . $xpath . "\n";
}
}
}
}
}
You could use XPath to find the offending nodes and remove them, while promoting its children into its place in the DOM. Then your paths will be correct.
$dom_xpath = new DOMXpath($dom);
$results = $dom_xpath->query('//td1'); // (or any offending element)
foreach ($results as $invalidNode)
{
$parentNode = $invalidNode->parentNode;
while ($invalidNode->childNodes)
{
$firstChild = $invalidNode->firstChild;
$parentNode->insertBefore($firstChild,$invalidNode);
}
$parentNode->removeChild($invalidNode);
}
EDIT:
You could also build a list of offending elements by using a list of valid elements and negating it.
// Build list manually from the HTML spec:
// See: http://www.w3.org/TR/html5/section-index.html#elements-1
$validTags = array();
// Convert list to XPath:
$validTagsStr = '';
foreach ($validTags as $tag)
{
if ($validTagsStr)
{ $validTagsStr .= ' or '; }
$validTagsStr .= 'self::'.$tag;
}
$results = $dom_xpath->query('//*[not('.$validTagsStr.')');
Sooo... perhaps str_replace($current, "<td1 va-laign=\"top\">", "") could do the trick?

Preventing DOMDocument::loadHTML() from converting entities

I have a string value that I'm trying to extract list items for. I'd like to extract the text and any subnodes, however, DOMDocument is converting the entities to the character, instead of leaving in the original state.
I've tried setting DOMDocument::resolveExternals and DOMDocument::substituteEntities for false, but this has no effect. It should be noted I'm running on Win7 with PHP 5.2.17.
Example code is:
$example = '<ul><li>text</li>'.
'<li>½ of this is <strong>strong</strong></li></ul>';
echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;
$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;
$doc->loadHTML($example);
$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;
for ($idx = 0; $idx < $count; $idx++) {
$value = trim(_get_inner_html($domNodeList->item($idx)));
/* remainder of processing and storing in database */
echo 'Saved '.$value.PHP_EOL;
}
function _get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
½ ends up getting converted to ½ (single character / UTF-8 version, not entity version), which is not the desired format.
Solution for not PHP 5.3.6++
$html =<<<HTML
<ul><li>text</li>
<li>½ of this is <strong>strong</strong></li></ul>
HTML;
$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;
$doc->loadHTML($html);
foreach ($doc->getElementsByTagName('li') as $node)
{
echo htmlentities(iconv('UTF-8', 'ISO-8859-1', $node->nodeValue)), "\n";
}
Based on the answer provided by ajreal, I've expanded the example variable to handle more cases, and changed _get_inner_html() to make recursive calls and handle the entity conversion for text nodes.
It's probably not the best answer, since it makes some assumptions about the elements (such as no attributes). But since my particular needs don't require attributes to be carried across (yet.. I'm sure my sample data will throw that one at me later on), this solution works for me.
$example = '<ul><li>text</li>'.
'<li>½ of this is <strong>strong</strong></li>'.
'<li>Entity <strong attr="3">in ½ tag</strong></li>'.
'<li>Nested nodes <strong attr="3">in ½ <em>tag ½</em></strong></li>'.
'</ul>';
echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;
$doc = new DOMDocument();
$doc->resolveExternals = true;
$doc->substituteEntities = false;
$doc->loadHTML($example);
$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;
for ($idx = 0; $idx < $count; $idx++) {
$value = trim(_get_inner_html($domNodeList->item($idx)));
/* remainder of processing and storing in database */
echo 'Saved '.$value.PHP_EOL;
}
function _get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
echo 'Node type is '.$child->nodeType.PHP_EOL;
switch ($child->nodeType) {
case 3:
$innerHTML .= htmlentities(iconv('UTF-8', 'ISO-8859-1', $child->nodeValue));
break;
default:
echo 'Non text node has '.$child->childNodes->length.' children'.PHP_EOL;
echo 'Node name '.$child->nodeName.PHP_EOL;
$innerHTML .= '<'.$child->nodeName.'>';
$innerHTML .= _get_inner_html( $child );
$innerHTML .= '</'.$child->nodeName.'>';
break;
}
}
return $innerHTML;
}
Need no iterate child nodes:
function innerHTML($node)
{$html=$node->ownerDocument->saveXML($node);
return preg_replace("%^<{$node->nodeName}[^>]*>|</{$node->nodeName}>$%", '', $html);
}

How to replace text in HTML

From this question: What regex pattern do I need for this? I've been using the following code:
function process($node, $replaceRules) {
if($node->hasChildNodes()) {
foreach ($node->childNodes as $childNode) {
if ($childNode instanceof DOMText) {
$text = preg_replace(
array_keys($replaceRules),
array_values($replaceRules),
$childNode->wholeText
);
$node->replaceChild(new DOMText($text),$childNode);
} else {
process($childNode, $replaceRules);
}
}
}
}
$replaceRules = array(
'/\b(c|C)olor\b/' => '$1olour',
'/\b(kilom|Kilom|M|m)eter/' => '$1etre',
);
$htmlString = "<p><span style='color:red'>The color of the sky is: gray</p>";
$doc = new DOMDocument();
$doc->loadHtml($htmlString);
process($doc, $replaceRules);
$string = $doc->saveHTML();
echo mb_substr($string,119,-15);
It works fine, but it fails (as the child node is replaced on the first instance) if the html has text and HTML. So it works on
<div>The distance is four kilometers</div>
but not
<div>The distance is four kilometers<br>1000 meters to a kilometer</div>
or
<div>The distance is four kilometers<div class="guide">1000 meters to a kilometer</div></div>
Any ideas of a method that would work on such examples?
Calling $node->replaceChild will confuse the $node->childNodes iterator. You can get the child nodes first, and then process them:
function process($node, $replaceRules) {
if($node->hasChildNodes()) {
$nodes = array();
foreach ($node->childNodes as $childNode) {
$nodes[] = $childNode;
}
foreach ($nodes as $childNode) {
if ($childNode instanceof DOMText) {
$text = preg_replace(
array_keys($replaceRules),
array_values($replaceRules),
$childNode->wholeText);
$node->replaceChild(new DOMText($text),$childNode);
}
else {
process($childNode, $replaceRules);
}
}
}
}

Categories