Use XPath with PHP's SimpleXML to find nodes containing a String - php

I try to use SimpleXML in combination with XPath to find nodes which contain a certain string.
<?php
$xhtml = <<<EOC
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>Test</title>
</head>
<body>
<p>Find me!</p>
<p>
<br />
Find me!
<br />
</p>
</body>
</html>
EOC;
$xml = simplexml_load_string($xhtml);
$xml->registerXPathNamespace('xhtml', 'http://www.w3.org/1999/xhtml');
$nodes = $xml->xpath("//*[contains(text(), 'Find me')]");
echo count($nodes);
Expected output: 2
Actual output: 1
When I change the xhtml of the second paragraph to
<p>
Find me!
<br />
</p>
then it works like expected. How has my XPath expression has to look like to match all nodes containing 'Find me' no matter where they are?
Using PHP's DOM-XML is an option, but not desired.
Thank's in advance!

It depends on what you want to do. You could select all the <p/> elements that contain "Find me" in any of their descendants with
//xhtml:p[contains(., 'Find me')]
This will return duplicates and so you don't specify the kind of nodes then it will return <body/> and <html/> as well.
Or perhaps you want any node which has a child (not a descendant) text node that contains "Find me"
//*[text()[contains(., 'Find me')]]
This one will not return <html/> or <body/>.
I forgot to mention that . represents the whole text content of a node. text() is used to retrieve [a nodeset of] text nodes. The problem with your expression contains(text(), 'Find me') is that contains() only works on strings, not nodesets and therefore it converts text() to the value of the first node, which is why removing the first <br/> makes it work.

Err, umm? But thanks #Jordy for the quick answer.
First, that's DOM-XML, which is not desired, since everything else in my script is done with SimpleXML.
Second, why do you translate to uppercase and search for an unchanged string 'Find me'? 'Searching for 'FIND ME' would actually give a result.
But you pointed me towards the right direction:
$nodes = $xml->xpath("//text()[contains(., 'Find me')]");
does the trick!

I was looking for a way to find whether a node with exact value "Find Me" exists and this seemed to work.
$node = $xml->xpath("//text()[.='Find Me']");

$doc = new DOMDocument();
$doc->loadHTML($xhtml);
$xPath = new DOMXpath($doc);
$xPathQuery = "//text()[contains(translate(.,'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'), 'Find me')]";
$elements = $xPath->query($xPathQuery);
if($elements->length > 0){
foreach($elements as $element){
print "Found: " .$element->nodeValue."<br />";
}}

Related

Match tags inside tag

I want to modify:
<ins><br/> <b>bold</b> <br/><br/> <br/> <br/></ins> <br/> <ins> <br/> </ins>
to:
<ins><br/>NL: <b>bold</b> <br/>NL:<br/>NL: <br/>NL: <br/>NL:</ins> <br/> <ins> <br/>NL: </ins>
(inside every <ins> and </ins> tag find and change <br/> to <br/>NL:. Ignore <br/> outside <ins>. Also, <ins> might contain various other tags)
To do this, I have this peace of code:
$string= preg_replace('~(?:<ins>|(?!^)\G)(.*?)<br\/>~', '$0NL:', $string);
https://regex101.com/r/xI8mW9/4
It would work just fine, but the problem is that matching doesn't end after </ins> tag. How do I replace <br/> with <br/>NL: only withing <ins> and </ins> tags. It modifies every <br/> after first <ins>
I have also tried pattern:
~(<ins>.*?)(?<my_br><br/>)(?!NL:)(.*?</ins>)~
https://regex101.com/r/xI8mW9/15
(in this case for each my_br changed as $1$2NL:$3) Problem: In case <ins><br/></ins><br/><ins><br/></ins> middle <br/> is affected.
Tried doing it with DOMDocument as suggested in comment:
$rendered_diff = "Some<ins>a<br/></ins><br/><ins>b<br/></ins>text";
$doc = new \DOMDocument();
$doc->loadHTML($rendered_diff);
$items = $doc->getElementsByTagName('ins');
for ($i = 0; $i < $items->length; $i++) {
foreach ($items->item($i)->childNodes as $node) {
if ($node->nodeName == 'br') {
$node->appendData('NL:');
}
}
}
$doc->saveHTML();
dd($rendered_diff);
Got an error:
ERROR: Call to undefined method DOMElement::appendData()
Have no idea why this approach is bad.
You can try the following code:
<?php
$rendered_diff = "<br/>Some<ins>a<br/><div>blablaa</div></ins><br/><ins>b<br/></ins>text";
$doc = new \DOMDocument();
$doc->loadHTML($rendered_diff);
$xpath = new DOMXpath($doc);
$items = $doc->getElementsByTagName('ins');
foreach ($xpath->query("//ins/br") as $br) {
$text = $doc->createTextNode('NS:');
$br->parentNode->insertBefore( $text, $br->nextSibling);
}
echo $doc->saveXML();
It outputs the following:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><br/>Some<ins>a<br/>NS:<div>blablaa</div></ins><br/><ins>b<br/>NS:</ins>text</body></html>
Which seems to solve the problem.
Note that I modified a bit your initial XML, to test your
Ignore <br/> outside <ins>
condition. See the 1st <br/>.
Answering your question
Have no idea why this approach is bad.
Your approach is not good because of this and compare it with the code I placed above: doesn't the latter look cleaner? And moreover, it uses XPath and you can create more complicated queries to match certain elements, not only <br>'s inside <ins>

Search and replace a string of HTML using the PHP DOM Parser

How can I search and replace a specific string (text + html tags) in a web page using the native PHP DOM Parser?
For example, search for
<p> Check this site </p>
This string is somewhere inside inside an html tree.
I would like to find it and replace it with another string. For example,
<span class="highligher"><p> Check this site </p></span>
Bear in mind that there is no ID to the <p> or <a> nodes. There can be many of those identical nodes, holding different pieces of text.
I tried str_replace, however it fails with complex html markup, so I have turned to HTML Parsers now.
EDIT:
The string to be found and replaced might contain a variety of HTML tags, like divs, headlines, bolds etc.. So, I am looking for a solution that can construct a regex or DOM xpath query depending on the contents of the string being searched.
Thanks!
Is this what you wanted:
<?php
// load
$doc = new DOMDocument();
$doc->loadHTMLFile("filename.html");
// search p elements
$p_elements = $doc->getElementsByTagName('p');
// parse this elements, if available
if (!is_null($p_elements))
{
foreach ($p_elements as $p_element)
{
// get p element nodes
$nodes = $p_element->childNodes;
// check for "a" nodes in these nodes
foreach ($nodes as $node) {
// found an a node - check must be defined better!
if(strtolower($node->nodeName) === 'a')
{
// create the new span element
$span_element = $doc->createElement('span');
$span_element->setAttribute('class', 'highlighter');
// replace the "p" element with the span
$p_element->parentNode->replaceChild($span_element, $p_element);
// append the "p" element to the span
$span_element->appendChild($p_element);
}
}
}
}
// output
echo '<pre>';
echo htmlentities($doc->saveHTML());
echo '</pre>';
This HTML is the basis for conversion:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<p> Check this site </p>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><p> Check this site </p>
</body></html>
The output looks like that, it wraps the elements you mentioned:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<span class="highlighter"><p> Check this site </p></span>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><span class="highlighter"><p> Check this site </p></span>
</body></html>
You could use a regular expression with preg_replace.
preg_replace("/<\s*p[^>]*>(.*?)<\s*\/\s*p>/", '<span class="highligher"><p>$1</p></span>', '<p> Check this site</p>');
The third parameter of preg_replace can be used to restrict the number of replacements
http://php.net/manual/en/function.preg-replace.php
http://www.pagecolumn.com/tool/all_about_html_tags.htm - for more examples on regular expressions for HTML
You will need to edit the regular expression to only capture the p tags with the google href
EDIT
preg_replace("/<\s*\w.*?><a href\s*=\s*\"?\s*(.*)(google.com)\s*\">(.*?)<\/a>\s*<\/\s*\w.*?>/", '<span class="highligher"><p>$3</p></span>', $string);

PHP DomDocument XPATH does not match to the HTML real structure

I'm trying to validate the following HTML code (
Please note the text content inside IMG tag, which is structurally correct as markup, but invalid as HTML):
<html>
<head>
</head>
<body>
<img src="./">
Some Text
</img>
</body>
</html>
Using PHP and DomDocument, I try to read entire tree with XPATH:
$dom = new DOMDocument();
$dom->validateOnParse = 0;
$dom->loadHTML($htmlSource);
$xpath = new DOMXPath($dom);
$allNodes = $xpath->query("//node()");
The result I get:
/html
/html/head
/html/body
/html/body/#text[1]
/html/body/img
/html/body/#text[2]
which obviously does not match the exact HTML structure.
What I expected to see is
....
/html/body/img/#text
....
Why does XPATH interpret the tree this way?
How can I get it to work as I expected?

PHP: html tidy repair string: making it not encase everything in <html>

Using the following code:
$tidy = new tidy();
$clean = $tidy->repairString("<p>Hello</p>");
This encases the string in the whole shenanigans:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title></title>
</head>
<body>
<p>Hello</p>
</body>
</html>
Since I'm using it on a "description" field, containing some html tags from time to time, I just want to use it to fix anomalies in the string, forexample unclosed elements, elements that are closed but not opened and so on, not encase it like this as a full html document.
If the string doesnt contain any html at all, it should just return the input. And if it contains html like the example above, it should fix whatever there is to fix, (which is nothing in this example) and not encase it in a full document.
Anyone know how to make HTML Tidy not encase it like this?
I was struggling with the same problem. But found it in the tidy documentation. If you add 'show-body-only' => true it will not show the complete html header and so on.
$tidy = new tidy();
$input = "<p>A paragraph with <b>bold<b> text";
$clean = $tidy->repairString($input,array('show-body-only' => true));
echo $clean;
will show:<p>A paragraph with <b>bold</b> text</p>

find if a img have "alt", if not then add from array ( serverside )

first I need to find all img in the sites,
and then check if the img have the "alt" attribute, if image have the attribute it'll be escaped and if it not have one or the alt is empty,a string will be randomly added to img from a list or array.
here is how you do it with javascript:
find if a img have alt in jquery if not then add from array
but it did not help me because according to this:
How do search engines crawl Javascript?
search bots can't read it , if you use JavaScript you need to use server-side language to add keyword to img alt.
what next? php? can i do it with a simple code?
Well, import it into an DOMDocument object and find all images inside.
Seems rather trivial. See the DOMDocument class
Here's my code for the problem:
<?php
$html = <<<HTML
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title></title>
</head>
<body>
<p>
<img src="test.png">
<img src="test.jpg" alt="Testing">
<img src="test.gif">
</p>
</body>
</html>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$images = $dom->getElementsByTagName("img");
foreach ($images as $image) {
if (!$image->hasAttribute("alt")) {
$altAttribute = $dom->createAttribute("alt");
$altAttribute->value = "Ready Value!";
$image->appendChild($altAttribute);
}
}
echo $dom->saveHTML();

Categories