XPath contains() Search for Exact Match - php

Is it possible to search a DOMDocument object with fn:contains and return true on only an exact match for a word?
I have a text replacement snippet that I did not write myself that does internal link replacements for keywords. But as written it also replaces partial words instead of only the full word.
Here is the snippet:
$autolinks = $this->config->get('autolinks');
if (isset($autolinks) && (strpos($this->data['description'], 'iframe') == false)
&& (strpos($this->data['description'], 'object') == false)):
$xdescription = mb_convert_encoding(html_entity_decode($this->data['description'], ENT_COMPAT, "UTF-8"), 'HTML-ENTITIES', "UTF-8");
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML('<div>'.$xdescription.'</div>');
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
foreach ($autolinks as $autolink):
$keyword = $autolink['keyword'];
$xlink = mb_convert_encoding(html_entity_decode($autolink['link'], ENT_COMPAT, "UTF-8"), 'HTML-ENTITIES', "UTF-8");
$target = $autolink['target'];
$tooltip = isset($autolink['tooltip']);
$pTexts = $xpath->query(
sprintf('///text()[contains(., "%s")]', $keyword)
);
foreach ($pTexts as $pText):
$this->parseText($pText, $keyword, $dom, $xlink, $target, $tooltip);
endforeach;
endforeach;
$this->data['description'] = $dom->saveXML($dom->documentElement);
endif;
In example:
If my keyword is "massage" *massage*r is partially matched and converted to a link, when only the whole word massage should be converted, not massager.

You should use fn:matches instead of fn:contains. This allows you to do matching with regular expressions. Then you can include word boundaries with \b.
sprintf('///text()[matches(., "\b%s\b")]', $keyword)
Note that this does not affect whatever your function parseText is doing. So while <Tagname>This is a sentence containing the word massager.</Tagname> will be unaffected, I make no guarantee what will happen to <Tagname>The massager give the customer a massage.</Tagname>. To make sure that this is handled properly your parsetext function will need to be modified. Possibly in a similar manner as above.
Note also that the modifications you might need to make to parsetext means that the above change becomes unecessary.

Text manipulation in XSLT 1.0 is very limited, but if you can't move to 2.0 (why not?) then translate() often comes to the rescue. Use translate() to replace all common punctuation characters by spaces, use concat() to add a space fore and aft, and then test for contains(' massage ') (note the spaces).

When matches(), ends-with() are not supported, you can use starts-with() and string-length() to get around.
Example:
[starts-with(.,'$var') and string-length(.)=string-length('$var')]
This is equivalent to matches().

This actually turned out to be incredibly simple, I just added a space onto the end of the $keyword variable so now it only returns true when the entire word is found.
foreach ($autolinks as $autolink):
$keyword = trim($autolink['keyword']) . ' ';
$xlink = mb_convert_encoding(html_entity_decode($autolink['link'], ENT_COMPAT, "UTF-8"), 'HTML-ENTITIES', "UTF-8");
$target = $autolink['target'];
$tooltip = isset($autolink['tooltip']);
$pTexts = $xpath->query(
sprintf('///text()[contains(., "%s")]', $keyword)
);
foreach ($pTexts as $pText):
$this->parseText($pText, $keyword, $dom, $xlink, $target, $tooltip);
endforeach;
endforeach;
thank you to everyone who tried to help.

Related

Validation like Strip_tags for HTML Attributes [duplicate]

I have this html code:
<p style="padding:0px;">
<strong style="padding:0;margin:0;">hello</strong>
</p>
How can I remove attributes from all tags? I'd like it to look like this:
<p>
<strong>hello</strong>
</p>
Adapted from my answer on a similar question
$text = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello</strong></p>';
echo preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/si",'<$1$2>', $text);
// <p><strong>hello</strong></p>
The RegExp broken down:
/ # Start Pattern
< # Match '<' at beginning of tags
( # Start Capture Group $1 - Tag Name
[a-z] # Match 'a' through 'z'
[a-z0-9]* # Match 'a' through 'z' or '0' through '9' zero or more times
) # End Capture Group
[^>]*? # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
(\/?) # Capture Group $2 - '/' if it is there
> # Match '>'
/is # End Pattern - Case Insensitive & Multi-line ability
Add some quoting, and use the replacement text <$1$2> it should strip any text after the tagname until the end of tag /> or just >.
Please Note This isn't necessarily going to work on ALL input, as the Anti-HTML + RegExp will tell you. There are a few fallbacks, most notably <p style=">"> would end up <p>"> and a few other broken issues... I would recommend looking at Zend_Filter_StripTags as a more full proof tags/attributes filter in PHP
Here is how to do it with native DOM:
$dom = new DOMDocument; // init new DOMDocument
$dom->loadHTML($html); // load HTML into it
$xpath = new DOMXPath($dom); // create a new XPath
$nodes = $xpath->query('//*[#style]'); // Find elements with a style attribute
foreach ($nodes as $node) { // Iterate over found elements
$node->removeAttribute('style'); // Remove style attribute
}
echo $dom->saveHTML(); // output cleaned HTML
If you want to remove all possible attributes from all possible tags, do
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//#*');
foreach ($nodes as $node) {
$node->parentNode->removeAttribute($node->nodeName);
}
echo $dom->saveHTML();
I would avoid using regex as HTML is not a regular language and instead use a html parser like Simple HTML DOM
You can get a list of attributes that the object has by using attr. For example:
$html = str_get_html('<div id="hello">World</div>');
var_dump($html->find("div", 0)->attr); /
/*
array(1) {
["id"]=>
string(5) "hello"
}
*/
foreach ( $html->find("div", 0)->attr as &$value ){
$value = null;
}
print $html
//<div>World</div>
$html_text = '<p>Hello <b onclick="alert(123)" style="color: red">world</b>. <i>Its beautiful day.</i></p>';
$strip_text = strip_tags($html_text, '<b>');
$result = preg_replace('/<(\w+)[^>]*>/', '<$1>', $strip_text);
echo $result;
// Result
string 'Hello <b>world</b>. Its beautiful day.'
Another way to do it using php's DOMDocument class (without xpath) is to iterate over the attributes on a given node. Please note, due to the way php handles the DOMNamedNodeMap class, you must iterate backward over the collection if you plan on altering it. This behaviour has been discussed elsewhere and is also noted in the documentation comments. The same applies to the DOMNodeList class when it comes to removing or adding elements. To be on the safe side, I always iterate backwards with these objects.
Here is a simple example:
function scrubAttributes($html) {
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
for ($els = $dom->getElementsByTagname('*'), $i = $els->length - 1; $i >= 0; $i--) {
for ($attrs = $els->item($i)->attributes, $ii = $attrs->length - 1; $ii >= 0; $ii--) {
$els->item($i)->removeAttribute($attrs->item($ii)->name);
}
}
return $dom->saveHTML();
}
Here's a demo: https://3v4l.org/M2ing
Optimized regular expression from the top rated answer on this issue:
$text = '<div width="5px">a is less than b: a<b, ya know?</div>';
echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);
// <div>a is less than b: a<b, ya know?</div>
UPDATE:
It works better when allow only some tags with PHP strip_tags() function. Let's say we want to allow only <br>, <b> and <i> tags, then:
$text = '<i style=">">Italic</i>';
$text = strip_tags($text, '<br><b><i>');
echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);
//<i>Italic</i>
As we can see it fixes flaws connected with tag symbols in attribute values.
Regex's are too fragile for HTML parsing. In your example, the following would strip out your attributes:
echo preg_replace(
"|<(\w+)([^>/]+)?|",
"<$1",
"<p style=\"padding:0px;\">\n<strong style=\"padding:0;margin:0;\">hello</strong>\n</p>\n"
);
Update
Make to second capture optional and do not strip '/' from closing tags:
|<(\w+)([^>]+)| to |<(\w+)([^>/]+)?|
Demonstrate this regular expression works:
$ phpsh
Starting php
type 'h' or 'help' to see instructions & features
php> $html = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello<br/></strong></p>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<p><strong>hello</strong><br/></p>
php> $html = '<strong>hello</strong>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<strong>hello</strong>
Hope this helps. It may not be the fastest way to do it, especially for large blocks of html.
If anyone has any suggestions as to make this faster, let me know.
function StringEx($str, $start, $end)
{
$str_low = strtolower($str);
$pos_start = strpos($str_low, $start);
$pos_end = strpos($str_low, $end, ($pos_start + strlen($start)));
if($pos_end==0) return false;
if ( ($pos_start !== false) && ($pos_end !== false) )
{
$pos1 = $pos_start + strlen($start);
$pos2 = $pos_end - $pos1;
$RData = substr($str, $pos1, $pos2);
if($RData=='') { return true; }
return $RData;
}
return false;
}
$S = '<'; $E = '>'; while($RData=StringEx($DATA, $S, $E)) { if($RData==true) {$RData='';} $DATA = str_ireplace($S.$RData.$E, '||||||', $DATA); } $DATA = str_ireplace('||||||', $S.$E, $DATA);
To do SPECIFICALLY what andufo wants, it's simply:
$html = preg_replace( "#(<[a-zA-Z0-9]+)[^\>]+>#", "\\1>", $html );
That is, he wants to strip anything but the tag name out of the opening tag. It won't work for self-closing tags of course.
Here's an easy way to get rid of attributes. It handles malformed html pretty well.
<?php
$string = '<p style="padding:0px;">
<strong style="padding:0;margin:0;">hello</strong>
</p>';
//get all html elements on a line by themselves
$string_html_on_lines = str_replace (array("<",">"),array("\n<",">\n"),$string);
//find lines starting with a '<' and any letters or numbers upto the first space. throw everything after the space away.
$string_attribute_free = preg_replace("/\n(<[\w123456]+)\s.+/i","\n$1>",$string_html_on_lines);
echo $string_attribute_free;
?>

php preg_match_all() how to get correct values in match-array

The following situation:
$text = "This is some <span class='classname'>example</span> text i'm writing to
demonstrate the <span class='classname otherclass'>problem</span> of this.<br />";
preg_match_all("|<[^>/]*(classname)(.+)>(.*)</[^>]+>|U", $text, $matches, PREG_PATTERN_ORDER);
I need an array ($matches) where in one field is "<span class='classname'>example</span>" and in another "example".
But what i get here is one field with "<span class='classname'>example</span>" and one with "classname".
It also should contain the values for the other matches, of course.
how can i get the right values?
You would be better off with a DOM parser, however this question is more to do with how capturing works in Regexes in general.
The reason you are getting classname as a match is because you are capturing it by putting () around it. They are completely unnecessary so you can just remove them. Similarly, you don't need them around .+ since you don't want to capture that.
If you had some group that you had to enclose in () as grouping rather than capturing, start the group with ?: and it won't be captured.
The safe/easy way:
$text = 'blah blah blah';
$dom = new DOM();
$dom->loadHTML($text);
$xp = new DOMXPath($dom);
$nodes = $xp->query("//span[#class='classname']");
foreach($nodes as $node) {
$innertext = $node->nodeValue;
$html = // see http://stackoverflow.com/questions/2087103/innerhtml-in-phps-domdocument
}

php regular expression to get the specific url

I would like to get the urls from a webpage that starts with "../category/" from these tags below:
PC<br>
Carpet<br>
Any suggestion would be very much appreciated.
Thanks!
No regular expressions is required. A simple XPath query with DOM will suffice:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a[starts-with(#href, "../category/")]');
foreach ($nodes as $node) {
echo $node->nodeValue.' = '.$node->getAttribute('href').PHP_EOL;
}
Will print:
PC = ../category/product/pc.html
Carpet = ../category/product/carpet.html
This regex searches for your ../category/ string:
preg_match_all('#......="(\.\./category/.*?)"#', $test, $matches);
All text literals are used for matching. You can replace the ..... to make it more specific. Only the \. need escaping. The .*? looks for a variable length string. And () captures the matched path name, so it appears in $matches. The manual explains the rest of the syntax. http://www.php.net/manual/en/book.pcre.php

Regex / DOMDocument - match and replace text not in a link

I need to find and replace all text matches in a case insensitive way, unless the text is within an anchor tag - for example:
<p>Match this text and replace it</p>
<p>Don't match this text</p>
<p>We still need to match this text and replace it</p>
Searching for 'match this text' would only replace the first instance and last instance.
[Edit] As per Gordon's comment, it may be preferred to use DOMDocument in this instance. I'm not at all familiar with the DOMDocument extension, and would really appreciate some basic examples for this functionality.
Here is an UTF-8 safe solution, which not only works with properly formatted documents, but also with document fragments.
The mb_convert_encoding is needed, because loadHtml() seems to has a bug with UTF-8 encoding (see here and here).
The mb_substr is trimming the body tag from the output, this way you get back your original content without any additional markup.
<?php
$html = '<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace itŐŰ</p>
<p>This is a link <span>with <strong>don\'t match this text</strong> content</span></p>';
$dom = new DOMDocument();
// loadXml needs properly formatted documents, so it's better to use loadHtml, but it needs a hack to properly handle UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
$xpath = new DOMXPath($dom);
foreach($xpath->query('//text()[not(ancestor::a)]') as $node)
{
$replaced = str_ireplace('match this text', 'MATCH', $node->wholeText);
$newNode = $dom->createDocumentFragment();
$newNode->appendXML($replaced);
$node->parentNode->replaceChild($newNode, $node);
}
// get only the body tag with its contents, then trim the body tag itself to get only the original content
echo mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");
References:
1. find and replace keywords by hyperlinks in an html fragment, via php dom
2. Regex / DOMDocument - match and replace text not in a link
3. php problem with russian language
4. Why Does DOM Change Encoding?
I read dozens of answers in the subject, so I am sorry if I forgot somebody (please comment it and I will add yours as well in this case).
Thanks for Gordon and stillstanding for commenting on my other answer.
Try this one:
$dom = new DOMDocument;
$dom->loadHTML($html_content);
function preg_replace_dom($regex, $replacement, DOMNode $dom, array $excludeParents = array()) {
if (!empty($dom->childNodes)) {
foreach ($dom->childNodes as $node) {
if ($node instanceof DOMText &&
!in_array($node->parentNode->nodeName, $excludeParents))
{
$node->nodeValue = preg_replace($regex, $replacement, $node->nodeValue);
}
else
{
preg_replace_dom($regex, $replacement, $node, $excludeParents);
}
}
}
}
preg_replace_dom('/match this text/i', 'IT WORKS', $dom->documentElement, array('a'));
This is the stackless non-recursive approach using pre-order traversal of the DOM tree.
libxml_use_internal_errors(TRUE);
$dom=new DOMDocument('1.0','UTF-8');
$dom->substituteEntities=FALSE;
$dom->recover=TRUE;
$dom->strictErrorChecking=FALSE;
$dom->loadHTMLFile($file);
$root=$dom->documentElement;
$node=$root;
$flag=FALSE;
for (;;) {
if (!$flag) {
if ($node->nodeType==XML_TEXT_NODE &&
$node->parentNode->tagName!='a') {
$node->nodeValue=preg_replace(
'/match this text/is',
$replacement, $node->nodeValue
);
}
if ($node->firstChild) {
$node=$node->firstChild;
continue;
}
}
if ($node->isSameNode($root)) break;
if ($flag=$node->nextSibling)
$node=$node->nextSibling;
else
$node=$node->parentNode;
}
echo $dom->saveHTML();
libxml_use_internal_errors(TRUE); and the 3 lines of code after $dom=new DOMDocument; should be able to handle any malformed HTML.
$a='<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace it</p>';
echo preg_replace('~match this text(?![^<]*</a>)~i','replacement',$a);
The negative lookahead ensures the replacement happens only if the next tag is not a closing link . It works fine with your example, though it won't work if you happen to use other tags inside your links.
You can use PHP Simple HTML DOM Parser. It is similar to DOMDocument, but in my opinion it's simpler to use.
Here is the alternative in parallel with Netcoder's DomDocument solution:
function replaceWithSimpleHtmlDom($html_content, $search, $replace, $excludedParents = array()) {
require_once('simple_html_dom.php');
$html = str_get_html($html_content);
foreach ($html->find('text') as $element) {
if (!in_array($element->parent()->tag, $excludedParents))
$element->innertext = str_ireplace($search, $replace, $element->innertext);
}
return (string)$html;
}
I have just profiled this code against my DomDocument solution (witch prints the exact same output), and the DomDocument is (not surprisingly) way faster (~4ms against ~77ms).
<?php
$a = '<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace it</p>
';
$res = preg_replace("#[^<a.*>]match this text#",'replacement',$a);
echo $res;
?>
This way works. Hope you want realy case sensitive, so match with small letter.
HTML parsing with regexs is a huge challenge, and they can very easily end up getting too complex and taking up loads of memory. I would say the best way is to do this:
preg_replace('/match this text/i','replacement text');
preg_replace('/(<a[^>]*>[^(<\/a)]*)replacement text(.*?<\/a)/is',"$1match this text$3");
If your replacement text is something which might occur otherwise, you might want to add an intermediate step with some unique identifier.

Using regex to remove HTML tags

I need to convert
$text = 'We had <i>fun</i>. Look at this photo of Joe';
[Edit] There could be multiple links in the text.
to
$text = 'We had fun. Look at this photo (http://example.com) of Joe';
All HTML tags are to be removed and the href value from <a> tags needs to be added like above.
What would be an efficient way to solve this with regex? Any code snippet would be great.
First do a preg_replace to keep the link. You could use:
preg_replace('(.*?)', '$\2 ($\1)', $str);
Then use strip_tags which will finish off the rest of the tags.
try an xml parser to replace any tag with it's inner html and the a tags with its href attribute.
http://www.php.net/manual/en/book.domxml.php
The DOM solution:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//a[#href]') as $node) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
echo strip_tags($dom->saveHTML());
and the same without XPath:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $node) {
if($node->hasAttribute('href')) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
}
echo strip_tags($dom->saveHTML());
All it does is load any HTML into a DomDocument instance. In the first case it uses an XPath expression, which is kinda like SQL for XML, and gets all links with an href attribute. It then creates a text node element from the innerHTML and the href attribute and replaces the link. The second version just uses the DOM API and no Xpath.
Yes, it's a few lines more than Regex but this is clean and easy to understand and it won't give you any headaches when you need to add additional logic.
I've done things like this using variations of substring and replace. I'd probably use regex today but you wanted an alternative so:
For the <i> tags, I'd do something like:
$text = replace($text, "<i>", "");
$text = replace($text, "</i>", "");
(My php is really rusty, so replace may not be the right function name -- but the idea is what I'm sharing.)
The <a> tag is a bit more tricky. But, it can be done. You need to find the point that <a starts and that the > ends with. Then you extract the entire length and replace the closing </a>
That might go something like:
$start = strrpos( $text, "<a" );
$end = strrpos( $text, "</a>", $start );
$text = substr( $text, $start, $end );
$text = replace($text, "</a>", "");
(I don't know if this will work, again the idea is what I want to communicate. I hope the code fragments help but they probably don't work "out of the box". There are also a lot of possible bugs in the code snippets depending on your exact implementation and environment)
Reference:
strrpos - http://www.php.net/manual/en/function.strrpos.php
replace - http://www.php.net/manual/en/function.str-replace.php
substr - http://php.net/manual/en/function.substr.php
It's also very easy to do with a parser:
# available from http://simplehtmldom.sourceforge.net
include('simple_html_dom.php');
# parse and echo
$html = str_get_html('We had <i>fun</i>. Look at this photo of Joe');
$a = $html->find('a');
$a[0]->outertext = "{$a[0]->innertext} ( {$a[0]->href} )";
echo strip_tags($html);
And that produces the code you want in your test case.

Categories