PHP Search Text Highlight Function - php

I have a PHP highlighting function which makes certain words bold.
Below is the function, and it works great, except when the array: $words contains a single value that is: b
For example someone searches for: jessie j price tag feat b o b
This will have the following entries in the array $words: jessie,j,price,tag,feat,b,o,b
When a 'b' shows up, my whole function goes wrong, and it displays a whole bunch of wrong html tags. Of course I can strip out any 'b' values from the array, but this isn't ideal, as the highlighting isnt working as it should with certain queries.
This sample script:
function highlightWords2($text, $words)
{
$text = ($text);
foreach ($words as $word)
{
$word = preg_quote($word);
$text = preg_replace("/\b($word)\b/i", '<b>$1</b>', $text);
}
return $text;
}
$string = 'jessie j price tag feat b o b';
$words = array('jessie','tag','b','o','b');
echo highlightWords2($string, $words);
Will output:
<<<b>b</b>><b>b</b></<b>b</b>>>jessie</<<b>b</b>><b>b</b></<b>b</b>>> j price <<<b>b</b>><b>b</b></<b>b</b>>>tag</<<b>b</b>><b>b</b></<b>b</b>>> feat <<b>b</b>><b>b</b></<b>b</b>> <<b>b</b>>o</<b>b</b>> <<b>b</b>><b>b</b></<b>b</b>>
And this only happens because there are "b"'s in the array.
Can you guys see anything that I could change to make it work properly?

You problem is that when your function goes through and looks for all the b's to bold it sees the bold tags and also tries to bold them as well.
#symcbean was close but forgot one thing.
$string = 'jessie j price tag feat b o b';
$words = array('jessie','tag','b','o','b');
print hl($string, $words);
function hl($inp, $words)
{
$replace=array_flip(array_flip($words)); // remove duplicates
$pattern=array();
foreach ($replace as $k=>$fword) {
$pattern[]='/\b(' . $fword . ')(?!>)\b/i';
$replace[$k]='<b>$1</b>';
}
return preg_replace($pattern, $replace, $inp);
}
Do you see this added "(?!>)" that is a negative look ahead assertion, basically it says only match if the string is not followed by a ">" which is what would be seen is opening bold and closing bold tags. Notice I only check for ">" after the string in order to exclude both the opening and closing bold tag as looking for it at the start of the string would not catch the closing bold tag. The above code works exactly as expected.

Your base problem is that you quite wildly replace plain text strings inside HTML. That does cause your problem for small strings as you replace text in tags and attributes as well.
Instead you need to apply your search and replace to the text between HTML texts only. Additionally you don't want to highlight inside another highlight as well.
To do such things, regular expressions are quite limited. Instead use a HTML parser, in PHP this is for example DOMDocument. With a HTML parser it is possible to search only inside the HTML text elements (and not other things like tags, attributes and comments).
You find a highlighter for text in a previous answer of mine with a detailed description how it works. The question is Ignore html tags in preg_replace and it is quite similar to your question so probably this snippet is helpful, it uses <span> instead of <b> tags:
$doc = new DOMDocument;
$doc->loadXML($str);
$xp = new DOMXPath($doc);
$anchor = $doc->getElementsByTagName('body')->item(0);
if (!$anchor)
{
throw new Exception('Anchor element not found.');
}
// search elements that contain the search-text
$r = $xp->query('//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..', $anchor);
if (!$r)
{
throw new Exception('XPath failed.');
}
// process search results
foreach($r as $i => $node)
{
$textNodes = $xp->query('.//child::text()', $node);
// extract $search textnode ranges, create fitting nodes if necessary
$range = new TextRange($textNodes);
$ranges = array();
while(FALSE !== $start = strpos($range, $search))
{
$base = $range->split($start);
$range = $base->split(strlen($search));
$ranges[] = $base;
};
// wrap every each matching textnode
foreach($ranges as $range)
{
foreach($range->getNodes() as $node)
{
$span = $doc->createElement('span');
$span->setAttribute('class', 'search_hightlight');
$node = $node->parentNode->replaceChild($span, $node);
$span->appendChild($node);
}
}
}
If you adopt it for multiple search terms, I would add an additional class with a number depending on the search term so you can nicely style it with CSS in different colors.
Additionally you should remove duplicate search terms and make the xpath expression aware to not look for text that is already part of an element that has the highlight span assigned.

If it were me I'd have used javascript.
But using PHP, since the problem only seems to be duplicate entries in the search, just remove them, also you can run preg_replace just once rather than multiple times....
$string = 'jessie j price tag feat b o b';
$words = array('jessie','tag','b','o','b');
print hl($string, $words);
function hl($inp, $words)
{
$replace=array_flip(array_flip($words)); // remove duplicates
$pattern=array();
foreach ($replace as $k=>$fword) {
$pattern[]='/\b(' . $fword . ')\b/i';
$replace[$k]='<b>$1<b>';
}
return preg_replace($pattern, $replace, $inp);
}

Related

php preg_match excluding text within html tags/attributes to find correct place to cut a string

I am trying to determine the absolute position of certain words within a block of html, but only if they are outside of an actual html tag. For instance, if I wanted to determine the position of the word "join" using preg_match in this text:
<p>There are 14 more days until our holiday special so come join us!</p>
I could use:
preg_match('/join/', $post_content, $matches, PREG_OFFSET_CAPTURE, $offset);
The problem is that this is matching the word within the aria-label attribute, when what I need is the one just after the link. It would be fine to match between the <a> and </a>, just not inside the brackets themselves.
My actual end goal, most of what (I think) I have aside from this last element: I am trimming a block of html (not a full document) to cut off at a specific word count. I am trying to determine which character that last word ends at, and then joining the left side of the html block with only the html from the right side, so all html tags close gracefully. I thought I had it working until I ran into an example like I showed where the last word was also within an html attribute, causing me to split the string at the wrong location. This is my code so far:
$post_content = strip_tags ( $p->post_content, "<a><br><p><ul><li>" );
$post_content_stripped = strip_tags ( $p->post_content );
$post_content_stripped = preg_replace("/[^A-Za-z0-9 ]/", ' ', $post_content_stripped);
$post_content_stripped = preg_replace("/\s+/", ' ', $post_content_stripped);
$post_content_stripped_array = explode ( " " , trim($post_content_stripped) );
$excerpt_wordcount = count( $post_content_stripped_array );
$cutpos = 0;
while($excerpt_wordcount>48){
$thiswordrev = "/" . strrev($post_content_stripped_array[$excerpt_wordcount - 1]) . "/";
preg_match($thiswordrev, strrev($post_content), $matches, PREG_OFFSET_CAPTURE, $cutpos);
$cutpos = $matches[0][1] + (strlen($thiswordrev) - 2);
array_pop($post_content_stripped_array);
$excerpt_wordcount = count( $post_content_stripped_array );
}
if($pwordcount>$excerpt_wordcount){
preg_match_all('/<\/?[^>]*>/', substr( $post_content, strlen($post_content) - $cutpos ), $closetags_result);
$excerpt_closetags = "" . $closetags_result[0][0];
$post_excerpt = substr( $post_content, 0, strlen($post_content) - $cutpos ) . $excerpt_closetags;
}else{
$post_excerpt = $post_content;
}
I am actually searching the string in reverse in this case, since I am walking word by word backwards from the end of the string, so I know that my html brackets are backwards, eg:
>p/<!su nioj emoc os >a/<laiceps yadiloh>"su nioj"=lebal-aira "renepoon rerreferon"=ler "knalb_"=tegrat "lmth.egapemos/"=ferh a< ruo litnu syad erom 41 era erehT>p<
But it's easy enough to flip all of the brackets before doing the preg_match, or I am assuming should be easy enough to have the preg_match account for that.
Do not use regex to parse HTML.
You have a simple objective: limit the text content to a given number of words, ensuring that the HTML remains valid.
To this end, I would suggest looping through text nodes until you count a certain number of words, and then removing everything after that.
$dom = new DOMDocument();
$dom->loadHTML($post_content);
$xpath = new DOMXPath($dom);
$all_text_nodes = $xpath->query("//text()");
$words_left = 48;
foreach( $all_text_nodes as $text_node) {
$text = $text_node->textContent;
$words = explode(" ", $text); // TODO: maybe preg_split on /\s/ to support more whitespace types
$word_count = count($words);
if( $word_count < $words_left) {
$words_left -= $word_count;
continue;
}
// reached the threshold
$words_that_fit = implode(" ", array_slice($words, 0, $words_left));
// If the above TODO is implemented, this will need to be adjusted to keep the specific whitespace characters
$text_node->textContent = $words_that_fit;
$remove_after = $text_node;
while( $remove_after->parentNode) {
while( $remove_after->nextSibling) {
$remove_after->parentNode->removeChild($remove_after->nextSibling);
}
$remove_after = $remove_after->parentNode;
}
break;
}
$output = substr($dom->saveHTML($dom->getElementsByTagName("body")->item(0)), strlen("<body>"), -strlen("</body>"));
Live demo
Ok, I figured out a workaround. I don't know if this is the most elegant solution, so if someone sees a better one I would still love to hear it, but for now I realized that I don't have to actually have the html in the string I am searching to determine the position to cut, I just need it to be the same length. I grabbed all of the html elements and just created a dummy string replacing all of them with the same number of asterisks:
// create faux string with placeholders instead of html for search purposes
preg_match_all('/<\/?[^>]*>/', $post_content, $alltags_result);
$tagcount = count( $alltags_result );
$post_content_dummy = $post_content;
foreach($alltags_result[0] as $thistag){
$post_content_dummy = str_replace($thistag, str_repeat("*",strlen($thistag)), $post_content_dummy);
}
Then I just use $post_content_dummy in the while loop instead of $post_content, in order to find the cut position, and then $post_content for the actual cut. So far seems to be working fine.

Replace enclosing Apostrophs with HTML tags but not inside <code> blocks

Goal: Modifying an HTML string that contains apostrophs for wrapping code inline (like Stackoverflow is doing it). But the same time having <code> blocks that can also contain apostrophs which should stay unchanged.
Example:
<p>This is my `inline code`, it can be replaced and tag-wrapped.</p>
<p><code>This text contains `apostrophs`, but should `not` be changed.</code></p>
This regex I am using for converting all wrapping apostrophs to <code> elements:
// replace apostroph with incorporating <code> tag
$content = preg_replace('/(.+?)\`(.+?)\`/', '$1<code class="inlinecode">$2</code>', $content);
Required:
Change the regex, so that it does not convert the apostroph if it is withing a <code> block.
Disclaimer: I tried for several hours to read the HTML string, use PHP's DOM parser, extract all nodes of type code, change their content, write them back, then found out that nodeValue is removing all HTML tags (especially the line breaks). Then tried several solutions found online, still not working... Now I am falling back to regex, even against the odds.
FYI, how I tried it the DOM way:
$code_blocks = $dom->getElementsByTagName('code');
foreach($code_blocks as $codenode) {
// nodeValue strips HTML tags, we need to hack
$nodevalue_html = $codenode->ownerDocument->saveXML($codenode);
// replace, i.e. custom-store each apostroph with '~~~APO~~~' so that they survive
$nodevalue_html = preg_replace('/`/', '~~~APO~~~', $nodevalue_html);
// $codenode->textValue = $nodevalue_html; // fail
// $codenode->nodeValue = $nodevalue_html; // fail
// ...
}
// html to string
$html_new = $dom->saveHTML();
$html_new = preg_replace('/~~~APO~~~/', '`', $html_new);
I wished I could use Markdown like Stackoverflow, but I still need to deal with HTML.
Using an XPath query to avoid text nodes that have a code element as ancestor:
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$xp = new DOMXPath($dom);
$textNodes = $xp->query('//text()[not(ancestor::code)][contains(.,"`")]');
foreach ($textNodes as $textNode) {
$parts = (function($text) { yield from explode('`', $text); })($textNode->nodeValue);
$frag = $dom->createDocumentFragment();
do {
$frag->appendChild($dom->createTextNode($parts->current()));
$parts->next();
if ( $parts->valid() ) {
$codeElt = $dom->createElement('code');
$codeElt->appendChild($dom->createTextNode($parts->current()));
$frag->appendChild($codeElt);
$parts->next();
}
} while ($parts->valid());
$textNode->parentNode->replaceChild($frag, $textNode);
}
echo $dom->saveHTML();
demo
demo for php < 7.0
I believe the only way is to explode and reassemble the string:
$html_string = '....................'; // contains apostrophes and <code>...</code> blocks
$delim = "<code>";
$closing_tag = "</code>";
$explode = explode($delim, $html_string);
foreach($explode as &$ex) {
$closing_tag_pos = strpos($ex, $closing_tag);
if ($closing_tag_pos !== false) {
$pre_closing_tag = substr($ex, 0, $closing_tag_pos);
$post_closing_tag = substr($ex, $closing_tag_pos);
$ex = $pre_closing_tag . preg_replace('/`/', '~~~APO~~~', $post_closing_tag);
}
}
$mapped_html_string = implode($delim, $explode);

regex to match a specific HTML string with any number of spaces inside it

I have code with several lines like this
<p> <inset></p>
Where there may be any number of spaces or tabs (or none) between the opening <p> tag and the rest if the string. I need to replace these, but I can't get it to work.
I thought this would do it, but it doesn't work:
<p>[ \t]+<inset></p>
Try this:
$html = preg_replace('#(<p>)\s+(<inset></p>)#', '$1$2', $html);
If you want true text-trimming for HTML including everything you can encounter like those entitites, comments, child-elements and all that stuff, you can make use of a TextRangeTrimmer and TextRange:
$htmlFragment = '<p> <inset></p>';
$dom = new DOMDocument();
$dom->loadHTML($htmlFragment);
$parent = $dom->getElementsByTagName('body')->item(0);
if (!$parent)
{
throw new Exception('Parent element not found.');
}
$range = new TextRange($parent);
$trimmer = new TextRangeTrimmer($range);
$trimmer->ltrim();
// inner HTML (PHP >= 5.3.6)
foreach($parent->childNodes as $node)
{
echo $dom->saveHTML($node);
}
Output:
<p><inset></p>
I've both classes in a gist: https://gist.github.com/1894360/ (codepad viper is down).
See as well the related questions / answers:
Wordwrap / Cut Text in HTML string
Ignore html tags in preg_replace
Try to load your HTML string into a DOM tree instead, and then trim all the text values in the tree.
http://php.net/domdocument.loadhtml
http://php.net/trim

Regular expression check by skipping anchor tags

I have written a regex for searching particular keyword and I am replacing that keyword with particular URL.
My current regex is as: \b$keyword\b
One problem in this is that if my data contains anchor tags and that tag contains this keyword then this regex replaces that keyword in the anchor tag as well.
I want to search in given data excluding anchor tag. Please help me out. Appreciate your help.
eg. Keyword: Disney
I/p:
This is Disney The disney should be replaceable
Expected O/p:
This is Disney The disney should be replaceable
Invalid o/p:
This is <a href="any-url.php">Disney </a> The disney should be replaceable
I've modified my function that highlights searched phrase on a page, here you go:
$html = 'This is Disney The disney should be replaceable.'.PHP_EOL;
$html .= 'Let\'s test also use of keyword inside other tags, for example as class name:'.PHP_EOL;
$html .= '<b class=disney></b> - this should not be replaced with link, and it isn\'t!'.PHP_EOL;
$result = ReplaceKeywordWithLink($html, "disney", "any-url.php");
echo nl2br(htmlspecialchars($result));
function ReplaceKeywordWithLink($html, $keyword, $link)
{
if (strpos($html, "<") !== false) {
$id = 0;
$unique_array = array();
// Hide existing anchor tags with some unique string.
preg_match_all("#<a[^<>]*>[\s\S]*?</a>#i", $html, $matches);
foreach ($matches[0] as $tag) {
$id++;
$unique_string = "#####$id#####";
$unique_array[$unique_string] = $tag;
$html = str_replace($tag, $unique_string, $html);
}
// Hide all tags by replacing with some unique string.
preg_match_all("#<[^<>]+>#", $html, $matches);
foreach ($matches[0] as $tag) {
$id++;
$unique_string = "#####$id#####";
$unique_array[$unique_string] = $tag;
$html = str_replace($tag, $unique_string, $html);
}
}
// Then we replace the keyword with link.
$keyword = preg_quote($keyword);
assert(strpos($keyword, '$') === false);
$html = preg_replace('#(\b)('.$keyword.')(\b)#i', '$1$2$3', $html);
// We get back all the tags by replacing unique strings with their corresponding tag.
if (isset($unique_array)) {
foreach ($unique_array as $unique_string => $tag) {
$html = str_replace($unique_string, $tag, $html);
}
}
return $html;
}
Result:
This is Disney The disney should be replaceable.
Let's test also use of keyword inside other tags, for example as class name:
<b class=disney></b> - this should not be replaced with link, and it isn't!
Add this to the end of your regex:
(?=[^<]*(?:<(?!/?a\b)[^<]*)*(?:<a\b|\z))
This lookahead tries to match either the next opening <a> tag or the end of the input, but only if it doesn't see a closing </a> tag first. Assuming the HTML is minimally well formed, the lookahead will fail whenever the match starts after the beginning of an <a> tag and before the corresponding </a> tag.
To prevent it from matching inside any other tag (e.g. <div class="disney">), you can add this lookahead as well:
(?![^<>]*+>)
With this one I'm assuming there won't be any angle brackets in the attribute values of the tags, which is legal according to the HTML 4 spec, but extremely rare in the real world.
If you're writing the regex in the form of a PHP double-quoted string (which you must be, if you expect the $keyword variable to be replaced) you should double all the backslashes. \z probably wouldn't be a problem but I believe \b would be interpreted as a backspace, not as a word-boundary assertion.
EDIT: On second thought, definitely do add the second lookahead--I mean, why would not want to prevent matches inside tags? And place it first, because it will tend to evaluate more quickly than the other:
(?![^<>]*+>)(?=[^<]*(?:<(?!/?a\b)[^<]*)*(?:<a\b|\z))
strip the tags first, then search on the stripped text.

PHP RegEx (or Alt Method) for Anchor tags

Ok I have to parse out a SOAP request and in the request some of the values are passed with (or inside) a Anchor tag. Looking for a RegEx (or alt method) to strip the tag and just return the value.
// But item needs to be a RegEx of some sort, it's a field right now
if($sObject->list == 'item') {
// Split on > this should be the end of the right side of the anchor tag
$pieces = explode(">", $sObject->fields->$field);
// Split on < this should be the closing anchor tag
$piece = explode("<", $pieces[1]);
$fields_string .= $piece[0] . "\n";
}
item is a field name but I would like to make this a RegEx to check for the Anchor tag instead of a specific field.
PHP has a strip_tags() function.
Alternatively you can use filter_var() with FILTER_SANITIZE_STRING.
Whatever you do don't parse HTML/XML with regular expressions. It's really error-prone and flaky. PHP has at least 3 different parsers as standard (SimpleXML, DOMDocument and XMLReader spring to mind).
I agree with cletus, using RegEx on HTML is bad practice because of how loose HTML is as a language (and I moan about PHP being too loose...). There are just so many ways you can variate a tag that unless you know that the document is standards-compliant / strict, it is sometimes just impossible to do. However, because I like a challenge that distracts me from work, here's how you might do it in RegEx!
I'll split this up into sections, no point if all you see is a string and say, "Meh... It'll do..."! First we have the main RegEx for an anchor tag:
'#<a></a>#'
Then we add in the text that could be between the tags.
We want to group this is parenthesis, so we can extract the string, and the question mark makes the asterix wildcard "un-greedy", meaning that the first </a> that it comes accross will be the one it uses to end the RegEx.
'#<a>(.*?)</a>#'
Next we add in the RegEx for href="". We match the href=" as plain text, then an any-length string that does not contain a quotation mark, then the ending quotation mark.
'#<a href\="([^"]*)">(.*?)</a>#'
Now we just need to say that the tag is allowed other attributes. According to the specification, an attribute can contain the following characters: [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*.
Allow an attribute multiple times, and with a value, we get: ( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*.
The resulting RegEx (PCRE) is as following:
'#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#'
Now, in PHP, use the preg_match_all() function to grab all occurances in the string.
$regex = '#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#';
preg_match_all($regex, $str_containing_anchors, $result);
foreach($result as $link)
{
$href = $link[2];
$text = $link[4];
}
use simplexml and xpath to retrieve the desired nodes
If you don't have some kind of request<->class mapping you can extract the information with the DOM extension. The property textConent contains all the text of the context node and its descendants.
$sr = '<?xml version="1.0"?>
<SOAP:Envelope xmlns:SOAP="urn:schemas-xmlsoap-org:soap.v1">
<SOAP:Body>
<foo:bar xmlns:foo="urn:yaddayadda">
<fragment>
Mary had a
little lamb
</fragment>
</foo:bar>
</SOAP:Body>
</SOAP:Envelope>';
$doc = new DOMDocument;
$doc->loadxml($sr);
$xpath = new DOMXPath($doc);
$ns = $xpath->query('//fragment');
if ( 0 < $ns->length ) {
echo $ns->item(0)->nodeValue;
}
prints
Mary had a
little lamb
If you want to strip or extract properties from only specific tag, you should try DOMDocument.
Something like this:
$TagWhiteList = array(
// Example of WhiteList
'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);
function getTextFromNode($Node, $Text = "") {
// No tag, so it is a text
if ($Node->tagName == null)
return $Text.$Node->textContent;
// You may select a tag here
// Like:
// if (in_array($TextName, $TagWhiteList))
// DoSomthingWithIt($Text,$Node);
// Recursive to child
$Node = $Node->firstChild;
if ($Node != null)
$Text = getTextFromNode($Node, $Text);
// Recursive to sibling
while($Node->nextSibling != null) {
$Text = getTextFromNode($Node->nextSibling, $Text);
$Node = $Node->nextSibling;
}
return $Text;
}
function getTextFromDocument($DOMDoc) {
return getTextFromNode($DOMDoc->documentElement);
}
To use:
$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
$Text = getTextFromDocument($Doc);
echo "Text from HTML: ".$Text."\n";
The above function is how to strip tags. But you can modify it a bit to manipulate the element. For example, if the tag is 'a' of archor, you can extract its target and display it instead of the text inside.
Hope this help.

Categories