I have a PHP script that runs a search on my database based on terms put into the search box on a website. This returns a block of text. Let's say that my search term for right now is "test block". An example of my result would be this block of text:
This is a test block of text that was gathered from the database using
a search query.
Now, my question is: how I can "highlight" the search term within the block of text so the user can see why this result was pulled in the first place. Using the above example, something like the following would suffice:
This is a test block of text that was gathered from the database
from a search query.
I have tried a few things so far that will change the text, but the real problem I am running into has to do with case sensitivity. For example, if I used the code:
$exploded = explode(' ', $search_terms);
foreach($exploded as $word) {
// I have to use str_ireplace so the word is actually found
$result = str_ireplace($word, '<b>' . $word . '</b>', $result);
}
It would go through my $result and bold any instance of the words. This would look correct, as I wanted in my second example of the search result. But, in the case that the user uses "Test Block" instead of "test block", the search terms would be capitalized and appear as this:
This is a Test Block of text that was gathered from the database
from a search query.
This does not work for me, especially when the user is using lower-case search terms and they happen to fall at the beginning of a sentance.
Essentially, what I need to do is find the word in the string, insert text (<b> in this example) directly in front of the word, and then insert text directly after the word (</b> in this example) while preserving the word itself from being replaced. This rules preg_replace and str_replace out I believe, so I am really stuck on what to do.
Any leads would be greatly appreciated.
$exploded = explode(' ', $search_terms);
foreach($exploded as $word) {
// I have to use str_ireplace so the word is actually found
$result = preg_replace("/(".preg_quote($word).")/i", "<b>$1</b>", $result);
}
Pattern matching http://www.php.net/manual/en/reference.pcre.pattern.syntax.php uses certain characters like . [ ] / * + etc.. so if these occur in the pattern they need to be escaped first with pre_quote();
Patterns start and end with delimiters to identify the pattern http://www.php.net/manual/en/regexp.reference.delimiters.php
followed my modifiers http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
In this case i for case-insensitive
anything in ( brackets ) is captured for use later, either in $matched parameter or in the replacement as $1 or \\1 for the first, $2 second etc..
Use preg_replace. In your example
$result = preg_replace("/\\b(" . preg_quote($word) . ")\\b/i", '<b>$1</b>', $result);
You can use preg_replace:
foreach ($exploded as $word) {
$text = preg_replace("`(" . preg_quote($word) . ")`Ui" , "<b>$1</b>" , $text);
}
$string = 'The quick brown fox jumped over the lazy dog.';
$search = "brown";
$pattern = "/".$search."/";
$replacement = "<strong>".$search."</strong>";
echo preg_replace($pattern, $replacement, $string);
The quick brown fox jumped over the lazy dog
Related
I have the following code to get characters before/after the regex match:
$searchterm = 'blue';
$string = 'Here is a sentence talking about blue. This sentence talks about red.';
$regex = '/.*(.{10}\b' . $searchterm . '\b.{10}).*/si';
echo preg_replace($regex, '$1', $string);
Output: "ing about blue. This se" (expected).
When I change $searchterm = 'red', then I get this:
Output: "Here is a sentence talking about blue. This sentence talks about red."
I am expecting this: "lks about red." The same thing happens if you start at the beginning of the sentence. Is there a way to use a similar regex to not pull back the entire string when it's at the start/end?
Example of what is happening: https://sandbox.onlinephpfunctions.com/code/e500b505860ded429e78869f61dbf4128ff368b3
Converting my comment to answer so that solution is easy to find for future visitors.
You regex regex is almost correct but make sure to use a non-greedy quantifier with .{0,10} limit for surrounding substring:
$searchterm = 'blue';
$string = 'Here is a sentence talking about blue. This sentence talks about red.';
$regex = '/.*?(.{0,10}\b' . $searchterm . '\b.{0,10}).*/si';
echo preg_replace($regex, '$1', $string);
Updated Code Demo
RegEx Demo
You'd better use preg_match with .{0,10} quantifiers instead of {10},
function truncateString($searchterm){
$string = 'Here is a sentence talking about blue. This sentence talks about red.';
$regex = '/.{0,10}\b' . $searchterm . '\b.{0,10}/si';
if (preg_match($regex, $string, $m)) {
echo $m[0] . "\n";
}
}
truncateString('blue');
// => ing about blue. This se
truncateString('red');
// => lks about red.
See the PHP demo.
preg_match will find and return the first match only. The .{0,10} pattern will match zero to ten occurrences of any char (since the s modifier is used, the . matches even line break chars).
One more thing: if your $searchterm can contain special regex metacharacters, anywhere in the term, you should consider refactoring the code to
$regex = '/.{0,10}(?<!\w)' . preg_quote($searchterm, '/') . '(?!\w).{0,10}/si';
where (?<!\w) / (?!\w) are unambiguous word boundaries and the preg_quote is used to escape all special chars.
I have below code that removes whole words that contain any pattern
$patterns = ["are", "finite", "get", "er"];
$string = "You are definitely getting better today";
$re = '\S*('.implode('|', $patterns).')\S*';
$string = preg_replace('#'.$re.'#', '', $string);
$string = preg_replace('#\h{2,}#', ' ', $string);
echo $string;
the output of the above code is
You today
I want to split this code into two functions so that the first function only removes whole words present in the pattern and a second function that only removes words that contain any pattern.
I expect the output of the function one that remove only whole words
You definitely getting better today (**are** is removed)
and output of the other function that remove whole word that contain pattern
You are today (**definitely getting better** are removed)
The first part is basic: Only match whole keywords (actually, you can find dozens of Q&As like that, e.g this)
\b(?:are|finite|get|er)\b
Which can be applied to your code like this: $re = '\b('.implode('|', $patterns).')\b';
The second part is a bit more involved: While you keep expanding substring matches to match the entire word you want to exclude words that match whole keywords.
We can use a lookahead to achieve this like that:
(?!\b(?:are|finite|get|er)\b)\S*(?:are|finite|get|er)\S*
Demo,
Sample Code:
$patterns = ["are", "finite", "get", "er"];
$string = "You are definitely getting better today";
$alternations = ''.implode('|', $patterns);
$re = '(?!\b(?:'.$alternations.')\b)\S*(?:'.$alternations.')\S*';
$string = preg_replace('#'.$re.'#', '', $string);
If the \b does not work for you and you'd like to go with space as word boundary use lookarounds:
(?<=\s)(?:are|finite|get|er)(?=\s)
Sample Code (updated) case 1.
I'm trying to replace 'he' by 'she' between two given positions, in this case between Maria(0) and Peter(34). So basically I give two positions and replace all the occurrences in the sentence between the boundaries. I have tried with the following simple code but it seems that the substr_replace function doesn't allow me to do it. Any idea to make it works?
$sentence = "Maria went to London where he met Peter with his family.";
$clean = substr_replace($sentence, "he", "she", 0, 34);
You are using substr_replace() incorrectly, and in this case you don't need it. Instead, try this, using a combo of str_replace() and substr():
$sentence = "Maria went to London where he met Peter with his family.";
$clean = str_replace(' he ', ' she ', substr($sentence, 0, 34));
$clean .= substr($sentence, 34);
See demo
Essentially, you are replacing he with she on the specified substr of $sentence, then concatenating the rest of $sentence back on.
Note the spaces around ' he ' and ' she '. This is necessary because otherwise you would replace where with wshere.
Edit: mark made me realized I misinterpretted your question. However I feel you might still find this regex useful, so I'm going to leave my answer but for the answer to your exact question look at MarkM's answer.
If you want to replace all instances of the word he with she only when the word 'he' occurs as its own word I would do something like this.
$pattern = '/(\b)he(\b)/';
$replacement = '$1she$2';
$subject = 'Maria went to London where he met Peter with his family';
$newString = preg_replace($pattern, $replacement, $subject);
The regex pattern basically says find all instances of the letters 'he' surrounded by valid word seperators (\b) and replace it with 'she' surrounded by the word separators that were found.
Edit:
See It Run
I've tried to make a tool in which you input a website and when you click the submit button it cURLS all the text.
After all the cURLing, stripping it from tags, and counting the words. It's eventually an array named $frequency. If I echo it using <pre> tags it will show me everything just fine! (NOTE: I'm placing the contents in a file, $homepage = file_get_contents($file); and this is what I work with in my code, I don't know if this matters or not)
However i don't really care if the word or is seen 200 times in a website, I only want the important words. So i have made an array with all the common words. Which is set eventually in the $common_words variable. But i can't seem to find a way to replace all words found in the $frequency to replace them with "" if they are found in the $common_words as well.
I've found this piece of code after some research:
$string = 'sand band or nor and where whereabouts foo';
$wordlist = array("or", "and", "where");
foreach ($wordlist as &$word) {
$word = '/\b' . preg_quote($word, '/') . '\b/';
}
$string = preg_replace($wordlist, '', $string);
var_dump($string);
If I copy paste this it works fine, removing the or, and, where from the string.
But replacing $string with $frequency or replacing $wordlist with $common_words will either not work or throw me an error like: Delimiter must not be alphanumeric or backslash
I hope i've formulated my question properly, if not. Please tell me!
Thanks in advance
EDIT: Alright, i've narrowed down the problem alot. First of all i forgot the & inside the foreach ($wordlist as &$word) {
But as it was counting all the words, the words it has replaced are all still counted. See those 2 screenshots to see what I mean: http://imgur.com/oqqZR3h,xHEZKRz#0
If I understand this correctly you wan't to know how many occurrences each word has by ignoring the so called common words.
Assuming that $url is the page you will be running against and $common_words is your common words array, here is what you can do:
// Get the page content's and strip the html tags
$contents = strip_tags( file_get_contents($url) );
// This will split the words from the contents, creating an array with each word in it
preg_match_all("/([\w]+[']?[\w]*)\W/", $contents, $words);
$common_words = array('or', 'and', 'I', 'where');
$frequency = array();
// Count occurrences
$frequency = array_count_values($words[0]);
unset($words); // Release all that memory
var_dump($frequency);
At this point you will have an associative array with each not common word and a count showing the number of occurrences of the given word.
UPDATE
A bit more about the RegEx. We need to match word. The easiest way possible is: (\w+). But that won't match words like I've or haven't (Notice the '). That was my point of making it more complicated. Also, \w doesn't support dashes for words like in 6-year-old.
So I created a subgroup which should match words characters including dashed and single quotes in a word.
(?:\w'|\w|-)
The ?: part on the beginning is do not match or do not include in the results. That is since all I am doing is grouping the options for word contents. To mach an entire word the RegEx will match one or more of the subgroup above:
((?:\w'\w|\w|-)+)
So the RegEx preg_match_all() line should be:
preg_match_all("/((?:\w'\w|\w|-)+)/", $contents, $words);
Hope this helps.
I had changed $wordlist with $mywordlist. still its working!
<?php
$string = 'sand band or nor and where whereabouts foo';
$wordlist = array("or", "and", "where");
$mywordlist=array("sand","band");
foreach ($mywordlist as &$word) {
$word = '/\b' . preg_quote($word, '/') . '\b/';
}
$string = preg_replace($mywordlist, '', $string);
var_dump($string);
?>
I suppose you can do simply like this:
$common_words = "foo baq etc etc";
$str = "foo bar baz"; // input
foreach (explode(" ", $common_words) as $word){
$str = strtr($str, $word, "");
}
I am working with some code in PHP that grabs the referrer data from a search engine, giving me the query that the user entered.
I would then like to remove certain stop words from that string if they exist. However, the word may or may not have a space at either end.
For example, I have been using str_replace to remove a word as follows:
$keywords = str_replace("for", "", $keywords);
$keywords = str_replace("sale", "", $keywords);
but if the $keywords value is "baby formula" it will change it to "baby mula" - removing the "for" part.
Without having to create further str_replace's that account for " for" and "for " - is there a preg_replace type command I could use that would remove the given word if it is found with a space at either end?
My idea would be to put all of the stop words into an array and step through them that way and I suspect that a preg_replace is going to be quicker than stepping through multiple str_replace lines.
UPDATE:
Solved thanks to you guys using the following combination:
$keywords = "...";
$stopwords = array("for","each");
foreach($stopwords as $stopWord)
{
$keywords = preg_replace("/(\b)$stopWord(\b)/", "", $keywords);
}
$keywords = "...";
$stopWords = array("for","sale");
foreach($stopWords as $stopWord){
$keywords = preg_replace("/(\b)$stopWord(\b)/", "", $keywords);
}
Try it this way
$keywords = preg_replace( '/(?!\w)(for|sale)(?>!\w)/', '', $keywords );
You can use word boundaries for this
$keywords = preg_replace('/\bfor\b/', '', $keywords);
or with multiple words
$keywords = preg_replace('/\b(?:for|sale)\b/', '', $keywords);
While Armel's answer will work, it is not performing optimally. Yes, your desired output will require wordboundaries and probably case-insensitive matching, but:
Wordboundaries gain nothing from being wrapped in parentheses.
Performing iterated preg_match() calls for each element in the blacklist array is not efficient. Doing so will ask the regex engine to perform wave after wave of individual keyword checks on the full string.
I recommend building a single regex pattern that will check for all keywords during each step of traversing the string -- one time. To generate the single pattern dynamically, you only need to implode your blacklist array of elements with | (pipes) which represent the "OR" command in regex. By wrapping all of the pipe-delimited keywords in a non-capturing group ((?:...)), the wordboundaries (\b) serve their purpose for all keywords in the blacklist array.
Code: (Demo)
$string = "Each person wants peaches for themselves forever";
$blacklist = array("for", "each");
// if you might have non-letter characters that have special meaning to the regex engine
//$blacklist = array_map(function($v){return preg_quote($v, '/');}, $blacklist);
//print_r($blacklist);
echo "Without wordboundaries:\n";
var_export(preg_replace('/' . implode('|', $blacklist) . '/i', '', $string));
echo "\n\n---\n";
echo "With wordboundaries:\n";
var_export(preg_replace('/\b(?:' . implode('|', $blacklist) . ')\b/i', '', $string));
echo "\n\n---\n";
echo "With wordboundaries and consecutive space mop up:\n";
var_export(trim(preg_replace(array('/\b(?:' . implode('|', $blacklist) . ')\b/i', '/ \K +/'), '', $string)));
Output:
Without wordboundaries:
' person wants pes themselves ever'
---
With wordboundaries:
' person wants peaches themselves forever'
---
With wordboundaries and consecutive space mop up:
'person wants peaches themselves forever'
p.s. / \K +/ is the second pattern fed to preg_replace() which means the input string will be read a second time to search for 2 or more consecutive spaces. \K means "restart the fullstring match from here"; effectively it releases the previously matched space. Then one or more spaces to follow are matched and replaced with an empty string.