I am trying to replace some "common" words from a large block of text, however it's only using the last word from the array, please can you see where I'm going wrong?
Thanks
$glue = strtolower ($glue);//make all lower case
//remove common words
$Maffwordlist = array('the','to','for');
foreach($Maffwordlist as $Maffword)
$filtered = preg_replace("/\s". $Maffword ."\s/", " ", $glue);
The extract above only removes 'for' from the text, 'the' and 'to' are still included.
Any help appreciated.
The problem is that the subject of your preg_replace() is always $glue, which itself never changes. Before iterating your list of words, you need to assign the starting contents of $glue into $filtered since that is what you are acting on in order to accumulate all the values into it.
// $filtered is the string you'll be modifying...
$filtered = strtolower ($glue);//make all lower case
$Maffwordlist = array('the','to','for');
foreach($Maffwordlist as $Maffword) {
$filtered = preg_replace("/\s". $Maffword ."\s/", " ", $glue);
}
But we can do better.
A regular expression can be constructed to handle all the replacements without a loop using a (a|b|c) grouping.
// Stick the words together with pipes
$pattern = implode("|", $Maffwordlist);
// And surround with regex delimiters and ()
// so the whole regex looks like /\s(the|to|for)\s/
$pattern = '/\s(' . $pattern . ')\s/';
// And do the operation in one go:
$filtered = preg_replace($pattern, " ", $filtered);
I'll note you may wish to use \b word boundaries instead of \s delimiting these by whitespace. That way, you would get proper replacements in a sentence like "You should not end a sentence with for." where one of your list words appears but not bound by whitespace.
Finally then, you'll end up with multiple consecutive spaces in some places where replacements have taken place. You can collapse those into single spaces with something like the following.
// Replace multiple spaces with a single space
$filtered = preg_replace('/\s+/', ' ', $filtered);
Related
I have an array of words
$banned_names = array('about','access','account');
The actual array is very long a contains bad words so at risk of breaking any rule I just added an example, the issue I'm having is the following:
$title = str_ireplace($filterWords, '****', $dn1['title']);
This works however, one of my filtered words is 'rum' and if I was to post the word 'forum' it will display as 'fo****'
So I need to only replace the word with **** if it matches the exact word from the array, if I was to give an example the phrase "Lets check the forum and see if anyone has rum", would be "Lets check the forum and see if anyone has ****".
Similar to the other answers but this uses \b in regex to match word boundaries (whole words). It also creates the regex-compatible banned list on the fly before passing to preg_replace_callback().
$dn1['title'] = 'access forum';
$banned_names = array('about','access','account','rum');
$banned_list = array_map(function($r) { return '/\b' . preg_quote($r, '/') . '\b/'; }, $banned_names);
$title = preg_replace_callback($banned_list, function($m) {
return $m[0][0].str_repeat('*', strlen($m[0])-1);
}, $dn1['title']);
echo $title; //a***** forum
You can use regex with \W to match a "non-word" character:
var_dump(preg_match('/\Wrum\W/i', 'the forum thing')); // returns 0 i.e. doesn't match
var_dump(preg_match('/\Wrum\W/i', 'the rum thing')); // returns 1 i.e. matches
The preg_replace() method takes an array of filters like str_replace() does, but you'll have to adjust the list to include the pattern delimiters and the \W on both sides. You could store the full patterns statically in your list:
$banlist = ['/\Wabout\W/i','/\Waccess\W/i', ... ];
preg_replace($banlist, '****', $text);
Or adjust the array on the fly to add those bits.
You can use preg_replace() to look for your needles with a beginning/end of string tag after converting each string in your haystack to an array of strings, so you'll be matching on full words. Alternatively you can add spaces and continue to use str_ireplace() but that option would fail if your word is the first or last word in the string being checked.
Adding spaces (will miss first/last word, not reccomended):
You'll have to modify your filtering array first of course. And yes the foreach could be simpler, but I hope this makes clear what I'm doing/why.
foreach($filterWords as $key => $value){
$filterWords[$key] = " ".$value." ";
}
str_ireplace ( $filterWords, "****", $dn1['title'] );
OR
Breaking up long string (recommended):
foreach($filterWords as $key => $value){
$filterWords[$key] = "/^".$value."$/i"; //add regex for beginning/end of string value
}
preg_replace ( $filterWords, "****", explode(" ", $dn1['title']) );
I have an array of words and a string and want to add a hashtag to the words in the string that they have a match inside the array. I use this loop to find and replace the words:
foreach($testArray as $tag){
$str = preg_replace("~\b".$tag."~i","#\$0",$str);
}
Problem: lets say I have the word "is" and "isolate" in my array. I will get ##isolate at the output. this means that the word "isolate" is found once for "is" and once for "isolate". And the pattern ignores the fact that "#isoldated" is not starting with "is" anymore and it starts with "#".
I bring an example BUT this is only an example and I don't want to just solve this one but every other possiblity:
$str = "this is isolated is an example of this and that";
$testArray = array('is','isolated','somethingElse');
Output will be:
this #is ##isolated #is an example of this and that
You may build a regex with an alternation group enclosed with word boundaries on both ends and replace all the matches in one pass:
$str = "this is isolated is an example of this and that";
$testArray = array('is','isolated','somethingElse');
echo preg_replace('~\b(?:' . implode('|', $testArray) . ')\b~i', '#$0', $str);
// => this #is #isolated #is an example of this and that
See the PHP demo.
The regex will look like
~\b(?:is|isolated|somethingElse)\b~
See its online demo.
If you want to make your approach work, you might add a negative lookbehind after \b: "~\b(?<!#)".$tag."~i","#\$0". The lookbehind will fail all matches that are preceded with #. See this PHP demo.
A way to do that is to split your string by words and to build a associative array with your original array of words (to avoid the use of in_array):
$str = "this is isolated is an example of this and that";
$testArray = array('is','isolated','somethingElse');
$hash = array_flip(array_map('strtolower', $testArray));
$parts = preg_split('~\b~', $str);
for ($i=1; $i<count($parts); $i+=2) {
$low = strtolower($parts[$i]);
if (isset($hash[$low])) $parts[$i-1] .= '#';
}
$result = implode('', $parts);
echo $result;
This way, your string is processed only once, whatever the number of words in your array.
I've tried to make a tool in which you input a website and when you click the submit button it cURLS all the text.
After all the cURLing, stripping it from tags, and counting the words. It's eventually an array named $frequency. If I echo it using <pre> tags it will show me everything just fine! (NOTE: I'm placing the contents in a file, $homepage = file_get_contents($file); and this is what I work with in my code, I don't know if this matters or not)
However i don't really care if the word or is seen 200 times in a website, I only want the important words. So i have made an array with all the common words. Which is set eventually in the $common_words variable. But i can't seem to find a way to replace all words found in the $frequency to replace them with "" if they are found in the $common_words as well.
I've found this piece of code after some research:
$string = 'sand band or nor and where whereabouts foo';
$wordlist = array("or", "and", "where");
foreach ($wordlist as &$word) {
$word = '/\b' . preg_quote($word, '/') . '\b/';
}
$string = preg_replace($wordlist, '', $string);
var_dump($string);
If I copy paste this it works fine, removing the or, and, where from the string.
But replacing $string with $frequency or replacing $wordlist with $common_words will either not work or throw me an error like: Delimiter must not be alphanumeric or backslash
I hope i've formulated my question properly, if not. Please tell me!
Thanks in advance
EDIT: Alright, i've narrowed down the problem alot. First of all i forgot the & inside the foreach ($wordlist as &$word) {
But as it was counting all the words, the words it has replaced are all still counted. See those 2 screenshots to see what I mean: http://imgur.com/oqqZR3h,xHEZKRz#0
If I understand this correctly you wan't to know how many occurrences each word has by ignoring the so called common words.
Assuming that $url is the page you will be running against and $common_words is your common words array, here is what you can do:
// Get the page content's and strip the html tags
$contents = strip_tags( file_get_contents($url) );
// This will split the words from the contents, creating an array with each word in it
preg_match_all("/([\w]+[']?[\w]*)\W/", $contents, $words);
$common_words = array('or', 'and', 'I', 'where');
$frequency = array();
// Count occurrences
$frequency = array_count_values($words[0]);
unset($words); // Release all that memory
var_dump($frequency);
At this point you will have an associative array with each not common word and a count showing the number of occurrences of the given word.
UPDATE
A bit more about the RegEx. We need to match word. The easiest way possible is: (\w+). But that won't match words like I've or haven't (Notice the '). That was my point of making it more complicated. Also, \w doesn't support dashes for words like in 6-year-old.
So I created a subgroup which should match words characters including dashed and single quotes in a word.
(?:\w'|\w|-)
The ?: part on the beginning is do not match or do not include in the results. That is since all I am doing is grouping the options for word contents. To mach an entire word the RegEx will match one or more of the subgroup above:
((?:\w'\w|\w|-)+)
So the RegEx preg_match_all() line should be:
preg_match_all("/((?:\w'\w|\w|-)+)/", $contents, $words);
Hope this helps.
I had changed $wordlist with $mywordlist. still its working!
<?php
$string = 'sand band or nor and where whereabouts foo';
$wordlist = array("or", "and", "where");
$mywordlist=array("sand","band");
foreach ($mywordlist as &$word) {
$word = '/\b' . preg_quote($word, '/') . '\b/';
}
$string = preg_replace($mywordlist, '', $string);
var_dump($string);
?>
I suppose you can do simply like this:
$common_words = "foo baq etc etc";
$str = "foo bar baz"; // input
foreach (explode(" ", $common_words) as $word){
$str = strtr($str, $word, "");
}
I am working with some code in PHP that grabs the referrer data from a search engine, giving me the query that the user entered.
I would then like to remove certain stop words from that string if they exist. However, the word may or may not have a space at either end.
For example, I have been using str_replace to remove a word as follows:
$keywords = str_replace("for", "", $keywords);
$keywords = str_replace("sale", "", $keywords);
but if the $keywords value is "baby formula" it will change it to "baby mula" - removing the "for" part.
Without having to create further str_replace's that account for " for" and "for " - is there a preg_replace type command I could use that would remove the given word if it is found with a space at either end?
My idea would be to put all of the stop words into an array and step through them that way and I suspect that a preg_replace is going to be quicker than stepping through multiple str_replace lines.
UPDATE:
Solved thanks to you guys using the following combination:
$keywords = "...";
$stopwords = array("for","each");
foreach($stopwords as $stopWord)
{
$keywords = preg_replace("/(\b)$stopWord(\b)/", "", $keywords);
}
$keywords = "...";
$stopWords = array("for","sale");
foreach($stopWords as $stopWord){
$keywords = preg_replace("/(\b)$stopWord(\b)/", "", $keywords);
}
Try it this way
$keywords = preg_replace( '/(?!\w)(for|sale)(?>!\w)/', '', $keywords );
You can use word boundaries for this
$keywords = preg_replace('/\bfor\b/', '', $keywords);
or with multiple words
$keywords = preg_replace('/\b(?:for|sale)\b/', '', $keywords);
While Armel's answer will work, it is not performing optimally. Yes, your desired output will require wordboundaries and probably case-insensitive matching, but:
Wordboundaries gain nothing from being wrapped in parentheses.
Performing iterated preg_match() calls for each element in the blacklist array is not efficient. Doing so will ask the regex engine to perform wave after wave of individual keyword checks on the full string.
I recommend building a single regex pattern that will check for all keywords during each step of traversing the string -- one time. To generate the single pattern dynamically, you only need to implode your blacklist array of elements with | (pipes) which represent the "OR" command in regex. By wrapping all of the pipe-delimited keywords in a non-capturing group ((?:...)), the wordboundaries (\b) serve their purpose for all keywords in the blacklist array.
Code: (Demo)
$string = "Each person wants peaches for themselves forever";
$blacklist = array("for", "each");
// if you might have non-letter characters that have special meaning to the regex engine
//$blacklist = array_map(function($v){return preg_quote($v, '/');}, $blacklist);
//print_r($blacklist);
echo "Without wordboundaries:\n";
var_export(preg_replace('/' . implode('|', $blacklist) . '/i', '', $string));
echo "\n\n---\n";
echo "With wordboundaries:\n";
var_export(preg_replace('/\b(?:' . implode('|', $blacklist) . ')\b/i', '', $string));
echo "\n\n---\n";
echo "With wordboundaries and consecutive space mop up:\n";
var_export(trim(preg_replace(array('/\b(?:' . implode('|', $blacklist) . ')\b/i', '/ \K +/'), '', $string)));
Output:
Without wordboundaries:
' person wants pes themselves ever'
---
With wordboundaries:
' person wants peaches themselves forever'
---
With wordboundaries and consecutive space mop up:
'person wants peaches themselves forever'
p.s. / \K +/ is the second pattern fed to preg_replace() which means the input string will be read a second time to search for 2 or more consecutive spaces. \K means "restart the fullstring match from here"; effectively it releases the previously matched space. Then one or more spaces to follow are matched and replaced with an empty string.
Here is my problem:
Using preg_replace('#\b(word)\b#','****',$text);
Where in text I have word\word and word, the preg_replace above replaces both word\word and word so my resulting string is ***\word and ***.
I want my string to look like : word\word and ***.
Is this possible? What am I doing wrong???
LATER EDIT
I have an array with urls, I foreach that array and preg_replace the text where url is found, but it's not working.
For instance, I have http://www.link.com and http://www.link.com/something
If I have http://www.link.com it also replaces http://www.link.com/something.
You are effectively specifying that you don't want certain characters to count as word boundary. Therefore you need to specify the "boundaries" yourself, something like this:
preg_replace('#(^|[^\w\\])(word)([^\w\\]|$)#','**',$text);
What this does is searches for the word surrounded by line boundaries or non-word characters except the back slash \. Therefore it will match .word, but not .word\ and not `\word. If you need to exclude other characters from matching, just add them inside the brackets.
You could just use str_replace("word\word", "word\word and"), I dont really see why you would need to use a preg_replace in your case given above.
Here is a simple solution that doesn't use a regex. It will ONLY replace single occurances of 'word' where it is a lone word.
<?php
$text = "word\word word cat dog";
$new_text = "";
$words = explode(" ",$text); // split the string into seperate 'words'
$inc = 0; // loop counter
foreach($words as $word){
if($word == "word"){ // if the current word in the array of words matches the criteria, replace it
$words[$inc] = "***";
}
$new_text.= $words[$inc]." ";
$inc ++;
}
echo $new_text; // gives 'word\word *** cat dog'
?>