PHP replace common words from my file - php

I've tried to make a tool in which you input a website and when you click the submit button it cURLS all the text.
After all the cURLing, stripping it from tags, and counting the words. It's eventually an array named $frequency. If I echo it using <pre> tags it will show me everything just fine! (NOTE: I'm placing the contents in a file, $homepage = file_get_contents($file); and this is what I work with in my code, I don't know if this matters or not)
However i don't really care if the word or is seen 200 times in a website, I only want the important words. So i have made an array with all the common words. Which is set eventually in the $common_words variable. But i can't seem to find a way to replace all words found in the $frequency to replace them with "" if they are found in the $common_words as well.
I've found this piece of code after some research:
$string = 'sand band or nor and where whereabouts foo';
$wordlist = array("or", "and", "where");
foreach ($wordlist as &$word) {
$word = '/\b' . preg_quote($word, '/') . '\b/';
}
$string = preg_replace($wordlist, '', $string);
var_dump($string);
If I copy paste this it works fine, removing the or, and, where from the string.
But replacing $string with $frequency or replacing $wordlist with $common_words will either not work or throw me an error like: Delimiter must not be alphanumeric or backslash
I hope i've formulated my question properly, if not. Please tell me!
Thanks in advance
EDIT: Alright, i've narrowed down the problem alot. First of all i forgot the & inside the foreach ($wordlist as &$word) {
But as it was counting all the words, the words it has replaced are all still counted. See those 2 screenshots to see what I mean: http://imgur.com/oqqZR3h,xHEZKRz#0

If I understand this correctly you wan't to know how many occurrences each word has by ignoring the so called common words.
Assuming that $url is the page you will be running against and $common_words is your common words array, here is what you can do:
// Get the page content's and strip the html tags
$contents = strip_tags( file_get_contents($url) );
// This will split the words from the contents, creating an array with each word in it
preg_match_all("/([\w]+[']?[\w]*)\W/", $contents, $words);
$common_words = array('or', 'and', 'I', 'where');
$frequency = array();
// Count occurrences
$frequency = array_count_values($words[0]);
unset($words); // Release all that memory
var_dump($frequency);
At this point you will have an associative array with each not common word and a count showing the number of occurrences of the given word.
UPDATE
A bit more about the RegEx. We need to match word. The easiest way possible is: (\w+). But that won't match words like I've or haven't (Notice the '). That was my point of making it more complicated. Also, \w doesn't support dashes for words like in 6-year-old.
So I created a subgroup which should match words characters including dashed and single quotes in a word.
(?:\w'|\w|-)
The ?: part on the beginning is do not match or do not include in the results. That is since all I am doing is grouping the options for word contents. To mach an entire word the RegEx will match one or more of the subgroup above:
((?:\w'\w|\w|-)+)
So the RegEx preg_match_all() line should be:
preg_match_all("/((?:\w'\w|\w|-)+)/", $contents, $words);
Hope this helps.

I had changed $wordlist with $mywordlist. still its working!
<?php
$string = 'sand band or nor and where whereabouts foo';
$wordlist = array("or", "and", "where");
$mywordlist=array("sand","band");
foreach ($mywordlist as &$word) {
$word = '/\b' . preg_quote($word, '/') . '\b/';
}
$string = preg_replace($mywordlist, '', $string);
var_dump($string);
?>

I suppose you can do simply like this:
$common_words = "foo baq etc etc";
$str = "foo bar baz"; // input
foreach (explode(" ", $common_words) as $word){
$str = strtr($str, $word, "");
}

Related

Replace words in a string including plural variations with apostrophes

I want to link matches for specific words in a sentence. Overall this is easy, and sample code could go like this:
$words = array("Facebook", "Apple");
$text = "Is Facebook's vr hardware better than Apple's current prototype?";
foreach($words as $w) {
$pattern = '/' . $w .'\b/i';
$link = '' . $w . '';
$text = preg_replace($pattern, $link, $text);
}
print $text;
However I would like to catch variations of words that have 's (apostrophe-s).
To do that I need to search for the two possible variations (with and without the 's), but the outcome also affects what text used in the replacement.
I'm drawing a blank on how to pro-actively used preg_match and then alter preg_replace based on the outcome. Any advice appreciated.
try using the optional ? quantifier and parenthesis.
$pattern = '/' . $w .'(\'s)?\b/i';
should match either version.
now, to use the match in your replacement, you can add an extra set of parenthesis, like this:
$pattern = '/(' . $w .'(\'s)?)\b/i';
then insert the matched string into your replacement, like this:
$link = '$1';
the $1 in the replacement string will be replaced with whatever the outer parenthesis of the match contains.

preg replace would ignore non-letter characters when detecting words

I have an array of words and a string and want to add a hashtag to the words in the string that they have a match inside the array. I use this loop to find and replace the words:
foreach($testArray as $tag){
$str = preg_replace("~\b".$tag."~i","#\$0",$str);
}
Problem: lets say I have the word "is" and "isolate" in my array. I will get ##isolate at the output. this means that the word "isolate" is found once for "is" and once for "isolate". And the pattern ignores the fact that "#isoldated" is not starting with "is" anymore and it starts with "#".
I bring an example BUT this is only an example and I don't want to just solve this one but every other possiblity:
$str = "this is isolated is an example of this and that";
$testArray = array('is','isolated','somethingElse');
Output will be:
this #is ##isolated #is an example of this and that
You may build a regex with an alternation group enclosed with word boundaries on both ends and replace all the matches in one pass:
$str = "this is isolated is an example of this and that";
$testArray = array('is','isolated','somethingElse');
echo preg_replace('~\b(?:' . implode('|', $testArray) . ')\b~i', '#$0', $str);
// => this #is #isolated #is an example of this and that
See the PHP demo.
The regex will look like
~\b(?:is|isolated|somethingElse)\b~
See its online demo.
If you want to make your approach work, you might add a negative lookbehind after \b: "~\b(?<!#)".$tag."~i","#\$0". The lookbehind will fail all matches that are preceded with #. See this PHP demo.
A way to do that is to split your string by words and to build a associative array with your original array of words (to avoid the use of in_array):
$str = "this is isolated is an example of this and that";
$testArray = array('is','isolated','somethingElse');
$hash = array_flip(array_map('strtolower', $testArray));
$parts = preg_split('~\b~', $str);
for ($i=1; $i<count($parts); $i+=2) {
$low = strtolower($parts[$i]);
if (isset($hash[$low])) $parts[$i-1] .= '#';
}
$result = implode('', $parts);
echo $result;
This way, your string is processed only once, whatever the number of words in your array.

Only last element of array being used when replacing text

I am trying to replace some "common" words from a large block of text, however it's only using the last word from the array, please can you see where I'm going wrong?
Thanks
$glue = strtolower ($glue);//make all lower case
//remove common words
$Maffwordlist = array('the','to','for');
foreach($Maffwordlist as $Maffword)
$filtered = preg_replace("/\s". $Maffword ."\s/", " ", $glue);
The extract above only removes 'for' from the text, 'the' and 'to' are still included.
Any help appreciated.
The problem is that the subject of your preg_replace() is always $glue, which itself never changes. Before iterating your list of words, you need to assign the starting contents of $glue into $filtered since that is what you are acting on in order to accumulate all the values into it.
// $filtered is the string you'll be modifying...
$filtered = strtolower ($glue);//make all lower case
$Maffwordlist = array('the','to','for');
foreach($Maffwordlist as $Maffword) {
$filtered = preg_replace("/\s". $Maffword ."\s/", " ", $glue);
}
But we can do better.
A regular expression can be constructed to handle all the replacements without a loop using a (a|b|c) grouping.
// Stick the words together with pipes
$pattern = implode("|", $Maffwordlist);
// And surround with regex delimiters and ()
// so the whole regex looks like /\s(the|to|for)\s/
$pattern = '/\s(' . $pattern . ')\s/';
// And do the operation in one go:
$filtered = preg_replace($pattern, " ", $filtered);
I'll note you may wish to use \b word boundaries instead of \s delimiting these by whitespace. That way, you would get proper replacements in a sentence like "You should not end a sentence with for." where one of your list words appears but not bound by whitespace.
Finally then, you'll end up with multiple consecutive spaces in some places where replacements have taken place. You can collapse those into single spaces with something like the following.
// Replace multiple spaces with a single space
$filtered = preg_replace('/\s+/', ' ', $filtered);

Dynamically Inserting Text Into String (PHP)

I have a PHP script that runs a search on my database based on terms put into the search box on a website. This returns a block of text. Let's say that my search term for right now is "test block". An example of my result would be this block of text:
This is a test block of text that was gathered from the database using
a search query.
Now, my question is: how I can "highlight" the search term within the block of text so the user can see why this result was pulled in the first place. Using the above example, something like the following would suffice:
This is a test block of text that was gathered from the database
from a search query.
I have tried a few things so far that will change the text, but the real problem I am running into has to do with case sensitivity. For example, if I used the code:
$exploded = explode(' ', $search_terms);
foreach($exploded as $word) {
// I have to use str_ireplace so the word is actually found
$result = str_ireplace($word, '<b>' . $word . '</b>', $result);
}
It would go through my $result and bold any instance of the words. This would look correct, as I wanted in my second example of the search result. But, in the case that the user uses "Test Block" instead of "test block", the search terms would be capitalized and appear as this:
This is a Test Block of text that was gathered from the database
from a search query.
This does not work for me, especially when the user is using lower-case search terms and they happen to fall at the beginning of a sentance.
Essentially, what I need to do is find the word in the string, insert text (<b> in this example) directly in front of the word, and then insert text directly after the word (</b> in this example) while preserving the word itself from being replaced. This rules preg_replace and str_replace out I believe, so I am really stuck on what to do.
Any leads would be greatly appreciated.
$exploded = explode(' ', $search_terms);
foreach($exploded as $word) {
// I have to use str_ireplace so the word is actually found
$result = preg_replace("/(".preg_quote($word).")/i", "<b>$1</b>", $result);
}
Pattern matching http://www.php.net/manual/en/reference.pcre.pattern.syntax.php uses certain characters like . [ ] / * + etc.. so if these occur in the pattern they need to be escaped first with pre_quote();
Patterns start and end with delimiters to identify the pattern http://www.php.net/manual/en/regexp.reference.delimiters.php
followed my modifiers http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
In this case i for case-insensitive
anything in ( brackets ) is captured for use later, either in $matched parameter or in the replacement as $1 or \\1 for the first, $2 second etc..
Use preg_replace. In your example
$result = preg_replace("/\\b(" . preg_quote($word) . ")\\b/i", '<b>$1</b>', $result);
You can use preg_replace:
foreach ($exploded as $word) {
$text = preg_replace("`(" . preg_quote($word) . ")`Ui" , "<b>$1</b>" , $text);
}
$string = 'The quick brown fox jumped over the lazy dog.';
$search = "brown";
$pattern = "/".$search."/";
$replacement = "<strong>".$search."</strong>";
echo preg_replace($pattern, $replacement, $string);
The quick brown fox jumped over the lazy dog

regex to replace a given word with space at either side or not at all

I am working with some code in PHP that grabs the referrer data from a search engine, giving me the query that the user entered.
I would then like to remove certain stop words from that string if they exist. However, the word may or may not have a space at either end.
For example, I have been using str_replace to remove a word as follows:
$keywords = str_replace("for", "", $keywords);
$keywords = str_replace("sale", "", $keywords);
but if the $keywords value is "baby formula" it will change it to "baby mula" - removing the "for" part.
Without having to create further str_replace's that account for " for" and "for " - is there a preg_replace type command I could use that would remove the given word if it is found with a space at either end?
My idea would be to put all of the stop words into an array and step through them that way and I suspect that a preg_replace is going to be quicker than stepping through multiple str_replace lines.
UPDATE:
Solved thanks to you guys using the following combination:
$keywords = "...";
$stopwords = array("for","each");
foreach($stopwords as $stopWord)
{
$keywords = preg_replace("/(\b)$stopWord(\b)/", "", $keywords);
}
$keywords = "...";
$stopWords = array("for","sale");
foreach($stopWords as $stopWord){
$keywords = preg_replace("/(\b)$stopWord(\b)/", "", $keywords);
}
Try it this way
$keywords = preg_replace( '/(?!\w)(for|sale)(?>!\w)/', '', $keywords );
You can use word boundaries for this
$keywords = preg_replace('/\bfor\b/', '', $keywords);
or with multiple words
$keywords = preg_replace('/\b(?:for|sale)\b/', '', $keywords);
While Armel's answer will work, it is not performing optimally. Yes, your desired output will require wordboundaries and probably case-insensitive matching, but:
Wordboundaries gain nothing from being wrapped in parentheses.
Performing iterated preg_match() calls for each element in the blacklist array is not efficient. Doing so will ask the regex engine to perform wave after wave of individual keyword checks on the full string.
I recommend building a single regex pattern that will check for all keywords during each step of traversing the string -- one time. To generate the single pattern dynamically, you only need to implode your blacklist array of elements with | (pipes) which represent the "OR" command in regex. By wrapping all of the pipe-delimited keywords in a non-capturing group ((?:...)), the wordboundaries (\b) serve their purpose for all keywords in the blacklist array.
Code: (Demo)
$string = "Each person wants peaches for themselves forever";
$blacklist = array("for", "each");
// if you might have non-letter characters that have special meaning to the regex engine
//$blacklist = array_map(function($v){return preg_quote($v, '/');}, $blacklist);
//print_r($blacklist);
echo "Without wordboundaries:\n";
var_export(preg_replace('/' . implode('|', $blacklist) . '/i', '', $string));
echo "\n\n---\n";
echo "With wordboundaries:\n";
var_export(preg_replace('/\b(?:' . implode('|', $blacklist) . ')\b/i', '', $string));
echo "\n\n---\n";
echo "With wordboundaries and consecutive space mop up:\n";
var_export(trim(preg_replace(array('/\b(?:' . implode('|', $blacklist) . ')\b/i', '/ \K +/'), '', $string)));
Output:
Without wordboundaries:
' person wants pes themselves ever'
---
With wordboundaries:
' person wants peaches themselves forever'
---
With wordboundaries and consecutive space mop up:
'person wants peaches themselves forever'
p.s. / \K +/ is the second pattern fed to preg_replace() which means the input string will be read a second time to search for 2 or more consecutive spaces. \K means "restart the fullstring match from here"; effectively it releases the previously matched space. Then one or more spaces to follow are matched and replaced with an empty string.

Categories