Since i cant use preg_match (UTF8 support is somehow broken, it works locally but breaks at production) i want to find another way to match word against blacklist. Problem is, i want to search a string for exact match only, not first occurrence of the string.
This is how i do it with preg_match
preg_match('/\b(badword)\b/', strtolower($string));
Example string:
$string = "This is a string containing badwords and one badword";
I want to only match the "badword" (at the end) and not "badwords".
strpos('badword', $string) matches the first one
Any ideas?
Assuming you could do some pre-processing, you could use replace all your punctuation marks with white spaces and put everything in lowercase and then either:
Use strpos with something like so strpos(' badword ', $string) in a while loop to keep on iterating through your entire document;
Split the string at white spaces and compare each word with a list of bad words you have.
So if you where trying the first option, it would something like so (untested pseudo code)
$documet = body of text to process . ' '
$document.replace('!##$%^&*(),./...', ' ')
$document.toLowerCase()
$arr_badWords = [...]
foreach($word in badwords)
{
$badwordIndex = strpos(' ' . $word . ' ', $document)
while(!badWordIndex)
{
//
$badwordIndex = strpos($word, $document)
}
}
EDIT: As per #jonhopkins suggestion, adding a white space at the end should cater for the scenario where there wanted word is at the end of the document and is not proceeded by a punctuation mark.
If you want to mimic the \b modifier of regex you can try something like this:
$offset = 0;
$word = 'badword';
$matched = array();
while(($pos = strpos($string, $word, $offset)) !== false) {
$leftBoundary = false;
// If is the first char, it has a boundary on the right
if ($pos === 0) {
$leftBoundary = true;
// Else, if it is on the middle of the string, we must check the previous char
} elseif ($pos > 0 && in_array($string[$pos-1], array(' ', '-',...)) {
$leftBoundary = true;
}
$rightBoundary = false;
// If is the last char, it has a boundary on the right
if ($pos === (strlen($string) - 1)) {
$rightBoundary = true;
// Else, if it is on the middle of the string, we must check the next char
} elseif ($pos < (strlen($string) - 1) && in_array($string[$pos+1], array(' ', '-',...)) {
$rightBoundary = true;
}
// If it has both boundaries, we add the index to the matched ones...
if ($leftBoundary && $rightBoundary) {
$matched[] = $pos;
}
$offset = $pos + strlen($word);
}
You can use strrpos() instead of strpos:
strrpos — Find the position of the last occurrence of a substring in a string
$string = "This is a string containing badwords and one badword";
var_dump(strrpos($string, 'badword'));
Output:
45
A simple way to use word boundaries with unicode properties:
preg_match('/(?:^|[^pL\pN_])(badword)(?:[^pL\pN_]|$)/u', $string);
In fact it's much more complicated, have a look at here.
Related
Is it possible to skip a strpos/strrpos position?
$string = "This is a cookie 'cookie'.";
$finder = "cookie";
$replacement = "monster";
if (strrpos($string, $finder) !== false)
str_replace($finder, $replacement, $string);
I want to skip the 'cookie' and replace the plain cookie so it'll result in "This is a monster 'cookie'."
I don't have qualms with it finding 'cookie' first and then checking it (Obviously necessary to determine it shouldn't be replaced), but I want to make sure that while 'cookie' is still there, I can use the same function to find the unquoted cookie.
Alternatively, is there a function I haven't found yet (Through hours of searching) to get all indices of a particular word so I can check them all through a loop without the use of regex?
It's important that it's the index, not the word itself, as there are other checks that have to be done based on where in the string the word's located.
You can try a regex instead:
Try the following:
$string = "This is a cookie 'cookie'.";
var_dump(preg_replace("/(?<!')(cookie)/", ' monster', $string));
This uses preg_replace instead of str_replace to replace the string.
Edit: You can use preg_match to get the position of the matched regex in the string like:
$string = "This is a cookie 'cookie'.";
$finder = "cookie";
preg_match("/(?<!')(" . preg_quote($finder) . ")/", $string, $matches, PREG_OFFSET_CAPTURE);
var_dump($matches);
And you can use preg_quote to make sure that preg_match and preg_replace doesn't treat the $finder var as a regex. And the difference in performance is very subtle between preg and other string functions in php. You can run some benchmarks to see how it varies in your case.
The following gives the required replacement as well as the position of the replaced word.
$string = "This is a cookie 'cookie'.";
$finder = "cookie";
$replacement = "monster";
$p = -1; // helps get position of current word
$position = -1; // the position of the word replaced
$arr = explode(' ',$string);
for($i = 0; $i < count($arr); $i += 1){
// Find the position $p of each word and
// Catch $position when a replacement is made
if($i == 0){$p = 0;} else { $w =$arr[$i - 1]; $p += strlen($w) + 1;}
if($arr[$i] == $finder){ if($position < 0){$position = $p;}$arr[$i] = $replacement;}}
$newstring = implode(' ', $arr);
echo $newstring; // gives: This is a monster 'cookie'
echo '<br/>';
echo $position; // gives 10, the position of replaced element.
For the position, the assumption is that the sentence has only single spaces because spaces are used in the explode and implode functions. Otherwise a case of double or larger spaces would require modifications, possibly by replacing spaces with a unique character or set of characters such as #$# which would be used as the first argument of the explode and implode functions.
The code could be modified to capture more than one replacement, e.g. by capturing each replaced position in an array instead of testing for if(position < 0). This would also require to change the way $position is computed because its values are affected by the lengths of previous replacements.
We can do also like this for short-hand :
Include also previous letter to function str_replace like this :
$string = "This is a cookie 'cookie'.";
echo str_replace('a cookie','a monster',$string);
For sure this has already been asked by someone else, however I've searched here on SO and found nothing https://stackoverflow.com/search?q=php+parse+between+words
I have a string and want to get an array with all the words contained between 2 delimiters (2 words). I am not confident with regex so I ended up with this solution, but it is not appropiate because I need to get all the words that match those requirements and not only the first one.
$start_limiter = 'First';
$end_limiter = 'Second';
$haystack = $string;
# Step 1. Find the start limiter's position
$start_pos = strpos($haystack,$start_limiter);
if ($start_pos === FALSE)
{
die("Starting limiter ".$start_limiter." not found in ".$haystack);
}
# Step 2. Find the ending limiters position, relative to the start position
$end_pos = strpos($haystack,$end_limiter,$start_pos);
if ($end_pos === FALSE)
{
die("Ending limiter ".$end_limiter." not found in ".$haystack);
}
# Step 3. Extract the string between the starting position and ending position
# Our starting is the position of the start limiter. To find the string we must take
# the ending position of our end limiter and subtract that from the start limiter
$needle = substr($haystack, $start_pos+1, ($end_pos-1)-$start_pos);
echo "Found $needle";
I thought also about using explode() but I think a regex could be better and faster.
I'm not much familiar with PHP, but it seems to me that you can use something like:
if (preg_match("/(?<=First).*?(?=Second)/s", $haystack, $result))
print_r($result[0]);
(?<=First) looks behind for First but doesn't consume it,
.*? Captures everything in between First and Second,
(?=Second) looks ahead for Second but doesn't consume it,
The s at the end is to make the dot . match newlines if any.
To get all the text between those delimiters, you use preg_match_all and you can use a loop to get each element:
if (preg_match_all("/(?<=First)(.*?)(?=Second)/s", $haystack, $result))
for ($i = 1; count($result) > $i; $i++) {
print_r($result[$i]);
}
Not sure that the result will be faster than your code, but you can do it like this with regex:
$pattern = '~(?<=' . preg_quote($start, '~')
. ').+?(?=' . preg_quote($end, '~') . ')~si';
if (preg_match($pattern, $subject, $match))
print_r($match[0]);
I use preg_quote to escape all characters that have a special meaning in a regex (like +*|()[]{}.? and the pattern delimiter ~)
(?<=..) is a lookbehind assertion that check a substring before what you want to find.
(?=..) is a lookahead assertion (same thing for after)
.+? means all characters one or more times but the less possible (the question mark make the quantifier lazy)
s allows the dot to match newlines (not the default behavior)
i make the search case insensitive (you can remove it, if you don't need)
This allows you to run the same function with different parameters, just so you don't have to rewrite this bit of code all of the time. Also uses the strpos which you used. Has been working great for me.
function get_string_between($string, $start, $end){
$string = " ".$string;
$ini = strpos($string,$start);
if ($ini == 0) return "";
$ini += strlen($start);
$len = strpos($string,$end,$ini) - $ini;
return substr($string,$ini,$len);
}
$fullstring = 'This is a long set of words that I am going to use.';
$parsed = get_string_between($fullstring, 'This', "use");
echo $parsed;
Will output:
is a long set of words that I am going to
Here's a simple example for finding everything between the words 'mega' and 'yo' for the string $t.
PHP Example
$t = "I am super mega awesome-sauce, yo!";
$arr = [];
preg_match("/mega\ (.*?)\ yo/ims", $t, $arr);
echo $arr[1];
PHP Output
awesome-sauce,
You can also use two explode statements.
For example, say you want to get "z" in y=mx^z+b. To get z:
$formula="y=mx^z+b";
$z=explode("+",explode("^",$formula)[1])[0];
First I get everything after ^: explode("^",$formula)[1]
Then I get everything before +: explode("+",$previousExplode)[0]
I need to take every double letter occurrence away from a word. (I.E. "attached" have to become: "aached".)
I wrote this function:
function strip_doubles($string, $positions) {
for ($i = 0; $i < strlen($string); $i++) {
$stripped_word[] = $string[$i];
}
foreach($positions['word'] as $position) {
unset($stripped_word[$position], $stripped_word[$position + 1]);
}
$returned_string= "";
foreach($stripped_words $key => $value) {
$returned_string.= $stripped_words[$key];
}
return $returned_string;
}
where $string is the word to be stripped and $positions is an array containing the positions of any first double letter.
It perfectly works but how would a real programmer write the same function... in a more condensed way? I have a feeling it could be possible to do the same thing without three loops and so much code.
Non-regex solution, tested:
$string = 'attached';
$stripped = '';
for ($i=0,$l=strlen($string);$i<$l;$i++) {
$matched = '';
// if current char is the same as the next, skip it
while (substr($string, $i, 1)==substr($string, $i+1, 1)) {
$matched = substr($string, $i, 1);
$i++;
}
// if current char is NOT the same as the matched char, append it
if (substr($string, $i, 1) != $matched) {
$stripped .= substr($string, $i, 1);
}
}
echo $stripped;
You should use a regular expression. It matches on certain characteristics and can replace the matched occurences with some other string(s).
Something like
$result = preg_replace('#([a-zA-Z]{1})\1#i', '', $string);
Should work. It tells the regexp to match one character from a-z followed by the match itself, thus effectively two identical characters after each other. The # mark the start and end of the regexp. If you want more characters than just a-z and A-Z, you could use other identifiers like [a-ZA-Z0-9]{1} or for any character .{1} or for only Unicode characters (including combined characters), use \p{L}\p{M}*
The i flag after the last # means 'case insensitive' and will instruct the regexp to also match combinations with different cases, like 'tT'. If you want only combinations in the same case, so 'tt' and 'TT', then remove the 'i' from the flags.
The '' tells the regexp to replace the matched occurences (the two identical characters) with an empty string.
See http://php.net/manual/en/function.preg-replace.php and http://www.regular-expressions.info/
echo $string can give any text.
How do I remove word "blank", only if it is the last word of the $string?
So, if we have a sentence like "Steve Blank is here" - nothing should not removed, otherwise if the sentence is "his name is Granblank", then "Blank" word should be removed.
You can easily do it using a regex. The \b ensures it's only removed if it's a separate word.
$str = preg_replace('/\bblank$/', '', $str);
As a variation on Teez's answer:
/**
* A slightly more readable, non-regex solution.
*/
function remove_if_trailing($haystack, $needle)
{
// The length of the needle as a negative number is where it would appear in the haystack
$needle_position = strlen($needle) * -1;
// If the last N letters match $needle
if (substr($haystack, $needle_position) == $needle) {
// Then remove the last N letters from the string
$haystack = substr($haystack, 0, $needle_position);
}
return $haystack;
}
echo remove_if_trailing("Steve Blank is here", 'blank'); // OUTPUTS: Steve blank is here
echo remove_if_trailing("his name is Granblank", 'blank'); // OUTPUTS: his name is Gran
Try the below code:
$str = trim($str);
$strlength = strlen($str);
if (strcasecmp(substr($str, ($strlength-5), $strlength), 'blank') == 0)
echo $str = substr($str, 0, ($strlength-5))
Don't use preg_match unless it is not required. PHP itself recommends using string functions over regex functions when the match is straightforward. From the preg_match manual page.
ThiefMaster is quite correct. A technique that doesn't involve the end of line $ regex character would be to use rtrim.
$trimmed = rtrim($str, "blank");
var_dump($trimmed);
^ That's if you want to remove the last characters of the string. If you want to remove the last word:
$trimmed = rtrim($str, "\sblank");
var_dump($trimmed);
We have a variable $string, its contains some text like:
About 200 million CAPTCHAs are solved by humans around the world every day.
How can we get 2-3 last or first letters of each word (which length is more than 3 letters)?
Will check them for matched text with foreach():
if ('ey' is matched in the end of some word) {
replace 'ey' with 'ei' in this word;
}
Thanks.
First, I'll give you an example of how to loop through a string and work with each word in the string.
Second, I'll explain each part of the code so that you can modify it to your exact needs.
Here is how to switch out the last 2 letters (if they are "ey") of each word that is more than 3 letters long.
<?php
// Example string
$string = 'Hey they ey shay play stay nowhey';
// Create array of words splitting at spaces
$string = explode(" ", $string);
// The search and replace strings
$lookFor = "ey";
$switchTo = "ei";
// Cycle through the words
foreach($string as $key => $word)
{
// If the word has more than 3 letters
if(strlen($word) > 3)
{
// If the last two letters are what we want
if ( substr($word, -2) == $lookFor )
{
// Replace the last 2 letters of the word
$string[$key] = substr_replace($word, $switchTo, -2);
}
}
}
// Recreate string from array
$string = implode(" ", $string);
// See what we got
echo $string;
// The above will print:
// Hey thei ey sashei play nowhei
?>
Live example
I'll explain each function so that you can modify the above to exactly how you want it, since I don't precisely understand all your specifications:
explode() will take a string and split it apart into an array. The first argument is what you use to split it. The second argument is the string, so explode(" ", $string) will split $string by the use of spaces. The spaces will not be included in the array.
foreach() will cycle through each element of an array. foreach($string as $key => $word) will go through each element of $string and for each element it will assign the index number to $key and the value of the element (the word in this case) to $word.
strlen() returns how long a string is.
substr() returns a portion of a string. The first argument is the string, the second argument is where the substring starts, and a third optional argument is the length of the substring. With a negative start, the start will be calculated from the end of the string to the end of the string. In other words, substr($word, -2) returns the substring that begins two from the end of the string and goes to the end of the string.... the last two letters. If you want the first two letters, you would use substr($word, 0, 2), since you're starting at the very beginning and want a length of 2 letters.
substr_replace() will replace a substring within a string. The first argument is the entire string. The second argument is your replacement substring. The third argument is where the replacement starts, and the fourth optional argument is the length of the substring, so substr_replace($word, $switchTo, -2) will take $word and starting at the penultimate letter, replace what's there with $switchTo. In this case, we'll switch out the last two letter. If you want to replace the first two letters, you would use substr_replace($word, $switchTo, 0, 2)
implode() is the opposite of explode. It takes an array and forms it into a string using the separator specified.
$string = 'About 200 million CAPTCHAs are solved by humans around the world every day.';
$result = array();
$words = explode(" ",$string);
foreach($words as $word){
if(strlen($word) > 3){
$result[] = substr($word,0,3); //first 3 characters, use "-3" for second paramter if you want last three
}
}
function get_symbols($str, $reverse = false)
{
$symbols = array();
foreach (explode(' ', $str) as $word)
{
if ($reverse)
$word = strrev($word);
if (strlen($word) > 3)
$word = substr($word, 0, 3);
array_push($symbols, $word);
}
return $symbols;
}
EDIT:
function change_reverse_symbol_in_word($str, $symbol, $replace_to)
{
$result = "";
foreach (explode(' ', $str) as $word)
{
$rword = $word;
if (strlen($rword) > 3)
{
$rword = substr($word, 0, -3);
}
if (!strcmp($symbol, $rword))
{
$word = substr($word, 0, strlen($word) - strlen($rword)) . $replace_to;
}
$result .= $word . " ";
}
return $result;
}
And if you want to use this like a your question you must call this like that:
$string_malformed = change_reverse_symbol_in_word($str, "ey", "ei");