Preg replace utf8 charset issue with à - php

I'm trying to add a special string '|||' after newlines, blankspaces and other characters. I'm doing this because I want to split my text into an array. So I was thinking to do it like this:
$result = preg_replace("/<br>/", "<br>|||", preg_replace("/\s/", " |||", preg_replace("/\r/", "\r|||", preg_replace("/\n/", "\n|||", preg_replace("/’/", "’|||", preg_replace("/'/", "'|||", $text))))));
$result = preg_split("/[|||]+/", $result);
It works with every word but words which contain à char. It is replaced by �.
I'm sure the problem is here because my string $text shows the char à.

Since your pattern deals with a Unicode string, pass the /u modifier.
Also, you do not need so many chained regex replacements, group the first patterns and use a backreference in the replacement.
Use
preg_replace("/(<br>|[\s’'])/u", "$1|||", $text)
Note that \s matches spaces, carriage returns and newlines.
Details:
(<br>|[\s’']) - Group 1 capturing either a
<br> - character sequence
| - or
[\s’'] - a whitespace, ’ or '.
See the PHP demo:
$text = "Voilà. C'est vrai.";
echo preg_replace("/(<br>|[\s’'])/u", "$1|||", $text);

Related

Robustly detect dash in PHP string [duplicate]

preg_replace does not return desired result when I use it on string fetched from database.
$result = DB::connection("connection")->select("my query");
foreach($result as $row){
//prints run-d.m.c.
print($row->artist . "\n");
//should print run.d.m.c
//prints run-d.m.c
print(preg_replace("/-/", ".", $row->artist) . "\n");
}
This occurs only when i try to replace - (dash). I can replace any other character.
However if I try this regex on simple string it works as expected:
$str = "run-d.m.c";
//prints run.d.m.c
print(preg_replace("/-/", ".", $str) . "\n");
What am I missing here?
It turns out you have Unicode dashes in your strings. To match all Unicode dashes, use
/[\p{Pd}\xAD]/u
See the regex demo
The \p{Pd} matches any hyphen in the Unicode Character Category 'Punctuation, Dash' but a soft hyphen, \xAD, hence it should be combined with \p{Pd} in a character class.
The /u modifier makes the pattern Unicode aware and makes the regex engine treat the input string as Unicode code point sequence, not a byte sequence.

PHP regular expression not working with string from database

preg_replace does not return desired result when I use it on string fetched from database.
$result = DB::connection("connection")->select("my query");
foreach($result as $row){
//prints run-d.m.c.
print($row->artist . "\n");
//should print run.d.m.c
//prints run-d.m.c
print(preg_replace("/-/", ".", $row->artist) . "\n");
}
This occurs only when i try to replace - (dash). I can replace any other character.
However if I try this regex on simple string it works as expected:
$str = "run-d.m.c";
//prints run.d.m.c
print(preg_replace("/-/", ".", $str) . "\n");
What am I missing here?
It turns out you have Unicode dashes in your strings. To match all Unicode dashes, use
/[\p{Pd}\xAD]/u
See the regex demo
The \p{Pd} matches any hyphen in the Unicode Character Category 'Punctuation, Dash' but a soft hyphen, \xAD, hence it should be combined with \p{Pd} in a character class.
The /u modifier makes the pattern Unicode aware and makes the regex engine treat the input string as Unicode code point sequence, not a byte sequence.

php regex remove all non-alphanumeric and space characters from a string

I need a regex to remove all non-alphanumeric and space characters, I have this
$page_title = preg_replace("/[^A-Za-z0-9 ]/", "", $page_title);
but it doesn't remove space characters and replaces some non-alphanumeric characters with numbers.
I need the special characters like puntuation and spaces removed.
If all you want to leave all of the alphanumeric bits you would use this:
(\W)+
Here is some test code:
$original = "Match spaces and {!}#";
echo $original ."<br>";
$altered = preg_replace("/(\W)+/", "", $original);
echo $altered;
Here is the output:
Match spaces and {!}#
Matchspacesand
Here is the explanation:
1st Capturing group: (\W) matches any non-word character [^a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
I need the special characters like puntuation and spaces removed.
Then use:
$page_title = preg_replace('/[\p{P}\p{Zs}]+/u', "", $page_title);
\p{P} matches any punctuation character
\p{Zs} matches any space character
/u - To support unicode
Try this
preg_replace('/[^[:alnum:]]/', '', $page_title);
[:alnum:] matches alphanumeric characters
Works good for me on Sublime and PHP Regex Tester
$page_title = preg_replace("/[^A-Za-z0-9]/", "", $page_title);

Find words starting and ending with dollar signs $ in PHP

I am looking to find and replace words in a long string. I want to find words that start looks like this: $test$ and replace it with nothing.
I have tried a lot of things and can't figure out the regular expression. This is the last one I tried:
preg_replace("/\b\\$(.*)\\$\b/im", '', $text);
No matter what I do, I can't get it to replace words that begin and end with a dollar sign.
Use single quotes instead of double quotes and remove the double escape.
$text = preg_replace('/\$(.*?)\$/', '', $text);
Also a word boundary \b does not consume any characters, it asserts that on one side there is a word character, and on the other side there is not. You need to remove the word boundary for this to work and you have nothing containing word characters in your regular expression, so the i modifier is useless here and you have no anchors so remove the m (multi-line) modifier as well.
As well * is a greedy operator. Therefore, .* will match as much as it can and still allow the remainder of the regular expression to match. To be clear on this, it will replace the entire string:
$text = '$fooo$ bar $baz$ quz $foobar$';
var_dump(preg_replace('/\$(.*)\$/', '', $text));
# => string(0) ""
I recommend using a non-greedy operator *? here. Once you specify the question mark, you're stating (don't be greedy.. as soon as you find a ending $... stop, you're done.)
$text = '$fooo$ bar $baz$ quz $foobar$';
var_dump(preg_replace('/\$(.*?)\$/', '', $text));
# => string(10) " bar quz "
Edit
To fix your problem, you can use \S which matches any non-white space character.
$text = '$20.00 is the $total$';
var_dump(preg_replace('/\$\S+\$/', '', $text));
# string(14) "$20.00 is the "
There are three different positions that qualify as word boundaries \b:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
$ is not a word character, so don't use \b or it won't work. Also, there is no need for the double escaping and no need for the im modifiers:
preg_replace('/\$(.*)\$/', '', $text);
I would use:
preg_replace('/\$[^$]+\$/', '', $text);
You can use preg_quote to help you out on 'quoting':
$t = preg_replace('/' . preg_quote('$', '/') . '.*?' . preg_quote('$', '/') . '/', '', $text);
echo $t;
From the docs:
This is useful if you have a run-time string that you need to match in some text and the string may contain special regex characters.
The special regular expression characters are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | : -
Contrary to your use of word boundary markers (\b), you actually want the inverse effect (\B)-- you want to make sure that there ISN'T a word character next to the non-word character $.
You also don't need to use capturing parentheses because you are not using a backreference in your replacement string.
\S+ means one or more non-whitespace characters -- with greedy/possessive matching.
Code: (Demo)
$text = '$foo$ boo hi$$ mon$k$ey $how thi$ $baz$ bar $foobar$';
var_export(
preg_replace(
'/\B\$\S+\$\B/',
'',
$text
)
);
Output:
' boo hi$$ mon$k$ey $how thi$ bar '

Replace symbol if it is preceded and followed by a word character

I want to change a specific character, only if it's previous and following character is of English characters. In other words, the target character is part of the word and not a start or end character.
For Example...
$string = "I am learn*ing *PHP today*";
I want this string to be converted as following.
$newString = "I am learn'ing *PHP today*";
$string = "I am learn*ing *PHP today*";
$newString = preg_replace('/(\w)\*(\w)/', '$1\'$2', $string);
// $newString = "I am learn'ing *PHP today* "
This will match an asterisk surrounded by word characters (letters, digits, underscores). If you only want to do alphabet characters you can do:
preg_replace('/([a-zA-Z])\*([a-zA-Z])/', '$1\'$2', 'I am learn*ing *PHP today*');
The most concise way would be to use "word boundary" characters in your pattern -- they represent a zero-width position between a "word" character and a "non-word" characters. Since * is a non-word character, the word boundaries require the both neighboring characters to be word characters.
No capture groups, no references.
Code: (Demo)
$string = "I am learn*ing *PHP today*";
echo preg_replace('~\b\*\b~', "'", $string);
Output:
I am learn'ing *PHP today*
To replace only alphabetical characters, you need to use a [a-z] as a character range, and use the i flag to make the regex case-insensitive. Since the character you want to replace is an asterisk, you also need to escape it with a backslash, because an asterisk means "match zero or more times" in a regular expression.
$newstring = preg_replace('/([a-z])\*([a-z])/i', "$1'$2", $string);
To replace all occurances of asteric surrounded by letter....
$string = preg_replace('/(\w)*(\w)/', '$1\'$2', $string);
AND
To replace all occurances of asteric where asteric is start and end character of the word....
$string = preg_replace('/*(\w+)*/','\'$1\'', $string);

Categories