Robustly detect dash in PHP string [duplicate]

Robustly detect dash in PHP string [duplicate] - php

preg_replace does not return desired result when I use it on string fetched from database.
$result = DB::connection("connection")->select("my query");
foreach($result as $row){
//prints run-d.m.c.
print($row->artist . "\n");
//should print run.d.m.c
//prints run-d.m.c
print(preg_replace("/-/", ".", $row->artist) . "\n");
}
This occurs only when i try to replace - (dash). I can replace any other character.
However if I try this regex on simple string it works as expected:
$str = "run-d.m.c";
//prints run.d.m.c
print(preg_replace("/-/", ".", $str) . "\n");
What am I missing here?

It turns out you have Unicode dashes in your strings. To match all Unicode dashes, use
/[\p{Pd}\xAD]/u
See the regex demo
The \p{Pd} matches any hyphen in the Unicode Character Category 'Punctuation, Dash' but a soft hyphen, \xAD, hence it should be combined with \p{Pd} in a character class.
The /u modifier makes the pattern Unicode aware and makes the regex engine treat the input string as Unicode code point sequence, not a byte sequence.

Related

Separate Unicode and Ascii Charactors with White Space from PHP

I'm Doing some class to Handle Sinhala Unicode from php, I want to separate mixed string Unicode and ascii char as a separate words with white space.
example:
$inputstr = "ලංකාABCDE TEST1දිස්ත්‍රික් වාණිජ්‍යTEMP මණ්ඩලය # MNOPQ";
function separatestring($inputstr)
{
//do some code
return $inputstr;
}
echo separatestring($inputstr);
//OUTPUT String = ලංකා ABCDE TEST1 දිස්ත්‍රික් වාණිජ්‍ය TEMP මණ්ඩලය # MNOPQ
i have try with preg_replace with Regex and several looping methods but any method did not success. please help me on this. Thanks All!

This works for me:
$inputstr = "ලංකාABCDE TEST1දිස්ත්‍රික් වාණිජ්‍යTEMP මණ්ඩලය # MNOPQ";
function separatestring($inputstr)
{
$re = '#\s+|(?<=[^\x20-\x7f])(?=[\x20-\x7f])'
. '|(?<=[\x20-\x7f])(?=[^\x20-\x7f])#';
$array = preg_split($re, $inputstr);
return array_filter($array);
}
echo implode(" ", separatestring($inputstr));
//OUTPUT String = ලංකා ABCDE TEST1 දිස්ත්‍රික් වාණිජ්‍ය TEMP මණ්ඩලය # MNOPQ
The regexp for splitting means the following:
# — start regexp (delimeter character),
\s+ — split on one or more whitespace character (counting the whitespace as the separator),
| — or,
(?<=[^\x20-\x7f])(?=[\x20-\x7f]) — split on a border between non-ASCII and ASCII characters (not counting them as separators),
| — or,
(?<=[\x20-\x7f])(?=[^\x20-\x7f]) — split on a border between ASCII and non-ASCII characters (not counting them as separators),
# — end regexp (delimeter character).
Unfortunately, my regular expression is not too elegant, so sometimes the empty strings are returned (because whitespace is also an ASCII character). I’ve put array_filter to fix this, but a more elegant solution might exist.
I’ve written separatestring in such a way that it returns in array. If you want a string, replace the return statement this way:
return implode(" ", array_filter($array));

Preg replace utf8 charset issue with à

I'm trying to add a special string '|||' after newlines, blankspaces and other characters. I'm doing this because I want to split my text into an array. So I was thinking to do it like this:
$result = preg_replace("/<br>/", "<br>|||", preg_replace("/\s/", " |||", preg_replace("/\r/", "\r|||", preg_replace("/\n/", "\n|||", preg_replace("/’/", "’|||", preg_replace("/'/", "'|||", $text))))));
$result = preg_split("/[|||]+/", $result);
It works with every word but words which contain à char. It is replaced by �.
I'm sure the problem is here because my string $text shows the char à.

Since your pattern deals with a Unicode string, pass the /u modifier.
Also, you do not need so many chained regex replacements, group the first patterns and use a backreference in the replacement.
Use
preg_replace("/(<br>|[\s’'])/u", "$1|||", $text)
Note that \s matches spaces, carriage returns and newlines.
Details:
(<br>|[\s’']) - Group 1 capturing either a
<br> - character sequence
| - or
[\s’'] - a whitespace, ’ or '.
See the PHP demo:
$text = "Voilà. C'est vrai.";
echo preg_replace("/(<br>|[\s’'])/u", "$1|||", $text);

PHP regular expression not working with string from database

preg_replace does not return desired result when I use it on string fetched from database.
$result = DB::connection("connection")->select("my query");
foreach($result as $row){
//prints run-d.m.c.
print($row->artist . "\n");
//should print run.d.m.c
//prints run-d.m.c
print(preg_replace("/-/", ".", $row->artist) . "\n");
}
This occurs only when i try to replace - (dash). I can replace any other character.
However if I try this regex on simple string it works as expected:
$str = "run-d.m.c";
//prints run.d.m.c
print(preg_replace("/-/", ".", $str) . "\n");
What am I missing here?

It turns out you have Unicode dashes in your strings. To match all Unicode dashes, use
/[\p{Pd}\xAD]/u
See the regex demo
The \p{Pd} matches any hyphen in the Unicode Character Category 'Punctuation, Dash' but a soft hyphen, \xAD, hence it should be combined with \p{Pd} in a character class.
The /u modifier makes the pattern Unicode aware and makes the regex engine treat the input string as Unicode code point sequence, not a byte sequence.

Find last word of a string that has special characters

I am trying to add a span tag to the last word of a string. It works if the string has no special characters. I can't figure out the correct regex for it.
$string = "Onun Mesajı";
echo preg_replace("~\W\w+\s*\S?$~", ' <span>' . '\\0' . '</span>', $string);
Here is the Turkish character set : ÇŞĞÜÖİçşğüöı

You need to use /u modifier to allow processing Unicode characters in the pattern and input string.
preg_replace('~\w+\s*$~u', '<span>$0</span>', $string);
^
Full PHP demo:
$string = "Onun Mesajı";
echo preg_replace("~\w+\s*$~u", '<span>$0</span>', $string);
Also, the regex you need is just \w+\s*$:
\w+ - 1 or more alphanumerics
\s* - 0 or more whitespace (trailing)
$ - end of string
Since I removed the \W from the regex, there is no need to "hardcode" the leading space in the replacement string (removed, too).

You should use the u modifier for regular expressions to set the engine into unicode mode:
<?php
$subject = "Onun äöüß Mesajı";
$pattern = '/\w+\s*?$/u';
echo preg_replace($pattern, '<span>\\0</span>', $subject);
The output is:
Onun äöüß <span>Mesajı</span>

This regex will do the trick for you, and is a lot shorter then the other solutions:
[ ](.*?$)
Here is an example of it:
$string = "Onun Mes*ÇŞĞÜÖİçşğüöıajı";
echo preg_replace('~[ ](.*?$)~', ' <span>' .'${1}'. '</span>', $string);
Will echo out:
Onun <span>Mes*ÇŞĞÜÖİçşğüöıajı</span>
The way this regex works is that we look for any characters without space in lazy mode [ ].*?.
then we add the $ identifier, so it matches from the end of the string instead.

Find words starting and ending with dollar signs $ in PHP

I am looking to find and replace words in a long string. I want to find words that start looks like this: $test$ and replace it with nothing.
I have tried a lot of things and can't figure out the regular expression. This is the last one I tried:
preg_replace("/\b\\$(.*)\\$\b/im", '', $text);
No matter what I do, I can't get it to replace words that begin and end with a dollar sign.

Use single quotes instead of double quotes and remove the double escape.
$text = preg_replace('/\$(.*?)\$/', '', $text);
Also a word boundary \b does not consume any characters, it asserts that on one side there is a word character, and on the other side there is not. You need to remove the word boundary for this to work and you have nothing containing word characters in your regular expression, so the i modifier is useless here and you have no anchors so remove the m (multi-line) modifier as well.
As well * is a greedy operator. Therefore, .* will match as much as it can and still allow the remainder of the regular expression to match. To be clear on this, it will replace the entire string:
$text = '$fooo$ bar $baz$ quz $foobar$';
var_dump(preg_replace('/\$(.*)\$/', '', $text));
# => string(0) ""
I recommend using a non-greedy operator *? here. Once you specify the question mark, you're stating (don't be greedy.. as soon as you find a ending $... stop, you're done.)
$text = '$fooo$ bar $baz$ quz $foobar$';
var_dump(preg_replace('/\$(.*?)\$/', '', $text));
# => string(10) " bar quz "
Edit
To fix your problem, you can use \S which matches any non-white space character.
$text = '$20.00 is the $total$';
var_dump(preg_replace('/\$\S+\$/', '', $text));
# string(14) "$20.00 is the "

There are three different positions that qualify as word boundaries \b:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
$ is not a word character, so don't use \b or it won't work. Also, there is no need for the double escaping and no need for the im modifiers:
preg_replace('/\$(.*)\$/', '', $text);
I would use:
preg_replace('/\$[^$]+\$/', '', $text);

You can use preg_quote to help you out on 'quoting':
$t = preg_replace('/' . preg_quote('$', '/') . '.*?' . preg_quote('$', '/') . '/', '', $text);
echo $t;
From the docs:
This is useful if you have a run-time string that you need to match in some text and the string may contain special regex characters.
The special regular expression characters are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | : -

Contrary to your use of word boundary markers (\b), you actually want the inverse effect (\B)-- you want to make sure that there ISN'T a word character next to the non-word character $.
You also don't need to use capturing parentheses because you are not using a backreference in your replacement string.
\S+ means one or more non-whitespace characters -- with greedy/possessive matching.
Code: (Demo)
$text = '$foo$ boo hi$$ mon$k$ey $how thi$ $baz$ bar $foobar$';
var_export(
preg_replace(
'/\B\$\S+\$\B/',
'',
$text
)
);
Output:
' boo hi$$ mon$k$ey $how thi$ bar '

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Robustly detect dash in PHP string [duplicate] - php

Related

Separate Unicode and Ascii Charactors with White Space from PHP

Preg replace utf8 charset issue with à

PHP regular expression not working with string from database

Find last word of a string that has special characters

Find words starting and ending with dollar signs $ in PHP

Categories

Resources