I have this RegEx for matching whitespace in Unicode:
/^[\pZ\pC]+|[\pZ\pC]+$/u
I'm not even sure of what it does, but it seems to work. Now, in this case, which function applies better and why?
$str = preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u', '', $str);
or
$str = mb_ereg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u', '', $str);
The first one works. The second one doesn't.
Tried it out again, mb_ereg_replace doesn't actually support those Unicode char escapes. And it doesn't use regex delimiters. (See Oniguruma)
preg_replace uses the PCRE regex engine, which supports both.
Anyway, there is no such thing as a "better" application. It's either functioning, or not.
Related
I have text with unicode soft hyphens (U+00AD) that I wish to remove. I'm attempting to do this with PHP's mb_ereg_replace() function. It's finding the soft hyphens, but the replace procedure is removing both the soft hyphens and the first character that immediately follows them.
My code is:
$text_cleansed = mb_ereg_replace('[\u00AD]', '', $text);
For example, if $text were "en-dur-ance" (with the hyphens shown here being the invisible unicode soft hyphens), then $text_cleansed would be "enurnce"; -d and -a have been removed, when for each soft hyphen only the soft hyphen should be removed. mb_ereg_replace has therefore removed each soft hyphen and the first character that follows it. Surely, I must be feeding incorrect arguments into the function.
What is causing this behavior, and what would be the correct arguments for the function?
PHP regex does not support \u notation. The symbols in your regex are treated as separate entities, not as a hex notation (as '\u', '0', 'A', 'D').
Use preg_replace with \x{} notation with /u modifier (necessary to interpret the pattern and the input string as Unicode strings):
preg_replace('~\x{00AD}~u', '', $s)
See IDEONE demo
What stribizhev has written is correct. It's a matter of syntax that isn't clear in the PHP mb_ereg documentation. I'm answering after already marking his second response as the answer, because his third response has the specific answer to my original question (re: multi-byte strings--mb--as opposed to regular strings).
1) If non-multi-byte strings are used, preg_replace('~\x{00AD}~u', '', $text) is the solution.
2) If multi-byte strings are used, mb_ereg_replace('[\x{00AD}]', '', $text) is the solution.
It's a matter of syntax, obscure to those not experienced in regex work. With luck, this will help someone else with a similar problem.
If your string is UTF8 encoded, you don't need to take in account that your string is multi-byte to remove a character from the ASCII range (00-7F) since these bytes aren't used to compose other multi-byte characters. In this case, you can use str_replace:
$result = str_replace('-', '', $text);
We have a regex to strip out non alpha numeric characters except for '#', '&' and '-'. Here is what it looks like:
preg_replace('/[^a-zA-Z0-9#&-*]/', '', strtolower($title));
Now we need to support traditional Chinese strings and the above function won't work. How can I implement similar functionality for traditional Chinese.
Thanks,
Use u modifier:
preg_replace(`/[^a-zA-Z0-9#&-*诶]/u`, '', $string);
By the way, don't use strtolower(), because it will break your string. Use mb_strtolower():
mb_strtolower($string, 'UTF-8');
Have you tried mb_ereg_replace() instead of preg_replace()? That might do the trick.
http://www.php.net/manual/en/function.mb-ereg-replace.php
I am converting an eregi_replace function I found to preg_replace, but the eregi string has about every character on the keyboard in it. So I tried to use £ as the delimiter.. and it is working currently, but I wonder if it might potentially cause problems because it is a non-standard character?
Here is the eregi:
function makeLinks($text) {
$text = eregi_replace('(((f|ht){1}tp://)[-a-zA-Z0-9#:%_\+.~#?&//=]+)',
'\\1', $text);
$text = eregi_replace('([[:space:]()[{}])(www.[-a-zA-Z0-9#:%_\+.~#?&//=]+)',
'\\1\\2', $text);
return $text;}
and the preg:
function makeLinks($text) {
$text = preg_replace('£(((f|ht){1}tp://)[-a-zA-^Z0-9#:%_\+.~#?&//=]+)£i',
'\\1', $text);
$text = preg_replace('£([[:space:]()[{}])(www.[-a-zA-Z0-9#:%_\+.~#?&//=]+)£i',
'\\1\\2', $text);
return $text;
}
£ is problematic because it isn't an ASCII character. It's from the Latin-1 charset and will only work if your PHP script also uses the 8bit representation. Should your file be encoded as UTF-8, then £ will be represented as two bytes. And PCRE in PHP will trip over that. (At least my version does.)
You can use parentheses to delimit a regex rather than a single character, for example:
preg_replace('(abc/def#ghi)i', ...);
That would probably be nicer than trying to find an obscure character that's not (yet) part of your expression.
You can use the unicode character, just to be sure.
\u00A3
Watch out for the ereg functions and unicode support.
http://www.regular-expressions.info/php.html
http://www.regular-expressions.info/characters.html
Long live the Queen.
As #Chris pointed out, you can use paired bracket characters as delimiters, but they have to properly balanced throughout the regex. For example, '<<>' won't work, but '<<>>' will. You can use any of (), [], {} or <>, but I recommend the braces or the square brackets; parentheses are too common in regexes, and angle brackets are used in escape sequences like (?>...) (atomic group) and (?<=...) (lookbehind).
But I'm with #Brad on this one: why not just escape the delimiter character with a backslash whenever it appears in the regex?
You would know the data being parsed better than we would. As far as regex is concerned, it's no different than any other ASCII value.
Though I have to ask: what's wrong with traditional then just escaping it? Or using a class with a character range?
Just migrating from PHP 5.2 to 5.3, lot of hard work! Is the following ok, or would you do something differently?
$cleanstring = ereg_replace("[^A-Za-z0-9]^[,]^[.]^[_]^[:]", "", $critvalue);
to
$cleanstring = preg_replace("[^A-Za-z0-9]^[,]^[.]^[_]^[:]", "", $critvalue);
Thanks all
As a follow-up to cletus's answer:
I'm not familiar with the POSIX regex syntax (ereg_*) either, but based on your criteria the following should do what you want:
$cleanstring = preg_replace('#[^a-zA-Z0-9,._:]#', '', $critvalue);
This removes everything except a-z, A-Z, 0-9, and the puncation characters.
I'm not that familiar with the ereg_* functions but your preg version has a couple of problems:
^ means beginning of string so with it in the middle it won't match anything; and
You need to delimit your regular expression;
An example:
$out = preg_replace('![^0-9a-zA-Z]+!', '', $in);
Note I'm using ! to delimit the regex but you could just as easily use /, ~ or whatever. The above removes everything except numbers and letters.
See Pattern Syntax, specifically Delimiters.
I cant figure out preg_replace at all, it just looks chinese to me, anyway I just need to remove "&page-X" from a string if its there.
X being a number of course, if anyone has a link to a useful preg_replace tutorial for beginners that would also be handy!
Actually the basic syntax for regular expressions, as supported by preg_replace and friends, is pretty easy to learn. Think of it as a string describing a pattern with certain characters having special meaning.
In your very simple case, a possible pattern is:
&page-\d+
With \d meaning a digit (numeric characters 0-9) and + meaning: Repeat the expression right before + (here: \d) one or more times. All other characters just represent themselves.
Therefore, the pattern above matches any of the following strings:
&page-0
&page-665
&page-1234567890
Since the preg functions use a Perl-compatible syntax and regular expressions are denoted between slashes (/) in Perl, you have to surround the pattern in slashes:
$after = preg_replace('/&page-\d+/', '', $before);
Actually, you can use other characters as well:
$after = preg_replace('#&page-\d+#', '', $before);
For a full reference of supported syntax, see the PHP manual.
preg_replace uses Perl-Compatible Regular Expression for the search pattern. Try this pattern:
preg_replace('/&page-\d+/', '', $str)
See the pattern syntax for more information.
$outputstring = preg_replace('/&page-\d+/', "", $inputstring);
preg_replace()
preg_replace('/&page-\d+/', '', $string)
Useful information:
Using Regular Expressions with PHP
http://articles.sitepoint.com/article/regular-expressions-php