PHP: Match strange dash with preg_match() - php

I have big problems to match this character: –
It's something called a "en dash" U+2013 (according to http://www.fileformat.info/info/unicode/char/search.htm)
It's a match with - in my test environment (windows and php 5.2.11) but fails on the production servers (ubuntu and php 5.3.2). Even \x2013 fails there.
Any suggestions how to match this strange character? Or how to config php to make it work?

You can also try use the "u" flag on the expression which makes the expression compatible with utf-8: regex pattern modifiers
so your expression would be "/[somepatter]/u"

if (preg_match ('~\xe2\x80\x93~', $string))
{
echo "En Dash found";
}
I believe you've got an UTF-8 encoding, don't you?

Related

PHP preg_replace() returns empty string with hidden control chars

I'm trying to strip hidden control chars (especially \x{89} and \x{88}) with preg_replace() from a string. It is "ˆText" (it starts with an "\x{88}" char), mb_detect_encoding says it is UTF-8.
The code used is $result = preg_replace('/\x{88}/u','',$string); but the result is null.
If I use the code without /u modifier I get "�Text", the control char is replaced with a replacement char (U+FFFD).
I'm using PHP 7.1 on Windows. The same search with BBEdit and NotePad++ replaces the chars correctly.
Any ideas?
Thanks,
A.
preg_replace() returns "null" only on error.
Run preg_last_error() right after preg_replace() and check the returned error code.
As a side note: Your wording suggests that you want to strip all control characters, not just the two explicitly mentioned. Then you would be better of matching against "\p{Cc}"
preg_replace('/\p{Cc}/u', '', $string);

From Postgress regexp replace match in PHP language

i need some help.
I have PostgreSQL regexp_replace pattern, like:
regexp_replace(lower(institution_title),'[[:cntrl:]]|[[[:digit:]]|[[:punct:]]|[[:blank:]]|[[:space:]|„|“|“|”"]','','g')
and i need this one alternative in PHP language
Because one half is from postgress db, and i have to compare strings from php aswell.
You may use the same POSIX character classes with PHP PCRE regex:
preg_replace('/[[:cntrl:][:digit:][:punct:][:blank:][:space:]„““”"]+/', '', strtolower($institution_title))
See demo
Besides, there are Unicode category classes in PCRE. Thus, you may also try
preg_replace('/[\p{Cc}\d\p{P}\s„““”"]+/u', '', mb_strtolower($institution_title, 'UTF-8'))
Where \p{Cc} stands for Control characters, \d for digits, \p{P} for punctuation, and \s for whitespace.
I am adding /u modifier to handle Unicode strings, too.
See a regex demo
hanks guys, but i bumped to another problem, i cannot match strings, if there is specifik symbols,
here is my output of postgres sql:
SQL:
select regexp_replace(lower(title),'[[:cntrl:]]|[[[:digit:]]|[[:punct:]]|[[:blank:]]|[[:space:]|„|“|“|”"]','','g')
from cls_institutions
Output:
"oxforduniversity"
"šiauliųuniversitetas"
"harwarduniversity"
"internationalbusinessschool"
"vilniuscollege"
"žemaitijoskolegija"
"worldhealthorganization"
But in PHP is output is a little bit different: I got my array with institutions:
$institutions[] = "'".preg_replace('/[[:cntrl:][:digit:][:punct:][:blank:][:space:]„““”"]+/', '', strtolower($data[0]))."'";
And PHP outputs like this:
"oxforduniversity",
"Šiauliųuniversitetas",
"harwarduniversity",
"internationalbusinessschool",
"vilniuscollege",
"Žemaitijoskolegija",
"worldhealthorganization"
First letter is not lowered case, somehow... I am missing something?

Fixed-length regex lookbehind complains of variable-length lookbehind

Here is the code I am trying to run:
$str = 'a,b,c,d';
return preg_split('/(?<![^\\\\][\\\\]),/', $str);
As you can see, the regexp being used here is:
/(?<![^\\][\\]),/
Which is a simple fixed-length negative lookbehind for "preceded by something that isn't a backslash, then something that is!".
This regex works just fine on http://www.phpliveregex.com
But when I go and actually attempt to run the above code, I am spat back the error:
Warning: preg_split() [function.preg-split]: Compilation failed: lookbehind assertion is not fixed length at offset 13
To make matters worse, a fellow programmer tested the code on his 5.4.24 PHP server, and it worked fine.
This leads me to believe that my issues are related to the configuration of my server, which I have very little control over. I am told that my PHP version if 5.2.*
Are there any workarounds/alternatives to preg_replace() that might not have this issue?
The problem is caused by the bug fixed in PCRE 6.7. Quoting the changelog:
A negated single-character class was not being recognized as
fixed-length in lookbehind assertions such as (?<=[^f]), leading to an
incorrect compile error "lookbehind assertion is not fixed length"
PCRE 6.7 was introduced in PHP 5.2.0, in Nov 2006. As you still have this bug, it means it's not still there at your server - so for a preg-split based workaround you have to use a pattern without a negative character class. For example:
$patt = '/(?<!(?<!\\\\)\\\\),/';
// or...
$patt = '/(?<![\x00-\x5b\x5d-\xFF]\x5c),/';
However, I find the whole approach a bit weird: what if , symbol is preceded by exactly three backslashes? Or five? Or any odd number of them? The comma in this case should be considered 'escaped', but obviously you cannot create a lookbehind expression of variable length to cover these cases.
On the second thought, one can use preg_match_all instead, with a common alternation trick to cover the escaped symbols:
$str = 'e ,a\\,b\\\\,c\\\\\\,d\\\\';
preg_match_all('/(?:[^\\\\,]|\\\\(?:.|$))+/', $str, $matches);
var_dump($matches[0]);
Demo.
I really think I covered all the issues here, those trailing slashes were a killer )
Way to avoid the negated character class (I write \x5c instead of a lot of backslashes to be more clear)
$result = preg_split('/(?<!(?!\x5c).\x5c),/s', $str);
About the approach itself:
If you are trying to split on comma that are not escaped, you are in the wrong way with a lookbehind since you can't check and undefined number of backslash before the comma. You have several possibilities to solve this problem:
$result = preg_split('/(?:[^\x5c]|\A)(?:\x5c.)*\K,/s', $str);
or
$result = preg_split('/(?<!\x5c)(?:\x5c.)*\K,/s', $str);
or for PHP > 5.2.4
$result = preg_split('/\x5c{2}(*SKIP)(?!)|(?<!\x5c),/s', $str);
I think you are using an older php version since I your error rises on PHP 5.1.6 or lower.
You can check a non working demo here
On the other hand it works for PHP 5.2.16 or higher:
Working demo

Hebrew regex match not working in php

this is my current regex code to validate english & numbers:
const CANONICAL_FMT = '[0-9a-z]{1,64}';
public static function isCanonical($str)
{
return preg_match('/^(?:' . self::CANONICAL_FMT . ')$/', $str);
}
Pretty straight forward. Now i want to change that to validate only hebrew, underscore
and numbers. So i changed the code to:
public static function isCanonical($str)
{
return preg_match('/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/i', $str);
}
But it doesn't work. I basically took the hebrew UTF range out of Wikipedia.
What is Wrong here?
I was able to get it to work much more easily, using the /u flag and the \p{Hebrew} Unicode character property:
return preg_match('/^(?:\p{Hebrew}+|\w+)$/iu', $str);
Working example: http://ideone.com/gSlmh
If you want preg_match() to work properly with UTF-8, you might have to enable the u modifier (quoting) :
This modifier turns on additional functionality of PCRE that is
incompatible with Perl. Pattern strings are treated as UTF-8.
In your case, instead of using the following regex :
/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/i
I suppose you'd be using :
/^(?:[\u0590-\u05FF\uFB1D-\uFB40]+|[\w]+)$/iu
(Note the additionnal u at the end)
You need the /u modifier to add support for UTF-8.
Make sure you convert your hebrew input to UTF-8 if it's in some other codepage/character set.

Weird error using preg_match and unicode

if (preg_match('(\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+)', '2010/02/14/this-is-something'))
{
// do stuff
}
The above code works. However this one doesn't.
if (preg_match('/\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+/u', '2010/02/14/this-is-something'))
{
// do stuff
}
Maybe someone could shed some light as to why the one below doesn't work. This is the error that is being produced:
A PHP Error was encountered
Severity: Warning
Message: preg_match()
[function.preg-match]: Unknown
modifier '\'
Try this: (delimit the regex with ())
if (preg_match('#\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+#', '2010/02/14/this-is-something'))
{
// do stuff
}
Edited
The modifier u is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32.
Also as nvl observed, you are using / as the delimiter and you are not escaping the / present in the regex. So you'lll have to use:
/\p{Nd}{4}\/\p{Nd}{2}\/\p{Nd}{2}\/\p{L}+/u
To avoid this escaping you can use a different set of delimiters like:
#\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+#
or
#\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+#
As a tip, if your delimiter is present in your regex, its better to choose a different delimiter not found in the regex. This keeps the regex clean and short.
In the second regex you're using / as the regex delimiter, but you're also using it in the regex. The compiler is trying to interpret this part as a complete regex:
/\p{Nd}{4}/
It thinks the next character after the second / should be a modifier like 'u' or 'm', but it sees a backslash instead, so it throws that cryptic exception.
In the first regex you're using parentheses as regex delimiters; if you wanted to add the u modifier, you would put it after the closing paren:
'(\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+)u'
Although it's legal to use parentheses or other bracketing characters ({}, [], <>) as regex delimiters, it's not a good idea IMO. Most people prefer to use one of the less common punctuation characters. For example:
'~\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+~u'
'%\p{Nd}{4}/\p{Nd}{2}/\p{Nd}{2}/\p{L}+%u'
Of course, you could also escape the slashes in the regex with backslashes, but why bother?

Categories