In PHP it is a common practice to treat strings as immutable. Sometimes there's a need to modify a string "in-place".
We go with the additional array creation approach.
This array should contain every single letter from the source string.
There's a function for that in PHP (str_split). One issue, it doesn't handle multibyte encodings well enough.
There's also a mb_split function which takes a regex as an input parameter for separator sequence. So
mb_split('.', '123')
returns ['', '', '', ''].
BUT:
mb_split('', '123')
returns ['123'].
So I believe there is a counterpart regex which matches empty space between any variation of multi-byte character sequence.
So for '123' it should match
'1~2', '2~3'
where ~ is an actual match. That is just like \b but for anything.
Is there a regex hack to do so?
Use
preg_match_all('~\X~u', $s, $arr)
The $arr[0] will contain all the characters. The \X pattern matches any Unicode grapheme. The /u modifier is necessary to make the regex engine treat the input string as a Unicode string and make the pattern Unicode aware.
See the PHP demo.
Related
I know, that PHP PCRE functions treat strings as byte sequences, so many sites suggest to use /u modifier for handling input and regex as UTF-8.
But, do I really need this always? My tests show, that this flag makes no difference, when I don't use escape sequences or dot or something like this.
For example
preg_match('/^[\da-f]{40}$/', $string); to check if string has format of a SHA1 hash
preg_replace('/[^a-zA-Z0-9]/', $spacer, $string); to replace every char that is non-ASCII letter or number
preg_replace('/^\+\((.*)\)$/', '\1', $string); for getting inner content of +(XYZ)
These regex contain only single byte ASCII symbols, so it should work on every input, regardless of encoding, shouldn't it? Note that third regex uses dot operator, but as I cut off some ASCII chars at beginning and end of string, this should work on UTF-8 also, correct?
Cannot anyone tell me, if I'm overlooking something?
There is no problem with the first expression. The characters being quantified are explicitly single-byte, and cannot occur in a UTF-8 multibyte sequence.
The second expression may give you more spacers than you expect; for example:
echo preg_replace('/[^a-zA-Z0-9]/', "0", "💩");
// => 0000
The third expression also does not pose a problem, as the repeated character is limited by parentheses (which is ASCII-safe).
This is more dangerous:
echo preg_replace('/^(.)/', "0", "💩");
// => 0???
Typically, without knowing more about how UTF-8 works, it may be tricky to predict which regexps are safe, and which are not, so using /u for all text that might contain a character above U+007F is the best practice.
Unicode modifier u allows proper detection of accented characters, which are always multibyte.
preg_match('/([\w ]{2,})/', 'baz báz báž', $match);
// $match[0] = "baz b" ... wrong, accented/multibyte chars silently ignored
preg_match('/([\w ]{2,})/u', 'baz báz báž', $match);
// $match[0] = "baz báz báž" ... correct
Use it also for safe detection of whitespaces:
preg_replace(''/\s+/u', ' ', $txt); // works reliably e.g. with EOLs (line endings)
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.
You will need this when you have to compare Unicode characters, such as Korean or Japanese.
In other words, unless you are not comparing strings that is not Unicode (such as English), You don't need to use this flag.
I have text with unicode soft hyphens (U+00AD) that I wish to remove. I'm attempting to do this with PHP's mb_ereg_replace() function. It's finding the soft hyphens, but the replace procedure is removing both the soft hyphens and the first character that immediately follows them.
My code is:
$text_cleansed = mb_ereg_replace('[\u00AD]', '', $text);
For example, if $text were "en-dur-ance" (with the hyphens shown here being the invisible unicode soft hyphens), then $text_cleansed would be "enurnce"; -d and -a have been removed, when for each soft hyphen only the soft hyphen should be removed. mb_ereg_replace has therefore removed each soft hyphen and the first character that follows it. Surely, I must be feeding incorrect arguments into the function.
What is causing this behavior, and what would be the correct arguments for the function?
PHP regex does not support \u notation. The symbols in your regex are treated as separate entities, not as a hex notation (as '\u', '0', 'A', 'D').
Use preg_replace with \x{} notation with /u modifier (necessary to interpret the pattern and the input string as Unicode strings):
preg_replace('~\x{00AD}~u', '', $s)
See IDEONE demo
What stribizhev has written is correct. It's a matter of syntax that isn't clear in the PHP mb_ereg documentation. I'm answering after already marking his second response as the answer, because his third response has the specific answer to my original question (re: multi-byte strings--mb--as opposed to regular strings).
1) If non-multi-byte strings are used, preg_replace('~\x{00AD}~u', '', $text) is the solution.
2) If multi-byte strings are used, mb_ereg_replace('[\x{00AD}]', '', $text) is the solution.
It's a matter of syntax, obscure to those not experienced in regex work. With luck, this will help someone else with a similar problem.
If your string is UTF8 encoded, you don't need to take in account that your string is multi-byte to remove a character from the ASCII range (00-7F) since these bytes aren't used to compose other multi-byte characters. In this case, you can use str_replace:
$result = str_replace('-', '', $text);
I want to convert a set of Unicode code points in string format to actual characters and/or HTML entities (either result is fine).
For example, if I have the following string assignment:
$str = '\u304a\u306f\u3088\u3046';
I want to use the preg_replace function to convert those Unicode code points to actual characters and/or HTML entities.
As per other Stack Overflow posts I saw for similar issues, I first attempted the following:
$str = '\u304a\u306f\u3088\u3046';
$str2 = preg_replace('/\u[0-9a-f]+/', '&#x$1;', $str);
However, whenever I attempt to do this, I get the following PHP error:
Warning: preg_replace() [function.preg-replace]: Compilation failed: PCRE does not support \L, \l, \N, \U, or \u
I tried all sorts of things like adding the u flag to the regex or changing /\u[0-9a-f]+/ to /\x{[0-9a-f]+}/, but nothing seems to work.
Also, I've looked at all sorts of other relevant pages/posts I could find on the web related to converting Unicode code points to actual characters in PHP, but either I'm missing something crucial, or something is wrong because I can't fix the issue I'm having.
Can someone please offer me a concrete solution on how to convert a string of Unicode code points to actual characters and/or a string of HTML entities?
From the PHP manual:
Single and double quoted PHP strings have special meaning of backslash. Thus if \ has to be matched with a regular expression \\, then "\\\\" or '\\\\' must be used in PHP code.
First of all, in your regular expression, you're only using one backslash (\). As explained in the PHP manual, you need to use \\\\ to match a literal backslash (with some exceptions).
Second, you are missing the capturing groups in your original expression. preg_replace() searches the given string for matches to the supplied pattern and returns the string where the contents matched by the capturing groups are replaced with the replacement string.
The updated regular expression with proper escaping and correct capturing groups would look like:
$str2 = preg_replace('/\\\\u([0-9a-f]+)/i', '&#x$1;', $str);
Output:
おはよう
Expression: \\\\u([0-9a-f]+)
\\\\ - matches a literal backslash
u - matches the literal u character
( - beginning of the capturing group
[0-9a-f] - character class -- matches a digit (0 - 9) or an alphabet (from a - f) one or more times
) - end of capturing group
i modifier - used for case-insensitive matching
Replacement: &#x$1
& - literal ampersand character (&)
# - literal pound character (#)
x - literal character x
$1 - contents of the first capturing group -- in this case, the strings of the form 304a etc.
RegExr Demo.
This page here—titled Escaping Unicode Characters to HTML Entities in PHP—seems to tackle it with this nice function:
function unicode_escape_sequences($str){
$working = json_encode($str);
$working = preg_replace('/\\\u([0-9a-z]{4})/', '&#x$1;', $working);
return json_decode($working);
}
That seems to work with json_encode and json_decode to take pure UTF-8 and convert it into Unicode. Very nice technique. But for your example, this would work.
$str = '\u304a\u306f\u3088\u3046';
echo preg_replace('/\\\u([0-9a-z]{4})/', '&#x$1;', $str);
The output is:
おはよう
Which is:
おはよう
Which translates to:
Good morning
when I try preg_match with the following expression: /.{0,5}/, it still matches string longer than 5 characters.
It does, however, work properly when trying in online regexp matcher
The site you reference, myregexp.com, is focussed on Java.
Java has a specific function for matching an exact pattern, without needing to use anchor characters. This is the function which myregexp.com uses.
In most other languages, in order to match an exact pattern, you would need to add the anchoring characters ^ and $ at the start and end of the pattern respectively, otherwise the regex assumes it only needs to find the matched pattern somewhere within the string, rather than the whole string being the match.
This means that without the anchors, your pattern will match any string, of any length, because whatever the string, it will contain within it somewhere a match for "zero to five of any character".
So in PHP, and Perl, and virtually any other language, you need your pattern to look like this:
/^.{0,5}$/
Having explained all that, I would make one final observation though: this specific pattern really doesn't need to be a regular expression -- you could achieve the same thing with strlen(). In addition, the dot character in regex may not work exactly as you expect: it typically matches almost any character; some characters, including new line characters, are excluded by default, so if your string contains five characters, but one of them is a new line, it will fail your regex when you might have expected it to pass. With this in mind, strlen() would be a safer option (or mb_strlen() if you expect to have unicode characters).
If you need to match any character in regex, and the default behaviour of the dot isn't good enough, there are two options: One is to add the s modifier at the end of the expression (ie it becomes /^.{0,5}$/s). The s modifier tells regex to include new line characters in the dot "any character" match.
The other option (which is useful for languages that don't support the s modifier) is to use an expression and its negative together in a character class - eg [\s\S] - instead of the dot. \s matches any white space character, and \S is a negative of \s, so any character not matched by \s. So together in a character class they match any character. It's more long winded and less readable than a dot, but in some languages it's the only way to be sure.
You can find out more about this here: http://www.regular-expressions.info/dot.html
Hope that helps.
You need to anchor it with ^$. These symbols match the beginning and end of the string respectively, so it must be 0-5 characters between the beginning and end. Leaving out the anchors will match anywhere in the string so it could be longer.
/^.{0,5}$/
For better readability, I would probably also enclose the . in (), but that's kind of subjective.
/^(.){0,5}$/
Hi All I have this code that checks for 5 or more consecutive numbers :
if (preg_match("/\d{5}/", $input, $matches) > 0)
return true;
It works fine for input that is English, but it's tripping up when the input string contains Arabic/multibyte characters - it returns true sometimes even if there aren't numbers in the input text.
Any ideas ?
You appear to be using PHP.
Do this:
if (preg_match("/\d{5}/u", $input, $matches) > 0)
return true;
Note the 'u' modifier at the end of expression. It tells preg_* to use unicode mode for matching.
You have to set yourself up properly when you want to deal with UTF-8.
You can recompile php with the PCRE UTF-8 flag enabled.
Or, you can add the sequence (*UTC8) to the start of your regex. For example:
/(*UTF8)[[:alnum:]]/, input é, output TRUE
/[[:alnum:]]/, input é, output FALSE.
Check out http://www.pcre.org/pcre.txt, which contains lots of information about UTF-8 support in the PCRE library.
Even in UTF-8 mode, predefined character classes like \d and [[:digit:]] only match ASCII characters. To match potentially non-ASCII digits you have to use the equivalent Unicode property, \p{Nd}:
$s = "12345\xD9\xA1\xD9\xA2\xD9\xA3\xD9\xA4\xD9\xA5";
preg_match_all('~\p{Nd}{5}~u', $s, $matches);
See it in action on ideone.com
If you need to match specific characters or ranges, you can either use the \x{HHHH} escape sequence with the appropriate code points:
preg_match_all('~[\x{0661}-\x{0665}]{5}~u', $s, $matches);
...or use the \xHH form to input their UTF-8 encoded byte sequences:
preg_match_all("~[\xD9\xA1-\xD9\xA5]{5}~u", $s, $matches);
Notice that I switched to double-quotes for this last example. The \p{} and \x{} forms were passed through to be processed by the regex compiler, but this time we want the PHP compiler to expand the escape sequences. That doesn't happen in single-quoted strings.