how locale aware is preg_replace in php? - php

If I do, preg_replace('/[^a-zA-Z0-9\s-_]/','',$val) in a multi-lingual application, will it handle things like accented characters or russian characters? If not, how can I filter user input to only allow the above characters but with locale awareness?
thanks!
codecowboy.

The only useful information I can find is from this page of the manual, which states :
A "word" character is any letter or
digit or the underscore character,
that is, any character which can be
part of a Perl "word". The definition
of letters and digits is controlled by
PCRE's character tables, and may vary
if locale-specific matching is taking
place. For example, in the "fr"
(French) locale, some character codes
greater than 128 are used for accented
letters, and these are matched by \w.
Still, I wouldn't bet that it's working as you want...
But, to be sure :
maybe using unicode matching would be better
You'll probably have to try to be certain...
About unicode, the manual says this :
Matching characters by Unicode
property is not fast, because PCRE has
to search a structure that contains
data for over fifteen thousand
characters. That is why the
traditional escape sequences such as
\d and \w do not use Unicode
properties in PCRE.
So, it might be a safer solution... curious about it, should I add ^^

No, it will only match the ASCII character A-Z. To match any letter/number in any language, you need to use the unicode properties of the regex engine:
preg_replace('/[^\p{L}\p{N}]/', '', $string);

Related

PHP - Sanitize inputs for hashtags, allowing arabic, hebrew, japanese, etc., and emoji?

I'm looking at sanitizing inputs for a hashtag search engine.
Effectively I want to allow all alphanumeric characters, cyrillic, arabic, hebrew, etc., as well as emoji characters, but strip any symbols other than underscore.
After spending an hour or so looking online I haven't yet found a conclusive answer. Is there a regex that would enable me to sanitize such an input? Basically remove anything that isn't alphanumeric / letters / emojis.
Thanks!
Mark
I would basically enable Unicode and match
/emoji-regex(*SKIP)(?!)|[^\p{L}\p{Nd}_]+/u
and replace with nothing.
There is a negative class [^ ] (meaning not these) of:
\p{L} All letters
\p{Nd} Number digits
_ Underscore
The emoji-regex was deleted because of its size.
Edit this answer and grab it if needed.
This regex will SKIP the emoji by moving the search position past
them until it finds a block of 1 or more Non-Letters/Digits/Underscore characters.

Special ä ö characters break UTF-8 encoding

A user on my site inputted special characters into a text field: ä ö
These apparently are not the same ä ö characters I can input from my keyboard because when I paste them into Programmer's Notepad, they split into two: a¨ o¨
On my site's server side I have a PHP script that identifies illegal special characters in user input and highligts them in an html error message with preg_replace.
The character splitting happens there too so I get a normal letter a and o with a weird lone xCC character that breaks the UTF-8 string encoding and json_encode function fails as a result.
What would be the best way to handle these characters? Should I try to replace the special ä ö chars and replace them with the regular ones or can I somehow catch the broken UTF-8 chars and remove or replace them?
It's not that these characters have broken the encoding, it's just that Unicode is really complicated.
Commonly used accented letters have their own code points in the Unicode standard, in this case:
U+00E4 "LATIN SMALL LETTER A WITH DIAERESIS"
U+00F6 "LATIN SMALL LETTER O WITH DIAERESIS"
However, to avoid encoding every possibility, particularly when multiple diacritics (accents) need to be placed on the same letter, Unicode includes "combining diacritics", such as:
U+0308 "COMBINING DIAERESIS"
When placed after the code point for a normal letter, these code points add a diacritic to it when displaying.
As you've seen, this means there's two different ways to represent the same letter. To help with this, Unicode includes "normalization forms" defined in an annex to the Unicode standard:
Normalization Form D (NFD): Canonical Decomposition
Normalization Form C (NFC): Canonical Decomposition, followed by Canonical Composition
Normalization Form KD (NFKD): Compatibility Decomposition
Normalization Form KC (NFKC): Compatibility Decomposition, followed by Canonical Composition
Ignoring the "Compatibility" forms for now, we have two options:
Decomposition, which uses combining diacritics as often as possible
Composition, which uses specific code points as often as possible
So one possibility is to convert your input into NFC, which in PHP can be achieved with the Normalizer class in the intl extension.
However, not all combinations can be normalised to a form with no separate diacritics, so this doesn't solve all your problems. You'll also need to look at what characters exactly you want to allow, probably by matching Unicode character properties.
You might also want to learn about "grapheme clusters" and use the relevant PHP functions. A "grapheme cluster", or just "grapheme", is what most readers will think of as "a character" - e.g. a letter with all its diacritics, or a full ideogram.

Hebrew special charcters in regexp

This is my code:
preg_replace('/[^{Hebrew}a-zA-Z0-9_ %\[\]\.\(\)%&-]/s', '', $q);
It's supposed to accept only a-z, A-Z, 0-9, any number of single white spaces and hebrew charcters.
I tried it in many varations and just couldn't get it to work.
Thanks in advance!
In PCRE, \p{xx} and \P{xx} can take in either a Unicode category name or Unicode script name. The list can be found in PHP documentation or in PCRE man page.
For Hebrew script, you need to use \p{Hebrew}.
I also remove escape \ for ., (, ), since they already loses their special meaning inside the character class []. The s flag (DOTALL) is useless, since there is no dot metacharacter in your regex.
preg_replace('/[^\p{Hebrew}a-zA-Z0-9_ %\[\].()&-]/', '', $q);
Appendix
From Unicode FAQs. It explains the difference between blocks and scripts. For your information, PCRE only has support for matching Unicode scripts and Unicode categories (character properties).
Q: If Unicode blocks aren't code pages, what are they?
A: Blocks in the Unicode Standard are named ranges of code points. They are used to help organize the standard into groupings of related kinds of characters, for convenience in reference. And they are used by a charting program to define the ranges of characters printed out together for the code charts seen in the book or posted online.
Q: Do Unicode blocks have defined character properties?
A: No. The character properties are associated with encoded characters themselves, rather than the blocks they are encoded in.
Q: Does that even apply to the script for characters?
A: Yes. For example, the Thai block contains Thai characters that have the Thai script property, but it also contains the character for the baht currency sign, which is used in Thai text, of course, but which is defined to have the Common script property. To find the script property value for any character you need to rely on the Unicode Character Database data file, Scripts.txt, rather than the block value alone.
Q: So block value is not the same as script value?
A: Correct. In some cases, such as Latin, the encoded characters are spread across as many as a dozen different Unicode blocks. That is unfortunate, but is simply the result of the history of the standard. In other instances, a single block may contain characters of more than one script. For example, the Greek and Coptic block contains mostly characters of the Greek script, but also a few historic characters of the Coptic script.
you should change the file to utf 8 encoding for example: notepad++ go to encoding -> encode to UTF-8. and it shudold work:preg_replace('/[^\p{Hebrew}a-zA-Z0-9_ %[].()&-]/u','', $q) I also added "u" as a modifier.

How can I detect, or correctly identify the length, of strange characters?

I am inserting soft hyphens into long words programatically, and am having problems with unusual characters, specifically: ■
Any word over 10 characters gets the soft hyphen treatment. Words are defined with a regex: [A-Za-z0-9,.]+ (to include long numbers). If I split a string containing two of the above unicode character with that regex, I get a 'word' like this: ■■
My script then goes through each word, measured the length (mb_strlen($word, 'UTF-8')), and if it is over an arbitrary number of characters, loops through the letters and inserts soft hyphens all over the place (every third character, not in the last five characters).
With the ■■, the word length is coming out as high enough to trigger the replacement (10). So soft hyphens are inserted, but they are inserted within the characters. So what I get out is something like:
�­�■
In the database, these ■ characters are being stored (in a json_encoded block) as "\u2002", so I can see where the string length is coming from. What I need is a way to identify these characters, so I can avoid adding soft hyphens to words that contain them. Any ideas, anyone?
(Either that, or a way to measure the length of a string, counting these as single characters, and then a way to split that string into characters without splitting it part-way through a multi-byte character.)
With the same caveats as listed in the comments about guessing without seeing the code:
mb_strlen($word, 'UTF-8'), and if it is over an arbitrary number of characters, loops through the letters
I suspect you are actually looping through bytes. This is what will happen if you use array-access notation on a string.
When you are using a multibyte encoding like UTF-8, a letter (or more generally ‘character’) may take up more than one byte of storage. If you insert or delete in the middle of a byte sequence you will get mangled results.
This is why you must use mb_strlen and not plain old strlen. Some languages have a native Unicode string type where each item is a character, but in PHP strings are completely byte-based and if you want to interact with them in a character-by-character way you must use the mb_string functions. In particular to read a single character from a string you use mb_substr, and you'd loop your index from 0 to mb_strlen.
It would probably be simpler to take the matched word and use a regular expression replacement to insert the soft hyphen between each sequence. You can get multibyte string support for regex by using the u flag. (This only works for UTF-8, but UTF-8 is the only multibyte encoding you'd ever actually want to use.)
const SHY= "\xC2\cAD"; // U+00AD Soft Hyphen encoded as UTF-8
$wrappableword= preg_replace('/.{3}\B/u', '$1'.SHY, $longword);

Regexp character filter

In my code, I use a regexp I googled somewhere, but I don't understand it. :)
preg_match("/^[\p{L} 0-9\-]{4,25}$/", $login))
What does that p{L} mean? I know what it does -- all characters with national letters included.
And my second question, I want to sanitize user input for ingame chat, so I'm starting with the regexp mentioned above, but I want to allow most special characters. What's the shortest way to do it? Has someone already prepared a regexp to do it?
For \p see Unicode character properties basically it require the character to be in a specific character class (Letter, number, ...).
For your filter it depends on what exactly you want to filter but looking at Unicode character classes is the good way to go i think (adding individually any character that seem useful to you).
The regular expression means:
Each string with length between 4 and 25, starting with a letter, a space, a number or dash.
\p{L} means literally: a character that matches the property "L", where "L" stands for "any letter".
To understand how regexp work:
http://en.wikipedia.org/wiki/Regular_expression
http://www.php.net/manual/en/regexp.reference.unicode.php
If you want to include most characters why not just exclude the ones that you are not allowing?
You can do this with the ^ in your character class
[^characters I don't want]
Disclaimer: Black listing might not be the best approach depending on what you're trying to do, and has to be more thorough than white listing.

Categories