I am using utf8 format to store all my data into mysql. Before data is inserted into the database I need to clean the strings with unwanted characters. The strings are in utf8 format. I know how to use regex and string replace but do not know how to work with arabic characters.
Sample string that needs to be cleaned : "████ .. الــقــوانين الجديـــدة في قســـم الـعنايـ";
Thanking you
Ok. As #Jonathan Leffler already said, if you can specify the unicode character ranges for the characters that need to be replaced, you can use a regular expression to replace the characters with an empty string.
A unicode character is specified as \x{FFFF} in an expression (in PHP). In addition, you have to set the u modifier to make PHP treat the pattern as UTF8.
So in the end, you have something like this:
preg_replace('/[\x{FFFF}-\x{FFFF}]+/u','',$string);
where
/.../u are the delimiters plus the modifier
[...]+ is a character class plus quantifier, which means match any of these characters inside one or mor times
\x{FFFF}-\x{FFFF} is a unicode character range (obviously you have to provide the right codepoints/numbers of the characters).
You can also negate the group with a ^ you can specify the range which you want to keep:
preg_replace('/[^\x{FFFF}-\x{FFFF}]+/u','',$string);
More information:
Regular expressions
Regular expressions in PHP
Unicode Charts
Related
I was using the following regex with preg_replace to filter inputs:
/[^A-Za-z0-9[:space:][:blank:]_<>=##£€$!?:;%,.\\'\\\"()&+\\/-]/
However this does not allow accented characters like umlauts so I changed it to:
/[^\w[:space:][:blank:]_<>=##$£€!?:;%,.\\'\\\"()&+\\/-]/u
This however does work with the £ or € characters, nothing is returned, but I need to accept these characters, I have tried escaping them but that doesn't work.
Also I want to create an regex that is similar to just A-Za-z but will allow accented characters, how can I do that?
From http://php.net/manual/en/reference.pcre.pattern.modifiers.php
u (PCRE_UTF8) This modifier turns on additional functionality of PCRE
that is incompatible with Perl. Pattern and subject strings are
treated as UTF-8. An invalid subject will cause the preg_* function to
match nothing; an invalid pattern will trigger an error of level
E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid
since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been
regarded as valid UTF-8.
That means that first you have to make sure the input string is proper UTF-8 text.
Secondly, have you heard of unicode categories? If not, head to http://www.regular-expressions.info/unicode.html and search for Unicode categories. For example you could use \p{S} to match all currency symbols, or \p{L} for all letters. Your regex could (probably) be written as follows: /[^\p{L}\p{P}\p{N}\p{S}\p{M}]/.
This will though match pretty much nothing, as it allows pretty much all characters to be used - ^ at the start of a regex character class (something between [ and ]) means "everything that is not what is in this class will be matched".
On top of that, your regex will only match input that has a length of exactly one - if you want to match everything, you should begin adding a + after your closing ] to keep matching characters until the pattern fails.
So, for that sake, what exactly are you trying to achieve? Maybe we can suggest you some more regex improvements if we know what you're trying to do.
I'm using PHP, and trying to write a regular expression that matches any alphabet in any language but not numbers.
I've tried /\p{L}+/ But it matches unicode alphabets and numbers too. I'm checking against Arabic and English languages. English numbers doesn't pass which is normal, but Arabic numbers pass which is not normal.
Is there another regular expression that matches only alphabets in any language ?
The regex engine need to know that the target string is an unicode string (to avoid interpretation errors). To do that you can use the u modifier, that has two functions:
it expands classical shorthand character classes like \w \d to unicode characters (and not only ascii characters)
it forces the string to be seen as an unicode string
So you can use: /\pL+/u
Note that in your particular case, the first behavior is not needed, but you can only switch on the second behavior with: /(*UTF8)\pL+/ ((*UTF8) must be placed at the very begining of the pattern)
I'm using PHP Version 5.3.27
I'm trying to get my regex to match whitespace, and special characters such as ♦◘•♠♥☻, the other known special characters which are %$#&*# are already matched, but somehow the ones I mentioned before are not matched..
Current regex
preg_match('/^[a-zA-Z0-9[:space:]]+$/', $login)
My apology for asking two questions on the same subject. I hope this one is clear enough for you.
use this
[\W]+
will match any non-word character.
Your regex doesn't contain any reference to the special characters mentioned. You would need to include them in the character class for them to be matched.
To match those kinds of special characters you can use the unicode values.
Example:
\u0000-\uFFFF
\x00-\xFF
The top is UTF-16, the bottom is UTF-8.
Refer to a UTF-8/16 character table online to match up your symbols with their unicode values, then create a range to keep your expression short.
You can use the \p{S} character class (or \p{So}) that matches symbol characters (that includes this kind of characters: ╭₠☪♛♣♞♉♆☯♫):
preg_match('/^[a-zA-Z0-9\h\p{S}]+$/u', $login)
To find more possibilities you can check the pcre documentation at: http://www.pcre.org/pcre.txt
If you need to be more precise, the best way is to use character ranges in the character class. You can find code of characters here.
This regular expression is supposed to match all non-ASCII characters, 0-128 code points:
/[^x00-x7F]/i
Imagine I want to test (just out of curiosity) this regular expression with all Unicode characters, 0-1114111 code points.
Generating this range maybe simple with range(0, 1114111). Then I should covert each decimal number to hexadecimal with dechex() function.
After that, how can i convert the hexadecimal number to the actual character? And how can exclude characters already in ASCII scheme?
It depends on how you are going to do the matching and whether you are going to put the PCRE regex engine into UTF-8 mode with the /u modifier.
If you do use the /u modifier then first of all you must use UTF-8 encoding for both the regular expression and the subject and the regex engine will automatically interpret legal UTF-8 byte sequences as just one character. In this mode the regular expression [^x00-x7F] will match all characters outside the Latin-1 supplement block, including those with code points greater than 255. You will also need to generate the UTF-8 representations of each character (given its code point) manually.
If you do not use the /u modifier then the regex engine will be dumb: it will consider each byte as a separate character, which means that you have to work at byte rather than character level. On the other hand, you will now be able to work with any encoding you prefer. However, you will have to ditch the [^x00-x7F] regex (because it's only going to be matching random bytes in the string) and work with a regular expression that embodies the rules of your chosen encoding (example for UTF-8). To generate the encoded forms of random characters you will again need to use custom code that depends on the specific encoding.
I think the hex2bin(string) function will convert a hex string into a binary string. To exclude ASCII character codepoints, just begin from the x80 hex codepoint (skipping x00 to x7F).
But it does sort of sound like you're trying to unit test the regex library, which seems unnecessary unless you are developing the regex library, or you need to be extremely paranoid.
I have a set of characters like
., !, ?, ;, (space)
and a string, which may or may not be UTF 8 (any language).
Is there a easy way to find out if the string has one of the character set above?
For example:
这是一个在中国的字符串。
which translates to
This is a string in chinese.
The dot character looks different in the first string. Is that a totally different character, or the dot correspondent in utf 8?
Or maybe there's a list somewhere with Unicode punctuation character codes?
In Unicode there are character propertiesPHP Docs, for example Symbols, Letters and the like. You can search for any string of a specific class with preg_matchDocs and the u modifier.
echo preg_match('/pP$/u', $str);
However, your string needs to be UTF-8 to do that.
You can test this on your own, I created a little script that tests for all properties via preg_match:
Looking for properties of last character in "Test.":
Found Punctuation (P).
Found Other punctuation (Po).
Looking for properties of last character in "这是一个在中国的字符串。":
Found Punctuation (P).
Found Other punctuation (Po).
Related: PHP - Fast way to strip all characters not displayable in browser from utf8 string.
Yes, 。 (U+3002, IDEOGRAPHIC FULL STOP) is a totally different character than . (U+002E, FULL STOP). If you want to find out whether a string contains one of the listed characters, you can use regular expressions:
preg_match('/[.!?;。]/u', $str, $match)
This will return either 0 or 1 and $match will contain the matched character. With this it’s important that your string in $str is properly encoded in UTF-8.
If you want to match any Unicode punctuation character, you can use the pattern \p{P} to describe the Unicode character property instead:
/\p{P}/u
you are not trying to transliterate, you are trying to translate!
UTF-8 is not a language, is a unicode character set that supports (virtually) all languages known in the world
what you are trying to do is something like this:
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", "这是一个在中国的字符串。");
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", "à è ò ù");
that not works with your chinese example