For a long time, any time I've needed to use a regular expression, I've standardized on using the copyright symbol © as the delimiter because it was a symbol that wasn't on the keyboard that I was sure to not use in a regular expression, unlike ! # # \ or / (which are sometimes all in use within in a regex).
Code:
$result=preg_match('©<.*?>©', '<something string>');
However, today I needed to use a regular expression with accented characters which included this:
Code:
[a-zA-ZàáâäãåąćęèéêëìíîïłńòóôöõøùúûüÿýżźñçčšžÀÁÂÄÃÅĄĆĘÈÉÊËÌÍÎÏŁŃÒÓÔÖÕØÙÚÛÜŸÝŻŹÑßÇŒÆČŠŽ∂ð \,\.\'-]+
After including this new regex in the PHP file in my IDE (Eclipse PDT), I was prompted to save the PHP file as UTF-8 instead of the default cp1252.
After saving and running the PHP file, every time I used a regex in a preg_match() or preg_replace() function call, it generated a generic PHP warning (Warning: preg_match in file.php on line x) and the regex was not processed.
So--two questions:
1) Is there another symbol that would be good to use as a delimiter that isn't typically found on a keyboard (`~!##$%^&*()+=[]{};\':",./<>?|\) that I can standardize on and not worry about having to check each and every regex to see if that symbol is actually used somewhere in the expression?
2) Or, is there a I way I can use the copyright symbol as the standard delimiter when the file format is UTF-8?
One thing that needs correcting is that if your regular expression and/or input data is encoded in UTF-8 (which in this case it is, since it comes straight from inside a UTF-8 encoded file) you must use the u modifier for your regular expression.
Another issue is that the copyright character should not be used as a delimiter in UTF-8 because the PCRE functions consider that the first byte of your pattern encodes your delimiter (this could plausibly be called a bug in PHP).
When you attempt to use the copyright sign as a delimiter in UTF-8, what actually gets saved into the file is the byte sequence 0xC2 0xA9. preg_match looks at the first byte 0xC2 and decides that it is an alphanumeric character because in your current locale that byte corresponds to the character Latin capital letter A with circumflex  (see extended ASCII table). Therefore a warning is generated and processing is immediately aborted.
Given these facts, the ideal solution would be to choose an unusual delimiter from inside the ASCII character set because that character would encode to the same byte sequence both in single byte encodings and in UTF-8.
I would not consider printable ASCII characters unusual enough for this purpose, so a good choice would be one of the control characters (ASCII codes 1 to 31). For example, STX (\x02) would fit the bill.
Together with the u regex modifier this means you should write the regex like this:
$result = preg_match("\x02<.*?>\x02u", '<something string>');
Related
anubhava's answer about matching ranges of unicode characters led me to the regex to use for cleaning up a specific range of single code point of characters. With it, now I can match all miscellaneous symbols in this list (includes emoticons) with this simple expression:
preg_replace('/[\x{2600}-\x{26FF}]/u', '', $str);
However, I also want to match those in this list of paired/double surrogates emoji, but as nhahtdh explained in a comment:
There is a range from d800 to dfff to specify surrogates in UTF-16 to allow for more characters to be specified. A single surrogate is not a valid character in UTF-16 (a pair is necessary to specify a valid character).
So, for example, when I try this:
preg_replace('/\x{D83D}\x{DE00}/u', '', $str);
For replacing only the first of the paired surrogates on this list, i.e.: 😀
PHP throws this:
preg_replace(): Compilation failed: disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
I have tried several different combinations, including the supposed combination of the above code points in UTF8 for 😀 ('/[\x{00F0}\x{009F}\x{0098}\x{0080}]/u'), but I was still unable to match it. I also looked into other PCRE pattern modifiers, but it seems u is the only one that allows to point through UTF8.
Am I missing any "escape" alternative here?
revo's comment above was very helpful to find a solution:
If your PHP isn't shipped with a PCRE build for UTF-16 then you can't perform such a match. From PHP 7.0 on, you're able to use Unicode code points following this syntax \u{XXXX} e.g. preg_replace("~\u{1F600}~", '', $str); (Mind the double quotes)
Since I am using PHP 7, echo "\u{1F602}"; outputs 😂 according to this PHP RFC page on unicode escape. This proposal was in essence:
A new escape sequence is added for double-quoted strings and heredocs.
\u{ codepoint-digits } where codepoint-digits is composed of hexadecimal digits.
This implies that the matching string in preg_replace (normally single-quoted for not messing up with double-quoted strings variable expansion), now needs some preg_quote magic. This is the solution I came up with:
preg_replace(
// single point unicode list
"/[\x{2600}-\x{26FF}".
// http://www.fileformat.info/info/unicode/block/miscellaneous_symbols/list.htm
// concatenates with paired surrogates
preg_quote("\u{1F600}", '/')."-".preg_quote("\u{1F64F}", '/').
// https://www.fileformat.info/info/unicode/block/emoticons/list.htm
"]/u",
'',
$str
);
Here's the proof of the above in 3v4l.
EDIT: a simpler solution
In another comment made by revo, it seems that by placing unicode characters directly into the regex character class, single-quoted strings and previous PHP versions (e.g. 4.3.4) are supported:
preg_replace('/[☀-⛿😀-🙏]/u','YOINK',$str);
For using PHP 7's new feature though, you still need double-quotes:
preg_replace("/[\u{2600}-\u{26FF}\u{1F600}-\u{1F64F}]/u",'YOINK',$str);
Here's revo's proof in 3v4l.
I was using the following regex with preg_replace to filter inputs:
/[^A-Za-z0-9[:space:][:blank:]_<>=##£€$!?:;%,.\\'\\\"()&+\\/-]/
However this does not allow accented characters like umlauts so I changed it to:
/[^\w[:space:][:blank:]_<>=##$£€!?:;%,.\\'\\\"()&+\\/-]/u
This however does work with the £ or € characters, nothing is returned, but I need to accept these characters, I have tried escaping them but that doesn't work.
Also I want to create an regex that is similar to just A-Za-z but will allow accented characters, how can I do that?
From http://php.net/manual/en/reference.pcre.pattern.modifiers.php
u (PCRE_UTF8) This modifier turns on additional functionality of PCRE
that is incompatible with Perl. Pattern and subject strings are
treated as UTF-8. An invalid subject will cause the preg_* function to
match nothing; an invalid pattern will trigger an error of level
E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid
since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been
regarded as valid UTF-8.
That means that first you have to make sure the input string is proper UTF-8 text.
Secondly, have you heard of unicode categories? If not, head to http://www.regular-expressions.info/unicode.html and search for Unicode categories. For example you could use \p{S} to match all currency symbols, or \p{L} for all letters. Your regex could (probably) be written as follows: /[^\p{L}\p{P}\p{N}\p{S}\p{M}]/.
This will though match pretty much nothing, as it allows pretty much all characters to be used - ^ at the start of a regex character class (something between [ and ]) means "everything that is not what is in this class will be matched".
On top of that, your regex will only match input that has a length of exactly one - if you want to match everything, you should begin adding a + after your closing ] to keep matching characters until the pattern fails.
So, for that sake, what exactly are you trying to achieve? Maybe we can suggest you some more regex improvements if we know what you're trying to do.
I write a regex that should extracts following patterns;
"çççoookkk gggüüüzzzeeelll" (it means vvveeerrryyy gggoooddd with turkish characters "ç" and "ü")
"ccccoookkk ggguuuzzzeeelll" (it means the same but with english characters "c" and "u")
here is the regular expressions i'm trying;
"\b[çc]+o+k+\sg+[üu]+z+e+l+\b" : this works in english but not in turkish characters
"çok": finds "çok" but when i try "ç+o+k+" doesn't work for "çççoookkk", it finds "çoookkk"
"güzel": finds "güzel" but when i try "g+ü+z+e+l+" doesn't work for "gggüüüzzzeeelll"
"\b(c+o+k+)|(ç+o+k+)\s(g+u+z+e+l)|(g+ü+z+e+l+)\b": doesn't work properly
"[çc]ok\sg[uü]zel": I also tried this to get "çok güzel" pattern but doesn't work neither.
I thing the problem might be using regex operators with turkish characters. I don't know how can i solve this.
I am using http://www.myregextester.com to check if my regular expressions are correct.
I am using Php programming language to get a specific pattern from searched tweets via Twitter Rest Api.
Thanks,
You have not specified what programming language you are using, but in many of them, the \b character class can only be used with plain ASCII encoding.
Internally, \b is processed as a boundary between \w and \W sets.
In turn, \w is equal to [a-zA-Z0-9_].
If you are not using any fancy space marks (you shouldn't), then consider using regular whitespace char classes (\s).
See this table (scroll down to Word Boundaries section) to check if your language supports Unicode for \b. If it says, "ascii", then it does not.
As a side note, depending on your programming language, you may consider using direct Unicode code points instead of national characters.
Se also: utf-8 word boundary regex in javascript
Further reading:
An excellent article about using Unicode characters in regular expressions
An article for word boundaries
List of Turkish Unicode code points
This regular expression is supposed to match all non-ASCII characters, 0-128 code points:
/[^x00-x7F]/i
Imagine I want to test (just out of curiosity) this regular expression with all Unicode characters, 0-1114111 code points.
Generating this range maybe simple with range(0, 1114111). Then I should covert each decimal number to hexadecimal with dechex() function.
After that, how can i convert the hexadecimal number to the actual character? And how can exclude characters already in ASCII scheme?
It depends on how you are going to do the matching and whether you are going to put the PCRE regex engine into UTF-8 mode with the /u modifier.
If you do use the /u modifier then first of all you must use UTF-8 encoding for both the regular expression and the subject and the regex engine will automatically interpret legal UTF-8 byte sequences as just one character. In this mode the regular expression [^x00-x7F] will match all characters outside the Latin-1 supplement block, including those with code points greater than 255. You will also need to generate the UTF-8 representations of each character (given its code point) manually.
If you do not use the /u modifier then the regex engine will be dumb: it will consider each byte as a separate character, which means that you have to work at byte rather than character level. On the other hand, you will now be able to work with any encoding you prefer. However, you will have to ditch the [^x00-x7F] regex (because it's only going to be matching random bytes in the string) and work with a regular expression that embodies the rules of your chosen encoding (example for UTF-8). To generate the encoded forms of random characters you will again need to use custom code that depends on the specific encoding.
I think the hex2bin(string) function will convert a hex string into a binary string. To exclude ASCII character codepoints, just begin from the x80 hex codepoint (skipping x00 to x7F).
But it does sort of sound like you're trying to unit test the regex library, which seems unnecessary unless you are developing the regex library, or you need to be extremely paranoid.
I am using utf8 format to store all my data into mysql. Before data is inserted into the database I need to clean the strings with unwanted characters. The strings are in utf8 format. I know how to use regex and string replace but do not know how to work with arabic characters.
Sample string that needs to be cleaned : "████ .. الــقــوانين الجديـــدة في قســـم الـعنايـ";
Thanking you
Ok. As #Jonathan Leffler already said, if you can specify the unicode character ranges for the characters that need to be replaced, you can use a regular expression to replace the characters with an empty string.
A unicode character is specified as \x{FFFF} in an expression (in PHP). In addition, you have to set the u modifier to make PHP treat the pattern as UTF8.
So in the end, you have something like this:
preg_replace('/[\x{FFFF}-\x{FFFF}]+/u','',$string);
where
/.../u are the delimiters plus the modifier
[...]+ is a character class plus quantifier, which means match any of these characters inside one or mor times
\x{FFFF}-\x{FFFF} is a unicode character range (obviously you have to provide the right codepoints/numbers of the characters).
You can also negate the group with a ^ you can specify the range which you want to keep:
preg_replace('/[^\x{FFFF}-\x{FFFF}]+/u','',$string);
More information:
Regular expressions
Regular expressions in PHP
Unicode Charts