I'm already aware that \w in PCRE (particularly PHP's implementation) can sometimes match some non-ASCII characters depending on the locale of the system, but what about [a-z]?
I wouldn't think so, but I noticed these lines in one of Drupal's core files (includes/theme.inc, simplified):
// To avoid illegal characters in the class,
// we're removing everything disallowed. We are not using 'a-z' as that might leave
// in certain international characters (e.g. German umlauts).
$body_classes[] = preg_replace('![^abcdefghijklmnopqrstuvwxyz0-9-_]+!s', '', $class);
Is this true, or did someone simply get [a-z] confused with \w?
Long story short: Maybe, depends on the system the app is deployed to, depends how PHP was compiled, welcome to the CF of localization and internationalization.
The underlying PCRE engine takes locale into account when determining what "a-z" means. In a Spanish based locale, ñ would be caught by a-z). The semantic meaning of a-z is "all the letters between a and z, and ñ is a separate letter in Spanish.
However, the way PHP blindly handles strings as collections of bytes rather than a collection of UTF code points means you have a situation where a-z MIGHT match an accented character. Given the variety of different systems Drupal gets deployed to, it makes sense that they would choose to be explicit about the allowed characters rather than just trust a-z to do the right thing.
I'd also conjecture that the existence of this regular expression is the result of a bug report being filed about German umlauts not being filtered.
Update in 2014: Per JimmiTh's answer below, it looks like (despite some "confusing-to-non-pcre-core-developers" documentation) that [a-z] will only match the characters abcdefghijklmnopqrstuvwxyz a proverbial 99% of the time. That said — framework developers tend to get twitchy about vagueness in their code, especially when the code relies on systems (locale specific strings) that PHP doesn't handle as gracefully as you'd like, and servers the developers have no control over. While the anonymous Drupal developer's comments are incorrect — it wasn't a matter of "getting [a-z] confused with \w", but instead a Drupal developer being unclear/unsure of how PCRE handled [a-z], and choosing the more specific form of abcdefghijklmnopqrstuvwxyz to ensure the specific behavior they wanted.
The comment in Drupal's code is WRONG.
It's NOT true that "international characters (e.g. German umlauts)" might match [a-z].
If, e.g., you have the German locale available, you can check it like this:
setlocale(LC_ALL, 'de_DE'); // German locale (not needed, but you never know...)
echo preg_match('/^[a-z]+$/', 'abc') ? "yes\n" : "no\n";
echo preg_match('/^[a-z]+$/', "\xE4bc") ? "yes\n" : "no\n"; // äbc in ISO-8859-1
echo preg_match('/^[a-z]+$/', "\xC3\xA4bc") ? "yes\n" : "no\n"; // äbc in UTF-8
echo preg_match('/^[a-z]+$/u', "\xC3\xA4bc") ? "yes\n" : "no\n"; // w/ PCRE_UTF8
Output (will not change if you replace de_DE with de_DE.UTF-8):
yes
no
no
no
The character class [abcdefghijklmnopqrstuvwxyz] is identical to [a-z] in both encodings the PCRE understands: ASCII-derived monobyte and UTF-8 (which is ASCII-derived too). In both of these encodings [a-z] is the same as [\x61-\x7A].
Things may have been different when the question was asked in 2009, but in 2014 there is no "weird configuration" that can make PHP's PCRE regex engine interpret [a-z] as a class of more than 26 characters (as long as [a-z] itself is written as 5 bytes in an ASCII-derived encoding, of course).
Just an addition to both the already excellent, if contradicting, answers.
The documentation for the PCRE library has always stated that "Ranges operate in the collating sequence of character values". Which is somewhat vague, and yet very precise.
It refers to collating by the index of characters in PCRE's internal character tables, which can be set up to match the current locale using pcre_maketables. That function builds the tables in order of char value (tolower(i)/toupper(i))
In other words, it doesn't collate by actual cultural sort order (the locale collation info). As an example, while German treats ö the same as o in dictionary collation, ö has a value that makes it appear outside the a-z range in all the common character encodings used for German (ISO-8859-x, unicode encodings etc.) In this case, PCRE would base its determination of whether ö is in the range [a-z] on that code value, rather than any actual locale-defined sort order.
PHP has mostly copied PCRE's documentation verbatim in their docs. However, they've actually gone to pains changing the above statement to "Ranges operate in ASCII collating sequence". That statement has been in the docs at least since 2004.
In spite of the above, I'm not quite sure it's true, however.
Well, not in all cases, at least.
The one call PHP makes to pcre_maketables... From the PHP source:
#if HAVE_SETLOCALE
if (strcmp(locale, "C"))
tables = pcre_maketables();
#endif
In other words, if the environment for which PHP is compiled has setlocale and the (LC_CTYPE) locale isn't the POSIX/C locale, the runtime environment's POSIX/C locale's character order is used. Otherwise, the default PCRE tables are used - which are generated (by pcre_maketables) when PCRE is compiled - based on the compiler's locale:
This function builds a set of character tables for character values less than 256. These can be passed to pcre_compile() to override PCRE's internal, built-in tables (which were made by pcre_maketables() when PCRE was compiled). You might want to do this if you are using a non-standard locale. The function yields a pointer to the tables.
While German wouldn't be different for [a-z] in any common character encoding, if we were dealing with EBCDIC, for example, [a-z] would include ± and ~. Granted, EBCDIC is the one character encoding I can think of that doesn't place a-z and A-Z in uninterrupted sequence.
Unless PCRE does some magic when using EBCDIC (and it might), while it's highly unlikely you'd be including umlauts in anything but the most obscure PHP build or runtime environment (using your very own, very special, custom-made locale definition), you might, in the case of EBCDIC, include other unintended characters. And for other ranges, "collated in ASCII sequence" doesn't seem entirely accurate.
ETA: I could have saved some research by looking for Philip Hazel's own reply to a similar concern:
Another issue is with character classes ranges. You would think that [a-k] and [x-z] are well defined for latin scripts but that's not the case.
They are certainly well defined, being equivalent to [\x61-\x6b] and [\x78-\x7a], that is, related to code order, not cultural sorting order.
Related
I have function that sanitizes URLs and filenames and it works fine with characters like éáßöäü as it replaces them with eassoau etc. using str_replace($a, $b, $value). But how can I replace all characters from Chinese, Japanese … languages? And if replacing is not possible because it's not easy to determine, how can I remove all those characters? Of course I could first sanitize it like above and then remove all "non-latin" characters. But maybe there is another good solution to that?
Edit/addition
As asked in the comments: What is the purpose of my question? We had a client that had content in English, German and Russian language at first. Later on there came some chinese pages. Two problems occurred with the URLs:
the first sanitizer killed all 'non-ascii-characters' and possibly returned 'blank' (invalid) clean-URLs
the client experienced that in some Browser clean URLs with Chinese characters wouldn't work
The first point led me to the shot to replace those characters, which is of course, as stated in the question and the comments confirmed it, not possible. Maybe now somebody is answering that in all modern browsers (starting with IE8) this ain't an issue anymore. I would also be glad to hear about that too.
As for Japanese, as an example, there is usually a romanji representation of everything which uses only ascii characters and still gives a reversable and understandable representation of the original characters. However translating something into romanji requires that you know the correct pronounciation, and that usually depends on the meaning or the context in which the characters are used. That makes it hard if not impossible to simply convert everything correcly (or at least not efficiently doable for a simple sanitizer).
The same applies to Chinese, in an even worse way. Korean on the other hand has a very simple character set which should be easily translateable into a roman representation. Another common problem though is that there is not a single romanization method; those languages usually have different ones which are used by different people (Japanese for example has two common romanizations).
So it really depends on the actual language you are working with; while you might be able to make it work for some languages another problem would be to detect which language you are actually working with (e.g. Japanese and Chinese share a lot of characters but meanings, pronounciations and as such romanizations are usually incompatible). Especially for simple santization of file names, I don’t think it is worth to invest such an amount of work and processing time into it.
Maybe you should work in a different direction: Make your file names simply work as unicode filenames. There are actually a very few number of characters that are truly invalid in file systems (*|\/:"<>?) so it would be way easier to simply filter those out and otherwise support unicode file names.
You could run it through your existing sanitizer, then anything not latin, you could convert to punycode
So, as i understand you need some character relation tables for every language, and replace characters by relation in this table.
By example, for translit russian symbols to latin synonyms, we use this tables =) Or classes, which use this tables =)
It's intresting, i finded it right now http://derickrethans.nl/projects.html#translit
In Unicode, a character can be considered in different "compositions".
For example the character à which codepoint is U+00E0, it's also composed of two code points: U+0061 combined with the grave accent U+0300.
Which left the question of:
What depends when a character ends up been considered in a specific composition?
I mean: The Keyboard? Encoding? Copy-Pasted Text?
I know the way to be aware of with the \X metacharacter, but I would like that someone explain my wondering.
It's ultimately up to the operating system which code point(s) they store when you hit a key, although there is convention in the form of the normalized forms (specifically NFC):
http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
Copy-and-paste copies code points, not concepts-of-graphemes (Grapheme is a less ambiguous term, since character can mean both grapheme and code point).
If you're converting from some other character set to Unicode, then the conversion mapping will dictate what code points you end up with and it nearly always matches how the source character set encodes composite characters - where the source character set has a single code point for a LATIN A WITH UMLAUT, then Unicode will too.
Is there a way to select in mysql words that are only Chinese, only Japanese and only Korean?
In english it can be done by:
SELECT * FROM table WHERE field REGEXP '[a-zA-Z0-9]'
or even a "dirty" solution like:
SELECT * FROM table WHERE field > "0" AND field <"ZZZZZZZZ"
Is there a similar solution for eastern languages / CJK characters?
I understand that Chinese and Japanese share characters so there is a chance that Japanese words using these characters will be mistaken for Chinese words. I guess those words would not be filtered.
The words are stored in a utf-8 string field.
If this cannot be done in mysql, can it be done in PHP?
Thanks! :)
edit 1: The data does not include in which language the string is therefore I cannot filter by another field.
edit 2: using a translator api like bing's (google is closing their translator api) is an interesting idea but i was hoping for a faster regex-style solution.
Searching for a UTF-8 range of characters is not directly supported in MySQL regexp. See the mySQL reference for regexp where it states:
Warning The REGEXP and RLIKE operators
work in byte-wise fashion, so they are
not multi-byte safe and may produce
unexpected results with multi-byte
character sets.
Fortunately in PHP you can build such a regexp e.g. with
/[\x{1234}-\x{5678}]*/u
(note the u at the end of the regexp). You therefore need to find the appropriate ranges for your different languages. Using the unicode code charts will enable you to pick the appropriate script for the language (although not directly the language itself).
You can't do this from the character set alone - especially in modern times where asian texts are frequently "romanized", that is, written with the roman script, that said, if you merely want to select texts that are superficially 'asian', there are ways of doing that depending on just how complicated you want to be and how accurate you need to be.
But honestly, I suggest that you add a new "language" field to your database and ensuring that it's populated correctly.
That said, here are some useful links you may be interested in:
Detect language from string in PHP
http://en.wikipedia.org/wiki/Hidden_Markov_model
The latter is relatively complex to implement, but yields a much better result.
Alternatively, I believe that google has an (online) API that will allow you to detect, AND translate a language.
An interesting paper that should demonstrate the futility of this excercise is:
http://xldb.lasige.di.fc.ul.pt/xldb/publications/ngram-article.pdf
Finally, you ask:
If this cant be done in mysql - how can it be done in PHP?
It will likely to be much easier to do this in PHP because you are more able to perform mathematical analysis on the language string in question, although you'll probably want to feed the results back into the database as a kludgy way of caching the results for performance reasons.
you may consider another data structure that contains the words and or characters, and the language you want to associate them with.
the 'normal' eastern ascii characters will associate to many more languages than just English for instance, just as other characters may associate to more than just Chinese.
Korean mostly uses its own alphabet called Hangul. Occasionally there will be some Han characters thrown in.
Japanese uses three writing systems combined. Of these, Katakana and Hiragana are unique to Japanese and thus are hardly ever used in Korean or Chinese text.
Japanese and Chinese both use Han characters though which means the same Unicode range(s), so there is no simple way to differentiate them based on character ranges alone!
There are some heuristics though.
Mainland China uses simplified characters, many of which are unique and thus are hardly ever used in Japanese or Korean text.
Japan also simplified a small number of common characters, many of which are unique and thus will hardly ever be used in Chinese or Korean text.
But there are certainly plenty of occasions where the same strings of characters are valid as both Japanese and Chinese, especially in the case of very short strings.
One method that will work with all text is to look at groups of characters. This means n-grams and probably Markov models as Arafangion mentions in their answer. But be aware that even this is not foolproof in the case of very short strings!
And of course none of this is going to be implemented in any database software so you will have to do it in your programming language.
Remembering to do all the stuff you need to do in PHP to get it to work properly with Unicode is far too tricky, tedious, and error-prone, so I'm looking for the trick to get PHP to magically upgrade absolutely everything it possibly can from musty old ASCII byte mode into modern Unicode character mode, all at once and by using just one simple declaration.
The idea is to modernize PHP scripts to work with Unicode without having to clutter up the source code with a bunch of confusing alternate function calls and special regexes. Everything should just “Do The Right Thing” with Unicode, no questions asked.
Given that the goal is maximum Unicodeness with minimal fuss, this declaration must at least do these things (plus anything else I’ve forgotten that furthers the overall goal):
The PHP script source is itself in considered to be in UTF‑8 (eg, strings and regexes).
All input and output is automatically converted to/from UTF‑8 as needed, and with a normalization option (eg, all input normalized to NFD and all output normalized to NFC).
All functions with Unicode versions use those instead (eg, Collator::sort for sort).
All byte functions (eg, strlen, strstr, strpos, and substr) work like the corresponding character functions (eg, mb_strlen, mb_strstr, mb_strpos, and mb_substr).
All regexes and regexy functions transparently work on Unicode (ie, like all the preggers have /u tacked on implicitly, and things like \w and \b and \s all work on Unicode the way The Unicode Standard requires them to work, etc).
For extra credit :), I'd like there to be a way to “upgrade” this declaration to full grapheme mode. That way the byte or character functions become grapheme functions (eg, grapheme_strlen, grapheme_strstr, grapheme_strpos, and grapheme_substr), and the regex stuff works on proper graphemes (ie, . — or even [^abc] — matches a Unicode grapheme cluster no matter how many code points it contains, etc).
That full-unicode thing was precisely the idea of PHP 6 -- which has been canceled more than one year ago.
So, no, there is no way of getting all that -- except by using the right functions, and remembering that characters are not the same as bytes.
One thing that might help with you fourth point, though, is the Function Overloading Feature of the mbstring extension (quoting) :
mbstring supports a 'function
overloading' feature which enables you
to add multibyte awareness to such an
application without code modification
by overloading multibyte counterparts
on the standard string functions.
For example, mb_substr() is
called instead of substr() if
function overloading is enabled.
All byte functions (eg, strlen, strstr, strpos, and substr) work like the corresponding character functions (eg, mb_strlen, mb_strstr, mb_strpos, and mb_substr).
This isn't a good idea.
Unicode strings cannot transparently replace byte strings. Even when you are correctly handling all human-readable text as Unicode, there are still important uses for byte strings in handling file and network data that isn't character-based, and interacting with systems that explicitly use bytes.
For example, spit out a header 'Content-Length: '.strlen($imageblob) and you're going to get brokenness if that's suddenly using codepoint semantics.
You still need to have both mb_strlen and strlen, and you have to know which is the right one to use in each circumstance; there's not a single switch you can throw to automatically do the right thing.
This is why IMO the approach of having a single string datatype that can be treated with byte or codepoint semantics is generally a mistake. Languages that provide separate datatypes for byte strings (with byte semantics), and character strings (with Unicode codepoint semantics(*)) tend to be more consistent.
(*: or UTF-16 code unit semantics if unlucky)
There are some letters in different alphabets, that are looking totally the same.
Like A in latin and А in cyrillic.
Do they play the same role, when I call one of them through utf-8 script?
If aren't, how to get know code of given letter?
It's not clear what you mean by "play the same role".
They are certainly not the same character, though they may appear to be when rendered.
This is exactly analogous as the confusion between "l" (lowercase L) and "I" (uppercase i) in many fonts.
If you want to consider A and А to be the same, you have to transliterate the Cyrillic into a Latin one. Unfortunately, PHP support for transliteration is sketchy. You can use iconv, which is not great -- if you transliterate to ASCII, you'll lose everything that cannot be represented in ASCII.
The Unicode PHP implementation (what was supposed to be PHP 6) had a function called str_transliterate that used the ICU transliteration API. Hopefully, transliteration will be added to the intl extension (the current ICU wrapper) in the future.
You might be interested in the 'spoof detection' API in ICU. I think it is designed to report that your two As are 'visually confusable'.