There are some letters in different alphabets, that are looking totally the same.
Like A in latin and А in cyrillic.
Do they play the same role, when I call one of them through utf-8 script?
If aren't, how to get know code of given letter?
It's not clear what you mean by "play the same role".
They are certainly not the same character, though they may appear to be when rendered.
This is exactly analogous as the confusion between "l" (lowercase L) and "I" (uppercase i) in many fonts.
If you want to consider A and А to be the same, you have to transliterate the Cyrillic into a Latin one. Unfortunately, PHP support for transliteration is sketchy. You can use iconv, which is not great -- if you transliterate to ASCII, you'll lose everything that cannot be represented in ASCII.
The Unicode PHP implementation (what was supposed to be PHP 6) had a function called str_transliterate that used the ICU transliteration API. Hopefully, transliteration will be added to the intl extension (the current ICU wrapper) in the future.
You might be interested in the 'spoof detection' API in ICU. I think it is designed to report that your two As are 'visually confusable'.
Related
From time to time I encounter files that have a strange (wrong?) encoding of umlaut characters in their file names. Maybe the encoding comes from a Mac system, but I'm not sure. I work with Windows.
For example:
Volkszählung instead of Volkszählung (try to use Backspace after the first ä).
When pasting it into an ANSI encoded file with notepad++ it inserts Volksza¨hlung.
I have two questions:
a) Where does that come from and which encoding is it?
b) Using glob() in PHP does not list these files when using the wildchard character *. How is it possible to detect them in PHP?
That's a combining character: specifically, U+0308 COMBINING DIARESIS. Combining characters are what let you put things like umlauts on any character, not just specific "precomposed" characters with built-in umlauts like U+00E4 LATIN SMALL LETTER A WITH DIAERESIS. Although it's not necessary to use a combining character in this case (since a suitable precomposed character exists), it's not wrong either.
(Note, this isn't an "encoding" at all: in the context of Unicode, an encoding is a method for transforming Unicode codepoint numbers into byte sequences so they can be stored in a file. UTF-8 and UTF-16 are encodings. But combining characters are Unicode codepoints, just like normal characters; they're not something produced by the encoding process.)
If you're working with Unicode text, you should be using PHP's mbstring functions. The built-in string functions aren't Unicode-aware, and see strings only as sequences of bytes rather than sequences of characters. I'm not sure how mbstring treats combining characters, though; the documentation doesn't mention them at all, as far as I can see.
You should also take a look at the grapheme functions, which are specifically meant to cope with combining characters. A "grapheme unit" is the single visual character produced by a base character codepoint plus any combining characters that follow it.
Finally, the PCRE regex functions support a \X escape sequence that matches whole grapheme clusters rather than individual codepoints.
I have function that sanitizes URLs and filenames and it works fine with characters like éáßöäü as it replaces them with eassoau etc. using str_replace($a, $b, $value). But how can I replace all characters from Chinese, Japanese … languages? And if replacing is not possible because it's not easy to determine, how can I remove all those characters? Of course I could first sanitize it like above and then remove all "non-latin" characters. But maybe there is another good solution to that?
Edit/addition
As asked in the comments: What is the purpose of my question? We had a client that had content in English, German and Russian language at first. Later on there came some chinese pages. Two problems occurred with the URLs:
the first sanitizer killed all 'non-ascii-characters' and possibly returned 'blank' (invalid) clean-URLs
the client experienced that in some Browser clean URLs with Chinese characters wouldn't work
The first point led me to the shot to replace those characters, which is of course, as stated in the question and the comments confirmed it, not possible. Maybe now somebody is answering that in all modern browsers (starting with IE8) this ain't an issue anymore. I would also be glad to hear about that too.
As for Japanese, as an example, there is usually a romanji representation of everything which uses only ascii characters and still gives a reversable and understandable representation of the original characters. However translating something into romanji requires that you know the correct pronounciation, and that usually depends on the meaning or the context in which the characters are used. That makes it hard if not impossible to simply convert everything correcly (or at least not efficiently doable for a simple sanitizer).
The same applies to Chinese, in an even worse way. Korean on the other hand has a very simple character set which should be easily translateable into a roman representation. Another common problem though is that there is not a single romanization method; those languages usually have different ones which are used by different people (Japanese for example has two common romanizations).
So it really depends on the actual language you are working with; while you might be able to make it work for some languages another problem would be to detect which language you are actually working with (e.g. Japanese and Chinese share a lot of characters but meanings, pronounciations and as such romanizations are usually incompatible). Especially for simple santization of file names, I don’t think it is worth to invest such an amount of work and processing time into it.
Maybe you should work in a different direction: Make your file names simply work as unicode filenames. There are actually a very few number of characters that are truly invalid in file systems (*|\/:"<>?) so it would be way easier to simply filter those out and otherwise support unicode file names.
You could run it through your existing sanitizer, then anything not latin, you could convert to punycode
So, as i understand you need some character relation tables for every language, and replace characters by relation in this table.
By example, for translit russian symbols to latin synonyms, we use this tables =) Or classes, which use this tables =)
It's intresting, i finded it right now http://derickrethans.nl/projects.html#translit
I'm interested in writing a PHP script (I do welcome language-agnostic suggestions) that would transliterate a sentence or word written in English (phoenetically) into the script of another language. Since I'm looking at English written phoenetically (i.e. by ear): I'd have to deal with variant spellings of the same word.
It is assumed that no standard exists for romanization (for instance, in Chinese, you have the Simplified Wade, etc.)
Does anyone have any advice on where I could start?
EDIT: I'm doing this purely for educational purposes, and I was initially under the impression that in order to figure out the connection between variant spellings (which could be found in a corpus of IM messages, Facebook posts written in the romanized form of the language), you'd need some sort of machine learning tool. However, I'd like to know if I was on the right track, and I'd like some help in figuring out what next I should look into to get this working (for instance: which machine learning tool should I look into?).
Try Transliteration PHP Extension by Derick Rethans:
This extension allows you to transliterate text in non-latin
characters (such as Chinese, Cyrillic, Greek etc) to latin characters.
Besides the transliteration the extension also contains filters to
upper- and lowercase latin, cyrillic and greek, and perform special
forms of transliteration such as converting ligatures such as the
Norwegian "æ" to "ae" and normalizing punctuation and spacing.
It seems he has already started on just what you are looking for! (unless you want to deal with english-> latin language, but at least this deals with scripts of other languages. :) )
I know with Japanese at least, you have a set number of letter combinations.
So, you could do something like create a matching array like this
array(
'oo' => 'おう',
'oh' => 'おう',
'ou' => 'おう'
)
Of course, continuing on, and making sure you don't match 'su', when it should be 'tsu'.
This would only be a starting point, of course.
Machine learning is probably most practical with Chinese...but here's a rough start to hiragana: https://gist.github.com/1154969
What is better for PHP developers - Unicode or UTF-8?
I am going to create an international CMS. So I am going to have clients all over the world. They will speak all possible languages.
What encoding format is better for browser recognition and for DB data storage?
"Unicode" is not an encoding. You may mean UTF-8 versus UTF-16 (big-endian or little-endian). It really doesn't matter much for browser support. Any modern browser will support all three. You will probably find UTF-8 is the most space-efficient for your database.
UTF-8 is an encoding of Unicode, a way of representing an (abstract) sequence of Unicode characters as a (concrete) sequence of bytes. There are other encodings, such as UTF-16 (which has both big-endian and little-endian variants). Both UTF-8 and UTF-16 can represent any character in Unicode, so you can support all languages regardless of which one you choose.
UTF-8 is useful if most of your text is in Western languages since it represents ASCII characters in just one byte, but it needs three bytes each for many characters in "foreign" alphabets such as Chinese. UTF-16, on the other hand, uses exactly two bytes for all characters you're likely to ever encounter (though some very esoteric characters, those outside Unicode's "Basic Multilingual Plane", require four).
I wouldn't recommend using PHP for developing international software, though, because it doesn't really properly support Unicode. It has some add-on functions for working with Unicode encodings (look at the multibyte string functions), but the the PHP core treats strings as bytes, not characters, so the standard PHP string functions are not suitable for working with characters that are encoded as more than one byte. For example, if you call PHP's strlen() on a string containing the UTF-8 representation of the character "大", it will return 3, because that character takes up three bytes in UTF-8, even though it's only one character. Using string-splitting functions like substr() is precarious because if you split in the middle of a multi-byte character you corrupt the string.
Most other languages used for Web development, such as Java, C#, and Python, have built-in support for Unicode, so that you can put arbitrary Unicode characters into a string and not need to worry about which encoding is used to represent them in memory because from your point of view a string contains characters, not bytes. This is a much safer, less-error-prone way to work with Unicode text. For this and other reasons (PHP isn't really that great a language), I'd recommend using something else.
(I've read that PHP 6 will have proper Unicode support, but that's not available yet.)
UTF-8 is a Unicode encoding. You probably meant that you want to choose between UTF-8 and UTF-16.
Microsoft recommends that
Developers should use UTF-8 for all
Unicode data that they send to and
receive from the browser.
For database storage, use the encoding your RDBMS has better support for. Or, all else being equal, choose based on space efficiency. UTF-8 is smaller for English and most European languages, while UTF-16 tends to be smaller for Asian languages.
Unicode is a standard which defines a bunch of abstract characters (so-called code points) and their properties (is it a digit, is it uppercase etc.). It also defines certain encodings (methods to represent characters with bytes), UTF-8 being one of them. See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Spolsky for more details.
I would certainly go with UTF-8, it is the standard everywhere these days, and has some nice properties such as leaving all 7-bit ASCII characters in place, which means that most HTML-related functions such as htmlspecialchars can be used directly on the UTF-8 representation, so you have less chance of leaving encoding-related security holes. Also, a lot of PHP functions explicitly expect UTF-8 strings, and UTF-8 has better text editor support than alternatives like UTF-16, too.
It is better to use UTF-8, because which refers all language's accents around the world. Also UTF-8 has an extended provisions to add more unused or recognized chars too. I prefer and use always UTF-8 and its series.
I'm already aware that \w in PCRE (particularly PHP's implementation) can sometimes match some non-ASCII characters depending on the locale of the system, but what about [a-z]?
I wouldn't think so, but I noticed these lines in one of Drupal's core files (includes/theme.inc, simplified):
// To avoid illegal characters in the class,
// we're removing everything disallowed. We are not using 'a-z' as that might leave
// in certain international characters (e.g. German umlauts).
$body_classes[] = preg_replace('![^abcdefghijklmnopqrstuvwxyz0-9-_]+!s', '', $class);
Is this true, or did someone simply get [a-z] confused with \w?
Long story short: Maybe, depends on the system the app is deployed to, depends how PHP was compiled, welcome to the CF of localization and internationalization.
The underlying PCRE engine takes locale into account when determining what "a-z" means. In a Spanish based locale, ñ would be caught by a-z). The semantic meaning of a-z is "all the letters between a and z, and ñ is a separate letter in Spanish.
However, the way PHP blindly handles strings as collections of bytes rather than a collection of UTF code points means you have a situation where a-z MIGHT match an accented character. Given the variety of different systems Drupal gets deployed to, it makes sense that they would choose to be explicit about the allowed characters rather than just trust a-z to do the right thing.
I'd also conjecture that the existence of this regular expression is the result of a bug report being filed about German umlauts not being filtered.
Update in 2014: Per JimmiTh's answer below, it looks like (despite some "confusing-to-non-pcre-core-developers" documentation) that [a-z] will only match the characters abcdefghijklmnopqrstuvwxyz a proverbial 99% of the time. That said — framework developers tend to get twitchy about vagueness in their code, especially when the code relies on systems (locale specific strings) that PHP doesn't handle as gracefully as you'd like, and servers the developers have no control over. While the anonymous Drupal developer's comments are incorrect — it wasn't a matter of "getting [a-z] confused with \w", but instead a Drupal developer being unclear/unsure of how PCRE handled [a-z], and choosing the more specific form of abcdefghijklmnopqrstuvwxyz to ensure the specific behavior they wanted.
The comment in Drupal's code is WRONG.
It's NOT true that "international characters (e.g. German umlauts)" might match [a-z].
If, e.g., you have the German locale available, you can check it like this:
setlocale(LC_ALL, 'de_DE'); // German locale (not needed, but you never know...)
echo preg_match('/^[a-z]+$/', 'abc') ? "yes\n" : "no\n";
echo preg_match('/^[a-z]+$/', "\xE4bc") ? "yes\n" : "no\n"; // äbc in ISO-8859-1
echo preg_match('/^[a-z]+$/', "\xC3\xA4bc") ? "yes\n" : "no\n"; // äbc in UTF-8
echo preg_match('/^[a-z]+$/u', "\xC3\xA4bc") ? "yes\n" : "no\n"; // w/ PCRE_UTF8
Output (will not change if you replace de_DE with de_DE.UTF-8):
yes
no
no
no
The character class [abcdefghijklmnopqrstuvwxyz] is identical to [a-z] in both encodings the PCRE understands: ASCII-derived monobyte and UTF-8 (which is ASCII-derived too). In both of these encodings [a-z] is the same as [\x61-\x7A].
Things may have been different when the question was asked in 2009, but in 2014 there is no "weird configuration" that can make PHP's PCRE regex engine interpret [a-z] as a class of more than 26 characters (as long as [a-z] itself is written as 5 bytes in an ASCII-derived encoding, of course).
Just an addition to both the already excellent, if contradicting, answers.
The documentation for the PCRE library has always stated that "Ranges operate in the collating sequence of character values". Which is somewhat vague, and yet very precise.
It refers to collating by the index of characters in PCRE's internal character tables, which can be set up to match the current locale using pcre_maketables. That function builds the tables in order of char value (tolower(i)/toupper(i))
In other words, it doesn't collate by actual cultural sort order (the locale collation info). As an example, while German treats ö the same as o in dictionary collation, ö has a value that makes it appear outside the a-z range in all the common character encodings used for German (ISO-8859-x, unicode encodings etc.) In this case, PCRE would base its determination of whether ö is in the range [a-z] on that code value, rather than any actual locale-defined sort order.
PHP has mostly copied PCRE's documentation verbatim in their docs. However, they've actually gone to pains changing the above statement to "Ranges operate in ASCII collating sequence". That statement has been in the docs at least since 2004.
In spite of the above, I'm not quite sure it's true, however.
Well, not in all cases, at least.
The one call PHP makes to pcre_maketables... From the PHP source:
#if HAVE_SETLOCALE
if (strcmp(locale, "C"))
tables = pcre_maketables();
#endif
In other words, if the environment for which PHP is compiled has setlocale and the (LC_CTYPE) locale isn't the POSIX/C locale, the runtime environment's POSIX/C locale's character order is used. Otherwise, the default PCRE tables are used - which are generated (by pcre_maketables) when PCRE is compiled - based on the compiler's locale:
This function builds a set of character tables for character values less than 256. These can be passed to pcre_compile() to override PCRE's internal, built-in tables (which were made by pcre_maketables() when PCRE was compiled). You might want to do this if you are using a non-standard locale. The function yields a pointer to the tables.
While German wouldn't be different for [a-z] in any common character encoding, if we were dealing with EBCDIC, for example, [a-z] would include ± and ~. Granted, EBCDIC is the one character encoding I can think of that doesn't place a-z and A-Z in uninterrupted sequence.
Unless PCRE does some magic when using EBCDIC (and it might), while it's highly unlikely you'd be including umlauts in anything but the most obscure PHP build or runtime environment (using your very own, very special, custom-made locale definition), you might, in the case of EBCDIC, include other unintended characters. And for other ranges, "collated in ASCII sequence" doesn't seem entirely accurate.
ETA: I could have saved some research by looking for Philip Hazel's own reply to a similar concern:
Another issue is with character classes ranges. You would think that [a-k] and [x-z] are well defined for latin scripts but that's not the case.
They are certainly well defined, being equivalent to [\x61-\x6b] and [\x78-\x7a], that is, related to code order, not cultural sorting order.