I am quite confused on the issue of validating the fields (eg. business name, company name, address, etc.), because the site has localization feature. Currently, I am validating the fields using jQuery via regular expression, a snip of one of my regex:
var regex = /^[a-zA-Z0-9-,.\säöüÄÖÜ]{2,}$/;
This works fine when the site is in English language. However, I am not confident if this does work in German environment.
What I do to test my validation is by using Character Map on Windows. Say for example, I get ü from Character Map, paste it on the field. But the script says it is an invalid character. Whereas, if you look on the regex, I am considering such character as valid.
Most probably the technical problem is in the document’s character encoding. Make sure that your document is UTF-8 encoded and declared as such in HTTP headers or at least in a meta tag.
There are more difficult important problems though. Your regexp will reject the English name Brontë and the German name Strauß for example. And it will accept 42, which is hardly anyone’s first or last name.
What is the purpose of this checking? Can you expect that all the names will be English or German? According to European conventions, a person has the right to have his name spelled correctly in European countries, even if it happens to be in a language other than the majority language.
There’s not much checking of personal names that you can do without risk of rejecting someone’s real name in some accepted spelling. If you need to force names to some limited character repertoire or syntax, this needs to be made clear to users and performed server-side and, preferably, additionally as client-side pre-checking.
Related
When one creates web content in languages different than English the problem of search engine optimized and user friendly URLs emerge.
I'm wondering whether it is the best practice to use de-accented letters in URLs -- risking that some words have completely different meanings with and without certain accents -- or it is better to stick to the usage of non-english characters where appropriate sacrificing the readability of those URLs in less advanced environments (e.g. MSIE, view source).
"Exotic" letters could appear anywhere: in titles of documents, in tags, in user names, etc, so they're not always under the complete supervision of the maintainer of the website.
A possible approach of course would be setting up alternate -- unaccented -- URLs as well which would point to the original destination, but I would like to learn your opinions about using accented URLs as primary document identifiers.
There's no ambiguity here: RFC3986 says no, that is, URIs cannot contain unicode characters, only ASCII.
An entirely different matter is how browsers represent encoded characters when displaying a URI, for example some browsers will display a space in a URL instead of '%20'. This is how IDN works too: punycoded strings are encoded and decoded by browsers on the fly, so if you visit café.com, you're really visiting xn--caf-dma.com. What appears to be unicode chars in URLs is really only 'visual sugar' on the part of the browser: if you use a browser that doesn't support IDN or unicode, the encoded version won't work because the underlying definition of URLs simply doesn't support it, so for it to work consistently, you need to % encode.
When faced with a similar problem, I took advantage of URL rewriting to allow such pages to be accessible by either the accented or unaccented character. The actual URL would be something like
http://www.mysite.com/myresume.html
And a rewriting+character translating function allows this reference
http://www.mysite.com/myresumé.html
to load the same resource. So to answer your question, as the primary resource identifier, I confine myself to 0-9, A-Z, a-z and the occasional hyphen.
Considering URLs with accents often tend to end up looking like this :
http://fr.wikipedia.org/wiki/%C3%89l%C3%A9phant
...which is not that nice... I think we'll still be using de-accented URLs for some time.
Though, things should get better, as accented URLs are now accepted by web browsers, it seems.
The firefox 3.5 I'm currently using displays the URL the nice way, and not with %stuff, btw ; this seems to be "new" since firefox 3.0 (see Firefox 3: UTF-8 support in location bar) ; so, not probably not supported in IE 6, at least -- and there are still quite too many people using this one :-(
Maybe URL with no accent are not looking the best that could be ; but, still, people are used to them, and seem to generally understand them quite well.
You should avoid non-ASCII characters in URLs that may be entered in browser manually by users. It's ok for embedded links pre-encoded by server.
We found out that browser can encode the URL in different ways and it's very hard to figure out what encoding it uses. See my question on this issue,
Handling Character Encoding in URI on Tomcat
There are several areas in a full URL, and each one might has different rules.
The protocol is plain ASCII.
The DNS entry is governed by IDN (International Domain Names) rules, and can contain (most) of the Unicode characters.
The path (after the first /), the user name and the password can again be everything. They are escaped (as %XX), but those are just bytes. What is the encoding of these bytes is difficult to know (is interpreted by the http server).
The parameters part (after the first ?) is passed "as is" (after %XX unescapeing) to some server-side application thing (php, asp, jsp, cgi), and how that interprets the bytes is another story).
It is recommended that the path/user/password/arguments are utf-8, but not mandatory, and not everyone respects that.
So you should definitely allow for non-ASCII (we are not in the 80s anymore), but exactly what you do with that might be tricky. Try to use Unicode and stay away from legacy code pages, tag your content with the proper encoding/charset if you can (using meta in html, language directives for asp/jsp, etc.)
We all know email address verification is a touchy subject, there are so many opinions on the best way to deal with it without encoding for the entire RFC. But since 2009 its become even more difficult and I haven't really seen anyone address the issue of IDN's yet.
Here is what I've been using:
preg_match(/^[a-z0-9._%+-]+#[a-z0-9.-]+\.[a-z]{2,6}\z/i)
Which will work for most email addresses but what if I need to match a non Latin email address? e.g.: bob#china.中國, or bob#russia.рф
Look here for the complete list. (Notice all the non Latin domain extensions at the bottom of the list.)
Information on this subject can be found here and I think what they are saying is these new characters will simply be read as '.xn--fiqz9s' and '.xn--p1ai' on the machine level but I'm not 100% sure.
If it is, does that mean the only change I need to consider making in my code the following? (For domain extensions like .travelersinsurance and .sandvikcoromant)
preg_match(/^[a-z0-9._%+-]+#[a-z0-9.-]+\.[a-z]{2,20}\z/i)
NOTICE: This is not related to the discussion found on this page Using a regular expression to validate an email address
Consider: Every time you make up your own new regex without validating addresses according to the complete RFC spec, you're just making the situation for using "exotic" email addresses on the web worse. You're inventing some new ad-hoc sub or superset of the official RFC spec; that means you will either have false positives or false negatives or both, you will deny people to use their actual addresses because your regex doesn't account for them correctly, or you will accept addresses which are actually invalid.
Add to that that even if the address is syntactically valid, that still doesn't mean a) the address actually (still) exists, b) belongs to that user or c) can actually receive email. In the grant scheme of things, validating the syntax is an extremely minor concern.
If you're going to validate the syntax at all, either do a very rough general check which is sure to not reject any valid addresses (e.g. /.+#.+/), or validate according to all RFC rules; don't do some in-between half-assed sort-of-strict-but-not-really validation you just came up with.
I'm gonna stick with the tried and true suggestion that you should send them a verification email. No need for a fancy regex that will need to be updated time and time again. Just assume they know their email address and let them enter it.
That's what I've always done when this situation comes up. If anything I would make them enter their email twice. It'll free you up to spend more time on the important parts of your site/project.
Here is what I eventually came up with.
preg_match(/^[\pL\pM*+\pN._%+-]+#[\pL\pM*+\pN.-]+\.[\pL\pM*+]{2,20}\z/u)
This uses Unicode regular expressions like \pL, \pM*+ and \pN to help me deal with characters and numbers from any language.
\pL Any kind of letter from any language, upper or lower case.
\pM*+ Matches zero or more code points that are combining marks. A character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\pN Any number.
The expression above will work perfectly for normal email addresses like me#mydomain.com and cacophonous email addresses like a.s中3_yÄhমহাজোটেরoo文%网+d-fελληνικά#πyÄhooαράδειγμα.δοκιμή.
It's not that I don't trust people to be able to type in their own email addresses but people do make mistakes and I may use this code in other situations. For example: I need to double check the integrity of an existing list of 10,000 email addresses. Besides, I was always taught to NOT trust user input and to ALWAYS filter.
UPDATE
I just discovered that though this works perfectly when tested on sites like phpliveregex.com and locally when parsing a normal string for utf-8 content it doesn't work properly with email fields because browsers converting fields of that content type to normal latin. So an email address like bob#china.中國, or bob#russia.рф does get converted before being received by the server to bob#china.xn--fiqz9s, or bob#russia.xn--p1ai. The only thing I was really missing from my original filter was the inclusion of hyphens from the domain extention.
Here is the final version:
preg_match('/^[a-z0-9%+-._]+#[a-z0-9-.]+\.[a-z0-9-]{2,20}\z/i');
I know this question is a bit vague and not sure this is even possible. On my web site I want to display a combo box with maximum possible languages (available in unicode) and when the user selects the language respective character map of that language should be loaded. Then users can click and complete the given text area with their comments in their own language. I am not asking for the code but a kind guide line about the possibility of this and a way to do this will be really helpful.
My ultimate need is to give user to type in any language of their choice. Do the users need to install the language in their computer before using it? Thank you.
The Unicode Standard does not divide characters by language, and there is no rigorous definition for the concept “characters used in a language”. For example, is “é” a character used in English? (Think about “fiancé”.) What about “è”? (Think about the spelling “belovèd” used in some forms of writing.)
The Unicode Consortium has created the CLDR database, which contains information about “exemplar characters” in any languages, but these are based on subjective judgement and often debatable – mostly in the sense of covering too much, which might not be serious here. The data is in an XML formal, so it could be automatically fed into an application.
There is nothing the user needs to do, or could do, to “install the language” for purposes like this. What matters is whether the user’s computer has fonts containing all the characters needed and whether the browser is able to use them.
I have function that sanitizes URLs and filenames and it works fine with characters like éáßöäü as it replaces them with eassoau etc. using str_replace($a, $b, $value). But how can I replace all characters from Chinese, Japanese … languages? And if replacing is not possible because it's not easy to determine, how can I remove all those characters? Of course I could first sanitize it like above and then remove all "non-latin" characters. But maybe there is another good solution to that?
Edit/addition
As asked in the comments: What is the purpose of my question? We had a client that had content in English, German and Russian language at first. Later on there came some chinese pages. Two problems occurred with the URLs:
the first sanitizer killed all 'non-ascii-characters' and possibly returned 'blank' (invalid) clean-URLs
the client experienced that in some Browser clean URLs with Chinese characters wouldn't work
The first point led me to the shot to replace those characters, which is of course, as stated in the question and the comments confirmed it, not possible. Maybe now somebody is answering that in all modern browsers (starting with IE8) this ain't an issue anymore. I would also be glad to hear about that too.
As for Japanese, as an example, there is usually a romanji representation of everything which uses only ascii characters and still gives a reversable and understandable representation of the original characters. However translating something into romanji requires that you know the correct pronounciation, and that usually depends on the meaning or the context in which the characters are used. That makes it hard if not impossible to simply convert everything correcly (or at least not efficiently doable for a simple sanitizer).
The same applies to Chinese, in an even worse way. Korean on the other hand has a very simple character set which should be easily translateable into a roman representation. Another common problem though is that there is not a single romanization method; those languages usually have different ones which are used by different people (Japanese for example has two common romanizations).
So it really depends on the actual language you are working with; while you might be able to make it work for some languages another problem would be to detect which language you are actually working with (e.g. Japanese and Chinese share a lot of characters but meanings, pronounciations and as such romanizations are usually incompatible). Especially for simple santization of file names, I don’t think it is worth to invest such an amount of work and processing time into it.
Maybe you should work in a different direction: Make your file names simply work as unicode filenames. There are actually a very few number of characters that are truly invalid in file systems (*|\/:"<>?) so it would be way easier to simply filter those out and otherwise support unicode file names.
You could run it through your existing sanitizer, then anything not latin, you could convert to punycode
So, as i understand you need some character relation tables for every language, and replace characters by relation in this table.
By example, for translit russian symbols to latin synonyms, we use this tables =) Or classes, which use this tables =)
It's intresting, i finded it right now http://derickrethans.nl/projects.html#translit
Is there a way to select in mysql words that are only Chinese, only Japanese and only Korean?
In english it can be done by:
SELECT * FROM table WHERE field REGEXP '[a-zA-Z0-9]'
or even a "dirty" solution like:
SELECT * FROM table WHERE field > "0" AND field <"ZZZZZZZZ"
Is there a similar solution for eastern languages / CJK characters?
I understand that Chinese and Japanese share characters so there is a chance that Japanese words using these characters will be mistaken for Chinese words. I guess those words would not be filtered.
The words are stored in a utf-8 string field.
If this cannot be done in mysql, can it be done in PHP?
Thanks! :)
edit 1: The data does not include in which language the string is therefore I cannot filter by another field.
edit 2: using a translator api like bing's (google is closing their translator api) is an interesting idea but i was hoping for a faster regex-style solution.
Searching for a UTF-8 range of characters is not directly supported in MySQL regexp. See the mySQL reference for regexp where it states:
Warning The REGEXP and RLIKE operators
work in byte-wise fashion, so they are
not multi-byte safe and may produce
unexpected results with multi-byte
character sets.
Fortunately in PHP you can build such a regexp e.g. with
/[\x{1234}-\x{5678}]*/u
(note the u at the end of the regexp). You therefore need to find the appropriate ranges for your different languages. Using the unicode code charts will enable you to pick the appropriate script for the language (although not directly the language itself).
You can't do this from the character set alone - especially in modern times where asian texts are frequently "romanized", that is, written with the roman script, that said, if you merely want to select texts that are superficially 'asian', there are ways of doing that depending on just how complicated you want to be and how accurate you need to be.
But honestly, I suggest that you add a new "language" field to your database and ensuring that it's populated correctly.
That said, here are some useful links you may be interested in:
Detect language from string in PHP
http://en.wikipedia.org/wiki/Hidden_Markov_model
The latter is relatively complex to implement, but yields a much better result.
Alternatively, I believe that google has an (online) API that will allow you to detect, AND translate a language.
An interesting paper that should demonstrate the futility of this excercise is:
http://xldb.lasige.di.fc.ul.pt/xldb/publications/ngram-article.pdf
Finally, you ask:
If this cant be done in mysql - how can it be done in PHP?
It will likely to be much easier to do this in PHP because you are more able to perform mathematical analysis on the language string in question, although you'll probably want to feed the results back into the database as a kludgy way of caching the results for performance reasons.
you may consider another data structure that contains the words and or characters, and the language you want to associate them with.
the 'normal' eastern ascii characters will associate to many more languages than just English for instance, just as other characters may associate to more than just Chinese.
Korean mostly uses its own alphabet called Hangul. Occasionally there will be some Han characters thrown in.
Japanese uses three writing systems combined. Of these, Katakana and Hiragana are unique to Japanese and thus are hardly ever used in Korean or Chinese text.
Japanese and Chinese both use Han characters though which means the same Unicode range(s), so there is no simple way to differentiate them based on character ranges alone!
There are some heuristics though.
Mainland China uses simplified characters, many of which are unique and thus are hardly ever used in Japanese or Korean text.
Japan also simplified a small number of common characters, many of which are unique and thus will hardly ever be used in Chinese or Korean text.
But there are certainly plenty of occasions where the same strings of characters are valid as both Japanese and Chinese, especially in the case of very short strings.
One method that will work with all text is to look at groups of characters. This means n-grams and probably Markov models as Arafangion mentions in their answer. But be aware that even this is not foolproof in the case of very short strings!
And of course none of this is going to be implemented in any database software so you will have to do it in your programming language.