I know this question is a bit vague and not sure this is even possible. On my web site I want to display a combo box with maximum possible languages (available in unicode) and when the user selects the language respective character map of that language should be loaded. Then users can click and complete the given text area with their comments in their own language. I am not asking for the code but a kind guide line about the possibility of this and a way to do this will be really helpful.
My ultimate need is to give user to type in any language of their choice. Do the users need to install the language in their computer before using it? Thank you.
The Unicode Standard does not divide characters by language, and there is no rigorous definition for the concept “characters used in a language”. For example, is “é” a character used in English? (Think about “fiancé”.) What about “è”? (Think about the spelling “belovèd” used in some forms of writing.)
The Unicode Consortium has created the CLDR database, which contains information about “exemplar characters” in any languages, but these are based on subjective judgement and often debatable – mostly in the sense of covering too much, which might not be serious here. The data is in an XML formal, so it could be automatically fed into an application.
There is nothing the user needs to do, or could do, to “install the language” for purposes like this. What matters is whether the user’s computer has fonts containing all the characters needed and whether the browser is able to use them.
Related
Human readable, meaning the string is a real word. This is essentially a form validation. Ideally I'd like to test the 'texture' of the form responses to determine if an actual user has filled out the form versus someone looking for form vulnerabilities. Possibly using a dictionary look-up on the POSTed data and then giving a threshold of returned 'real words'.
I don't see anything in the PHP docs and the Google machine isn't offering up anything, at least this specific. I suspect that someone out there has written a PHP class or even a jQuery plugin that can do this. Something like so:
$string = "laiqbqi";
is_this_string_human_readable($string);
Any ideas?
This can be done using something called Markov Chains.
Essentially, they read through a large chunk of text in a given language (English, French, Russian, etc.) and determine the probability of one character being after another.
e.g. a "q" has a much lower probability of occurring after a "z" than a vowel such as "a" does.
At a lower level, this is actually implemented as a state machine.
As per Mike's comment, a PHP version of this can be found here.
For flavor, an amusing the Daily WTF article on Markov Chains.
In order to make a PHP content management system extensible, language translations are crucial. I was researching programming approaches for a translations system, and I thought that Qt Linguist was a good example.
This is an example usage from the Qt documentation:
int n = messages.count();
showMessage(tr("%n message(s) saved", "", n));
Qt uses known language rules to determine whether "message" has an "s" appended in English.
When I brought that example up with my development team, they discovered an issue that jeopardizes the extensibility effectiveness of modeling off of Qt's tr() function.
This is a similar example, except that something is now seriously wrong.
int n = deadBacteria.count();
showMessage(tr("%n bacterium(s) killed", "", n));
The plural of "bacterium" is "bacteria". It is improper to append an "s".
I don't have much experience with Qt Linguist, but I haven't seen how it handles irregular conjugations and forms.
A more complicated phrase could be "%n cactus(s) have grown.". The plural should be "cactii", and "have" needs to be conjugated to "has" if there is one cactus.
You might think that the logical correction is to avoid these irregular words because they are not used in programming. Well, this is not helpful in two ways:
Perhaps there is a language that modifies nouns in an irregular way, even though the source string works in English, like "%n message(s) saved". In MyImaginaryLanguage, the proper way to form the translated string could be "1Message saved", "M2essage saved", "Me3ssage saved" for %n values 1, 2, and 3, respectively, and it doesn't look like Qt Linguist has rules to handle this.
To make a CMS extensible like I need mine to be, all types of web applications need to be factored in. Somebody may build a role-playing game that requires sentences to be constructed like "5 cacti have grown." Or maybe a security software wants to say, "ClamAV found 2 viruses." as opposed to "ClamAV found 2 virus(es)."
After searching online to see if other Qt developers have a solution to this problem and not finding any, I came to Stack Overflow.
I want to know:
What extensible and effective programming technique should be used to translate strings with possible irregular rules?
What do Qt programmers and translators do if they encounter this irregularity issue?
You've misunderstood how the pluralisation in Qt works: it's not an automatic translation.
Basically you have a default string e.g. "%n cactus(s) have grown." which is a literal, in your code. You can put whatever the hell you like in it e.g. "dingbat wibble foo %n bar".
You may then define translation languages (including one for the same language you've written the source string in).
Linguist is programmed with the various rules for how languages treat quantities of something. In English it's just singular or plural; but if a language has a specific form for zero or whatever, it presents those in Linguist. It then allows you to put in whatever the correct sentence would be, in the target translation language, and it handles putting the %n in where you decide it should be in the translated form.
So whoever does the translation in Linguist would be provided the source, and has to fill in the singular and plural, for example.
Source text: %n cactus(s) have grown.
English translation (Singular): %n cactus has grown.
English translation (Plural): %n cacti have grown.
If the application can't find an installed translation then it falls back to the source literal. Also the source literal is what the person translating sees so has to infer what you meant from it. Hence "dingbat wibble foo %n bar" might not be a good idea when describing how many cacti have grown.
Further reading:
The Linguist manual
The Qt Quarterly article on Plural Form(s) in Translation(s)
The Internationalization example or the I18N example
Download the SDK and have a play.
Your best choice is to use the GNU gettext i18n framework. It is nicely integrated into PHP, and gives you tools to precisely define all the quirky grammar rules about plural forms.
Using Qt Linguist you can handle the various grammatical numbers based on the target language. So every time a %n is detected in a tr string the translator will be asked to give all necessary translations for the target language. Check this article for more details:
http://doc.qt.nokia.com/qq/qq19-plurals.html
I have developed a Quiz Contest Website in which admin can create questions and then those questions will be displayed .The Below is one of the question
The above is a question which admin will enter but as you can see there are special characters such as Square Root Symbol and equilibrium sign.Please help me on this how admin can enter above question in the admin site.
I have a HTML Editor for entering question and have text boxes to enter it's options.
There is a relatively new html5 library called MathML. See MathML on wikipedia
You can also play with unicode to achieve at least some of the symbols. see this example link
In general its best to just upload images that contain those special symbols. you can easily generate this kind of equations with latex.
The methods of entering special characters such as ⇌ (U+21CC) and √ (U+221A) depend on the environment: operating system, keyboard settings, installed auxiliary software, etc. You might consider linking to instruction pages such as http://www.fileformat.info/tip/microsoft/enter_unicode.htm but basically this is something that each user has to solve himself, unless you wish to enhance your HTML editor with special functionality.
(The vinculum associated with the square root sign cannot be produced directly at the character level – combining overline isn’t really suitable for it –, and although it can drawn in various ways, it probably does not pay off in a context like this.)
Your HTML editor of course needs to be Unicode-enabled
If I were you, I would write it in MATLAB or word-equation and take a screenshot and upload it as an image. That would be the easiest way.
I am quite confused on the issue of validating the fields (eg. business name, company name, address, etc.), because the site has localization feature. Currently, I am validating the fields using jQuery via regular expression, a snip of one of my regex:
var regex = /^[a-zA-Z0-9-,.\säöüÄÖÜ]{2,}$/;
This works fine when the site is in English language. However, I am not confident if this does work in German environment.
What I do to test my validation is by using Character Map on Windows. Say for example, I get ü from Character Map, paste it on the field. But the script says it is an invalid character. Whereas, if you look on the regex, I am considering such character as valid.
Most probably the technical problem is in the document’s character encoding. Make sure that your document is UTF-8 encoded and declared as such in HTTP headers or at least in a meta tag.
There are more difficult important problems though. Your regexp will reject the English name Brontë and the German name Strauß for example. And it will accept 42, which is hardly anyone’s first or last name.
What is the purpose of this checking? Can you expect that all the names will be English or German? According to European conventions, a person has the right to have his name spelled correctly in European countries, even if it happens to be in a language other than the majority language.
There’s not much checking of personal names that you can do without risk of rejecting someone’s real name in some accepted spelling. If you need to force names to some limited character repertoire or syntax, this needs to be made clear to users and performed server-side and, preferably, additionally as client-side pre-checking.
I'm working on a I18N application which will be located in Japanese, I don't know any word in Japanese, and I'm first wondering if utf8 is enough for that language.
Usually, for European language, utf8 is enough, and I've to set up my database charset/collation to use utf8_general_ci (in MySQL) and my html views in utf8, and it's enough.
But what about Japanese, is there something else to do?
By the way my application would be able to handle English, French, Japanese, but later on, it may be needed to add some languages, let's say, Russian.
How could I set up my I18N application to be available widely without having to change much configurations on deployment?
Is there any best practices?
By the way, I'm planning to use gettext, I'm pretty sure it supports such languages without any problems as it is the de facto standard for almost all GNU softwares, but any feedback?
A couple of points:
UTF-8 is fine for your app-internal data, but if you need to process user-supplied documents (e.g. uploads), those may use other encodings like Shift-JIS or ISO-2022-JP
Japanese text does not use whitespace between words. If your app needs to split text into words somewhere, you've got a problem.
Apart from text, date and number formats differ
The generic collation may not lead to a useful sort order for Japanese text - if your app involves large lists that people have to find things in, this can be a problem.
Yep, Unicode contains all the code points you need to display English, French, Japanese, Russian, and pretty much any language in the world (including Taiwanese, Cherokee, Esperanto, really anything but Elfish). That's what it's for. Due to the nature of UTF8, though, text in more esoteric languages will take a few bytes more to store.
Gettext is widely used and your PHP build probably even includes it. See http://php.net/gettext for usage details.
Just to add that interesting website to help build I18N application: http://www.i18nguy.com/
If you store text in text files then it goes like this:
This is the main folder structure for language:
-lang
-en
-fr
-jp
etc
every subfolder, en, fr... contains the same files, the same variables with different values.
For example in lang/en/links.txt
You would have
class txtLinks
{
public static $menu="Menu";
public static $products="Show products";
....
class txtErrors
{
public static $wrongUName="This user does not exists";
....
Then when a script loads you do
if(en)
define(__LANG,'en')
if(fr)
define(__LANG,'fr')
...
Then
include('lang'.__LANG.'what ever file you want')
Then this is a piece from your php script:
echo txtLink::$menu etc...
If you go the database way you do sth analogous, where instead of files you have tables.
This way you have absolute freedom cause you can give the english files to some person who speaks let's say french and he is able to fill the values in french without being required to know programming at all.
And you your self don't care what language is later added or removed.
And if you work on mvc you can split language files in accordance to controllers so you don't result in loading a huge text file.