Enforce English only on PHP form submission - php

I would like the contact form on my website to only accept text submitted in English. I've been dealing with a lot of spam recently that has appeared in multiple languages that is slipping right past the CAPTCHA. There is simply no reason for anyone to submit this form in a language other than English since it's not a business and more of a hobby for personal use.
I've been looking through this documentation and was hopeful that something like preg_match( '/[\p{Latin}]/u', $input) might work, but I'm not bilingual and don't understand all the nuances of character encoding, so while this will help filter out something like Russian it still allows languages like Vietnamese to slip through.
Ideally I would like it to accept:
Any Unicode symbol that might be used. I have frequently come across different styles of dashes, apostrophes, or things related to math, for example.
Common diacritical marks / accented characters found in words like "résumé."
And I would like it to reject:
Anything that appears to be something other than English, or uncommon. I'm not overly concerned with accents such as "naïve" or in words borrowed from other languages.
I'm thinking of simply stripping all potentially valid characters as follows:
$input = 'testing for English only!';
// reference: https://en.wikipedia.org/wiki/List_of_Unicode_characters
// allowed punctuation
$basic_latin = '`~!##$%^&*()-_=+[{]}\\|;:\'",<.>/?';
$input = str_replace(str_split($basic_latin), '', $input);
// allowed symbols and accents
$latin1_supplement = '¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿É×é÷';
$input = str_replace(str_split($latin1_supplement), '', $input);
$unicode_symbols = '–—―‗‘’‚‛“”„†‡•…‰′″‹›‼‾⁄⁊';
$input = str_replace(str_split($unicode_symbols), '', $input);
// remove all spaces including tabs and end lines
$input = preg_replace('/\s+/', '', $input);
// check that remaining characters are alpha-numeric
if (strlen($input) > 0 && ctype_alnum($input)) {
echo 'this is English';
} else {
echo 'no bueno señor';
}
However, I'm afraid there might be some perfectly common and valid exceptions that I'm unwittingly leaving out. I'm hoping that someone might be able to offer a more elegant solution or approach?

There are no native PHP features that would provide language recognition. There's an abandoned Pear package and some classes floating around the cyberspace (I haven't tested). If an external API is fine, Google's Translation API Basic can detect language, 500K free characters per month.
There is however a very simple solution to all this. We don't really need to know what language it is. All we need to know is whether it's reasonably valid English. And not Swahili or Klingon or Russian or Gibberish. Now, there is a convenient PHP extension for this: PSpell.
Here's a sample function you might use:
/**
* Spell Check Stats.
* Returns an array with OK, FAIL spell check counts and their ratio.
* Use the ratio to filter out undesirable (non-English/garbled) content.
*
* #updated 2022-12-29 00:00:29 +07:00
* #author #cmswares
* #ref https://stackoverflow.com/q/74910421/4630325
*
* #param string $text
*
* #return array
*/
function spell_check_stats(string $text): array
{
$stats = [
'ratio' => null,
'ok' => 0,
'fail' => 0
];
// Split into words
$words = preg_split('~[^\w\']+~', $text, -1, PREG_SPLIT_NO_EMPTY);
// Nw PSpell:
$pspeller = pspell_new("en");
// Check spelling and build stats
foreach($words as $word) {
if(pspell_check($pspeller, $word)) {
$stats['ok']++;
} else {
$stats['fail']++;
}
}
// Calculate ratio of OK to FAIL
$stats['ratio'] = match(true) {
$stats['fail'] === 0 => 0, // avoiding division by zero here!
$stats['ok'] === 0 => count($words),
default => $stats['ok'] / $stats['fail'],
};
return $stats;
}
Source at BitBucket. Function usage:
$stats = spell_check_stats('This starts in English, esto no se quiere, tätä ei haluta.');
// ratio: 0.7142857142857143, ok: 5, fail: 7
Then simply decide the threshold at which a submission is rejected. For example, if 20 words in 100 fail; ie. 80:20 ratio, or "ratio = 4". The higher the ratio, the more (properly-spelled) English it is.
The "ok" and "fail" counts are also returned in case you need to calibrate separately for very short strings. Run some tests on existing valid and spam content to see what sorts of figures you get, and then tune your rejection threshold accordingly.
PSpell package for PHP may not be installed by default on your server. On CentOS / RedHat, yum install php-pspell aspell-en, to install both the PHP module (includes ASpell dependency), along with an English dictionary. For other platforms, install per your package manager.
For Windows and modern PHP, I can't find the extension dll, or a maintained Aspell port. Please share if you've found a solution. Would like to have this on my dev machine too.

Related

Is there a way to tell whether a font supports a given character in Imagick?

I'm using Imagick to generate simple logos, which are just text on a background.
I'm usually looping through all available fonts, to present the user with a choice of different renderings for every font (one image per font).
The problem is, some fonts don't support the ASCII characters (I think they've been designed for a given language only). And I guess that some of the fonts which support ASCII characters, will fail with non-ASCII characters as well.
Anyway, I end up with images such as these:
Is there a programmatic way in Imagick to tell whether a given font supports all the characters in a given string?
That would help me filter out those fonts which do not support the text the user typed in, and avoid displaying any garbage images such as the ones above.
I don't know a way using imagemagik, but you could use the php-font-parser library from here:
https://github.com/Pomax/PHP-Font-Parser
Specifically, you can parse a font for each letter in your required string and check the return value:
$fonts = array("myfont.ttf");
/**
* For this test, we'll print the header information for the
* loaded font, and try to find the letter "g".
*/
$letter = "g";
$json = false;
while($json === false && count($fonts)>0) {
$font = new OTTTFont(array_pop($fonts));
echo "font header data:\n" . $font->toString() . "\n";
$data = $font->get_glyph($letter);
if($data!==false) {
$json = $data->toJSON(); }}
if($json===false) { die("the letter '$letter' could not be found!"); }
echo "glyph information for '$letter':\n" . $json;
Above code comes from the font parser projects fonttest.php class:
https://github.com/Pomax/PHP-Font-Parser/blob/master/fonttest.php

Gettext() with larger texts

I'm using gettext() to translate some of my texts in my website. Mostly these are short texts/buttons like "Back", "Name",...
// I18N support information here
$language = "en_US";
putenv("LANG=$language");
setlocale(LC_ALL, $language);
// Set the text domain as 'messages'
$domain = 'messages';
bindtextdomain($domain, "/opt/www/abc/web/www/lcl");
textdomain($domain);
echo gettext("Back");
My question is, how 'long' can this text (id) be in the echo gettext("") part ?
Is it slowing down the process for long texts? Or does it work just fine too? Like this for example:
echo _("LZ adfadffs is a VVV contributor who writes a weekly column for Cv00m. The former Hechinger Institute Fellow has had his commentary recognized by the Online News Association, the National Association of Black Journalists and the National ");
The official gettext documentation merely has this advice:
Translatable strings should be limited to one paragraph; don't let a single message be longer than ten lines. The reason is that when the translatable string changes, the translator is faced with the task of updating the entire translated string. Maybe only a single word will have changed in the English string, but the translator doesn't see that (with the current translation tools), therefore she has to proofread the entire message.
There's no official limitation on the length of strings, and they can obviously exceed at least "one paragraph/10 lines".
There should be virtually no measurable performance penalty for long strings.
gettext effectively has a limit of 4096 chars on the length of strings.
When you pass this limit you get a warning:
Warning: gettext(): msgid passed too long in %s on line %d
and returns you bool(false) instead of the text.
Source:
PHP Interpreter repository - The real fix for the gettext overflow bug
function gettext http://www.php.net/manual/en/function.gettext.php
it's defined as a string input so your machines memory would be the limiting factor.
try to benchmark it with microtime or better with xdebug if you have it on your development machine.

PHP gettext reverse translate

My question is quite simple, I use gettext to translate URLs, therefore I only have the translated version of the url string.
I would like to know if there was an easy way to get the base string from the translated string?
What I had in head was to automatically add the translated name in a database and aliases it with the base string each times I use my _u($string) function.
What I have currently:
function _u($string)
{
if (empty($string))
return '';
else
return dgettext('Urls', $string);
}
What I was thinking about (pseudo-code):
function _u($string)
{
if (empty($string))
return '';
$translation = dgettext('Urls', $string);
MySQL REPLACE INTO ... base = $string, translation = $translation; (translation = primary key)
return $translation;
}
function url_base($translation)
{
$row = SELECT ... FROM ... translation = $translation;
return $base;
}
Although it doesn't seem to be the best way possible to do this and, if on production I remove the REPLACE part, then I might forget a link or two in production that I might haven't went to.
EDIT: What I am mostly looking for is the parsing part of gettext. I need not to miss any of the possible URLs, so if you have another solution it would be required to have a parser (based on what I'm looking for).
EDIT2: Another difficulty have just been added. We must find the URL in any translations and put it back into the "base" translation for the system to parse the URL in the base language.
Actually, the most straightforward way I can think of would be to decode the .mo files used for the translation, through a call to the msgunfmt utility.
Once you have the plaintext database, you save it in any other kind of database, and will then be able to do reverse searches.
But perhaps better, you could create additional domain(s) ("ReverseUrlsIT") in which to store the translated URL as key, and the base as value (provided the mapping is fully two-way, that is!).
At that point you can use dgettext to recover the base string from the translated string, provided that you know the language of the translated string.
Update
This is the main point of using gettext and I would drop it anytime if
I could find another parser/library/tool that could help with that
The gettext family of functions, after all is said and done, are little more than a keystore database system with (maybe) a parser which is a little more powerful than printf, to handle plurals and adjective/noun inversions (violin virtuoso in English becomes virtuoso di violino in Italian).
At the cost of adding to the database complexity (and load), you can build a keystore leveraging whatever persistency layer you've got handy (gettext is file based, after all):
TABLE LanguageDomain
{
PRIMARY KEY ldId;
varchar(?) ldValue;
}
# e.g.
# 39 it_IT
# 44 en_US
# 01 us_US
TABLE Shorthand
{
PRIMARY KEY shId;
varchar(?) shValue;
}
# e.g.
# 1 CAMERA
# 2 BED
TABLE Translation
{
KEY t_ldId,
t_shId;
varchar(?) t_Value; // Or one value for singular form, one for plural...
}
# e.g.
# 44 1 Camera
# 39 1 Macchina fotografica
# 01 1 Camera
# 44 1 Bed
# 39 1 Letto
# 01 1 Bed
# 01 137 Behavior
# 44 137 Behaviour # "American and English have many things in common..."
# 01 979 Cookie
# 44 979 Biscuit " "...except of course the language" (O. Wilde)
function translate($string, $arguments = array())
{
GLOBAL $languageDomain;
// First recover main string
SELECT t_Value FROM Translation AS t
LEFT JOIN LanguageDomain AS l ON (t.ldId = l.ldId AND l.ldValue = :LangDom)
LEFT JOIN Shorthand AS s ON (t.t_shId = s.shId AND s.shValue=:String);
//
if (empty($arguments))
return $Result;
// Now run replacement of arguments - if any
$replacements = array();
foreach($arguments as $n => $argument)
$replacements["\${$n}"] = translate($argument);
// Now replace '$1' with translation of first argument, etc.
return str_replace(array_keys($replacements), array_values($replacements), $Result);
}
This would allow you to easily add one more languageDomain, and even to run queries such as e.g. "What terms in English have not yet been translated into German?" (i.e., have a NULL value when LEFT JOINing the subset of Translation table with English domain Id with the subset with German domain Id).
This system is inter-operable with POfiles, which is important if you need to outsource the translation to someone using the standard tools of the trade. But you can as easily output a query directly to TMX format, eliminating duplicates (in some cases this might really cut down your translation costs - several services overcharge for input in "strange" formats such as Excel, and will either overcharge for "deduping" or will charge for each duplicate as if it was an original).
<?xml version="1.0" ?>
<tmx version="1.4">
<header
creationtool="MySQLgetText"
creationtoolversion="0.1-20120827"
datatype="PlainText"
segtype="sentence"
adminlang="en-us"
srclang="EN"
o-tmf="ABCTransMem">
</header>
<body>
<tu tuid="BED" datatype="plaintext">
<tuv xml:lang="en">
<seg>bed</seg>
</tuv>
<tuv xml:lang="it">
<seg>letto</seg>
</tuv>
</tu>
<tu tuid="CAMERA" datatype="plaintext">
<tuv xml:lang="en">
<seg>camera</seg>
</tuv>
<tuv xml:lang="it">
<seg>macchina fotografica</seg>
</tuv>
</tu>
</body>
</tmx>

Implementing internationalization (language strings) in a PHP application

I want to build a CMS that can handle fetching locale strings to support internationalization. I plan on storing the strings in a database, and then placing a key/value cache like memcache in between the database and the application to prevent performance drops for hitting the database each page for a translation.
This is more complex than using PHP files with arrays of strings - but that method is incredibly inefficient when you have 2,000 translation lines.
I thought about using gettext, but I'm not sure that users of the CMS will be comfortable working with the gettext files. If the strings are stored in a database, then a nice administration system can be setup to allow them to make changes whenever they want and the caching in RAM will insure that the fetching of those strings is as fast, or faster than gettext. I also don't feel safe using the PHP extension considering not even the zend framework uses it.
Is there anything wrong with this approach?
Update
I thought perhaps I would add more food for thought. One of the problems with string translations it is that they doesn't support dates, money, or conditional statements. However, thanks to intl PHP now has MessageFormatter which is what really needs to be used anyway.
// Load string from gettext file
$string = _("{0} resulted in {1,choice,0#no errors|1#single error|1<{1, number} errors}");
// Format using the current locale
msgfmt_format_message(setlocale(LC_ALL, 0), $string, array('Update', 3));
On another note, one of the things I don't like about gettext is that the text is embedded into the application all over the place. That means that the team responsible for the primary translation (usually English) has to have access to the project source code to make changes in all the places the default statements are placed. It's almost as bad as applications that have SQL spaghetti-code all over.
So, it makes sense to use keys like _('error.404_not_found') which then allow the content writers and translators to just worry about the PO/MO files without messing in the code.
However, in the event that a gettext translation doesn't exist for the given key then there is no way to fall back to a default (like you could with a custom handler). This means that you either have the writter mucking around in your code - or have "error.404_not_found" shown to users that don't have a locale translation!
In addition, I am not aware of any large projects which use PHP's gettext. I would appreciate any links to well-used (and therefore tested), systems which actually rely on the native PHP gettext extension.
Gettext uses a binary protocol that is quite quick. Also the gettext implementation is usually simpler as it only requires echo _('Text to translate');. It also has existing tools for translators to use and they're proven to work well.
You can store them in a database but I feel it would be slower and a bit overkill, especially since you'd have to build the system to edit the translations yourself.
If only you could actually cache the lookups in a dedicated memory portion in APC, you'd be golden. Sadly, I don't know how.
For those that are interested, it seems full support for locales and i18n in PHP is finally starting to take place.
// Set the current locale to the one the user agent wants
$locale = Locale::acceptFromHttp(getenv('HTTP_ACCEPT_LANGUAGE'));
// Default Locale
Locale::setDefault($locale);
setlocale(LC_ALL, $locale . '.UTF-8');
// Default timezone of server
date_default_timezone_set('UTC');
// iconv encoding
iconv_set_encoding("internal_encoding", "UTF-8");
// multibyte encoding
mb_internal_encoding('UTF-8');
There are several things that need to be condered and detecting the timezone/locale and then using it to correctly parse and display input and output is important. There is a PHP I18N library that was just released which contains lookup tables for much of this information.
Processing User input is important to make sure you application has clean, well-formed UTF-8 strings from whatever input the user enters. iconv is great for this.
/**
* Convert a string from one encoding to another encoding
* and remove invalid bytes sequences.
*
* #param string $string to convert
* #param string $to encoding you want the string in
* #param string $from encoding that string is in
* #return string
*/
function encode($string, $to = 'UTF-8', $from = 'UTF-8')
{
// ASCII is already valid UTF-8
if($to == 'UTF-8' AND is_ascii($string))
{
return $string;
}
// Convert the string
return #iconv($from, $to . '//TRANSLIT//IGNORE', $string);
}
/**
* Tests whether a string contains only 7bit ASCII characters.
*
* #param string $string to check
* #return bool
*/
function is_ascii($string)
{
return ! preg_match('/[^\x00-\x7F]/S', $string);
}
Then just run the input through these functions.
$utf8_string = normalizer_normalize(encode($_POST['text']), Normalizer::FORM_C);
Translations
As Andre said, It seems gettext is the smart default choice for writing applications that can be translated.
Gettext uses a binary protocol that is quite quick.
The gettext implementation is usually simpler as it only requires _('Text to translate')
Existing tools for translators to use and they're proven to work well.
When you reach facebook size then you can work on implementing RAM-cached, alternative methods like the one I mentioned in the question. However, nothing beats "simple, fast, and works" for most projects.
However, there are also addition things that gettext cannot handle. Things like displaying dates, money, and numbers. For those you need the INTL extionsion.
/**
* Return an IntlDateFormatter object using the current system locale
*
* #param string $locale string
* #param integer $datetype IntlDateFormatter constant
* #param integer $timetype IntlDateFormatter constant
* #param string $timezone Time zone ID, default is system default
* #return IntlDateFormatter
*/
function __date($locale = NULL, $datetype = IntlDateFormatter::MEDIUM, $timetype = IntlDateFormatter::SHORT, $timezone = NULL)
{
return new IntlDateFormatter($locale ?: setlocale(LC_ALL, 0), $datetype, $timetype, $timezone);
}
$now = new DateTime();
print __date()->format($now);
$time = __date()->parse($string);
In addition you can use strftime to parse dates taking the current locale into consideration.
Sometimes you need the values for numbers and dates inserted correctly into locale messages
/**
* Format the given string using the current system locale
* Basically, it's sprintf on i18n steroids.
*
* #param string $string to parse
* #param array $params to insert
* #return string
*/
function __($string, array $params = NULL)
{
return msgfmt_format_message(setlocale(LC_ALL, 0), $string, $params);
}
// Multiple choices (can also just use ngettext)
print __(_("{1,choice,0#no errors|1#single error|1<{1, number} errors}"), array(4));
// Show time in the correct way
print __(_("It is now {0,time,medium}), time());
See the ICU format details for more information.
Database
Make sure your connection to the database is using the correct charset so that nothing gets currupted on storage.
String Functions
You need to understand the difference between the string, mb_string, and grapheme functions.
// 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5) normalization form "D"
$char_a_ring_nfd = "a\xCC\x8A";
var_dump(grapheme_strlen($char_a_ring_nfd));
var_dump(mb_strlen($char_a_ring_nfd));
var_dump(strlen($char_a_ring_nfd));
// 'LATIN CAPITAL LETTER A WITH RING ABOVE' (U+00C5)
$char_A_ring = "\xC3\x85";
var_dump(grapheme_strlen($char_A_ring));
var_dump(mb_strlen($char_A_ring));
var_dump(strlen($char_A_ring));
Domain name TLD's
The IDN functions from the INTL library are a big help processing non-ascii domain names.
There are a number of other SO questions and answers similar to this one. I suggest you search and read them as well.
Advice? Use an existing solution like gettext or xliff as it will save you lot's of grief when you hit all the translation edge cases such as right to left text, date formats, different text volumes, French is 30% more verbose than English for example that screw up formatting etc. Even better advice Don't do it. If the users want to translate they will make a clone and translate it. Because Localisation is more about look and feel and using colloquial language this is usually what happens. Again giving and example Anglo-Saxon culture likes cool web colours and san-serif type faces. Hispanic culture like bright colours and Serif/Cursive types. Which to cater for you would need different layouts per language.
Zend actually cater for the following adapters for Zend_Translate and it is a useful list.
Array:- Use PHP arrays for Small pages; simplest usage; only for programmers
Csv:- Use comma separated (.csv/.txt) files for Simple text file format; fast; possible problems with unicode characters
Gettext:- Use binary gettext (*.mo) files for GNU standard for linux; thread-safe; needs tools for translation
Ini:- Use simple INI (*.ini) files for Simple text file format; fast; possible problems with unicode characters
Tbx:- Use termbase exchange (.tbx/.xml) files for Industry standard for inter application terminology strings; XML format
Tmx:- Use tmx (.tmx/.xml) files for Industry standard for inter application translation; XML format; human readable
Qt:- Use qt linguist (*.ts) files for Cross platform application framework; XML format; human readable
Xliff:- Use xliff (.xliff/.xml) files for A simpler format as TMX but related to it; XML format; human readable
XmlTm:- Use xmltm (*.xml) files for Industry standard for XML document translation memory; XML format; human readable
Others:- *.sql for Different other adapters may be implemented in the future
I'm using the ICU stuff in my framework and really finding it simple and useful to use. My system is XML-based with XPath queries and not a database as you're suggesting to use. I've not found this approach to be inefficient. I played around with Resource bundles too when researching techniques but found them quite complicated to implement.
The Locale functionality is a god send. You can do so much more easily:
// Available translations
$languages = array('en', 'fr', 'de');
// The language the user wants
$preference = (isset($_COOKIE['lang'])) ?
$_COOKIE['lang'] : ((isset($_SERVER['HTTP_ACCEPT_LANGUAGE'])) ?
Locale::acceptFromHttp($_SERVER['HTTP_ACCEPT_LANGUAGE']) : '');
// Match preferred language to those available, defaulting to generic English
$locale = Locale::lookup($languages, $preference, false, 'en');
// Construct path to dictionary file
$file = $dir . '/' . $locale . '.xsl';
// Check that dictionary file is readable
if (!file_exists($file) || !is_readable($file)) {
throw new RuntimeException('Dictionary could not be loaded');
}
// Load and return dictionary file
$dictionary = simplexml_load_file($file);
I then perform word lookups using a method like this:
$selector = '/i18n/text[#label="' . $word . '"]';
$result = $dictionary->xpath($selector);
$text = array_shift($result);
if ($formatted && isset($text)) {
return new MessageFormatter($locale, $text);
}
The bonus for my system is that the template system is XSL-based which means I can use the same translation XML files directly in my templates for simple messages that don't need any i18n formatting.
Stick with gettext, you won't find a faster alternative in PHP.
Regarding the how, you can use a database to store your catalog and allow other users to translate the strings using a friendly gui. When the new changes are reviewed/approved, hit a button, compile a new .mo file and deploy.
Some resources to get you on track:
http://code.google.com/p/simplepo/
http://www.josscrowcroft.com/2011/code/php-mo-convert-gettext-po-file-to-binary-mo-file-php/
https://launchpad.net/php-gettext/
http://sourceforge.net/projects/tcktranslator/
What about csv files (which can be easily edited in many apps) and caching to memcache (wincache, etc.)? This approach works well in magento. All languages phrases in the code are wrapped into __() function, for example
<?php echo $this->__('Some text') ?>
Then, for example before new version release, you run simple script which parses source files, finds all text wrapped into __() and puts into .csv file. You load csv files and cache them to memcache. In __() function you look into your memcache where translations are cached.
In a recent project, we considered using gettext, but it turned out to be easier to just write our own functionality. It really is quite simple: Create a JSON file per locale (e.g. strings.en.json, strings.es.json, etc.), and create a function somewhere called "translate()" or something, and then just call that. That function will determine the current locale (from the URI or a session var or something), and return the localized string.
The only thing to remember is to make sure any HTML you output is encoded in UTF-8, and marked as such in the markup (e.g. in the doctype, etc.)
Maybe not really an answer to your question, but maybe you can get some ideas from the Symfony translation component? It looks very good to me, although I must confess I haven't used it myself yet.
The documentation for the component can be found at
http://symfony.com/doc/current/book/translation.html
and the code for the component can be found at
https://github.com/symfony/Translation.
It should be easy to use the Translation component, because Symfony components are intended to be able to be used as standalone components.
On another note, one of the things I don't like about gettext is that
the text is embedded into the application all over the place. That
means that the team responsible for the primary translation (usually
English) has to have access to the project source code to make changes
in all the places the default statements are placed. It's almost as
bad as applications that have SQL spaghetti-code all over.
This isn't actually true. You can have a header file (sorry, ex C programmer), such as:
<?php
define(MSG_404_NOT_FOUND, 'error.404_not_found')
?>
Then whenever you want a message, use _(MSG_404_NOT_FOUND). This is much more flexible than requiring developers to remember the exact syntax of the non-localised message every time they want to spit out a localised version.
You could go one step further, and generate the header file in a build step, maybe from CSV or database, and cross-reference with the translation to detect missing strings.
have a zend plugin that works very well for this.
<?php
/** dependencies **/
require 'Zend/Loader/Autoloader.php';
require 'Zag/Filter/CharConvert.php';
Zend_Loader_Autoloader::getInstance()->setFallbackAutoloader(true);
//filter
$filter = new Zag_Filter_CharConvert(array(
'replaceWhiteSpace' => '-',
'locale' => 'en_US',
'charset'=> 'UTF-8'
));
echo $filter->filter('ééé ááá 90');//eee-aaa-90
echo $filter->filter('óóó 10aáééé');//ooo-10aaeee
if you do not want to use the zend framework can only use the plugin.
hug!

Sorting Katakana names

If I have a list of Katakana names what is the best way to sort them?
Also is it more common to sort names based on their {first name}{last name} or {last name}{first name}.
Another question is how do we get the first character Hiragana representation of a Katakana name like how it is done for the iPhone's contact list is sorted.?
Thanks.
In Japan it is common (if not expected) that a person's first name appear after their surname when written: {last} {first}. But this would also depend on the context. In a less formal context it would be acceptable for a name to appear {first} {last}.
http://en.wikipedia.org/wiki/Japanese_name
Not that it matters, but why would the names of individuals be written in Katakana and not in the traditional Kanji?
I think it's
sort($array,SORT_LOCALE_STRING);
Provide more information if it's not your case
This answer talks about using the system locale to sort Unicode strings in PHP. Besides threading issues, it is also dependent on your vendor having supplied you with a correct locale for what you want to use. I’ve had so much trouble with that particular issue that I’ve given up using vendor locales altogether.
If you’re worried about different pronunciations of Unihan ideographs, then you probably need access to the Unihan database — or its moral equivalent. A smaller subset may suffice.
For example, I know that in Perl, the JIS X 0208 standard is used when the Japanese "ja" locale for is selected in the constructor for Unicode::Collate::Locale. This doesn’t depend on the system locale, so you can rely on it.
I’ve also had good luck in Perl with Lingua::JA::Romanize::Japanese, as that’s somewhat friendlier to use than accessing Unicode::Unihan directly.
Back to PHP. This article observes that you can’t get PHP to sort Japanese correctly.
I’ve taken his set of strings and run it through Perl’s sort, and I indeed get a different answer than he gets. If I use the default or English locale, I get in Perl what he gets in PHP. But if I use the Japanese locale for the collation module — which has nothing to do with the system locale and is completely thread-safe — then I get a rather different result. Watch:
JA Sort                          = EN Sort
------------------------------------------------------------
Java                               Java
NVIDIA                             NVIDIA
Windows ファイウォール             Windows ファイウォール
インターネット オプション          インターネット オプション
キーボード                         キーボード
システム                           システム
タスク                             タスク
フォント                           フォント
プログラムの追加と削除             プログラムの追加と削除
マウス                             マウス
メール                             メール
音声認識                         ! 地域と言語オプション
画面                             ! 日付と時刻
管理ツール                       ! 画面
自動更新                         ! 管理ツール
地域と言語オプション             ! 自動更新
電源オプション                     電源オプション
電話とモデムのオプション           電話とモデムのオプション
日付と時刻                       ! 音声認識
I don’t know whether this will help you at all, because I don’t know how to get at the Perl bits from PHP (can you?), but here is the program that generates that. It uses a couple of non-standard modules installed from CPAN to do its business.
#!/usr/bin/env perl
#
# jsort - demo showing how Perl sorts Japanese in a
# different way than PHP does.
#
# Data taken from http://www.localizingjapan.com/blog/2011/02/13/sorting-in-japanese-—-an-unsolved-problem/
#
# Program by Tom Christiansen <tchrist#perl.com>
# Saturday, April 9th, 2011
use utf8;
use 5.10.1;
use strict;
use autodie;
use warnings;
use open qw[ :std :utf8 ];
use Unicode::Collate::Locale;
use Unicode::GCString;
binmode(DATA, ":utf8");
my #data = <DATA>;
chomp #data;
my $ja_sorter = new Unicode::Collate::Locale locale => "ja";
my $en_sorter = new Unicode::Collate::Locale locale => "en";
my #en_data = $en_sorter->sort(#data);
my #ja_data = $ja_sorter->sort(#data);
my $gap = 8;
my $width = 0;
for my $datum (#data) {
my $columns = width($datum);
$width = $columns if $columns > $width;
}
my $bar = "-" x ( 2 + 2 * $width + $gap );
$width = -($width + $gap);
say justify($width => "JA Sort"), "= ", "EN Sort";
say $bar;
for my $i ( 0 .. $#data ) {
my $same = $ja_data[$i] eq $en_data[$i] ? " " : "!";
say justify($width => $ja_data[$i]), $same, " ", $en_data[$i];
}
sub justify {
my($len, $str) = #_;
my $alen = abs($len);
my $cols = width($str);
my $spacing = ($alen > $cols) && " " x ($alen - $cols);
return ($len < 0)
? $str . $spacing
: $spacing . $str
}
sub width {
return 0 unless #_;
my $str = shift();
return 0 unless length $str;
return Unicode::GCString->new($str)->columns;
}
__END__
システム
画面
Windows ファイウォール
インターネット オプション
キーボード
メール
音声認識
管理ツール
自動更新
日付と時刻
タスク
プログラムの追加と削除
フォント
電源オプション
マウス
地域と言語オプション
電話とモデムのオプション
Java
NVIDIA
Hope this helps. It shows that it is, at least theoretically, possible.
EDIT
This answer from How can I use Perl libraries from PHP? references this PHP package to do that for you. So if you don’t find a PHP library with the needed Japanese sorting stuff, you should be able to use the Perl module. The only one you need is Unicode::Collate::Locale. It comes standard as of release 5.14 (really 5.13.4, but that’s a devel version), but you can always install it from CPAN if you have an earlier version of Perl.

Categories