I am creating a site where the authenticated user can write messages for the index site.
On the message create site I have a textbox where the user can give the title of the message, and a textbox where he can write the message.
The message will be exported to a .txt file and from the title I'm creating the title of the .txt file and like this:
Title: This is a message (The filename will be: thisisamessage.txt)
The original given text as filename will be stored in a database rekord among with the .txt filename as path.
For converting the title text I am using a function that looks like this:
function filenameconverter($title){
$filename=str_replace(" ","",$title);
$filename=str_replace("ű","u",$filename);
$filename=str_replace("á","a",$filename);
$filename=str_replace("ú","u",$filename);
$filename=str_replace("ö","o",$filename);
$filename=str_replace("ő","o",$filename);
$filename=str_replace("ó","o",$filename);
$filename=str_replace("é","e",$filename);
$filename=str_replace("ü","u",$filename);
$filename=str_replace("í","i",$filename);
$filename=str_replace("Ű","U",$filename);
$filename=str_replace("Á","A",$filename);
$filename=str_replace("Ú","U",$filename);
$filename=str_replace("Ö","O",$filename);
$filename=str_replace("Ő","O",$filename);
$filename=str_replace("Ó","O",$filename);
$filename=str_replace("É","E",$filename);
$filename=str_replace("Ü","U",$filename);
$filename=str_replace("Í","I",$filename);
return $filename;
}
However it works fine at the most of the time, but sometimes it is not doing its work.
For example: "Pamutkéztörlő adagoló és higiéniai kéztörlő adagoló".
It should stand as a .txt as:
pamutkeztorloadagoloeshigieniaikeztorloadagolo.txt, and most of the times it is.
But sometimes when im giving this it will be:
pamutkă©ztă¶rlĺ‘adagolăłă©shigiă©niaikă©ztă¶rlĺ‘adagolăł.txt
I'm hungarian so the title text will be also hungarian, thats why i have to change the characters.
I'm using XAMPP with apache and phpmyadmin.
I would rather use a generated unique ID for each file as its filename and save the real name in a separate column.
This way you can avoid that someone overwrites files by simply uploading them several times. But if that is what you want you will find several approaches on cleaning filenames here on SO and one very good that I used is http://cubiq.org/the-perfect-php-clean-url-generator
intl
I don't think it is advisable to use str_replace manually for this purpose. You can use the bundled intl extension available as of PHP 5.3.0. Make sure the extension is turned on in your XAMPP settings.
Then, use the transliterator_transliterate() function to transform the string. You can also convert them to lowercase along. Credit goes to simonsimcity.
<?php
$input = 'Pamutkéztörlő adagoló és higiéniai kéztörlő adagoló';
$output = transliterator_transliterate('Any-Latin; Latin-ASCII; lower()', $input);
print(str_replace(' ', '', $output)); //pamutkeztorloadagoloeshigieniaikeztorloadagolo
?>
P.S. Unfortunately, the php manual on this function doesn't elaborate the available transliterator strings, but you can take a look at Artefacto's answer here.
iconv
Using iconv still returns some of the diacritics that are probably not expected.
print(iconv("UTF-8","ASCII//TRANSLIT",$input)); //Pamutk'ezt"orl"o adagol'o 'es higi'eniai k'ezt"orl"o adagol'o
mb_convert_encoding
While, using encoding conversion from Hungarian ISO to ASCII or UTF-8 also gives similar problems you have mentioned.
print(mb_convert_encoding($input, "ASCII", "ISO-8859-16")); //Pamutk??zt??rl?? adagol?? ??s higi??niai k??zt??rl?? adagol??
print(mb_convert_encoding($input, "UTF-8", "ISO-8859-16")); //PamutkéztörlŠadagoló és higiéniai kéztörlŠadagoló
P.S. Similar question could also be found here and here.
I want to use PHP's Intl's NumberFormatter class to display prices in a human-readable format. What our project needs:
The CLDR number pattern, and the currency and separator symbols will need to be configured through our code and not default to what Intl/ICU knows.
Our application will take care of the decimals. NumberFormatter should display any decimals that we pass on to it.
However, when playing around with different configurations to find the exact combination that works for our project, I noticed some effects that I can't explain. The three formatters in the following code snippet are almost identical. As opposed to the first one, the second one uses the euro instead of the U.S. dollar, and the third one has a currency sign set. The output of the first formatter is as I expected it to be, but when I change the currency or set a currency sign, the MIN_FRACTION_DIGITS attribute is ignored and the sign is never changed.
<?php
$fmt = new NumberFormatter('de_DE', NumberFormatter::CURRENCY);
$fmt->setAttribute(NumberFormatter::MIN_FRACTION_DIGITS, 4);
echo $fmt->formatCurrency(1234567890.891234567890000, "EUR")."\n";
// Outputs 1.234.567.890,8912 €
$fmt = new NumberFormatter('de_DE', NumberFormatter::CURRENCY);
$fmt->setAttribute(NumberFormatter::MIN_FRACTION_DIGITS, 4);
echo $fmt->formatCurrency(1234567890.891234567890000, "USD")."\n";
// Ouputs 1.234.567.890,89 $
$fmt = new NumberFormatter('de_DE', NumberFormatter::CURRENCY);
$fmt->setAttribute(NumberFormatter::MIN_FRACTION_DIGITS, 4);
$fmt->setSymbol(\NumberFormatter::CURRENCY_SYMBOL, '%');
echo $fmt->formatCurrency(1234567890.891234567890000, "EUR")."\n";
// Outputs 1.234.567.890,89 €
?>
The first table row under General Purpose Numbers of the Unicode CLDR number pattern documentation describes that when parsing currency patterns, the two zeroes in the decimal part of the pattern will need to be replaced by however many digits the application thinks is appropriate. The application here is ICU (the C library that PHP uses for this), and the MIN_FRACTION_DIGITS attribute does its job of letting me override default behavior in the first example, but not in the second or the third.
Can someone please explain this seemingly random change in behavior? Let me know if there is any additional information that you need.
I just found the following:
https://bugs.php.net/bug.php?id=63140
http://bugs.icu-project.org/trac/ticket/7667
[2012-10-05 08:21 UTC] jpauli#email.com
I confirm this is an ICU bug in 4.4.x branch.
Consider upgrading libicu, 4.8.x gives correct result
Originally, I had the problem that, although I had the same path by optical inspection, file_exists() returned true for one and false for the other. After spending hours narrowing down my problem, I wound up with the following code... (paths redacted)
$myCorePath = $modx->getOption('my.core_path', null, $modx->getOption('core_path').'components/my/');
$pkg1 = $myCorePath.'model/';
$pkg2 = MODX_CORE_PATH . 'components/my/model/';
$pkg3 = '/path/to/modx/core/components/my/model/';
var_dump($pkg1, $pkg2, $pkg3);
...and its output:
string '/path/to/modx/core/components/my/model/' (length=37)
string '/path/to/modx/core/components/my/model/' (length=78)
string '/path/to/modx/core/components/my/model/' (length=78)
So two versions, interestingly including simply writing the string down, apparently use wide characters (these worked, file_exists()-wise), while sadly my preferred variant uses narrow characters. I tried to research this but the only thing I wound up with told me that php has no such thing as wide strings. I also verified with a hex editor that all string constants really only take one byte per character in the php file.
phpinfo() tells me I have PHP Version 5.4.9, and I run on a 64 bit linux machine, fwiw. The manual was edited a week ago; is its info not accurate, or what's going on here?
I think it is caused by multibyte coding.
Starting with only the locale identifier name (string) provided by clients, how or where do I look up the default "list separator" character for that locale?
The "list separator" setting is the character many different types of applications and programming languages may use as the default grouping character when joining or splitting strings and arrays. This is especially important for opening CSV files in spreadsheet programs. Though this is often the comma ",", this default character may be different depending on the machine's region settings. It may even differ between OS's.
I'm not interested in my own server environment here. Instead, I need to know more about the client's based off their locale identifier which they've given to me, so my own server settings are irrelevant. Also for this solution, I can not change the locale setting on this server to match a client's for the entire current process as a shortcut to look this value up.
If this is defined in the ICU library, I'm not able to find any way to look this value up using the INTL extension.
Any hints?
I am not sure if my answer will satisfy your requirements but I suggest (especially as you don't want to change the locale on the server) to use a function that will give you the answer:
To my knowledge (and also Wikipedia's it seems) the list separator in a CSV is a comma unless the decimal point of the locale is a comma, in that case the list separator is a semicolon.
So you could get a list of all locales that use a comma (Unicode U+002C) as separator using this command:
cd /usr/share/i18n/locales/
grep decimal_point.*2C *_* -l
and you could then take this list to determine the appropriate list separator:
function get_csv_list_separator($locale) {
$locales_with_comma_separator = "az_AZ be_BY bg_BG bs_BA ca_ES crh_UA cs_CZ da_DK de_AT de_BE de_DE de_LU el_CY el_GR es_AR es_BO es_CL es_CO es_CR es_EC es_ES es_PY es_UY es_VE et_EE eu_ES eu_ES#euro ff_SN fi_FI fr_BE fr_CA fr_FR fr_LU gl_ES hr_HR ht_HT hu_HU id_ID is_IS it_IT ka_GE kk_KZ ky_KG lt_LT lv_LV mg_MG mk_MK mn_MN nb_NO nl_AW nl_NL nn_NO pap_AN pl_PL pt_BR pt_PT ro_RO ru_RU ru_UA rw_RW se_NO sk_SK sl_SI sq_AL sq_MK sr_ME sr_RS sr_RS#latin sv_SE tg_TJ tr_TR tt_RU#iqtelif uk_UA vi_VN wo_SN");
if (stripos($locales_with_comma_separator, $locale) !== false) {
return ";";
}
return ",";
}
(the list of locales is taken from my own Debian machine, I don't know about the completeness of the list)
If you don't want to have this static list of locales (though I assume that this doesn't change that often), you can of course generate the list using the command above and cache it.
As a final note, according to RFC4180 section 2.6 the list separator actually never changes but rather fields containing a comma (so this also means floating numbers, depending on the locale) should be enclosed in double-quotes. Though (as linked above) not many people follow the RFC standard.
There's no such locale setting as "list separator" it might be software specific, but I doubt it's user specific.
However... You can detect user's locale and try to match the settings.
Get browsers locale: $accept_lang = $_SERVER['HTTP_ACCEPT_LANGUAGE']; this might contain a list of comma-separated values. Some browser don't send this though. more here...
Next you can use setlocale(LC_ALL, $accept_lang); and get available locale settings using $locale_info = localeconv(); more here...
I want to build a CMS that can handle fetching locale strings to support internationalization. I plan on storing the strings in a database, and then placing a key/value cache like memcache in between the database and the application to prevent performance drops for hitting the database each page for a translation.
This is more complex than using PHP files with arrays of strings - but that method is incredibly inefficient when you have 2,000 translation lines.
I thought about using gettext, but I'm not sure that users of the CMS will be comfortable working with the gettext files. If the strings are stored in a database, then a nice administration system can be setup to allow them to make changes whenever they want and the caching in RAM will insure that the fetching of those strings is as fast, or faster than gettext. I also don't feel safe using the PHP extension considering not even the zend framework uses it.
Is there anything wrong with this approach?
Update
I thought perhaps I would add more food for thought. One of the problems with string translations it is that they doesn't support dates, money, or conditional statements. However, thanks to intl PHP now has MessageFormatter which is what really needs to be used anyway.
// Load string from gettext file
$string = _("{0} resulted in {1,choice,0#no errors|1#single error|1<{1, number} errors}");
// Format using the current locale
msgfmt_format_message(setlocale(LC_ALL, 0), $string, array('Update', 3));
On another note, one of the things I don't like about gettext is that the text is embedded into the application all over the place. That means that the team responsible for the primary translation (usually English) has to have access to the project source code to make changes in all the places the default statements are placed. It's almost as bad as applications that have SQL spaghetti-code all over.
So, it makes sense to use keys like _('error.404_not_found') which then allow the content writers and translators to just worry about the PO/MO files without messing in the code.
However, in the event that a gettext translation doesn't exist for the given key then there is no way to fall back to a default (like you could with a custom handler). This means that you either have the writter mucking around in your code - or have "error.404_not_found" shown to users that don't have a locale translation!
In addition, I am not aware of any large projects which use PHP's gettext. I would appreciate any links to well-used (and therefore tested), systems which actually rely on the native PHP gettext extension.
Gettext uses a binary protocol that is quite quick. Also the gettext implementation is usually simpler as it only requires echo _('Text to translate');. It also has existing tools for translators to use and they're proven to work well.
You can store them in a database but I feel it would be slower and a bit overkill, especially since you'd have to build the system to edit the translations yourself.
If only you could actually cache the lookups in a dedicated memory portion in APC, you'd be golden. Sadly, I don't know how.
For those that are interested, it seems full support for locales and i18n in PHP is finally starting to take place.
// Set the current locale to the one the user agent wants
$locale = Locale::acceptFromHttp(getenv('HTTP_ACCEPT_LANGUAGE'));
// Default Locale
Locale::setDefault($locale);
setlocale(LC_ALL, $locale . '.UTF-8');
// Default timezone of server
date_default_timezone_set('UTC');
// iconv encoding
iconv_set_encoding("internal_encoding", "UTF-8");
// multibyte encoding
mb_internal_encoding('UTF-8');
There are several things that need to be condered and detecting the timezone/locale and then using it to correctly parse and display input and output is important. There is a PHP I18N library that was just released which contains lookup tables for much of this information.
Processing User input is important to make sure you application has clean, well-formed UTF-8 strings from whatever input the user enters. iconv is great for this.
/**
* Convert a string from one encoding to another encoding
* and remove invalid bytes sequences.
*
* #param string $string to convert
* #param string $to encoding you want the string in
* #param string $from encoding that string is in
* #return string
*/
function encode($string, $to = 'UTF-8', $from = 'UTF-8')
{
// ASCII is already valid UTF-8
if($to == 'UTF-8' AND is_ascii($string))
{
return $string;
}
// Convert the string
return #iconv($from, $to . '//TRANSLIT//IGNORE', $string);
}
/**
* Tests whether a string contains only 7bit ASCII characters.
*
* #param string $string to check
* #return bool
*/
function is_ascii($string)
{
return ! preg_match('/[^\x00-\x7F]/S', $string);
}
Then just run the input through these functions.
$utf8_string = normalizer_normalize(encode($_POST['text']), Normalizer::FORM_C);
Translations
As Andre said, It seems gettext is the smart default choice for writing applications that can be translated.
Gettext uses a binary protocol that is quite quick.
The gettext implementation is usually simpler as it only requires _('Text to translate')
Existing tools for translators to use and they're proven to work well.
When you reach facebook size then you can work on implementing RAM-cached, alternative methods like the one I mentioned in the question. However, nothing beats "simple, fast, and works" for most projects.
However, there are also addition things that gettext cannot handle. Things like displaying dates, money, and numbers. For those you need the INTL extionsion.
/**
* Return an IntlDateFormatter object using the current system locale
*
* #param string $locale string
* #param integer $datetype IntlDateFormatter constant
* #param integer $timetype IntlDateFormatter constant
* #param string $timezone Time zone ID, default is system default
* #return IntlDateFormatter
*/
function __date($locale = NULL, $datetype = IntlDateFormatter::MEDIUM, $timetype = IntlDateFormatter::SHORT, $timezone = NULL)
{
return new IntlDateFormatter($locale ?: setlocale(LC_ALL, 0), $datetype, $timetype, $timezone);
}
$now = new DateTime();
print __date()->format($now);
$time = __date()->parse($string);
In addition you can use strftime to parse dates taking the current locale into consideration.
Sometimes you need the values for numbers and dates inserted correctly into locale messages
/**
* Format the given string using the current system locale
* Basically, it's sprintf on i18n steroids.
*
* #param string $string to parse
* #param array $params to insert
* #return string
*/
function __($string, array $params = NULL)
{
return msgfmt_format_message(setlocale(LC_ALL, 0), $string, $params);
}
// Multiple choices (can also just use ngettext)
print __(_("{1,choice,0#no errors|1#single error|1<{1, number} errors}"), array(4));
// Show time in the correct way
print __(_("It is now {0,time,medium}), time());
See the ICU format details for more information.
Database
Make sure your connection to the database is using the correct charset so that nothing gets currupted on storage.
String Functions
You need to understand the difference between the string, mb_string, and grapheme functions.
// 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5) normalization form "D"
$char_a_ring_nfd = "a\xCC\x8A";
var_dump(grapheme_strlen($char_a_ring_nfd));
var_dump(mb_strlen($char_a_ring_nfd));
var_dump(strlen($char_a_ring_nfd));
// 'LATIN CAPITAL LETTER A WITH RING ABOVE' (U+00C5)
$char_A_ring = "\xC3\x85";
var_dump(grapheme_strlen($char_A_ring));
var_dump(mb_strlen($char_A_ring));
var_dump(strlen($char_A_ring));
Domain name TLD's
The IDN functions from the INTL library are a big help processing non-ascii domain names.
There are a number of other SO questions and answers similar to this one. I suggest you search and read them as well.
Advice? Use an existing solution like gettext or xliff as it will save you lot's of grief when you hit all the translation edge cases such as right to left text, date formats, different text volumes, French is 30% more verbose than English for example that screw up formatting etc. Even better advice Don't do it. If the users want to translate they will make a clone and translate it. Because Localisation is more about look and feel and using colloquial language this is usually what happens. Again giving and example Anglo-Saxon culture likes cool web colours and san-serif type faces. Hispanic culture like bright colours and Serif/Cursive types. Which to cater for you would need different layouts per language.
Zend actually cater for the following adapters for Zend_Translate and it is a useful list.
Array:- Use PHP arrays for Small pages; simplest usage; only for programmers
Csv:- Use comma separated (.csv/.txt) files for Simple text file format; fast; possible problems with unicode characters
Gettext:- Use binary gettext (*.mo) files for GNU standard for linux; thread-safe; needs tools for translation
Ini:- Use simple INI (*.ini) files for Simple text file format; fast; possible problems with unicode characters
Tbx:- Use termbase exchange (.tbx/.xml) files for Industry standard for inter application terminology strings; XML format
Tmx:- Use tmx (.tmx/.xml) files for Industry standard for inter application translation; XML format; human readable
Qt:- Use qt linguist (*.ts) files for Cross platform application framework; XML format; human readable
Xliff:- Use xliff (.xliff/.xml) files for A simpler format as TMX but related to it; XML format; human readable
XmlTm:- Use xmltm (*.xml) files for Industry standard for XML document translation memory; XML format; human readable
Others:- *.sql for Different other adapters may be implemented in the future
I'm using the ICU stuff in my framework and really finding it simple and useful to use. My system is XML-based with XPath queries and not a database as you're suggesting to use. I've not found this approach to be inefficient. I played around with Resource bundles too when researching techniques but found them quite complicated to implement.
The Locale functionality is a god send. You can do so much more easily:
// Available translations
$languages = array('en', 'fr', 'de');
// The language the user wants
$preference = (isset($_COOKIE['lang'])) ?
$_COOKIE['lang'] : ((isset($_SERVER['HTTP_ACCEPT_LANGUAGE'])) ?
Locale::acceptFromHttp($_SERVER['HTTP_ACCEPT_LANGUAGE']) : '');
// Match preferred language to those available, defaulting to generic English
$locale = Locale::lookup($languages, $preference, false, 'en');
// Construct path to dictionary file
$file = $dir . '/' . $locale . '.xsl';
// Check that dictionary file is readable
if (!file_exists($file) || !is_readable($file)) {
throw new RuntimeException('Dictionary could not be loaded');
}
// Load and return dictionary file
$dictionary = simplexml_load_file($file);
I then perform word lookups using a method like this:
$selector = '/i18n/text[#label="' . $word . '"]';
$result = $dictionary->xpath($selector);
$text = array_shift($result);
if ($formatted && isset($text)) {
return new MessageFormatter($locale, $text);
}
The bonus for my system is that the template system is XSL-based which means I can use the same translation XML files directly in my templates for simple messages that don't need any i18n formatting.
Stick with gettext, you won't find a faster alternative in PHP.
Regarding the how, you can use a database to store your catalog and allow other users to translate the strings using a friendly gui. When the new changes are reviewed/approved, hit a button, compile a new .mo file and deploy.
Some resources to get you on track:
http://code.google.com/p/simplepo/
http://www.josscrowcroft.com/2011/code/php-mo-convert-gettext-po-file-to-binary-mo-file-php/
https://launchpad.net/php-gettext/
http://sourceforge.net/projects/tcktranslator/
What about csv files (which can be easily edited in many apps) and caching to memcache (wincache, etc.)? This approach works well in magento. All languages phrases in the code are wrapped into __() function, for example
<?php echo $this->__('Some text') ?>
Then, for example before new version release, you run simple script which parses source files, finds all text wrapped into __() and puts into .csv file. You load csv files and cache them to memcache. In __() function you look into your memcache where translations are cached.
In a recent project, we considered using gettext, but it turned out to be easier to just write our own functionality. It really is quite simple: Create a JSON file per locale (e.g. strings.en.json, strings.es.json, etc.), and create a function somewhere called "translate()" or something, and then just call that. That function will determine the current locale (from the URI or a session var or something), and return the localized string.
The only thing to remember is to make sure any HTML you output is encoded in UTF-8, and marked as such in the markup (e.g. in the doctype, etc.)
Maybe not really an answer to your question, but maybe you can get some ideas from the Symfony translation component? It looks very good to me, although I must confess I haven't used it myself yet.
The documentation for the component can be found at
http://symfony.com/doc/current/book/translation.html
and the code for the component can be found at
https://github.com/symfony/Translation.
It should be easy to use the Translation component, because Symfony components are intended to be able to be used as standalone components.
On another note, one of the things I don't like about gettext is that
the text is embedded into the application all over the place. That
means that the team responsible for the primary translation (usually
English) has to have access to the project source code to make changes
in all the places the default statements are placed. It's almost as
bad as applications that have SQL spaghetti-code all over.
This isn't actually true. You can have a header file (sorry, ex C programmer), such as:
<?php
define(MSG_404_NOT_FOUND, 'error.404_not_found')
?>
Then whenever you want a message, use _(MSG_404_NOT_FOUND). This is much more flexible than requiring developers to remember the exact syntax of the non-localised message every time they want to spit out a localised version.
You could go one step further, and generate the header file in a build step, maybe from CSV or database, and cross-reference with the translation to detect missing strings.
have a zend plugin that works very well for this.
<?php
/** dependencies **/
require 'Zend/Loader/Autoloader.php';
require 'Zag/Filter/CharConvert.php';
Zend_Loader_Autoloader::getInstance()->setFallbackAutoloader(true);
//filter
$filter = new Zag_Filter_CharConvert(array(
'replaceWhiteSpace' => '-',
'locale' => 'en_US',
'charset'=> 'UTF-8'
));
echo $filter->filter('ééé ááá 90');//eee-aaa-90
echo $filter->filter('óóó 10aáééé');//ooo-10aaeee
if you do not want to use the zend framework can only use the plugin.
hug!