I want to build a CMS that can handle fetching locale strings to support internationalization. I plan on storing the strings in a database, and then placing a key/value cache like memcache in between the database and the application to prevent performance drops for hitting the database each page for a translation.
This is more complex than using PHP files with arrays of strings - but that method is incredibly inefficient when you have 2,000 translation lines.
I thought about using gettext, but I'm not sure that users of the CMS will be comfortable working with the gettext files. If the strings are stored in a database, then a nice administration system can be setup to allow them to make changes whenever they want and the caching in RAM will insure that the fetching of those strings is as fast, or faster than gettext. I also don't feel safe using the PHP extension considering not even the zend framework uses it.
Is there anything wrong with this approach?
Update
I thought perhaps I would add more food for thought. One of the problems with string translations it is that they doesn't support dates, money, or conditional statements. However, thanks to intl PHP now has MessageFormatter which is what really needs to be used anyway.
// Load string from gettext file
$string = _("{0} resulted in {1,choice,0#no errors|1#single error|1<{1, number} errors}");
// Format using the current locale
msgfmt_format_message(setlocale(LC_ALL, 0), $string, array('Update', 3));
On another note, one of the things I don't like about gettext is that the text is embedded into the application all over the place. That means that the team responsible for the primary translation (usually English) has to have access to the project source code to make changes in all the places the default statements are placed. It's almost as bad as applications that have SQL spaghetti-code all over.
So, it makes sense to use keys like _('error.404_not_found') which then allow the content writers and translators to just worry about the PO/MO files without messing in the code.
However, in the event that a gettext translation doesn't exist for the given key then there is no way to fall back to a default (like you could with a custom handler). This means that you either have the writter mucking around in your code - or have "error.404_not_found" shown to users that don't have a locale translation!
In addition, I am not aware of any large projects which use PHP's gettext. I would appreciate any links to well-used (and therefore tested), systems which actually rely on the native PHP gettext extension.
Gettext uses a binary protocol that is quite quick. Also the gettext implementation is usually simpler as it only requires echo _('Text to translate');. It also has existing tools for translators to use and they're proven to work well.
You can store them in a database but I feel it would be slower and a bit overkill, especially since you'd have to build the system to edit the translations yourself.
If only you could actually cache the lookups in a dedicated memory portion in APC, you'd be golden. Sadly, I don't know how.
For those that are interested, it seems full support for locales and i18n in PHP is finally starting to take place.
// Set the current locale to the one the user agent wants
$locale = Locale::acceptFromHttp(getenv('HTTP_ACCEPT_LANGUAGE'));
// Default Locale
Locale::setDefault($locale);
setlocale(LC_ALL, $locale . '.UTF-8');
// Default timezone of server
date_default_timezone_set('UTC');
// iconv encoding
iconv_set_encoding("internal_encoding", "UTF-8");
// multibyte encoding
mb_internal_encoding('UTF-8');
There are several things that need to be condered and detecting the timezone/locale and then using it to correctly parse and display input and output is important. There is a PHP I18N library that was just released which contains lookup tables for much of this information.
Processing User input is important to make sure you application has clean, well-formed UTF-8 strings from whatever input the user enters. iconv is great for this.
/**
* Convert a string from one encoding to another encoding
* and remove invalid bytes sequences.
*
* #param string $string to convert
* #param string $to encoding you want the string in
* #param string $from encoding that string is in
* #return string
*/
function encode($string, $to = 'UTF-8', $from = 'UTF-8')
{
// ASCII is already valid UTF-8
if($to == 'UTF-8' AND is_ascii($string))
{
return $string;
}
// Convert the string
return #iconv($from, $to . '//TRANSLIT//IGNORE', $string);
}
/**
* Tests whether a string contains only 7bit ASCII characters.
*
* #param string $string to check
* #return bool
*/
function is_ascii($string)
{
return ! preg_match('/[^\x00-\x7F]/S', $string);
}
Then just run the input through these functions.
$utf8_string = normalizer_normalize(encode($_POST['text']), Normalizer::FORM_C);
Translations
As Andre said, It seems gettext is the smart default choice for writing applications that can be translated.
Gettext uses a binary protocol that is quite quick.
The gettext implementation is usually simpler as it only requires _('Text to translate')
Existing tools for translators to use and they're proven to work well.
When you reach facebook size then you can work on implementing RAM-cached, alternative methods like the one I mentioned in the question. However, nothing beats "simple, fast, and works" for most projects.
However, there are also addition things that gettext cannot handle. Things like displaying dates, money, and numbers. For those you need the INTL extionsion.
/**
* Return an IntlDateFormatter object using the current system locale
*
* #param string $locale string
* #param integer $datetype IntlDateFormatter constant
* #param integer $timetype IntlDateFormatter constant
* #param string $timezone Time zone ID, default is system default
* #return IntlDateFormatter
*/
function __date($locale = NULL, $datetype = IntlDateFormatter::MEDIUM, $timetype = IntlDateFormatter::SHORT, $timezone = NULL)
{
return new IntlDateFormatter($locale ?: setlocale(LC_ALL, 0), $datetype, $timetype, $timezone);
}
$now = new DateTime();
print __date()->format($now);
$time = __date()->parse($string);
In addition you can use strftime to parse dates taking the current locale into consideration.
Sometimes you need the values for numbers and dates inserted correctly into locale messages
/**
* Format the given string using the current system locale
* Basically, it's sprintf on i18n steroids.
*
* #param string $string to parse
* #param array $params to insert
* #return string
*/
function __($string, array $params = NULL)
{
return msgfmt_format_message(setlocale(LC_ALL, 0), $string, $params);
}
// Multiple choices (can also just use ngettext)
print __(_("{1,choice,0#no errors|1#single error|1<{1, number} errors}"), array(4));
// Show time in the correct way
print __(_("It is now {0,time,medium}), time());
See the ICU format details for more information.
Database
Make sure your connection to the database is using the correct charset so that nothing gets currupted on storage.
String Functions
You need to understand the difference between the string, mb_string, and grapheme functions.
// 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5) normalization form "D"
$char_a_ring_nfd = "a\xCC\x8A";
var_dump(grapheme_strlen($char_a_ring_nfd));
var_dump(mb_strlen($char_a_ring_nfd));
var_dump(strlen($char_a_ring_nfd));
// 'LATIN CAPITAL LETTER A WITH RING ABOVE' (U+00C5)
$char_A_ring = "\xC3\x85";
var_dump(grapheme_strlen($char_A_ring));
var_dump(mb_strlen($char_A_ring));
var_dump(strlen($char_A_ring));
Domain name TLD's
The IDN functions from the INTL library are a big help processing non-ascii domain names.
There are a number of other SO questions and answers similar to this one. I suggest you search and read them as well.
Advice? Use an existing solution like gettext or xliff as it will save you lot's of grief when you hit all the translation edge cases such as right to left text, date formats, different text volumes, French is 30% more verbose than English for example that screw up formatting etc. Even better advice Don't do it. If the users want to translate they will make a clone and translate it. Because Localisation is more about look and feel and using colloquial language this is usually what happens. Again giving and example Anglo-Saxon culture likes cool web colours and san-serif type faces. Hispanic culture like bright colours and Serif/Cursive types. Which to cater for you would need different layouts per language.
Zend actually cater for the following adapters for Zend_Translate and it is a useful list.
Array:- Use PHP arrays for Small pages; simplest usage; only for programmers
Csv:- Use comma separated (.csv/.txt) files for Simple text file format; fast; possible problems with unicode characters
Gettext:- Use binary gettext (*.mo) files for GNU standard for linux; thread-safe; needs tools for translation
Ini:- Use simple INI (*.ini) files for Simple text file format; fast; possible problems with unicode characters
Tbx:- Use termbase exchange (.tbx/.xml) files for Industry standard for inter application terminology strings; XML format
Tmx:- Use tmx (.tmx/.xml) files for Industry standard for inter application translation; XML format; human readable
Qt:- Use qt linguist (*.ts) files for Cross platform application framework; XML format; human readable
Xliff:- Use xliff (.xliff/.xml) files for A simpler format as TMX but related to it; XML format; human readable
XmlTm:- Use xmltm (*.xml) files for Industry standard for XML document translation memory; XML format; human readable
Others:- *.sql for Different other adapters may be implemented in the future
I'm using the ICU stuff in my framework and really finding it simple and useful to use. My system is XML-based with XPath queries and not a database as you're suggesting to use. I've not found this approach to be inefficient. I played around with Resource bundles too when researching techniques but found them quite complicated to implement.
The Locale functionality is a god send. You can do so much more easily:
// Available translations
$languages = array('en', 'fr', 'de');
// The language the user wants
$preference = (isset($_COOKIE['lang'])) ?
$_COOKIE['lang'] : ((isset($_SERVER['HTTP_ACCEPT_LANGUAGE'])) ?
Locale::acceptFromHttp($_SERVER['HTTP_ACCEPT_LANGUAGE']) : '');
// Match preferred language to those available, defaulting to generic English
$locale = Locale::lookup($languages, $preference, false, 'en');
// Construct path to dictionary file
$file = $dir . '/' . $locale . '.xsl';
// Check that dictionary file is readable
if (!file_exists($file) || !is_readable($file)) {
throw new RuntimeException('Dictionary could not be loaded');
}
// Load and return dictionary file
$dictionary = simplexml_load_file($file);
I then perform word lookups using a method like this:
$selector = '/i18n/text[#label="' . $word . '"]';
$result = $dictionary->xpath($selector);
$text = array_shift($result);
if ($formatted && isset($text)) {
return new MessageFormatter($locale, $text);
}
The bonus for my system is that the template system is XSL-based which means I can use the same translation XML files directly in my templates for simple messages that don't need any i18n formatting.
Stick with gettext, you won't find a faster alternative in PHP.
Regarding the how, you can use a database to store your catalog and allow other users to translate the strings using a friendly gui. When the new changes are reviewed/approved, hit a button, compile a new .mo file and deploy.
Some resources to get you on track:
http://code.google.com/p/simplepo/
http://www.josscrowcroft.com/2011/code/php-mo-convert-gettext-po-file-to-binary-mo-file-php/
https://launchpad.net/php-gettext/
http://sourceforge.net/projects/tcktranslator/
What about csv files (which can be easily edited in many apps) and caching to memcache (wincache, etc.)? This approach works well in magento. All languages phrases in the code are wrapped into __() function, for example
<?php echo $this->__('Some text') ?>
Then, for example before new version release, you run simple script which parses source files, finds all text wrapped into __() and puts into .csv file. You load csv files and cache them to memcache. In __() function you look into your memcache where translations are cached.
In a recent project, we considered using gettext, but it turned out to be easier to just write our own functionality. It really is quite simple: Create a JSON file per locale (e.g. strings.en.json, strings.es.json, etc.), and create a function somewhere called "translate()" or something, and then just call that. That function will determine the current locale (from the URI or a session var or something), and return the localized string.
The only thing to remember is to make sure any HTML you output is encoded in UTF-8, and marked as such in the markup (e.g. in the doctype, etc.)
Maybe not really an answer to your question, but maybe you can get some ideas from the Symfony translation component? It looks very good to me, although I must confess I haven't used it myself yet.
The documentation for the component can be found at
http://symfony.com/doc/current/book/translation.html
and the code for the component can be found at
https://github.com/symfony/Translation.
It should be easy to use the Translation component, because Symfony components are intended to be able to be used as standalone components.
On another note, one of the things I don't like about gettext is that
the text is embedded into the application all over the place. That
means that the team responsible for the primary translation (usually
English) has to have access to the project source code to make changes
in all the places the default statements are placed. It's almost as
bad as applications that have SQL spaghetti-code all over.
This isn't actually true. You can have a header file (sorry, ex C programmer), such as:
<?php
define(MSG_404_NOT_FOUND, 'error.404_not_found')
?>
Then whenever you want a message, use _(MSG_404_NOT_FOUND). This is much more flexible than requiring developers to remember the exact syntax of the non-localised message every time they want to spit out a localised version.
You could go one step further, and generate the header file in a build step, maybe from CSV or database, and cross-reference with the translation to detect missing strings.
have a zend plugin that works very well for this.
<?php
/** dependencies **/
require 'Zend/Loader/Autoloader.php';
require 'Zag/Filter/CharConvert.php';
Zend_Loader_Autoloader::getInstance()->setFallbackAutoloader(true);
//filter
$filter = new Zag_Filter_CharConvert(array(
'replaceWhiteSpace' => '-',
'locale' => 'en_US',
'charset'=> 'UTF-8'
));
echo $filter->filter('ééé ááá 90');//eee-aaa-90
echo $filter->filter('óóó 10aáééé');//ooo-10aaeee
if you do not want to use the zend framework can only use the plugin.
hug!
Related
I would like the contact form on my website to only accept text submitted in English. I've been dealing with a lot of spam recently that has appeared in multiple languages that is slipping right past the CAPTCHA. There is simply no reason for anyone to submit this form in a language other than English since it's not a business and more of a hobby for personal use.
I've been looking through this documentation and was hopeful that something like preg_match( '/[\p{Latin}]/u', $input) might work, but I'm not bilingual and don't understand all the nuances of character encoding, so while this will help filter out something like Russian it still allows languages like Vietnamese to slip through.
Ideally I would like it to accept:
Any Unicode symbol that might be used. I have frequently come across different styles of dashes, apostrophes, or things related to math, for example.
Common diacritical marks / accented characters found in words like "résumé."
And I would like it to reject:
Anything that appears to be something other than English, or uncommon. I'm not overly concerned with accents such as "naïve" or in words borrowed from other languages.
I'm thinking of simply stripping all potentially valid characters as follows:
$input = 'testing for English only!';
// reference: https://en.wikipedia.org/wiki/List_of_Unicode_characters
// allowed punctuation
$basic_latin = '`~!##$%^&*()-_=+[{]}\\|;:\'",<.>/?';
$input = str_replace(str_split($basic_latin), '', $input);
// allowed symbols and accents
$latin1_supplement = '¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿É×é÷';
$input = str_replace(str_split($latin1_supplement), '', $input);
$unicode_symbols = '–—―‗‘’‚‛“”„†‡•…‰′″‹›‼‾⁄⁊';
$input = str_replace(str_split($unicode_symbols), '', $input);
// remove all spaces including tabs and end lines
$input = preg_replace('/\s+/', '', $input);
// check that remaining characters are alpha-numeric
if (strlen($input) > 0 && ctype_alnum($input)) {
echo 'this is English';
} else {
echo 'no bueno señor';
}
However, I'm afraid there might be some perfectly common and valid exceptions that I'm unwittingly leaving out. I'm hoping that someone might be able to offer a more elegant solution or approach?
There are no native PHP features that would provide language recognition. There's an abandoned Pear package and some classes floating around the cyberspace (I haven't tested). If an external API is fine, Google's Translation API Basic can detect language, 500K free characters per month.
There is however a very simple solution to all this. We don't really need to know what language it is. All we need to know is whether it's reasonably valid English. And not Swahili or Klingon or Russian or Gibberish. Now, there is a convenient PHP extension for this: PSpell.
Here's a sample function you might use:
/**
* Spell Check Stats.
* Returns an array with OK, FAIL spell check counts and their ratio.
* Use the ratio to filter out undesirable (non-English/garbled) content.
*
* #updated 2022-12-29 00:00:29 +07:00
* #author #cmswares
* #ref https://stackoverflow.com/q/74910421/4630325
*
* #param string $text
*
* #return array
*/
function spell_check_stats(string $text): array
{
$stats = [
'ratio' => null,
'ok' => 0,
'fail' => 0
];
// Split into words
$words = preg_split('~[^\w\']+~', $text, -1, PREG_SPLIT_NO_EMPTY);
// Nw PSpell:
$pspeller = pspell_new("en");
// Check spelling and build stats
foreach($words as $word) {
if(pspell_check($pspeller, $word)) {
$stats['ok']++;
} else {
$stats['fail']++;
}
}
// Calculate ratio of OK to FAIL
$stats['ratio'] = match(true) {
$stats['fail'] === 0 => 0, // avoiding division by zero here!
$stats['ok'] === 0 => count($words),
default => $stats['ok'] / $stats['fail'],
};
return $stats;
}
Source at BitBucket. Function usage:
$stats = spell_check_stats('This starts in English, esto no se quiere, tätä ei haluta.');
// ratio: 0.7142857142857143, ok: 5, fail: 7
Then simply decide the threshold at which a submission is rejected. For example, if 20 words in 100 fail; ie. 80:20 ratio, or "ratio = 4". The higher the ratio, the more (properly-spelled) English it is.
The "ok" and "fail" counts are also returned in case you need to calibrate separately for very short strings. Run some tests on existing valid and spam content to see what sorts of figures you get, and then tune your rejection threshold accordingly.
PSpell package for PHP may not be installed by default on your server. On CentOS / RedHat, yum install php-pspell aspell-en, to install both the PHP module (includes ASpell dependency), along with an English dictionary. For other platforms, install per your package manager.
For Windows and modern PHP, I can't find the extension dll, or a maintained Aspell port. Please share if you've found a solution. Would like to have this on my dev machine too.
I am creating a site where the authenticated user can write messages for the index site.
On the message create site I have a textbox where the user can give the title of the message, and a textbox where he can write the message.
The message will be exported to a .txt file and from the title I'm creating the title of the .txt file and like this:
Title: This is a message (The filename will be: thisisamessage.txt)
The original given text as filename will be stored in a database rekord among with the .txt filename as path.
For converting the title text I am using a function that looks like this:
function filenameconverter($title){
$filename=str_replace(" ","",$title);
$filename=str_replace("ű","u",$filename);
$filename=str_replace("á","a",$filename);
$filename=str_replace("ú","u",$filename);
$filename=str_replace("ö","o",$filename);
$filename=str_replace("ő","o",$filename);
$filename=str_replace("ó","o",$filename);
$filename=str_replace("é","e",$filename);
$filename=str_replace("ü","u",$filename);
$filename=str_replace("í","i",$filename);
$filename=str_replace("Ű","U",$filename);
$filename=str_replace("Á","A",$filename);
$filename=str_replace("Ú","U",$filename);
$filename=str_replace("Ö","O",$filename);
$filename=str_replace("Ő","O",$filename);
$filename=str_replace("Ó","O",$filename);
$filename=str_replace("É","E",$filename);
$filename=str_replace("Ü","U",$filename);
$filename=str_replace("Í","I",$filename);
return $filename;
}
However it works fine at the most of the time, but sometimes it is not doing its work.
For example: "Pamutkéztörlő adagoló és higiéniai kéztörlő adagoló".
It should stand as a .txt as:
pamutkeztorloadagoloeshigieniaikeztorloadagolo.txt, and most of the times it is.
But sometimes when im giving this it will be:
pamutkă©ztă¶rlĺ‘adagolăłă©shigiă©niaikă©ztă¶rlĺ‘adagolăł.txt
I'm hungarian so the title text will be also hungarian, thats why i have to change the characters.
I'm using XAMPP with apache and phpmyadmin.
I would rather use a generated unique ID for each file as its filename and save the real name in a separate column.
This way you can avoid that someone overwrites files by simply uploading them several times. But if that is what you want you will find several approaches on cleaning filenames here on SO and one very good that I used is http://cubiq.org/the-perfect-php-clean-url-generator
intl
I don't think it is advisable to use str_replace manually for this purpose. You can use the bundled intl extension available as of PHP 5.3.0. Make sure the extension is turned on in your XAMPP settings.
Then, use the transliterator_transliterate() function to transform the string. You can also convert them to lowercase along. Credit goes to simonsimcity.
<?php
$input = 'Pamutkéztörlő adagoló és higiéniai kéztörlő adagoló';
$output = transliterator_transliterate('Any-Latin; Latin-ASCII; lower()', $input);
print(str_replace(' ', '', $output)); //pamutkeztorloadagoloeshigieniaikeztorloadagolo
?>
P.S. Unfortunately, the php manual on this function doesn't elaborate the available transliterator strings, but you can take a look at Artefacto's answer here.
iconv
Using iconv still returns some of the diacritics that are probably not expected.
print(iconv("UTF-8","ASCII//TRANSLIT",$input)); //Pamutk'ezt"orl"o adagol'o 'es higi'eniai k'ezt"orl"o adagol'o
mb_convert_encoding
While, using encoding conversion from Hungarian ISO to ASCII or UTF-8 also gives similar problems you have mentioned.
print(mb_convert_encoding($input, "ASCII", "ISO-8859-16")); //Pamutk??zt??rl?? adagol?? ??s higi??niai k??zt??rl?? adagol??
print(mb_convert_encoding($input, "UTF-8", "ISO-8859-16")); //PamutkéztörlŠadagoló és higiéniai kéztörlŠadagoló
P.S. Similar question could also be found here and here.
I'm using gettext() to translate some of my texts in my website. Mostly these are short texts/buttons like "Back", "Name",...
// I18N support information here
$language = "en_US";
putenv("LANG=$language");
setlocale(LC_ALL, $language);
// Set the text domain as 'messages'
$domain = 'messages';
bindtextdomain($domain, "/opt/www/abc/web/www/lcl");
textdomain($domain);
echo gettext("Back");
My question is, how 'long' can this text (id) be in the echo gettext("") part ?
Is it slowing down the process for long texts? Or does it work just fine too? Like this for example:
echo _("LZ adfadffs is a VVV contributor who writes a weekly column for Cv00m. The former Hechinger Institute Fellow has had his commentary recognized by the Online News Association, the National Association of Black Journalists and the National ");
The official gettext documentation merely has this advice:
Translatable strings should be limited to one paragraph; don't let a single message be longer than ten lines. The reason is that when the translatable string changes, the translator is faced with the task of updating the entire translated string. Maybe only a single word will have changed in the English string, but the translator doesn't see that (with the current translation tools), therefore she has to proofread the entire message.
There's no official limitation on the length of strings, and they can obviously exceed at least "one paragraph/10 lines".
There should be virtually no measurable performance penalty for long strings.
gettext effectively has a limit of 4096 chars on the length of strings.
When you pass this limit you get a warning:
Warning: gettext(): msgid passed too long in %s on line %d
and returns you bool(false) instead of the text.
Source:
PHP Interpreter repository - The real fix for the gettext overflow bug
function gettext http://www.php.net/manual/en/function.gettext.php
it's defined as a string input so your machines memory would be the limiting factor.
try to benchmark it with microtime or better with xdebug if you have it on your development machine.
I'm looking for an free php library that can generate code diff HTML. Basically just like GitHub's code diffs pages.
I've been searching all around and can't find anything. Does anyone know of anything out there that does what I'm looking for?
It looks like I found what I'm looking for after doing more Google searches with different wording.
php-diff seems to do exactly what I want. Just a php function that accepts two strings and generates all the HTML do display the diff in a web page.
To add my two cents here...
Unfortunately, there are no really good diff libraries for displaying/generating diffs in PHP. That said, I recently did find a circuitous way to do this using PHP. The solution involved:
A pure JavaScript approach for rendering the Diff
Shelling out to git with PHP to generate the Diff to render
First, there is an excellent JavaScript library for rendering GitHub-style diffs called diff2html. This renders diffs very cleanly and with modern styling. However diff2html requires a true git diff to render as it is intended to literally render git diffs--just like GitHub.
If we let diff2html handle the rendering of the diff, then all we have left to do is create the git diff to have it render.
To do that in PHP, you can shell out to the local git binary running on the server. You can use git to calculate a diff on two arbitrary files using the --no-index option. You can also specify how many lines before/after the found diffs to return with the -U option.
On the server it would look something like this:
// File names to save data to diff in
$leftFile = '/tmp/fileA.txt';
$rightFile = '/tmp/fileB.txt';
file_put_contents($leftFile, $leftData);
file_put_contents($rightFile, $rightData);
// Generate git diff and save shell output
$diff = shell_exec("git diff -U1000 --no-index $leftFile $rightFile");
// Strip off first line of output
$diff = substr($diff, strpos($diff, "\n"));
// Delete the files we just created
unlink($leftFile);
unlink($rightFile);
Then you need to get $diff back to the front-end. You should review the docs for diff2html but the end result will look something like this in JavaScript (assuming you pass $diff as diffString):
function renderDiff(el, diffString) {
var diff2htmlUi = new Diff2HtmlUI({diff: diffString});
diff2htmlUi.draw(el);
}
I think what you're looking for is xdiff.
xdiff extension enables you to create and apply patch files containing differences between different revisions of files.
This extension supports two modes of operation - on strings and on files, as well as two different patch formats - unified and binary. Unified patches are excellent for text files as they are human-readable and easy to review. For binary files like archives or images, binary patches will be adequate choice as they are binary safe and handle non-printable characters well.
I have a PHP script that builds a binary search tree over a rather large CSV file (5MB+). This is nice and all, but it takes about 3 seconds to read/parse/index the file.
Now I thought I could use serialize() and unserialize() to quicken the process. When the CSV file has not changed in the meantime, there is no point in parsing it again.
To my horror I find that calling serialize() on my index object takes 5 seconds and produces a huge (19MB) text file, whereas unserialize() takes unbearable 27 seconds to read it back. Improvements look a bit different. ;-)
So - is there a faster mechanism to store/restore large object graphs to/from disk in PHP?
(To clarify: I'm looking for something that takes significantly less than the aforementioned 3 seconds to do the de-serialization job.)
var_export should be lots faster as PHP won't have to process the string at all:
// export the process CSV to export.php
$php_array = read_parse_and_index_csv($csv); // takes 3 seconds
$export = var_export($php_array, true);
file_put_contents('export.php', '<?php $php_array = ' . $export . '; ?>');
Then include export.php when you need it:
include 'export.php';
Depending on your web server set up, you may have to chmod export.php to make it executable first.
Try igbinary...did wonders for me:
http://pecl.php.net/package/igbinary
First you have to change the way your program works. divide CSV file to smaller chunks. This is an IP datastore i assume. .
Convert all IP addresses to integer or long.
So if a query comes you can know which part to look.
There are <?php ip2long() /* and */ long2ip(); functions to do this.
So 0 to 2^32 convert all IP addresses into 5000K/50K total 100 smaller files.
This approach brings you quicker serialization.
Think smart, code tidy ;)
It seems that the answer to your question is no.
Even if you discover a "binary serialization format" option most likely even that would be to slow for what you envisage.
So, what you may have to look into using (as others have mentioned) is a database, memcached, or on online web service.
I'd like to add the following ideas as well:
caching of requests/responses
your PHP script does not shutdown but becomes a network server to answer queries
or, dare I say it, change the data structure and method of query you are currently using
i see two options here
string serialization, in the simplest form something like
write => implode("\x01", (array) $node);
read => explode() + $node->payload = $a[0]; $node->value = $a[1] etc
binary serialization with pack()
write => pack("fnna*", $node->value, $node->le, $node->ri, $node->payload);
read => $node = (object) unpack("fvalue/nre/nli/a*payload", $data);
It would be interesting to benchmark both options and compare the results.
If you want speed, writing to or reading from the file system in less than optimal.
In most cases, a database server will be able to store and retrieve data much more efficiently than a PHP script that is reading/writing files.
Another possibility would be something like Memcached.
Object serialization is not known for its performance but for its ease of use and it's definitely not suited to handle large amounts of data.
SQLite comes with PHP, you could use that as your database. Otherwise you could try using sessions, then you don't have to serialize anything, you just saving the raw PHP object.
What about using something like JSON for a format for storing/loading the data? I have no idea how fast the JSON parser is in PHP, but it's usually a fast operation in most languages and it's a lightweight format.
http://php.net/manual/en/book.json.php