Charset detection in PHP

Charset detection in PHP - php

//i've added a new take on this please see Cheating PHP integers . any help will be much appreciated. I've had an idea to trying and hack the storage option of the arrays by packing the integers into unsigned bytes (only need 8 or 16 bits integers to reduce the memory considerably).
Hi
I'm currently working on custom charset detection libraries and created a port from Mozilla's charset detection algorithm and used chardet (the python port) for a helping hand. However, this is extremely memory intensive in PHP (around 30mb of memory if I just load in Western language detection). I've optimised all I can without rewriting it from scratch to load each piece (this would reduce memory but make it a lot slower).
My question is that, do you know of any LGPL PHP libraries that do charset detection?
This would be purely for research to give me a slight guiding hand in the right direction.
I already know of mb_detect_encoding but it's far too limited and brings up far too many false positives with the text files i have (yet python's chardet detects them perfectly)

I created a method which encodes correctly to UTF-8. But it was hard to figure out what is currently encoded so I came to this solution:
<?php
function _convert($content) {
if(!mb_check_encoding($content, 'UTF-8')
OR !($content === mb_convert_encoding(mb_convert_encoding($content, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {
$content = mb_convert_encoding($content, 'UTF-8');
if (mb_check_encoding($content, 'UTF-8')) {
// log('Converted to UTF-8');
} else {
// log('Could not converted to UTF-8');
}
}
return $content;
}
?>
As you can see I do a conversion to check if it still the same (UTF-8/16) and if not convert it. Maybe you can use some of this code.

First of all, interesting project you are working on! I'm curious how the end product will be.
Have you take a look at the ICU project already?

Related

LibSodium functions return unreadable characters

I am following along with a tutorial on encryption: https://php.watch/articles/modern-php-encryption-decryption-sodium. In working with the Sodium extension I'm just baffled by a few things. Googling is returning frustratingly little help. (Most of the results are just duplications of the php.net/manual.)
1. In various articles I'm reading, the result of sodium_crypto_*_encrypt() is something familiar:
// ex. DEx9ATXEg/eRq8GWD3NT5BatB3m31WED
Whenever I echo it out myself I get something like:
// ex. 𫦢�2(*���3�CV��Wu��R~�u���H��
which I'm certain won't store correctly on a database. Nowhere in the articles or documentation does it mention anything about charset weirdness. I can throw a header('Content-Type: text/html; charset=ISO-8859-1') in there, but I still get weird characters I'm not certain are right since I'm not finding any threads talking about this:
// ex. ÑAÁ5eŠ…n#±'ýÞÃ1è9ÜÈÌ³¬"CžãÚ0ÿÛ
2. I can't find any information about the best practice for storing keys or nonces.
I just figured this obvious-to-security-folks-but-not-to-others bit of information would be a regularly discussed part of articles on keygens and nonces and such. Seeing as both my keygen and nonce functions (at least in the Sodium library) seem to return non-UTF-8 gibberish, what do I do with it? fwrite it out to a file to be referenced later? Pass it directly to my database? Copy/pasting certainly doesn't work right with it being wingdings.
Other than these things, everything else in the encryption/decryption process makes complete sense to me. I'm far from new to PHP development, I just can't figure this out.

Came across https://stackoverflow.com/a/44874239/1128978 answering "PHP random_bytes returns unreadable characters"
random_bytes generates an arbitrary length string of cryptographic random bytes...
And suggests to use bin2hex to get readable characters. So amending my usages:
// Generate $ciphertext
$message = 'This is a secret message';
$key = sodium_crypto_*_keygen();
$nonce = random_bytes(SODIUM_CRYPTO_*BYTES);
$ciphertext = sodium_crypto_*_encrypt($message, '', $nonce, $key);
// Store hexadecimal versions of binary output
$nonce_hex = bin2hex($nonce);
$key_hex = bin2hex($key);
$ciphertext_hex = bin2hex($ciphertext);
// When ready to decrypt, convert hexadecimal values back to binary
$ciphertext_bin = hex2bin($ciphertext_hex);
$nonce_bin = hex2bin($nonce_hex);
$key_bin = hex2bin($key_hex);
$decrypted = sodium_crypto_*_decrypt($ciphertext_bin, '', $nonce_bin, $key_bin);
// "This is a secret message"
So making lots of use of bin2hex and hex2bin, but this now makes sense. Effectively solved, though not confident this is the proper way to work with it. I still have no idea why this isn't pointed out anywhere in php.net/manual nor in any of the articles/comments I've been perusing.

Rendering complex fonts/scripts in PHP?

Looking to render complex fonts (with diacritics, joined glyphs, right to left text) in various languages/scripts, output is an image (not web page), ideally need to use PHP. The commonly built in graphics libraries for PHP, Imagick and GD, don't support complex fonts, I believe because the version of Freetype they come with doesn't support it.
I've looked into custom building PHP with the possible support but it looks horribly complex and messy.
Any thoughts on an easier solution for this?
Thanks

This a problem I faced a lot too, imagettftext is the most use GD function to write text on a image but the problem with this is that it can't render complex unicode characters, so as long as your language does not have complex characters we are good.
To render complex unicode characters you may need imagick installation plus pango installed on your server. Most of the hosting providers do not have pango installed so that means you need to have a dedicated server ready for your application.
most of the linux distributions comes with pango pre installed so if you managed to install imagick on your local linux matchine following code should work without any problem
/* complex unicode string */
$text = "වෙබ් මත ඕනෑම තැනක";
$im = new \Imagick();
$background = new \ImagickPixel('none');
$im->setBackgroundColor($background);
$im->setPointSize(30);
$im->setGravity(\Imagick::GRAVITY_EAST);
$im->newPseudoImage(300, 200, "pango:" . $text );
$im->setImageFormat("png");
$image = imagecreatefromstring($im->getImageBlob());
//just for print out to the browser
ob_start();
imagepng($image);
$base64 = base64_encode(ob_get_clean());
$url = "data:image/png;base64,$base64";
echo "<img src='$url' />";
Let me know if you find any difficulties with the code

I cannot comment, but if it is Unicode (AFAIK I think it is) you could use character map and echo it from the PHP code. You would probably need to insert a <meta> to make the page a certain variation of Unicode (sorry I don't know too much about Unicode).

PHP7 UTF-8 filenames on Windows server, new phenomenon caused by ZipArchive

Update:
Preparing a bug report to the great people that make PHP 7 possible I revised my research once more and tried to melt it down to a few simple lines of code. While doing this I found that PHP itself is not the cause of the problem. I will share my results here when I'm done. Just so you know and don't possibly waste your time or something :)
Synopsis: PHP7 now seems able to write UTF-8 filenames but is unable to access them?
Preamble: I read about 10-15 articles here touching the subject but they did not help me solve the problem and they all are older than the PHP7 release. It seems to me that this is probably a new issue and I wonder if it might be a bug. I spent a lot of time experimenting with en-/decoding of the strings and trying to figure out a way to make it work - to no avail.
Good day everybody and greetings from Germany (insert shy not-my-native-language-remark here), I hope you can help me out with this new phenomenon I encountered. It seems to be "new" in the sense that it came with PHP 7.
I think most people working with PHP on a Windows system are very familiar with the problem of filenames and the transparent wrapper of PHP that manages access to files that have non-ASCII filenames (or windows-1252 or whatever is the system code page).
I'm not quite sure how to approach the subject and as you can see I'm not very experienced in composing questions so please don't rip my head off instantly. And yes I will strive to keep it short. Here we go:
First symptom: after updating to PHP7 I sometimes encountered problems with accessing files generated by my software. Sometimes it worked as usual, sometimes not. I found out the difference was that PHP7 now seems able to write UTF-8 filenames but is unable to access files with those names.
After generating said files on two separate "identical" systems (differing only in the PHP version) this is how the files are named on the hard drive:
PHP 5.5: Lokaltest_KG_æ¼¢å—_æ±‰å—_KrÃ¼mhold-DEZ1604-140081-complete.zip
PHP 7: Lokaltest_KG_漢字_汉字_Krümhold-DEZ1604-140081-complete.zip
Splendid, PHP 7 is capable of writing unicode-filenames on the HDD, and UTF-16 is used on windows afaik. Now the downside is that when I try to access those files for example with is_file() PHP 5.5 works but PHP 7 does not.
Consider this code snippet (note: I "hacked" into this function because it was the simplest way, it was not written for this purpose). This function gets called after a zip-file gets generated taking on the name of the customer and other values to determine a proper name. Those come out of the database. Database and internal encoding of PHP are both UTF-8. clearstatcache is per se not necessary but I included it to make things clearer. Important: Everything that happens is done with PHP7, no other entity is responsible for creating the zip-file. To be precise it is done with class ZipArchive. Actually it does not even matter that it is a zip-archive, the point is that the filename and the content of the file are created by PHP7 - successfully.
public static function downloadFileAsStream( $file )
{
clearstatcache();
print $file . "<br/>";
var_dump(is_file($file));
die();
}
Output is:
D:/htdocs/otm/.data/_tmp/Lokaltest_KG_漢字_汉字_Krümhold-DEZ1604-140081-complete.zip
bool(false)
So PHP7 is able to generate the file - they indeed DO exist on the harddrive and are legit and accessible and all - but is incapable of accessing them. is_file is not the only function that fails, file_exists() does too for example.
A little experiment with encoding conversion to give you a taste of the things I tried:
public static function downloadFileAsStream( $file )
{
clearstatcache();
print $file . "<br/>";
print mb_detect_encoding($file, 'ASCII,UTF-16,windows-1252,UTF-8', false) . "<br/>";
print mb_detect_encoding($file, 'ASCII,UTF-16,windows-1252,UTF-8', true) . "<br/>";
if (($detectedEncoding = mb_detect_encoding($file, 'ASCII,UTF-16,windows-1252,UTF-8', true)) != 'windows-1252')
{
$file = mb_convert_encoding($file, 'UTF-16', $detectedEncoding);
}
print $file . "<br/>";
var_dump(is_file($file));
die();
}
Output is:
D:/htdocs/otm/.data/_tmp/Lokaltest_KG_漢字_汉字_Krümhold-DEZ1604-140081-complete.zip
UTF-8
UTF-8
D:/htdocs/otm/.data/_tmp/Lokaltest_KG_o"[W_lI[W_Kr�mhold-DEZ1604-140081-complete.zip
NULL
So converting from UTF-8 (database/internal encoding) to UTF-16 (windows file system) does not seem to work either.
I am at the end of my rope here and sadly the issue is very important to us since we cannot update our systems with this problem looming in the background. I hope somebody can shed a little light on this. Sorry for the long post, I'm not sure how well I could get my point across.
Addition:
$file = utf8_decode($file);
var_dump(is_file($file));
die();
Delivers false for the filename with the japanese letters. When I change the input used to create the filename so that the filename now is Lokaltest_KG_Krümhold-DEZ1604-140081-complete.zip above code delivers true. So utf8_decode helps but only with a small part of unicode, german umlauts?

Answering my own question here: The actual bad boy was the component ZipArchive which created files with incorrectly encoded filenames. I have written a hopefully helpful bug report: https://bugs.php.net/bug.php?id=72200
Consider this short script:
print "php default_charset: ".ini_get('default_charset')."\n"; // just 4 info (UTF-8)
$filename = "bugtest_müller-lüdenscheid.zip"; // just an example
$filename = utf8_encode($filename); // simulating my database delivering utf8-string
$zip = new ZipArchive();
if( $zip->open($filename, ZipArchive::CREATE | ZipArchive::OVERWRITE) === true )
{
$zip->addFile('bugtest.php', 'bugtest.php'); // copy of script file itself
$zip->close();
}
var_dump( is_file($filename) ); // delivers ?
output:
output PHP 5.5.35:
php default_charset: UTF-8
bool(true)
output PHP 7.0.6:
php default_charset: UTF-8
bool(false)

objective-c to PHP websafe encoding; uuencode?

I am sending strings from my objective-c app to a PHP script over HTTP. I need to websafe these strings.
I am currently encoding with Google Toolbox for Mac GTMStringEncoding rfc4648Base64WebsafeStringEncoding and decoding with base64_decode() on the PHP end. Works great 99% of the time.
Unfortunately, this encoding is not entirely websafe as it includes some web-interpreted characters ("/" and "-"). The regular GTMStringEncoding rfc4648Base64StringEncoding also includes web-interpreted characters.
Is uuencoding the data the way to go? I see that PHP already has uudecode support, will I have top roll my own on the objective-c side?
If not uuencode, then what?

OK, it seems that PHP did not default support Section 5 of RFC 4648, "Base 64 Encoding with URL and Filename Safe Alphabet." This function allows PHP to handle the 4 out-lier chars before base64_decode:
function base64url_decode($base64url) {
$base64 = strtr($base64url, '-_', '+/');
$plainText = base64_decode($base64);
return ($plainText);
}
My thanks to the anonymous "Tom" who posted it on PHP.net 6 years ago.

PHP: string breaks at special character

I wrote a small PHP script which does a "branding" on a present PDF file. This means on every page I put a string like "belongs to " at a special position. Therefor I use Zend_Pdf out of the Zend Framework.
Because the script is used in German language area, in one line there I use the special character "ö" ("Gehört zu ").
On my local machine (Windows, XAMPP) the script worked fine, but when moving it to my hoster's webspace (some Linux), the string breaks at "ö". That means in my PDF on appears "Geh".
The code is this:
if (substr($file, strlen($file) - 4) === '.pdf') {
$name = $user->GetName;
$fontSize = 12;
$xTextPos = 100;
$yTextPos = 10;
set_include_path(dirname(__FILE__)); // set include_path for external library Zend Framework
require_once('Zend' .DS . 'Pdf.php');
$pdf = Zend_Pdf::load($file);
$font = Zend_Pdf_Font::fontWithName(Zend_Pdf_Font::FONT_HELVETICA);
$branding = 'Gehört zu ' . $name; // German for: 'Belongs to ', problem with 'ö'
foreach ($pdf->pages as &$page) {
$page->setFont($font, $fontSize);
$page->drawText($branding, $xTextPos, $yTextPos);
}
}
I guess the problem is related to some kind of default charset or language setting of the PHP environment. So I searched here and tried out:
$branding = utf8_encode('Gehört zu ') . $name;
...and I made some experiments with functions like html_entity_decode but nothing helped and I decided stopping groping in the dark and open an own question.
Looking forward to any hints. Thank you in advance for your help!
EDIT: Meanwhile I found the same (?) problem, solved on a German forum. But if I do it like they say...
$branding = mb_convert_encoding('Gehört zu ', 'ISO-8859-1') . $name;
... the resulting branding in the PDF is "Gehrt zu ". The "ö" is skipped now.
For this I found another hint on the Zend issue tracker.
I sum up, that I can drop all UTF8 things and concentrate on Latin-1 AKA ISO 8859-1.
I still don't understand why the code worked on my Windows + XAMPP and now crashes on my hoster's Linux.

Your guess is right, the problem is related to encoding. Where exactly the encoding is messed up is hard to say from afar. I'm assuming you work not only with Zend_Pdf, but also have the MVC in place (meaning a complete Zend_Application).
You should check if your application serves pages as UTF-8, by setting:
resources.view.encoding = "UTF-8"
and also placing the appropriate meta-tags in your layout/view.
Depending on what Editor you use, your files may be encoded in a different encoding. You can use Notepad++ on Windows to check your file-encoding and for converting it to UTF-8 (don't just set the encoding to UTF-8, this might mess up your file!) if necessary. I recommend using Eclipse with text file encoding set to "UTF-8" (Preferences > General > Workspace) to make sure your code files are encoded in UTF-8.
Now comes the crucial part:
Zend_Pdf_Page::drawText(string $text, float $x, float $y, string $charEncoding)
See that last argument... set it. If you're lucky, you can skip the previous stuff and just set the encoding there.
edit: I missed something. Database connections. You should check the encoding there too. I frequently work with MS SQL Server, which uses Latin-1 internally; not setting driver_otpions.CharacterSet can mess up stuff pretty bad too. This might be relevant, if you have soemthing like Gehört zu: Günther, where the Name Günther is fetched from db.

Encoding is also depending of the file encoding.
If you encode your file in UTF8 for example and use ut8_encode("ö"), then you'll encode in UTF_8 something already in UTF_8.
So you may want to check what your file encoding is, and what your PDF lib is requiring. Then apply the right formula/transformation.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.