Php qr code generator works strange with utf-8 phrase - php

I downloaded the library http://phpqrcode.sourceforge.net/ and wrote simplest code for it
include('./phpqrcode/qrlib.php');
QRcode::png('иванов иван иванович 11111');
But resulted qr code contains only half of string
Resulted qr code - 'иванов иван ив';
url - vologda-oblast.ru/coronavirus/qr/parampng.php
What can be wrong?

The "phpqrcode" library in your case encodes a number of characters instead of the number of bytes of a UTF-8 string. That’s why the string is truncated. If you QR-encode English-only text, the string will not be truncated. The truncation occurs only with Cyrillic characters since it takes 2 bytes to encode each Cyrillic character in UTF-8 rather than just a single byte for a Latin one.
Interestingly, the demo example of the library on the author’s page do encode Cyrillic characters correctly.
The truncation happens in your case because you are using the following options in your php.ini file:
mbstring.func_overload = 2
mbstring.internal_encoding = "UTF-8"
If you remove the mbstring.func_overload (deprecated since PHP 7.2.0) from php.ini or set it 0, the "phpqrcode" library will start working properly. Otherwise, the strlen() function used by the library will return number of characters rather than the number of bytes in a UTF8-ecoded octet string, while str_split(), another function used by the library, will always return the number of bytes since it is not affected by mbstring.func_overload. As a result, your QR-codes will contain truncated strings.
Since you are using the Bitrix Site Manager CMS, removing the mbstring.func_overload from php.ini may be problematic until you fully update Bitrix to 20.5.393 (released on September 2020) or later version. Earlier version did rely on this deprecated feature. You can find find more information about Bitrix reliance on this deprecated feature at https://idea.1c-bitrix.ru/remove-dependency-on-mbstring-settingsfuncoverload/ or https://idea.1c-bitrix.ru/?tag=4799
Since you cannot change php.ini configuration on run-time, you can try to configure your web server to have php options configure on a per-directory level. Failing that, you can fix the code of the "phpqrcode" library to work correctly, at least partially, in your case, to not rely on the strlen() function. To to that, edit the qrencode.php file the following way. First, replace the $eightbit constant of the QREncode class from false to true. Second, in the function encodeString8bit, replace
$ret = $input->append(QR_MODE_8, strlen($string), str_split($string));
to
$arr = str_split($string);
$len = count($arr);
$ret = $input->append(QR_MODE_8, $len, $arr);
Anyway, since the "phpqrcode" library does not currently support Extended Channel Interpretations (ECI) mode, you cannot reliably encode Cyrillic characters with the library. It uses the 8-bit string mode of storing text in a QR code, which by default may only contain ISO-8859-1 (Latin-1) characters unless the default character set is modified by a ECI entry. But the library cannot insert the ECI entry into a QR code to show that the text has UTF-8 encoding rather than ISO-8859-1. Some decoding applications will auto-detect the wrong charset and show the string correctly, while some (compliant) may not.
As a conclusion, since the "phpqrcode" does not currently support ECI, you cannot reliably encode Cyrillic characters with it, but you can at least make it not truncate the string as I have shown above.

Related

Will comparing the binary data of a string with an unknown character encoding validate what its encoding is?

I need to automatically determine the character encoding of strings from email content and headers. For the most part this isn't an issue however there is an occasional email with content and/or a header that has an oddball character such as an en dash. Now I received an answer that technically seems to work if I statically test it on a specific header for a specific email however that blatantly ignores the fact that importing email needs to be a completely automated process in which case I am utterly unable to automatically determine the string's character encoding.
I've started with the basics such as detecting common trouble characters that seem to guarantee a character encoding issue will occur. However strpos('en dash: –', '–') works fine while intentionally / manually testing though it fails outright when added directly to the automated process. I'm going to guess that the issue there is that the string parameters have a UTF-8 encoding while the automated process is testing a string that isn't yet UTF-8 and thus internally the same character isn't using the same subset of code (via character encoding).
So my second attempt was mb_detect_encoding's second parameter can be an array. So I tried the following:
$encodings = array('UTF-8','UCS-4','UCS-4BE','UCS-4LE','UCS-2','UCS-2BE','UCS-2LE','UTF-32','UTF-32BE','UTF-32LE','UTF-16','UTF-16BE','UTF-16LE','UTF-7','UTF7-IMAP','ASCII','EUC-JP','SJIS','eucJP-win','SJIS-win','ISO-2022-JP','ISO-2022-JP-MS','CP932','CP51932','SJIS-mac','SJIS-Mobile#DOCOMO','SJIS-Mobile#KDDI','SJIS-Mobile#SOFTBANK','UTF-8-Mobile#DOCOMO','UTF-8-Mobile#KDDI-A','UTF-8-Mobile#KDDI-B','UTF-8-Mobile#SOFTBANK','ISO-2022-JP-MOBILE#KDDI','JIS','JIS-ms','CP50220','CP50220raw','CP50221','CP50222','ISO-8859-1','ISO-8859-2','ISO-8859-3','ISO-8859-4','ISO-8859-5','ISO-8859-6','ISO-8859-7','ISO-8859-8','ISO-8859-9','ISO-8859-10','ISO-8859-13','ISO-8859-14','ISO-8859-15','ISO-8859-16','byte2be','byte2le','byte4be','byte4le','BASE64','HTML-ENTITIES','7bit','8bit','EUC-CN','CP936','GB18030','HZ','EUC-TW','CP950','BIG-5','EUC-KR','UHC','ISO-2022-KR','Windows-1251','Windows-1252','CP866','KOI8-R','KOI8-U','ArmSCII-8');
$encoding = mb_detect_encoding($s, $encodings, true);
$compare = mb_convert_encoding($s, 'UTF-8', $encoding);
foreach ($encodings as $k1)
{
if (mb_convert_encoding($s, 'UTF-8', $k1) === $s) {$encoding = $k1; break;}
}
Unfortunately that seemed to result in the same failure based on what I presume was the same underlying issue.
So my third idea I'm looking for some more experienced validation. I could convert the string down in its binary form (ones and zeroes, not binary data). Then I could try converting the string and then converting that second string to binary to compare the two binary versions; if they === match then I might have determined the correct character encoding?
Now I can easily try this with this answer from an unrelated thread however I'm not certain if this is a valid idea or not. This is all intended to answer my question:
How can I determine the actual character encoding of a string in order to convert it to UTF-8 with fully automated validation without corrupting data?
By validation I'm talking about stuff like comparing the binary data though again, I'm not certain if that is a valid approach or not. I do know that I absolutely hate en dashes though.
The answer won't change: it's impossible. You have to rely on external information which encoding is used on text.
Guessing an encoding can horribly go wrong:
Based on the order in which you test against it can either turn out as i.e. ASCII or UTF-8 or Windows-1252, just because so far it fits in. Your list is questionable, because it may match Base64 which is not even a text encoding.
If the source is not properly encoded itself then guessing its encoding will most likely exclude the correct one. And guess a wrong one. Which makes things worse.
Many encodings share the same area: the source can either fit i.e. Windows-1252 or Windows-1251 and even detecting the lexical sense of the text cannot guarantee which of both is correct.
Also: ones and zeroes are binary. PHP strings are only byte arrays, so they're binary to begin with. How they're interpreted relies on you: if your code is $text= "グリーン"; then it's up to which encoding your PHP text file has and how your PHP defaults are set. There is no "internal ... character", only bytes. Which is also the reason why there are functions which operate on bytes (i.e. strlen()) and on a specific text encoding (i.e. mb_strlen()).
If you hate single characters or not: they can be easily used as what they are: characters in texts. And – has its own valid meaning in contrast to — and ‒ and -; don't replace it by personal opinion, because that could corrupt a context's meaning. It's like ignoring the fact that A and Α and A are all different characters. You might want to look up the difference between homoglyphs and synoglyphs - the latter is your current perspective.
You may ask "And in which encoding does PHP interpret the scripts?" Luckily ASCII is for most encodings the most common denominator, so interpreting the first bytes of a file as such to search for <?php (all these are ASCII characters, so for PHP code itself it doesn't matter if it is effectively UTF-8 or ISO-8859-1 or Shift-JIS) will only fail when the document is encoded in i.e. UTF-16 - in that case you must set your PHP defaults to that encoding. Which again proves: text encodings must be told outside of the text.

How to convert a Chinese character to UTF-16 code units?

I'm using PHP for this web development project. Right now, I'm working on a user page, where the user can add words that he knows. Off course, I'm starting out crude, without adding any special features yet like Do you know this Character suggestion, etc.
I have tackled the challenges of adding UTF-16 collation and charset set to UTF-16 in my MySQL Database, in fact online at http://freemysqlhosting.net to support Chinese characters in my website. Now what I'm struggling with is to support automatic PinYin generation for my Chinese characters.
I have found this after searching all over SO: https://github.com/reorx/pinyindep/blob/master/Uni2Pinyin. Each line begins with a Chinese character, in UTF-16 Code Units.
Take for example, 爱. In UTF-16, it is 7231. I convert this at https://r12a.github.io/apps/conversion/. When I do a lookup in the file, I get the pinyin associated. :D This is the functionality I need, though looking it up in GitHub is in JS, rather than PHP.
In the manual lookup, ai4 is returned, which is the correct intonation. Now, what I'm looking for is either a PHP Built-in Library, or a code snippet to convert this string input, let's say “爱” into a UTF-16 Four Character Code Unit, such as here 7321.
So what's the question:
How should I convert a Chinese character, in form of a string, to UTF-16 code units? (Either through built-in library, or through a suggested PHP Code Snippet)
P.S. I don't really like third-party tools unless they are really popular worldwide, or there's no other option.
You need to use PHP's multibyte string module:
$c = "爱";
list(, $d) = unpack('N', mb_convert_encoding($c, 'UCS-4BE', 'UTF-8'));
echo dechex($d);
// => 7231
Change UTF-8 to UTF-16 if your string is coming from the database in that encoding.
mb_convert_encoding will change the string into four-byte-per-character encoding; then unpack converts the four bytes into an unsigned long; finally, converting to hexadecimal string using dechex.
If you are using PHP 7.2+ you can use mb_ord to simplify the conversion.
echo dechex(mb_ord("爱"));

Detect encoding in PHP without multibyte extension?

Is there a way to detect the encoding of a string in PHP without having the mbstring extension loaded? I know it is possible to do so with mb_detect_encoding(), but is there an equivalent, non-multibyte function?
If not, what would it take to implement a detect_encoding() function that would at least detect UTF-8?
Strings in PHP are just byte sequences, they carry no encoding information with them. mb_detect_encoding doesn't actually detect the string's encoding, it tries to make an educated guess by running the byte sequence against a series of identification functions, one per encoding (by default those given by mb_detect_order), and returns the first one in which the sequence matches. These functions are very basic and don't even exist for many popular encodings.
There is no way, with or without the mbstring extension, to ascertain the encoding of a string - only to maybe rule some out, which you could only do if the string happens to contain byte sequences that would be invalid in those particular encodings.
You will never know whether "\xC2\xA4" is supposed to be the UTF-8 ¤ or ISO-8859-1 ¤ just by looking at it - because they're the exact same bytes.
For more information see: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
There's always iconv, which is generally enabled in PHP by default
<pre>
<?php
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "ISO-8859-1");
var_dump(iconv_get_encoding('all'));
?>
</pre>

PHP: parsing ascii string safely when running in multibyte mode

In my PHP config file I have
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_language('uni');
mb_regex_encoding('UTF-8');
ob_start('mb_output_handler');
To ensure UTF8 support. I have read that one should also use the multibyte string manipulation functions throughout if you have set these settings. I am currently altering a library which parses an excel file, and I need to split the one attribute value in the form N12 to determine the spreadsheet size. I know for a fact that the value cannot have values outside of ascii range. Do I need to use the multibyte string manipulation functions to parse the 12 out of N12 or can I use the normal ones. I am asking as I would like to keep the solution general and maybe submit the solution back to the library. If I need to use the correct function depending on whether current mode is utf8 or not, what is the best way to check for this?
UTF-8 is a pure superset of ASCII. If your functions can handle UTF-8, they by definition can also handle ASCII. The core PHP string functions mostly expect single-byte encodings, but that doesn't mean they won't work with other encodings; for example: Multibyte trim in PHP?.
So it depends on what exactly you're trying to do. Possibly core PHP string functions will already work fine regardless of encoding. If they do not, and your operation would break when using multi-byte strings, then you can use the appropriate MB function instead which by definition will also handle ASCII just fine when treating the input as UTF-8.

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medúlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt
To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.
I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a U​RI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

Categories