Encoding conversion in PHP (ISO-8859-1, UTF-8, CP1250)

Encoding conversion in PHP (ISO-8859-1, UTF-8, CP1250) - php

I want to work with data from CSV file, but I realized letters are not showing correctly. I tried million ways to convert the encoding but nothing works. Working on MacOS, PHP 7.4.4.
After executing fgets() or fgetcsv() on handle variable, I will get this (2 rows/lines in example).
Kód ADM;Kód obce;Název obce;Kód MOMC;Název MOMC;Kód MOP;Název MOP;Kód èásti obce;Název èásti obce;Kód ulice;Název ulice;Typ SO;Èíslo domovní;Èíslo orientaèní;Znak èísla orientaèního;PSÈ;Souøadnice Y;Souøadnice X;Platí Od
1234;1234;HorniDolni;;;;;1234;HorniDolni;;;è.p.;2;;;748790401;4799.98;15893971.21;2013-12-01T00:00:00
It is more or less correct czech language, but letter č is superseded by è and ř is superseded by ø, neither of them are part of czech alphabet. I am confident, there will be more of the misplaced letters in the file.
Executing file -I path/to/file I receive file: text/plain; charset=iso-8859-1 which is sad, because as far as wiki is concerned, this charset doesn't have a czech alphabet included.
Neither of following commands didn't converted misplaced letters:
mb_convert_encoding($line, 'UTF-8', 'ISO8859-1')
iconv('ISO-8859-1', 'UTF-8', $line)
iconv('ISO8859-1', 'UTF-8', $line)
I have noticed that in ISO-8859-1 the ø letter has a code 00F8. Windows-1250 (which includes czech aplhabet) has correct letter ř with code 0159 but both of them are preceded by 00F8. Same with letter č and è which are both preceded by code 00E7. I do not understand encoding very deeply, but it seems that file is encoded in Windows-1250 but the interpreter thinks the encoding is ISO-8859-1 and takes letter that is in place/code of original one.
But neither conversion (ISO-8859-1 => Windows-1250, ISO-8859-1 => UTF-8 or other way around) is working.
Does anyone has any idea how to solve this? Thanks!

The problem with 8-bit character encoding is that it mostly needs human intelligence to interpret the correct codepage.
When you run file on a file, it can work out that the file is mostly made up of printable characters but as it's only looking at the bytes, it can't easily tell the difference between iso-8895-1 and iso-8895-2. To file, 0x80 is the same as 0x80.
file can only tell that the file is text and likely iso-8895-* or windows-*, because of the use of 0x80-0xFF. I.e. not just ASCII.
(Unicode encodings, like UTF-8, and UTF-16 are easier to detect by their byte sequence or Byte Order Mark set at the top of the file)
There are some intelligent character codepage detectors that, with the help of dictionaries from different languages, can estimate the codepage based on character/byte sequences.
The likely conversion you need is simply iso-8895-2 -> UTF-8.
What is important for you is that you know the original encoding (interpretation) and then when you validate it, that you know exactly what encoding you're viewing it.
For example, PHP will by default set the HTTP charset to iso-8895-1. That means it's quite possible for you to be converting correctly to iso-8895-2, but your browser will then "interpret" as iso-8895-1.
The best way to validate is to save the file to disk, then use a text editor like VS Code set to your required encoding beforehand before opening the file.
If you need further help, you will need to edit your question to include the exact code you're using.

Related

How do I use Extended ASCII characters in a PHP/PDF document generated by FPDF?

I am trying to create a document that contains Extended ASCII characters. For text coming from the client the following works:
// Convert from UTF-8 to ISO-8859-1 - Deal with Spanish characters
setlocale(LC_ALL, 'en_US.UTF-8');
foreach ($_POST as $key => $value){
$post[$key] = iconv("UTF-8", "ISO-8859-1", $value);
}
$pdf->Cell(0, 0, $post["Name"], 0, 1);
However, I can't get text in the PHP file to work. For example:
$name = "José";
I don't know what encoding the variable uses. As a result, I can't convert it to ISO-8859-1. The é gets mangled.
Edit:
I am rewriting a program that generates PDF documents (some in Spanish). If I copy text from the existing PDFs, I get the following: (which looks normal in the PDF document and in the IDE but can't be printed with FPDF using either CP1252 or ISO-8859-1 fonts).
$Name = "José" // Jos\x65\xcc\x81 - I have no idea what encoding is used for the é
Changing the extended characters to UTF-8 solves the problem:
$Name = "José" // Jos\xC3\xA9 - UTF-8
Does anyone know what kind of encoding I am copying from the existing PDFs?
Is there a way to convert it to UTF-8?
Can users enter this stuff into a browser?
When I convert the UTF-8 encoded characters to ISO-8859-1 for output to FPDF, the PDF contains the three character encoded version of the é.
2nd Edit: Unicode equivalence from Wikipedia
Unicode provides two notions, canonical equivalence and
compatibility. Code point sequences that are defined as canonically
equivalent are assumed to have the same appearance and meaning when
printed or displayed. For example, the code point U+006E (the Latin
lowercase "n") followed by U+0303 (the combining tilde "◌̃") is
defined by Unicode to be canonically equivalent to the single code
point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet).
Therefore, those sequences should be displayed in the same manner,
should be treated in the same way by applications such as
alphabetizing names or searching, and may be substituted for each
other.
Which is the long way of paraphrasing #smith's comment that I just need to get TCPDF or something that will properly handle UTF-8. It should be noted that I am getting the error in PHP's iconv, so I not entirely sure that it can be made to go away by switching to TCPDF.

Turns out that to use extended ASCII characters one needs to pick and encoding and use it throughout. In my case, I went with UTF-8 encoded characters and used them everywhere. My original problem stemmed from my mistake in copying text from a PDF document which was encoded in the canonically equivalent format. Once I used UTF-8 encoded characters everywhere my problems went away.

Decoding ISO-8859-1 and Encoding to UTF-8 before MySQL query

I'm kinda stuck if I'm doing it right.
I have a file which is ISO-8859-1 (pretty certain). My MySQL db is in utf-8 encoding. Which is why I want to convert the file to UTF-8 encoded characters before I can send it as a query. For instance, First I rewrite every line of the file.txt into file_new.txt using.
line = line.decode('ISO-8859-1').encode('utf-8')
And then I save it. Next, I create a MySQL connection and create a cursor with the following query so that all the data is received as utf-8.
query = 'SET NAMES "utf8"'
cursor.execute(query)
Following this, I reopen file_new.txt and enter each line into MySQL. Is this the right approach to get the table in MySQL utf-8 encoding? Or Am I missing any crucial part?
Now to receive this data. I use 'SET NAMES "utf8"" as well. But the received data is giving me question marks � when I set the header content type to
header("Content-Type: text/html; charset=utf-8");
On the other hand, when I set
header("Content-Type: text/html; charset=ISO-8859-1");
It works fine, but other utf-8 encoded data from the database is getting scrambled. So I'm guessing the data from file.txt is still NOT getting encoded to utf-8. Can any one explain why?
PS: Before I read everyline, I replace a character and save the file.txt to file.txt.tmp. I then read this file to get file_new.txt. I don't know if it causes any problem to the original file encoding.
f1 = codecs.open(tsvpath, 'rb',encoding='iso-8859-1')
f2 = codecs.open(tsvpath + '.tmp', 'wb',encoding='utf8')
for line in f1:
f2.write(line.replace('\"', '\''))
f1.close()
f2.close()
In the below example, I've utf-8 encoded persian data which is right but the other non-enlgish text is coming out to be in "question marks". This is precisely my problem.
Example : Removed.

Welcome to the wonderful world of unicode and windows. I've found this site very helpful in understanding what is going wrong with my strings http://www.i18nqa.com/debug/utf8-debug.html. The other thing you need is a hex editor like HxD. There are many places where things can go wrong. For example, if you are viewing your files in a text editor - it may be trying to be helpful and is silently changing your encoding.
Start with your original data, view it in HxD and see what the encoding is. View your results in Hxd and see if the changes you expect are being made. Repeat through the steps in your process.
Without your full code and sample data, its hard to say where the problem is. My guess is your replacing the double quote with single quote on binary files is the culprit.
Also check out The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky

Try this instead:
line = line.decode('ISO-8859-1').encode('utf-8-sig')
From the docs:
As UTF-8 is an 8-bit encoding no BOM is required and any U+FEFF
character in the decoded string (even if it’s the first character) is
treated as a ZERO WIDTH NO-BREAK SPACE.
Without external information it’s impossible to reliably determine
which encoding was used for encoding a string. Each charmap encoding
can decode any random byte sequence. However that’s not possible with
UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow
arbitrary byte sequences. To increase the reliability with which a
UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8
(that Python 2.5 calls "utf-8-sig") for its Notepad program: Before
any of the Unicode characters is written to the file, a UTF-8 encoded
BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is
written. As it’s rather improbable that any charmap encoded file
starts with these byte values (which would e.g. map to
LATIN SMALL LETTER I WITH DIAERESIS RIGHT-POINTING DOUBLE ANGLE
QUOTATION MARK INVERTED QUESTION MARK in iso-8859-1), this increases
the probability that a utf-8-sig encoding can be correctly guessed
from the byte sequence. So here the BOM is not used to be able to
determine the byte order used for generating the byte sequence, but as
a signature that helps in guessing the encoding. On encoding the
utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes
to the file. On decoding utf-8-sig will skip those three bytes if they
appear as the first three bytes in the file. In UTF-8, the use of the
BOM is discouraged and should generally be avoided.
Source: https://docs.python.org/3.5/library/codecs.html
EDIT:
Sample:
"Hello World".encode('utf-8') yields b'Hello World' while "Hello World".encode('utf-8-sig') yields b'\xef\xbb\xbfHello World' highlighting the docs:
On encoding the
utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes
to the file. On decoding utf-8-sig will skip those three bytes if they
appear as the first three bytes in the file.
Edit:
I have made a similar function before that converts a file to utf-8 encoding. Here is a snippet:
def convert_encoding(src, dst, unicode='utf-8-sig'):
return open(dst, 'w').write(open(src, 'rb').read().decode(unicode, 'ignore'))
Based on your example, try this:
convert_encoding('file.txt.tmp', 'file_new.txt')

Alright guys, so my encoding was right. The file was getting encoding to utf-8 just as needed. All the queries were right. It turns out that the other dataset that was in Arabic was in ISO-8859-1. Therefore, only 1 of them was working. No matter what I did.
The Hexeditors did help. But in the end I just used sublime text to recheck if my encoded data was utf-8. It turns out the python script and the sublime editor did the same. So the code is fine. :)

You should not need to do any explicit encode or decode. SET NAMES ... should match what the client encoding is (for INSERTing) or should become (for SELECTing).
MySQL will convert between the client encoding and the columns's CHARACTER SET.

PHP Uploaded file name: Japanese character encoding

When uploading a file with a japanese name, some characters are creating problem.
On a windows system, I want to save the name of the file as-uploaded. So I have to use
mb_convert_encoding($name, "SJIS", "AUTO");
which works fine most of the cases.
Though, some characters like ① as in 0423図表① totally disappear at the end. It seems that when uploaded the name of the file is already "wrong":
it looks like "0423å³è¡¨â .pptx" in UTF-8 and if I change the header charset with
header('Content-Type: text/html; charset=SJIS');
it looks like
"0423ﾃ･ﾂ崢ｳﾃｨﾂ｡ﾂｨﾃ｢ﾂ堕.pptx"
I am not sure what I can do in this case. I tried to replace the ① character but I cannot even find it with strpos() before or after the encoding conversion.

To qualify my answer (to the downvoter):
Q: I have heard that UTF-8 does not support some Japanese characters. Is this correct?
A: There is a lot of misinformation floating around about the support
of Chinese, Japanese and Korean (CJK) characters. The Unicode Standard
supports all of the CJK characters from JIS X 0208, JIS X 0212, JIS X
0221, or JIS X 0213, for example, and many more. This is true no
matter which encoding form of Unicode is used: UTF-8, UTF-16, or
UTF-32.
Unicode supports over 80,000 CJK characters right now, and work is
underway to encode further additions. The International Standard
ISO/IEC 10646 and the Unicode Standard are completely synchronized in
repertoire and content. And that means that Unicode has the same
repertoire as GB 18030, since that also is synchronized with ISO 10646
— although with a different ordering and byte format.
From: The Unicode Consortium.
My Answer:
Rather than strpos use mb_stripos, from the PHP Multibyte string functions to find and replace characters. This should help your script detect and translate the non-latin characters.
If the uploaded file name ($_FILES['var']['name']) is already incorrect in the PHP script (from output such as print_r($_FILES)) then you need to ensure you are correctly encoding the HTML form with accept-charset='UTF-8' (or SJIS, etc.). I would hope you're already well ahead of me on this.
Also it may be advisable to add a few preconditionals at the top of your code, again using the PHP mb_ functions add at the top of your PHP page:
mb_internal_encoding('UTF-8'); //or whatever character set works for you
mb_http_output('SJIS');
mb_http_input('UTF-8');
mb_regex_encoding('UTF-8');
Out of interest:
http://www.unicode.org/reports/tr37/
and
http://david.latapie.name/blog/shift-jis-utf-8/

Convert the Chinese Characters From ISO-8859-1 To UTF-8

I got a system which previously the html encoding type was set as ISO-8859-1 and it caused all the Chinese characters store in the format of "&\#36830;&\#34915;&\#35033;".
So my question is, how can I convert the format above into Chinese word back in UTF-8?
For your information, I had tried with utf8_decode, iconv, but none of them work. :(
Thank you very much.

The current text encoding of that string is rather insubstantial. What you have there are HTML entities; they have little to do with the underlying "physical" encoding like ISO-8859 or UTF-8. What you want is to decode those HTML entities into a byte representation of the characters in a specific encoding, in this case to UTF-8. Therefore:
echo html_entity_decode('连衣裙', ENT_COMPAT, 'UTF-8');
// 连衣裙

You need to use:
utf8_encode($data);
and not decode,to convert your current ISO-8859-1 to UTF-8.
Some native PHP functions such as strtolower(), strtoupper() and ucfirst() do not always function correctly with UTF-8 strings. Possible solutions: convert to latin first or add the following line to your code:
setlocale(LC_CTYPE, 'C');
Make sure not to save your PHP files using a BOM (Byte-Order Marker) UTF-8 file marker (your browser might show these BOM characters between PHP pages on your site).
Just for your reference:
ISO-8859-1 => Albanian, Brazilian, Catalan, Danish, Dutch, English, Finnish, French, German, Portuguese, Norwegian, Spanish, Swedish
UTF-8 => Chinese (simplified), Chinese (traditional), Japanese, Persian

There are many tools that can convert character references to characters, and writing such a tool is rather straightforward, especially if you know the references are all decimal. So the answer really depends on the software environment.
For example, to do such a conversion for an individual HTML document, you could use the BabelPad editor: command Convert → Numeric Character References (NCR) → NCR to Unicode, and save the result as UTF-8.

PHP Encoding Conversion to Windows-1252 whilst keeping UTF-8 Compatibility

I need to convert uploaded filenames with an unknown encoding to Windows-1252 whilst also keeping UTF-8 compatibility.
As I pass on those files to a controller (on which I don't have any influence), the files have to be Windows-1252 encoded. This controller then again generates a list of valid file(names) that are stored via MySQL into a database - therefore I need UTF-8 compatibility. Filenames passed to the controller and filenames written to the database MUST match. So far so good.
In some rare cases, when converting to "Windows-1252" (like with te character "ï"), the character is converted to something invalid in UTF-8. MySQL then drops those invalid characters - as a result filenames on disk and filenames stored to the database don't match anymore. This conversion, which failes sometimes, is achieved with simple recoding:
$sEncoding = mb_detect_encoding($sOriginalFilename);
$sTargetFilename = iconv($sEncoding, "Windows-1252//IGNORE", $sOriginalFilename);
To prevent invalid characters being generated by the conversion, I then again can remove all invalid UTF-8 characters from the recoded string:
ini_set('mbstring.substitute_character', "none");
$sEncoding = mb_detect_encoding($sOriginalFilename);
$sTargetFilename = iconv($sEncoding, "Windows-1252//TRANSLIT", $sOriginalFilename);
$sTargetFilename = mb_convert_encoding($sTargetFilename, 'UTF-8', 'Windows-1252');
But this will completely remove / recode any special characters left in the string. For example I lose all "äöüÄÖÜ" etc., which are quite regular in german language.
If you know a cleaner and simpler way of encoding to Windows-1252 (without losing valid special characters), please let me know.
Any help is very appreciated. Thank you in advance!

I think the main problem is that mb_detect_encoding() does not do exactly what you think it does. It attempts to detect the character encoding but it does it from a fairly limited list of predefined encodings. By default, those encodings are the ones returned by mb_detect_order(). In my computer they are:
ASCII
UTF-8
So this function is completely useless unless you take care of compiling a list of candidate encodings and feeding the function with it.
Additionally, there's basically no reliable way to guess the encoding of an arbitrary input string, even if you restrict yourself to a small subset of encodings. In your case, Windows-1252 is so close to ISO-8859-1 and ISO-8859-15 that you have no way to tell them apart other than visual inspection of key characters like ¤ or €.

You can't have a string be Windows-1252 and UTF-8 at the same time. The character sets are identical for the first 128 characters (they contain e.g. the basic latin alphabet), but when it goes beyond that (like for Umlauts), it's either one or the other. They have different code points in UTF-8 than they have in Windows-1252.

Keep to ASCII in the filesystem - if you need to sustain characters outside ASCII in a filename, there are
schemes you can use to represent unicode characters while keeping to ASCII.
For example, percent encoding:
äöüÄÖÜ.txt <-> %C3%A4%C3%B6%C3%BC%C3%84%C3%96%C3%9C.txt
Of course this will hit the file name limit pretty fast and is not very optimal.
How about punycode?
äöüÄÖÜ.txt <-> xn--4caa7cb2ac.txt

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.