I have to create a csv (semicolon separated) file as export for some Česká pošta DOS based system. Its not basicaly problem, but it have to be strictly in CP852 encoding (which is extended ASCII table with czech special characters ěščřžýáíé etc. named also Latin 2). So its no solution to remove diacritics.
I tried many approches, search through stack and google, but find no working solution. Source is saved in UTF8. I try to use iconv, mb_convert_encoding and also some other libraries, but nothing works right.
Closest is to use
iconv("UTF-8", "ISO-8859-2", 'abcěščřžýáíé');
Which looks in browser that is correctly encoded in ISO
abc���������
(when I change encoding to ISO, it shows character right).
abcěščřžýáíé
But when I write it into file by FPutS and read it again by FGetS followed by mb_detect_encoding that string, it shows UTF-8 instead. Pošta's software is very strict about encoding, so I have to make it right.
When I simply use for example
$text = iconv("UTF-8", "ASCII", 'abcěščřžýáíé');
it removes all characters with diacritics, but its not solution.
Is there anyone able to help with it? I am realy stuck ...
Related
I want to work with data from CSV file, but I realized letters are not showing correctly. I tried million ways to convert the encoding but nothing works. Working on MacOS, PHP 7.4.4.
After executing fgets() or fgetcsv() on handle variable, I will get this (2 rows/lines in example).
Kód ADM;Kód obce;Název obce;Kód MOMC;Název MOMC;Kód MOP;Název MOP;Kód èásti obce;Název èásti obce;Kód ulice;Název ulice;Typ SO;Èíslo domovní;Èíslo orientaèní;Znak èísla orientaèního;PSÈ;Souøadnice Y;Souøadnice X;Platí Od
1234;1234;HorniDolni;;;;;1234;HorniDolni;;;è.p.;2;;;748790401;4799.98;15893971.21;2013-12-01T00:00:00
It is more or less correct czech language, but letter č is superseded by è and ř is superseded by ø, neither of them are part of czech alphabet. I am confident, there will be more of the misplaced letters in the file.
Executing file -I path/to/file I receive file: text/plain; charset=iso-8859-1 which is sad, because as far as wiki is concerned, this charset doesn't have a czech alphabet included.
Neither of following commands didn't converted misplaced letters:
mb_convert_encoding($line, 'UTF-8', 'ISO8859-1')
iconv('ISO-8859-1', 'UTF-8', $line)
iconv('ISO8859-1', 'UTF-8', $line)
I have noticed that in ISO-8859-1 the ø letter has a code 00F8. Windows-1250 (which includes czech aplhabet) has correct letter ř with code 0159 but both of them are preceded by 00F8. Same with letter č and è which are both preceded by code 00E7. I do not understand encoding very deeply, but it seems that file is encoded in Windows-1250 but the interpreter thinks the encoding is ISO-8859-1 and takes letter that is in place/code of original one.
But neither conversion (ISO-8859-1 => Windows-1250, ISO-8859-1 => UTF-8 or other way around) is working.
Does anyone has any idea how to solve this? Thanks!
The problem with 8-bit character encoding is that it mostly needs human intelligence to interpret the correct codepage.
When you run file on a file, it can work out that the file is mostly made up of printable characters but as it's only looking at the bytes, it can't easily tell the difference between iso-8895-1 and iso-8895-2. To file, 0x80 is the same as 0x80.
file can only tell that the file is text and likely iso-8895-* or windows-*, because of the use of 0x80-0xFF. I.e. not just ASCII.
(Unicode encodings, like UTF-8, and UTF-16 are easier to detect by their byte sequence or Byte Order Mark set at the top of the file)
There are some intelligent character codepage detectors that, with the help of dictionaries from different languages, can estimate the codepage based on character/byte sequences.
The likely conversion you need is simply iso-8895-2 -> UTF-8.
What is important for you is that you know the original encoding (interpretation) and then when you validate it, that you know exactly what encoding you're viewing it.
For example, PHP will by default set the HTTP charset to iso-8895-1. That means it's quite possible for you to be converting correctly to iso-8895-2, but your browser will then "interpret" as iso-8895-1.
The best way to validate is to save the file to disk, then use a text editor like VS Code set to your required encoding beforehand before opening the file.
If you need further help, you will need to edit your question to include the exact code you're using.
I'm trying to decode files created in windows-1251 and encode them to UTF-8. Everything works except some special characters such as ÅÄÖåäö. E.g Ä becomes Ž which I then use preg_replace to alter which works fine like below:
$file = preg_replace("/\Ž/", 'Ä', $file);
I'm having trouble with Å which shows up like this <U+008F>, which I see translates to single shift three and I can't seem to use preg_replace on it?
You have two major builtin functions to do the job, just pick one:
Multibyte String:
$file = mb_convert_encoding($file, 'UTF-8', 'Windows-1251');
iconv:
$file = iconv('Windows-1251', 'UTF-8', $file);
To determine why your homebrew alternative doesn't work we'd need to spend some time reviewing the complete codebase but I can think of some potential issues:
You're working with mixed encodings yet you aren't using hexadecimal notation or string entities of any kind. It's also unclear what encoding the script file itself is saved as.
There's no \Ž escape sequence in PCRE (no idea what the intention was).
Perhaps you're replacing some strings more than once.
Last but not least, have you compiled a complete and correct character mapping database of at least the 128 code points that differ between both encodings?
I am trying to create a document that contains Extended ASCII characters. For text coming from the client the following works:
// Convert from UTF-8 to ISO-8859-1 - Deal with Spanish characters
setlocale(LC_ALL, 'en_US.UTF-8');
foreach ($_POST as $key => $value){
$post[$key] = iconv("UTF-8", "ISO-8859-1", $value);
}
$pdf->Cell(0, 0, $post["Name"], 0, 1);
However, I can't get text in the PHP file to work. For example:
$name = "José";
I don't know what encoding the variable uses. As a result, I can't convert it to ISO-8859-1. The é gets mangled.
Edit:
I am rewriting a program that generates PDF documents (some in Spanish). If I copy text from the existing PDFs, I get the following: (which looks normal in the PDF document and in the IDE but can't be printed with FPDF using either CP1252 or ISO-8859-1 fonts).
$Name = "José" // Jos\x65\xcc\x81 - I have no idea what encoding is used for the é
Changing the extended characters to UTF-8 solves the problem:
$Name = "José" // Jos\xC3\xA9 - UTF-8
Does anyone know what kind of encoding I am copying from the existing PDFs?
Is there a way to convert it to UTF-8?
Can users enter this stuff into a browser?
When I convert the UTF-8 encoded characters to ISO-8859-1 for output to FPDF, the PDF contains the three character encoded version of the é.
2nd Edit: Unicode equivalence from Wikipedia
Unicode provides two notions, canonical equivalence and
compatibility. Code point sequences that are defined as canonically
equivalent are assumed to have the same appearance and meaning when
printed or displayed. For example, the code point U+006E (the Latin
lowercase "n") followed by U+0303 (the combining tilde "◌̃") is
defined by Unicode to be canonically equivalent to the single code
point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet).
Therefore, those sequences should be displayed in the same manner,
should be treated in the same way by applications such as
alphabetizing names or searching, and may be substituted for each
other.
Which is the long way of paraphrasing #smith's comment that I just need to get TCPDF or something that will properly handle UTF-8. It should be noted that I am getting the error in PHP's iconv, so I not entirely sure that it can be made to go away by switching to TCPDF.
Turns out that to use extended ASCII characters one needs to pick and encoding and use it throughout. In my case, I went with UTF-8 encoded characters and used them everywhere. My original problem stemmed from my mistake in copying text from a PDF document which was encoded in the canonically equivalent format. Once I used UTF-8 encoded characters everywhere my problems went away.
I have a file, which contains some cyrillic characters. When I open this file in Notepad++ I see, that it has ANSI encoding. If I manually encode it into UTF-8 using Notepad++, then everything is absolutely ok - I can use this file in my parsers and get results. But what I want is to do it programmatically, using PHP. This is what I tried after searching through SO and documentation:
file_put_contents($file, utf8_encode(file_get_contents($file)));
In this case when my algorithm parses the resulting files, it meets such letters as "è", "í" , "â". In other words, in this case I get some rubbish. I also tried this:
file_put_contents($file, iconv('WINDOWS-1252', 'UTF-8', file_get_contents($file)));
But it produces the very same rubbish. So, I really wonder how can I achive programmatically what Notepad++ does. Thanks!
Notepad++ may report your encoding as ANSI but this does not necessarily equate to Windows-1252. 1252 is an encoding for the Latin alphabet, whereas 1251 is designed to encode Cyrillic script. So use
file_put_contents($file, iconv('WINDOWS-1251', 'UTF-8', file_get_contents($file)));
to convert from 1251 to utf-8 with iconv.
I`ve a small issue with utf8 encoding.
the word i try to encode is "kühl".
So it has a special character in it.
When i encode this string with utf8 in the first file i get:
kühl
When i encode this string with utf8 in the second file i get:
ku�hl
With php utf8_encode() i always get the first one (kühl) as an output, but i would need the second one as output (ku�hl).
mb_detect_encoding tells me for both it is "UTF-8", so this does not really help.
do you have any ideas to get the second one as output?
thanks in advance!
There is only one encoding called UTF-8 but there are multiple ways to represent some glyphs in Unicode. U+00FC is the Latin-1 compatibility single-glyph precomposed ü which displays as kühl in Latin-1 whereas off the top of my head kuÌ�hl looks like a fully decomposed expression of the same character, i.e. U+0075 (u) followed by U+0308 (combining diaeresis). See also http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
vbvntv$ perl -CSD -le 'print "ku\x{0308}hl"' | iconv -f latin1 -t utf8
ku�hl
vbvntv$ perl -CSD -le 'print "ku\x{0308}hl"' | xxd
0000000: 6b75 cc88 686c 0a ku..hl.
0x88 is not a valid character in Latin-1 so (in my browser) it displays as an "invalid character" placeholder (black diamond with a white question mark in it) whereas others might see something else, or nothing at all.
Apparently you could use class.normalize to convert between these two forms in PHP:
$normalized = Normalizer::normalize($input, Normalizer::FORM_D);
By the by, viewing UTF8 as Latin-1 and copy/pasting the representation as if it was actual real text is capricious at best. If you have character encoding questions, the actual bytes (for example, in hex) is the only portable, understandable way to express what you have. How your computer renders it is unpredictable in many scenarios, especially when the encoding is problematic or unknown. I have stuck with the presentation you used in your question, but if you have additional questions, take care to articulate the problem unambiguously.
utf8_encode, despite it's name, does not magically encode into UTF-8.
It will only work, if your source is ISO-8559-1, also known as latin-1.
If your source was already UTF-8 or any other encoding it will output broken data.