Convert the Chinese Characters From ISO-8859-1 To UTF-8 - php

I got a system which previously the html encoding type was set as ISO-8859-1 and it caused all the Chinese characters store in the format of "&\#36830;&\#34915;&\#35033;".
So my question is, how can I convert the format above into Chinese word back in UTF-8?
For your information, I had tried with utf8_decode, iconv, but none of them work. :(
Thank you very much.

The current text encoding of that string is rather insubstantial. What you have there are HTML entities; they have little to do with the underlying "physical" encoding like ISO-8859 or UTF-8. What you want is to decode those HTML entities into a byte representation of the characters in a specific encoding, in this case to UTF-8. Therefore:
echo html_entity_decode('连衣裙', ENT_COMPAT, 'UTF-8');
// 连衣裙

You need to use:
utf8_encode($data);
and not decode,to convert your current ISO-8859-1 to UTF-8.
Some native PHP functions such as strtolower(), strtoupper() and ucfirst() do not always function correctly with UTF-8 strings. Possible solutions: convert to latin first or add the following line to your code:
setlocale(LC_CTYPE, 'C');
Make sure not to save your PHP files using a BOM (Byte-Order Marker) UTF-8 file marker (your browser might show these BOM characters between PHP pages on your site).
Just for your reference:
ISO-8859-1 => Albanian, Brazilian, Catalan, Danish, Dutch, English, Finnish, French, German, Portuguese, Norwegian, Spanish, Swedish
UTF-8 => Chinese (simplified), Chinese (traditional), Japanese, Persian

There are many tools that can convert character references to characters, and writing such a tool is rather straightforward, especially if you know the references are all decimal. So the answer really depends on the software environment.
For example, to do such a conversion for an individual HTML document, you could use the BabelPad editor: command Convert → Numeric Character References (NCR) → NCR to Unicode, and save the result as UTF-8.

Related

Encoding conversion in PHP (ISO-8859-1, UTF-8, CP1250)

I want to work with data from CSV file, but I realized letters are not showing correctly. I tried million ways to convert the encoding but nothing works. Working on MacOS, PHP 7.4.4.
After executing fgets() or fgetcsv() on handle variable, I will get this (2 rows/lines in example).
Kód ADM;Kód obce;Název obce;Kód MOMC;Název MOMC;Kód MOP;Název MOP;Kód èásti obce;Název èásti obce;Kód ulice;Název ulice;Typ SO;Èíslo domovní;Èíslo orientaèní;Znak èísla orientaèního;PSÈ;Souøadnice Y;Souøadnice X;Platí Od
1234;1234;HorniDolni;;;;;1234;HorniDolni;;;è.p.;2;;;748790401;4799.98;15893971.21;2013-12-01T00:00:00
It is more or less correct czech language, but letter č is superseded by è and ř is superseded by ø, neither of them are part of czech alphabet. I am confident, there will be more of the misplaced letters in the file.
Executing file -I path/to/file I receive file: text/plain; charset=iso-8859-1 which is sad, because as far as wiki is concerned, this charset doesn't have a czech alphabet included.
Neither of following commands didn't converted misplaced letters:
mb_convert_encoding($line, 'UTF-8', 'ISO8859-1')
iconv('ISO-8859-1', 'UTF-8', $line)
iconv('ISO8859-1', 'UTF-8', $line)
I have noticed that in ISO-8859-1 the ø letter has a code 00F8. Windows-1250 (which includes czech aplhabet) has correct letter ř with code 0159 but both of them are preceded by 00F8. Same with letter č and è which are both preceded by code 00E7. I do not understand encoding very deeply, but it seems that file is encoded in Windows-1250 but the interpreter thinks the encoding is ISO-8859-1 and takes letter that is in place/code of original one.
But neither conversion (ISO-8859-1 => Windows-1250, ISO-8859-1 => UTF-8 or other way around) is working.
Does anyone has any idea how to solve this? Thanks!
The problem with 8-bit character encoding is that it mostly needs human intelligence to interpret the correct codepage.
When you run file on a file, it can work out that the file is mostly made up of printable characters but as it's only looking at the bytes, it can't easily tell the difference between iso-8895-1 and iso-8895-2. To file, 0x80 is the same as 0x80.
file can only tell that the file is text and likely iso-8895-* or windows-*, because of the use of 0x80-0xFF. I.e. not just ASCII.
(Unicode encodings, like UTF-8, and UTF-16 are easier to detect by their byte sequence or Byte Order Mark set at the top of the file)
There are some intelligent character codepage detectors that, with the help of dictionaries from different languages, can estimate the codepage based on character/byte sequences.
The likely conversion you need is simply iso-8895-2 -> UTF-8.
What is important for you is that you know the original encoding (interpretation) and then when you validate it, that you know exactly what encoding you're viewing it.
For example, PHP will by default set the HTTP charset to iso-8895-1. That means it's quite possible for you to be converting correctly to iso-8895-2, but your browser will then "interpret" as iso-8895-1.
The best way to validate is to save the file to disk, then use a text editor like VS Code set to your required encoding beforehand before opening the file.
If you need further help, you will need to edit your question to include the exact code you're using.

How do I use Extended ASCII characters in a PHP/PDF document generated by FPDF?

I am trying to create a document that contains Extended ASCII characters. For text coming from the client the following works:
// Convert from UTF-8 to ISO-8859-1 - Deal with Spanish characters
setlocale(LC_ALL, 'en_US.UTF-8');
foreach ($_POST as $key => $value){
$post[$key] = iconv("UTF-8", "ISO-8859-1", $value);
}
$pdf->Cell(0, 0, $post["Name"], 0, 1);
However, I can't get text in the PHP file to work. For example:
$name = "José";
I don't know what encoding the variable uses. As a result, I can't convert it to ISO-8859-1. The é gets mangled.
Edit:
I am rewriting a program that generates PDF documents (some in Spanish). If I copy text from the existing PDFs, I get the following: (which looks normal in the PDF document and in the IDE but can't be printed with FPDF using either CP1252 or ISO-8859-1 fonts).
$Name = "José" // Jos\x65\xcc\x81 - I have no idea what encoding is used for the é
Changing the extended characters to UTF-8 solves the problem:
$Name = "José" // Jos\xC3\xA9 - UTF-8
Does anyone know what kind of encoding I am copying from the existing PDFs?
Is there a way to convert it to UTF-8?
Can users enter this stuff into a browser?
When I convert the UTF-8 encoded characters to ISO-8859-1 for output to FPDF, the PDF contains the three character encoded version of the é.
2nd Edit: Unicode equivalence from Wikipedia
Unicode provides two notions, canonical equivalence and
compatibility. Code point sequences that are defined as canonically
equivalent are assumed to have the same appearance and meaning when
printed or displayed. For example, the code point U+006E (the Latin
lowercase "n") followed by U+0303 (the combining tilde "◌̃") is
defined by Unicode to be canonically equivalent to the single code
point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet).
Therefore, those sequences should be displayed in the same manner,
should be treated in the same way by applications such as
alphabetizing names or searching, and may be substituted for each
other.
Which is the long way of paraphrasing #smith's comment that I just need to get TCPDF or something that will properly handle UTF-8. It should be noted that I am getting the error in PHP's iconv, so I not entirely sure that it can be made to go away by switching to TCPDF.
Turns out that to use extended ASCII characters one needs to pick and encoding and use it throughout. In my case, I went with UTF-8 encoded characters and used them everywhere. My original problem stemmed from my mistake in copying text from a PDF document which was encoded in the canonically equivalent format. Once I used UTF-8 encoded characters everywhere my problems went away.

Two byte character in a single byte character encoded (ISO-8859-1) HTML document

I learned that ISO-8859-1 is a single-byte charset.
See the page http://www.manoramaonline.com/cgi-bin/MMOnline.dll/portal/ep/malayalamContentView.do?tabId=11&programId=1073753760&BV_ID=###&contentId=15238737&contentType=EDITORIAL&articleType=Malayalam%20News. It is using Malayalam language.
The HTTP header and meta tag tell that it is using ISO-8859-1 as character-encoding.
But in this page a two byte character (0x201A) is used (http://unicodelookup.com/#%E2%80%9A).
(copy the character and look up it in http://unicodelookup.com)
<div id="articleTitleMal" style="padding-top:10px;">
<font face= "Manorama" >
¼ÈØOVA¢: ÜÍß‚Äí 1.28 ...
</font>
</div>
How is it possible to use two byte character in the single byte encoding?
Mine is not a curiosity to know that. One of my task is stucked because of not understanding the above issue.
Update: They are using the font www.manoramaonline.com/portal/mmcss/Manorama.ttf and I think some of the character in the Manaorama-font using two byte.
UPDATE2: I tried to convert the document from ISO-8859-1 to UTF-8 using the below code.
<?php
$t = file_get_contents('http://www.manoramaonline.com/cgi-bin/MMOnline.dll/portal/ep/malayalamContentView.do?tabId=11&programId=1073753760&BV_ID=###&contentId=15238737&contentType=EDITORIAL&articleType=Malayalam%20News');
// Change the charset info in meta-tag
$t = str_replace('ISO-8859-1', 'UTF-8', $t);
file_put_contents('t.html', utf8_encode($t));
That time the above selected character is missing.
Even though the page is declared as ISO-8859-1 encoded in HTTP headers, browsers interpret it as Windows-1252 encoded. This is a longstanding tradition, now being formalized e.g. in the WHATWG Encoding Standard.
Thus, when the data contains the byte 82 (hex), it is not taken as a control character (as per ISO 8859-1) but as U+201A “‚” (as per Windows-1252).
However, the page uses font trickery that maps code positions to Malayalam characters according to a special internal, nonstandard encoding. (You can see this if you disable style sheets on the page. All texts become gibberish.) The page is not really meant to contain U+201A “‚” but the byte 82 to which a Malayalam character is assigned in the font.
So you need to preserve the byte as-is to get the same results. A conversion to UTF-8 would break this.
If you wanted to convert the data to Unicode, you would need to find out the internal encoding of the font being used and perform that mapping at the character level.

simplexml_load_string strange characters

I am having difficulty with non-standard characters using simplexml_load_string.
I have loaded an newspaper xml feed using file_get_contents. If I print to screen the contents I get a title for one of the articles as :
<title>‘If Legault were running in Alberta, he’d be more popular’: How right-wing is the CAQ?</title>
If I then do this:
$feed = #simplexml_load_string($xml);
And print the results of $feed, the title has changed to:
[title] => �If Legault were running in Alberta, he�d be more popular�: How right-wing is the CAQ?
Any advice on how to stop these characters being displayed like this?
This looks SimpleXML is creating a UTF-8 string, which is then rendered in ISO-8859-1 (latin-1) or something close like CP-1252.
When you save the result to a file and serve that file via a web server, the browser will use the encoding declared in the file.
Including in a web page
Since your web page encoding is not UTF-8, you need to convert the string to whatever encoding you are using, eg ISO-8859-1 (latin-1).
This is easily done with iconv():
$xmlout = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $xmlout);
Saving to database
You database column is not using UTF-8 collation, so you should use iconv to convert the string to the charset that your database uses.
Assuming your database collation is the same as the encoding that you render in, you will not have to do anything when reading from the database.
Explanation
In UTF-8, a 0xc2 prefix byte is used to access the top half of the "Latin-1 Supplement" block which includes characters such as accented letters, currency symbols, fractions, superscript 2 and 3, the copyright and registered trademark symbols, and the non-breaking space.
However in ISO-8859-1, the byte 0xC2 represents an Â. So when your UTF-8 string is misinterpreted as one of those, then you get  followed by some other nonsense character.
This is a charset issue. it needs to be utf8, you can run utf8_decode on the content, but its better to fix this issue by matching charsets from your input (feed) to your output (html page i presume).

utf-8 to iso-8859-1 encoding problem

I'm trying preview the latest post from an rss feed on another website. The feed is UTF-8 encoded, whilst the website is ISO-8859-1 encoded. When displaying the title, I'm using;
$post_title = 'Blogging – does it pay the bills?';
echo mb_convert_encoding($post_title, 'iso-8859-1','utf-8');
// returns: Blogging ? does it pay the bills?
// expected: Blogging - does it pay the bills?
Note that the hyphen I'm expecting isn't a normal minus sign but some big-ass uber dash. Well, a few pixels longer anyway. :) Not sure how else to describe it as my keyboard can't produce that character...
mb_convert_encoding only converts the internal encoding - it won't actually change the byte sequences for characters from one character set to another. For that you need iconv.
mb_internal_encoding( 'UTF-8' );
ini_set( 'default_charset', 'ISO-8859-1' );
$post_title = 'Blogging — does it pay the bills?'; // I used the actual m-dash here to best mimic your scenario
echo iconv( 'UTF-8', 'ISO-8859-1//TRANSLIT', $post_title );
Or, as others have said, just convert out-of-range characters to html entities.
I suspect you mean an Em Dash (—). ISO-8859-1 doesn't include this character, so you aren't going to have much luck converting it to that encoding.
You could use htmlentities(), but I'd suggest moving off ISO-8859-1 to UTF-8 for publication.
I suppose the following:
Your file is actually encoded with UTF-8
Your editor interprets the file with Windows-1252
The reason for that is that your EM DASH character (U+2014) is represented by –. That’s exactly what you get when you interpret the UTF-8 code word of that character (0xE28094) with Windows-1252 (0xE2=â, 0x80=€, 0x94=”). So you first need to fix your editor encoding.
And the reason for the ? in your output is that ISO 8859-1 doesn’t contain the EM DASH character.
It's probably an em dash (U+2014), and what you're trying to do isn't converting the encoding, because the hyphen is a different character. In other words, you want to search for such characters and replace them manually.
Better yet, just switch the website to UTF-8. It largely coincides with Latin-1 and is more appropriate for a website in 2009.

Categories