utf-8 to iso-8859-1 encoding problem - php

I'm trying preview the latest post from an rss feed on another website. The feed is UTF-8 encoded, whilst the website is ISO-8859-1 encoded. When displaying the title, I'm using;
$post_title = 'Blogging – does it pay the bills?';
echo mb_convert_encoding($post_title, 'iso-8859-1','utf-8');
// returns: Blogging ? does it pay the bills?
// expected: Blogging - does it pay the bills?
Note that the hyphen I'm expecting isn't a normal minus sign but some big-ass uber dash. Well, a few pixels longer anyway. :) Not sure how else to describe it as my keyboard can't produce that character...

mb_convert_encoding only converts the internal encoding - it won't actually change the byte sequences for characters from one character set to another. For that you need iconv.
mb_internal_encoding( 'UTF-8' );
ini_set( 'default_charset', 'ISO-8859-1' );
$post_title = 'Blogging — does it pay the bills?'; // I used the actual m-dash here to best mimic your scenario
echo iconv( 'UTF-8', 'ISO-8859-1//TRANSLIT', $post_title );
Or, as others have said, just convert out-of-range characters to html entities.

I suspect you mean an Em Dash (—). ISO-8859-1 doesn't include this character, so you aren't going to have much luck converting it to that encoding.
You could use htmlentities(), but I'd suggest moving off ISO-8859-1 to UTF-8 for publication.

I suppose the following:
Your file is actually encoded with UTF-8
Your editor interprets the file with Windows-1252
The reason for that is that your EM DASH character (U+2014) is represented by –. That’s exactly what you get when you interpret the UTF-8 code word of that character (0xE28094) with Windows-1252 (0xE2=â, 0x80=€, 0x94=”). So you first need to fix your editor encoding.
And the reason for the ? in your output is that ISO 8859-1 doesn’t contain the EM DASH character.

It's probably an em dash (U+2014), and what you're trying to do isn't converting the encoding, because the hyphen is a different character. In other words, you want to search for such characters and replace them manually.
Better yet, just switch the website to UTF-8. It largely coincides with Latin-1 and is more appropriate for a website in 2009.

Related

Which middot character is this?

$string = 'Single · Female'
I copied it from facebook.
In html source its just that dot, how did they type it?
While echoing in php its A with circumflex (Â) concatenated with that same dot.
How can i explode this string with that dot?
It is U+00B7 MIDDLE DOT, a character used for many purposes, e.g. as a separator between links, alternatives, or other items.
If your code displays it as ·, then the reason is that the UTF-8 encoded form of U+00B7, namely 0xC2 0xB7, is being misinterpreted as being ISO-8859-1 or Windows-1252 encoded. You should fix this basic problem (instead of trying to deal with some of its symptoms). See UTF-8 all the way through.
Regarding the question “how did they type it?”, we cannot really know, and we need not know. There are zillions of ways to type characters, and anyone can invent a few more. (On my keyboard, I use AltGr Shift X. If I needed to type “·” on a Windows computer with vanilla settings, I would use Alt 0183.)
I believe this is an interpunct. It can be used through the HTML entities · or · and in PHP with the unicode value U+00B7.
If you want to echo the unicode character without HTML entities, you can set the character encoding to UTF-8. Splitting is done through explode("·", $textToSplit) given that your PHP file is using UTF-8 as character encoding.

Convert the Chinese Characters From ISO-8859-1 To UTF-8

I got a system which previously the html encoding type was set as ISO-8859-1 and it caused all the Chinese characters store in the format of "&\#36830;&\#34915;&\#35033;".
So my question is, how can I convert the format above into Chinese word back in UTF-8?
For your information, I had tried with utf8_decode, iconv, but none of them work. :(
Thank you very much.
The current text encoding of that string is rather insubstantial. What you have there are HTML entities; they have little to do with the underlying "physical" encoding like ISO-8859 or UTF-8. What you want is to decode those HTML entities into a byte representation of the characters in a specific encoding, in this case to UTF-8. Therefore:
echo html_entity_decode('连衣裙', ENT_COMPAT, 'UTF-8');
// 连衣裙
You need to use:
utf8_encode($data);
and not decode,to convert your current ISO-8859-1 to UTF-8.
Some native PHP functions such as strtolower(), strtoupper() and ucfirst() do not always function correctly with UTF-8 strings. Possible solutions: convert to latin first or add the following line to your code:
setlocale(LC_CTYPE, 'C');
Make sure not to save your PHP files using a BOM (Byte-Order Marker) UTF-8 file marker (your browser might show these BOM characters between PHP pages on your site).
Just for your reference:
ISO-8859-1 => Albanian, Brazilian, Catalan, Danish, Dutch, English, Finnish, French, German, Portuguese, Norwegian, Spanish, Swedish
UTF-8 => Chinese (simplified), Chinese (traditional), Japanese, Persian
There are many tools that can convert character references to characters, and writing such a tool is rather straightforward, especially if you know the references are all decimal. So the answer really depends on the software environment.
For example, to do such a conversion for an individual HTML document, you could use the BabelPad editor: command Convert → Numeric Character References (NCR) → NCR to Unicode, and save the result as UTF-8.

simplexml_load_string strange characters

I am having difficulty with non-standard characters using simplexml_load_string.
I have loaded an newspaper xml feed using file_get_contents. If I print to screen the contents I get a title for one of the articles as :
<title>‘If Legault were running in Alberta, he’d be more popular’: How right-wing is the CAQ?</title>
If I then do this:
$feed = #simplexml_load_string($xml);
And print the results of $feed, the title has changed to:
[title] => �If Legault were running in Alberta, he�d be more popular�: How right-wing is the CAQ?
Any advice on how to stop these characters being displayed like this?
This looks SimpleXML is creating a UTF-8 string, which is then rendered in ISO-8859-1 (latin-1) or something close like CP-1252.
When you save the result to a file and serve that file via a web server, the browser will use the encoding declared in the file.
Including in a web page
Since your web page encoding is not UTF-8, you need to convert the string to whatever encoding you are using, eg ISO-8859-1 (latin-1).
This is easily done with iconv():
$xmlout = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $xmlout);
Saving to database
You database column is not using UTF-8 collation, so you should use iconv to convert the string to the charset that your database uses.
Assuming your database collation is the same as the encoding that you render in, you will not have to do anything when reading from the database.
Explanation
In UTF-8, a 0xc2 prefix byte is used to access the top half of the "Latin-1 Supplement" block which includes characters such as accented letters, currency symbols, fractions, superscript 2 and 3, the copyright and registered trademark symbols, and the non-breaking space.
However in ISO-8859-1, the byte 0xC2 represents an Â. So when your UTF-8 string is misinterpreted as one of those, then you get  followed by some other nonsense character.
This is a charset issue. it needs to be utf8, you can run utf8_decode on the content, but its better to fix this issue by matching charsets from your input (feed) to your output (html page i presume).

How to list files with special (norwegian) characters

I'm doing a simple (I thought) directory listing of files, like so:
$files = scandir(DOCROOT.'files');
foreach($files as $file)
{
echo ' <li>'.$file.PHP_EOL;
}
Problem is the files contains norwegian characters (æ,ø,å) and they for some reason come out as question marks. Why is this?
I can apparently fix(?) it by doing this before I echo it out:
$file = mb_convert_encoding($file, 'UTF-8', 'pass');
But it makes little sense to me why this helps, since pass should mean no character encoding conversion is performed, according to the docs... *confused*
Here is an example: http://random.geekality.net/files/index.php
It appears the encoding of the file names is in ISO Latin 1, but the page is interpreted by default using UTF-8. The characters do not come out as "question marks", but as Unicode replacement characters (�). That means the browser, which tries to interpret the byte stream as UTF-8, has encountered a byte invalid in UTF-8 and inserts the character at that point instead. Switch your browser to ISO Latin 1 and see the difference (View > Encoding > ...).
So what you need to do is to convert the strings from ISO Latin 1 to UTF-8, if you designate your page to be UTF-8 encoded. Use mb_convert_encoding($file, 'UTF-8', 'ISO-8859-1') to do so.
Why it works if you specify the $from encoding as pass I can only guess. What you're telling mb_convert_encoding with that is to convert from pass to UTF-8. I guess that makes mb_convert_encoding take the mb_internal_encoding value as the $from encoding, which happens to be ISO Latin 1. I suppose it's equivalent to 'auto' when used as the $from parameter.

PHP character encoding � sign instead of à

Hi this is a very strange error I have on some pages of this joomla website:
http://www.pcsnet.it/news
If you go into the details of a specific news the à character is correctly displayed.
Other accented characters seem not affected.
I've checked that UTF-8 encoding is default both in the MySql db and that the text files are in UTF-8 encoding.
Other ideas?
What is very interesting in your case is that it only affects the letter à! So it cannot be an encoding problem.
Here is my take on your problem. The letter à is encoded in two bytes in utf8. The first byte is xC3, which is à in latin-1, the second byte is... non breaking space! (The other accented letters, such as è are encoded by à followed by an other accented letter in latin-1, and they are not affected).
Therefore, my guess is that you have a script, somewhere, that removes, or replaces, the non-breaking space in latin-1, i.e., character xA0. The resulting lonely byte xC3 cannot be displayed properly, so the general placeholder � is displayed instead. (just load your page in latin-1, you will see that I am right).
Find that script that removes non-breaking spaces, and you'll be fine.
Are you by any chance using htmlentities, htmlspecialchars or html_entity_decode on the text you retrieved from the database ? If so, you need to force UTF8 in the third parameter because it's not the default charset of these functions.
Example: htmlentities('£hello', null, 'utf-8');
A � sign is usually an indication that the character the browser is trying to display is not available in the font used. It's probably not an à, if that works on other pages (using the same font).

Categories