Somehow there are two different € chars in UTF-8. A correct one U+20AC and latin-1 supplement U+0080.
Using bin2hex I got hex c280 instead of the correct e282ac. Since the first one is not displayed correctly I would like to convert it.
Officiously I can't use utf8_decode() or utf8_encode(). I tried iconv('Windows-1252', 'UTF-8', $x) but that gives me "€" because in Windows-1252 € is 80.
What is the correct converter for this?
Looks like it does work if I convert from utf8_decode back to Windows-1252 and convert to utf8 again using iconv:
iconv('Windows-1252', 'UTF-8', utf8_decode($x));
I guess the string is originally Windows-1252 and was converted utf8_encode what did not work for most but not all characters.
Related
I have variables with chinese words, their charset is GB2312. I want to convert them to UTF-8 because I want to save them to mysql table with utf-8 encoding. How to do that is PHP? I'm using PHP 7.
Here are what I have tried:
I have tried using $myvar = iconv('gb2312', 'utf-8', $myvar); However some of my variables get empty if it contains some characters (invalid UTF-8 chars maybe?)
I have tried using $myvar = mb_convert_encoding($myvar, 'UTF-8', 'GB2312'); It works better than iconv but when $myvar contain some characters as I mentioned above, they turned into question mark (?)
Please help me, thanks
Update
Here is an example of my chinese string:
GB2312 (Expected result): 第3章︰林鴻
Using mb_convert_encoding become: 第3章?林?
Using iconv become empty
I have a string like this:
$str = "\xC4";
According to wikipedia the C4 is ISO-8859-1 Hexcode for Ä. Now i want to lowercase this string to get ä (also in ISO-8859-1).
I tried various solutions using strtolower and mb_strtolower. None of them worked. The output was garbled every time.
You can specify the encoding in mb_strtolower(), so just specify it and it all works fine:
echo mb_strtolower($str, "ISO-8859-1");
//^^^^^^^^^^
output:
ä
strtolower("\xC4") works just fine. The thing is that you need to interpret the resulting byte (xE4) using the ISO-8859-1 encoding, otherwise you'll obviously see garbage. If you're doing this in a browser, set the appropriate header to clue the browser in to the expected encoding:
header('Content-Type: text/html; charset=iso-8859-1');
echo strtolower("\xC4");
How to convert ASCII encoding to UTF8 in PHP
ASCII is a subset of UTF-8, so if a document is ASCII then it is already UTF-8.
If you know for sure that your current encoding is pure ASCII, then you don't have to do anything because ASCII is already a valid UTF-8.
But if you still want to convert, just to be sure that its UTF-8, then you can use iconv
$string = iconv('ASCII', 'UTF-8//IGNORE', $string);
The IGNORE will discard any invalid characters just in case some were not valid ASCII.
Use mb_convert_encoding to convert an ASCII to UTF-8. More info here
$string = "chárêctërs";
print(mb_detect_encoding ($string));
$string = mb_convert_encoding($string, "UTF-8");
print(mb_detect_encoding ($string));
"ASCII is a subset of UTF-8, so..." - so UTF-8 is a set? :)
In other words: any string build with code points from x00 to x7F has indistinguishable representations (byte sequences) in ASCII and UTF-8. Converting such string is pointless.
Use utf8_encode()
Man page can be found here http://php.net/manual/en/function.utf8-encode.php
Also read this article from Joel on Software. It provides an excellent explanation if what Unicode is and how it works. http://www.joelonsoftware.com/articles/Unicode.html
On a request URL, I can get the query string ?dir=Documents%20partag%C3%A9s or ?dir=Documents%20partag%E9s. I think the first one is UTF-8 and the second is ASCII.
The real string is : Documents partagés
So, I have a PHP script (in UTF-8) and what I want to do, is to detect if the query string is ASCII or UTF-8, and if ASCII, convert it to UTF-8.
I tried with mb_ functions, but the query string is always detected as ASCII and urldecode version of query string as UTF-8.
How can I achieve this? Note that Wikipedia has a similar function -it encodes itself %E9 to %C3%A9.
E9 is 233 in decimal. It is not a valid ASCII byte (0-127 only), but it is é in ISO-8859-1 (Latin1). When using mb_convert_encoding, you can specify multiple encodings (e.g.: UTF-8 and ISO-8859-1).
This should fix it:
mb_convert_encoding($str, 'UTF-8', 'UTF-8,ISO-8859-1');
With the following script:
$str1 = 'Documents%20partag%E9s';
$str2 = 'Documents%20partag%C3%A9s';
var_dump(mb_convert_encoding(urldecode($str1), 'UTF-8', 'UTF-8,ISO-8859-1'));
var_dump(mb_convert_encoding(urldecode($str2), 'UTF-8', 'UTF-8,ISO-8859-1'));
I get:
string(19) "Documents partagés"
string(19) "Documents partagés"
I'm trying preview the latest post from an rss feed on another website. The feed is UTF-8 encoded, whilst the website is ISO-8859-1 encoded. When displaying the title, I'm using;
$post_title = 'Blogging – does it pay the bills?';
echo mb_convert_encoding($post_title, 'iso-8859-1','utf-8');
// returns: Blogging ? does it pay the bills?
// expected: Blogging - does it pay the bills?
Note that the hyphen I'm expecting isn't a normal minus sign but some big-ass uber dash. Well, a few pixels longer anyway. :) Not sure how else to describe it as my keyboard can't produce that character...
mb_convert_encoding only converts the internal encoding - it won't actually change the byte sequences for characters from one character set to another. For that you need iconv.
mb_internal_encoding( 'UTF-8' );
ini_set( 'default_charset', 'ISO-8859-1' );
$post_title = 'Blogging — does it pay the bills?'; // I used the actual m-dash here to best mimic your scenario
echo iconv( 'UTF-8', 'ISO-8859-1//TRANSLIT', $post_title );
Or, as others have said, just convert out-of-range characters to html entities.
I suspect you mean an Em Dash (—). ISO-8859-1 doesn't include this character, so you aren't going to have much luck converting it to that encoding.
You could use htmlentities(), but I'd suggest moving off ISO-8859-1 to UTF-8 for publication.
I suppose the following:
Your file is actually encoded with UTF-8
Your editor interprets the file with Windows-1252
The reason for that is that your EM DASH character (U+2014) is represented by –. That’s exactly what you get when you interpret the UTF-8 code word of that character (0xE28094) with Windows-1252 (0xE2=â, 0x80=€, 0x94=”). So you first need to fix your editor encoding.
And the reason for the ? in your output is that ISO 8859-1 doesn’t contain the EM DASH character.
It's probably an em dash (U+2014), and what you're trying to do isn't converting the encoding, because the hyphen is a different character. In other words, you want to search for such characters and replace them manually.
Better yet, just switch the website to UTF-8. It largely coincides with Latin-1 and is more appropriate for a website in 2009.