Charset Problem with youtube Gdata - php

i have a few problems with multi language support.
My website is using charset iso 8859 1
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
When i the fetched title or content is in chinese, the display will be funky text
$doc = new DOMDocument;
if (#$doc->load($url) === false) return;
$title = $doc->getElementsByTagName("title")->item(0)->nodeValue;
$content = $doc->getElementsByTagName("content")->item(0)->nodeValue;
However if i change my header to UTF-8, it will work, however due to other scripts i wont be able to do that. any idea how?
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

In your case, utf8_decode() will do:
$title = utf8_decode($title);
$content= utf8_decode($content);
For more complex conversions from one character set to another, one would usually use iconv() or mb_convert_encoding().
e.g.
$title = iconv("UTF-8", "iso-8859-1", $title);
$content = iconv("UTF-8", "iso-8859-1", $content);

Chinese characters won't display correct if your web page charset is iso-8859-1
pick UTF-8 or gb2312, big5
then convert it using mb_convert_encoding
mb_detect_order(array('utf-8', 'big5', 'gb2312'));
$in_encoding = mb_detect_encoding($str);
if (!$in_encoding || $in_encoding=='EUC-CN' || $in_encoding=='BIG-5')
{
$str = mb_convert_encoding($str, 'UTF-8');
}

Related

How to convert euro (€) symbol from Windows-1252 to UTF-8?

A software generates me a Windows-1252 XML file, and I would like to parse it in PHP, and send the data on my database in UTF8.
I tried a lot of solutions, such as iconv or utf8_encode functions, but no result.
It displays things like €, but not just €...
My XML file is like this :
<?xml version="1.0" encodoing="Windows-1252" standalone="yes"?>
<node>The price is 12 € !</node>
€ seems to be the code of € (euro) in Windows-1252.
I tried these functions :
<!doctype html>
<html lang='fr'>
<head>
<meta charset='UTF-8'>
</head>
<body>
<?php
// XML Loading in DOM Document
// Parsing XML Node
/* Not working */
$node = iconv('Windows-1252', 'UTF-8', $nodeValue);
/* Not working */
$node = utf8_encode($nodeValue);
?>
</body>
</html>
As shown in this Stack Overflow question the Euro symbol is converted to the latin-1 supplement euro character, and not the "proper" UTF-8 codepoint. A workaround for it is to utf8_decode and then "re-encode" again:
$node = iconv('Windows-1252', 'UTF-8', utf8_decode($node));
So some sample code that works:
<?php
$xml = '<?xml version="1.0" encoding="Windows-1252" standalone="yes"?>
<node>The price is 12 € !</node>';
$doc = new DomDocument();
$doc->loadXML($xml);
$nodes = $doc->getElementsByTagName('node');
$node = iconv('Windows-1252', 'UTF-8', utf8_decode($nodes[0]->nodeValue));
echo $node;

Spanish Characters not Displaying Correctly

I am getting the lovely � box where spanish characters should be displayed. (ie: ñ, á, etc). I have already made sure that my meta http-equiv is set to utf-8:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
I have also made sure that the page header is set for utf-8 also:
header('Content-type: text/html; charset=UTF-8');
Here is the beginning stages of my code thus far:
<?php
setlocale(LC_ALL, 'es_MX');
$datetime = strtotime($event['datetime']);
$date = date("M j, Y", $datetime);
$day = strftime("%A", $datetime);
$time = date("g:i", $datetime);
?>
<?= $day ?> <?= $time ?>
The above code is in a where statement. I have read that switching the collation in the database can also be a factor but I already have it set to UTF-8 General ci. Plus, the only thing that is in that column is DateTime anyway which is numbers and cannot be collated anyway.
result: s�bado 8:00
Any help is greatly appreciated as always.
Things to consider in PHP/MySQL/UTF-8
The database tables and text columns should be set to UTF-8
HTML page Content-Type should be set to UTF-8
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
PHP should send a header informing the browser to expect UTF-8
header('Content-Type: text/html; charset=utf-8' );
The PHP-MySQL connection should be set to UTF-8
mysqli_query("SET CHARACTER_SET_CLIENT='utf8'",$conn);
mysqli_query("SET CHARACTER_SET_RESULTS='utf8'",$conn);
mysqli_query("SET CHARACTER_SET_CONNECTION='utf8'",$conn);
PHP ini has default_charset setting it should be utf-8
if you do not have access to it use ini_set('default_charset', 'utf-8');
I have suffered this problem for many years and I can't find any logic and I have tried all the solutions above.
One solution is to make html codes for all text.
Here is a function I have used when all else has failed.
function span_accent($wordz)
{
$wordz = str_replace( "Á","Á",$wordz);
$wordz = str_replace( "É","É",$wordz);
$wordz = str_replace( "Í","Í",$wordz);
$wordz = str_replace( "Ó","Ó",$wordz);
$wordz = str_replace( "Ú","Ú",$wordz);
$wordz = str_replace( "Ñ","Ñ",$wordz);
$wordz = str_replace( "Ü","Ü",$wordz);
$wordz = str_replace( "á","á",$wordz);
$wordz = str_replace( "é","é",$wordz);
$wordz = str_replace( "í","í",$wordz);
$wordz = str_replace( "ó","ó",$wordz);
$wordz = str_replace( "ú","ú",$wordz);
$wordz = str_replace( "ñ","ñ",$wordz);
$wordz = str_replace( "ü","ü",$wordz);
$wordz = str_replace( "¿","¿",$wordz);
$wordz = str_replace( "¡","¡",$wordz);
$wordz = str_replace( "€","€",$wordz);
$wordz = str_replace( "«","«",$wordz);
$wordz = str_replace( "»","»",$wordz);
$wordz = str_replace( "‹","‹",$wordz);
$wordz = str_replace( "›","›",$wordz);
return $wordz;
}
Kindly check your file ENCODING. It must be in UTF-8 or UTF-8 without BOM.
To change you file encoding. Use Notepad++(you can use also other editor where you can change the file encoding). In menu bar > Choose ENCODING > Choose any UTF-8 or UTF-8 without BOM.
See link for the difference of UTF-8 and UTF-8 without BOM.
What's different between UTF-8 and UTF-8 without BOM?
Hope it can help. :)
Having a similar problem, I found the answer here.
Not Displaying Spanish Characters
The resolution was to change from UTF-8 to windows-1252.
(HTML) <meta http-equiv="Content-Type" content="text/html; charset=windows-1252" />
(PHP) ini_set('default_charset', 'windows-1252');
My problem was reading Spanish characters from a CSV file. When I opened the file in Excel, the characters appeared fine. In my editor, the odd character was shown regardless of the intended character. This change seems to work for my requirements.
it's important to check that your code is also codified as UTF-8 (you can see this property in a lot of text and code editors).
Because there is only one symbol (the black square), its probably that you are using ISO-8859-1 or ISO-8859-15 .
Can you see that the content is correct in the database table, look at it with phpmyadmin for eg. If it is, be sure your php files are utf8 encoded, take a look at your ide/editor configuration.
Use utf8mb4 or Windows-1252
ini_set('default_charset', 'utf8mb4');
or
header('Content-Type: text/html; charset=utf8mb4');
then use tag,
<meta charset="utf8mb4">

Php/json: decode utf8?

I store a json string that contains some (chinese ?) characters in a mysql database.
Example of what's in the database:
normal.text.\u8bf1\u60d1.rest.of.text
On my PHP page I just do a json_decode of what I receive from mysql, but it doesn't display right, it shows things like "½±è§�"
I've tried to execute the "SET NAMES 'utf8'" query at the beginning of my file, didn't change anything.
I already have the following header on my webpage:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
And of course all my php files are encoded in UTF-8.
Do you have any idea how to display these "\uXXXX" characters nicely?
This seems to work fine for me, with PHP 5.3.5 on Ubuntu 11.04:
<?php
header('Content-Type: text/plain; charset="UTF-8"');
$json = '[ "normal.text.\u8bf1\u60d1.rest.of.text" ]';
$decoded = json_decode($json, true);
var_dump($decoded);
Outputs this:
array(1) {
[0]=>
string(31) "normal.text.诱惑.rest.of.text"
}
Unicode is not UTF-8!
$ echo -en '\x8b\xf1\x60\xd1\x00\n' | iconv -f unicodebig -t utf-8
诱惑
This is a strange "encoding" you have. I guess each character of the normal text is "one byte" long (US-ASCII)? Then you have to extract the \u.... sequences, convert the sequence in a "two byte" character and convert that character with iconv("unicodebig", "utf-8", $character) to an UTF-8 character (see iconv in the PHP-documentation). This worked on my side:
$in = "normal.text.\u8bf1\u60d1.rest.of.text";
function ewchar_to_utf8($matches) {
$ewchar = $matches[1];
$binwchar = hexdec($ewchar);
$wchar = chr(($binwchar >> 8) & 0xFF) . chr(($binwchar) & 0xFF);
return iconv("unicodebig", "utf-8", $wchar);
}
function special_unicode_to_utf8($str) {
return preg_replace_callback("/\\\u([[:xdigit:]]{4})/i", "ewchar_to_utf8", $str);
}
echo special_unicode_to_utf8($in);
Otherwise we need more Information on how your string in the database is encoded.
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
That's a red herring. If you serve your page over http, and the response contains a Content-Type header, then the meta tag will be ignored. By default, PHP will set such a header, if you don't do it explicitly. And the default is set as iso-8859-1.
Try with this line:
<?php
header("Content-Type: text/html; charset=UTF-8");

Character Encoding UTF8 Issue when using mb_detect_encoding() with PHP

I am reading an rss feed http://beersandbeans.com/feed/
The feeds says it is UTF8 format, and I am using simplepie rss to import the content When i grab the content and store it in $content I perform the following:
<?php
header ('Content-type: text/html; charset=utf-8');
?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head><body>
<?php
echo $content;
echo $enc = mb_detect_encoding($content, "UTF-8,ISO-8859-1", true);
echo $content = mb_convert_encoding($content, "UTF-8", $enc);
echo $enc = mb_detect_encoding($content, "UTF-8,ISO-8859-1", true);
?>
</body></html>
This then produces:
..... Camping: 2,000isk/day for 5 days) = $89 .....
ISO-8859-1
..... Camping: Â Â 2,000isk/day for 5 days) = $89 .....
UTF-8
Why is it outputting the  ?
Try not specifying "UTF-8,ISO-8859-1" and see what encoding it gives you. It might be detecting ISO-8859-1 because it's the last one in that list, rather than the actual encoding of the string.
Set strict-mode to true in mb_detect_encoding(), see http://www.php.net/manual/de/function.mb-detect-encoding.php#102510
Also try http://www.php.net/manual/de/function.mb-convert-encoding.php instead of iconv()

Problem with function removing accents and other characters in PHP

I found a simple function to remove some undesired characters from a string.
function strClean($input){
$input = strtolower($input);
$b = array("á","é","í","ó","ú", "ñ", " "); //etc...
$c = array("a","e","i","o","u","n", "-"); //etc...
$input = str_replace($b, $c, $input);
return $input;
}
When I use it on accents or other characters, like this word 'á é ñ í' it prints out those question marks or weird characters, like:
output http://img217.imageshack.us/img217/6794/59472278.jpg
Note: I'm using strclean.php (which contains this function) and index.php, both in UTF-8. index.php looks as follows:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
</head>
<body>
<?php
include('strclean.php');
echo 'óóóáà';
echo strClean('óóóáà');
?>
</body>
</html>
What am I doing wrong?
Use
iconv('UTF-8', 'ASCII//TRANSLIT', $input);
You may want to try iconv.
Does a replacement happen at all, i.e. do you get the same weird characters when you print $input beforehand? If so, the character sets of your PHP source code file and the input do not match and you might need to use iconv() on the input before replacing.
edit: I took both of your files, uploaded them to my webserver and printing and cleaning works fine (see http://www.tag-am-meer.com/test1/). This is on PHP 4.4.9 and Firefox 3.0.6. More potential problems that come to my mind:
Does it work for you on Firefox? I remember vaguely that IE6 (and probably later versions as well) expect the charset in the HTML head section to be written in lowercase ("utf-8")
Does your editor include byte order marks (BOM) in the code files? Mine does not, maybe PHP chokes on those.
Can you look at the HTTP headers to see if there's something unusual going on, like a bad MIME type? The Tamper Data add-on for Firefox can help with this.
I have tested your code, and error is in strtolower function...
Replace it with mb_strtolower, like bellow
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title></title>
</head>
<body>
<?php
function strClean($input) {
$input = mb_strtolower($input, 'UTF-8');
$b = array("á","é","í","ó","ú", "n", " ");
$c = array("a","e","i","o","u","n", "-");
return str_replace($b, $c, $input);
}
$string = 'á é í ó ú n abcdef ghij';
echo $string ."<br />". strClean($string);
?>
</body>
</html>
Why do you want to remove accents? Is it possible that you just want to ignore them? If so, this answer has a Perl solution that demonstrates how to do that. Note that the Perl is in a foreign language. :)
I found myself with this trouble before, and I tried to follow the leads of this post and others I found on the way and there was no simple solution, cause you have to know the charset that your system uses (in my case ISO-8859-1) and this is what I did:
function quit_accenture($str){
$pattern = array();
$pattern[0] = '/[Á|Â|À|Å|Ä]/';
$pattern[1] = '/[É|Ê|È]/';
$pattern[2] = '/[Í|Î|Ì|Ï]/';
$pattern[3] = '/[Ó|Ô|Ò|Ö]/';
$pattern[4] = '/[Ú|Û|Ù|Ü]/';
$pattern[5] = '/[á|â|à|å|ä]/';
$pattern[6] = '/[ð|é|ê|è|ë]/';
$pattern[7] = '/[í|î|ì|ï]/';
$pattern[8] = '/[ó|ô|ò|ø|õ|ö]/';
$pattern[9] = '/[ú|û|ù|ü]/';
$replacement = array();
$replacement[0] = 'A';
$replacement[1] = 'E';
$replacement[2] = 'I';
$replacement[3] = 'O';
$replacement[4] = 'U';
$replacement[5] = 'a';
$replacement[6] = 'e';
$replacement[7] = 'i';
$replacement[8] = 'o';
$replacement[9] = 'u';
return preg_replace($pattern, $replacement, $str);
}
$txt = $_POST['your_htmled_text'];
//Convert to your system's charset. I checked this on the php.ini
$txt = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $txt);
//Apply your function
$txt = quit_accenture($txt);
//output
print_r($txt);
This worked for me, but I also think is the right way :)

Categories