I'm calling the Google Translate API and I need to send UTF-8 as input.
I have a piece of code to convert a string to UTF-8 but not matter what I try, when I check the encoding right after the conversion operation I get ASCII as the encoding of the string.
Here is the most popular answer I could find:
iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text);
The other way I tried was like this:
$text = utf8_encode($text);
As soon as I check the encoding again (on both cases) I get ASCII as the result:
echo mb_detect_encoding($text);
What am I missing here?
Thanks for any tips.
Related
I have some json I need to decode, alter and then encode without messing up any characters.
If I have a unicode character in a json string it will not decode. I'm not sure why since json.org says a string can contain: any-Unicode-character- except-"-or-\-or- control-character. But it doesn't work in python either.
{"Tag":"Odómetro"}
I can use utf8_encode which will allow the string to be decoded with json_decode, however the character gets mangled into something else. This is the result from a print_r of the result array. Two characters.
[Tag] => Odómetro
When I encode the array again I the character escaped to ascii, which is correct according to the json spec:
"Tag"=>"Od\u00f3metro"
Is there some way I can un-escape this? json_encode gives no such option, utf8_encode does not seem to work either.
Edit I see there is an unescaped_unicode option for json_encode. However it's not working as expected. Oh damn, it's only on php 5.4. I will have to use some regex as I only have 5.3.
$json = json_encode($array, JSON_UNESCAPED_UNICODE);
Warning: json_encode() expects parameter 2 to be long, string ...
I have found following way to fix this issue... I hope this can help you.
json_encode($data,JSON_UNESCAPED_UNICODE|JSON_UNESCAPED_SLASHES);
Judging from everything you've said, it seems like the original Odómetro string you're dealing with is encoded with ISO 8859-1, not UTF-8.
Here's why I think so:
json_encode produced parseable output after you ran the input string through utf8_encode, which converts from ISO 8859-1 to UTF-8.
You did say that you got "mangled" output when using print_r after doing utf8_encode, but the mangled output you got is actually exactly what would happen by trying to parse UTF-8 text as ISO 8859-1 (ó is \x63\xb3 in UTF-8, but that sequence is ó in ISO 8859-1.
Your htmlentities hackaround solution worked. htmlentities needs to know what the encoding of the input string to work correctly. If you don't specify one, it assumes ISO 8859-1. (html_entity_decode, confusingly, defaults to UTF-8, so your method had the effect of converting from ISO 8859-1 to UTF-8.)
You said you had the same problem in Python, which would seem to exclude PHP from being the issue.
PHP will use the \uXXXX escaping, but as you noted, this is valid JSON.
So, it seems like you need to configure your connection to Postgres so that it will give you UTF-8 strings. The PHP manual indicates you'd do this by appending options='--client_encoding=UTF8' to the connection string. There's also the possibility that the data currently stored in the database is in the wrong encoding. (You could simply use utf8_encode, but this will only support characters that are part of ISO 8859-1).
Finally, as another answer noted, you do need to make sure that you're declaring the proper charset, with an HTTP header or otherwise (of course, this particular issue might have just been an artifact of the environment where you did your print_r testing).
JSON_UNESCAPED_UNICODE was added in PHP 5.4 so it looks like you need upgrade your version of PHP to take advantage of it. 5.4 is not released yet though! :(
There is a 5.4 alpha release candidate on QA though if you want to play on your development machine.
A hacky way of doing JSON_UNESCAPED_UNICODE in PHP 5.3. Really disappointed by PHP json support. Maybe this will help someone else.
$array = some_json();
// Encode all string children in the array to html entities.
array_walk_recursive($array, function(&$item, $key) {
if(is_string($item)) {
$item = htmlentities($item);
}
});
$json = json_encode($array);
// Decode the html entities and end up with unicode again.
$json = html_entity_decode($rson);
$json = array('tag' => 'Odómetro'); // Original array
$json = json_encode($json); // {"Tag":"Od\u00f3metro"}
$json = json_decode($json); // Od\u00f3metro becomes Odómetro
echo $json->{'tag'}; // Odómetro
echo utf8_decode($json->{'tag'}); // Odómetro
You were close, just use utf8_decode.
try setting the utf-8 encoding in your page:
header('content-type:text/html;charset=utf-8');
this works for me:
$arr = array('tag' => 'Odómetro');
$encoded = json_encode($arr);
$decoded = json_decode($encoded);
echo $decoded->{'tag'};
Try Using:
utf8_decode() and utf8_encode
To encode an array that contains special characters, ISO 8859-1 to UTF8. (If utf8_encode & utf8_decode is not what is working for you, this might be an option)
Everything that is in ISO-8859-1 should be converted to UTF8:
$utf8 = utf8_encode('이 감사의 마음을 전합니다!'); //contains UTF8 & ISO 8859-1 characters;
$iso88591 = mb_convert_encoding($utf8, 'ISO-8859-1', 'UTF-8');
$data = $iso88591;
Encode should work after this:
$encoded_data = json_encode($data);
Convert UTF-8 to & from ISO 8859-1
I am using this scraper for IMDB, and the problem is that some characters are in UNICODE ï.
I use this scraper with CURL, and the answer its a string encoded in UTF8
I try to get the encode of the string with mb_detect_encoding() and it answer with UTF-8
$html = $this->geturl("${imdbUrl}combined");
mb_detect_encoding($html);
So I have a string with some HEX values inside, like this for example:
$var = 'Saïd Taghmaoui'
So I try to get the value of $html with utf8_decode() but no luck, I still have some characters in HEX.
So I have a few questions:
1- What's the best solution for this? I imagine different scenarios for example a read the string and with a REGEX change all the HEX codes with the character, but I am not sure if this one its the best solution, and also I dont know how to create the REGEX for this.
2- The solution can be through cURL? I mean manage some configurations to set the encoding of cURL in UTF-8 for example?
I try with the functions recode_string or iconv or mb_convert_encoding
Well basically my problem is that the answer from the scraper comes with UTF-8 encoding, but before print the text I need to work the data with this functions
$var = 'Saïd Taghmaoui'
htmlspecialchars(html_entity_decode($var, ENT_QUOTES, 'UTF-8'), ENT_NOQUOTES, 'UTF-8');
My boss is forcing me to use an access mdb database (yes, I'm serious) in a php server.
I can connect it and retrieve data from it, but as you could imagine, I have problems with encodings because I want to work using utf8.
The thing is that now I have two "solutions" to translate Windows-1252 to UTF-8
This is the first way:
mb_convert_encoding($string, "UTF-8", "Windows-1252").
It works, but the problem is that special chars are not properly converted, for example char º is converted to \u00ba and char Ó is converted to \u00d3.
My second way is doing this:
mb_convert_encoding(mb_convert_encoding($string, "UTF-8", "Windows-1252"), "HTML-ENTITIES", "UTF-8")
It works too, but it happens the same, special chars are not correctly converted. Char º is converted to º
Does anybody know how to properly change encoding including special chars?
Or does anybody know how to convert from º and \u00ba to something readable?
I did simple test to convert codepoint to letters
<?php
function codepoint_decode($str) {
return json_decode(sprintf('"%s"', $str));
}
$string_with_codepoint = "Ahed \u00d3\u00ba\u00d3";
// $string_with_codepoint = mb_convert_encoding($string, "UTF-8", "Windows-1252");
$output = codepoint_decode($string_with_codepoint);
echo $output; // Ahed ÓºÓ
Credit go for this answer
I finally found the solution.
I had the solution from the beginning but I was doing my tests wrong.
My bad.
The right way to do it for me is mb_convert_encoding($string, "UTF-8", "Windows-1252")
But i was checking the result like this:
$stringUTF8 = mb_convert_encoding($string, "UTF-8", "Windows-1252");
echo json_encode($stringUTF8);
that's why it was returning unicode chars like \u20ac, if I would have done:
$stringUTF8 = mb_convert_encoding($string, "UTF-8", "Windows-1252");
echo $stringUTF8;
I should have seen the solution from the beginning but I was wrong. It was json_encode() what was turning special chars into unicode chars.
Thanks everybody for your help!!
My php file is in UTF-8 encoding and I am trying to encode my data for safe sending into application but some characters get encoded incorrectly.
$text = "Š";
$text = urlencode(utf8_decode($text));
echo $text;
Echos %3F but according to w3c urlencoding reference found here (http://www.w3schools.com/tags/ref_urlencode.asp), "Š" should be converted into %8A. Php's own reference also does not state what reference is it using. Could this be encoding/decoding issue or something else?
utf8_decode tries to convert from UTF-8 to ISO-8859-1 but Š does not exist in ISO-8859-1. So you obtain '?' (= %3F), the substitution character.
It exists in CP1252 (maybe others), under the hexadecimal code 8A. So:
$text = urlencode(iconv('UTF-8', 'CP1252', $text));
Should give what you expect. In fact, you shouldn't decode an unicode string.
I have a large file that contains world countries/regions that I'm seperating into smaller files based on individual countries/regions. The original file contains entries like:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
However when I extract that and write it to a new file, the text becomes:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
To save my files I'm using the following code:
mb_detect_encoding($text, "UTF-8") == "UTF-8" ? : $text = utf8_encode($text);
$fp = fopen(MY_LOCATION,'wb');
fwrite($fp,$text);
fclose($fp);
I tried saving the files with and without utf8_encode() and neither seems to work. How would I go about saving the original encoding (which is UTF8)?
Thank you!
First off, don't depend on mb_detect_encoding. It's not great at figuring out what the encoding is unless there's a bunch of encoding specific entities (meaning entities that are invalid in other encodings).
Try just getting rid of the mb_detect_encoding line all together.
Oh, and utf8_encode turns a Latin-1 string into a UTF-8 string (not from an arbitrary charset to UTF-8, which is what you really want)... You want iconv, but you need to know the source encoding (and since you can't really trust mb_detect_encoding, you'll need to figure it out some other way).
Or you can try using iconv with a empty input encoding $str = iconv('', 'UTF-8', $str); (which may or may not work)...
It doesn't work like that. Even if you utf8_encode($theString) you will not CREATE a UTF8 file.
The correct answer has something to do with the UTF-8 byte-order mark.
This to understand the issue:
- http://en.wikipedia.org/wiki/Byte_order_mark
- http://unicode.org/faq/utf_bom.html
The solution is the following:
As the UTF-8 byte-order mark is '\xef\xbb\xbf' we should add it to the document's header.
<?php
function writeStringToFile($file, $string){
$f=fopen($file, "wb");
$file="\xEF\xBB\xBF".$string; // utf8 bom
fputs($f, $string);
fclose($f);
}
?>
The $file could be anything text or xml...
The $string is your UTF8 encoded string.
Try it now and it will write a UTF8 encoded file with your UTF8 content (string).
writeStringToFile('test.xml', 'éèàç');
Maybe you want to call htmlentities($text) before writing it into file and html_entity_decode($fetchedData) before output. It'll work with Scandinavian letters.
It appears that your source file is not, in fact, in UTF-8. You might want to try using the same approach you've been using, but with a different encoding, such as UTF-16 perhaps.
You can do it as follows:
<?php
$s = "This is a string éèàç and it is in utf-8";
$f = fopen('myFile',"w");
fwrite($f, utf8_encode($s));
fclose($f);
?>