get � when for special characters using gettext and smarty - php

I am using $encoding = 'utf-8'; in gettext and in my html code i have set <meta charset="utf-8">. I have also set utf-8 in my .po files, but I still get � when I write æøå! What can be wrong?

Let's see how the values you mention are at the byte level.
I copied the æøå from your question and � from your title. The reason for � is that I had to use a Windows console application to fetch the title of your question and its codepage was Windows 1252 (copying from the browser gave me Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD)).
In a script encoded in UTF-8, this gives:
<?php
$s = 'æøå';
$s2 = '�';
echo "s iso-8859-1 ", #reset(unpack("H*", mb_convert_encoding($s, "ISO-8859-1", "UTF-8"))), "\n";
echo "s2 win-1252 ", #reset(unpack("H*", mb_convert_encoding($s, "WINDOWS-1252", "UTF-8"))), "\n";
s iso-8859-1 e6f8e5
s2 win-1252 e6f8e5
So the byte representation matches. The problem here is that when you write æøå either:
You're writing it in ISO-8859-1, instead of UTF-8. Check your text editor.
The value is being converted from UTF-8 to ISO-8859-1 (unlikely)

You need to set this
bind_textdomain_codeset($domain, "UTF-8");
Otherwise you will get the � character

Related

How to detect MacRoman encoding in PHP?

PHP's mb_detect_encoding() doesn't understand the MacRoman encoding. My app allows users to upload data in csv format and I need to convert it to utf8 because the users are not tech-savvy. I will never be able to get all of them to understand how to do it and control their encoding.
This is what I'm doing:
$encoding_detection_order = array('UTF-8', 'UTF-7', 'ASCII', 'ISO-8859-1', 'EUC-JP', 'SJIS', 'eucJP-win', 'SJIS-win', 'JIS', 'ISO-2022-JP', );
$encoding = mb_detect_encoding($value, $detection_order, true);
$converted_value = iconv($encoding, 'UTF-8//TRANSLIT', $value);
This works great for most situations, but if my user is on a Mac and they save the CSV in MacRoman encoding, then the above code will usually wrongly detect the text as ISO-8859-1 which causes the iconv() to produce bad output.
For example, the accented-e in Jaimé has a hex value of 0x8e in MacRoman. In ISO-8859-1, the 0x8e character is Ž and so when I covert it to utf8, I just get the utf8 version of Ž when I should be getting é.
I need to be able to dynamically differentiate MacRoman from other encodings so that I convert it properly.

PHP iconv from utf-8 to windows-1252 with no special characters

I'm trying to encode a file contents like this:
$f_file = fopen("dreams.txt", "w");
$string = "Los sueños se cumplen.";
$string_encoded = iconv( mb_detect_encoding( $string ), 'Windows-1252//TRANSLIT', $string );
fwrite($f_file, $string_encoded);
fclose($f_file);
If $string include a special character such as "ñ" or "á", the file is saved as Windows-1252 encoding but if $string does not include them, the file is encoded as UTF-8. I need the file with Windows-1252 encoding.
What am I doing wrong?
The first 127 characters used in ASCII, ANSI (ISO-8859-1), Windows-1252, and UTF8 are all the same, so it's impossible to tell what "the" encoding is just by looking at a document with only characters from that set: they are all equally applicable.
Modern editors will see this and go "it's 2018 so I'm going to tell you it's UTF8", and they won't even be wrong: until you add those special characters, all these encoding schemes are interchangeable. It's not until you introduce higher bytecode characters that you will have to be explicit about what the encoding is supposed to be again.

Convert UTF-8 to WINDOWS-1258 using PHP

I'm needing to convert a UTF-8 character set to Windows-1252 using PHP and i'm not having much luck thus far. My aim is to transfer text to a 3rd party system and exclude any characters not in the Windows-1252 character set.
I've tried both iconv and mb_convert_encoding but both give unexpected results.
$text = 'KØBENHAVN Ø ô& üü þþ';
echo iconv("UTF-8", "WINDOWS-1252", $text);
echo mb_convert_encoding($text, "WINDOWS-1252");
Output for both is 'K?BENHAVN ? ?& ?? ??'
I would not have expected the ?'s as these characters are in the WINDOWS-1252 character set.
Can anyone help cast some light on this for me please.
I ended up running the text from UTF-8 to WINDOWS-1252 and then back from WINDOWS-1252 to UTF-8. This gave the desire output.
$text = "Ѭjanky";
$converted = iconv("UTF-8//IGNORE", "WINDOWS-1252//IGNORE", $text);
$converted = iconv("WINDOWS-1252//IGNORE", "UTF-8//IGNORE", $converted);
echo $text; // outputs "janky"

convert UTF-8 to ANSI (windows-1252)

I'm trying to save a string in hebrew to file, while having the file ANSI encoded.
All attemps failed I'm afraid.
The PHP file itself is UTF-8.
So here's the code I'm trying :
$to_file = "בדיקה אם נרשם";
$to_file = mb_convert_encoding($to_file, "WINDOWS-1255", "UTF-8");
file_put_contents(dirname(__FILE__) ."/txt/TESTING.txt",$to_file);
This returns false for some reason.
Another attempt was :
$to_file = iconv("UTF-8", "windows-1252", $to_file);
This returns an empty string. while this did not work, Changing the outpout charset to windows-1255 DID work. so the function itself works, But for some reason it does not convert to 1252.
I ran this function before and after the iconv and printed the results
mb_detect_encoding ($to_file);
before the iconv the encoding is UTF-8.
after the iconv the encoding is ASCII(??)
I'd really appreciate any help you can give
Windows-1252 is a Latin encoding; you cannot encode Hebrew characters in Windows-1252. That's why it doesn't work.
Windows-1255 is an encoding for Hebrew, that's why it works.
The reason it doesn't work with mb_convert_encoding is that mb_ doesn't support Windows-1255.
Detecting encodings is by definition impossible. Windows-1255 is a single-byte encoding; it's virtually impossible to distinguish any one single byte encoding from another. The result is just as valid in ASCII as it is in Windows-1255 or Windows-1252 or ISO-8859 or any other single byte encoding.
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for more information.
You can use this:
<?php
$heb = 'טקסט בעברית .. # ';
$utf = preg_replace("/([\xE0-\xFA])/e","chr(215).chr(ord(\${1})-80)",$heb);
echo '<pre>';
print_r($heb);
echo '<pre>';
echo '------';
echo '<pre>';
print_r($utf);
echo '<pre>';
?>
Output will be like this:
���� ������ .. # <-- $heb - what we get when we print hebrew ANSI Windows 1255
טקסט בעברית .. # <- $utf - The Converted ANSI Windows 1255 to now UTF ...:)

Convert ASCII TO UTF-8 Encoding

How to convert ASCII encoding to UTF8 in PHP
ASCII is a subset of UTF-8, so if a document is ASCII then it is already UTF-8.
If you know for sure that your current encoding is pure ASCII, then you don't have to do anything because ASCII is already a valid UTF-8.
But if you still want to convert, just to be sure that its UTF-8, then you can use iconv
$string = iconv('ASCII', 'UTF-8//IGNORE', $string);
The IGNORE will discard any invalid characters just in case some were not valid ASCII.
Use mb_convert_encoding to convert an ASCII to UTF-8. More info here
$string = "chárêctërs";
print(mb_detect_encoding ($string));
$string = mb_convert_encoding($string, "UTF-8");
print(mb_detect_encoding ($string));
"ASCII is a subset of UTF-8, so..." - so UTF-8 is a set? :)
In other words: any string build with code points from x00 to x7F has indistinguishable representations (byte sequences) in ASCII and UTF-8. Converting such string is pointless.
Use utf8_encode()
Man page can be found here http://php.net/manual/en/function.utf8-encode.php
Also read this article from Joel on Software. It provides an excellent explanation if what Unicode is and how it works. http://www.joelonsoftware.com/articles/Unicode.html

Categories