I'm trying to save a string in hebrew to file, while having the file ANSI encoded.
All attemps failed I'm afraid.
The PHP file itself is UTF-8.
So here's the code I'm trying :
$to_file = "בדיקה אם נרשם";
$to_file = mb_convert_encoding($to_file, "WINDOWS-1255", "UTF-8");
file_put_contents(dirname(__FILE__) ."/txt/TESTING.txt",$to_file);
This returns false for some reason.
Another attempt was :
$to_file = iconv("UTF-8", "windows-1252", $to_file);
This returns an empty string. while this did not work, Changing the outpout charset to windows-1255 DID work. so the function itself works, But for some reason it does not convert to 1252.
I ran this function before and after the iconv and printed the results
mb_detect_encoding ($to_file);
before the iconv the encoding is UTF-8.
after the iconv the encoding is ASCII(??)
I'd really appreciate any help you can give
Windows-1252 is a Latin encoding; you cannot encode Hebrew characters in Windows-1252. That's why it doesn't work.
Windows-1255 is an encoding for Hebrew, that's why it works.
The reason it doesn't work with mb_convert_encoding is that mb_ doesn't support Windows-1255.
Detecting encodings is by definition impossible. Windows-1255 is a single-byte encoding; it's virtually impossible to distinguish any one single byte encoding from another. The result is just as valid in ASCII as it is in Windows-1255 or Windows-1252 or ISO-8859 or any other single byte encoding.
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for more information.
You can use this:
<?php
$heb = 'טקסט בעברית .. # ';
$utf = preg_replace("/([\xE0-\xFA])/e","chr(215).chr(ord(\${1})-80)",$heb);
echo '<pre>';
print_r($heb);
echo '<pre>';
echo '------';
echo '<pre>';
print_r($utf);
echo '<pre>';
?>
Output will be like this:
���� ������ .. # <-- $heb - what we get when we print hebrew ANSI Windows 1255
טקסט בעברית .. # <- $utf - The Converted ANSI Windows 1255 to now UTF ...:)
Related
I'm trying to encode a file contents like this:
$f_file = fopen("dreams.txt", "w");
$string = "Los sueños se cumplen.";
$string_encoded = iconv( mb_detect_encoding( $string ), 'Windows-1252//TRANSLIT', $string );
fwrite($f_file, $string_encoded);
fclose($f_file);
If $string include a special character such as "ñ" or "á", the file is saved as Windows-1252 encoding but if $string does not include them, the file is encoded as UTF-8. I need the file with Windows-1252 encoding.
What am I doing wrong?
The first 127 characters used in ASCII, ANSI (ISO-8859-1), Windows-1252, and UTF8 are all the same, so it's impossible to tell what "the" encoding is just by looking at a document with only characters from that set: they are all equally applicable.
Modern editors will see this and go "it's 2018 so I'm going to tell you it's UTF8", and they won't even be wrong: until you add those special characters, all these encoding schemes are interchangeable. It's not until you introduce higher bytecode characters that you will have to be explicit about what the encoding is supposed to be again.
I have a string like this:
$str = "\xC4";
According to wikipedia the C4 is ISO-8859-1 Hexcode for Ä. Now i want to lowercase this string to get ä (also in ISO-8859-1).
I tried various solutions using strtolower and mb_strtolower. None of them worked. The output was garbled every time.
You can specify the encoding in mb_strtolower(), so just specify it and it all works fine:
echo mb_strtolower($str, "ISO-8859-1");
//^^^^^^^^^^
output:
ä
strtolower("\xC4") works just fine. The thing is that you need to interpret the resulting byte (xE4) using the ISO-8859-1 encoding, otherwise you'll obviously see garbage. If you're doing this in a browser, set the appropriate header to clue the browser in to the expected encoding:
header('Content-Type: text/html; charset=iso-8859-1');
echo strtolower("\xC4");
I have a weird problem , the following code :
$str = "נסיון" // <--- Hebrew chars
echo mb_detect_encoding ($str)."<br><br><br>";
$str = iconv (mb_detect_encoding($str),'UCS-2BE',$str);
echo mb_detect_encoding ($str)."<br><br><br>";
This will output :
UTF-8
UTF-8
This code is written in a file that's encoded (using Notepad++) in UTF-8 Without BOM, trying other encodings and didn't work.
I also tried converting the string using :
$str = mb_convert_encoding($str,'UCS-2BE');
But that didn't work either. Any insights?
From the documentation for mb_detect_order, the function that establishes the order in which mb_detect_encoding tests different encodings:
mbstring currently implements the following encoding detection filters. If there is an invalid byte sequence for the following encodings, encoding detection will fail.
UTF-8, UTF-7, ASCII, EUC-JP,SJIS, eucJP-win, SJIS-win, JIS, ISO-2022-JP
For ISO-8859-*, mbstring always detects as ISO-8859-*.
For UTF-16, UTF-32, UCS2 and UCS4, encoding detection will fail always.
So, you can't detect the encoding of the second string with the mb functions.
How to convert ASCII encoding to UTF8 in PHP
ASCII is a subset of UTF-8, so if a document is ASCII then it is already UTF-8.
If you know for sure that your current encoding is pure ASCII, then you don't have to do anything because ASCII is already a valid UTF-8.
But if you still want to convert, just to be sure that its UTF-8, then you can use iconv
$string = iconv('ASCII', 'UTF-8//IGNORE', $string);
The IGNORE will discard any invalid characters just in case some were not valid ASCII.
Use mb_convert_encoding to convert an ASCII to UTF-8. More info here
$string = "chárêctërs";
print(mb_detect_encoding ($string));
$string = mb_convert_encoding($string, "UTF-8");
print(mb_detect_encoding ($string));
"ASCII is a subset of UTF-8, so..." - so UTF-8 is a set? :)
In other words: any string build with code points from x00 to x7F has indistinguishable representations (byte sequences) in ASCII and UTF-8. Converting such string is pointless.
Use utf8_encode()
Man page can be found here http://php.net/manual/en/function.utf8-encode.php
Also read this article from Joel on Software. It provides an excellent explanation if what Unicode is and how it works. http://www.joelonsoftware.com/articles/Unicode.html
I am using $encoding = 'utf-8'; in gettext and in my html code i have set <meta charset="utf-8">. I have also set utf-8 in my .po files, but I still get � when I write æøå! What can be wrong?
Let's see how the values you mention are at the byte level.
I copied the æøå from your question and � from your title. The reason for � is that I had to use a Windows console application to fetch the title of your question and its codepage was Windows 1252 (copying from the browser gave me Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD)).
In a script encoded in UTF-8, this gives:
<?php
$s = 'æøå';
$s2 = '�';
echo "s iso-8859-1 ", #reset(unpack("H*", mb_convert_encoding($s, "ISO-8859-1", "UTF-8"))), "\n";
echo "s2 win-1252 ", #reset(unpack("H*", mb_convert_encoding($s, "WINDOWS-1252", "UTF-8"))), "\n";
s iso-8859-1 e6f8e5
s2 win-1252 e6f8e5
So the byte representation matches. The problem here is that when you write æøå either:
You're writing it in ISO-8859-1, instead of UTF-8. Check your text editor.
The value is being converted from UTF-8 to ISO-8859-1 (unlikely)
You need to set this
bind_textdomain_codeset($domain, "UTF-8");
Otherwise you will get the � character