I'm trying to encode a file contents like this:
$f_file = fopen("dreams.txt", "w");
$string = "Los sueños se cumplen.";
$string_encoded = iconv( mb_detect_encoding( $string ), 'Windows-1252//TRANSLIT', $string );
fwrite($f_file, $string_encoded);
fclose($f_file);
If $string include a special character such as "ñ" or "á", the file is saved as Windows-1252 encoding but if $string does not include them, the file is encoded as UTF-8. I need the file with Windows-1252 encoding.
What am I doing wrong?
The first 127 characters used in ASCII, ANSI (ISO-8859-1), Windows-1252, and UTF8 are all the same, so it's impossible to tell what "the" encoding is just by looking at a document with only characters from that set: they are all equally applicable.
Modern editors will see this and go "it's 2018 so I'm going to tell you it's UTF8", and they won't even be wrong: until you add those special characters, all these encoding schemes are interchangeable. It's not until you introduce higher bytecode characters that you will have to be explicit about what the encoding is supposed to be again.
Related
I'm trying to save a string in hebrew to file, while having the file ANSI encoded.
All attemps failed I'm afraid.
The PHP file itself is UTF-8.
So here's the code I'm trying :
$to_file = "בדיקה אם נרשם";
$to_file = mb_convert_encoding($to_file, "WINDOWS-1255", "UTF-8");
file_put_contents(dirname(__FILE__) ."/txt/TESTING.txt",$to_file);
This returns false for some reason.
Another attempt was :
$to_file = iconv("UTF-8", "windows-1252", $to_file);
This returns an empty string. while this did not work, Changing the outpout charset to windows-1255 DID work. so the function itself works, But for some reason it does not convert to 1252.
I ran this function before and after the iconv and printed the results
mb_detect_encoding ($to_file);
before the iconv the encoding is UTF-8.
after the iconv the encoding is ASCII(??)
I'd really appreciate any help you can give
Windows-1252 is a Latin encoding; you cannot encode Hebrew characters in Windows-1252. That's why it doesn't work.
Windows-1255 is an encoding for Hebrew, that's why it works.
The reason it doesn't work with mb_convert_encoding is that mb_ doesn't support Windows-1255.
Detecting encodings is by definition impossible. Windows-1255 is a single-byte encoding; it's virtually impossible to distinguish any one single byte encoding from another. The result is just as valid in ASCII as it is in Windows-1255 or Windows-1252 or ISO-8859 or any other single byte encoding.
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for more information.
You can use this:
<?php
$heb = 'טקסט בעברית .. # ';
$utf = preg_replace("/([\xE0-\xFA])/e","chr(215).chr(ord(\${1})-80)",$heb);
echo '<pre>';
print_r($heb);
echo '<pre>';
echo '------';
echo '<pre>';
print_r($utf);
echo '<pre>';
?>
Output will be like this:
���� ������ .. # <-- $heb - what we get when we print hebrew ANSI Windows 1255
טקסט בעברית .. # <- $utf - The Converted ANSI Windows 1255 to now UTF ...:)
I'm doing a simple (I thought) directory listing of files, like so:
$files = scandir(DOCROOT.'files');
foreach($files as $file)
{
echo ' <li>'.$file.PHP_EOL;
}
Problem is the files contains norwegian characters (æ,ø,å) and they for some reason come out as question marks. Why is this?
I can apparently fix(?) it by doing this before I echo it out:
$file = mb_convert_encoding($file, 'UTF-8', 'pass');
But it makes little sense to me why this helps, since pass should mean no character encoding conversion is performed, according to the docs... *confused*
Here is an example: http://random.geekality.net/files/index.php
It appears the encoding of the file names is in ISO Latin 1, but the page is interpreted by default using UTF-8. The characters do not come out as "question marks", but as Unicode replacement characters (�). That means the browser, which tries to interpret the byte stream as UTF-8, has encountered a byte invalid in UTF-8 and inserts the character at that point instead. Switch your browser to ISO Latin 1 and see the difference (View > Encoding > ...).
So what you need to do is to convert the strings from ISO Latin 1 to UTF-8, if you designate your page to be UTF-8 encoded. Use mb_convert_encoding($file, 'UTF-8', 'ISO-8859-1') to do so.
Why it works if you specify the $from encoding as pass I can only guess. What you're telling mb_convert_encoding with that is to convert from pass to UTF-8. I guess that makes mb_convert_encoding take the mb_internal_encoding value as the $from encoding, which happens to be ISO Latin 1. I suppose it's equivalent to 'auto' when used as the $from parameter.
When detecting the encoding of some text from Word (saved as a CSV file) using...
$encoding = mb_detect_encoding($value, 'WINDOWS-1252, ISO-8859-1', true);
$value = iconv($encoding, 'UTF-8//IGNORE', $value);
If a string has curly quotes the $encoding will be set to ISO-8859-1 not WINDOWS-1252 which it should be, so the string will read "self-motivated" with funny boxes around them and not “self-motivated” in it's UTF-8 encoding.
Any ideas on how to resolve this other than replacing the curly quotes, because this could effect other characters too?
Windows-1252 and ISO-8859-1 only differ in bytes 7F to 9F. They exist in the former but not in the latter. If you know your encode is either Windows-1252 or ISO-8859-1 you can determine which it is by the existence of such bytes. If no such bytes are included, and you know it is one of these two encodings, you can convert from either.
I once created a function to convert almost everything to UTF8, it has also some content sniffing functionality inside, may be this helps you?
http://php.net/manual/function.utf8-encode.php#102382
How to convert ASCII encoding to UTF8 in PHP
ASCII is a subset of UTF-8, so if a document is ASCII then it is already UTF-8.
If you know for sure that your current encoding is pure ASCII, then you don't have to do anything because ASCII is already a valid UTF-8.
But if you still want to convert, just to be sure that its UTF-8, then you can use iconv
$string = iconv('ASCII', 'UTF-8//IGNORE', $string);
The IGNORE will discard any invalid characters just in case some were not valid ASCII.
Use mb_convert_encoding to convert an ASCII to UTF-8. More info here
$string = "chárêctërs";
print(mb_detect_encoding ($string));
$string = mb_convert_encoding($string, "UTF-8");
print(mb_detect_encoding ($string));
"ASCII is a subset of UTF-8, so..." - so UTF-8 is a set? :)
In other words: any string build with code points from x00 to x7F has indistinguishable representations (byte sequences) in ASCII and UTF-8. Converting such string is pointless.
Use utf8_encode()
Man page can be found here http://php.net/manual/en/function.utf8-encode.php
Also read this article from Joel on Software. It provides an excellent explanation if what Unicode is and how it works. http://www.joelonsoftware.com/articles/Unicode.html
i wanna convert to original string of “Cool†..Origingal string is cool . (' is backquote)
It seems that you just forgot to specify the character encoding properly.
Because “ is what you get when the character “ (U+201C) encoded in UTF-8 (0xE2809C) is interpreted with a single-byte character encoding like Windows-1252 (default character encoding in some browsers) where 0xE2, 0x80, and 0x9C represent the characters â, €, and œ respectively.
So just make sure to specify your character encoding properly. Or if you actually want to use Windows-1252 as your output character encoding, you can convert your UTF-8 data with mb_convert_encoding, iconv or similar functions.
There's a wide variety of character encoding functions in PHP, especially if you have access to the multibyte string functions. (mb_string is thankfully enabled on most PHP installs.)
What you need to do is convert the encoding of the original string to the encoding you require, but as I don't know what encoding has been used/is required all I can suggest is that you could try using the mb_convert_encoding function, possibly after using mb_detect_encoding on the original string.
Incidentally, I'd highly recommend attempting to keep all data in UTF-8, (text files, HTML encoding, database connections/data, etc.) as you'll make your life a lot easier this way.