I encountered a problem with converting the Windows-1257 file to UTF-8. The original file has
<?xml version="1.0" encoding="windows-1257"?>
on top and I try to convert it using this code:
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "ISO-8859-1");
$baltic_xml = file_get_contents($remote_file);
$unicode_xml = iconv("UTF-8", "UTF-8//IGNORE", $baltic_xml);
file_put_contents('data/rmtools/import/utf8/'.$files_single, $unicode_xml);
It saves the file as UTF-8, but when I open this file I still get the error:
XML parsing error: Input is not proper UTF-8, indicate encoding ! Bytes: 0x04 0x50 0x72 0x65
Is there any proper way I could convert it to readable UTF-8, or it means that there is still some symbols in the file which is NOT on UTF-8?
You're trying to convert UTF8 to UTF8//IGNORE, and that's why you're receiving that error. The first parameter is the in_charset. iconv on PHP.net Please change
$unicode_xml = iconv("UTF-8", "UTF-8//IGNORE", $baltic_xml);
to
$unicode_xml = iconv("CP1257", "UTF-8//IGNORE", $baltic_xml);
However I'd personally recommend you to use mb_* as iconv relies heavily on your OS's implementation of iconv and can show differences in between OS, mb_* on the other hand is pure php extension and is consistent. Making your code use mb_* changes whole to
ini_set('mbstring.substitute_character','none'); //to remove the unknown characters, in place of //IGNORE in iconv
$baltic_xml = file_get_contents($remote_file);
$unicode_xml = iconv("UTF-8", "UTF-8//IGNORE", $baltic_xml);
$unicode_xml = utf8_encode($unicode_xml); //to correct utf-8 bytes
$unicode_xml = preg_replace('/[^\PC\s]/u', '', $unicode_xml); //to remove control chars in case it has
file_put_contents('data/rmtools/import/utf8/' . $files_single, $unicode_xml);
According to mb supported encodings CP-1257 is not one of them, you may use ISO-8859-13 instead, however please note that there are some inconsistencies between them in some graphical characters (language characters however seem to be consistent according to wikipedia )
Related
I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));
I'm trying to save a string in hebrew to file, while having the file ANSI encoded.
All attemps failed I'm afraid.
The PHP file itself is UTF-8.
So here's the code I'm trying :
$to_file = "בדיקה אם נרשם";
$to_file = mb_convert_encoding($to_file, "WINDOWS-1255", "UTF-8");
file_put_contents(dirname(__FILE__) ."/txt/TESTING.txt",$to_file);
This returns false for some reason.
Another attempt was :
$to_file = iconv("UTF-8", "windows-1252", $to_file);
This returns an empty string. while this did not work, Changing the outpout charset to windows-1255 DID work. so the function itself works, But for some reason it does not convert to 1252.
I ran this function before and after the iconv and printed the results
mb_detect_encoding ($to_file);
before the iconv the encoding is UTF-8.
after the iconv the encoding is ASCII(??)
I'd really appreciate any help you can give
Windows-1252 is a Latin encoding; you cannot encode Hebrew characters in Windows-1252. That's why it doesn't work.
Windows-1255 is an encoding for Hebrew, that's why it works.
The reason it doesn't work with mb_convert_encoding is that mb_ doesn't support Windows-1255.
Detecting encodings is by definition impossible. Windows-1255 is a single-byte encoding; it's virtually impossible to distinguish any one single byte encoding from another. The result is just as valid in ASCII as it is in Windows-1255 or Windows-1252 or ISO-8859 or any other single byte encoding.
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for more information.
You can use this:
<?php
$heb = 'טקסט בעברית .. # ';
$utf = preg_replace("/([\xE0-\xFA])/e","chr(215).chr(ord(\${1})-80)",$heb);
echo '<pre>';
print_r($heb);
echo '<pre>';
echo '------';
echo '<pre>';
print_r($utf);
echo '<pre>';
?>
Output will be like this:
���� ������ .. # <-- $heb - what we get when we print hebrew ANSI Windows 1255
טקסט בעברית .. # <- $utf - The Converted ANSI Windows 1255 to now UTF ...:)
I have a weird problem , the following code :
$str = "נסיון" // <--- Hebrew chars
echo mb_detect_encoding ($str)."<br><br><br>";
$str = iconv (mb_detect_encoding($str),'UCS-2BE',$str);
echo mb_detect_encoding ($str)."<br><br><br>";
This will output :
UTF-8
UTF-8
This code is written in a file that's encoded (using Notepad++) in UTF-8 Without BOM, trying other encodings and didn't work.
I also tried converting the string using :
$str = mb_convert_encoding($str,'UCS-2BE');
But that didn't work either. Any insights?
From the documentation for mb_detect_order, the function that establishes the order in which mb_detect_encoding tests different encodings:
mbstring currently implements the following encoding detection filters. If there is an invalid byte sequence for the following encodings, encoding detection will fail.
UTF-8, UTF-7, ASCII, EUC-JP,SJIS, eucJP-win, SJIS-win, JIS, ISO-2022-JP
For ISO-8859-*, mbstring always detects as ISO-8859-*.
For UTF-16, UTF-32, UCS2 and UCS4, encoding detection will fail always.
So, you can't detect the encoding of the second string with the mb functions.
I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));
I am trying to use file_put_contents (and file_get_contents for that matter) with a UTF-8 ¥ following this stackoverflow post: How to write file in UTF-8 format? which uses:
$data = mb_convert_encoding($data, 'UTF-8', 'OLD-ENCODING');
Which wasn't really explained well, since it produces an error of:
mb_convert_encoding(): Illegal character encoding specified
So 'OLD-ENCODING' was just a placeholder they were using.
The question I have is what encoding should I change this to? ASCII or ISO-8859-1? What encoding do most web hosts use? Does it matter?
When I open the file, I will get the symbol correctly, only if I have my notepad set with encoding UTF-8. If I open it with another character set it will show up with a "?".
Try without third parameter.
$str = mb_convert_encoding($str, "UTF-8");
Or auto:
$str = mb_convert_encoding($str, "UTF-8", "auto");
More info and examples on:
http://php.net/manual/function.mb-convert-encoding.php
mb_convert_encoding($data, 'UTF-8', mb_detect_encoding($data));
mb_detect_encoding