convert UCS-2LE BOM encoding to UTF without losing information PHP - php

I have a .csv file encoded in UCS-2LE BOM. I need to make some changes to it and I want to use preg_replace, so I want to convert the file to UTF-8. However, when I convert it, all spaces disappear and all words which belong to one and the same line are sticked together.
My code is :
$content = file_get_contents( "myFile.csv" );
$content = mb_convert_encoding( $content, 'UCS-2LE', 'UTF-8');
What is the proper way to make the conversion so that I do not lose any spaces or characters?
Before converting - screenshot in Excel:
After converting the file:

You should change second line into this:
$content = mb_convert_encoding($content, 'UTF-8', 'UCS-2LE');
2nd argument is TO ENCODING, 3rd is FROM ENCODING.

Related

Change encoding of a file to UTF-8 in PHP

I need to convert a CSV file from UCS-2LE to UTF-8 encoding. So far I've tried the following:
$str = file_get_contents($file);
$str = mb_convert_encoding($str, 'UTF-8', 'UCS-2LE');
file_put_contents($newfile, $str);
But the problem is PHP encoding the new file as UTF-8 BOM instead of pure UTF-8 (according to Notepad++).
Notepad++ also have options to set encoding as UTF-8 (without the BOM).
I don't understand why PHP adding BOM on UTF-8 even when I explicitly instructed it to UTF-8 only.

DOMDocument::loadXML(): Input is not proper UTF-8, indicate encoding

Im generating a XML file from database that is formated to utf-8 and creating a XML file, however for a some specific case it is not converting properly and displaying me this message :
DOMDocument::loadXML(): Input is not proper UTF-8, indicate encoding ! Bytes: 0x96 0x20 0x50 0x61 in Entity, line: 1
I have already tried all possible online solutions, going from iconv , trying to do regex but none of these are solving the problem. The mb_encoding returns it is ASCII , which is supposedly UTF-8, even checking the file itself its utf-8.
This is my file start which loads the file path from the database which is the variable $xml_file, all inputs from database are being decoded using utf8_decode.
<?php
$content = utf8_encode(file_get_contents($xml_file));
//$encoding = mb_detect_encoding($content);
//$myXMLString = file_put_contents($xml_file, iconv('WINDOWS-1251', 'UTF-8', file_get_contents($xml_file)));
$xml_doc = new DomDocument();
$xml_doc->formatOutput = true;
$xml_doc->preserveWhiteSpace = false;
$xml_doc->loadXML($content);
?>
This is only happening with some items because other generate correctly, however i can not find any particular difference between them neither a permanent fix for this.
HOW I FIXED :
$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);
Managed to fix this converting it again to UTF-8:
$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);

mb_detect_encoding returns both ASCII and UTF8 [duplicate]

I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?
Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php
You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');
If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));

how to convert an ASCII encoded string to UTF8 in php? [duplicate]

This question already has answers here:
Convert ASCII TO UTF-8 Encoding
(5 answers)
Closed 6 years ago.
I tried to do:
file_put_contents ( $file_name, utf8_encode($data) ) ;
But when i check the file encoding from the shell with the linux command: 'file file_name'
I get: 'file_name: ASCII text'
Does it mean that the utf8_encoding didn't worked? if so, what is the right way to convert from ASCII to UTF8
If your string doesn't contain any non-ASCII characters, then you likely won't see differences, since UTF-8 is backwards compatible with ASCII. Try writing, for example, the text "1000 さくら" and see what happens.
Please note that utf8_encode only converts a string encoded in
ISO-8859-1 to UTF-8. A more appropriate name for it would be
"iso88591_to_utf8". If your text is not encoded in ISO-8859-1, you do
not need this function. If your text is already in UTF-8, you do not
need this function. In fact, applying this function to text that is
not encoded in ISO-8859-1 will most likely simply garble that text.
If you need to convert text from any encoding to any other encoding,
look at iconv() instead.
See http://php.net/manual/en/function.utf8-encode.php
ASCII is a subset of UTF-8, so if a document is ASCII then it is already UTF-8
Found at: Convert ASCII TO UTF-8 Encoding
Try this:
$data = mb_convert_encoding($data, 'UTF-8', 'ASCII');
file_put_contents ( $file_name, $data );
or use this to change file encoding:
$fd = fopen($file, 'r');
stream_filter_append($fd, 'convert.iconv.UTF-8/ASCII');
stream_copy_to_stream($fd, fopen($output, 'w'));
Reference: How to write file in UTF-8 format?

Problem writing UTF-8 encoded file in PHP

I have a large file that contains world countries/regions that I'm seperating into smaller files based on individual countries/regions. The original file contains entries like:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
However when I extract that and write it to a new file, the text becomes:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
To save my files I'm using the following code:
mb_detect_encoding($text, "UTF-8") == "UTF-8" ? : $text = utf8_encode($text);
$fp = fopen(MY_LOCATION,'wb');
fwrite($fp,$text);
fclose($fp);
I tried saving the files with and without utf8_encode() and neither seems to work. How would I go about saving the original encoding (which is UTF8)?
Thank you!
First off, don't depend on mb_detect_encoding. It's not great at figuring out what the encoding is unless there's a bunch of encoding specific entities (meaning entities that are invalid in other encodings).
Try just getting rid of the mb_detect_encoding line all together.
Oh, and utf8_encode turns a Latin-1 string into a UTF-8 string (not from an arbitrary charset to UTF-8, which is what you really want)... You want iconv, but you need to know the source encoding (and since you can't really trust mb_detect_encoding, you'll need to figure it out some other way).
Or you can try using iconv with a empty input encoding $str = iconv('', 'UTF-8', $str); (which may or may not work)...
It doesn't work like that. Even if you utf8_encode($theString) you will not CREATE a UTF8 file.
The correct answer has something to do with the UTF-8 byte-order mark.
This to understand the issue:
- http://en.wikipedia.org/wiki/Byte_order_mark
- http://unicode.org/faq/utf_bom.html
The solution is the following:
As the UTF-8 byte-order mark is '\xef\xbb\xbf' we should add it to the document's header.
<?php
function writeStringToFile($file, $string){
$f=fopen($file, "wb");
$file="\xEF\xBB\xBF".$string; // utf8 bom
fputs($f, $string);
fclose($f);
}
?>
The $file could be anything text or xml...
The $string is your UTF8 encoded string.
Try it now and it will write a UTF8 encoded file with your UTF8 content (string).
writeStringToFile('test.xml', 'éèàç');
Maybe you want to call htmlentities($text) before writing it into file and html_entity_decode($fetchedData) before output. It'll work with Scandinavian letters.
It appears that your source file is not, in fact, in UTF-8. You might want to try using the same approach you've been using, but with a different encoding, such as UTF-16 perhaps.
You can do it as follows:
<?php
$s = "This is a string éèàç and it is in utf-8";
$f = fopen('myFile',"w");
fwrite($f, utf8_encode($s));
fclose($f);
?>

Categories