I have looked all over the internet and i cannot find an answer.
I am scraping thousands of CSV's from a source out of my control. The CSV can be ANY character encoding. so i need to convert them all to UTF-8.
I have read online that if you convert utf-8 to utf-8 the data gets scrabbled, so what i am trying to do is detect the character encoding of the file and if its not utf-8 i want to convert it to utf-8 (i plan to use iconv).
I have tried everything on stack overflow (and other sites) but i cannot seem to get the current encoding of the file.
If i use
mb_detect_encoding(file_get_contents($csvPath), mb_detect_order(), TRUE);
or
mb_detect_encoding(file_get_contents($csvPath),'auto');
has anyone got any suggestions on how i can detect the encoding of the csv or have a better way that i can convert files without knowing the original encoding.
Iv figured it out after hours of trial and error. forget mb_detect_encoding its useless.
to the shell instead and use iconv (installed by default on OSX and Linux).
$output = shell_exec("file --mime-encoding GBP_AUD_Week1.csv");
$output = str_replace("$csvPath: ", '', $output);
This gives the current file encoding
shell_exec(iconv -f $output -t utf-8 GBP_AUD_Week1.csv > GBP_AUD_Week1Converted.csv);
Note:
I tried to overwrite the file instead of creating a new one, but when i did this the file was blank and the encoding was binary.
Related
I work on a system that automates signature generation for outlook. The part to generate the .htm files works great. But now I need to also add files in .txt format. If I use the content without any change in the encoding, all my accentuated characters are converted to a different value for example : "é" becomes "é" or "ô" becomes "ô".
This issue clearly looked like an encoding conflict of some sort. I tried to correct it by converting the text value input to the "Windows-1252" encoding.
$myText = iconv( mb_detect_encoding( $myText ) , "Windows-1252//TRANSLIT", $myText);
But it didn't change anything. I also tried with :
$myText = mb_convert_encoding($myText, "Windows-1252");
And it didn't work either. For both of these tests, I checked the file type with Atom (my IDE) and it recognise these files as UTF-8. But when I check on terminal with file -I signature.txt it responds with this encoding signature.txt: text/plain; charset=iso-8859-1
Note that if I manually change the encoding to Windows-1252 in Atom, the characters are correct.
Has anyone met the same problem ? Is there another way in php to specify the encoding of the file ?
I figured it out. The code to use was (as pointed out by #Powerlord):
$monTexteTXT = mb_convert_encoding($monTexteTXT, "Windows-1252", "UTF-8");
I had a false negative when I first tried this solution because when I opened the file the characters seemed broken. But once it was opened with outlook it was fine.
I'm trying to save text file as UTF-8 by using Laravel's Storage facade. Unfortunately couldn't find a way and it saves as us-ascii. How can I save as UTF-8?
Currently I'm using following code to save file;
Storage::disk('public')->put('files/test.txt", $fileData);
You should be able to append "\xEF\xBB\xBF" (the BOM which defines it as UTF-8) to your $fileData. So:
Storage::disk('public')->put('files/test.txt", "\xEF\xBB\xBF" . $fileData);
There are other ways to convert your text before writing it to the file, but this is the simplest and easiest to read and execute. As far as I know, there is also no character encoding methods within Illuminate\Filesystem\Filesystem.
For more information: https://stackoverflow.com/a/9047876/823549 and What's different between UTF-8 and UTF-8 without BOM?.
ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes. There's no difference between them, so there's no need to do anything.
It looks like your problem is that the files are not actually ASCII. You need to determine what encoding they are using, and transcode them properly.
I recommend using mb_convert_encoding instead
$fileData = mb_convert_encoding($fileData, "UTF-8", "auto");
Storage::disk('public')->put('files/test.txt", $fileData);
I have a txt file that has greek characters. When i open the file with notepad it shows that the encoding is ASCII.
But the only way that i can read the greek characters is to change (in openoffice writer or Editpad lite) the character set to DOS737.
The process that i need to implement in PHP is to open the file, split the text and import it to database. Everything is ok except that i cannot get the greek characters as they are.
I tried iconv but with no result.
I also tried mb_convert_encoding($data[0], "DOS737"); but i get warning mb_convert_encoding(): Unknown encoding "DOS737"
Also tried utf8_encode but with no luck
Any suggestions?
Finally found it.
It was easy... For anyone that might have the same issue use iconv("cp737","UTF-8","$string");
Using PHP CLI, this works well:
$result = iconv (LATIN1, 'UTF-8', N�n��;M�tt);
Result is: Nönüß
This also works for CP437, Windows, Macintosh etc.
On apache, the SAME code results in:
$result = iconv (LATIN1, 'UTF-8', N�n��;M�tt);
Result is: Nönüß
I googled around and added setlocale(LC_ALL, "en_US.utf8"); to the script, but made no difference. Thanks for helping!
I run Debian Linux with apache2 and php 5.4. I am trying to convert different CSV files as they are being uploaded into UTF-8 for processing.
UPDATE: I found my own solution.
$result = utf8_decode (iconv (LATIN1, 'UTF-8', N�n��;M�tt));
utf8_decode makes it show up correctly in the browser and when saved to the MySQL DB.
There are always two sides to encoding: the encoded string, and the entity interpreting this encoded string into readable characters! This "entity", as I'll ambiguously call it, can be the database, the browser, your text editor, the console, or whatever else.
$result = iconv('LATIN1', 'UTF-8', 'N�n��;M�tt');
Result is: Nönüß
Not sure where you're getting 'N�n��;M�tt' from exactly, but the UNICODE REPLACEMENT CHARACTERS � in there indicate that you're trying to interpret this string as UTF-8, but the string is not actually UTF-8 encoded. Using iconv to convert it from Latin-1 to UTF-8 makes the correct characters appear - that means the string was originally Latin-1 encoded and converting it to your expected encoding solved the discrepancy.
On apache, the SAME code results in Nönüß
That means the interpreting party here is not interpreting the string as UTF-8 this time, even though the string is UTF-8. I assume by "Apache" you mean "in the browser". You need to tell your browser through HTTP headers or HTML meta tags that it's supposed to interpret the text as UTF-8.
I found my own solution.
$result = utf8_decode (iconv (LATIN1, 'UTF-8', N�n��;M�tt));
Guess what utf8_decode does. It converts the encoding of a string from UTF-8 to Latin-1. So the above code converts Latin-1 to ... Latin-1.
Please read the following:
UTF-8 all the way through
Handling Unicode Front To Back In A Web App
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
I have searched all over the Internet and SO, still no luck in the following:
I would like to know, how to properly save a file using file_put_contents when filename has some unicode characters. (Windows 7 as OS)
$string = "jérôme.jpg" ; //UTF-8 string
file_put_contents("images/" . $string, "stuff");
Resuts in a file:
jГ©rГґme.jpg
Tried all possible combinations of such functions as iconv and mb_convert_encoding with all possible encodings, converting source file into different encodings as well.
All proper headers are set, browser recognises UTF-8 properly.
However, I can successfully copy-paste and create a file with such a name in explorer's GUI, but how to make it via PHP?
The last hardcore solution was to urlencode the string and save file.
This might be late but i just found a solution to close this hurting issue for me as well.
Forget about iconv and multibyte solutions; the problem is on Windows! (in the link you'll find all it's beauty about this.)
After numerous attempts and ways to solve this, i met with URLify and decided that best way to cope with unicode-in-filenames is to transliterate them before writing to file.
Example of transliterating a filename before saving it:
$filename = "Αρχείο.php"; // greek name for 'file'
echo URLify::filter($filename,128,"",TRUE);
// output: arxeio.php