PHP encoding issue reading files - ISO-8859-1 & UTF-8 Conflict

PHP encoding issue reading files - ISO-8859-1 & UTF-8 Conflict - php

I have a php file that reads a CSV file that im assuming is in UTF-8 - sent via API. I'm using fopen() to read it.
The issue is my output returns as :
IU?Q?JL?.?/Q?R??/)?J-.?))VH?/OM?K-NI?T0?P?*ͩT0204jzԴ?H???X???# D??K
I checked my php5 config settings:
Default is UTF-8 already :/ ; php.net/default-charset ;default_charset = "UTF-8"
I changed ISO-8859-1 to UTF-8 below also:
[iconv] ;iconv.input_encoding = UTF-8 ;iconv.internal_encoding = UTF-8 ;iconv.output_encoding = UTF-8 ;mssql.charset = "UTF-8"
The output is still the same. Any suggestions or steps I could take to solve the issue.

I never opened files with php but,
Have you used
$data = fopen($file);
fgets($data);
, too?

If you are just reading from the source, no problem as php doesn't make any encoding assumptions for strings. So, if your source is sending you the data as UTF8, it is UTF8, the default_charset in php is just an header sent before your page which can be overridden in a number of ways. Check if your browser is actually showing the page in the correct encoding... in Chrome, go to the menu More Tools / Encoding, there you'll see the encoding that is being used.

I had to use compress.zlib to solve the issue
$f_pointer=fopen("compress.zlib:URL","r");

Related

PHP read a line from a csv file return wrong in charset

I got a csv file, if I set the charset to ISO-8859-2(eastern europe) in Libre Calc, than it renders the characters correctly, but since the server's locale set to EN-UK.
I can not read the characters correctly, for example:
it returns : T�t insted of Tót.
I tried many things like:
echo (mb_detect_encoding("T�t","ISO-8859-2","UTF-8"));
I know probably the char does not exist in UTF-8 but I tried.
Also tried to setup the correct charset in the header:
header('Content-Type: text/html; charset=iso-8859-2');
echo "T�th";
but its returns : TÄĹźËth insted of Tóth.
Please help me solve this, thanks in advance

I advise against setting the header to charset=iso-8859-2'. It is usual to work with UTF-8. If the data is available with a different encoding, it should be converted to UTF-8 and then processed as CSV. The following example code could be kept as simple as the newline characters in UTF-8 and iso-8859-2 are the same.
$fileName = "yourpath/Iso8859_2.csv";
$fp = fopen($fileName,"r");
while($row = fgets($fp)){
$strUtf8 = mb_convert_encoding($row,'UTF-8','ISO-8859-2');
$arr = str_getcsv($strUtf8);
var_dump($arr);
}
fclose($fp);
The exact encoding of the CSV file must be known. mb_detect_encoding is not suitable for determining the encoding of a file.

I get an Ansi string instead of Utf-8 from Utf-8 mysql table [duplicate]

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 4 years ago.
When I moved from php mysql shared hosting to my own VPS I've found that code which outputs user names in UTF8 from mysql database outputs ?�??????� instead of 鬼神❗. My page has utf-8 encoding, and I have default_charset = "UTF-8" in php.ini, and header('Content-Type: text/html; charset=utf-8'); in my php file, as well as <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> in html part of it.
My database has collation utf8_bin, and table has the same. On both previos and current hosting in phpmyadmin for this database record I see: й¬јзҐћвќ—. When I create ANSI text file in Notepad++, paste й¬јзҐћвќ— into it and select Encoding->Encode in UTF-8 menu I see 鬼神❗, so I suppose it is correct encoded UTF-8 string.
Ok, and then I added
init_connect='SET collation_connection = utf8_general_bin'
init_connect='SET NAMES utf8'
character-set-server=utf8
collation-server=utf8_general_bin
skip-character-set-client-handshake
to my.cnf and now my page shows й¬јзҐћвќ— instead of ?�??????�. This is the same output I get in phpmyadmin on both hostings, so I'm on a right way. And still somehow on my old hosting the same php script returns utf-8 web page with name 鬼神❗ while on new hosting - й¬јзҐћвќ—. It looks like the string is twice utf-8 encoded: I get utf-8 string, I give it as ansi string to Notepad++ and it encodes it in correct utf-8 string.
However when I try utf8_encode() I get Ð¹Â¬ÑÐ·ÒÑÐ²Ñâ, and utf8_decode() returns ?�???????. The same result return mb_convert_encoding($name,"UTF-8","ISO-8859-1"); and iconv( "ISO-8859-1","UTF-8", $name);.
So how could I reproduce the same conversion Notepad++ does?
See answer below.

The solution was simple yet not obvious for me, as I never saw my.cnf on that shared hosting: it seems that that server had settings as follows
init_connect='SET collation_connection = cp1252'
init_connect='SET NAMES cp1252'
character-set-server=cp1252
So to make no harm to other code on my new server I have to place mysql_query("SET NAMES CP1252"); on top of each php script which works with utf8 strings.
The trick here was script gets a string as is (ansi) and outputs it, and when browser is warned that page is in utf-8 encoding it just renders my strings as utf-8.

UTF-8 dates doesn't encode properly

I have hard time with character charset, I suspect my fonction that display date to return non UTF-8 character (août is replaced by a question mark inside a diamond aoÃ»t).
When working on my local server everything's fine but when I push my code on my staging server, it's not displaying properly.
My php files are saved as UTF-8 NO BOM
If I inspect my output page, headers indicate UTF-8.
My local machine is a Mac with MAMP installed and my stating server have CentOS with cPanel installed.
Here is the part I suspect causing problem :
$langCode = "fr_FR"; /* Alos tried fr_FR.UTF-8 */
setlocale(LC_ALL, $langCode);
$monthName = _(strftime("%B",strtotime($dateStr)))
echo $monthName; /* Alos tried utf8_encode($monthName) worked on my staging server but not on my local server ! I'm using */

Finally found how to find the bug and fix it.
setlocale(LC_ALL, 'fr_FR');
var_dump(mb_detect_encoding(_(strftime("%B",strtotime($dateStr)))));
the dump returned UTF-8 on local and FALSE on staging server.
PHP.net documentation about mb_detect_encoding()
Return Values ¶
The detected character encoding or FALSE if the encoding cannot be
detected from the given string.
So charset can't be detected. I will try to force it "again"
setlocale(LC_ALL, 'fr_FR.UTF-8');
var_dump(mb_detect_encoding(_(strftime("%B",strtotime($dateStr)))));
this time the dump returned UTF-8 on local and UTF-8 on staging server. So I rollback my code to see what's happened when I tried first time with fr_FR.UTF-8 why does it was not working ? And I realize I was using utf8_encode() like pointed by user deceze in comment of this function's doc,
In fact, applying this function to text that is not encoded in ISO-8859-1 will most likely simply garble that text.
Thank you for your help everyone !

put this meta tag on your html code inside <head></head>
<meta charset="UTF-8">

It seems your server are configured to send the header
content-type: text/html; charset=UTF-8
as default. You could change your server configuration or you could add at the very start
<?php
header("content-type: text/html; charset=UTF-8");
?>
to set this header by yourself.

you need to use :
<?php
$conn = mysql_connect("localhost","root","root");
mysql_select_db("test");
mysql_query("SET NAMES 'utf8'", $conn);//put this line after you select db.

Accents in uploaded file being replaced with '?'

I am building a data import tool for the admin section of a website I am working on. The data is in both French and English, and contains many accented characters. Whenever I attempt to upload a file, parse the data, and store it in my MySQL database, the accents are replaced with '?'.
I have text files containing data (charset is iso-8859-1) which I upload to my server using CodeIgniter's file upload library. I then read the file in PHP.
My code is similar to this:
$this->upload->do_upload()
$data = array('upload_data' => $this->upload->data());
$fileHandle = fopen($data['upload_data']['full_path'], "r");
while (($line = fgets($fileHandle)) !== false) {
echo $line;
}
This produces lines with accents replaced with '?'. Everything else is correct.
If I download my uploaded file from my server over FTP, the charset is still iso-8850-1, but a diff reveals that the file has changed. However, if I open the file in TextEdit, it displays properly.
I attempted to use PHP's stream_encoding method to explicitly set my file stream to iso-8859-1, but my build of PHP does not have the method.
After running out of ideas, I tried wrapping my strings in both utf8_encode and utf8_decode. Neither worked.
If anyone has any suggestions about things I could try, I would be extremely grateful.

It's Important to see if the corruption is happening before or after the query is being issued to mySQL. There are too many possible things happening here to be able to pinpoint it. Are you able to output your MySql to check this?
Assuming that your query IS properly formed (no corruption at the stage the query is being outputted) there are a couple of things that you should check.
What is the character encoding of the database itself? (collation)
What is the Charset of the connection - this may not be set up correctly in your mysql config and can be manually set using the 'SET NAMES' command
In my own application I issue a 'SET NAMES utf8' as my first query after establishing a connection as I am unable to change the MySQL config.
See this.
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html
Edit: If the issue is not related to mysql I'd check the following
You say the encoding of the file is 'charset is iso-8859-1' - can I ask how you are sure of this?
What happens if you save the file itself as utf8 (Without BOM) and try to reprocess it?
What is the encoding of the php file that is performing the conversion? (What are you using to write your php - it may be 'managing' this for you in an undesired way)
(an aside) Are the files you are processing suitable for processing using fgetcsv instead?
http://php.net/manual/en/function.fgetcsv.php

Files uploaded to your server should be returned the same on download. That means, the encoding of the file (which is just a bunch of binary data) should not be changed. Instead you should take care that you are able to store the binary information of that file unchanged.
To achieve that with your database, create a BLOB field. That's the right column type for it. It's just binary data.
Assuming you're using MySQL, this is the reference: The BLOB and TEXT Types, look out for BLOB.

The problem is that you are using iso-8859-1 instead of utf-8. In order to encode it in the correct charset, you should use the iconv function, like so:
$output_string = iconv('utf-8", "utf-8//TRANSLIT", $input_string);
iso-8859-1 does not have the encoding for any sort of accents.
It would be so much better if everything were utf-8, as it handles virtually every character known to man.

fwrite() and UTF8

I am creating a file using php fwrite() and I know all my data is in UTF8 ( I have done extensive testing on this - when saving data to db and outputting on normal webpage all work fine and report as utf8.), but I am being told the file I am outputting contains non utf8 data :( Is there a command in bash (CentOS) to check the format of a file?
When using vim it shows the content as:
Donâ~#~Yt do anything .... Itâ~#~Ys a
great site with
everything....Weâ~#~Yve only just
launched/
Any help would be appreciated: Either confirming the file is UTF8 or how to write utf8 content to a file.
UPDATE
To clarify how I know I have data in UTF8 i have done the following:
DB is set to utf8 When saving data
to database I run this first:
$enc = mb_detect_encoding($data);
$data = mb_convert_encoding($data, "UTF-8", $enc);
Just before I run fwrite i have checked the data with Note each piece of data returns 'IS utf-8'
if (strlen($data)==mb_strlen($data, 'UTF-8')) print 'NOT UTF-8';
else print 'IS utf-8';
Thanks!

If you know the data is in UTF8 than you want to set up the header.
I wrote a solution answering to another tread.
The solution is the following: As the UTF-8 byte-order mark is \xef\xbb\xbf we should add it to the document's header.
<?php
function writeStringToFile($file, $string){
$f=fopen($file, "wb");
$file="\xEF\xBB\xBF".$file; // this is what makes the magic
fputs($f, $string);
fclose($f);
}
?>
You can adapt it to your code, basically you just want to make sure that you write a UTF8 file (as you said you know your content is UTF8 encoded).

fwrite() is not binary safe. That means, that your data - be it correctly encoded or not - might get mangled by this command or it's underlying routines.
To be on the safe side, you should use fopen() with the binary mode flag. that's b. Afterwards, fwrite() will safe your string data "as-is", and that is in PHP until now binary data, because strings in PHP are binary strings.
Background: Some systems differ between text and binary data. The binary flag will explicitly command PHP on such systems to use the binary output. When you deal with UTF-8 you should take care that the data does not get's mangeled. That's prevented by handling the string data as binary data.
However: If it's not like you told in your question that the UTF-8 encoding of the data is preserved, than your encoding got broken and even binary safe handling will keep the broken status. However, with the binary flag you still ensure that this is not the fwrite() part of your application that is breaking things.
It has been rightfully written in another answer here, that you do not know the encoding if you have data only. However, you can validate data if it validates UTF-8 encoding or not, so giving you at least some chance to check the encoding. A function in PHP which does this I've posted in a UTF-8 releated question so it might be of use for you if you need to debug things: Answer to: SimpleXML and Chinese look for can_be_valid_utf8_statemachine, that's the name of the function.

//add BOM to fix UTF-8 in Excel
fputs($fp, $bom =( chr(0xEF) . chr(0xBB) . chr(0xBF) ));
I find this piece works for me :)

The problem is your data is double encoded. I assume your original text is something like:
Don’t do anything
with ’, i.e., not the straight apostrophe, but the right single quotation mark.
If you write a PHP script with this content and encoded in UTF-8:
<?php
//File in UTF-8
echo utf8_encode("Don’t"); //this will double encode
You will get something similar to your output.

$handle = fopen($file,"w");
fwrite($handle, pack("CCC",0xef,0xbb,0xbf));
fwrite($handle,$file);
fclose($handle);

I know all my data is in UTF8 - wrong.
Encoding it's not the format of a file. So, check charset in headers of the page, where you taking data from:
header("Content-type: text/html; charset=utf-8;");
And check if data really in multi-byte encoding:
if (strlen($data)==mb_strlen($data, 'UTF-8')) print 'not UTF-8';
else print 'utf-8';

There is some reason:
first you get information from database it is not utf-8.
if you sure that was true use this ,I always use this and it work :
$file= fopen('../logs/logs.txt','a');
fwrite($file,PHP_EOL."_____________________output_____________________".PHP_EOL);
fwrite($file,print_r($value,true));

The only thing I had to do is add a UTF8 BOM to the CSV, the data was correct but the file reader (external application) couldn't read the file properly without the BOM

Try this simple method that is more useful and add to the top of the page before tag <body> :
<head>
<meta charset="utf-8">
</head>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP encoding issue reading files - ISO-8859-1 & UTF-8 Conflict - php

I never opened files with php but, Have you used $data = fopen($file); fgets($data); , too?

I had to use compress.zlib to solve the issue $f_pointer=fopen("compress.zlib:URL","r");

Related

PHP read a line from a csv file return wrong in charset

I get an Ansi string instead of Utf-8 from Utf-8 mysql table [duplicate]

UTF-8 dates doesn't encode properly

Accents in uploaded file being replaced with '?'

fwrite() and UTF8

Categories

Resources