I was trying to output UTF-8 text read from SQLite database to a text file using fwrite function, but with no luck at all.
When I echo the content to the browser I can read it with no problem. As a last resort, I created the same tables into MySQL database, and surprisingly it worked!
What could be the cause, how can I debug this so that I can use SQLite DB?
I am using PDO.
Below is the code I am using to read from DB and write to file:
$arFile = realpath(APP_PATH.'output/Arabic.txt');
$arfh = fopen($arFile, 'w');
$arTxt = '';
$key = 'somekey';
$sql = 'SELECT ot.langv AS orgv, et.langv AS engv, at.langv AS arbv FROM original ot LEFT JOIN en_vals et ON ot.langk=et.langk
LEFT JOIN ar_vals at ON ot.langk=at.langk
WHERE ot.langk=:key';
$stm = $dbh->prepare($sql);
$stm->execute(array(':key'=>$key));
if( $row = $stm->fetch(PDO::FETCH_ASSOC) ){
$arTxt .= '$_LANG["'.$key.'"] = "'.special_escape($row['arbv']).'";'."\n";
}
fwrite( $arfh, $arTxt);
fclose($arfh);
What could be the cause, how can I debug this so that I can use SQLite DB?
SQLite stores text into the database as it receives it. So if you store UTF-8 encoded text into the SQLite database, you can read UTF-8 text from it.
If you store, let's say, LATIN-1 text into the database, then you can read LATIN-1 text from it.
SQLite itself does not care. So you get out what you put in.
As you write in your question that display in browser looks good I would assume that at least some valid encoded values have been stored inside the database. You might want to look into your browser when you view that data, what your browser tells you in which encoding that data is.
If it says UTF-8 then fine. You might just only view the text-file with an editor that does not support UTF-8 to view files. Because fwrite also does not care about the encoding, it just puts the string data into the file and that's it.
So as long as you don't provide additional information with your question it's hard to tell something more specific.
See as well: How to change character encoding of a PDO/SQLite connection in PHP?
Related
I was surprised that I was unable to find a straightforward answer to this question by searching.
I have a web application in PHP that takes user input. Due to the nature of the application, users may often use extended ASCII characters (a.k.a. "ALT codes").
My specific issue at the moment is with ALT code 26, which is a right arrow (→). This will be accompanied with other text to be stored in the same field (for example, 'this→that').
My column type is NVARCHAR.
Here's what I've tried:
I've tried doing no conversions and just inserting the value as normal, but the value gets stored as thisâ??that.
I've tried converting the value to UCS-2 in PHP using iconv('UTF-8', 'UCS-2', $value), but I get an error saying Unclosed quotation mark after the character string 't'.. The query ends up looking like this: UPDATE myTable SET myColumn = 'this�!that'.
I've tried doing the above conversion and then adding an N before the quoted value, but I get the same error message. The query looks like this: UPDATE myTable SET myColumn = N'this�!that'.
I've tried removing the UCS-2 conversion and just adding the N before the quoted value, and the query works again, but the value is stored as thisâ that.
I've tried using utf8_decode($value) in PHP, but then the arrow is just replaced with a question mark.
So can anyone answer the (seemingly simple) question of, how can I store this value in my database and then retrieve it as it was originally typed?
I'm using PHP 5.5 and MSSQL 2012. If any question of driver/OS version comes into play, it's a Linux server connecting via FreeTDS. There is no possibility of changing this.
You might try base64 encoding the input, this is fairly trivial to handle with PHP's base64_encode() and base64_decode() and it should handle what ever your users throw at it.
(edit: You can apparently also do the base64 encoding on the SQL Server side. This doesn't seem like something it should be responsible for imho, but it's an option.)
It seems like your freetds.conf is wrong. You need a TDS protocol version >= 7.0 to support unicode. See this for more details.
Edit your freetds.conf:
[global]
# TDS protocol version
tds version = 7.4
client charset = UTF-8
Also make sure to configure PHP correct:
ini_set('mssql.charset', 'UTF-8');
The accepted answer seems to do the job; yes you can encode it to base64 and then decode it back again, but then all the applications that use that remote database, should change and support the fields to be base64 encoded. My thought is that if there is a remote MS SQL Server database, there could be an other application (or applications) that may use it, so that application have to also be changed to support both plain and base64 encoding. And you'll have to also handle both plain text and base64 converted text.
I searched a little bit and I found how to send UNICODE text to the MS SQL Server using MS SQL commands and PHP to convert the UNICODE bytes to HEX numbers.
If you go at the PHP documentation for the mssql_fetch_array (http://php.net/manual/ru/function.mssql-fetch-array.php#80076), you'll see at the comments a pretty good solution that converts the text to UNICODE HEX values and then sends that HEX data directly to MS SQL Server like this:
Convert Unicode Text to HEX Data
// sending data to database
$utf8 = 'Δοκιμή με unicode → Test with Unicode'; // some Greek text for example
$ucs2 = iconv('UTF-8', 'UCS-2LE', $utf8);
// converting UCS-2 string into "binary" hexadecimal form
$arr = unpack('H*hex', $ucs2);
$hex = "0x{$arr['hex']}";
// IMPORTANT!
// please note that value must be passed without apostrophes
// it should be "... values(0x0123456789ABCEF) ...", not "... values('0x0123456789ABCEF') ..."
mssql_query("INSERT INTO mytable (myfield) VALUES ({$hex})", $link);
Now all the text actually is stored to the NVARCHAR database field correctly as UNICODE, and that's all you have to do in order to send and store it as plain text and not encoded.
To retrieve that text, you need to ask MS SQL Server to send back UNICODE encoded text like this:
Retrieving Unicode Text from MS SQL Server
// retrieving data from database
// IMPORTANT!
// please note that "varbinary" expects number of bytes
// in this example it must be 200 (bytes), while size of field is 100 (UCS-2 chars)
// myfield is of 50 length, so I set VARBINARY to 100
$result = mssql_query("SELECT CONVERT(VARBINARY(100), myfield) AS myfield FROM mytable", $link);
while (($row = mssql_fetch_array($result, MSSQL_BOTH)))
{
// we get data in UCS-2
// I use UTF-8 in my project, so I encode it back
echo '1. '.iconv('UCS-2LE', 'UTF-8', $row['myfield'])).PHP_EOL;
// or you can even use mb_convert_encoding to convert from UCS-2LE to UTF-8
echo '2. '.mb_convert_encoding($row['myfield'], 'UTF-8', 'UCS-2LE').PHP_EOL;
}
The MS SQL Table with the UNICODE Data after the INSERT
The output result using a PHP page to display the values
I'm not sure if you can reach my test page here, but you can try to see the live results:
http://dbg.deve.wiznet.gr/php56/mssql/test1.php
I want to read data from a .txt file using php. I want to show data in one column but I am unable.
My file.txt contain this data:
ہو
تو
کر
جو
نہ
And I'm getting output like this:
ہو 㰊牢>تو 㰊牢>کر 㰊牢>جو 㰊牢>نہ 㰊牢>؟ 㰊牢>ہیں 㰊牢>ان 㰊牢夾畯栠癡湡攠牲
<?php
$conn = mysql_connect("localhost","root","");
mysql_select_db("Project",$conn);
mysql_query("SET NAMES UTF8");
$handle = #fopen("file.txt", "r");
while (!feof($handle)) {
$buffer = fgets($handle, 1024);
list($urdu)=explode('/n',$buffer);
echo $urdu."<br>";
$sql = "INSERT INTO fb (urdu) VALUES('$urdu')";
mysql_query($sql) or die(mysql_error());
}
So please tell me how can display text like this:
ہو
تو
کر
جو
نہ
Problem seems to be from an encoding mismatch, so first figure out where this mismatch is occurring, your problem could be:
a) your php code could be reading input using an incorrect encoding (if you are trying to read in iso-8859, but the source file is encoded some other way)
b) your php code could be writing output using an incorrect encoding
c) whatever you are using to read the output (your browser) could be set to a different encoding than the bytes you are writing.
So once you figure out which of the 3 places is causing your problem, you can figure out how to fix it by understanding what your source encoding is, and how to read/write using that source encoding instead of another encoding (which your system has probably set as the default).
Maybe you can use mb_detect_encoding and mb-convert-encoding.
I can't figure out what I'm doing wrong. I'm getting file content from the database. When I echo the content, everything displays just fine, when I write it to a file (.html) it breaks. I've tried iconv and a few other solutions, but I just don't understand what I should put for the first parameter, I've tried blanks, and that didn't work very well either. I assume it's coming out of the DB as UTF-8 if it's echoing properly. Been stuck a little while now without much luck.
function file($fileName, $content) {
if (!file_exists("out/".$fileName)) {
$file_handle = fopen(DOCROOT . "out/".$fileName, "wb") or die("can't open file");
fwrite($file_handle, iconv('UTF-8', 'UTF-8', $content));
fclose($file_handle);
return TRUE;
} else {
return FALSE;
}
}
Source of the html file looks like.
Comes out of the DB like this:
<h5>Текущая стабильная версия CMS</h5>
goes in file like this
<h5>Ð¢ÐµÐºÑƒÑ‰Ð°Ñ ÑÑ‚Ð°Ð±Ð¸Ð»ÑŒÐ½Ð°Ñ Ð²ÐµÑ€ÑÐ¸Ñ CMS</h5>
EDIT:
Turns out the root of the problem was Apache serving the files incorrectly. Adding
AddDefaultCharset utf-8
To my .htaccess file fixed it. Hours wasted... At least I learned something though.
Edit: The database encoding does not seem to be the issue here, so this part of the answer is retained for information only
I assume it's coming out of the DB as UTF-8
This is most likely your problem, what database type do you use? Have you set the character encoding and collation details for the database, the table, the connection and the transfer.
If I was to hazard a guess, I would say your table is MySQL and that your MySQL collation for the database / table / column should all be UTF8_general_ci ?
However, for some reason MySQL UTF8 is not actually UTF8, as it stores its data in 3bits rather than 4bits, so can not store the whole UTF-8 Character sets, see UTF-8 all the way through .
So you need to go through every table, column on your MySQL and change it from UTF8_ to the UTF8mb4_ (note: since MySQL 5.5.3) which is UTF8_multibyte_4 which covers the whole UTF-8 Spectrum of characters.
Also if you do any PHP work on the data strings be aware you should be using mb_ PHP functions for multibyte encodings.
And finally, you need to specify a connection character set for the database, don't run with the default one as it will almost certainly not be UTF8mb4, and hence you can have the correct data in the database, but then that data is repackaged as 3bit UTF8 before then being treated as 4bit UTF8 by PHP at the other end.
Hope this helps, and if your DB is not MySQL, let us know what it is!
Edit:
function file($fileName, $content) {
if (!file_exists("out/".$fileName)) {
$file_handle = fopen(DOCROOT . "out/".$fileName, "wb") or die("can't open file");
fwrite($file_handle, iconv('UTF-8', 'UTF-8', $content));
fclose($file_handle);
return TRUE;
} else {
return FALSE;
}
}
your $file_handle is trying to open a file inside an if statement that will only run if the file does not exist.
Your iconv is worthless here, turning from "utf-8" to er, "utf-8". character detection is extremely haphazard and hard for programs to do correctly so it's generally advised not to try and work out / guess what a character encoding it, you need to know what it is and tell the function what it is.
The comment by Dean is actually very important. The HTML should have a <meta charset="UTF-8"> inside <head>.
That iconv call is actually not useful and, if you are right that you are getting your content as UTF-8, it is not necessary.
You should check the character set of your database connection. Your database can be encoded in UTF-8 but the connection could be in another character set.
Good luck!
I am building a data import tool for the admin section of a website I am working on. The data is in both French and English, and contains many accented characters. Whenever I attempt to upload a file, parse the data, and store it in my MySQL database, the accents are replaced with '?'.
I have text files containing data (charset is iso-8859-1) which I upload to my server using CodeIgniter's file upload library. I then read the file in PHP.
My code is similar to this:
$this->upload->do_upload()
$data = array('upload_data' => $this->upload->data());
$fileHandle = fopen($data['upload_data']['full_path'], "r");
while (($line = fgets($fileHandle)) !== false) {
echo $line;
}
This produces lines with accents replaced with '?'. Everything else is correct.
If I download my uploaded file from my server over FTP, the charset is still iso-8850-1, but a diff reveals that the file has changed. However, if I open the file in TextEdit, it displays properly.
I attempted to use PHP's stream_encoding method to explicitly set my file stream to iso-8859-1, but my build of PHP does not have the method.
After running out of ideas, I tried wrapping my strings in both utf8_encode and utf8_decode. Neither worked.
If anyone has any suggestions about things I could try, I would be extremely grateful.
It's Important to see if the corruption is happening before or after the query is being issued to mySQL. There are too many possible things happening here to be able to pinpoint it. Are you able to output your MySql to check this?
Assuming that your query IS properly formed (no corruption at the stage the query is being outputted) there are a couple of things that you should check.
What is the character encoding of the database itself? (collation)
What is the Charset of the connection - this may not be set up correctly in your mysql config and can be manually set using the 'SET NAMES' command
In my own application I issue a 'SET NAMES utf8' as my first query after establishing a connection as I am unable to change the MySQL config.
See this.
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html
Edit: If the issue is not related to mysql I'd check the following
You say the encoding of the file is 'charset is iso-8859-1' - can I ask how you are sure of this?
What happens if you save the file itself as utf8 (Without BOM) and try to reprocess it?
What is the encoding of the php file that is performing the conversion? (What are you using to write your php - it may be 'managing' this for you in an undesired way)
(an aside) Are the files you are processing suitable for processing using fgetcsv instead?
http://php.net/manual/en/function.fgetcsv.php
Files uploaded to your server should be returned the same on download. That means, the encoding of the file (which is just a bunch of binary data) should not be changed. Instead you should take care that you are able to store the binary information of that file unchanged.
To achieve that with your database, create a BLOB field. That's the right column type for it. It's just binary data.
Assuming you're using MySQL, this is the reference: The BLOB and TEXT Types, look out for BLOB.
The problem is that you are using iso-8859-1 instead of utf-8. In order to encode it in the correct charset, you should use the iconv function, like so:
$output_string = iconv('utf-8", "utf-8//TRANSLIT", $input_string);
iso-8859-1 does not have the encoding for any sort of accents.
It would be so much better if everything were utf-8, as it handles virtually every character known to man.
I am creating a file using php fwrite() and I know all my data is in UTF8 ( I have done extensive testing on this - when saving data to db and outputting on normal webpage all work fine and report as utf8.), but I am being told the file I am outputting contains non utf8 data :( Is there a command in bash (CentOS) to check the format of a file?
When using vim it shows the content as:
Donâ~#~Yt do anything .... Itâ~#~Ys a
great site with
everything....Weâ~#~Yve only just
launched/
Any help would be appreciated: Either confirming the file is UTF8 or how to write utf8 content to a file.
UPDATE
To clarify how I know I have data in UTF8 i have done the following:
DB is set to utf8 When saving data
to database I run this first:
$enc = mb_detect_encoding($data);
$data = mb_convert_encoding($data, "UTF-8", $enc);
Just before I run fwrite i have checked the data with Note each piece of data returns 'IS utf-8'
if (strlen($data)==mb_strlen($data, 'UTF-8')) print 'NOT UTF-8';
else print 'IS utf-8';
Thanks!
If you know the data is in UTF8 than you want to set up the header.
I wrote a solution answering to another tread.
The solution is the following: As the UTF-8 byte-order mark is \xef\xbb\xbf we should add it to the document's header.
<?php
function writeStringToFile($file, $string){
$f=fopen($file, "wb");
$file="\xEF\xBB\xBF".$file; // this is what makes the magic
fputs($f, $string);
fclose($f);
}
?>
You can adapt it to your code, basically you just want to make sure that you write a UTF8 file (as you said you know your content is UTF8 encoded).
fwrite() is not binary safe. That means, that your data - be it correctly encoded or not - might get mangled by this command or it's underlying routines.
To be on the safe side, you should use fopen() with the binary mode flag. that's b. Afterwards, fwrite() will safe your string data "as-is", and that is in PHP until now binary data, because strings in PHP are binary strings.
Background: Some systems differ between text and binary data. The binary flag will explicitly command PHP on such systems to use the binary output. When you deal with UTF-8 you should take care that the data does not get's mangeled. That's prevented by handling the string data as binary data.
However: If it's not like you told in your question that the UTF-8 encoding of the data is preserved, than your encoding got broken and even binary safe handling will keep the broken status. However, with the binary flag you still ensure that this is not the fwrite() part of your application that is breaking things.
It has been rightfully written in another answer here, that you do not know the encoding if you have data only. However, you can validate data if it validates UTF-8 encoding or not, so giving you at least some chance to check the encoding. A function in PHP which does this I've posted in a UTF-8 releated question so it might be of use for you if you need to debug things: Answer to: SimpleXML and Chinese look for can_be_valid_utf8_statemachine, that's the name of the function.
//add BOM to fix UTF-8 in Excel
fputs($fp, $bom =( chr(0xEF) . chr(0xBB) . chr(0xBF) ));
I find this piece works for me :)
The problem is your data is double encoded. I assume your original text is something like:
Don’t do anything
with ’, i.e., not the straight apostrophe, but the right single quotation mark.
If you write a PHP script with this content and encoded in UTF-8:
<?php
//File in UTF-8
echo utf8_encode("Don’t"); //this will double encode
You will get something similar to your output.
$handle = fopen($file,"w");
fwrite($handle, pack("CCC",0xef,0xbb,0xbf));
fwrite($handle,$file);
fclose($handle);
I know all my data is in UTF8 - wrong.
Encoding it's not the format of a file. So, check charset in headers of the page, where you taking data from:
header("Content-type: text/html; charset=utf-8;");
And check if data really in multi-byte encoding:
if (strlen($data)==mb_strlen($data, 'UTF-8')) print 'not UTF-8';
else print 'utf-8';
There is some reason:
first you get information from database it is not utf-8.
if you sure that was true use this ,I always use this and it work :
$file= fopen('../logs/logs.txt','a');
fwrite($file,PHP_EOL."_____________________output_____________________".PHP_EOL);
fwrite($file,print_r($value,true));
The only thing I had to do is add a UTF8 BOM to the CSV, the data was correct but the file reader (external application) couldn't read the file properly without the BOM
Try this simple method that is more useful and add to the top of the page before tag <body> :
<head>
<meta charset="utf-8">
</head>