utf8_encode or decode isn't doing what I expect

utf8_encode or decode isn't doing what I expect - php

I am taking an XML file and reading it into various strings, before writing to a database, however I am having difficulty with German characters.
The XML file starts off
<?xml version="1.0" encoding="UTF-8"?>
Then an example of where I am having problems is this part
<name><![CDATA[PONS Großwörterbuch Deutsch als Fremdsprache Android]]></name>
My PHP has this relevant section
$dom = new DOMDocument();
$domNode = $xmlReader->expand();
$element = $dom->appendChild($domNode);
$domString = utf8_encode($dom->saveXML($element));
$product = new SimpleXMLElement($domString);
//read in data
$arr = $product->attributes();
$link_ident = $arr["id"];
$link_id = $platform . "" . $link_ident;
$link_name = $product->name;
So $link_name becomes PONS GroÃwÃ¶rterbuch Deutsch als Fremdsprache Android
I then did a
$link_name = utf8_decode($link_name);
Which when I echoed back in terminal worked fine
PONS GroÃwÃ¶rterbuch Deutsch als Fremdsprache Android as is now
PONS Großwörterbuch Deutsch als Fremdsprache Android after utf8decode
However when it is written into my database it appears as:
PONS KompaktwÃ¶rterbuch Deutsch-Englisch (Android)
The collation for link_name in MysQL is utf8_general_ci
How should I be doing this to get it correctly written into my database?
This is the code I use to write to the database
$link_name = utf8_decode($link_name);
$link_id = mysql_real_escape_string($link_id);
$link_name = mysql_real_escape_string($link_name);
$description = mysql_real_escape_string($description);
$metadesc = mysql_real_escape_string($metadesc);
$link_created = mysql_real_escape_string($link_created);
$link_modified = mysql_real_escape_string($link_modified);
$website = mysql_real_escape_string($website);
$cost = mysql_real_escape_string($cost);
$image_name = mysql_real_escape_string($image_name);
$query = "REPLACE into jos_mt_links
(link_id, link_name, alias, link_desc, user_id, link_published,link_approved, metadesc, link_created, link_modified, website, price)
VALUES ('$link_id','$link_name','$link_name','$description','63','1','1','$metadesc','$link_created','$link_modified','$website','$cost')";
echo $link_name . " has been inserted ";
and when I run it from shell I see
PONS Kompaktwörterbuch Deutsch-Englisch (Android) has been inserted

You've got a UTF-8 string from an XML file, and you're putting it into a UTF-8 database. So there is no encoding or decode to be done, just shove the original string into the database. Make sure you've used mysql_set_charset('utf-8') first to tell the database there are UTF-8 strings coming.
utf8_decode and utf8_encode are misleadingly named. They are only for converting between UTF-8 and ISO-8859-1 encodings. Calling utf8_decode, which converts to ISO-8859-1, will naturally lose any characters you have that don't fit in that encoding. You should generally avoid these functions unless there's a specific place where you need to be using 8859-1.
You should not consider what the terminal shows when you echo a string to be definitive. The terminal has its own encoding problems and especially under Windows it is likely to be impossible to output every character properly. On a Western Windows install the system code page (which the terminal will use to turn the bytes PHP spits out into characters to display on-screen) will be code page 1252, which is similar to but not the same as ISO-8859-1. This is why utf8_decode, which spits out ISO-8859-1, appeared to make the text appear as you expected. But that's of little use. Internally you should be using UTF-8 for all strings.

You must use mb_convert_encoding or iconv unction before you write into your database.

Related

How to store and retrieve extended ASCII characters in MSSQL

I was surprised that I was unable to find a straightforward answer to this question by searching.
I have a web application in PHP that takes user input. Due to the nature of the application, users may often use extended ASCII characters (a.k.a. "ALT codes").
My specific issue at the moment is with ALT code 26, which is a right arrow (→). This will be accompanied with other text to be stored in the same field (for example, 'this→that').
My column type is NVARCHAR.
Here's what I've tried:
I've tried doing no conversions and just inserting the value as normal, but the value gets stored as thisâ??that.
I've tried converting the value to UCS-2 in PHP using iconv('UTF-8', 'UCS-2', $value), but I get an error saying Unclosed quotation mark after the character string 't'.. The query ends up looking like this: UPDATE myTable SET myColumn = 'this�!that'.
I've tried doing the above conversion and then adding an N before the quoted value, but I get the same error message. The query looks like this: UPDATE myTable SET myColumn = N'this�!that'.
I've tried removing the UCS-2 conversion and just adding the N before the quoted value, and the query works again, but the value is stored as thisâ that.
I've tried using utf8_decode($value) in PHP, but then the arrow is just replaced with a question mark.
So can anyone answer the (seemingly simple) question of, how can I store this value in my database and then retrieve it as it was originally typed?
I'm using PHP 5.5 and MSSQL 2012. If any question of driver/OS version comes into play, it's a Linux server connecting via FreeTDS. There is no possibility of changing this.

You might try base64 encoding the input, this is fairly trivial to handle with PHP's base64_encode() and base64_decode() and it should handle what ever your users throw at it.
(edit: You can apparently also do the base64 encoding on the SQL Server side. This doesn't seem like something it should be responsible for imho, but it's an option.)

It seems like your freetds.conf is wrong. You need a TDS protocol version >= 7.0 to support unicode. See this for more details.
Edit your freetds.conf:
[global]
# TDS protocol version
tds version = 7.4
client charset = UTF-8
Also make sure to configure PHP correct:
ini_set('mssql.charset', 'UTF-8');

The accepted answer seems to do the job; yes you can encode it to base64 and then decode it back again, but then all the applications that use that remote database, should change and support the fields to be base64 encoded. My thought is that if there is a remote MS SQL Server database, there could be an other application (or applications) that may use it, so that application have to also be changed to support both plain and base64 encoding. And you'll have to also handle both plain text and base64 converted text.
I searched a little bit and I found how to send UNICODE text to the MS SQL Server using MS SQL commands and PHP to convert the UNICODE bytes to HEX numbers.
If you go at the PHP documentation for the mssql_fetch_array (http://php.net/manual/ru/function.mssql-fetch-array.php#80076), you'll see at the comments a pretty good solution that converts the text to UNICODE HEX values and then sends that HEX data directly to MS SQL Server like this:
Convert Unicode Text to HEX Data
// sending data to database
$utf8 = 'Δοκιμή με unicode → Test with Unicode'; // some Greek text for example
$ucs2 = iconv('UTF-8', 'UCS-2LE', $utf8);
// converting UCS-2 string into "binary" hexadecimal form
$arr = unpack('H*hex', $ucs2);
$hex = "0x{$arr['hex']}";
// IMPORTANT!
// please note that value must be passed without apostrophes
// it should be "... values(0x0123456789ABCEF) ...", not "... values('0x0123456789ABCEF') ..."
mssql_query("INSERT INTO mytable (myfield) VALUES ({$hex})", $link);
Now all the text actually is stored to the NVARCHAR database field correctly as UNICODE, and that's all you have to do in order to send and store it as plain text and not encoded.
To retrieve that text, you need to ask MS SQL Server to send back UNICODE encoded text like this:
Retrieving Unicode Text from MS SQL Server
// retrieving data from database
// IMPORTANT!
// please note that "varbinary" expects number of bytes
// in this example it must be 200 (bytes), while size of field is 100 (UCS-2 chars)
// myfield is of 50 length, so I set VARBINARY to 100
$result = mssql_query("SELECT CONVERT(VARBINARY(100), myfield) AS myfield FROM mytable", $link);
while (($row = mssql_fetch_array($result, MSSQL_BOTH)))
{
// we get data in UCS-2
// I use UTF-8 in my project, so I encode it back
echo '1. '.iconv('UCS-2LE', 'UTF-8', $row['myfield'])).PHP_EOL;
// or you can even use mb_convert_encoding to convert from UCS-2LE to UTF-8
echo '2. '.mb_convert_encoding($row['myfield'], 'UTF-8', 'UCS-2LE').PHP_EOL;
}
The MS SQL Table with the UNICODE Data after the INSERT
The output result using a PHP page to display the values
I'm not sure if you can reach my test page here, but you can try to see the live results:
http://dbg.deve.wiznet.gr/php56/mssql/test1.php

trying to export csv in microsoft excel with special character like chinese but failed

I have a web app where I am trying to export to CSV from a database.
It runs perfectly with english character set, but when I put some chinese text in the database my CSV shows dumb character like ????.
<?php
$con=mysqli_connect(global_dbhost,global_dbusername,global_dbpassword,global_dbdatabase);
if(isset($_GET['csv']))
{
$query ='SELECT CONCAT("TC00", `t_id`),m_id,s_id,t_name,Description,start_date,end_date,start_time,end_time,status,active FROM tc_task';
$today = date("dmY");
//CSVExport($query);
$con=mysqli_connect(global_dbhost,global_dbusername,global_dbpassword,global_dbdatabase);
//echo 'inside function';
$sql_csv = mysqli_query($con,$query) or die("Error: " . mysqli_error()); //Replace this line with what is appropriate for your DB abstraction layer
file_put_contents("csvLOG.txt","\n inside ajax",FILE_APPEND);
header("Content-type:text/octect-stream");
header("Content-Disposition:attachment;filename=caring_data".$today.".csv");
while($row = mysqli_fetch_row($sql_csv)) {
print '"' . stripslashes(implode('","',$row)) . "\"\n";
}
exit;
}
?>

Solution available here:
Open in notepad (or equivalent)
Re-Save CSV as Unicode (not UTF-8)
Open in Excel
Profit
Excel does not handle UTF-8. If you go to the import options for CSV files, you will notice there is no choice for UTF-8 encoding. Since Unicode is supported, the above should work (though it is an extra step). The equivalent can likely be done on the PHP side (if you can save as Unicode instead of UTF-8). This seems doable according to this page which suggests:
$unicode_str_for_Excel = chr(255).chr(254).mb_convert_encoding( $utf8_str, 'UTF-16LE', 'UTF-8');

Umlauts are displayed as ? in my MySQL database?

this is my local scenario:
I have an application which reads some CSV files and writers the content to my local MYSQL database. The content contains umlauts, such as "ß" or "Ä". Locally everything works fine, the umlauts are written to the db and also displayed correclty inside the app which reads the db.
Now I moved this scenario to the amazon cloud and suddenly "ß" becomes "?" in the db. I checked what the program reads from the CSV files and there it is still a "ß". So it must be the writing to the database I guess, question is, why was this working locally but not on my cloud server? Is this a db problem, or a PHP problem?
Thanks! :)

Did you check the encoding on both databases? Most likely there might be the problem.

You need to have your database UTF-8 encoded.
Here is an excellent overview article that explains how encoding works in MySQL, and multiple ways to fix it:
http://www.bluebox.net/news/2009/07/mysql_encoding/

you can use:
iconv("UTF-8", "ISO-8859-1", $stringhere);
this will convert the string for you

it all depends on how you upload the CSV file.
Before writing the data to the MySQL server, this code might help:
$Notes = $_POST['Notes']; // or contents of the CSV file- but split the data first according to newline etc.
$charset = mysqli_character_set_name($DBConnect);
printf ("To check your character set but not necessary %s\n",$charset);
$Notes = str_replace('"', '"', $Notes); //double quotes for mailto: emails.
$von = array("ä","ö","ü","ß","Ä","Ö","Ü"," ","é"); //to correct double whitepaces as well
$zu = array("ä","ö","ü","ß","Ä","Ö","Ü"," ","é");
$Notes = str_replace($von, $zu, $Notes);
echo " Notes:".$Notes."<br>" ;
$Notes = mysqli_real_escape_string($link, $Notes); //for mysqli DB connection.
echo " Notes:".$Notes ;

PHP: string breaks at special character

I wrote a small PHP script which does a "branding" on a present PDF file. This means on every page I put a string like "belongs to " at a special position. Therefor I use Zend_Pdf out of the Zend Framework.
Because the script is used in German language area, in one line there I use the special character "ö" ("Gehört zu ").
On my local machine (Windows, XAMPP) the script worked fine, but when moving it to my hoster's webspace (some Linux), the string breaks at "ö". That means in my PDF on appears "Geh".
The code is this:
if (substr($file, strlen($file) - 4) === '.pdf') {
$name = $user->GetName;
$fontSize = 12;
$xTextPos = 100;
$yTextPos = 10;
set_include_path(dirname(__FILE__)); // set include_path for external library Zend Framework
require_once('Zend' .DS . 'Pdf.php');
$pdf = Zend_Pdf::load($file);
$font = Zend_Pdf_Font::fontWithName(Zend_Pdf_Font::FONT_HELVETICA);
$branding = 'Gehört zu ' . $name; // German for: 'Belongs to ', problem with 'ö'
foreach ($pdf->pages as &$page) {
$page->setFont($font, $fontSize);
$page->drawText($branding, $xTextPos, $yTextPos);
}
}
I guess the problem is related to some kind of default charset or language setting of the PHP environment. So I searched here and tried out:
$branding = utf8_encode('Gehört zu ') . $name;
...and I made some experiments with functions like html_entity_decode but nothing helped and I decided stopping groping in the dark and open an own question.
Looking forward to any hints. Thank you in advance for your help!
EDIT: Meanwhile I found the same (?) problem, solved on a German forum. But if I do it like they say...
$branding = mb_convert_encoding('Gehört zu ', 'ISO-8859-1') . $name;
... the resulting branding in the PDF is "Gehrt zu ". The "ö" is skipped now.
For this I found another hint on the Zend issue tracker.
I sum up, that I can drop all UTF8 things and concentrate on Latin-1 AKA ISO 8859-1.
I still don't understand why the code worked on my Windows + XAMPP and now crashes on my hoster's Linux.

Your guess is right, the problem is related to encoding. Where exactly the encoding is messed up is hard to say from afar. I'm assuming you work not only with Zend_Pdf, but also have the MVC in place (meaning a complete Zend_Application).
You should check if your application serves pages as UTF-8, by setting:
resources.view.encoding = "UTF-8"
and also placing the appropriate meta-tags in your layout/view.
Depending on what Editor you use, your files may be encoded in a different encoding. You can use Notepad++ on Windows to check your file-encoding and for converting it to UTF-8 (don't just set the encoding to UTF-8, this might mess up your file!) if necessary. I recommend using Eclipse with text file encoding set to "UTF-8" (Preferences > General > Workspace) to make sure your code files are encoded in UTF-8.
Now comes the crucial part:
Zend_Pdf_Page::drawText(string $text, float $x, float $y, string $charEncoding)
See that last argument... set it. If you're lucky, you can skip the previous stuff and just set the encoding there.
edit: I missed something. Database connections. You should check the encoding there too. I frequently work with MS SQL Server, which uses Latin-1 internally; not setting driver_otpions.CharacterSet can mess up stuff pretty bad too. This might be relevant, if you have soemthing like Gehört zu: Günther, where the Name Günther is fetched from db.

Encoding is also depending of the file encoding.
If you encode your file in UTF8 for example and use ut8_encode("ö"), then you'll encode in UTF_8 something already in UTF_8.
So you may want to check what your file encoding is, and what your PDF lib is requiring. Then apply the right formula/transformation.

SQLite UTF-8 output with fwrite

I was trying to output UTF-8 text read from SQLite database to a text file using fwrite function, but with no luck at all.
When I echo the content to the browser I can read it with no problem. As a last resort, I created the same tables into MySQL database, and surprisingly it worked!
What could be the cause, how can I debug this so that I can use SQLite DB?
I am using PDO.
Below is the code I am using to read from DB and write to file:
$arFile = realpath(APP_PATH.'output/Arabic.txt');
$arfh = fopen($arFile, 'w');
$arTxt = '';
$key = 'somekey';
$sql = 'SELECT ot.langv AS orgv, et.langv AS engv, at.langv AS arbv FROM original ot LEFT JOIN en_vals et ON ot.langk=et.langk
LEFT JOIN ar_vals at ON ot.langk=at.langk
WHERE ot.langk=:key';
$stm = $dbh->prepare($sql);
$stm->execute(array(':key'=>$key));
if( $row = $stm->fetch(PDO::FETCH_ASSOC) ){
$arTxt .= '$_LANG["'.$key.'"] = "'.special_escape($row['arbv']).'";'."\n";
}
fwrite( $arfh, $arTxt);
fclose($arfh);

What could be the cause, how can I debug this so that I can use SQLite DB?
SQLite stores text into the database as it receives it. So if you store UTF-8 encoded text into the SQLite database, you can read UTF-8 text from it.
If you store, let's say, LATIN-1 text into the database, then you can read LATIN-1 text from it.
SQLite itself does not care. So you get out what you put in.
As you write in your question that display in browser looks good I would assume that at least some valid encoded values have been stored inside the database. You might want to look into your browser when you view that data, what your browser tells you in which encoding that data is.
If it says UTF-8 then fine. You might just only view the text-file with an editor that does not support UTF-8 to view files. Because fwrite also does not care about the encoding, it just puts the string data into the file and that's it.
So as long as you don't provide additional information with your question it's hard to tell something more specific.
See as well: How to change character encoding of a PDO/SQLite connection in PHP?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.