PHP Sax Parser and UTF-8 - php

It is unfortunate that I am running into some troubles with the php sax parser and the utf-8 encoding.
The case:
I have a xml-file that is encoded in utf-8. The file is parsed using the standard php sax parser. The data is stored into some container objects and inserted into a mysql database. Unfortunately some characters look weird in the database (mostly german umlaute). For example Gürtel looks like Gürtel.
The following code fragment shows how the parser is instantiated:
$saxParser = xml_parser_create("UTF-8");
Does this suffice to parse utf-8 files? If yes, what I am missing? Some sepcial database stuff when inserting?
Thanks in advance.

Check the encoding step by step to find the invalid code:
Print the value you retrive from the XML
Print out the SQL statement you build
When printing the values, make sure your browser reads the output with the correct encoding.
You have to ensure that every component uses the proper encoding:
PHP script
Save your PHP with the encoding set to UTF-8 without BOM, because this might cause problems. Use only multibyte string functions when working with UTF-8 strings.
XML file
XML file starts with
<?xml version="1.0" encoding="UTF-8" ?>
and the file is properly saved with the encoding set to UTF-8.
SQL column (collation)
VARCHAR(length) [CHARACTER SET charset_name] [COLLATE collation_name]
Communication between MySQL server and PHP script
Run this command right after opening the connection to the MySQL server:
SET NAMES 'UTF8'
SET NAMES indicates what character set the client will use to send SQL
statements to the server.
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html

Related

Tiny but strong russian issue

Plugin Used : http://www.tinybutstrong.com/plugins.php
Russian characters are not displaying correctly.
In mysql database they are stored correctly collation is utf8_general_ci.
I used define('OPENTBS_ALREADY_UTF8','already_utf8');
It looks like an UTF-8 problem.
You have to check that all the data chain is UTF-8 :
all your PHP scripts
the data injected in the template (usually a database), but you also have to check that your PHP script retrieve the data as UTF8. See « How do I make MySQL return UTF-8? »
the template (it is actually UTF8 since it is a LibreOffice or Ms Office template)
Since this chain is ok, you have to use the OPENTBS_ALREADY_UTF8 option to load the template.
$TBS->LoadTemplate('my_template.odt', OPENTBS_ALREADY_UTF8);
You can check that you chain is ok by a test like this :
echo "<!doctype html><html><head><title>Test</title><meta charset='UTF-8'></head><body>";
echo $my_data_from_database;
echo "</body></html>";
exit;
where $my_data_from_database is a data item retrieve from the database and that contains special characters like Russian characters.

MYSQL REPLACE in QUERY and CDATA output usage still causes broken XML

I have a PHP script that queries a MySQL database and displays that information in XML format output.
I have a troublesome column that I have no control over (I can only SELECT). This column is filled with character returns etc.
In the MySQL query, I have tried to use REPLACE on this column like this:
REPLACE(PropertyInformation, '\r\n', '') AS PropertyInformation
In the PHP script I also wrap the exported XML in CDATA as I was told this could help, like this:
<Description><![CDATA[' . $PropertyInformation . ']]></Description>
I also form the XML like this in the script:
header("Content-Type: text/xml;charset=UTF-8");
echo '<?xml version="1.0" encoding="UTF-8"?>
The result is broken since your data is not UTF-8 even though you claim it to be (<?xml version="1.0" encoding="UTF-8"?>)
You need to convert your data to this format.
There are three ways to do that, either
convert your data in the database to be UTF-8, or
convert doing a select statement, or
convert in PHP the data to be UTF-8, leaving the data in the database as-is.
First option you would do by taking a dump of the database, issuing iconv conversion command to it and importing it back.
Second you would do with SELECT CONVERT(latin1column USING utf8) ...
Third you would again do with iconv, assuming your data would be ISO-8859-1: $converted = iconv("ISO-8859-1", "UTF-8", $text);

Why does my output change?

I'm working with UTF-8 encoding in PHP and I keep managing to get the output just as I want it. And then without anything happening with the code, the output all of a sudden changes.
Previously I was getting hebrew output. Now I'm getting "&&&&&".
Any ideas what might be causing this?
These are most common problems:
Your editor that you’re creating the PHP/HTML files in
The web browser you are viewing your site through
Your PHP web application running on the web server
The MySQL database
Anywhere else external you’re reading/writing data from (memcached, APIs, RSS feeds, etc)
And few things you can try:
Configuring your editor
Ensure that your text editor, IDE or whatever you’re writing the PHP code in saves your files in UTF-8 format. Your FTP client, scp, SFTP client doesn’t need any special UTF-8 setting.
Making sure that web browsers know to use UTF-8
To make sure your users’ browsers all know to read/write all data as UTF-8 you can set this in two places.
The content-type tag
Ensure the content-type META header specifies UTF-8 as the character set like this:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
The HTTP response headers
Make sure that the Content-Type response header also specifies UTF-8 as the character-set like this:
ini_set('default_charset', 'utf-8')
Configuring the MySQL Connection
Now you know that all of the data you’re receiving from the users is in UTF-8 format we need to configure the client connection between the PHP and the MySQL database.
There’s a generic way of doing by simply executing the MySQL query:
SET NAMES utf8;
…and depending on which client/driver you’re using there are helper functions to do this more easily instead:
With the built in mysql functions
mysql_set_charset('utf8', $link);
With MySQLi
$mysqli->set_charset("utf8")
*With PDO_MySQL (as you connect)*
$pdo = new PDO(
'mysql:host=hostname;dbname=defaultDbName',
'username',
'password',
array(PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8")
);
The MySQL Database
We’re pretty much there now, you just need to make sure that MySQL knows to store the data in your tables as UTF-8. You can check their encoding by looking at the Collation value in the output of SHOW TABLE STATUS (in phpmyadmin this is shown in the list of tables).
If your tables are not already in UTF-8 (it’s likely they’re in latin1) then you’ll need to convert them by running the following command for each table:
ALTER TABLE myTable CHARACTER SET utf8 COLLATE utf8_general_ci;
One last thing to watch out for
With all of these steps complete now your application should be free of any character set problems.
There is one thing to watch out for, most of the PHP string functions are not unicode aware so for example if you run strlen() against a multi-byte character it’ll return the number of bytes in the input, not the number of characters. You can work round this by using the Multibyte String PHP extension though it’s not that common for these byte/character issues to cause problems.
Taken form here: http://webmonkeyuk.wordpress.com/2011/04/23/how-to-avoid-character-encoding-problems-in-php/
Try after setting the content type with header like this
header('Content-Type: text/html; charset=utf-8');
Try this function - >
$html = "Bla Bla Bla...";
$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
for more - http://php.net/manual/en/function.mb-convert-encoding.php
I put together this method and called it in the file I'm working with, and that seemed to resolve the issue.
function setutf_8()
{
header('content-type: text/html; charset: utf-8');
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_language('uni');
mb_regex_encoding('UTF-8');
ob_start('mb_output_handler');
}
Thank you for all your help! :)

Same dataset outputs different characters : phpmyadmin / own query

Im trying to get a some data from the db , but the output isn't what i expected.
Doing my own querying on the db , i get this output : string 'C�te d�Ivoire' (length=13)
Querying the db from phpmyadmin i get normal output : Côte d’Ivoire
php.ini default charset, mysql db default charset , <meta> charset are all set to utf-8 .
I can't fugire it out where the encoding is being made that i get different output with same configuration .
P.S. : using mysqli driver .
In the same page that gives you wrong results, try first running this instruction
print base64_encode("Côte");
The correct answer is Q8O0dGU.... If you get something else, like Q/R0ZQo..., this means that your script is working with another charset (here Latin-1) instead of UTF-8. It's still possible that also MySQL and also the browser are playing tricks, but the line above ensures that PHP and/or your editor are playing you false.
Next, extract Côte from the database and output its base64_encode. If you see Q8O0..., then the connection between MySQL and PHP is safely UTF8. If not, then whatever else might also be needed, you need to change the MySQL charset (SET NAMES utf8 and/or ALTER of table and database collation).
If PHP is UTF8, and MySQL is UTF8, and still you see invalid characters, then it's something between PHP and the browser. Verify that the content type header is sent correctly; if not, try sending it yourself as first thing in the script:
Header('Content-Type: text/html; charset=UTF8');
For example in Apache configuration you should have
AddDefaultCharset utf-8
Verify also that your browser is not set to override both server charset and auto-detection.
NOTE: as a rule of thumb, if you get a single diamond with a question mark instead of a UTF8 international character, this means that an UTF8 reader received an invalid UTF8 code point. In other words, the entity showing the diamond (your browser) is expecting UTF8, but is receiving something else, for example Latin1 a.k.a. ISO-8859-15.
Another difficult-to-track way of getting that error is if the output somehow contains a byte order mark (BOM). This may happen if you create a file such as
###<?php
Header("Content-Type: text/html; charset=UTF8");
?>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF8" />
</head>
<body>
Hellò, world!
</body>
</html>
where that ### is an (invisible in most editors) UTF8 BOM. To remove it, you either need to save the file as "without BOM" if the editor allows it, or use a different editor.
If you do your "own querying" with the command line tool mysql, you have to set the option --default-character-set=utf8, too. Otherwise, please tell us how you do your own querying.

PHP: Simple XML and different codepages and getting the data correctly

I am working on this project where I receive different XML files from different sources. My PHP script should read them, parse them, and store them into the mysql database.
To parse the XML files, I use the SimpleXMLElement class in PHP. I receive files from Belgium in UTF-8 encoding, from Germany in iso-8859-1 encoding, from the Czech Republic in cp1250, and so on...
When I pass the xml-data to SimpleXMLElement and print an asXML() on this object, I see the xml data correctly as it was in the original xml file.
When I try to assign a field to a PHP-variable and print this variable on the screen, the text looks corrupted, and is of course also corrupted when inserted into the mysql database.
Example:
The XML:
<?xml version="1.0" encoding="cp1250"?>
...
<name>Labe Dìèín - Rozb 741,85km ; Dìèín - Rozb 741,85km </name>
...
The PHP code:
$sxml = file_get_contents("test.xml");
$xml = new SimpleXMLElement($sxml);
//echo $xml->asXML() . "\n"; // content will show up correctly in the shell
$name = (string)$xml->ftm->fairway_section->geo_object->name;
echo $name . "\n";
Result of the code (on linux bash shell) moves the cursor upwards and then prints: bín - Rozb 741,85km ; DÄ (the cursor movement is of course related to the incorrect characters that are printed out by PHP)
I think that PHP converts its data to UTF-8 to store it in a string parameter, so I presumed that using mb_convert_encoding to convert from UTF-8 to cp1250 would show the correct result, but it doesn't. Also I should be able to store the data in a format that is combinable with all the other sources.
I don't know much about encodings/codepages, this is probably why I can't get it to work right, but what I do know is that if I copy/paste the texts from the different languages to for example a new UltraEdit file, all of them show up right. How does UltraEdit handle this? Does it use UTF-8 (which I presume can show anything?)
How can I convert my data so that it will always show up, with whatever encoding on the source?
Try iconv instead:
$str = iconv('UTF-8', 'WINDOWS-1250', $str);
The problem is your input file is malformed. There is no character ì (latin small letter I with grave) in Windows-1250. See here.
The closest character is U+00ED (LATIN SMALL LETTER I WITH ACUTE).
The fact such character shows correctly in the shell is likely fortuitous.

Categories