Decoding XML from UTF-8 to ISO-8859-1 in PHP

Decoding XML from UTF-8 to ISO-8859-1 in PHP - php

I'm trying to "decode" an XML file (and transforming it with XSLT), but I'm having trouble decoding both files. The scenario is as follows:
I have a site for data entry which is all encoded in ISO-8859-1 (our Oracle database is in that format, so I can't change it). The problem is, I have those 2 files (an XML to show the data entry form and and XSLT to transform it into HTML). Both files are saved in ISO-8859-1 encoding, and both have the corresponding header, i. e., , and whenever I read the files and show them in the browser, the special characters (ñ, á, ¿) are shown either as UTF-8 or as a question mark (depending on the method I use for showing), but never as the "normal" representation.
My code for showing the XML file is:
<?php
$xslString = file_get_contents("catalog.xsl");
$xslString = utf8_decode($xslString);
$xslDoc = simplexml_load_string($xslString);
$xmlString = file_get_contents("questionnaire.xml");
$xmlString = utf8_decode($xmlString);
$xmlDoc = simplexml_load_string($xmlString);
$proc = new XSLTProcessor();
$proc->importStylesheet($xslDoc);
?>
I already tried several combinations of DOMDocument, iconv, mb_convert_encoding, but they show the XML file as unencoded UTF, a question mark or a double question mark.
On the other hand, this also messes up my data entry, since if I want to enter one of those characters, they either show as ? or ?? on the corresponding data field on the DB, or they get truncated at the first special char (if I use iconv).
What am I missing? Is there a workaround? I can't convert anything to UTF-8 because of the database.
I hope I'm being clear enough, please excuse my English.
Thanks in advance!

Hope this helps others. In the end, there were two things:
1) I was reading the XML/XSL files like this (in my original script):
<?php
$xmlDoc = new DOMDocument();
$xmlDoc->loadXML($xmlFile);
$xmlDoc->load("xmlfile.xml");
?>
which effectively changed the encoding to UTF-8. I changed the lines to:
<?php
$xmlString = file_get_contents("xmlfile.xml");
$xmlDoc = simplexml_load_string($xmlString);
?>
removing the utf_decode statement, and it worked like a charm. Now I get my special chars on screen as they're intended. As a side effect, the data entered in the form is now saved correctly to my database, so I got two birds in one shot.

Related

MYSQL REPLACE in QUERY and CDATA output usage still causes broken XML

I have a PHP script that queries a MySQL database and displays that information in XML format output.
I have a troublesome column that I have no control over (I can only SELECT). This column is filled with character returns etc.
In the MySQL query, I have tried to use REPLACE on this column like this:
REPLACE(PropertyInformation, '\r\n', '') AS PropertyInformation
In the PHP script I also wrap the exported XML in CDATA as I was told this could help, like this:
<Description><![CDATA[' . $PropertyInformation . ']]></Description>
I also form the XML like this in the script:
header("Content-Type: text/xml;charset=UTF-8");
echo '<?xml version="1.0" encoding="UTF-8"?>

The result is broken since your data is not UTF-8 even though you claim it to be (<?xml version="1.0" encoding="UTF-8"?>)
You need to convert your data to this format.
There are three ways to do that, either
convert your data in the database to be UTF-8, or
convert doing a select statement, or
convert in PHP the data to be UTF-8, leaving the data in the database as-is.
First option you would do by taking a dump of the database, issuing iconv conversion command to it and importing it back.
Second you would do with SELECT CONVERT(latin1column USING utf8) ...
Third you would again do with iconv, assuming your data would be ISO-8859-1: $converted = iconv("ISO-8859-1", "UTF-8", $text);

encoding issues in drupal when importing from wordpress

I am currently moving blog posts from wordpress to drupal. however after moving it
some of the text is not being displayed correctly.
wordpress is displaying :
When it hasn’t (html code is <h2>When it hasn’t</h2>)
Drupal is displaying :
When it hasnâ€™t (html code is <h2>When it hasnâ€™t</h2>)
In the wordpress and drupal db the value is correct. The source is the same.
<h2>When it hasnâ€™t</h2>
I did a search and found many options. None of them helped.
Below are the ones I have done and checked.
1) I double checked that utf-8 is the character encoing in drupal and wp.
I also made a simple test.php file to check nothing else was coming in the way
and it still did not display correctly.
2) I made sure when we take a mysqldump and upload to drupal utf-8
is used.
3) I also made sure the .php file is in utf-8 when saved.
4) I changed the encoding type in chrome for every option available and nothing
displayed it correctly.
5) I also used php functions to recode it but they did not work.
$value2="<h2>When it hasnâ€™t</h2>";
$out = recode_string('..utf-8', $value2);
//output - When it hasnt
$out2= mb_convert_encoding($value2,'UTF-8', "UTF-8");
// output - When it hasnÃ¢â‚¬â„¢t
$out3= #iconv('UTF-8', 'utf-8', $value2);
// output - When it hasnÃ¢â‚¬â„¢t
I have ran out of options now and I am stuck. Please help

You say the text in both databases is correct, but actually this doesn't mean too much: to viewing the content of a record you must use some client, and quite a few transformations may happen depending on how the text is rendered so you can read it.
So only two things matters:
the encoding of the column
the encoding of the HTML page returned by Drupal
Since your page outputs â€™ (in CP1252 is xE2x80x99) for ’ (Unicode U+2019, UTF-8 is 0xE28099) I guess the column is indeed UTF-8, however there's someone between the database and the browser who thinks the text is CP1252. This is what you have to check:
If using MySQL, the connection encoding must be UTF-8 so that what you have in your PHP script is UTF-8 text. You can use SET NAMES 'UTF-8'. Note that if you don't need the Unicode set, you can even use CP1252: the only important thing is that you know the encoding, since PHP strings are just byte arrays.
Explicitely define the response encoding in the HTTP Content-Type header. I mean, configure Drupal to call header('Content-Type: text/html; charset=utf-8');
If the HTTP response encoding is different than the one used for the text retrieved from the db, transcode the query result accordingly

Storing HTML in MySQL

I'm storing HTML and text data in my database table in its raw form - however I am having a slight problem in getting it to output correctly. Here is some sample data stored in the table AS IS:
<p>Professional Freelance PHP & MySQL developer based in Manchester.
<br />Providing an unbeatable service at a competitive price.</p>
To output this data I do:
echo $row['details'];
And this outputs the data correctly, however when I do a W3C validator check it says:
character "&" is the first character of a delimiter but occurred as data
So I tried using htmlemtities and htmlspecialchars but this just causes the HMTL tags to output on the page.
What is the correct way of doing this?

Use & instead of &.

What you want to do is use the php function htmlentities()...
It will convert your input into html entities, and then when it is outputted it will be interpreted as HTML and outputted as the result of that HTML...For example:
$mything = "<b>BOLD & BOLD</b>";
//normally would throw an error if not converted...
//lets convert!!
$mynewthing = htmlentities($mything);
Now, just insert $mynewthing to your database!!

htmlentities is basically as superset of htmlspecialchars, and htmlspecialchars replaces also < and >.
Actually, what you are trying to do is to fix invalid HTML code, and I think this needs an ad-hoc solution:
$row['details'] = preg_replace("/&(?![#0-9a-z]+;)/i", "&", $row['details']);
This is not a perfect solution, since it will fail for strings like: someone&son; (with a trailing ;), but at least it won't break existing HTML entities.
However, if you have decision power over how the data is stored, please enforce that the HTML code stored in the database is correct.

In my Projects I use XSLT Parser, so i had to change to   (e.g.). But this is the safety way i found...
here is my code
$html = trim(addslashes(htmlspecialchars(
html_entity_decode($_POST['html'], ENT_QUOTES, 'UTF-8'),
ENT_QUOTES, 'UTF-8'
)));
And when you read from DB, don't forget to use stripslashes();
$html = stripslashes($mysq_row['html']);

How do I get PHP to accept ISO-8859-1 characters in general?

This has been bugging me for ages and I want to get to the bottom of this once and for all. I have an associative array which fields I have defined using ISO-8859-1 characters. For instance:
array("utført" => "red");
I also have another array that I have loaded in from a file. I have printed this array out in a browser, checking that values like Æ, Ø and Å is intact. I try to compare two fields from these arrays and I'm slapped by the message:
Undefined index: utfã¸rt on line 39
I can't help but sob. Every single damn time I involve any letters outside UTF-8 in a script they are at some point converted into ã¸r or similar nonsense.
My script file is encoded in ISO-8859-1, the document from which I'm loading my data is the same, and so is the MySQL table I'm trying to save the data to.
So the only conclusion I can draw is that PHP isn't accepting just any character-sets into it's code, and I have to somehow force PHP to speak Norwegian.
Thanks for any suggestions
Just FYI, I won't accept any answers in the lines of "Just don't use those characters" or "Just replace those characters with UTF equivalents at file load" or any other hack solutions

When you read your data from external file try to convert them in proper encoding.
Something like this I have on my mind...
$f = file_get_contents('externaldata.txt');
$f = mb_convert_encoding($f, 'iso-8859-1');
// from this point deal with $f whatever you want
Also, look at mb_convert_encoding() manual for more info.

PHP: Simple XML and different codepages and getting the data correctly

I am working on this project where I receive different XML files from different sources. My PHP script should read them, parse them, and store them into the mysql database.
To parse the XML files, I use the SimpleXMLElement class in PHP. I receive files from Belgium in UTF-8 encoding, from Germany in iso-8859-1 encoding, from the Czech Republic in cp1250, and so on...
When I pass the xml-data to SimpleXMLElement and print an asXML() on this object, I see the xml data correctly as it was in the original xml file.
When I try to assign a field to a PHP-variable and print this variable on the screen, the text looks corrupted, and is of course also corrupted when inserted into the mysql database.
Example:
The XML:
<?xml version="1.0" encoding="cp1250"?>
...
<name>Labe Dìèín - Rozb 741,85km ; Dìèín - Rozb 741,85km </name>
...
The PHP code:
$sxml = file_get_contents("test.xml");
$xml = new SimpleXMLElement($sxml);
//echo $xml->asXML() . "\n"; // content will show up correctly in the shell
$name = (string)$xml->ftm->fairway_section->geo_object->name;
echo $name . "\n";
Result of the code (on linux bash shell) moves the cursor upwards and then prints: bÃn - Rozb 741,85km ; DÄ (the cursor movement is of course related to the incorrect characters that are printed out by PHP)
I think that PHP converts its data to UTF-8 to store it in a string parameter, so I presumed that using mb_convert_encoding to convert from UTF-8 to cp1250 would show the correct result, but it doesn't. Also I should be able to store the data in a format that is combinable with all the other sources.
I don't know much about encodings/codepages, this is probably why I can't get it to work right, but what I do know is that if I copy/paste the texts from the different languages to for example a new UltraEdit file, all of them show up right. How does UltraEdit handle this? Does it use UTF-8 (which I presume can show anything?)
How can I convert my data so that it will always show up, with whatever encoding on the source?

Try iconv instead:
$str = iconv('UTF-8', 'WINDOWS-1250', $str);

The problem is your input file is malformed. There is no character ì (latin small letter I with grave) in Windows-1250. See here.
The closest character is U+00ED (LATIN SMALL LETTER I WITH ACUTE).
The fact such character shows correctly in the shell is likely fortuitous.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Decoding XML from UTF-8 to ISO-8859-1 in PHP - php

Related

MYSQL REPLACE in QUERY and CDATA output usage still causes broken XML

encoding issues in drupal when importing from wordpress

Storing HTML in MySQL

How do I get PHP to accept ISO-8859-1 characters in general?

PHP: Simple XML and different codepages and getting the data correctly

Categories

Resources