parsing xml and output encoding in php - php

I generate a lot of posts in Wordpress from an XML file. The worry: accented characters.
The header of the stream is:
<? Xml version = "1.0" encoding = "ISO-8859-15"?>
Here is the complete flux : http://flux.netaffiliation.com/rsscp.php?maff=177053821BA2E13E910D54
My site is in utf8.
So I use the function utf8_encode ... but that does not solve the problem, the accents are always misunderstood.
Does anyone have an idea?
EDIT 04-10-2011 18:02 (french hour) :
Here is the complete flux : http://flux.netaffiliation.com/rsscp.php?maff=177053821BA2E13E910D54
Here is my code :
/**
* parse an rss flux from netaffiliation and convert each item to posts
* #var $flux = external link
* #return bool
*/
private function parseFluxNetAffiliation($flux)
{
$content = file_get_contents($flux);
$content = iconv("iso-8859-15", "utf-8", $content);
$xml = new DOMDocument;
$xml->loadXML($content);
//get the first link : http://www.netaffiliation.com
$link = $xml->getElementsByTagName('link')->item(0);
//echo $link->textContent;
//we get all items and create a multidimentionnal array
$items = $xml->getElementsByTagName('item');
$offers = array();
//we walk items
foreach($items as $item)
{
$childs = $item->childNodes;
//we walk childs
foreach($childs as $child)
{
$offers[$child->nodeName][] = $child->nodeValue;
}
}
unset($offers['#text']);
//we create one article foreach offer
$nbrPosts = count($offers['title']);
if($nbrPosts <= 0)
{
echo self::getFeedback("Le flux ne continent aucune offre",'error');
return false;
}
$i = 0;
while($i < $nbrPosts)
{
// Create post object
$description = '<p>'.$offers['description'][$i].'</p><p>'.$offers['link'][$i].'</p>';
$my_post = array(
'post_title' => $offers['title'][$i],
'post_content' => $description,
'post_status' => 'publish',
'post_author' => 1,
'post_category' => array(self::getCatAffiliation())
);
// Insert the post into the database
if(!wp_insert_post($my_post));;
$i++;
}
echo self::getFeedback("Le flux a généré {$nbrPosts} article(s) depuis le flux NetAffiliation dans la catégorie affiliation",'updated');
return false;
}
All the posts are generated but... the accented chars are ugly. You can see the result here: http://monsieur-mode.com/test/

There are plenty difficulties which you have to master when swapping between different encodings. Also, encodings which use more than one byte to encode characters (so-called multibyte-encodings) like UTF-8, which is used by WordPress, deserve special attention in PHP.
First, make sure that all the files you create are saved with the same encoding as they will be served. For example, make sure you set the same encoding as in the "Save as..."-dialog as you use in the HTTP Content-Type header.
Second, you need to verify that the input has the same encoding as the file you want to deliver. In your case, the input file has the encoding ISO-8859-15, so you'll need to convert it to UTF-8 using iconv().
Third, you must know that PHP doesn't natively support multibyte-encodings such as UTF-8. Functions such as htmlentities() will produce strange characters. For many of these functions, there are multibyte-alternatives, which are prefixed with mb_. If your encoding is UTF-8, check your files for such functions and replace them if necessary.
For more information about these topics, see Wikipedia about variable-width encodings, and the page in the PHP-Manual.

By default, most application work with UTF-8 data and output UTF-8 content. Wordpress should definitely not be apart and surely works on a UTF-8 basis.
I would simply not convert at all any information when printing, but instead change your header to UTF-8 instead of ISO-8859-15.

If your incoming XML data is ISO-8859-15, use iconv() to convert it:
$stream = file_get_contents("stream.xml");
$stream = iconv("iso-8859-15", "utf-8", $stream);

mb_convert_encoding()saves my life.
Here is my solution :
$content = preg_replace('/ encoding="ISO-8859-15"/is','',$content);
$content = mb_convert_encoding($content,"UTF-8");

Related

PHP json encode - Malformed UTF-8 characters, possibly incorrectly encoded [duplicate]

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 7 months ago.
I'm using json_encode($data) to an data array and there's a field contains Russian characters.
I used this mb_detect_encoding() to display what encoding it is for that field and it displays UTF-8.
I think the json encode failed due to some bad characters in it like "ра▒". I tried alot of things utf8_encode on the data and it will by pass that error but then the data doesn't look correct anymore.
What can be done with this issue?
The issue happens if there are some non-utf8 characters inside even though most of them are utf8 chars. This will remove any non-utf8 characters and now it works.
$data['name'] = mb_convert_encoding($data['name'], 'UTF-8', 'UTF-8');
If you have a multidimensional array to encode in JSON format then you can use below function:
If JSON_ERROR_UTF8 occurred :
$encoded = json_encode( utf8ize( $responseForJS ) );
Below function is used to encode Array data recursively
/* Use it for json_encode some corrupt UTF-8 chars
* useful for = malformed utf-8 characters possibly incorrectly encoded by json_encode
*/
function utf8ize( $mixed ) {
if (is_array($mixed)) {
foreach ($mixed as $key => $value) {
$mixed[$key] = utf8ize($value);
}
} elseif (is_string($mixed)) {
return mb_convert_encoding($mixed, "UTF-8", "UTF-8");
}
return $mixed;
}
Please, make sure to initiate your Pdo object with the charset iso as utf8.
This should fix this problem avoiding any re-utf8izing dance.
$pdo = new PDO("mysql:host=localhost;dbname=mybase;charset=utf8", 'user', 'password');
With php 7.2, two options allow to manage invalid UTF-8 direcly in json_encode :
https://www.php.net/manual/en/function.json-encode
json_encode($text, JSON_INVALID_UTF8_IGNORE);
Or
json_encode($text, JSON_INVALID_UTF8_SUBSTITUTE);
you just add in your pdo connection charset=utf8
like below line of pdo connection:
$pdo = new PDO("mysql:host=localhost;dbname=mybase;charset=utf8", 'user', 'password');
hope this will help you
Remove HTML entities before JSON encoding. I used html_entity_decode() in PHP and the problem was solved
$json = html_entity_decode($source);
$data = json_decode($json,true);
Do you by any chance have UUIDs in your result set? In that case the following database flag will help:
PDO::DBLIB_ATTR_STRINGIFY_UNIQUEIDENTIFIER => true
If your data is well encoded in the database for example, make sure to use the mb_ * functions for string handling, before json_encode. Functions like substr or strlen do not work well with utf8mb4 and can cut your text and leave a malformed UTF8
I know this is kind of an old topic, but for me it was what I needed. I just needed to modify the answer 'jayashan perera'.
//...code
$stmt->execute();
$result = $stmt->fetchAll(PDO::FETCH_ASSOC);
for ($i=0; $i < sizeof($result) ; $i++) {
$tempCnpj = $result[$i]['CNPJ'];
$tempFornecedor = json_encode(html_entity_decode($result[$i]['Nome_fornecedor']),true) ;
$tempData = $result[$i]['efetivado_data'];
$tempNota = $result[$i]['valor_nota'];
$arrResposta[$i] = ["Status"=>"true", "Cnpj"=>"$tempCnpj", "Fornecedor"=>$tempFornecedor, "Data"=>"$tempData", "Nota"=>"$tempNota" ];
}
echo json_encode($arrResposta);
And no .js i have use
obj = JSON.parse(msg);

How to fix encoding with dom

I am trying to scrape some old pages and present them in a modern design for me using Dom
And I have a problem with the encoding, The content is in french
I am using this code to get the content that I want, There is 2 type of content "Categories" And "Data"
$html = new DOMDocument();
$html->validateOnParse = true;
#$html->loadHTML($page);
$xpath = new DOMXPath($html);
$table =$xpath->query("//*[#style='background: white']")->item(0);
Then I process the content , First I enter the Categories in a function that convert them to id for me
function category_to_id($category) {
$categories = array('Forêts','Assurance','Aéronautique','Equipement ','Autre');
foreach ($categories as $id => $cat) {
if(trim($cat) == trim($category)) {
return $id + 1;
}
}
}
Then I store everything in MYSQL database
My first problem is my function work only for categories without spécial charachters like Assurance
And the second is that when I go to the database, I find the data stored like this Travaux d'électricité instead of Travaux d'électricité
I tried adding $html->encoding = 'utf-8'; But that didn't change anything
What am i doing wrong, And how can I fix it
Dom doesn't use UTF-8 as default, so you should encode the page to it
$xml->loadHTML(mb_convert_encoding($page, 'HTML-ENTITIES', "UTF-8"););
Alternatively, you could utf8_decode your string
echo category_to_id(utf8_decode("Travaux d'électricité"));

PHP. JSON encode utf-8

I want to encode json, but when I use json_encode function I get not UTF-8 string. I added header header('Content-Type: application/json; charset=utf-8'); and data from database comes good. How I could solve the problem?
My code:
foreach($dbh->query('SELECT Event.name, Event.description, Category.name as category FROM Event, Category WHERE Event.category_id = Category.category_id') as $row) {
$event['name'] = utf8_encode($row['name']);
$event['description'] = utf8_encode($row['description']);
$event['category'] = utf8_encode($row['category']);
$events[] = $event;
}
echo json_encode($events);
PHP json_encode needs always UTF8 string despite your charset. You must encode all your strings before.
To clarify, you must use utf8_encode on data extracted from your database if they are not already in utf8.
json_encode(array(
"one" => utf8_encode("super string &éùà"),
"two" => utf8_encode("super string &éùà")
));
Note : utf8_encode is only applicable from ISO 8859-1. If you are using another charset, see iconv()

Detect Encoding and Convert Everything to UTF-8 with PHP

I want to extract various data from URLs that will be converted to UTF-8 no matter what the encoding methods is used in original page (or at least it will work on most of the source encodings).
So, after looking and searching many discussions and answers, I finally came with the following code, with which I am parsing HTML data twice (once for detecting encoding and a second time for getting the actual data). This is working at least on all the checked URLs. But I think that the code is poorly written.
Can anyone let me know if there are any better alternatives to do the same or if I need any improvements on the code?
<?php
header('Content-Type: text/html; charset=utf-8');
require_once 'curl.php';
require_once 'curl_response.php';
$curl = new Curl;
$url = "http://" . $_GET['domain'];
$curl_response = $curl->get($url);
$header_content_type = $curl_response->headers['Content-Type'];
$dom_doc = new DOMDocument();
libxml_use_internal_errors(TRUE);
$dom_doc->loadHTML('<?xml encoding="utf-8" ?>' . $curl_response);
libxml_use_internal_errors(FALSE);
$metas = $dom_doc->getElementsByTagName('meta');
foreach ($metas as $meta) {
if (strtolower($meta->getAttribute('http-equiv')) == 'content-type') {
$meta_content_type = $meta->getAttribute('content');
}
if ($meta->getAttribute('charset') != '') {
$html5_charset = $meta->getAttribute('charset');
}
}
if (preg_match('/charset=(.+)/', $header_content_type, $m)) {
$charset = $m[1];
} elseif (preg_match('/charset=(.+)/', $meta_content_type, $m)) {
$charset = $m[1];
} elseif (!empty($html5_charset)) {
$charset = $html5_charset;
} elseif (preg_match('/encoding=(.+)/', $curl_response, $m)) {
$charset = $m[1];
} else {
// browser default charset
// $charset = 'ISO-8859-1';
}
if (!empty($charset) && $charset != "utf-8") {
$tmp = iconv($charset,'utf-8', $curl_response);
libxml_use_internal_errors(TRUE);
$dom_doc->loadHTML('<?xml encoding="utf-8" ?>' . $tmp);
libxml_use_internal_errors(FALSE);
}
$page_title = $dom_doc->getElementsByTagName('title')->item(0)->nodeValue;
$metas = $dom_doc->getElementsByTagName('meta');
foreach ($metas as $meta) {
if (strtolower($meta->getAttribute('name')) == 'description') {
$meta_description = $meta->getAttribute('content');
}
if (strtolower($meta->getAttribute('name')) == 'keywords') {
$meta_tags = $meta->getAttribute('content');
}
}
print $charset;
print "<hr>";
print $page_title;
print "<hr>";
print $meta_description;
print "<hr>";
print $meta_tags;
print "<hr>";
print "Memory Peak Usages: " . memory_get_peak_usage()/1024/1024 . " MB";
?>
Your question is too open-ended, and I've voted to close it. However, I will still provide a stub of an answer that will, hopefully, point you in the right direction.
At the moment, you are checking user-defined input for the charset. This is a very, very, very bad move, for various reasons:
Most webmasters on small site will just header("Content-type: text/html; charset=utf-8") because they've heard it is good practice, without actually encoding. Not taking this into account will lead to mangled UTF-8 outputs
Some webmasters do the opposite: they do not set a header, and their webserver outputs ISO-8859-1 headers despite an UTF-8 encoding. Visibly on a page, this does not matter - it matters for DOMDocument (I've had this issue recently)
iconv double utf-8 encoding is never fun.
I'd strongly advise using a utility to decode UTF-8 until there are no more entities within the UTF-8 extended range of characters and then encoding once rather than relying on iconv or multibyte encoding. The reason is simple: these can get it wrong. You can also set an error handler to parse DOMDocument errors in order to catch and redirect the loadXML "failed due to malformed XML" errors, which will not be related to your character encoding at all. Basically, the key to you problem is to not blindly do stuff.
If you'd like good targets where you need to worry about UTF-8, parse the home page of Google Play. They send out malformed replies (which is what initially forced me to go through the UTF-8-decode-until-nothing-is-in-the-range approach). It will also show you that DOMDocument can fail due to a wide variety of reasons - not just charset - and that you need to follow the errors to deal with them.
Other performance pointers outside of that big encoding snafu include:
Fragmenting your code into resultant functions. You've got a lot of repetition in there - learn to use functions to stop having to explicitely write the same core functions multiple times.
This:
if (preg_match('/charset=(.+)/', $header_content_type, $m)) {
$charset = $m[1];
} elseif (preg_match('/charset=(.+)/', $meta_content_type, $m)) {
is horrible. You can easily replace it with a strpos call, which will speed this particular set of ifs by about 5-10x.
* $metas = $dom_doc->getElementsByTagName('meta'); - you're aware that DOMDocument will go through your entire DOM when you use this method, right? Consider restricting the XPath query to just the head tag (which is always the first child of html, which is the document. XPath: /html/head[0])
In regard to performance you should be using unset(); when you're done with variables or values even if you're going to reset their values, but not if you need the value further down your script. PHP cannot reclaim memory and will reuse the preallocated memory released from the unset command for future use.
Another thing you could do is take huge chunks of that code and split it into functions that return resultant values. Remember that function variables and memory are automatically released after execution unless you're working with global variables.
Those will help performance and memory utilization.

XML character encoding issue with PHP

I have code which is creating an XML, my only problem is with the encoding of words like á, olá and ção.
These characters dont appear correctly and when I try reading the XML I get an error displayed relating to that character.
$dom_doc = new DOMDocument("1.0", "utf-8");
$dom_doc->preserveWhiteSpace = false;
$dom_doc->formatOutput = true;
$element = $dom->createElement("hotels");
while ($row = mysql_fetch_assoc($result)) {
$contact = $dom_doc->createElement( "m" . $row['id'] );
$nome = $dom_doc->createElement("nome", $row['nome'] );
$data1 = $dom_doc->createElement("data1", $row['data'] );
$data2 = $dom_doc->createElement("data2", $row['data2'] );
$contact->appendChild($nome);
$contact->appendChild($data1);
$contact->appendChild($data2);
$element->appendChild($contact);
$dom_doc->appendChild($element);
What can I change to fix my problem, I am using utf-8???
Please try to put directly 'á', 'olá' or 'ção' in your script.
$data1 = $dom_doc->createElement("data1", 'ção');
If you don't have problem, this is probably the data you get from mysql that are wrongly encoded.
Are you sure your mysql outputs correct UTF-8?
To know that, make your PHP dump your data in an HTML document with meta tag set to UTF-8 and see if the characters display correctly.
You can also call :
$data1 = $dom_doc->createElement("data1", mb_detect_encoding($row['data']));
and see what encoding is detected by PHP for your data.
If you can't convert the data from your database, or change its settings, you can use mb_convert to do it on-the-fly : http://www.php.net/manual/en/function.mb-convert-encoding.php
You are using utf-8, the 8-bit unicode encoding format. Even though it properly supports all 1,112,064 code points in Unicode its possible that there is an issue here.
Try UTF-16 as the standard, just an idea. See below:
$dom_doc = new DOMDocument("1.0", "utf-16");
OR
$dom_doc = new DOMDocument("1.0", "ISO-10646");

Categories