PHP How to convert strings from DomCrawler to UTF-8 - php

I have some data I collect with DomCrawler and store in an array, but it looks like he fails when it comes to special characters like è,à,ï,etc.
As an example I get è instead of è when I echo the result.
When I store my results in a .json file I get this: \u00c3\u00a8
My goal is to save the special character in the .json file.
I've tried encoding it but doesn't seem to have the result I want.
$html = file_get_contents($url);
$crawler = new Crawler($html);
$h1 = $crawler->filter('h1');
$title = $h1->text();
$title = mb_convert_encoding($title, "HTML-ENTITIES", "UTF-8");
Is there anyway I can have my special characters shown?
Thanks a lot!

By using the constructor to add the HTML, the crawler assume that it is in ISO-8859-1. You have to explicitly tell it that your DOM is in UTF-8 with the addHTMLContent method:
$html = file_get_contents($url);
$crawler = new Crawler;
$crawler->addHTMLContent($html, 'UTF-8');

Related

PHP decoding square brackets href attr to html file

Saving an html the decodes square brackets.
//My STRing
$teaserTest = "<a href='[CLICK_URL]'><strong>testgerr</strong></a>";
//Calling save function
saveFile($teaserTest);
//Save function
function saveFile($stringToAdd){
$doc = new DOMDocument();
$doc->formatOutput = true;
$doc->loadHTML('<html><head><title>Test</title></head><body>'.$stringToAdd.'</body></html>');
$doc->saveHTMLFile("Campaigns/test.html");
}
file resaults <a href="%5BCLICK_URL%5D">
im trying to keep the"[" decoded.
[] brackets are special chars in url
which is specified in following RFC It is important for the ip address for example: http://[::1]/example/
That because it is good to encoding. But if you have a special approach use a different pattern for it.

How to fix encoding with dom

I am trying to scrape some old pages and present them in a modern design for me using Dom
And I have a problem with the encoding, The content is in french
I am using this code to get the content that I want, There is 2 type of content "Categories" And "Data"
$html = new DOMDocument();
$html->validateOnParse = true;
#$html->loadHTML($page);
$xpath = new DOMXPath($html);
$table =$xpath->query("//*[#style='background: white']")->item(0);
Then I process the content , First I enter the Categories in a function that convert them to id for me
function category_to_id($category) {
$categories = array('Forêts','Assurance','Aéronautique','Equipement ','Autre');
foreach ($categories as $id => $cat) {
if(trim($cat) == trim($category)) {
return $id + 1;
}
}
}
Then I store everything in MYSQL database
My first problem is my function work only for categories without spécial charachters like Assurance
And the second is that when I go to the database, I find the data stored like this Travaux d'électricité instead of Travaux d'électricité
I tried adding $html->encoding = 'utf-8'; But that didn't change anything
What am i doing wrong, And how can I fix it
Dom doesn't use UTF-8 as default, so you should encode the page to it
$xml->loadHTML(mb_convert_encoding($page, 'HTML-ENTITIES', "UTF-8"););
Alternatively, you could utf8_decode your string
echo category_to_id(utf8_decode("Travaux d'électricité"));

PHP: Converting xml to array

I have an xml string. That xml string has to be converted into PHP array in order to be processed by other parts of software my team is working on.
For xml -> array conversion i'm using something like this:
if(get_class($xmlString) != 'SimpleXMLElement') {
$xml = simplexml_load_string($xmlString);
}
if(!$xml) {
return false;
}
It works fine - most of the time :) The problem arises when my "xmlString" contains something like this:
<Line0 User="-5" ID="7436194"><Node0 Key="<1" Value="0"></Node0></Line0>
Then, simplexml_load_string won't do it's job (and i know that's because of character "<").
As i can't influence any other part of the code (i can't open up a module that's generating XML string and tell it "encode special characters, please!") i need your suggestions on how to fix that problem BEFORE calling "simplexml_load_string".
Do you have some ideas? I've tried
str_replace("<","<",$xmlString)
but, that simply ruins entire "xmlString"... :(
Well, then you can just replace the special characters in the $xmlString to the HTML entity counterparts using htmlspecialchars() and preg_replace_callback().
I know this is not performance friendly, but it does the job :)
<?php
$xmlString = '<Line0 User="-5" ID="7436194"><Node0 Key="<1" Value="0"></Node0></Line0>';
$xmlString = preg_replace_callback('~(?:").*?(?:")~',
function ($matches) {
return htmlspecialchars($matches[0], ENT_NOQUOTES);
},
$xmlString
);
header('Content-Type: text/plain');
echo $xmlString; // you will see the special characters are converted to HTML entities :)
echo PHP_EOL . PHP_EOL; // tidy :)
$xmlobj = simplexml_load_string($xmlString);
var_dump($xmlobj);
?>

PHP file_get_contents and domxpath UTF-8 encoding issue

I'm reading an external file which contains this :
<td>ÖZGÜR </td>
And I read it like this :
$html = file_get_contents("");
$html = str_replace("charset=iso8859-9" , "charset=utf-8" , $html);
$rows = $x->query('//tr[contains(#class,"tablerow")]');
foreach($rows as $node)
{
echo $node->childNodes->item(12)->nodeValue;
}
it does not echo ÖZGÜR , but it echoes �ZGÜR.
what type of encoding function should I call here ?
Thanks for any help !
you should use
mb_internal_encoding("UTF-8");
function to change the encoding instead of
$html = str_replace("charset=iso8859-9" , "charset=utf-8" , $html);
if data is stored in database than you need to change the connection encoding at the time of data fetching.
mysql_set_charset('utf8',$constring) than you will be able to retrieve in the UTF-8 format
Try converting $html to utf8 after you set it with file_get_contents, something like
$html = iconv('ISO-8859-9', 'UTF-8', $html);

XML character encoding issue with PHP

I have code which is creating an XML, my only problem is with the encoding of words like á, olá and ção.
These characters dont appear correctly and when I try reading the XML I get an error displayed relating to that character.
$dom_doc = new DOMDocument("1.0", "utf-8");
$dom_doc->preserveWhiteSpace = false;
$dom_doc->formatOutput = true;
$element = $dom->createElement("hotels");
while ($row = mysql_fetch_assoc($result)) {
$contact = $dom_doc->createElement( "m" . $row['id'] );
$nome = $dom_doc->createElement("nome", $row['nome'] );
$data1 = $dom_doc->createElement("data1", $row['data'] );
$data2 = $dom_doc->createElement("data2", $row['data2'] );
$contact->appendChild($nome);
$contact->appendChild($data1);
$contact->appendChild($data2);
$element->appendChild($contact);
$dom_doc->appendChild($element);
What can I change to fix my problem, I am using utf-8???
Please try to put directly 'á', 'olá' or 'ção' in your script.
$data1 = $dom_doc->createElement("data1", 'ção');
If you don't have problem, this is probably the data you get from mysql that are wrongly encoded.
Are you sure your mysql outputs correct UTF-8?
To know that, make your PHP dump your data in an HTML document with meta tag set to UTF-8 and see if the characters display correctly.
You can also call :
$data1 = $dom_doc->createElement("data1", mb_detect_encoding($row['data']));
and see what encoding is detected by PHP for your data.
If you can't convert the data from your database, or change its settings, you can use mb_convert to do it on-the-fly : http://www.php.net/manual/en/function.mb-convert-encoding.php
You are using utf-8, the 8-bit unicode encoding format. Even though it properly supports all 1,112,064 code points in Unicode its possible that there is an issue here.
Try UTF-16 as the standard, just an idea. See below:
$dom_doc = new DOMDocument("1.0", "utf-16");
OR
$dom_doc = new DOMDocument("1.0", "ISO-10646");

Categories