PHP file_get_contents and domxpath UTF-8 encoding issue - php

I'm reading an external file which contains this :
<td>ÖZGÜR </td>
And I read it like this :
$html = file_get_contents("");
$html = str_replace("charset=iso8859-9" , "charset=utf-8" , $html);
$rows = $x->query('//tr[contains(#class,"tablerow")]');
foreach($rows as $node)
{
echo $node->childNodes->item(12)->nodeValue;
}
it does not echo ÖZGÜR , but it echoes �ZGÜR.
what type of encoding function should I call here ?
Thanks for any help !

you should use
mb_internal_encoding("UTF-8");
function to change the encoding instead of
$html = str_replace("charset=iso8859-9" , "charset=utf-8" , $html);
if data is stored in database than you need to change the connection encoding at the time of data fetching.
mysql_set_charset('utf8',$constring) than you will be able to retrieve in the UTF-8 format

Try converting $html to utf8 after you set it with file_get_contents, something like
$html = iconv('ISO-8859-9', 'UTF-8', $html);

Related

Can't decode my HTML entities

I am storing my raw html like this...
nl2br(htmlentities($this->input->post('raw_html')))
In my database the data looks like this...
&lt;ul&gt; &lt;li&gt;Improve our understanding of this issue&lt;/li&gt; &lt;li&gt;Strengthen your listening and writing skills&lt;/li&gt; &lt;/ul&gt;
When I try and display the markup from my database I use this:
echo html_entity_decode($html_from_db, ENT_COMPAT, 'UTF-8');
But I get this output being shown in the browser:
<ul> <li>Improve our understanding of this issue</li> <li>Strengthen your listening and writing skills</li> </ul>
Lesson name
And html entities are shown in my source code... so no entities are being decoded.
Why is this not working?
Probably when you're using encode with htmlentities you encoded it twice. See at function params:
function htmlentities ($string, $quote_style = null, $charset = null, $double_encode = true) {}
So you can try this:
nl2br(htmlentities($this->input->post('raw_html'), ENT_QUOTES, 'UTF-8', false))

PHP How to convert strings from DomCrawler to UTF-8

I have some data I collect with DomCrawler and store in an array, but it looks like he fails when it comes to special characters like è,à,ï,etc.
As an example I get è instead of è when I echo the result.
When I store my results in a .json file I get this: \u00c3\u00a8
My goal is to save the special character in the .json file.
I've tried encoding it but doesn't seem to have the result I want.
$html = file_get_contents($url);
$crawler = new Crawler($html);
$h1 = $crawler->filter('h1');
$title = $h1->text();
$title = mb_convert_encoding($title, "HTML-ENTITIES", "UTF-8");
Is there anyway I can have my special characters shown?
Thanks a lot!
By using the constructor to add the HTML, the crawler assume that it is in ISO-8859-1. You have to explicitly tell it that your DOM is in UTF-8 with the addHTMLContent method:
$html = file_get_contents($url);
$crawler = new Crawler;
$crawler->addHTMLContent($html, 'UTF-8');

PHP json decode with æøå

I have a problem with some json decode of my products, which have special characters "æøå".
Decode code + echo:
$products = json_decode($details['items'],true);
foreach($products as $pro){
..
<?php echo $pro['name']; ?>
..
In my database the name of the product looks like this: 'SpÃ¥ner'. However, in the echo it's: 'Spu00e5ner'. It need to be 'Spåner'.
I know the code isn't updated, but there gotta be a way to show the special characters.
I made a function that might help you with your problem.
function convertChars($char){
$return = html_entity_decode(htmlentities($char, ENT_QUOTES, 'UTF-8'), ENT_QUOTES , 'ISO-8859-15');
$return = iconv("UTF-8","ASCII//TRANSLIT",$return);
return strtolower(preg_replace('/[^a-zA-Z0-9]+/','',$return));
}

XML character encoding issue with PHP

I have code which is creating an XML, my only problem is with the encoding of words like á, olá and ção.
These characters dont appear correctly and when I try reading the XML I get an error displayed relating to that character.
$dom_doc = new DOMDocument("1.0", "utf-8");
$dom_doc->preserveWhiteSpace = false;
$dom_doc->formatOutput = true;
$element = $dom->createElement("hotels");
while ($row = mysql_fetch_assoc($result)) {
$contact = $dom_doc->createElement( "m" . $row['id'] );
$nome = $dom_doc->createElement("nome", $row['nome'] );
$data1 = $dom_doc->createElement("data1", $row['data'] );
$data2 = $dom_doc->createElement("data2", $row['data2'] );
$contact->appendChild($nome);
$contact->appendChild($data1);
$contact->appendChild($data2);
$element->appendChild($contact);
$dom_doc->appendChild($element);
What can I change to fix my problem, I am using utf-8???
Please try to put directly 'á', 'olá' or 'ção' in your script.
$data1 = $dom_doc->createElement("data1", 'ção');
If you don't have problem, this is probably the data you get from mysql that are wrongly encoded.
Are you sure your mysql outputs correct UTF-8?
To know that, make your PHP dump your data in an HTML document with meta tag set to UTF-8 and see if the characters display correctly.
You can also call :
$data1 = $dom_doc->createElement("data1", mb_detect_encoding($row['data']));
and see what encoding is detected by PHP for your data.
If you can't convert the data from your database, or change its settings, you can use mb_convert to do it on-the-fly : http://www.php.net/manual/en/function.mb-convert-encoding.php
You are using utf-8, the 8-bit unicode encoding format. Even though it properly supports all 1,112,064 code points in Unicode its possible that there is an issue here.
Try UTF-16 as the standard, just an idea. See below:
$dom_doc = new DOMDocument("1.0", "utf-16");
OR
$dom_doc = new DOMDocument("1.0", "ISO-10646");

Encoding Conversion with PHP

Trying to do a Latin1 to UTF-8 conversion for WordPress, had no luck with the tutorial posted in the Codex. I came up with this to check encoding and convert.
while($row = mysql_fetch_assoc($sql)) {
if(!mb_check_encoding($row['post_content'], 'UTF-8')) {
$row = mb_convert_encoding($row['post_content'], 'ISO-8859-1', 'UTF-8');
if(!mb_check_encoding($row['post_content'], 'UTF-8')) {
echo 'Can\'t Be Converted<br/>';
}
else {
echo '<br/>'.$row.'<br/><br/>';
}
}
else {
echo 'UTF-8<br/>';
}
}
This works... sorta. I'm not getting any rows that can't converted but I did notice that Panamá becomes Panam
Am I missing a step? Or am I doing this all wrong?
UPDATE
The data stored within the database is corrupt(á characters are stored). So its looking more like a find and replace job than a conversion. I haven't found any great solutions so far for doing this automagically.
This will help you. http://php.net/manual/en/book.iconv.php
Further more you can set your mysql connection to utf8 this way:
mysql_set_charset ('utf8',$this->getConnection());
$this->getConnection in my code returns the variable which was returned by
mysql_connect(MYSQL_SERVER,DB_LOGIN,DB_PASS);
Refer to the PHP documentation for mb_convert_encoding:
string mb_convert_encoding ( string $str , string $to_encoding [, mixed $from_encoding ] )
Your code is attempting to convert to ISO-8859-1 from UTF-8!

Categories