Character encoding issues - UTF-8 / Issue while transmitting data on the internet? - php

I've got data being sent from a client side which is sending it like this:
// $booktitle = "Comí habitación bailé"
$xml_obj = new DOMDocument('1.0', 'utf-8');
// node created with booktitle and added to xml_obj
// NO htmlentities / other transformations done
$returnHeader = drupal_http_request($url, $headers = array("Content-Type: text/xml; charset=utf-8"), $method = 'POST', $data = $xml_data, $retry = 3);
When I receive it at my end (via that drupal_http_request) and I do htmlentities on it, I get the following:
Comí habitación bailé
Which when displayed looks like gibberish:
Comí Habitación Bailé
What is going wrong?
Edit 1)
<?php
$title = "Comí habitación bailé";
echo "title=$title\n";
echo 'encoding is '.mb_detect_encoding($title);
$heutf8 = htmlentities($title, ENT_COMPAT, "UTF-8");
echo "heutf8=$heutf8\n";
?>
Running this test script on a Windows machine and redirecting to a file shows:
title=Comí habitación bailé
encoding is UTF-8heutf8=
Running this on a linux system:
title=Comí habitación bailé
encoding is UTF-8PHP Warning: htmlentities(): Invalid multibyte sequence in argument in /home/testaccount/public_html/test2.php on line 5
heutf8=

I think you shouldn't encode the entities with htmlentities just for outputting it correctly (you should as stated in the comments use htmlspecialchars to avoid cross side scripting) , just set the correct headers and meta end echo the values normally:
<?php
header ('Content-type: text/html; charset=utf-8');
?>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
</body>
</html>

htmlentities interprets its input as ISO-8859-1 by default; are you passing UTF-8 for the charset parameter?

Try passing headers information in a key/value array format.
Something like
$headers = array("Content-Type" => "text/xml; charset=utf-8"")

Related

PHP: Get encoded html entities

I'm trying to get the html entities of a UTF-8 string,
Example: example.com/search?q=مرحبا
<?php
echo htmlentities($_GET['q']);
?>
I got:
مرحبا0مرحبا
It's UTF-8 text not html entities,
what I need is:
مرحبا
I have tried urldecode and htmlentities functions!
Add this code to the start of your file:
header('Content-Type: text/html; charset=utf-8');
The browser needs to know it is UTF-8. This tag also can go in the head section for formality.
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
I think you can solve it by getting the each char in the string and get its value.
From Mark Baker's answer and vartec's answer you can get:
<?php
$chrArray = preg_split('//u',$_GET['q'], -1, PREG_SPLIT_NO_EMPTY);
$htmlEntities = "";
foreach ($chrArray as $chr) {
$htmlEntities .= '&#'._uniord($chr).';';
}
echo $htmlEntities;
?>
I have not test it.

trying to get "é" character to print out correctly

I am trying to take the rss/xml feed from itunes and I have noticed that artist and songs that have special charters like é like in beyoncé is showing as Beyoncé
I have tried the following to get it to show correctly but unsucessfully I have Googled searched and searched on here for the correct answer but sadly not working.
here is what I have tried - I maybe way off.
echo html_entity_decode($entry->imartist, ENT_COMPAT, 'UTF-8');
here is the full code
function itunes(){
$itunes_feed = "https://itunes.apple.com/au/rss/topsongs/limit=100/explicit=true/xml";
$itunes_feed = file_get_contents($itunes_feed);
$itunes_feed = preg_replace("/(<\/?)(\w+):([^>]*>)/", "$1$2$3", $itunes_feed);
$itunes_xml = new SimpleXMLElement($itunes_feed);
$itunes_entry = $itunes_xml->entry;
foreach($itunes_entry as $entry){
echo html_entity_decode($entry->title."<br>", ENT_COMPAT, 'UTF-8');
echo html_entity_decode($entry->imartist, ENT_COMPAT, 'UTF-8');
echo "<br><br>";
// Get the value of the entry ID, by using the 'im' namespace within the <id> attribute
$entry_id['im'] = $entry->id->attributes('im', TRUE);
echo (string)$entry_id['im']['artist'];
//echo $entry_id['artist']."<br>";
}
}
That feed is in valid UTF-8, you shouldn't need to decode it with html_entity_decode. What happens if you add a <meta charset="utf-8" /> in the <head> of HTML page ?

Read Persian (Unicode chars) text file using php

I am reading one Persian text file (using PHP) with the help of below code:
/* Reading the file name and the book (UTF-8) */
if(file_exists($SourceDirectoryFile))
{
$NameBook = "name.txt";
$AboutBook = "about.txt";
$myFile = "Computer-Technolgy/2 ($i)/".$NameBook;
$fh = fopen($myFile, 'r');
$theData = fread($fh, filesize($myFile));
fclose($fh);
echo 'Name file: '. $theData.'<hr/>';
}
name.txt file contents :
آموزش شبكه هاي کامپيوتري (LEARNING NETWORK)
Name file: ����� ���� ��� ��������� (LEARNING NETWORK)
The reason you are seeing this is because you are just echoing the contents raw. Your browser will need more information, in order to display the message in its correct form.
The easiest way is to use the snippet below.
/* Reading the file name and the book (UTF-8) */
if (file_exists($SourceDirectoryFile))
{
$NameBook = "name.txt";
$AboutBook = "about.txt";
// Using file_get_contents instead. Less code
$myFile = "Computer-Technolgy/2 ($i)/" . $NameBook;
$contents = file_get_contents($myFile);
// I want my browser to display UTF-8 characters
header('Content-Type: text/html; charset=UTF-8');
echo 'Name file: ' . $contents . '<hr/>';
}
Please note that the header function needs to be executed at the beginning of the output to the browser. So for instance if you have additional data that is displayed prior to this function, you need to move the header statement at the top. Otherwise you will end up with warnings on screen that the headers have already been set.
You'll need to make sure that the page where you're displaying the text file has correct encoding.
final and best solution is this:
use this line under your connect
mysqli_set_charset( $con, 'utf8');
like this:
$con = mysqli_connect("localhost","root","amirahmad","shoutit");
mysqli_set_charset( $con, 'utf8');
and at the end add this line right under the head tag in your html to make sure your page have utf-8 charset,like this:
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
and that's it . you can read formal document here : pph.net charset

Simple RSS encoding issue

Consider the following PHP code for getting RSS news on a site I'm developing:
<?php
$url = "http://dariknews.bg/rss.php";
$xml = simplexml_load_file($url);
$feed_title = $xml->channel->title;
$feed_description = $xml->channel->description;
$feed_link = $xml->channel->link;
$item = $xml->channel->item;
function getTheData($item){
for ($i = 0; $i < 4; $i++) {
$article_title = $item[$i]->title;
$article_description = $item[$i]->description;
$article_link = $item[$i]->link;
echo "<p><h3>". $article_title. "</h3></p><small>".$article_description."</small><p>";
}
}
?>
The data accumulated by this function should be presented in the following HTML format:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1251"/>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Новини от Дарик</title>
</head>
<body>
<?php getTheData($item);?>
</body>
</html>
As you see I added windows-1251(cyrillic) and utf-8 encoding but the RSS feed is unreadable if I don't change the browser encoding to utf-8. The default encoding in my case is cyrilic but I get unreadable feed. Any help making this RSS readable in cyrilic(it's from Bulgaria) will be greatly appreciated.
I've just tested your code and the Bulgarian characters displayed fine when I removed the charset=windows-1251 meta tag and just left the UTF-8 one. Want to try that and see if it works?
Also, you might want to change your <html> tag to reflect the fact that your page is in Bulgarian like this: <html xmlns="http://www.w3.org/1999/xhtml" lang="bg" xml:lang="bg">
Or maybe you need to force the web server to send the content as UTF-8 by sending a Content-Type header:
<?php
header("Content-Type: text/html; charset=UTF-8");
?>
Just be sure to include this before ANY other content (even whitespace) is sent to the browser. If you don't you'll get the PHP "headers already sent" error.
Maybe you should take a look at htmlentities.
This can convert to html some characters.
$titleEncoded = htmlentities($article_title,ENT_XHTML,cp1251);

Trying to display Vietnamese characters with php

When I try to display Vietnamese characters with the following code:
<?php
$str = "Nghệ thuật cắm hoa vải";
//echo utf8_encode(html_entity_decode(($str)));
echo html_entity_decode($str);
//echo $str;
?>
I get Ngh�? thu�?t c??m hoa va?i as a result.
Tried several option but couldn't make it. Any ideas?
Is the PHP script encoded in UTF-8? If it is, send a header indicating so:
header("Content-type: text/html; charset=utf-8");
Alternatively, do:
echo mb_convert_encoding($string, "HTML-ENTITIES", "UTF-8");
Works fine for me: http://codepad.org/uTmORRmz
Does your browser support Unicode?

Categories