I have a script which caches a number of RSS feeds, however I have noticed that I've started getting strange characters appearing in the page where I output the cached contents (Stored in DB).
For instance the RSS feed contains the characters: Introducing…: ...
Which should read: Introducing...: ...
However my page displays it as: Introducing…: ...
It seems that these strangers chars are actually being stored in the database like this.
Can anyone suggest where I might be going wrong?
Do I need to encode on the way into the database the decode on the way out?
You need to make sure that the encoding of the RSS feed is the same as in your DB. Otherwise you first need to convert the content.
The encoding of the feed should be in the XML header:
<?xml version="1.0" encoding="UTF-8"?>
You can use this function to convert it to the encoding you use in the DB (preferably UTF-8):
http://php.net/manual/function.mb-convert-encoding.php
When you use UTF-8 then make sure you set the database connection to utf-8.. f.e. in mysql
SET NAMES 'utf-8';
Then set the correct output content-type like described by Anthony Williams. At best you do both: set the META Content-Type and send the Content-Type HTTP-Header.
Since your application seems to decode the htmlentities of that cached RSS feed before writing them to the DB, you may also output them like you got them in the first place
<?php echo htmlentities($string, ENT_QUOTES, 'UTF-8'); ?>
The fact that there are 3 bad characters in the output suggests that the RSS feed is being interpreted so that the HTML character reference is converted to UTF-8.
Try setting the text encoding of your display page to UTF-8 by adding the following to the output HTML in the <head> section:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Alternatively, since this is PHP you can set the HTTP header directly:
<?php
header("Content-Type: text/html; charset=UTF-8");
?>
However, a better solution might be to avoid converting the entity in the first place. Have you got a call to html_entity_decode() in the code that retrieves the RSS feed? If so, then it might be wise to remove it.
Related
I have one problem when using accents in HTML. The problem is that my page is loaded sometimes with all characters ok and sometimes with the typical strange characters like Ã, only need to refresh the page to load ok or wrong... this is absolutely random but first time after clean cache is always bad loaded.
Of course I have the meta line in headers
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>"
The file have php extension, don't know if this is relevant but I include the next two lines in the php section:
header("Content-Type: text/html;charset=UTF-8");
ini_set('default_charset', 'UTF-8');
Thanks
Those settings tell the browser what encoding you say you are using but doesn't change your encoding itself,
if your data is not utf8 encoded you need to encode it in your code using something like the utf8_encode() function or the mb_convert_encoding() function.
you can use the function mb_detect_encoding() to find out what encoding your data is in, en then encode accordingly.
I have data in a table that looks like this (based on SQLYog):
(1) µéÁÂÓ ·Óᡧ
But when the forum system that is reading the data shows it on screen it looks like this:
(2) ต้มยำ ทำแกง
The second output is the correct one (Thai language).
I'm writing a script that is going to pull all this data and import it into a new database (MongoDB) but when I pull the data and echo to the browser I get the output like the first one (1) above.
How do I go about converting this so that when I insert it (or output it to a browser) it is saved and displayed correctly like (2)?
I haven't been able to output the text like (2) but I WAS able to get the output to look like (1) by including in my html:
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
And then when echoing the data doing:
echo iconv('latin1', 'utf-8', $string);
I'm sure it's something really simple but I'm not familiar enough with unicode etc to work this out! Thanks dudes!
UPDATE
I'm now once step closer. I called:
mysql_query("SET NAMES 'utf8'");
And was then able to output (1) using just:
echo $string;
So I guess that MySQL is now converting latin1 to utf8 for me over the connection instead of me having to do this in PHP via iconv.
Still can't do Thai character output to the browser though!
You need to make sure that your script is using a UTF-8 encoding for the database connection, and you need to make sure all the areas in your script that manipulate the value do so with operations that are safe for multi-byte characters. Finally, if you are displaying the value in a browser, you need to output the meta tag for utf-8 as you seem to already be doing.
I managed to solve this.
The text I was getting from the database was windows-874 (the codepage for Thai). After I googled the Thai codepage that put me on the correct path for converting to utf-8. Once I switched the header to:
header('Content-type: text/html; charset=windows-874');
I was able to see the Thai characters correctly so I then disabled the header again and used:
iconv('windows-874', 'UTF-8', $string);
This converted the windows-874 to utf-8 and the page still displayed correctly even without the header or meta tag.
So... a lesson for character set newbies - find out what codepage your text is likely to be encoded with and then try a conversion from that to utf-8 :)
I'm developing a site with codeigniter that support multilanguage. When a user search with their native language I got the first result when I paginate the result the character is not decoding.
This is the url which is used to paginate.
When I print the uri segment I got %E0%B4%AE
I tried the url encode and url decode that time I got a different charecter like à´®
Can any one tell me how can I decode this type of charecterset?
While urldecode is what you should be using, the reason that you are getting the wrong output printed is probably because the output page's encoding hasn't been set to UTF-8, and is thus defaulting to ISO-8859-1. Hence, while the characters have been decoded correctly by PHP, the browser then interprets the characters in the wrong encoding, resulting in incorrect display.
To fix the problem, send a charset in the Content-type header before any output like so:
header('Content-type: <type>; charset=utf-8');
If your output page is HTML, you could alternatively use this tag in the head:
<meta charset="utf-8">
If you take the second option, be sure to place the tag as early as possible in the head, as browsers do not scan past the first 1024 bytes of the page for this declaration.
I'm pulling some content from my database and when I display it, I am getting some random characters occasionally dispersed throughout the content. I am seeing a lot of  where spaces were/are. I'm also getting ’ in some places.
The characters don't appear when I view in phpMyAdmin. How do I encode the content correctly? Is it something I should do BEFORE I insert the content or is it something I do when I am displaying?
What character set is the data stored in?
For example, if the data is stored as UTF-8, then when displaying the data, you need to make sure the page encoding is set to UTF-8 as well.
If it is stored in some other character set, then set the page encoding as appropriate.
You can do this by passing appropriate headers:
Content-Type: text/html; charset=utf-8
Or letting the browser know in your document:
<META http-equiv="Content-Type" content="text/html; charset="utf-8">
And in HTML5:
<meta charset="utf-8" />
That's UTF-8 being misinterpreted as CP1252. Make sure all the appropriate headers are in place.
>>> print u'’'.encode('cp1252').decode('utf-8')
’
IMO, the best thing would be to work on utf-8 on your files/database (or at least the same encoding in all places).
Please check what do you have under $db['default']['char_set'] and $db['default']['dbcollat'] on your application/config/database.php and what encoding you are using in your views/html. If you see the data correctly on PMA, then maybe the problem is in your views.
Try to use utf8_encode or utf8_decode when you print your text.
I have an php page with mixed Latin and Arabic characters. The charset declaration tag is in the html code
and the file is saved as UTF-8. All the text is static and in the php file (does not come from a DB or an external source)
When I browse to the site some pages randomly get corrupt in IE and FF and display all question marks. After I refresh the page, text is displayed properly though... I have been working with Arabic and Hebrew for a long time and this is the first time I run in to this issue. Can anybody think of a cause?
Chrome is always fine...
Turns out the script reference that was before the meta description was causing the problem. I moved
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
to be the first item after the opening head tag and this is no longer an issue. Thanks for all the comments..
P.S I wasn't the one who code this page, and only working on localizing it, thats why I didn't even think that meta tag being after script would even make a difference...
Try to send appropriate header, something like this:
header("Content-Type: text/xml; charset=utf-8");
Try using UTF8_encode on your content:
http://php.net/manual/en/function.utf8-encode.php
If you have some text you want to store in a DB and display even if the page encoding is latin-1, there is a free tool that can convert Unicode to escaped HTML:
http://www.sprawk.com/tools/escapeUnicode