SimpleXML & html entities = strange characters - php

I am getting a feed as such..
$posts = new SimpleXMLElement(WP_ROOT_URL . 'feed/', 0, true);
In this feed one of the items I am getting contains a html entity, which is the entity for the "hyphen character", which is –
However when this is returned from SimpleXML all I get is a –. I have read other similar questions on SO & some mention to make sure your page is set to UTF-8; though not sure how this will stop SimpleXML from returning the strange character?
Any which way I do have this on the page the data is output on:
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
What can I do here to get the correct entity?

In PHP strings don't have unified or managed encoding, therefore you cannot think of them as containing characters but bytes. The result always contains the bytes 0xE28093, only the interpretation changes. You can see this by calling bin2hex() on the result.
The bytes interpreted in Windows-1252 come out as –, interpreted in UTF-8, they come out as –.
If you are echoing this on a web page, then you can make browser interpret your output in UTF-8 by doing:
<?php
header("Content-Type: text/html; charset=UTF-8"); //Put this before any output
echo "stuff";

Related

Remove non-standard characters from html PHP

How can i remove only � (using curl To get data)
$str = "Check this out <a href=�http://www.somewebsite.com�>Somewebsite</a>, this is a great website
Windows� (XP 32bit/Vista/7/8/8.1)";
I just want � to be removed.
I tried
$output = preg_replace("/[^A-Za-z0-9]/","",$str);
it remove html also ... but i want html
Instead of doing a bad work-around like that, you should fix your charset issue instead. Your problem is likely that you don't use the same character-encoding in all levels of your application/scripts. Anything that has or can be set to a specific character-encoding, should be set to the same. The most general ones are below.
Save the document as UTF-8 (or UTF8 w/o BOM) (If you're using Notepad++, it's Format -> Convert to UFT-8 or UTF8 w/o BOM)
The header in both PHP and HTML should be set to UTF-8
HTML: <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />, inside the <head>-tag in your document.
PHP: header('Content-Type: text/html; charset=utf-8'); - PHP headers has to be set BEFORE any output is made (no HTML, no whitespace, no echo/print - nothing).
There are other aspects as well that might need to be set to UTF-8, it depends on what kind of PHP functions you are using and so on. But the above is generally a good start.

How to decode utf-8 charected in codeigniter?

I'm developing a site with codeigniter that support multilanguage. When a user search with their native language I got the first result when I paginate the result the character is not decoding.
This is the url which is used to paginate.
When I print the uri segment I got %E0%B4%AE
I tried the url encode and url decode that time I got a different charecter like à´®
Can any one tell me how can I decode this type of charecterset?
While urldecode is what you should be using, the reason that you are getting the wrong output printed is probably because the output page's encoding hasn't been set to UTF-8, and is thus defaulting to ISO-8859-1. Hence, while the characters have been decoded correctly by PHP, the browser then interprets the characters in the wrong encoding, resulting in incorrect display.
To fix the problem, send a charset in the Content-type header before any output like so:
header('Content-type: <type>; charset=utf-8');
If your output page is HTML, you could alternatively use this tag in the head:
<meta charset="utf-8">
If you take the second option, be sure to place the tag as early as possible in the head, as browsers do not scan past the first 1024 bytes of the page for this declaration.

What's the cause of these characters when I pull content from the database?

I'm pulling some content from my database and when I display it, I am getting some random characters occasionally dispersed throughout the content. I am seeing a lot of  where spaces were/are. I'm also getting ’ in some places.
The characters don't appear when I view in phpMyAdmin. How do I encode the content correctly? Is it something I should do BEFORE I insert the content or is it something I do when I am displaying?
What character set is the data stored in?
For example, if the data is stored as UTF-8, then when displaying the data, you need to make sure the page encoding is set to UTF-8 as well.
If it is stored in some other character set, then set the page encoding as appropriate.
You can do this by passing appropriate headers:
Content-Type: text/html; charset=utf-8
Or letting the browser know in your document:
<META http-equiv="Content-Type" content="text/html; charset="utf-8">
And in HTML5:
<meta charset="utf-8" />
That's UTF-8 being misinterpreted as CP1252. Make sure all the appropriate headers are in place.
>>> print u'’'.encode('cp1252').decode('utf-8')
’
IMO, the best thing would be to work on utf-8 on your files/database (or at least the same encoding in all places).
Please check what do you have under $db['default']['char_set'] and $db['default']['dbcollat'] on your application/config/database.php and what encoding you are using in your views/html. If you see the data correctly on PMA, then maybe the problem is in your views.
Try to use utf8_encode or utf8_decode when you print your text.

Saving special characters to DB then display using PHP

I have a script which caches a number of RSS feeds, however I have noticed that I've started getting strange characters appearing in the page where I output the cached contents (Stored in DB).
For instance the RSS feed contains the characters: Introducing…: ...
Which should read: Introducing...: ...
However my page displays it as: Introducing…: ...
It seems that these strangers chars are actually being stored in the database like this.
Can anyone suggest where I might be going wrong?
Do I need to encode on the way into the database the decode on the way out?
You need to make sure that the encoding of the RSS feed is the same as in your DB. Otherwise you first need to convert the content.
The encoding of the feed should be in the XML header:
<?xml version="1.0" encoding="UTF-8"?>
You can use this function to convert it to the encoding you use in the DB (preferably UTF-8):
http://php.net/manual/function.mb-convert-encoding.php
When you use UTF-8 then make sure you set the database connection to utf-8.. f.e. in mysql
SET NAMES 'utf-8';
Then set the correct output content-type like described by Anthony Williams. At best you do both: set the META Content-Type and send the Content-Type HTTP-Header.
Since your application seems to decode the htmlentities of that cached RSS feed before writing them to the DB, you may also output them like you got them in the first place
<?php echo htmlentities($string, ENT_QUOTES, 'UTF-8'); ?>
The fact that there are 3 bad characters in the output suggests that the RSS feed is being interpreted so that the HTML character reference is converted to UTF-8.
Try setting the text encoding of your display page to UTF-8 by adding the following to the output HTML in the <head> section:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Alternatively, since this is PHP you can set the HTTP header directly:
<?php
header("Content-Type: text/html; charset=UTF-8");
?>
However, a better solution might be to avoid converting the entity in the first place. Have you got a call to html_entity_decode() in the code that retrieves the RSS feed? If so, then it might be wise to remove it.

Unicode and PHP - am I doing something wrong?

I'm using Kohana 3, which has full support for Unicode.
I have this as the first child of my <head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
The Unicode character I am inserting into is é as in Café.
However, I am getting the triangle with a ? (as in could not decode character).
As far as I can tell in my own code, I am not doing any string manipulation on the text.
In fact, I have placed the accent straight into a view's PHP file and it is still not working.
I copied the character from this page: http://www.fileformat.info/info/unicode/char/00e9/index.htm
I've only just started examining PHP's Unicode limitations, so I could be doing something horribly wrong.
So, how do I display this character? Do I need to resort to the HTML entity?
Update
So this works
Caf<?php echo html_entity_decode('é', ENT_NOQUOTES, 'UTF-8'); ?>
Why does that work? If I copy the output accented e from that script and insert it into my document, it doesn't work.
View the http headers. You should see something like
Content-Type: text/html; charset=UTF-8
Browsers don't pay much attention to meta tags, if there was a real http header stating a different encoding.
update
Whatcha get from this?
echo bin2hex('é');
echo chr(0xc3) . chr(0xa9);
You should get c3a9é, otherwise I'd say file encoding issue.
I guess, you see �, the replacement character for invalid UTF-8 byte sequences. Your text is not UTF-8 encoded. Check your editor’s settings to control the encoding of the PHP file.
If you’re not sure about the encoding of your sources, you can enforce UTF-8 compatibilty as described here (German text): Force UTF-8.
You should never need entities except the basic ones.

Categories