Character encoding fail, why does \xBD display improperly in PHP + HTML - php

I'm just trying to understand character encoding a bit better, so I'm doing a few tests.
I have a PHP file that is saved as UTF-8 and looks like this:
<?php
declare(encoding='UTF-8');
header( 'Content-type: text/html; charset=utf-8' );
?><!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8" />
<title>Test</title>
</head>
<body>
<?php echo "\xBD"; # Does not work ?>
<?php echo htmlentities( "\xBD" ) ; # Works ?>
</body>
</html>
The page itself shows this:
The gist of the problem is that my web application has a bunch of character encoding problems, where people are copying and pasting from Outlook or Word and the characters get transformed into the diamond question marks (Do those have a real name?)
I'm trying to learn how to make sure all my input is transformed into UTF-8 when the page loads (Basically $_GET, $_POST, and $_REQUEST), and all output is done using proper UTF-8 handling methods.
My question is: Why is my page showing the question mark for the first echo, and does anyone have any other information about making a UTF-8 safe web app in PHP?

0xBD is not valid UTF-8. If you want to encode "½" in UTF-8 then you need to use 0xC2 0xBD instead.
>>> print '\xc2\xbd'.decode('utf-8')
½
If you want to use text from another charset (Latin-1 in this case) then you need to transcode it to UTF-8 first using the various iconv or mb functions.
Also:
$ charinfo �
U+FFFD REPLACEMENT CHARACTER

\xBD is not valid as utf8 what you want is \xC2\xBD, the question mark thing is what applications replace invalid code points with, so if you see that in your utf8 text its either not utf8 or corrupted.

Related

Convert utf8 without bom to utf 8

Files
index.php :
<?php
include_once 'index_a.php';
?>
index_a.php :
<html>
<head>
<title>test</title>
</head>
<body>
casa
</body>
</html>
Results
The first result is from the index.php and the second index_a.php.
Why I defend those quotes?
If index_a.php converts the file in UTF-8 without BOM, quotation marks do not appear, but I want the file to be encoded in UTF-8.
you question doesn't make sense: UTF8 file encoding may (but shouldn't, as the byte ordering for UTF8 is fixed) have a BOM. In both cases your file will be UTF8 encoded, so you're done already. What happened here is that you've asked an XY question
So, what you really want to know is: why do those quotes show up for a normal UTF8 encoded file without BOM, but not when there is a BOM, and the answer to that is that you're giving the browser HTML code that could be any version of HTML, and expect it know which version you want rendered.
Without any knowledge of the document type, the browser may, or may not, treat any whitespace between tags as a single whitespace, or no whitespace, depending on the rendermode it guessed you wanted. So if you really don't want that " " then you shouldn't rely on the file encoding, you should make it explicit to the browser that what you're giving it to render is proper HTML. Add
<!doctype html>
at the top so that all browsers know this is a modern HTML5 content file and should be parsed accordingly, rather than falling back into an unpredictable quirks mode.
edit
http://jsbin.com/helikafuni/1/ shows proper HTML5 doctype and element use (you're using ancient HTML4.1 syntax. It's time to read up on how HTML5 changed a lot of the rules and use those new rules instead)
If you want to change your encoding of your Files i would sugguest you to use Notpad++!
After you installed it you can open your files in it and change the encoding like this:
(See point "Convert to UTF-8")
UPDATE:
This should work for you:
index.php:
<?php
include_once 'index_a.php';
?>
index_a.php:
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>test</title>
</head>
<body>
casa
</body>
</html>

Weird characters though using the right encoding

I work on a website that has different language interfaces, so far I use english and german.
when the german text is loaded, it shows weird characters like the following screenshot
though I use
header('Content-type: text/html; charset=utf-8');
and also in the html header
<META http-equiv="content-type" content="text/html; charset=utf-8">
what else can I do to solve it ?
Thanks
The content of the page needs to also be in UTF-8. Your content was probably made using MS Word, which uses Windows 1251 encoding. You need to re-save your document as UTF-8.
UTF-8 does not convert formats for you.
If those strings are saved in a file, the file has to be encoded in UTF-8 too.
If you're getting them from a database, they'll have to be stored as UTF-8 and you'll have to set the connection charset to utf-8.
You could also check whether your text is UTF-8 and if not, convert it with utf8_encode.

Header special char conversion?

I have some problems with char conversion on my php's page header.
I have to develop a snippet of code that, means WS (xml-rpc protocol), can interface with another snippet of code wrote in python.
This is python snippet's output:
Output={'metaTagKeyWords': '', 'metaTagTitle': '10% DISCOUNT FOR 3 NIGHTS','metaTagDescription': 'Questa \xc3\xa8 una prova: devo vedere che succede.\r\n\r\nProva prova.\r\n\r\nDaje.\r\n\r\nENGLISH VERSION !!!!\r\n'}
So I have to convert some char: first of all \xc3\xa8 that is the unicode conversion of "è" and, in a second time, the "\r\n\" chars.
I know how to procede with "\r\n\" chars, but I don't know how to convert the unicode char.
I have had alredy tried to do something like this:
htmlentities($data[$META_TITLE_KEY], ENT_QUOTES, 'UTF-8')
But it dind't work.
Moreover, I had alredy tried to convert in pyhon the string in UTF-8 (so that entity would be u'\xc3' or something like that, but the results are pretty the same.)
An additional info: that conversion have to be used on php file header, into "meta tag description" tag.
EDIT1:
It's seems to be that, what we belive as an UTF-8, is instead a LATIN-1. So, if i change in the header that part:
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
in
<meta http-equiv="content-type" content="text/html;charset=ISO-8859-1" />
it works.
But I have to have a utf-8 charset; so I suppose that have to do something in python applicative logic (because when I go from editor to DB i encode something while when I return from DB to editor I decode something).
Stay tune for more info
EDIT2:
Maybe some function that i use to save my data onto Postrges DMB, convert data in latin-1 and then in utf-8. So, if I add this instruction:
d_meta[element] = codeDbToEditor(d_meta[element]).replace('\r\n', ' ').decode('latin-1')
everything seems to works.
Have I had the right "insipration"?
$str="Hello Loréane";
echo utf8_encode($str);
Hope It Helps

PHP SimpleXML Values returned have weird characters in place of hyphens and apostrophes

I have looked around and can't seem to find a solution so here it is.
I have the following code:
$file = "adhddrugs.xml";
$xmlstr = simplexml_load_file($file);
echo $xmlstr->report_description;
This is the simple version, but even trying this any hyphens r apostrophes are turned into: ^a (euro sign) trademark sign.
Things I have tried are:
echo = (string)$xmlstr->report_description; /* did not work */
echo = addslashes($xmlstr->report_description); /* yes I know this doesnt work with hyphens, was mainly trying to see if I could escape the apostrophes */
echo = addslashes((string)$xmlstr->report_description); /* did not work */
also htmlspecial(again i know does not work with hyphens), htmlentities, and a few other tricks.
Now the situation is I am getting the XML files from a feed so I cannot change them, but they are pretty standard. The text with the hyphens etc are encapsulated in a cdata tag and encoding is UTF-8. If I check the source I am shown the hyphens and apostrophes in the source.
Now just to see if the encoding was off or mislabeled or something else weird, I tried to view the raw XML file and sure enough it is displayed correctly.
I am sure that in my rush to find the answer I have overlooked something simple and the fact that this is really the first time I have ever used SimpleXML I am missing a very simple solution. Just don't dock me for it I really did try and find the answer on my own.
Thanks again.
This is the simple version, but even
trying this any hyphens apostrophes
are turned into: ^a (euro sign)
trademark sign.
This is caused by incorrect charset guessing (and possibly recoding).
If a text contains a "curly apostrophe" = "Right single quotation mark" = U+2019 character, saving it in UTF-8 encoding results in bytes 0xE2 0x80 0x99. If the same file is then read again assuming its charset is windows-1252, the byte stream of the apostrophe character (0xE2 0x80 0x99) is interpreted as characters ’ (=small "a" with circumflex, euro sign, trademark sign). Again if this incorrectly interpreted text is saved as UTF-8 the original character results in byte stream 0xC3 0xA2 0xE2 0x82 0xAC 0xE2 0x84 0xA2
Summary: Your original data is UTF-8 and some part of your code that reads the data assumes it is windows-1252 (or ISO-8859-1, which is usually actually treated as windows-1252). A probable reason for this charset assumption is that default charset for HTTP is ISO-8859-1. 'When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.' Source: RFC 2616, Hypertext Transfer Protocol -- HTTP/1.1
PS. this is a very common problem. Just do a Google or Bing search with query doesn’t -doesn't and you'll see many pages with this same encoding error.
Do you know the document's character set?
You could do header('Content-Type: text/html; charset=utf-8'); before any content is printed, if you havent already.
Make sure you have set up SimpleXML to use UTF-8 too.
Be sure that all the entities are encoded using hex notation, not HTML entities.
Also maybe:
$string = html_entity_decode($string, ENT_QUOTES, "utf-8");
will help.
This is a symptom of declaring an incorrect character set in the <head> section of your page (or not declaring and using default character set without accents and special characters).
This does the trick for latin languages.
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
For TOTAL NEWBIES, html pages for browsers have a basic layout, with a HEAD or HEADER which serves to tell the browser some basic stuff about the page, as well as preload some scripts that the page will use to achieve its functionality(ies).
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
Hello world
</body>
</html>
if the <head> section is omitted, html will use defaults (take some things for granted - like using the northamerican character set, which does NOT include many accented letters, whch show up as "weird characters".

Unicode and PHP - am I doing something wrong?

I'm using Kohana 3, which has full support for Unicode.
I have this as the first child of my <head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
The Unicode character I am inserting into is é as in Café.
However, I am getting the triangle with a ? (as in could not decode character).
As far as I can tell in my own code, I am not doing any string manipulation on the text.
In fact, I have placed the accent straight into a view's PHP file and it is still not working.
I copied the character from this page: http://www.fileformat.info/info/unicode/char/00e9/index.htm
I've only just started examining PHP's Unicode limitations, so I could be doing something horribly wrong.
So, how do I display this character? Do I need to resort to the HTML entity?
Update
So this works
Caf<?php echo html_entity_decode('é', ENT_NOQUOTES, 'UTF-8'); ?>
Why does that work? If I copy the output accented e from that script and insert it into my document, it doesn't work.
View the http headers. You should see something like
Content-Type: text/html; charset=UTF-8
Browsers don't pay much attention to meta tags, if there was a real http header stating a different encoding.
update
Whatcha get from this?
echo bin2hex('é');
echo chr(0xc3) . chr(0xa9);
You should get c3a9é, otherwise I'd say file encoding issue.
I guess, you see �, the replacement character for invalid UTF-8 byte sequences. Your text is not UTF-8 encoded. Check your editor’s settings to control the encoding of the PHP file.
If you’re not sure about the encoding of your sources, you can enforce UTF-8 compatibilty as described here (German text): Force UTF-8.
You should never need entities except the basic ones.

Categories