Kohana 3 - In View ® shows invalid character

Kohana 3 - In View ® shows invalid character - php

We are in the process of converting to UTF-8 throughout our site. For the most part we've not had any problems, but presently the ® symbol shows up in the views as an invalid character, but when if the value is output in the Controller then it is displays correctly.
We have made certain to include the meta attribute in the main view:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
Is there a setting that we are missing?

Your files should be UTF-8 encoded (UTF-8 without BOM, ANSI as UTF-8).
Note: Kohanas' HTML::chars() uses Kohana::$charset to decide which charset to use when encoding, so use it instead of htmlspecialchars().

Related

IE decodes umlauts in encoded urls to wrong charset

I have a file url (file:///...) with an umlaut in a intranet solution.
The url is encoded which turned the ä into an %C3%A4.
When I click the link in Firefox/Chrome the character is an ä and the file is displayed.
In IE the character is changed to Ã which results in a 404-error.
I tried with and without the following charset definition but it does not seem to work. (The file is UTF-8 encoded)
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
I tried without encoding which does work but since I am using PHP-DOMDoc which causes the encoding I would rather not parse the content again.
I cannot avoid the umlauts in the urls since the customer enters these.
Is there any solution for this problem?

Weird characters though using the right encoding

I work on a website that has different language interfaces, so far I use english and german.
when the german text is loaded, it shows weird characters like the following screenshot
though I use
header('Content-type: text/html; charset=utf-8');
and also in the html header
<META http-equiv="content-type" content="text/html; charset=utf-8">
what else can I do to solve it ?
Thanks

The content of the page needs to also be in UTF-8. Your content was probably made using MS Word, which uses Windows 1251 encoding. You need to re-save your document as UTF-8.
UTF-8 does not convert formats for you.

If those strings are saved in a file, the file has to be encoded in UTF-8 too.
If you're getting them from a database, they'll have to be stored as UTF-8 and you'll have to set the connection charset to utf-8.
You could also check whether your text is UTF-8 and if not, convert it with utf8_encode.

$_POST will convert from utf-8 to Ã¤ Ã¶ Ã¼ etc

I am new here, so I apologize if I am doing anything wrong.
I have a form which submits user input onto another page. User is expected to type ä, ö, é, etc... I have placed all of the following in the document:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
header('Content-Type:text/html; charset=UTF-8');
<form action="whatever.php" accept-charset="UTF-8">
I even tried:
ini_set('default_charset', 'UTF-8');
When the other page loads, I need to check what the user input with something like:
if ( $_POST['field'] == $check ) {
...
}
But if he inputs something like 'München', PHP will compare 'MÃ¼nchen' with 'München' and will never trigger TRUE even though it should. Since it is specified UTF-8 everywhere, I am guessing that the server is converting to something else (Windows-1252 as I read on another thread) because it does not support or is not configured to UTF-8. I am using Apache on a local server before I load into production; I have not changed (and don't know how to) any of the default settings. I've been working on a Windows 7, editing with Notepad++ enconding my files in ANSI. If I bin2hex('München') I get '4dc3bc6e6368656e'.
If I echo $_POST['field']; it displays 'München' correctly.
I have researched everywhere for an explanation, all I find is that I should include those tags/headings I already have.
Any help is much appreciated.

You are facing many different problems at the same, let's start with the simplest one.
Problem 1) You say that echo $_POST['field']; will display it correctly? What do you mean with "display"? It can be displayed correctly in two cases:
either the field is in UTF-8 and your page has been declared as UTF-8 and the browser is displaying it as UTF-8 or,
the field is in Latin-1 and the browser has decided (through the auto-detection heuristics) that your page is in Latin-1.
So, the fact that echo $_POST['field']; is correct tells you nothing.
Problem 2) You are using
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
header('Content-Type:text/html; charset=UTF-8');
Is this PHP code? If it is, it will be an error because the header must be set before sending out any byte. If you do this you will not set the Content-Type header and PHP should generate a warning.
Problem 3) You are using
<form action="whatever.php" accept-charset="UTF-8">
Some browsers (IE, mostly) ignore accept-charset if they can coerce the data to be sent in ASCII or ISO Latin-1. So the data will be in UTF-8 and declared as ISO Latin-1 or ISO Latin-1 and sent as ISO Latin-1 (but this second case is not your case).
Have a look at https://stackoverflow.com/a/8547004/449288 to see how to solve this problem.
Problem 4) Which strings are you comparing? For example, if you have
$city = "München"
$_POST['city'] == $city
The result of this code will depend on the encoding of the PHP file. If the file is encoded in ISO Latin-1 and the $_POST correctly contains UTF-8 data, the == will compare different bytes and will return false.

Another solution that may be helpful is in Apache, you can place a directive in your configuration file (httpd.conf) or .htacess called AddDefaultCharset. It looks like this:
AddDefaultCharset utf-8
http://httpd.apache.org/docs/2.0/mod/core.html#adddefaultcharset
That will override any other default charsets.

I changed "mbstring.detect_order = pass" in my php.ini file and i worked

I've used Unicode characters in my forms and file many times. I had not any problem up to now.
Try to do these steps and check the result:
Remove header('Content-Type:text/html; charset=UTF-8'); from your HTML form codes.
Use your form just like <form action="whatever.php"> without accept-charset="UTF-8". (It's better to insert the method of sending data in your form tag).
In target page (whatever.php), insert again <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> in a <head> tag.
I always did my project like what I mentioned here and I did not have any problem with Unicode strings.

This is due to the character encoding of the PHP file(s).
The hardcoded München is stored with the character encoding of the source file(s), in this case ANSI and when that value is compared to the UTF-8 encoded value provided in the $_POST variable, the two will, quite naturally, differ.
The solution to your problem is one of:
Serve and process content with the same encoding as that of the source file(s), in this case likely to be windows-1252.
This would, for starters, include changing the content="text/html; charset=UTF-8" to content="text/html; charset=windows-1252" whenever serving HTML data.
Avoid all hardcoded values that could be affected by character encoding issues between UTF-8 and windows-1252, more or less only hardcode values that only includes English letters and numbers.
Any UTF-8 values would have to be read from a source that ensures they are UTF-8 encoded (for instance a database set to use UTF-8 as storage encoding as well as connection encoding).
Wrap all hardcoded assignments in utf8_encode(), for instance $value = utf8_encode ('München');
Change the encoding of the source file(s) to UTF-8.
This can be accomplished in any number of ways, a decent text editor will be able to do it or the outstanding libiconv can be used, especially for batch processing.
Either solution 1 or 4 would be my preferred solution, especially if multiple people are involved in the project.
As a side-note, some text editors (notably Notepad++) has the option of using either UTF-8 or UTF-8 without BOM. The BOM (Byte Order Mark) is pointless in UTF-8 and will cause problems when writing headers in PHP (most often when doing a redirect). This is because the BOM is right in front of the initial <?php, causing the server to send the BOM just as it would had there been any other character in front. The difference is you'd note a character in front, but the BOM isn't displayed.
Rule of thumb: Always use UTF-8 without BOM.

PHP SimpleXML Values returned have weird characters in place of hyphens and apostrophes

I have looked around and can't seem to find a solution so here it is.
I have the following code:
$file = "adhddrugs.xml";
$xmlstr = simplexml_load_file($file);
echo $xmlstr->report_description;
This is the simple version, but even trying this any hyphens r apostrophes are turned into: ^a (euro sign) trademark sign.
Things I have tried are:
echo = (string)$xmlstr->report_description; /* did not work */
echo = addslashes($xmlstr->report_description); /* yes I know this doesnt work with hyphens, was mainly trying to see if I could escape the apostrophes */
echo = addslashes((string)$xmlstr->report_description); /* did not work */
also htmlspecial(again i know does not work with hyphens), htmlentities, and a few other tricks.
Now the situation is I am getting the XML files from a feed so I cannot change them, but they are pretty standard. The text with the hyphens etc are encapsulated in a cdata tag and encoding is UTF-8. If I check the source I am shown the hyphens and apostrophes in the source.
Now just to see if the encoding was off or mislabeled or something else weird, I tried to view the raw XML file and sure enough it is displayed correctly.
I am sure that in my rush to find the answer I have overlooked something simple and the fact that this is really the first time I have ever used SimpleXML I am missing a very simple solution. Just don't dock me for it I really did try and find the answer on my own.
Thanks again.

This is the simple version, but even
trying this any hyphens apostrophes
are turned into: ^a (euro sign)
trademark sign.
This is caused by incorrect charset guessing (and possibly recoding).
If a text contains a "curly apostrophe" = "Right single quotation mark" = U+2019 character, saving it in UTF-8 encoding results in bytes 0xE2 0x80 0x99. If the same file is then read again assuming its charset is windows-1252, the byte stream of the apostrophe character (0xE2 0x80 0x99) is interpreted as characters â€™ (=small "a" with circumflex, euro sign, trademark sign). Again if this incorrectly interpreted text is saved as UTF-8 the original character results in byte stream 0xC3 0xA2 0xE2 0x82 0xAC 0xE2 0x84 0xA2
Summary: Your original data is UTF-8 and some part of your code that reads the data assumes it is windows-1252 (or ISO-8859-1, which is usually actually treated as windows-1252). A probable reason for this charset assumption is that default charset for HTTP is ISO-8859-1. 'When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.' Source: RFC 2616, Hypertext Transfer Protocol -- HTTP/1.1
PS. this is a very common problem. Just do a Google or Bing search with query doesnâ€™t -doesn't and you'll see many pages with this same encoding error.

Do you know the document's character set?
You could do header('Content-Type: text/html; charset=utf-8'); before any content is printed, if you havent already.

Make sure you have set up SimpleXML to use UTF-8 too.
Be sure that all the entities are encoded using hex notation, not HTML entities.
Also maybe:
$string = html_entity_decode($string, ENT_QUOTES, "utf-8");
will help.

This is a symptom of declaring an incorrect character set in the <head> section of your page (or not declaring and using default character set without accents and special characters).
This does the trick for latin languages.
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
For TOTAL NEWBIES, html pages for browsers have a basic layout, with a HEAD or HEADER which serves to tell the browser some basic stuff about the page, as well as preload some scripts that the page will use to achieve its functionality(ies).
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
Hello world
</body>
</html>
if the <head> section is omitted, html will use defaults (take some things for granted - like using the northamerican character set, which does NOT include many accented letters, whch show up as "weird characters".

Unicode and PHP - am I doing something wrong?

I'm using Kohana 3, which has full support for Unicode.
I have this as the first child of my <head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
The Unicode character I am inserting into is é as in Café.
However, I am getting the triangle with a ? (as in could not decode character).
As far as I can tell in my own code, I am not doing any string manipulation on the text.
In fact, I have placed the accent straight into a view's PHP file and it is still not working.
I copied the character from this page: http://www.fileformat.info/info/unicode/char/00e9/index.htm
I've only just started examining PHP's Unicode limitations, so I could be doing something horribly wrong.
So, how do I display this character? Do I need to resort to the HTML entity?
Update
So this works
Caf<?php echo html_entity_decode('é', ENT_NOQUOTES, 'UTF-8'); ?>
Why does that work? If I copy the output accented e from that script and insert it into my document, it doesn't work.

View the http headers. You should see something like
Content-Type: text/html; charset=UTF-8
Browsers don't pay much attention to meta tags, if there was a real http header stating a different encoding.
update
Whatcha get from this?
echo bin2hex('é');
echo chr(0xc3) . chr(0xa9);
You should get c3a9é, otherwise I'd say file encoding issue.

I guess, you see �, the replacement character for invalid UTF-8 byte sequences. Your text is not UTF-8 encoded. Check your editor’s settings to control the encoding of the PHP file.
If you’re not sure about the encoding of your sources, you can enforce UTF-8 compatibilty as described here (German text): Force UTF-8.
You should never need entities except the basic ones.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.