Google Autocompleter and character encoding

Google Autocompleter and character encoding - php

I am using this autocompleter from Google
http://code.google.com/p/jquery-autocomplete/ (if you click on "Source" you can find all the source files for the script)
and everything is working fine, except it's having problems with special Croatian characters (like č, ć, ž etc. I'm not sure if you'll see these, so here's an idea of what I am talking about: link - the letter c with a hachek on top etc.)
Here's the setup:
an html file points to a jquery autocomplete script and a php file with the results array
the metadata for the html file has a charset of utf-8, no other pages have any kind of encoding at all
the array in the php file has those special characters encoded with html codes (the letter "ž" is replaced with ž so a typical array element looks like this: "Požega" => "5")
when I enter a search string into the input field, the returning results are encoded correctly - Požega etc. but when I click the result to accept it, it enters Požega into the input field, which is obviously not what I want
when my search string has a special letter in it, the script doesn't find anything
How do I fix this? Should I just replace the HTML special codes in the array with the actual special letters(it seems to work fine then, but I'm not sure whether everybody will see this as I intended)? If not, how do I set the character encoding on all pages so the special letters display correctly on the input field and they're searchable?
Thanks for the help!

Character encoding is such a pain in the ass with browsers. There are several things you can do to cover your bases, one of which you've already done.
Set the tag to indicate charset of UTF-8
Use .htaccess to define a charset of UTF-8
Use PHP to define a charset of UTF-8 in the header (something like: header('Content-Type: text/html; charset=UTF-8');"
Making sure these are true should ensure that the data shows up on all UTF-8 supported browsers. By the way, I can see the special characters, so you must be doing something right. :)

Related

encoding from any language to utf in php

I insert from csv characters from different languages..
I apply this to every set of characters:
private function process_elements($element){
utf8_encode($element);
return $element;
}
The problem is when they go into the database, they go like this:
???????? ?? ???????????? ????? ??????? ??? ???????...
When I retrieve them from the databse, I also get this.
This happens with greek. However, when I retrieve greek pages (through scrapping), who are on a utf encoded page. The characters look like this:
Î”ÎµÏ‚ webcam Î´Ï‰Î¼Î¬Ï„Î¹Î± | Gr.ImLive.com
which is okay, because when i use the utf8_encode function, they look normal on the screen..
But when the data is taken from the csv and be put into the database, i get those question marks..
Is there a way to encode form any language to utf.. why retrieving data from csv and a utf8 encoded webpage makes such a difference.. they look the same.. how do I address that problem?

please take a look at this
it will help you
Handling Unicode Front To Back In A Web App

It's not about "languages", it's about encodings. Text is encoded as bits and bytes. Any one byte is equal to any other byte. If you only have a blob of bytes, you cannot know what encoding it represents. You can guess, but that's not accurate. You have to know what encoding some text is in by reading the accompanying meta data. That may be documentation, a <meta> tag or an HTTP header. Then you need to treat the text in that encoding.
utf8_encode actually converts text from ISO-8859-1 to UTF-8. It does not simply encode anything to UTF-8, because it does not have the means to determine what something is encoded in either. If your text is already UTF-8 encoded or was not ISO-8859-1 encoded to begin with, you're just garbling the text (as you are).

Strange behaviour when encoding cURL response as UTF-8

I'm making a cURL request to a third party website which returns a text file on which I need to do a few string replacements to replace certain characters by their html entity equivalents e.g I need to replace í by í.
Using string_replace/preg_replace_callback on the response directly didn't result in matches (whether searching for í directly or using its hex code \x00\xED), so I used utf8_encode() before carrying out the replacement. But utf8_encode replaces all the í characters by Ã.
Why is this happening, and what's the correct approach to carrying out UTF-8 replacements on an arbitrary piece of text using php?
*edit - some further research reveals
utf8_decode("í") == í;
utf8_encode("í") == ÃƒÂ;
utf8_encode("\xc3\xad") == ÃƒÂ;

utf8_encode is definitely not the way to go here (you're double-encoding if you do that).
Re. searching for the character directly or using its hex code, did you make sure to add the u modifier at the end of the regex? e.g. /\x00\xED/u?

You're probably specify the characters/strings you want replaced via string literals in the php source code? If you do, then the values of those string literals depends on the encoding you save your php file in. So while you see the character í, maybe the literal value is a latin encoded í, like maybe 8859-1 encoding, or maybe its windows cp1252 í, or maybe its utf8 í, or maybe even utf32 í...i dont know off hand how many of those are different, but i know at least some have different byte representations, and so wont match in a php string comparison.
my point is, you need to specify the correct character that will match whatever encoding your incoming text is in.
heres an example without using literals
$iso8859_1 = chr(236);
$utf8 = utf8_encode(chr(236));
be warned, text editors may or may not convert the existing characters when you change the encoding, if you decide to change the file encoding to utf8. I've seen editors do really bizarre things when changing the encoding. Start with a fresh file.
also-just because the other server claims its utf8, doesn't mean it really is.

Special characters escaping with JS and PHP

my application geting Text from a input field an post it over ajax to a php file for saving it to db.
var title = encodeURIComponent($('#title').val());
if I escape() the title it is all OK but i have Problems with "+" character. So i use the encodeURIComponent().
Now i habe a Problem with german special characters like "ö" "ä" "ü" they will be displayed like a crypdet something....
Have some an idea how can i solve this problem?
Thx

I suppose this has to do with encoding : your HTML page might be using UTF-8, and the special characters are encoded like this :
>>> encodeURIComponent('ö');
"%C3%B6"
When your PHP page receives this, it has to know it's UTF-8, and deal with it as UTF-8 -- which means that everything on the server-side has to work with UTF-8 :
PHP code must use functions that can work with multi-byte characters
The database (db, tables, columns, ...) must use UTF-8 for storing data
When generating HTML pages, you need to indicate it's UTF-8 too, ...
For instance, if you are using var_dump() on the PHP side to display what's been sent from the client, don't forget to indicate that the generated page is in UTF-8, with something like this :
header('Content-type: text/html; charset=UTF-8');
Else, the browser will use it's default charset -- which is not necessarily the right one, and possibly display garbage.

You might use escape("AbcÄüö") and you would get "Abc%C4%FC%F6"
In php you could then use urldecode($myValue) to get "AbcÄüö" again

Problems with utf-8 encoding in php

Another utf-8 related problem I believe...
I am using php to update data in a mysql db then display that data elsewhere in the site. Previously I have run into utf-8 problems before where special characters are displayed as question marks when viewed in a browser but this one seems slightly different.
I have a number of records to enter that contain the è character. If I enter this directly in the db then it appears correctly on the page so I take this to mean that utf-8 content is being output correctly.
However when I try and update the values in the db through php, then the è character is replaced. What appears instead is & Atilde ; & uml ; (without the spaces) which appears in the browser as Ã¨
I have the tables in the database set to use UTF-8. I believe this is correct cos, as mentioned, if I update the db through phpMyAdmin, its all ok. Similarly I have set the character encoding for the page which seems to be correct. I am also running the sql statement "SET NAMES 'utf8';" before trying to update the db.
Anyone have any other ideas as to where the problem may lie?
Many thanks

Yup.
The character you have is LATIN SMALL LETTER E WITH GRAVE. As you can see, in UTF-8 that character is encoded into two bytes 0xC3 and 0xA8.
But in many default, western encodings (such as ISO-8859-1) which are single-byte only, this multi-byte character is decoded as two separate characters, LATIN CAPITAL LETTER A WITH TILDE and DIAERESIS. Notice how they are both encoded as C3 and A8 in ISO-8859-1?
Furthermore, it looks like PHP is processing these characters through htmlentities() which result in the Ã and ¨ respectively.
So, where exactly is the problem in your code? Well, htmlentities() could be doing it all by itself since its 3rd argument is a encoding name - which you may not have properly set to 'UTF-8'. But it could be some other string processing function as well. (Note: As a general rule, it's a bad idea to store HTML entities in the database - this step should be reserved for time of display)
There are a bunch of other ways to trip yourself up with UTF-8 in php - I suggest hitting up the cheatsheet and make sure you're in good shape.

Well it is your own code convert characters into entities.
To make it right:
Ban htmlentities function from your scripts forever.
Use htmlspecialchars, but not on insert, but whan displaying data.
Repair existing data in the database using html_entity_decode.

I suppose you're taking the results of some form submission and inserting the results in the database. If so, you must ensure that you instruct the browser to send UTF-8 data and you should validate the user input for a valid UTF-8 stream.
Change your form element to include accept-charset:
<form accept-charset="utf-8" method="post" ... >
<input type="text name="field" />
...
</form>
Validate the data with:
$valid = array_key_exists("field", $_POST) && !is_array($_POST['field']) &&
preg_match('//u', $_POST['field']) && ...; //check length with mb_strlen etc.

I think you miss Content-Type declaration on the html page:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
If you don't have it, the browser will guess the encoding, and convert any characters outside of that encoding to entities when posting a form.

Get non-UTF-8-form fields as UTF-8 in PHP?

I have a form served in non-UTF-8 (it’s actually in Windows-1251). People, of course, post there any characters they like to. The browser helpfully converts the unpresentable-in-Windows-1251 characters to html entities so I can still recognise them. For example, if user types an →, I receive an →. That’s partially great, like, if I just echo it back, the browser will correctly display the → no matter what.
The problem is, I actually do a htmlspecialchars () on the text before displaying it (it’s a PHP function to convert special characters to HTML entities, e.g. & becomes &). My users sometimes type things like — or ©, and I want to display them as actual — or ©, not — and ©.
There’s no way for me to distinguish an → from →, because I get them both as →. And, since I htmlspecialchars () the text, and I also get a → for a → from browser, I echo back an &#8594; which gets displayed as → in a browser. So the user’s input gets corrupted.
Is there a way to say: “Okay, I serve this form in Windows-1251, but will you please just send me the input in UTF-8 and let me deal with it myself”?
Oh, I know that the good idea is to switch the whole software to UTF-8, but that is just too much work, and I would be happy to get a quick fix for this. If this matters, the form’s enctype is "multipart/form-data" (includes file uploader, so cannot use any other enctype). I use Apache and PHP.
Thanks!

The browser helpfully converts the unpresentable-in-Windows-1251 characters to html entities
Well, nearly, except that it's not at all helpful. Now you can't tell the difference between a real “ƛ” that someone typed expecting it to come out as a string of text with a ‘&’ in it, and a ‘Б’ character.
I actually do a htmlspecialchars () on the text before displaying it
Yes. You must do that, or else you've got a security problem.
Okay, I serve this form in Windows-1251, but will you please just send me the input in UTF-8 and let me deal with it myself
Yeah, supposedly you send “accept-charset="UTF-8"” in the form tag. But the reality is that doesn't work in IE. To get a form in UTF-8, you must send a form (page) in UTF-8.
I know that the good idea is to switch the whole software to UTF-8,
Yup. Well, at least the encoding of the page containing the form should be UTF-8.

<form action="action.php" method="get" accept-charset="UTF-8">
<!-- some elements -->
</form>
All browsers should return the values in the encoding specified in accept-charset.

You check to see if the characters are within a certain range. If they fall outside the range of standard UTF-8 characters, do whatever you want to with it. I would do this by looking at each character &, #, 8, 5, 9, 4, and parsing it into something you can apply something to.
Short of finding somewhere where someone has created a Windows-1251 to UTF-8 conversion script, you are probably going to have to roll your own. You are probably going to have to look at each specific character and see what needs to be done with it. If it's something like © you will want to handle it differently than → because the second one has the # in it.
I think this answers your question.

The html_entity_decode function is probably what you want.

You could set the fourth parameter of the htmlspecialchars function (double_encode, since PHP 5.2.3) to false do avoid the character references being encoded again.
Or you first decode those existing character references.

You can convert the strings to UTF-8 using the PHP multi-byte functions. From there you can do as you wish. Especially the mb_convert_encoding() to move it from windows-1251 to UTF-8, or where ever.
I don't quite understand your question though, because if someone enters & as a text string, when you do the htmlspecialchars() that should convert it to &amp; ... which when ran back through a html_entity_decode() would come out as the text string the user entered.
This is of course if you haven't used the double_encode option when running your string through the htmlspecialchars()

mbstring supports the "charset" HTML-Entities
for($i=0; $i<strlen($out); $i++) {
printf('%02X ', ord($out[$i]));
}61 20 E2 86 92 20 62 20 26 20 63 E2 86 92 is the byte-sequence for → (RIGHTWARDS ARROW) in utf8.

You won't be able to distinguish between the browser converting a codepoint to an entity and your users typing in an entity because they look identical. The real solution is to give up on Windows 1251. Instead, serve the webpage and form in UTF-8, ask for UTF-8 encoding and all these problems should just go away.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Google Autocompleter and character encoding - php

Related

encoding from any language to utf in php

Strange behaviour when encoding cURL response as UTF-8

Special characters escaping with JS and PHP

Problems with utf-8 encoding in php

Get non-UTF-8-form fields as UTF-8 in PHP?

Categories

Resources