Google Geocode: PHP Implementation - character encoding issues

Google Geocode: PHP Implementation - character encoding issues - php

I'm working with UK address data and also International address data.
I need to geocode the address data for use on a google map. I'm doing this using the HTTP service. Ie/ Constructing a query string and passing it to file_get_contents($THEURL).
I've managed to geocode 80% of the address data perfectly, however those addresses in countries like Norway and Sweeden that contain special characters will not return a geocode.The code returned is 602 (cannot find an address).
Looking into the documentation I can see that the string sent to google must be UTF8 encoded.
I've tried the following to ensure the string is UTF8 encoded / remove the special characters.
1) Using UTF8 encode on the query string - this often results in malformed characters being displayed on the screen.
2) mb_check_encoding reports the string is correctly encoded.
3) Using a function to substitue special characters for thier europiene eqivilents (in the hope google api will compensate.
Can anyone suggest a reason why my method isn't working (whether to do with encoding or not?).

You need to systematically go through every encoding aspect in your system and define what encoding it is in. Mb_detect_encoding and guesswork are not a good approach here.
You need to check the encoding of:
incoming data
pages
GET parameters
database connection
database table collations
the script files you work with
If malformed characters occur, chances are you are using ISO-8859-1 or some other non-UTF-8 encoding somewhere. When everything is clean UTF-8, the request should go through.
A very good article on the basics is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Related

Proper handling of foreign characters / emoji

Having trouble getting foreign characters and Emoji to display.
Edited to clarify
A user types an Emoji character into a text field which is then sent to the server (php) and saved into the database (mysql). When displaying the text we grab a JSON encoded string from the server, which is parsed and displayed on the client side.
QUESTION: the character for a "trophy" emoji saved in the DB reads as
%uD83C%uDFC6
When that is sent back to the client we don't see the emoji picture, we actually see the raw encoded text.
How would we get the client side to read that text as an emoji character and display the image?
(all on an iphone / mobile safari)
Thanks!

Check the encodings used by your client, your web server, and your database table. Make sure they are all using encodings that can handle the characters you are concerned about.

Looks like the problem is my MySql encoding... utf8mb4 would allow it - unfortunately it's unavailable before MySQL v5.5

the character for a "trophy" emoji saved in the DB reads as %uD83C%uDFC6
Then your data are already mangled. %u escapes are specific to the JavaScript escape() function, which should generally never be used. Make sure your textarea->PHP handling uses standards-compliant encoding, eg encodeURIComponent if you need to get a JS variable into a URL query.
Then, having proper raw UTF-8 strings in your PHP layer, you can worry about getting MySQL to store characters like the emoji that are outside of the Basic Multilingual Plane. Best way is columns with a utf8mb4 collation; if that is not available try binary columns which will allow you to store any byte sequence (treating it as UTF-8 when it comes back out). That way, however, you won't get case-insensitive comparisons.

Html special characters in email

I had written a script to read email from a mailbox.
in some email i am getting some data being converted into wiered characters that are breaking my further processing.
those character looks something like this http://brucejohnson.ca/HTMLCharacters13.html
Any idea how to convert them into original content.

if the script is giving you those characters, then you have two options, see the character as is, or see the numerical equivalent of that character (in various bases - octal, hex etc).
Are you sure that your script isn't trying to read an encrypted mail, and that your script works fine?
Try putting some dummy test data through the functions/script you've written to see if it produces the output you expect.
Hope this helps

You need to check the charset encoding in the email headers first.
Once you have done this you then chose 1 of 2 methods, change the charset in the HTML or change the charset (where possible) to the charset you're already using (probably UTF-8)
If you dynamically change the HTML charset in the header then your biggest problem is the users will need to specify the correct charset in their browser settings, for example mine is set to UTF-8 however my emails are in ISO-8859-1 so if I was to employ this method every time I look at the site I would need to change my browser charset but a friend of mine has ISO-8859-1 as his normal charset so he would have no problems.
If you encode the characters to UTF-8 (e.g. utf8_encode in php) you need to ensure the content isn't already in UTF-8 otherwise you may find the encode function creates other invalid characters.
The way I handle this is basically to decode the mime header of the email, then use preg_match in PHP to detect the charset being used, from there I run the encoding to UTF-8 or not.
This is a very complicated activity at times dealing mail and various charsets based on the sender of the email, you don't really know in advance what charset will be used so you need to really understand the various charsets, how they are best stored if storing them and how they are best displayed, you then need to translate this to your app and target market.
GOod luck with your app

have u checked the character encoding It must be UTF-8. If it is western europian then change to UTF-8

PHP char encoding to UTF-8 from various sources

Hey, guys. I work for http://pastebin.com and we have a little issue with the new API and char encoding.
On the site itself we run a meta tag which specifies that everything on the site, including the forms, are utf-8. Because of this all chars get stored in the right way, without having to modify any char types.
With the API however, people can send data from all kinds of different sources & forms, and therefor has to get checked and possibly changed, before storing it.
Chars that are giving a problem are for example:
고객님이 티빙
Iñtërnâtiônàlizætiøn
♥♥♥♥♥
идите в *оопу, он лучший)
What would be a good way to approach this data input to the API to make sure all chars get stored in a valid UTF-8 format, which will work on our site.

Assuming your client is sending utf8 data and headers correctly: Sounds like you're doing a utf8_encode() on already-encoded utf8 data.

Duplicate: What is the best way to handle uploaded text files of different encodings?
In a nutshell, the only reliable way is having the client specify what encoding they are using. Automatic encoding detection is imperfect and tends to be unreliable.
You could for example specify that incoming data needs an encoding specified if it's not UTF-8.

How to make PHP use the right charset?

I'm making a KSSN (Korean ID Number) checker in PHP using a MySQL database.
I check if it is working by using a file_get_contents call to an external site.
The problem is that the requests (with Hangul/Korean characters in them) are using the wrong charset.
When I echo the string, the Korean characters just get replaced by question marks.
How can I make it to use Korean? Should I change anything in the database too?
What should be the charset?
PHP Source and SQL Dump: http://www.multiupload.com/RJ93RASZ31
NOTE: I'm using Apache (HTML), not CLI.

You need to:
tell the browser what encoding you wish to receive in the form submission, by setting Content-Type by header or <meta> as in aviv's answer.
tell the database what encoding you're sending it bytes in, using mysql_set_charset().
Currently you are using EUC-KR in the database so presumably you want to use that encoding in both the above points. In this century I would suggest instead using UTF-8 throughout for all web apps/databases, as the East Asian multibyte encodings are an anachronistic unpleasantness. (With potential security implications, as if mysql_real_escape_string doesn't know the correct encoding, a multibyte sequence containing ' or \ can sneak through an SQL injection.)
However, if enpang.com are using EUC-KR for the encoding of the Name URL parameter you would need either to stick with EUC-KR, or to transcode the name value from UTF-8 to EUC-KR for that purpose using iconv(). (It's not clear to me what encoding enpang.com are using for URL parameters to their name check service; I always get the same results anyway.)

I don't know the charset, but if you are using HTML to show the results you should set the charset of the html
<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
You can also use iconv (php function) to convert the charset to a different charset
http://php.net/manual/en/book.iconv.php
And last but not least, check your database encoding for the tables.
But i guess that in your case you will only have to change the meta tag.

Basically all charset problems stem from the fact that they're being mixed and/or misinterpreted.
A string (text) is a sequence of bytes in a specific order. The string is encoded using some specific charset, that in itself is neither right nor wrong nor anything else. The problem is when you try to read the string, the sequence of bytes, assuming the wrong charset. Bytes encoded using, for example, KS X 1001 just don't make sense when you read them assuming they're UTF-8, that's where the question marks come from.
The site you're getting the text from sends it to you in some specific character set, let's assume KS X 1001. Let's assume your own site uses UTF-8. Embedding a stream of bytes representing KS X 1001 encoded text in the middle of UTF-8 encoded text and telling the browser to interpret the whole site as UTF-8 leads to the KS X 1001 encoded text not making sense to the UTF-8 parser.
UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
KSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKS
UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
will be rendered as
Hey, this is UTF-8 encoded text, awesome!
???????I?have?no?idea?what?this?is???????
Hey, this is UTF-8 encoded text, awesome!
To solve this problem, convert the fetched text into UTF-8 (or whatever encoding you're using on your site). Look at the Content-Type header of that other site, it should tell you what encoding the site is in. If it doesn't, take a guess.

How to write ½ in php

Quick question, how can I make this valid :
if($this->datos->bathrooms == "1½"){$select1 = JText::_( 'selected="selected"' );}
The ½ doesn't seem to be recognized. I tried to write it as ½ but then it looks for ½ literally, and not the ½ sign. Any ideas?

As many others have noted, you have a character encoding problem, most likely. I'm not sure what encodings PHP supports but you need to take the whole picture into account. For this example I'm assuming your PHP script is responding to a FORM post.
Some app (yours, most likely) writes some HTML which is encoded using some encoding and sent to the browser. Common choices are ISO-8859-1 and UTF-8. You should always use UTF-8 if you can. Note: it's not the default for the web (sadly).
The browser downloads this html and renders the page. Browsers use Unicode internally, mostly, or some superset. The user submits a form. The data in that form is encoded, usually with the same encoding that the page was sent in. So if you send UTF-8 it gets sent back to you as UTF-8.
PHP reads the bytes of the incoming request and sets up its internal variables. This is where you might have a problem, if it is not picking the right encoding.
You are doing a string comparison, which decomposes to a byte comparison, but the bytes that make up the characters depends on the encoding used. As Peter Bailey wrote,
In ISO-8859-1 this character is encoded as 0xBD
In UTF-8 this character is encoded as 0xC2BD
You need to verify the text encoding along each step of the way to make sure it is happening as you expect. You can verify the data sent to the browser by changing the encoding from the browser's auto-detected encoding to something else to see how the page changes.
If your data is not coming from the browser, but rather from the DB, you need to check the encodings between your app and the DB.
Finally, I'd suggest that it's impractical to use a string like 1½ as a key for comparison as you are. I'd recommend using 1.5 and detecting that at display time, then changing how the data is displayed only. Advantages: you can order the results by number of bathrooms if the value is numeric as opposed to a string, etc. Plus you avoid bugs like this one.

The character you are looking for is the Unicode character Vulgar Fraction One Half
There are a multitude of ways to make sure you are displaying this character properly, all of which depend on the encoding of your data. By looking here we can see that
In ISO-8859-1, a popular western encoding, this character is encoded as BD
In UTF-8, a popular international encoding, this character is encoded ad C2BD
What this means is that if your PHP file is UTF-8 encoded, but you are sending this to the browser as ISO-8850-1 (or the other way around), the character will not render properly.
As others have posted, you can also use the HTML Entity for this character which will be character-encoding agnostic and will always render (in HTML) properly, regardless of the output encoding.

Try comparing it with "1½"

Use the PHP chr function to create the character by its hex 0xBD or dec 189:
if($this->datos->bathrooms == "1".chr(189)){$select1 = JText::_( 'selected="selected"' );}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.