I have a situation where after several years of use we are suddenly have some JSON-encoded values that are giving our Perl script fits due to backslashes.
The issues are with accented characters like í and é. An example is Matí encoded as Mat\ud873.
It is unclear what may have changed in the environment. PHP, Perl, and MySQL are involved. The table collation is latin1_swedish_ci and this may have been changed by a co-worker screwing around.
Does this ring any bells for anyone?
The problem here is internationalization on the JavaScript end, not the collation of your DB table. If you had no such problems before, it's likely that no users were inputting international characters before, or the character set of your HTML pages was ISO-8859-1/cp1252 (which would have limited form POST data on the client end.) New users or changed HTML headers could have caused this problem to manifest itself, but the issue is really on the side of the Perl script.
JSON defines strings as double-quoted sets of characters with Unicode escape sequences when more than a 7-bit encoding is necessary. The first 127 ISO-8859-1 characters can be represented as-is, but any extended-ASCII/multi-byte characters will end up as \uXXXX values. For example, character é (e-acute), which is #233 in ISO-8859-1 will show up as \u00E9 (since é is U+00E9 in Unicode), and the string "résumé" would be stored as "r\u00E9sum\u00E9".
Not knowing what your Perl script is attempting to do, all I can say is it may be experiencing difficulty when trying to de-reference the escape sequence. Perl has its own set of escape sequences, and \u mid-string actually means "make the next character upper-case", so you're probably seeing a lot of "00E9" stuff from your Perl script instead of the accented characters, or you may get parse errors depending on your script.
Since you're creating/storing the JSON from POST data in PHP, you have some options:
Convert the special characters to HTML entities (htmlentities())
Force all special characters to reduce from UTF-8 sequences (if that's what your POST data comes in) to ISO-8859-1 via utf8_decode() (you may lose data with this approach)
Scrub the resultant JSON by replacing this REGEX match: /\\u[a-zA-Z0-9]{4,4}/ with "" (nothing) (you may lose data with this approach)
Double-escape the resultant JSON by changing all "\" characters to "\\" before feeding it to your Perl script (be wary of SQL injection!)
Related
I have a fedora machine acting as server, with apache running php 5.3
A scripts acts as an entry page for various sources sending me "messages".
The php script is called like: serverAddress/phpScript.php?message=MyMessage the message is then saved via PDO to connect to SqlServer 2008 db.
If the message contains any special characters (e.g. german), like: üäöß then in the db I will get some gibberish instead of the correct string: üäöß
The db is perfectly capable of UTF-8 - I can connect and send/retrieve german characters without any issue with other tools (not via php).
Inside the php script:
if I echo the input string I get the correct string üäöß
if I save it to a file (log the input) I see the gibberish: üäöß
What is causing this behavior? How can I fix it?
multibyte is enabled (yum install php-mbstring followed by a apache restart)
at the start of my php script I have:
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_language('uni');
mb_regex_encoding('UTF-8');
ob_start('mb_output_handler');
from what I understand the default encoding type when dealing with mssql via PDO is UTF-8
New development:
A colleague pointed me to the PDO_DBLIB page (visible only from cache in this moment) where I saw $res->bindValue(':value', iconv('UTF-8', 'ISO8859-1', $value);
I replaced all my $res->bindParam(':text',$text); with $res->bindParam(':text',iconv('UTF-8', 'ISO8859-1',$text)); and everything worked :).
The mb_internal_encoding.... and all other lines were no longer needed.
Why does it work when using the ISO8859-1 encoding?
A database may handle special characters without even supporting the Unicode set (which UTF-8 happens to be an encoding, specifically a variable-length one).
A character set is a mapping between numbers and characters. Unicode and ASCII are common examples of charsets. Unicode states that the sign € maps to the number 8364 (really it uses the code point U+20AC). UTF-8 is a way to encode Unicode code points, and represents U+20AC with three bytes: 0xE2 0x82 0xAC; UTF-16 is another encodind for Unicode code points, which always use two bytes: 0x20AC (link). Both of these encodings refer to the same 8364th entry in the Unicode catalogue.
ASCII is both a charset and an encoding scheme: the ASCII character set maps number from 0 to 127 to 128 human chars, and the ASCII encoding requires a single byte.
Always remember that a String is a human concept. It's represented in a computer by the tuple (byte_content, encoding). Let's say you want to store Unicode strings in your database. Please, note: it's not necessary to use the Unicode set if you just need to support German users. It's useful when you want to store Arabian, Chinese, Hebrew and German at the same time in the same column. MS SQLServer uses UCS-2 to encode Unicode, and this holds true for columns declared NCHAR or NVARCHAR (note the N prefix). So your first action will be checking if the target columns types are actually nvarchar (or nchar).
Then, let's assume that all input strings are UTF-8 encoded in your PHP script. You want to execute something like
$stmt->bindParam(':text', $utf8_encoded_text);
According to the documentation, UTF-8 is the default string encoding. I hope it's smart enough to work with NVARCHAR, otherwise you may need to use the extra options.
Your colleague's solution doesn't store Unicode strings: it converts in the ISO-8859-1 space, then saves the bytes in simple CHAR or VARCHAR columns. The difference is that you won't be able to store character outside of the ISO-8859-1 space (eg Polish)
Take a look at this article on "Handling Unicode Front to Back in a Web App". By far one of the best articles I've seen on the subject. If you follow the guide and the issues are still present, then you know for sure that it's not your fault.
I'm trying to pull sport players from my database that are already stored as unicode values. when calling json_encode it gives up when it hits unicode characters in the format i've got:
$values = array('a'=>'BERDYCH, Tomáš','b'=>'FEDERER, Roger');
echo json_encode($values);
the result is
{"a":"BERDYCH, Tom","b":"FEDERER, Roger"}
You can see 'Tom' was cut-off because it reached the unicode characters.
I understand json_encode only handles \uxxxx style characters but the problem is my database of thousands of sporting competitors already contains unicode stored values, so somehow I need to convert á type characters into \uxxxx without doing updates to my data source.
Any ideas?
json_encode() does this when it gets characters that are not valid UTF-8 characters.
If you are fetching data from the database, the most likely reason is that your connection is not UTF-8 encoded, and you are getting ISO-8859-1 data from your queries.
Show your database code for a suggestion how to change this.
I understand json_encode only handles \uxxxx style characters
This is not true. json_encode() outputs Unicode characters encoded this way, but it doesn't expect them in the incoming data.
Your source code and/or the data coming from the database is not encoded in UTF-8. I'd guess it's one of the specialized ISO-8859 encodings, but I'm not sure. When saving your source code, make sure it's saved in UTF-8. When getting data from the database, make sure you're setting the connection to utf8.
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text and Handling Unicode Front To Back In A Web App.
To make sure they are UTF8, encode all values in your array
$values = array_map('utf8_encode', $values);
If that doesn't help use mb_detect_encoding() and mb_convert_encoding() to change language specific encoding to UTF8.
It's a c# question, but take a look at Converting Unicode strings to escaped ascii string for an implementation that does this.
I'm looking into how characters are handled that are outside of the set characterset for a page.
In this case the page is set to iso-8859-1, and the previous programmer decided to escape input using htmlentities($string,ENT_COMPAT). This is then stored into Latin1 tables in Mysql.
As the table is set to the same character set as the page, I am wondering if that htmlentities step is needed.
I did some experiments on http://floris.workingweb.nl/experiments/characters.php and it seems that for stuff inside Latin1 some characters are escaped, but for example with a Czech name they are not.
Is this because those characters are outside of Latin1? If so, then the htmlentities can be removed, as it doesn't help for stuff outside of Latin1 anyway, and for within Latin1 it is not needed as far as I can see now...
htmlentities only translates characters it knows about (get_html_translation_table(HTML_ENTITIES) returns the whole list), and leaves the rest as is. So you're right, using it for non-latin data makes no sense. Moreover, both html-encoding of database entries and using latin1 are bad ideas either, and I'd suggest to get rid of them both.
A word of warning: after removing htmlentities(), remember that you still need to escape quotes for the data you're going to insert in DB (mysql_escape_string or similar).
He could have used it as a basic safety precaution, ie. to prevent users from inserting HTML/Javascript into the input (because < and > will be escaped as well).
btw If you want to support Eastern and Western European languages I would suggest using UTF-8 as the default character encoding.
Yes
though not because Czech characters are outside of Latin1 but because they share the same places in the table. So, database take it as corresponding latin1 characters.
using htmlentities is always bad. the only proper solution to store different languages is to use UTF-8 charset.
Take note that htmlentities / htmlspecialchars have a 3rd parameter (since PHP 4.1.0) for the charset. ISO-8859-1 is the default so if you apply htmlentities without a 3rd parameter to a UTF-8 string for example, the output will be corrupted.
You can detect & convert the input string with mb_detect_encoding and mb_convert_encoding to make sure the input string match the desired charset.
Quick question, how can I make this valid :
if($this->datos->bathrooms == "1½"){$select1 = JText::_( 'selected="selected"' );}
The ½ doesn't seem to be recognized. I tried to write it as ½ but then it looks for ½ literally, and not the ½ sign. Any ideas?
As many others have noted, you have a character encoding problem, most likely. I'm not sure what encodings PHP supports but you need to take the whole picture into account. For this example I'm assuming your PHP script is responding to a FORM post.
Some app (yours, most likely) writes some HTML which is encoded using some encoding and sent to the browser. Common choices are ISO-8859-1 and UTF-8. You should always use UTF-8 if you can. Note: it's not the default for the web (sadly).
The browser downloads this html and renders the page. Browsers use Unicode internally, mostly, or some superset. The user submits a form. The data in that form is encoded, usually with the same encoding that the page was sent in. So if you send UTF-8 it gets sent back to you as UTF-8.
PHP reads the bytes of the incoming request and sets up its internal variables. This is where you might have a problem, if it is not picking the right encoding.
You are doing a string comparison, which decomposes to a byte comparison, but the bytes that make up the characters depends on the encoding used. As Peter Bailey wrote,
In ISO-8859-1 this character is encoded as 0xBD
In UTF-8 this character is encoded as 0xC2BD
You need to verify the text encoding along each step of the way to make sure it is happening as you expect. You can verify the data sent to the browser by changing the encoding from the browser's auto-detected encoding to something else to see how the page changes.
If your data is not coming from the browser, but rather from the DB, you need to check the encodings between your app and the DB.
Finally, I'd suggest that it's impractical to use a string like 1½ as a key for comparison as you are. I'd recommend using 1.5 and detecting that at display time, then changing how the data is displayed only. Advantages: you can order the results by number of bathrooms if the value is numeric as opposed to a string, etc. Plus you avoid bugs like this one.
The character you are looking for is the Unicode character Vulgar Fraction One Half
There are a multitude of ways to make sure you are displaying this character properly, all of which depend on the encoding of your data. By looking here we can see that
In ISO-8859-1, a popular western encoding, this character is encoded as BD
In UTF-8, a popular international encoding, this character is encoded ad C2BD
What this means is that if your PHP file is UTF-8 encoded, but you are sending this to the browser as ISO-8850-1 (or the other way around), the character will not render properly.
As others have posted, you can also use the HTML Entity for this character which will be character-encoding agnostic and will always render (in HTML) properly, regardless of the output encoding.
Try comparing it with "1½"
Use the PHP chr function to create the character by its hex 0xBD or dec 189:
if($this->datos->bathrooms == "1".chr(189)){$select1 = JText::_( 'selected="selected"' );}
I am working on a Flex app that has a MySQL database. Data is retrieved from the DB using PHP then I am using AMFPHP to pass the data on to Flex
The problem that I am having is that the data is being copied from Word documents which sometimes result in some of the more unusual characters are not displaying properly. For example, Word uses different characters for starting and ending double quotes instead of just " (the standard double quotes). Another example is the long dash instead of -.
All of these characters result in one or more accented capital A characters appearing instead. Not only that, each time the document is saved, the characters are replaced again resulting in an ever-increasing number of these accented A's appearing.
Doing a search and replace for each troublesome character to swap it for one of the none characters seems to work but obviously this requires compiling a list of all the characters that may appear and means there is scope for this continuing as new characters are used for the first time. It also seems like a bit of a brute force way of getting round the problem rather than a proper solution.
Does anyone know what causes this and have any good workarounds / fixes? I have had similar problems when using utf-8 characters in html documents that aren't set to use utf-8. Is this the same thing and if so, how do I get flex to use utf-8?
Many thanks
Adam
It is the same thing, and smart quotes aren't special as such: you will in fact be failing for every non-ASCII character. As such a trivial ad-hoc replace for the smart quote characters will be pointless.
At some point, someone is mis-decoding a sequence of bytes as ISO-8859-1 or Windows code page 1252 when it should have been UTF-8. Difficult to say where without detail/code.
What is “the document”? What format is it? Does that format support UTF-8 content? If it does not, you will need to encode output you put into it at the document-creation phase to the encoding the consumer of that document expects, eg. using iconv.