decoding ISO characters

decoding ISO characters - php

I got Chinese characters encoded in ISO-8859-1, for example 兼 = 兼
Those characters are taken form the database using AJAX and sent by Json using json_encode.
I then use the template Handlebars to set the data on the page.
When I look at the ajax page the characters are displayed correctly, the source is still encoded.
But the final result displays the encrypted characters.
I tried to decode on the javascript part with unescape but there is no foreach with the template that gives me the possibility to decode the specific variable, so it crashes.
I tried to decode on the PHP side with htmlspecialchars_decode but without success.
Both pages are encoded in ISO-8859-1, but I can change them in UTF8 if necessary, but the data in the database remains encoded in ISO-8859-1.
Thank you for your help.

You're simply representing your characters in HTML entities. If you want them as "actual characters", you'll need to use an encoding that can represent those characters, ISO-8859 won't do. htmlspecialchars_decode doesn't work because it only decodes a handful of characters that are special in HTML and leaves other characters alone. You'll need html_entity_decode to decode all entities, and you'll need to provide it with a character set to decode to which can handle Chinese characters, UTF-8 being the obvious best choice:
$str = html_entity_decode($str, ENT_COMPAT, 'UTF-8');
You'll then need to make sure the browser knows that you're sending it UTF-8. If you want to store the text in the database in UTF-8 as well (which you really should), best follow the guide How to handle UTF-8 in a web app which explains all the pitfalls.

Are you including your text with the "double-stache" Handlebars syntax?
{{your expression}}
As the Handlebars documentation mentions, that syntax HTML-escapes its output, which would cause the results you're mentioning, where you're seeing the entity 兼 instead of 兼.
Using three braces instead ("triple-stache") won't escape the output and will let the browser correctly interpet those numeric entities:
{{{your expression}}}

Related

Why is php converting certain characters to '?'

Everything in my code is running my database(Postgresql) is using utf8 encoding, I've checked the php.ini file its encoding is utf8, I tried debugging to see if it was any of the functions I used that were doing this, but nothing everything is running as expected, however after my frontend sends a post request to backend server through curl for some text to be inserted in the database, some characters like 'da' are converted to '?' in postgre and in memcached, I think php is converting them to Latin-1 again after the request reaches the other side for some reason becuase I use utf8_encode before the request and utf8_decode on the other side
this is the code to send the request
$pre_opp->
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST",str_replace(" ","%",utf8_encode($bio)));
this is how the backend system receives this
$data= str_replace("%"," ",utf8_decode($_POST["Data"]));

Don't replace " " with "%".
Use urlencode and urldecode instead of utf8_encode and utf8_decode - It will give you a clean alphanumeric representation of any character to easily transport your data.
If everything in your environment defaults to UTF-8, you shouldn't need utf_encode and utf_decode anyways, I guess. But if you still do, you could try combining both like this:
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST", urlencode(utf8_encode($bio)));
and
$data= str_replace("%"," ",utf8_decode(urldecode($_POST["Data"])));

You say this like it's a mystery:
I think php is converting them to Latin-1 again after the request reaches the other side for some reason
But then you give the reason yourself:
because I use utf8_encode before the request and utf8_decode on the other side
That is exactly what uf8_decode does: it converts UTF-8 to Latin-1.
As the manual explains, this is also where your '?' replacements come from:
This function converts the string string from the UTF-8 encoding to ISO-8859-1. Bytes in the string which are not valid UTF-8, and UTF-8 characters which do not exist in ISO-8859-1 (that is, characters above U+00FF) are replaced with ?.
Since you'd picked the unfortunate replacement of % for space, sequences like "%da" were being interpreted as URL percent escapes, and generating invalid UTF-8 strings. You then asked PHP to convert them to Latin-1, and it couldn't, so it substituted "?".
The simple solution is: don't do that. If your data is already in UTF-8, neither of those functions will do anything but mess it up; if it's not already in UTF-8, then work out what encoding it's in and use iconv or mb_convert_encoding to convert it, once. See also "UTF-8 all the way through".
Since we can't see your Send_Request_To_BackEnd function, it's hard to know why you thought you needed it. If you're constructing a URL with that string, you should use urlencode inside your request sending code; you shouldn't need to decode it the other end, PHP will do that for you.

encoding from any language to utf in php

I insert from csv characters from different languages..
I apply this to every set of characters:
private function process_elements($element){
utf8_encode($element);
return $element;
}
The problem is when they go into the database, they go like this:
???????? ?? ???????????? ????? ??????? ??? ???????...
When I retrieve them from the databse, I also get this.
This happens with greek. However, when I retrieve greek pages (through scrapping), who are on a utf encoded page. The characters look like this:
Î”ÎµÏ‚ webcam Î´Ï‰Î¼Î¬Ï„Î¹Î± | Gr.ImLive.com
which is okay, because when i use the utf8_encode function, they look normal on the screen..
But when the data is taken from the csv and be put into the database, i get those question marks..
Is there a way to encode form any language to utf.. why retrieving data from csv and a utf8 encoded webpage makes such a difference.. they look the same.. how do I address that problem?

please take a look at this
it will help you
Handling Unicode Front To Back In A Web App

It's not about "languages", it's about encodings. Text is encoded as bits and bytes. Any one byte is equal to any other byte. If you only have a blob of bytes, you cannot know what encoding it represents. You can guess, but that's not accurate. You have to know what encoding some text is in by reading the accompanying meta data. That may be documentation, a <meta> tag or an HTTP header. Then you need to treat the text in that encoding.
utf8_encode actually converts text from ISO-8859-1 to UTF-8. It does not simply encode anything to UTF-8, because it does not have the means to determine what something is encoded in either. If your text is already UTF-8 encoded or was not ISO-8859-1 encoded to begin with, you're just garbling the text (as you are).

Encode ● special character in php

I need to store the special char ● inside a database.
I'm doing this:
htmlentities($text,ENT_QUOTES,'UTF-8')
for reasons I don't understand the char ● does not get encoded, it remains in its "normal" (●) form instead of its encoded (●) form, it looks like this character is part of the utf-8 charset, but I need to have it encoded (●) anyway in the db. I cannot use another charset because I need UTF-8.
Why does this happen?
Thanks
EDIT: It's not a discussion about encoding or not encoding chars in the database, I would like to know why that particular char gets completely ignored.

You can use the multi-byte functions to convert your UTF-8 string to US-ASCI while replacing any non-ASCII character by a character reference:
mb_substitute_character('entity');
$ascii = mb_convert_encoding($utf8, 'ASCII', 'UTF-8');
Although I don’t see any reason to do this either. Your database won’t interpret any string as HTML.

The built-in function with a widest set of HTML entities is probably mb_convert_encoding:
<?php
// Assuming UTF-8 input:
echo mb_convert_encoding('●', 'HTML-ENTITIES', 'UTF-8');
In any case, once you've stored plain text as HTML, it's difficult to get it back to plain text (or re-encode it if specs change). I'd recommend storing stuff as-is.

(PHP) rawurlencode/decode seems to encode '£' sign as 'Â£' (%C2%A3 instead of %A3)

So, I've run into a problem with PHP's rawurlencode function. All text fields in our web app are of course converted before being processed by the web-server, and we've used rawurlencode for this. This works fine with almost every character I've found, expect for the "£" sign. Now, there is no reason for our users to ever enter a pound sign, but they might, so I want to take care of this.
The problem is that rawurlencode doesn't encode a pound sign entered on the webpage as %A3, but instead as %C2%A3. Even worse, if the user failed to enter another bit of critical information (which causes the webpage to refresh - the checks are done on the backend side - and try and refill the form boxes with the information the user had used), then when the %C2 is run through rawurldecode/encode, it becomes Ã? - aka, %C3?. And of course the "£" is also turned into another Â£!
So, what is causing this? I assume it's a character encoding issue, but I'm not that knowledgable about these things. I heard somewhere that I can encode £s as &pound manually, but why should I need to do that when the database can handle "£"s, and there is a percentage-encoding for a pound sign? Is this a bug in rawurlencode, or a bug caused by differing character sets?
Thanks for any help.

The standard requires forms to be submitted in the character encoding you specify in <form accept-charset="..."> or UTF-8 if it's not specified or the text the user has entered cannot be represented in the charset you specify.
Clearly, you're receiving the pound sign encoded in UTF-8. If you want to convert it to ISO-8859-15, write:
iconv("UTF-8", "ISO-8859-15//TRANSLIT", $original)

This is probably encoding A3 character in your native character set to C2A3 in UTF-8 encoding, which seems to be the valid UTF-8 encoding for an ANSI A3. Just consume your encoded url using UTF-8 encoding, or specify an ANSI encoding to urlencode.
Artefacto's answer represents a case when you need to convert character encodings, for example, you are displaying a page and the page encoding is set to Latin-1. (Raw)Urlencode will produce escaped strings with multibyte character representations. (Raw)Urldecode will by default produce utf-8 encoded strings, and will represent £ as two bytes. If you display this string making a claim that it is a ISO-8859 encoded string, it will appear as two characters.
A primer on PHP and UTF-8: http://www.phpwact.org/php/i18n/utf-8
Some "hot tips": http://www.sitepoint.com/blogs/2006/08/10/hot-php-utf-8-tips/
Likely, between getting the string from rawurldecode, and using the string, the locale is assumed to be ISO8859, so two bytes get interpreted as two characters when they represent one.
Use mb_convert_encoding to force PHP to realize that the bytes in the string represent a UTF-8 encoded string.

utf-8 and htmlentities in RSS feeds

I'm writing some RSS feeds in PHP and stuggling with character-encoding issues. Should I utf8_encode() before or after htmlentities() encoding? For example, I've got both ampersands and Chinese characters in a description element, and I'm not sure which of these is proper:
$output = utf8_encode(htmlentities($source)); or
$output = htmlentities(utf8_encode($source));
And why?

It's important to pass the character set to the htmlentities function, as the default is ISO-8859-1:
utf8_encode(htmlentities($source,ENT_COMPAT,'utf-8'));
You should apply htmlentities first as to allow utf8_encode to encode the entities properly.
(EDIT: I changed from my opinion before that the order didn't matter based on the comments. This code is tested and works well).

First: The utf8_encode function converts from ISO 8859-1 to UTF-8. So you only need this function, if your input encoding/charset is ISO 8859-1. But why don’t you use UTF-8 in the first place?
Second: You don’t need htmlentities. You just need htmlspecialchars to replace the special characters by character references. htmlentities would replace “too much” characters that can be encoded directly using UTF-8. Important is that you use the ENT_QUOTES quote style to replace the single quotes as well.
So my proposal:
// if your input encoding is ISO 8859-1
htmlspecialchars(utf8_encode($string), ENT_QUOTES)
// if your input encoding is UTF-8
htmlspecialchars($string, ENT_QUOTES, 'UTF-8')

Don't use htmlentities()!
Simply use UTF-8 characters. Just make sure you declare encoding of the feed in HTTP headers (Content-Type:application/xml;charset=UTF-8) or failing that, in the feed itself using <?xml version="1.0" encoding="UTF-8"?> on the first line.

It might be easier to forget htmlentities and use a CDATA section. It works for the title section, which doesn't seem support encoded HTML characters in Firefox's RSS viewer:
<title><![CDATA[News & Updates " > » ☂ ☺ ☹ ☃ Test!]]></title>

You want to do $output = htmlentities(utf8_encode($source));. This is because you want to convert your international characters into proper UTF8 first, and then have ampersands (and possibly some of the UTF-8 characters as well) turned in to HTML entities. If you do the entities first, then some of the international characters may not be handled properly.
If none of your international characters are going to be changed by utf8_encode, then it doesn't matter which order you call them in.

After much trial & error, I finally found a way to properly display a string from a utf8-encoded database value, through an xml file, to an html page:
$output = '<![CDATA['.utf8_encode(htmlentities($string)).']]>';
I hope this helps someone.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

decoding ISO characters - php

Related

Why is php converting certain characters to '?'

encoding from any language to utf in php

Encode ● special character in php

(PHP) rawurlencode/decode seems to encode '£' sign as 'Â£' (%C2%A3 instead of %A3)

utf-8 and htmlentities in RSS feeds

Categories

Resources