PHP - Replace JSON with the correct Unicode symbol - php

Ok, so I have some JSON, that when decoded, I print out the result. Before the JSON is decoded, I use stripslashes() to remove extra slashes. The JSON contains website links, such as https://www.w3schools.com/php/default.asp and descriptions like Hello World, I have u00249999999 dollars
When I print out the JSON, I would like it to print out
Hello World, I have $9999999 dollars, but it prints out Hello World, I have u00249999999 dollars.
I assume that the u0024 is not getting parsed because it has no backslash, though the thing is that the website links' forward slashes aren't removed through strip slashes, which is good - I think that the backslashes for the Unicode symbols are removed with stripslashes();
How do I get the PHP to automatically detect and parse the Unicode dollar sign? I would also like to apply this rule to every single Unicode symbol.
Thanks In Advance!

According to the PHP documentation on stripslashes (), it
un-quotes a quoted string.
Which means, that it basically removes all backslashes, which are used for escaping characters (or Unicode sequences). When removing those, you basically have no chance to be completely sure that any sequence as "u0024" was meant to be a Unicode entity, your user could just have entered that.
Besides that, you will get some trouble when using stripslashes () on a JSON value that contains escaped quotes. Consider this example:
{
"key": "\"value\""
}
This will become invalid when using stripslashes () because it will then look like this:
{
"key": ""value""
}
Which is not parseable as it isn't a valid JSON object. When you don't use stripslashes (), all escape sequences will be converted by the JSON parser and before outputting the (decoded) JSON object to the client, PHP will automatically decode (or "convert") the Unicode sequences your data may contain.
Conclusion: I'd suggest not to use stripslashes () when dealing with JSON entities as it may break things (as seen in the previous example, but also in your problem).

Your assumption is correct: u0024 is not getting parsed because it has no backslash. You can use regex to add backslash back after the conversion.
It looks like you have UTF-8 encoded strings internally, PHP outputs them properly, but your browser fails to auto-detect the encoding (it decides for ISO 8859-1 or some other encoding).
The best way is to tell the browser that UTF-8 is being used by sending the corresponding HTTP header:
header("content-type: text/html; charset=UTF-8");
Then, you can leave the rest of your code as-is and don't have to html-encode entities or create other mess.
If you want, you can additionally declare the encoding in the generated HTML by using the <meta> tag:
<meta http-equiv=Content-Type content="text/html; charset=UTF-8"> for HTML <=4.01
<meta charset="UTF-8">
for HTML5
HTTP header has priority over the <meta> tag, but the latter may be useful if the HTML is saved to HD and then read locally.

The main question you have to understand, is why do you need to strip slashes?
And, if it is really necessary to strip slashes, how to manage the encoding? Probably it is a good idea to convert unicode symbols before to strip slashes, not after, using html_entity_decode .
Anyway, you can try fix the problem with this workaround:
$string = "Hello World, I have u00249999999 dollars";
$string = preg_replace( "/u([0-9A-F]{0,4})/", "&#x$1;", $string ); // recover "u" + 4 alnums
$string = html_entity_decode( $string, ENT_COMPAT, 'UTF-8' ); // convert to utf-8

Related

Output PHP string to show escaped characters

In PHP, is it at all possible to output the contents of a string to show any escaped characters that may be contained within the string? I get that the whole point of escaping characters is so that they aren't treated in the usual way. But I would still like to be able to view the raw contents of a string so I can see for myself exactly how characters like \n and \r, etc. are represented. Does PHP have a method for doing this?
Use json_encode() to encode the string as JSON. The JSON encoding of strings (which is, in fact, JavaScript) is the same as the one used by PHP. Both JavaScript and PHP were inspired from C and they copied the notation of string literals from it.
if you use single quotation marks it should do what you need
eg echo 'this\n'; will output this\n where as echo "this\n"; will output this and a new line

decoding ISO characters

I got Chinese characters encoded in ISO-8859-1, for example 兼 = 兼
Those characters are taken form the database using AJAX and sent by Json using json_encode.
I then use the template Handlebars to set the data on the page.
When I look at the ajax page the characters are displayed correctly, the source is still encoded.
But the final result displays the encrypted characters.
I tried to decode on the javascript part with unescape but there is no foreach with the template that gives me the possibility to decode the specific variable, so it crashes.
I tried to decode on the PHP side with htmlspecialchars_decode but without success.
Both pages are encoded in ISO-8859-1, but I can change them in UTF8 if necessary, but the data in the database remains encoded in ISO-8859-1.
Thank you for your help.
You're simply representing your characters in HTML entities. If you want them as "actual characters", you'll need to use an encoding that can represent those characters, ISO-8859 won't do. htmlspecialchars_decode doesn't work because it only decodes a handful of characters that are special in HTML and leaves other characters alone. You'll need html_entity_decode to decode all entities, and you'll need to provide it with a character set to decode to which can handle Chinese characters, UTF-8 being the obvious best choice:
$str = html_entity_decode($str, ENT_COMPAT, 'UTF-8');
You'll then need to make sure the browser knows that you're sending it UTF-8. If you want to store the text in the database in UTF-8 as well (which you really should), best follow the guide How to handle UTF-8 in a web app which explains all the pitfalls.
Are you including your text with the "double-stache" Handlebars syntax?
{{your expression}}
As the Handlebars documentation mentions, that syntax HTML-escapes its output, which would cause the results you're mentioning, where you're seeing the entity 兼 instead of 兼.
Using three braces instead ("triple-stache") won't escape the output and will let the browser correctly interpet those numeric entities:
{{{your expression}}}

Illegal non-standard quotes in XML

I'm allowing some user input on my website, that later is read in XML. Every once in a while I get these weird single or double quotes like this ”’. These are directly copied from the source that broke my XML. I'm wondering if there is an easy way to correct these types of characters in my xml. htmlentities did not seem to touch them.
Where do these characters come from? I'm not even sure how I'd go about typing them out unintentionally.
EDIT- I forgot to clarify these quotes are not being used in attributes, but in the following way:
<SomeTag>User’s Input</SomeTag>
Don't disallow and/or modify foreign characters; that's just annoying for your users! This is just an encoding issue. I don't know what parser you're using to read the XML, but if it's reasonably sophisticated, you can solve your problem by including the following encoding pragma at the top of your XML files:
<?xml version="1.0" encoding="UTF-8"?>
There may also be a UTF-8 option in the parser's API.
Edit: I just read that you're reading the XML directly in a browser. Most browsers listen to the encoding pragma!
Edit 2: Apparently, those quotes aren't even legal in UTF-8, so ignore what I said above. Instead, you might find what you're looking for here, where a similar problem is being discussed.
Are these quotes being used in text content, or to delimit attributes? For attribute delimiters, XML requires typewriter quotes (single or double). Microsoft and other word-processing applications often try to be smart and replace typewriter quotes with typographical quotes, which is almost certainly the answer to the question "where are they coming from?".
If you need to get rid of them, a simple global replace using a text editor will do the job fine.
But you might try to work out first why they are causing a problem. Perhaps your data flow can't handle ANY non-ASCII characters, in which case that's a deeper problem that you really ought to fix (it would typically imply some unwanted transcoding is happing somewhere along the line).
If the input string is UTF-8 encoded, maybe you need to specify that to htmlentities(), for example:
$html = htmlentities( '”’', ENT_COMPAT, "utf-8" );
echo $html;
For me gives:
”’
whereas
$html = htmlentities( '”’' );
echo $html;
gets confused:
â??â??
If the input string is non-UTF-8, then you'd need to adjust the encoding arg for htmlentities() accordingly.
Stay away from MicroSoft Office apps. Word, Excel etc. have a nasty habit of replacing matching pairs of single quotes and double quotes with non-standard "smart-quotes".
These quote characters are truly non-standard and never made it into the official latin-1 character set. All the MS Office apps "helpfully" replace standard quote characters with these abominations.
Just google for "undoing smatquotes" or "convert smartquotes back" for hints tips and regexes to get rid of these.
Use
$s = 'User’s Input';
$descriptfix = preg_replace('/[“”]/','\"',$s);
$descriptfix = preg_replace('/[‘’]/','\'',$descriptfix);
echo "<SomeTag>htmlentities($s)</SomeTag>";

Convert HTML numbered entities in php to unicode for use on iPhone

I'm creating a web service to transfer json to an iPhone app. I'm using json-framework to receive the json, and that works great because it automatically decodes things like "\u2018". The problem I'm running into is there doesn't seem to be a comprehensive way to get all the characters in one fell swoop.
For example html_entity_decode() gets most things, but it leaves behind stuff like ‘ (‘). In order to catch these entities and convert them to something json-framework can use (e.g., \u2018), I'm using this code to convert the &# to \u, convert the numbers to hex, and then strip the ending semicolon.
function func($matches) {
return "\u" . dechex($matches[1]);
}
$json = preg_replace_callback("/&#(\d{4});/", "func", $json);
This is working for me at the moment, but it just doesn't feel right. It seems like I'm surely missing some characters that are going to come back to haunt me later.
Does anyone see flaws in this approach? Can anyone think of characters this approach will miss?
Any help would be most appreciated!
From where are you getting this HTML-encoded input? If you're scraping a web page you should be using an HTML parser, which will decode both entity and character references for you. If you are getting them in form input data, you've got a problem with encodings (make sure to serve the page containing the form as UTF-8 to avoid this).
If you must convert an HTML-encoded stretch of literal text to JSON, you should do it by HTML-decoding first then JSON-encoding, rather than attempting to go straight to JSON format (which will fail for a bunch of other characters that need escaping). Use the built-in decoder and encoder functions rather than trying to create JSON-encoded characters like \u.... yourself (as there are traps there).
$html= 'abc " def Ӓ ghi ሴ jkl \n mno';
$raw= html_entity_decode($html, ENT_COMPAT, 'utf-8');
$json= json_encode($raw);
"abc \" def \u04d2 ghi \u1234 jkl \\n mno"
‘ is a decimal numbered entity, while I believe \u2018 is a hexadecimal representation. HTML also supports hexadecimal numbered entities (e.g., ‘), but once you've found # as the entity prefix you're looking at either decimal or hex. There are also named entities (e.g., &) but it doesn't sound like you need to cover those cases in your code.
$html_escape = ""Love sex magic rise" & 尹真希 ‘";
$utf8 = mb_convert_encoding($html_escape, 'UTF-8', 'HTML-ENTITIES');
echo json_encode(array(
"title" => $utf8
));
// {"title":"\"Love sex magic rise\" & \u5c39\u771f\u5e0c \u2018"}
This work well for me

PHP Json_Encode strange characters?

I am using JSON_ENCODE in PHP to output data.
When it gets to this word: Æther it outputs \u00c6ther.
Anyone know of a way to make json output that character or am I going to have to change the text to not have that character in it?
That's the unicode version of the character. JavaScript should handle it properly. You'll notice the slash before it which means that it's an escape sequence. The u indicates it's a unicode code point and the hex digits represent the actual character.
See here for some more info.
That is working as specified. The RFC ( http://www.ietf.org/rfc/rfc4627.txt ) indicates that any character may be escaped, and your average printable character can be written in the \uXXXX format.
Any JSON parser that cannot understand a character escaped in that way is not compliant with the standard. Work on resolving that problem rather than trying to coax PHP into misbehaving as well.
(It is legal to put UTF-8 characters into JSON strings without escaping them as well, with a few exceptions, but the safe approach of escaping anything questionable is wise.)

Categories