getting json_encode to not escape html entities - php

I send json_encoded data from my PHP server to iPhone app. Strings containing html entities, like '&' are escaped by json_encode and sent as &.
I am looking to do one of two things:
make json_encode not escape html entities. Doc says 'normal' mode shouldn't escape it but it doesn't work for me. Any ideas?
make the iPhone app un-escape html entities cheaply. The only way I can think of doing it now involves spinning up a XML/HTML parser which is very expensive. Any cheaper suggestions?
Thanks!

Neither PHP 5.3 nor PHP 5.2 touch the HTML entities.
You can test this with the following code:
<?php
header("Content-type: text/plain"); //makes sure entities are not interpreted
$s = 'A string with & &#x6F8 entities';
echo json_encode($s);
You'll see the only thing PHP does is to add double quotes around the string.

json_encode does not do that. You have another component that is doing the HTML encoding.
If you use the JSON_HEX_ options you can avoid that any < or & characters appear in the output (they'd get converted to \u003C or similar JS string literal escapes), thus possibly avoiding the problem:
json_encode($s, JSON_HEX_TAG|JSON_HEX_AMP|JSON_HEX_QUOT)
though this would depend on knowing exactly which characters are being HTML-encoded further downstream. Maybe non-ASCII characters too?

Based on the manual it appears that json_encode shouldn't be escaping your entities, unless you explicitly tell it to, in PHP 5.3. Are you perhaps running an older version of PHP?

Going off of Artefacto's answer, I would recommend using this header, it's specifically designed for JSON data instead of just using plain text.
<?php
header('Content-Type: application/json'); //Also makes sure entities are not interpreted
$s = 'A string with & &#x6F8 entities';
echo json_encode($s);
Make sure you check out this post for more specific reasons why to use this content type, What is the correct JSON content type?

Related

using htmlentities with superglobal variables

I'm working on php with a book now. The book said I should be careful using superglobal variables, so it's better to use htmlentities like this.
$came_from = htmlentities($_SERVER['HTTP_REFERER']);
So, I wrote a code like this;
<?php
$came_from=htmlentities($_SERVER['HTTP_REFERER']);
echo $came_from;
?>
However, the display of the code above was the same without htmlentities(); It didn't change anything at all. I thought that it would change \ into something else. Did I use it wrong?
So, by default, htmlentities() encodes characters using ENT_COMPAT (converts double-quotes and leave single-quotes alone) and ENT_HTML401. Seeing as the backslash isn't part of the HTML 4.01 entity spec (as far as I can see anyway), it won't be converted.
If you specify the ENT_HTML5 flag, you get a different result
php > echo htmlentities('abc\123');
abc\123
php > echo htmlentities('abc\123', ENT_HTML5);
abc&bsol;123
This is because backslash is part of the HTML5 spec. See http://dev.w3.org/html5/html-author/charref
Sorry. My previous answer was absolutely wrong. I was confused with something else. My apologise. Let me refrain my answer:
htmlentities will convert special characters into their HTML entity. "<" for example will be converted to "<". Your browser will automaticly recognise this HTML entity and decode it back to "<". So you won't notice any difference.
The reason for this is to prevent problems when saving your document in something different then UTF-8 encoding. Any characters not encoded might become screwed up for this reason.

converting special characters in HTML into the appropriate coding for PHP

I am making a website where one fills out a form and it creates a PDF. The user will be able to put in diacritic and special characters. The way I am sending the characters to the PHP, those characters will come into the PHP as HTML coded characters i.e. à. I need to change this to whatever it is PHP will read so when I put it through the PDF maker we have it has the diacritic character and not the HTML code for it.
I wrote a test to try this out but I haven't been able to figure it out. If I have to I will end up writing an array for every possible character they can use and translate the incoming string but I am trying to find an easier solution.
Here is the code of my test:
$title = "Test of Title for use With This Project and it should also wrap because it is sò long! Acutally it is even longer than previously expected!";
$ti = htmlspecialchars_decode($title);
I have been attempting to use the htmlspecialchars_decode() to convert it but it still comes out as &ograve and not ò. Is there an easy way to do this?
See the documentation which tells you it won't touch most of the characters you care about and to use html_entity_decode instead.
Use the html_entity_decode function instead of htmlspecialchars_decode (which only decodes entities such as &, ", < and > = special HTML chars, not all entities).

JSON Encode and curly quotes

I've run into an interesting behavior in the native PHP 5 implementation of json_encode(). Apparently when serializing an object to a json string, the encoder will null out any properties that are strings containing "curly" quotes, the kind that would potentially be copy-pasted out of MS Word documents with the auto conversion enabled.
Is this an expected behavior of the function? What can I do to force these kinds of characters to covert to their basic equivalents? I've checked for character encoding mismatches between the database returning the data and the administration page the inserts it and everything is setup correctly - it definitely seems like the encoder just refuses these values because of these characters. Has anyone else encountered this behavior?
EDIT:
To clarify;
MSWord will take standard quotation marks and apostraphes and convert them to more aesthetic "fancy" or "curly" quotes. These characters can cause problems when placed in content managers that have charset mistmatches between their editing interface (in the html) and the database encoding.
That's not the problem here, though. For example, I have a json_object representing a person's profile and the string:
Jim O’Shea
The UTF code for that apostraphe being \u2019
Will come out null in the json object when fetched from database and directly json_encoded.
{"model_name":"Bio","logged":true,"BioID":"17","Name":null,"Body":"Profile stuff!","Image":"","Timestamp":"2011-09-23 11:15:24","CategoryID":"1"}
Never had this specific problem (i.e. with json_encode()) but a simple - albeit a bit ugly - solution I have used in other places is to loop through your data and pass it through this function I got from somewhere (will credit it when I find out where I got it):
function convert_fancy_quotes ($str) {
return str_replace(array(chr(145),chr(146),chr(147),chr(148),chr(151)),array("'","'",'"','"','-'),$str);
}
json_encode has the nasty habit of silently dropping strings that it finds invalid (i.e. non-UTF8) characters in. (See here for background: How to keep json_encode() from dropping strings with invalid characters)
My guess is the curly quotes are in the wrong character set, or get converted along the way. For example, it could be that your database connection is ISO-8859-1 encoded.
Can you clarify where the data comes from in what format?
If I ever need to do that, I first copy the text into Notepad and then copy it from there. Notepad forces it to be normal quotes. Never had to do it through code though...

XML charactor encoding issues with accents

I have had the problem a few times now while working on projects and I would like to know if there's an elegant solution.
Problem
I am pulling tweets via XML from twitter and uploading them to my DB however when I output them to screen I get these characters:
"moved to dusseldorf.�"
OR
también
and if I have Russian characters then I get lots of ugly boxes in place.
What I would like is the correct native accents to show under one encoding. I thought was possible with UTF-8.
What I am using
PHP, MYSQL
After reading in the XML file I am doing the following to cleanse the data:
$data = trim($data);
$data = htmlentities($data);
$data = mysql_real_escape_string($data);
My Database Collation is: utf8_general_ci
Web page character set is: charset=UTF-8
I think it could have something to do with HTML entities but I really appreciate a solution that works across the board on projects.
Thanks in advance.
Replace this line:
$data = htmlentities($data);
With this:
$data = htmlentities($data, null, "UTF-8");
That way, htmlentities() will leave valid UTF-8 characters alone. For more information see the documentation for htmlentities().
You need to change your connection's encoding to UTF-8 (it's usually iso-8859-1). See here: How can I store the '€' symbol in MySQL using PHP?
Calling htmlentities() is unnecessary when you get the encodings right. I would remove it completely. You'll just have to be careful to use htmlspecialchars() when outputting the data a in HTML context.
Make sure that you set your php internal encoding ot UTF8 using iconv_set_encoding, and that you call htmlentities with the encoding information as EdoDodo said. Also make sure that you're database stores with UTF8-encoding, though you say that's already the case.
You can't use htmlentities() in it's default state for XML data, because this function produces HTML entities, not XML entities.
The difference is that the HTML DTD defines a bunch of entity codes which web browsers are programmed to interpret. But most XML DTDs don't define them (if the XML even has a DTD).
The only entitity codes that are available by default to XML are >, < and &. All other entities need to be presented using their numeric entity.
PHP doesn't have an xmlentities() function, but if you read the manual page for htmlentities(), you'll see in the comments that that plenty of people have had this same issue and have posted their solutions. After a quick browse through it, I'd suggest looking at the one named philsXMLClean().
Hope that helps.

weird characters such as ‪ ‬ ‏

My friend has been playing around with some language stuff on our site and our file names are being out put with these characters now. Usually I'd wait for him to wake up but this is a pretty big issue as we are getting e-mails through about the weird characters in the file names.
You don't see the characters when echoed in HTML, but we have the names being output to a header, which does show the characters, like so:
header('Content-Disposition: attachment; filename="'.$title.'.'.strtolower($type).'";');
How can we avoid these characters from displaying? They are also being input to our database, file names such as ‪asdfmovie‬‏ - I have googled the codes but I can't find any results for them.
Does anyone know what they are? and how to avoid them?
Thank you
html_entity_decode()
http://php.net/manual/en/function.html-entity-decode.php
These are html entities that are valid in HTML. Your email client is actually encoding them into HTML entities (a double effect), which means that the actual entities are what you're seeing. Just make sure that anything passed into the email runs through the html_entity_decode() function.
These are HTML entities which can be decoded using html_entity_decode, like echo html_entity_decode($str, ENT_COMPAT, 'UTF-8').
It's wrong to store such values in the database though, as you are seeing. The values should be stored in their original form and only HTML entity encoded when necessary for outputting to HTML. Figure out where they're being HTML encoded and fix that. If you already have a database full of this nonsense... um, have fun reversing it. :o)

Categories