How to encode large HTML lists for jQuery? - php

I have a PHP script that fetches a relatively large amount of data, and formats it as HTML unordered lists for use in an Ajax application.
Considering the data is in the order of tens to possibly more than a hundred KB, and that I want to be able to differentiate between the different lists with Javascript, what would be the best way to go about doing this?
I thought about json_encode, but that results in [null] when more than a certain amount of rows are requested (maybe PHP memory limit?).
Thanks a lot,
Fela

Certain illegal characters in the string could be breaking the json_encode() function in PHP which you will need to sanitize before this will work correctly. You could do this using regular expressions if this becomes a problem.
However, if you are sending requests with that amount of data it may be unwise to send this using AJAX as your application will seem very unresponsive. It may be better to get this data directly from the database as this would be a far faster method although you will have to obviously compromise.

I cannot click up, but I agree with Daniel West.
Ensure your strings are UTF-8 encoded or use mysql_set_charset('utf8') when you connect. The default charset for mysql is unfortunately Latin/Windows. Null is the result of a failed encoding because of this, if it was out of memory, the script itself would fail.

I would pass the data around via JSON, and do it in small batches that you can present incrementally. This will make it appear faster and may get around the memory issues you are having. If the data above the scroll ( first 40 lines or so ) loads quick, it is ok if the rest of the page takes several seconds to load. If you want to get tricky, you can even load the first page and then wait for a scroll event to load the rest, so you don't have to hit the server too much if the user never scrolls to look at the data below the scrollbar. If php is returning null from the json_encode it is because of invalid characters. If you cant control the data, you could just send HTML from the server to the client and avoid all the encoding/decoding, but this means more data to transfer.

With jquery you can transform your unordered list into a javascript array in your ajax application.
$.map( $('li'), function (element) { return $(element).text() });
Also, underscorejs as some very neat functions for javascript arrays and collections.
http://documentcloud.github.com/underscore/

I would prefer JSON, and I thought the 'null' you get from PHP encode is resulted from jSON's double escaping mechanism with javaScript. I have explained it in another post.
You need to double escape special character(One suspicious 'null' cause is '\n' in your case)
json parse error with double quotes

Related

hidden field becomes visible with large amount of serialized data

I have hidden fields which contain a very large amount of serialized data (I'm talking around 1300 records from a database). With all of this data, the hidden fields become visible as text boxes containing the serialized data. When, on the other hand, I limit this data to say, 200 records, the fields remain hidden as they should. I went ahead and inspected this issue in Chrome's "inspect element" and I noticed that many of the HTML characters throughout the page are all out of place.
For example:
class="cont_buffer" turned into class="”cont_buffer”"
The extra quotation marks are messing up the input type="hidden" and thus showing the fields.
What could I do to get around this issue?
Two things:
First, it sounds like you should be html encoding the value of the hidden input with htmlentities($value, ENT_QUOTES) to make sure quotes are properly encoded for use as an HTML attribute value.
However, there is probably a better way to do what you are doing rather than to pass thousands of records back to the browser so they can be returned by the browser to the server. Instead of passing to a hidden input, you are probably better off storing the rows temporarily in $_SESSION. In some cases, it may even be faster to just requery that many rows when the form post is recieved, rather than have the browser pass them back.
If you are serializing the rows in PHP with serialize(), I assume you intend to unserialize() them in PHP. Use extreme caution when doing so, because this opens the possibility that the browser could send malicious data back that would unserialize into something harmful to your application. If you must send serialized data from the browser to unserialize(), be certain to validate the object or array you receive -- make sure it contains all the keys or properties you expect, and only those you expect so that you don't have problems when iterating over it.

Encoding large numbers with json_encode in php

I have a php script that outputs a json-encoded object with large numbers (greater than PHP_MAX_INT) so to store those numbers internally, I have to store them as strings. However, I need them to be shown as un-quoted numbers to the client.
I've thought of several solutions, many of which haven't worked. Most of the ideas revolve around writing my own JSON encoder, which I have done already, but don't want to take the time to change all the places I have json_encode to instead say my_json_encode.
Since I have no control over the server, I cannot turn remove the JSON library. I cannot undeclare json_encode, nor can I rename it. Is there any easy way to handle all this, or is the best option to just go through each and every file and rename all the method calls?
With javascript being loosely typed, why the need to control the type in the JSON data? What are you doing with this number in javascript, and would parseInt\parseFloat not be able to make the leap from string to number on the client side?
The only option I had was to use my own json_encode method renamed to my_json_encode, and then change everywhere that called that method.

Store html entities in database? Or convert when retrieved?

Quick question, is it a better idea to call htmlentities() (or htmlspecialchars()) before or after inserting data into the database?
Before: The new longer string will cause me to have to change the database to hold longer values in the field. (maxlength="800" could change to a 804 char string)
After: This will require a lot more server processing, and hundreds of calls to htmlspecialchars() could be made on every page load or AJAX load.
SOOO. Will converting when results are retrieved slow my code significantly? Should I change the DB?
I'd recommend storing the most raw form of the data in the database. That gives you the most flexibility when choosing how and where to output that data.
If you find that performance is a problem, you could cache the HTML-formatted version of this data somehow. Remember that premature optimization is a bad thing.
I have no experience of php but generally I always convert or escape nearest to output. You don't know when your output requirements will change, for example you may want to spit out data as XML, or JSON arrays and so escaping for HTML and then storing means you're limited to using the data as HTML alone.
In a php/MySQL web app, data flows in two ways
Database -> scripting language (php) -> HTML output -> browser ->screen
and
Keyboard-> browser-> $_POST -> php -> SQL statement -> database.
Data is defined as everything provided by the user.
ALWAYS ALWAYS ALWAYS....
A) process data through mysql_real_escape_string as you move it into an SQL statement, and
B) process data through htmlspecialchars as you move it into the HTML output.
This will protect you from sql injection attacks, and enable html characters and entities to display properly (unless you manage to forget one place, and then you have opened up a security hole).
Did I mention that this has to be done for every single piece of data any user could ever have touched, altered or provided via a script?
p.s. For performance reasons, use UTF-8 encoding everywhere.
It's best to store text as raw and encode it as needed, to be honest, you always need to htmlencode your data anyways when you're outputting it to the wbe page to prevent XSS hacking.
You shouldn't encode your data before you put it in the database. The main reason are:
If such data is near the column size limit, say 32 chars, if the title was "Steve & Fred blah blah" then you might go over that column limit because a 1 char & becomes a 5 char & amp;
You are assuming the data will always be displayed in a web page, in the future you never know where you'll be looking at the data and you might not want it encoded, now you have to decode it and it's possible you might not have access to PHP's decode function
It is the way of the craftsman to "measure twice, optimize once".
If you don't need high performance for your website, store it as raw data and when you output it do what you want.
If you need performance then consider storing it twice: raw data to do what you want with it and another field with the filtered data. It could be seen as redundant, but CPU is expensive, while data storage is really cheap.
The easiest way is store the data "as is" and then convert to htmlentities wherever it is needed.
The safest solution is to filter the data before it goes in into the Database as this prevents possible attacks on your server and database from the lack of security implementation, and then convert it however you need when needed. Also if you are using PDO this will happen automatically for you using prepared statements.
http://php.net/PDO
We had this debate at work recently. We decided to store the escaped values in the database, because before (when we were storing it unescaped) there were corner cases where data was being displayed without being escaped. This can lead to XSS. So we decided to store it escaped to be safe, and if you want it unescaped you have to do the work yourself.
Edit: So to everyone who disagrees, let me add some backstory for my case. Let's say you're working in a team of 50+ people... and data from the database is not guaranteed to be HTML-Encoded on the way out - there's no built-in mechanism for it so the developer has to write the code to do it. And this data is shown all over the place so it's not going through 1 developer's code it's going through 30's - most of whom have no clue about this data (or that it could even contain angle brackets which is rare) and merely want to get it shown on the page, move on, and forget about it.
Do you still think it's better to put the data, in HTML, into the database and rely on random people who are not-you to do things properly? Because frankly, while it certainly may not seem warm-fuzzy-best-practicey, I prefer to fail closed (meaning when the data comes through in a Word Doc it looks like Value<Stock rather than Value<Stock) rather than open (so the Word Doc looks right with no work, but some corner of the platform may/likely-is vulnerable to XSS). You can't have both.

Anyone have issues going from ColdFusion's serializeJSON method to PHP's json_decode?

The Interwebs are no help on this one. We're encoding data in ColdFusion using serializeJSON and trying to decode it in PHP using json_decode. Most of the time, this is working fine, but in some cases, json_decode returns NULL. We've looked for the obvious culprits, but serializeJSON seems to be formatting things as expected. What else could be the problem?
UPDATE: A couple of people (wisely) asked me to post the output that is causing the problem. I would, except we just discovered that the result set is all of our data (listing information for 2300+ rental properties for a total of 565,135 ASCII characters)! That could be a problem, though I didn't see anything in the PHP docs about a max size for the string. What would be the limiting factor there? RAM?
UPDATE II: It looks like the problem was that a couple of our users had copied and pasted Microsoft Word text with "smart" quotes. Those pesky users...
You could try operating in UTF-8 and also letting PHP know that fact.
I had an issue with PHP's json_decode not being able to decode a UTF-8 JSON string (with some "weird" characters other than the curly quotes that you have). My solution was to hint PHP that I was working in UTF-8 mode by inserting a Content-Type meta tag in the HTML page that was doing the submit to the PHP. That way the content type of the submitted data, which is the JSON string, would also be UTF-8:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
After that, PHP's json_decode was able to properly decode the string.
can you replicate this issue reliably? and if so can you post sample data that returns null? i'm sure you know this, but for informational sake for others stumbling on this who may not, RFC 4627 describes JSON, and it's a common mistake to assume valid javascript is valid JSON. it's better to think of JSON as a subset of javascript.
in response to the edit:
i'd suggest checking to make sure your information is being populated in your PHP script (before it's being passed off to json_decode), and also validating that information (especially if you can reliably reproduce the error). you can try an online validator for convenience. based on the very limited information it sounds like perhaps it's timing out and not grabbing all the data? is there a need for such a large dataset?
I had this exact problem and it turns out it was due to ColdFusion putting none printable characters into the JSON packets (these characters did actually exist in our data) but they can't go into JSON.
Two questions on this site fixed this problem for me, although I went for the PHP solution rather than the ColdFusion solution as I felt it was the more elegant of the two.
PHP solution
Fix the string before you pass it to json_decode()
$string = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);
ColdFusion solution
Use the cleanXmlString() function in that SO question after using serializeJSON()
You could try parsing it with another parser, and looking for an error -- I know Python's JSON parsers are very high quality. If you have Python installed it's easy enough to run the text through demjson's syntax checker. If it's a very large dataset you can use my library jsonlib -- memory use will be higher than with demjson, but it will run faster because it's written in C.

How do I HTML Encode all the output in a web application?

I want to prevent XSS attacks in my web application. I found that HTML Encoding the output can really prevent XSS attacks. Now the problem is that how do I HTML encode every single output in my application? I there a way to automate this?
I appreciate answers for JSP, ASP.net and PHP.
One thing that you shouldn't do is filter the input data as it comes in. People often suggest this, since it's the easiest solution, but it leads to problems.
Input data can be sent to multiple places, besides being output as HTML. It might be stored in a database, for example. The rules for filtering data sent to a database are very different from the rules for filtering HTML output. If you HTML-encode everything on input, you'll end up with HTML in your database. (This is also why PHP's "magic quotes" feature is a bad idea.)
You can't anticipate all the places your input data will travel. The safe approach is to prepare the data just before it's sent somewhere. If you're sending it to a database, escape the single quotes. If you're outputting HTML, escape the HTML entities. And once it's sent somewhere, if you still need to work with the data, use the original un-escaped version.
This is more work, but you can reduce it by using template engines or libraries.
You don't want to encode all HTML, you only want to HTML-encode any user input that you're outputting.
For PHP: htmlentities and htmlspecialchars
For JSPs, you can have your cake and eat it too, with the c:out tag, which escapes XML by default. This means you can bind to your properties as raw elements:
<input name="someName.someProperty" value="<c:out value='${someName.someProperty}' />" />
When bound to a string, someName.someProperty will contain the XML input, but when being output to the page, it will be automatically escaped to provide the XML entities. This is particularly useful for links for page validation.
A nice way I used to escape all user input is by writing a modifier for smarty wich escapes all variables passed to the template; except for the ones that have |unescape attached to it. That way you only give HTML access to the elements you explicitly give access to.
I don't have that modifier any more; but about the same version can be found here:
http://www.madcat.nl/martijn/archives/16-Using-smarty-to-prevent-HTML-injection..html
In the new Django 1.0 release this works exactly the same way, jay :)
My personal preference is to diligently encode anything that's coming from the database, business layer or from the user.
In ASP.Net this is done by using Server.HtmlEncode(string) .
The reason so encode anything is that even properties which you might assume to be boolean or numeric could contain malicious code (For example, checkbox values, if they're done improperly could be coming back as strings. If you're not encoding them before sending the output to the user, then you've got a vulnerability).
You could wrap echo / print etc. in your own methods which you can then use to escape output. i.e. instead of
echo "blah";
use
myecho('blah');
you could even have a second param that turns off escaping if you need it.
In one project we had a debug mode in our output functions which made all the output text going through our method invisible. Then we knew that anything left on the screen HADN'T been escaped! Was very useful tracking down those naughty unescaped bits :)
If you do actually HTML encode every single output, the user will see plain text of <html> instead of a functioning web app.
EDIT: If you HTML encode every single input, you'll have problem accepting external password containing < etc..
The only way to truly protect yourself against this sort of attack is to rigorously filter all of the input that you accept, specifically (although not exclusively) from the public areas of your application. I would recommend that you take a look at Daniel Morris's PHP Filtering Class (a complete solution) and also the Zend_Filter package (a collection of classes you can use to build your own filter).
PHP is my language of choice when it comes to web development, so apologies for the bias in my answer.
Kieran.
OWASP has a nice API to encode HTML output, either to use as HTML text (e.g. paragraph or <textarea> content) or as an attribute's value (e.g. for <input> tags after rejecting a form):
encodeForHTML($input) // Encode data for use in HTML using HTML entity encoding
encodeForHTMLAttribute($input) // Encode data for use in HTML attributes.
The project (the PHP version) is hosted under http://code.google.com/p/owasp-esapi-php/ and is also available for some other languages, e.g. .NET.
Remember that you should encode everything (not only user input), and as late as possible (not when storing in DB but when outputting the HTTP response).
Output encoding is by far the best defense. Validating input is great for many reasons, but not 100% defense. If a database becomes infected with XSS via attack (i.e. ASPROX), mistake, or maliciousness input validation does nothing. Output encoding will still work.
there was a good essay from Joel on software (making wrong code look wrong I think, I'm on my phone otherwise I'd have a URL for you) that covered the correct use of Hungarian notation. The short version would be something like:
Var dsFirstName, uhsFirstName : String;
Begin
uhsFirstName := request.queryfields.value['firstname'];
dsFirstName := dsHtmlToDB(uhsFirstName);
Basically prefix your variables with something like "us" for unsafe string, "ds" for database safe, "hs" for HTML safe. You only want to encode and decode where you actually need it, not everything. But by using they prefixes that infer a useful meaning looking at your code you'll see real quick if something isn't right. And you're going to need different encode/decode functions anyways.

Categories