PHP, convert $_GET elements to HTML entities - php

Edit: I think %27 is actually the wrong kind of quote. I am still stuck though, I cannot find a PHP function that does the conversion I want.
Edit (again): I found a solution where I stick %26rsquo%3bs into the URL and it turns into ’. It works so I posted it as an answer below but I'd still be interested in knowing how it'd be done with PHP functions.
I'm working on a website that uses a PHP tree as if it were a directory. For example, if someone types index.php?foo=visual programming (or index.php?foo=visual%20programming) then the website opens the item "Visual Programming" (I'm using strtolower()).
Another working example would be index.php?foo=visual programming&bar=animated path finder which opens "Animated Path Finder", a child of "Visual Programming".
The problem is that some of the items are named things like "Conway’s Game of Life" which uses a HTML entity. My guess of what someone should type to open this would be index.php?foo=visual%20programming&bar=conway%27s%20game%20of%20life. The problem is that ' is not === to ’.
What do I need to do to make this work? Here is my code that selects an item based on $_GET (the PHP is inside of <script type="text/javascript">):
<?php
function echoActiveDirectory($inTree) {
// Compare $_GET with PhpTree
$itemId = 0;
foreach ($_GET as $name) {
if ($inTree->children !== null) {
foreach ($inTree->children as $child) {
if (strtolower($child->title) === strtolower($name)) {
$itemId = $child->id;
$inTree = $child;
break;
}
}
}
}
// Set jsItems[$itemId].selected(), it will be 0 if nothing was found
echo "\t\tjsItems[".$itemId."].selected();\n";
}
echo "// Results of PHP echoActiveDirectory(\$root)\n";
echoActiveDirectory($root);
?>
The website is a work in progress, but it can be tested here to see $_GET working: http://alexsimes.com/index.php

The hex code %27 (39 decimal) will never translate to ’, since it is a completely different entity (Wikipedia). It could be translated to &apos;, but PHP doesn't do that (although I don't know the reason for that).
Edit
While there is no standard for URL-encoding multibyte character sets, PHP will treat a string as just a set of bytes, and if those match an UTF-8 sequence, it will work:
php -r 'echo htmlentities(urldecode("%E2%80%99"), ENT_QUOTES|ENT_HTML401);'
should output
’

You can use html_entity_encode() and html_entity_decode() PHP functions to convert those characters to html entities or decode them back to desired characters before comparison.

You can try the htmlentities function to convert special characters to corresponding html entity. But in your case if data is already stored in db as html entity form, the data from $_GET parameter must be first passed through htmlentities before using it in your query.

Related

PHP - HTML Entities/Special Chars Confusion

I'm working with a database and comparing values to strings to then create new records. I've encountered an issue with a comparison with a database value that is stored in a $type variable - the offending value is:
<recordID>
In my PHP script I do a test to see if the database value = "":
if ($type == '<recordID>') {
// create new records etc
}
however I've just noticed this test is failing, and I'm assuming it's the "<" and ">" characters that is the issue. If I echo the $type variable I get this in the browser source view:
<contactID>
I can see the issue relates to htmlentities and html special chars but I haven't been able to work out the function to use to be able to get this comparison above to work.
You can use builtin for it
<?php
echo htmlspecialchars_decode($type);
?>

Extra linebreak when producing JSON with PHP

I have some chunks of HTML in my database that I put in a JSON this way, upon first page render:
foreach ($data as $key => $value) {
$output .= '
"'.$key.'" : "'.addSlashes($value).'",
';
}
usually it works fine, except in a few cases where I get a leading line feed like this below:
"html" : "
<div id=\"object_01\" class=\"object_textbox\" namn=\"object_01\" style=\"z-index:1;font-family:Arial;left:750.9776000976563px;top:30.979339599609375px;position:absolute;width:709.6444444656372px;height:327.6444444656372px;\"><div class=\"object_inner\" style=\"border:0px solid rgb(0,0,0);background-color:rgb(255,255,255);color:rgb(0,0,0);opacity:.8;font-size:28px;line-height:42px;padding:20px;\">“Behold my Beloved Son, in whom I am well pleased, in whom I have glorified my name—hear ye him.” And it came to pass, as they understood they cast their eyes up again towards heaven; and behold, they saw a Man descending out of heaven; and he was clothed in a white robe; and he came down and stood in the midst of them...</div></div><div id=\"object_02\" class=\"object_textbox\" namn=\"object_02\" style=\"z-index:2;font-family:Arial;position:absolute;top:309.9988098144531px;left:1269.9826049804688px;width:187.6444444656372px;height:48.64444446563721px;\"><div class=\"object_inner\" style=\"border:0px solid rgb(0,0,0);font-style:italic;font-weight:bold;\">3 Nephi 11:7-8</div></div>",
Which obviously breaks the entiry page since javascript doesn't support linefeeds inside strings.
Any idéa what might cause it, or a way around it?
All the data from the database is picked up from a page at form save, so the user never gets to manually insert that linefeed... I just use jQuery element.html() to pick it up so there shouldn't be a linefeed there...
Never build JSON "manually". Use the built in function:
<?php
$output = json_encode($data);
One of the reasons JSON is so popular (other than it being so compatible with JavaScript) is that it is a widely supported serialization scheme. Almost every language and platform now have libraries for working with it so that you can reliably serialize and deserialize your data in any environment for any other.
Once you're using json_encode, whatever ends up in that data is what was there before serialization. So if the line break is still there, it's not the serialization causing it--you'll need to find it in your other code/data.

Images not uploading when htmlentities has 'UTF-8' set

I have a form that, among other things, accepts an image for upload and sticks it in the database. Previously I had a function filtering the POSTed data that was basically:
function processInput($stuff) {
$formdata = $stuff;
$formdata = htmlentities($formdata, ENT_QUOTES);
return "'" . mysql_real_escape_string(stripslashes($formdata)) . "'";
}
When, in an effort to fix some weird entities that weren't getting converted properly I changed the function to (all that has changed is I added that 'UTF-8' bit in htmlentities):
function processInput($stuff) {
$formdata = $stuff;
$formdata = htmlentities($formdata, ENT_QUOTES, 'UTF-8'); //added UTF-8
return "'" . mysql_real_escape_string(stripslashes($formdata)) . "'";
}
And now images will not upload.
What would be causing this? Simply removing the 'UTF-8' bit allows images to upload properly but then some of the MS Word entities that users put into the system show up as gibberish. What is going on?
**EDIT: Since I cannot do much to change the code on this beast I was able to slap a bandaid on by using htmlspecialchars() rather than htmlentities() and that seems to at least leave the image data untouched while converting things like quotes, angle brackets, etc.
bobince's advice is excellent but in this case I cannot now spend the time needed to fix the messy legacy code in this project. Most stuff I deal with is object oriented and framework based but now I see first hand what people mean when they talk about "spaghetti code" in PHP.
function processInput($stuff) {
$formdata = $stuff;
$formdata = htmlentities($formdata, ENT_QUOTES);
return "'" . mysql_real_escape_string(stripslashes($formdata)) . "'";
}
This function represents a basic misunderstanding of string processing, one common to PHP programmers.
SQL-escaping, HTML-escaping and input validation are three separate functions, to be used at different stages of your script. It makes no sense to try to do them all in one go; it will only result in characters that are ‘special’ to any one of the processes getting mangled when used in the other parts of the script. You can try to tinker with this function to try to fix mangling in one part of the app, but you'll break something else.
Why are images being mangled? Well, it's not immediately clear via what path image data is going from a $_FILES temporary upload file to the database. If this function is involved at any point though, it's going to completely ruin the binary content of an image file. Backslashes removed and HTML-escaped... no image could survive that.
mysql_real_escape_string is for escaping some text for inclusion in a MySQL string literal. It should be used always-and-only when making an SQL string literal with inserted text, and not globally applied to input. Because some things that come in in the input aren't going immediately or solely to the database. For example, if you echo one of the input values to the HTML page, you'll find you get a bunch of unwanted backslashes in it when it contains characters like '. This is how you end up with pages full of runaway backslashes.
(Even then, parameterised queries are generally preferable to manual string hacking and mysql_real_escape_string. They hide the details of string escaping from you so you don't get confused by them.)
htmlentities is for escaping text for inclusion in an HTML page. It should be used always-and-only in the output templating bit of your PHP. It is inappropriate to run it globally over all your input because not everything is going to end up in an HTML page or solely in an HTML page, and most probably it's going to go to the database first where you absolutely don't want a load of < and & rubbish making your text fail to search or substring reliably.
(Even then, htmlspecialchars is generally preferable to htmlentities as it only encodes the characters that really need it. htmlentities will add needless escaping, and unless you tell it the right encoding it'll also totally mess up all your non-ASCII characters. htmlentities should almost never be used.)
As for stripslashes... well, you sometimes need to apply that to input, but only when the idiotic magic_quotes_gpc option is turned on. You certainly shouldn't apply it all the time, only when you detect magic_quotes_gpc is on. It is long deprecated and thankfully dying out, so it's probably just as good to bomb out with an error message if you detect it being turned on. Then you could chuck the whole processInput thing away.
To summarise:
At start time, do no global input processing. You can do application-specific validation here if you want, like checking a phone number is just numbers, or removing control characters from text or something, but there should be no escaping happening here.
When making an SQL query with a string literal in it, use SQL-escaping on the value as it goes into the string: $query= "SELECT * FROM t WHERE name='".mysql_real_escape_string($name)."'";. You can define a function with a shorter name to do the escaping to save some typing. Or, more readably, parameterisation.
When making HTML output with strings from the input or the database or elsewhere, use HTML-escaping, eg.: <p>Hello, <?php echo htmlspecialchars($name); ?>!</p>. Again, you can define a function with a short name to do echo htmlspecialchars to save on typing.

PHP HTML Entities

I want to display on screen data send by the user,
remembering it can contain dangerous code, it is the best to clean this data with html entities.
Is there a better way to do html entities, besides this:
$name = clean($name, 40);
$email = clean($email, 40);
$comment = clean($comment, 40);
and this:
$data = array("name", "email," "comment")
function confHtmlEnt($data)
{
return htmlentities($data, ENT_QUOTES, 'UTF-8');
}
$cleanPost = array_map('confHtmlEnt', $_POST);
if so, how, and how does my wannabe structure
for html entities look?
Thank you for not flaming the newb :-).
"Clean POST", the only problem is you might not know in what context will your data appear. I have a Chat server now that works via browser client and a desktop client and both need data in a different way. So make sure you save the data as "raw" as possible into the DB and then worry about filtering it on output.
Do not encode everything in $_POST/$_GET. HTML-escaping is an output-encoding issue, not an input-checking one.
Call htmlentities (or, usually better, htmlspecialchars) only at the point where you're taking some plain text and concatenating or echoing it into an HTML page. That applies whether the text you are using comes from a submitted parameter, or from the database, or somewhere else completely. Call mysql_real_escape_string only at the point you insert plain text into an SQL string literal.
It's tempting to shove all that escaping stuff in its own box at the top of the script and then forget about it. But text preparation really doesn't work like that, and if you pretend it does you'll find your database irreparably full of double-encoded crud, backslashes on your HTML page and security holes you didn't spot because you were taking data from a source other than the (encoded) parameters.
You can make the burden of remembering to mysql_real_escape_string go away by using mysqli's parameterised queries or another higher-level data access layer. You can make the burden of typing htmlspecialchars every time less bothersome by defining a shorter-named function for it, eg.:
<?php
function h($s) {
echo(htmlspecialchars($s, ENT_QUOTES));
}
?>
<h1> Blah blah </h1>
<p>
Blah blah <?php h($title); ?> blah.
</p>
or using a different templating engine that encodes HTML by default.
If you wish to convert the five special HTML characters to their equivalent entities, use the following method:
function filter_HTML($mixed)
{
return is_array($mixed)
? array_map('filter_HTML',$mixed)
: htmlspecialchars($mixed,ENT_QUOTES);
}
That would work for both UTF-8 or single-byte encoded string.
But if the string is UTF-8 encoded, make sure to filter out any invalid characters sequence, prior to using the filter_HTML() function:
function make_valid_UTF8($str)
{
return iconv('UTF-8','UTF-8//IGNORE',$str)
}
Also see: http://www.phpwact.org/php/i18n/charsets#character_sets_character_encoding_issues
You need to clean every element bevor displaying it. I do it usually with a function and an array like your secound example.
If you use a framework with a template engine, there is quite likely a possibility to auto-encode strings. Apart from that, what's simpler than calling a function and getting the entity-"encoded" string back?
Check out the filter libraries in php, in particular filter_input_array.
filter_input_array(INPUT_POST, FILTER_SANITIZE_SPECIAL_CHARS);

PHP form auto escaping posted data?

I have an HTML form POSTing to a PHP page.
I can read in the data using the $_POST variable on the PHP.
However, all the data seems to be escaped.
So, for example
a comma (,) = %2C
a colon (:) = %3a
a slash (/) = %2
so things like a simple URL of such as http://example.com get POSTed as http%3A%2F%2Fexample.com
Any ideas as to what is happening?
Actually you want urldecode. %xx is an URL encoding, not a html encoding. The real question is why are you getting these codes. PHP usually decodes the URL for you as it parses the request into the $_GET and $_REQUEST variables. POSTed forms should not be urlencoded. Can you show us some of the code generating the form? Maybe your form is being encoded on the way out for some reason.
See the warning on this page: http://us2.php.net/manual/en/function.urldecode.php
Here is a simple PHP loop to decode all POST vars
foreach($_POST as $key=>$value) {
$_POST[$key] = urldecode($value);
}
You can then access them as per normal, but properly decoded. I, however, would use a different array to store them, as I don't like to pollute the super globals (I believe they should always have the exact data in them as by PHP).
This shouldn't be happening, and though you can fix it by manually urldecode()ing, you will probably be hiding a basic bug elsewhere that might come round to bite you later.
Although when you POST a form using the default content-type ‘application/x-www-form-encoded’, the values inside it are URL-encoded (%xx), PHP undoes that for you when it makes values available in the $_POST[] array.
If you are still getting unwanted %xx sequences afterwards, there must be another layer of manual URL-encoding going on that shouldn't be there. You need to find where that is. If it's a hidden field, maybe the page that generates it is accidentally encoding it using urlencode() instead of htmlspecialchars(), or something? Putting some example code online might help us find out.

Categories