htmlspecialchars - different escaping for attributes compared to everything else?

htmlspecialchars - different escaping for attributes compared to everything else? - php

I have been reading up on htmlspecialchars() for escaping user input and user input from the database. Before anyone says anything, yes, I am filtering on db input as well as using prepared statements with bindings. I am only concerned about securing the output.
I am confused as to when to use ENT_COMPAT, ENT_QUOTES, ENT_NOQUOTES. I came across the following excerpt while doing my research:
The second argument in the htmlspecialchars() call is ENT_COMPAT. I've
used that because it's a safe default: it will also escape
double-quote characters ". You only really need to do that if you're
outputting inside an HTML attribute (like <img src="<?php echo htmlspecialchars($img_path, ENT_COMPAT, 'UTF-8')">). You could use
ENT_NOQUOTES everywhere else.
I have found similar comments elsewhere as well. What is the purpose of converting single and/or double quotes for attributes yet not converting them elsewhere? The only thing I can think of is if you were adding actual html into the page for instance:
My variable is : <img src="somepic.jpg" alt="some text"> if you converted the double quotes here it would not render properly because of the escaped quotes. In the example given in the excerpt though I can't even think of an instance where any type of quote would be used.
Secondly, in this particular reference it says to use ENT_NOQUOTES everywhere else. Why? My personal thought process is telling me to use ENT_QUOTES everywhere and ENT_NOQUOTES if and only if the variable is an actual html attribute that requires them.
I've done lots of searching and reading, but still confused about all of this. My main goal is to secure output to the page so there is no html, php, js manipulation happening.

Just use ENT_QUOTES everywhere. PHP gives the option in case you need it, but 99% of the time you don't. Escaping the quotes unnecessarily is harmless.
htmlspecialchars($string, ENT_QUOTES, 'UTF-8');
Because that code is just too long to keep writing everywhere wrap it in some tiny function.
function es($string) {
return htmlspecialchars($string, ENT_QUOTES, 'UTF-8');
}

Within HTML there are difference contexts where different characters are considered special. For example, within a double-quoted attribute value, a literal double quote would be interpreted as attribute value delimiter:
8.2.4.38 Attribute value (double-quoted) state
Consume the next input character:
↪ U+0022 QUOTATION MARK (")
Switch to the after attribute value (quoted) state.
↪ U+0026 AMPERSAND (&)
Switch to the character reference in attribute value state, with the additional allowed character being U+0022 QUOTATION MARK (").
↪ U+0000 NULL
Parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current attribute's value.
↪ EOF
Parse error. Switch to the data state. Reconsume the EOF character.
↪ Anything else
Append the current input character to the current attribute's value.
In such a case the double quote needs to be encoded using a character reference. Single-quoted attribute values are similar but here the first literal single quoted is considered the attribute value end delimiter.
Similar does also apply for the data context, i. e., outside a tag:
8.2.4.1 Data state
Consume the next input character:
↪ U+0026 AMPERSAND (&)
Switch to the character reference in data state.
↪ "<" (U+003C)
Switch to the tag open state.
↪ U+0000 NULL
Parse error. Emit the current input character as a character token.
↪ EOF
Emit an end-of-file token.
↪ Anything else
Emit the current input character as a character token.
As you can see, the only character that would be considered harmful in regards of Cross-Site Scripting is < as it would switch to the tag open context. So this would need to be encoded using a character reference to avoid the injection of a tag.
However, it is also allowed to use character references instead of the literal characters even though they are not special in the corresponding context or even at all. For example, the following are equivalent:
<a href="http://example.com/">
<a href="http://example.com/">
So only certain special characters are really required to be encoded as character references depending on the context but it doesn’t harm to encode other characters that are special in other contexts as well.

Related

php is incorrectly converting &not in strings to ¬

I need to make up a simple string in PHP which is a string of data to be posted to another site.
The problem is that one of the fields is 'notify_url=..' and when I use that PHP takes the & in front of it and the not part to mean the logical operator AND NOT and converts it to a ¬ character:
$string = 'field1=1234&field2=this&notify_url=http';
prints as 'field1=1234&field2=this¬ify_url=http'
The encoding on my page is UTF-8.
I have tried creating the string with single quotes as well as double quotes. I have tried making the fields names variables and concating them in but it always products the special character.
This is not being urlencoded because the string is meant to be hashed before the form is submitted to verify posted data.

PHP isn't doing that, it's your browser interpreting HTML entity notation. & has a special meaning in HTML as the start of an HTML entity, and &not happens to be a valid HTML entity. You need to HTML-encode characters with special meanings:
echo htmlspecialchars($string);
// field1=1234&field2=this&notify_url=http

Using Regular Expressions with user input in PHP

I was wondering if anybody knew how to get around this problem.
I am gathering user input from a HTML form which is then posted using htmlspecialchars into PHP to avoid issues when using quotes/etc...
However, I also want to run server-side validation checks on the data being gathered through regular expressions - though I'm not sure how to go about this.
So far, I have thought of decoding the htmlspecialchars - but because I am going to be using the Strings straight away, this means that the code could break after I run this conversion. e.g: Let's say the user inputted a single quote, " into a field. This would be converted to ", then if I decode this and use it in a variable, it could end up like: $string = """; which is going to give me issues.
Any advice on this would be greatly appreciated!

You seem to misunderstand the difference between data and how this data is altered to be parseable in a certain context.
A php string can contain any data. What is stored in this string is the "raw" form: the form in which we want to manipulate the data if needed.
In certain contexts, not all characters are valid. For example, in a html textarea, the < and > characters may not be used, because they are special characters. We still want to be able to use these characters. To use special characters in a context, we escape these characters. By escaping a special character it looses its special meaning. In the context of a html textarea, the < character is escaped as the sequence <. Unlike the < character, this escaped sequence does not have a special meaning in html, and thus if we send the following sequence to the browser, it knows how to parse that sequence and display the right thing: <textarea><</textarea>. When we talk about what the data is that this textarea contains, we do not say that it contains <, but instead we say that it contains <.
As you said, in a php script, in a double quoted string, the " character has a special meaning. This has only to do with parsing. PHP simply does not know how to parse a sequence $str = """;. If we would want to have the double quote in such a double quoted string, we would need to escape it. We escape a double quote in a php double quoted string by prepending it with a \. To make a string containing a single double quote, using the double quoted notation, you would write $str = "\"";.
However, none of this matters.. You are taking input from a html form. When you click the submit button, the browser reads what is in the textarea(, and decodes it as html?). The browser then encodes it in a way as dictated by the form tag, and sends it to the server. The server then decodes the blob of text back in it's raw data form. That data is passed to PHP, and it is this form you will encounter in $_POST['myTextarea'].
In conclusion: If data is encoded, realize for which context it was encoded and decode it based on that context. You do not need to escape for php quoted strings, because you are working on internal strings. There is nothing to parse. Remind yourself that when you are going to use the data somewhere, that you should take care that all special characters in your data for that particular context are escaped.

I suppose that htmlspecialchars() function is called after posting the form to PHP. Simplest solution then will be to match against regular expression first and then do htmlspecialchars().
Also, if you have string encoded with htmlspecialchars(), after decoding with htmlspecialchars_decode(), PHP internal representation will be "\"", so you break nothing. There is big difference how you write strings by hand to PHP file and how PHP internally handle them. You really don't need to be bothered by this.

What do the ENT_HTML5, ENT_HTML401, ... modifiers on html_entity_decode do?

Since php 5.4 html_entity_decode introduces four new flags, with a minimal explanation
ENT_HTML401 Handle code as HTML 4.01.
ENT_XML1 Handle code as XML 1.
ENT_XHTML Handle code as XHTML.
ENT_HTML5 Handle code as HTML 5.
I want to understand what are they for. In which cases are they significant?
My guess, (but may I be wrong) is that any different standard, encodes some unusual chars but any other don't, so in order to respect that, they are here.
My research: htmlentities has the same minimal explanation, with no examples too. I have googled with no luck.

I started wondering what behavior these constants have when I saw these constants at the htmlspecialchars page. The documentation was rubbish, so I started digging in the source code of PHP.
Basically, these constants affect whether certain entities are encoded or not (or decoded for html_entity_decode). The most obvious effect is whether the apostrophe (') is encoded to ' (for ENT_HTML401) or &apos; (for others). Similarly, it determines whether &apos; is decoded or not when using html_entity_decode. (' is always decoded).
All usages can be found in ext/standard/html.c and its header file. From ext/standard/html.h:
#define ENT_HTML_DOC_HTML401 0
#define ENT_HTML_DOC_XML1 16
#define ENT_HTML_DOC_XHTML 32
#define ENT_HTML_DOC_HTML5 (16|32)
(replace ENT_HTML_DOC_ by ENT_ to get their PHP constant names)
I started looking for all occurrences of these constants, and can share the following on the behaviour of the ENT_* constants:
It affects which numeric entities will be decoded or not. For example, ﷐ gets decoded to an unreadable/invalid character for ENT_HTML401, and ENT_XHTML and ENT_XML1. For ENT_HTML5 however, this is considered an invalid character and hence it stays ﷐. (C function unicode_cp_is_allowed)
With ENT_SUBSTITUTE enabled, invalid code unit sequences for a specified character set are replaced with �. (does not depend on document type!)
With ENT_DISALLOWED enabled, code points that are disallowed for the specified document type are replaced with �. (does not depend on charset!)
With ENT_IGNORE, the same invalid code unit sequences from ENT_SUBSTITUTE are removed and no replacement is done (depends on choice of "document type", e.g. ENT_HTML5)
Disallow 
 for ENT_HTML5 (line 976)
ENT_XHTML shares the entity map with ENT_HTML401. The only difference is that &apos; will be converted to an apostrophe with ENT_XHTML while ENT_HTML401 does not convert it (see this line)
ENT_HTML401 and ENT_XHTML use exactly the same entity map (minus the difference from the previous point). ENT_HTML5 uses its own map. Others (currently ENT_XML1) have a very limited decoding map (>, &, <, &apos;, " and their numeric equivalents). (see C function unescape_inverse_map)
Note for the previous point: when only a few entities must be escaped (think of htmlspecialchars), all entities map will use the same one as ENT_XML1, except for ENT_HTML401. That one will not use &apos;, but '.
That covers almost everything. I am not going to list all entity differences, instead I would like to point at https://github.com/php/php-src/tree/php-5.4.11/ext/standard/html_tables for some text files that contain the mappings for each type.
What ENT_* should I use for htmlspecialchars?
When using htmlspecialchars with ENT_COMPAT (default) or ENT_NOQUOTES, it does not matter which one you pick (see below). I saw some answers here on SO that boils down to this:
<input value="<?php echo htmlspecialchars($str, ENT_HTML5);?>" >
This is insecure. It will override the default value ENT_HTML401 | ENT_COMPAT which has as difference that HTML5 entities are used, but also that quotes are not escaped anymore! In addition, this is redundant code. The entities that have to be encoded by htmlspecialchars are the same for all ENT_HTML401, ENT_HTML5, etc.
Just use ENT_COMPAT or ENT_QUOTES instead. The latter also works when you use apostrophes for attributes (value='foo'). If you only have two arguments for htmlspecialchars, do not include the argument at all since it is the default (ENT_HTML401 is 0, remember?).
When you want to print something on the page (between tags, not attributes), it does not matter at all which one you pick as it will have equal effect. It is even sufficient to use ENT_NOQUOTES | ENT_HTML401 which equals to the numeric value 0.
See also below, about ENT_SUBTITUTE and ENT_DISALLOWED.
What ENT_* should I use for htmlentities?
If your text editor or database is so crappy that you cannot include non-US-ASCII characters (e.g. UTF-8), you can use htmlentities. Otherwise, save some bytes and use htmlspecialchars instead (see above).
Whether you need to use ENT_HTML401, ENT_HTML5 or something else depends on how your page is served. When you have a HTML5 page (<!doctype html>), use ENT_HTML5. XHTML or XML? Use the corresponding ENT_XHTML or ENT_XML1. With no doctype or plain ol' HTML4, use ENT_HTML401 (which is the default when omitted).
Should I use ENT_DISALLOWED, ENT_IGNORE or ENT_SUBSTITUTE?
By default, byte sequences that are invalid for the given character set are removed. To have a � in place of an invalid byte sequence, specify ENT_SUBSTITUTE. (note that &#FFFD; is shown for non-UTF-8 charsets). When you specify ENT_IGNORE though, these characters are not shown even if you specified ENT_SUBSTITUTE.
Invalid characters for a document type are substituted by the same replacement character (or its entity) above when ENT_DISALLOWED is specified. This happens regardless of having ENT_IGNORE set (which has nothing to do with invalid chars for doctypes).

If a URL contains a quote how do you specify the rel=canonical value?

Say the path of your URL is:
/thisisa"quote/helloworld/
Then how do you create the rel=canonical URL?
Is this kosher?
<link rel="canonical" href="/thisisa&quot;/helloworld/" />
UPDATE
To clarify, I'm getting a form submission, I need to convert part of the query string into the URL. So the steps are:
.htaccess does the redirect
PHP processes a directory as a query string.
The query string will be dynamically inserted into the:
Title,
Description,
Keywords
Canonical URL.
Spit back into the form's input box
So I need to know which processing has to be done each step of the way...On the first cut, this is my take:
Title: htmlspecialchars($rawQuery)
Description: htmlspecialchars($rawQery)
Keywords: htmlspecialchars($rawQuery)
Canonical URL: This is the tricky part. It must match the same URL .htaccess redirects to but even so, I think the raw query is unsafe because quotes can cause JavaScript injection. Worried about urlencode($rawquery) since it's coming from the URL, wouldn't it already be URL-encoded?
Spit back into form: htmlspecialchars($rawQuery)

You have to split your question into two:
Do I need to encode the double quotation mark character in the URL path?
Yes, the quotation mark character (U+0022) is not allowed in plain and must be encoded with %22.
Do I need to encode the double quotation mark character in a HTML attribute value?
It depends on how you declare the attribute value:
By default, SGML requires that all attribute values be delimited using either double quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39). Single quote marks can be included within the attribute value when the value is delimited by double quote marks, and vice versa. Authors may also use numeric character references to represent double quotes (") and single quotes ('). For double quotes authors can also use the character entity reference ".
If you’re using double quotation mark character to declare the attribute value (attr="value"), then you must encode the douvke quoteation mark character inside the attribute value declaration with a character reference (", " or ").
If you’re using the single quotation mark character (U+0027) for your attribute value declaration (attr='value'), then you don’t need to encode the quotation mark character. But it’s recommended to do so.
And since you have slash and a double quotation mark in your attribute value, the third case (using no quotes at all) is not applicable:
In certain cases, authors may specify the value of an attribute without any quotation marks. The attribute value may only contain letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), periods (ASCII decimal 46), underscores (ASCII decimal 95), and colons (ASCII decimal 58). We recommend using quotation marks even when it is possible to eliminate them.
Now bringing both answers together
Since a double quotation mark must be encoded in a URL (but the single quotation mark is!), you can use the following to do so with the path segments or you URL path:
$path = '/thisisa"quote/helloworld/';
$path = implode('/', array_map('rawurlencode', explode('/', $path)));
And if you want to put that URL path in a HTML attribute, use the htmlspecialchars function to encode remaining special HTML characters:
echo '<link rel="canonical" href="' . htmlspecialchars($path) . '" />';

Use URL escaping, in this case %22
http://everything2.com/title/URL+escape+sequences

A quote is not even a valid URL character, so I think long-term you should address this. It is specifically excluded from the URI syntax by RFC 2396.
To solve the immediate problem though, you'll need to escape the character, using %22.

If the URL contains a double quote then contain it with single quotes.
<link rel="canonical" href='foo.com/thisisa"/helloworld/' />
Do not use HTML encoding in URI strings. That is invalid syntax as the ampersand must be encoded in URIs since it is a function special character. Instead always use percent encoding for URIs.

I would say you want to use the HEX value for a quote which is %22.
Read this to learn more about URL Encoding.

Escaping double quotes in a value for a sticky form in PHP

I'm having a little bit of trouble making a sticky form that will remember what is entered in it on form submission if the value has double quotes. The problem is that the HTML is supposed to read something like:
<input type="text" name="something" value="Whatever value you entered" />
However, if the phrase: "How do I do this?" is typed in with quotes, the resulting HTML is similar to:
<input type="text" this?="" do="" i="" how="" value="" name="something"/>
How would I have to filter the double quotes? I've tried it with magic quotes on and off, I've used stripslashes and addslashes, but so far I haven't come across the right solution. What's the best way to get around this problem for PHP?

You want htmlentities().
<input type="text" value="<?php echo htmlentities($myValue); ?>">

The above will encode all sorts of characters that have html entity code. I prefer to use:
htmlspecialchars($myValue, ENT_QUOTES, 'utf-8');
This will only encode:
'&' (ampersand) becomes '&'
'"' (double quote) becomes '"' when ENT_NOQUOTES is not set.
''' (single quote) becomes ''' only when ENT_QUOTES is set.
'<' (less than) becomes '<'
'>' (greater than) becomes '>'
You could also do a strip_tags on the $myValue to remove html and php tags.

This is what I use:
htmlspecialchars($string, ENT_QUOTES | ENT_SUBSTITUTE | ENT_DISALLOWED | ENT_HTML5, 'UTF-8')
ENT_QUOTES tells PHP to convert both single and double quotes, which I find desirable.
ENT_SUBSTITUTE and ENT_DISALLOWED deal with invalid Unicode. They're quite similar - as far as I understand, the first substitutes invalid code unit sequences, i.e. invalidly encoded characters or sequences that do not represent characters, while the second substitutes invalid code points for the given document type, i.e. characters which are not allowed for the document type specified (or the default if not explicitly specified). The documentation is undesirably laconic on them.
ENT_HTML5 is the document type I use. You can use a different one, but it should match your page doctype.
UTF-8 is the encoding of my document. I suggest that, unless you are absolutely sure you're using PHP 5.4.0, you explicitly specify the encoding - especially if you'll be dealing with non-English text. A host I do some work on uses 5.2.something, which defaults to ISO-8859-1 and produces gibberish.
As thesmart suggests, htmlspecialchars encodes only reserved HTML characters while htmlentities converts everything that has an HTML representation. In most contexts either will do the job. Here is a discussion on the subject.
One more thing: it is a best practice to keep magic quotes disabled since they give a false sense of security and are deprecated in 5.3.0 and removed from 5.4.0. If they are enabled, each quote in your fields will be prepended by a backslash on postback (and multiple postbacks will add more and more slashes). I see that the OP is able to change the setting, but for future references: if you are on a shared host or otherwise don't have access to php.ini, the easiest way is to add
php_flag magic_quotes_gpc Off
to the .htaccess file.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.