I was wondering if anybody knew how to get around this problem.
I am gathering user input from a HTML form which is then posted using htmlspecialchars into PHP to avoid issues when using quotes/etc...
However, I also want to run server-side validation checks on the data being gathered through regular expressions - though I'm not sure how to go about this.
So far, I have thought of decoding the htmlspecialchars - but because I am going to be using the Strings straight away, this means that the code could break after I run this conversion. e.g: Let's say the user inputted a single quote, " into a field. This would be converted to ", then if I decode this and use it in a variable, it could end up like: $string = """; which is going to give me issues.
Any advice on this would be greatly appreciated!
You seem to misunderstand the difference between data and how this data is altered to be parseable in a certain context.
A php string can contain any data. What is stored in this string is the "raw" form: the form in which we want to manipulate the data if needed.
In certain contexts, not all characters are valid. For example, in a html textarea, the < and > characters may not be used, because they are special characters. We still want to be able to use these characters. To use special characters in a context, we escape these characters. By escaping a special character it looses its special meaning. In the context of a html textarea, the < character is escaped as the sequence <. Unlike the < character, this escaped sequence does not have a special meaning in html, and thus if we send the following sequence to the browser, it knows how to parse that sequence and display the right thing: <textarea><</textarea>. When we talk about what the data is that this textarea contains, we do not say that it contains <, but instead we say that it contains <.
As you said, in a php script, in a double quoted string, the " character has a special meaning. This has only to do with parsing. PHP simply does not know how to parse a sequence $str = """;. If we would want to have the double quote in such a double quoted string, we would need to escape it. We escape a double quote in a php double quoted string by prepending it with a \. To make a string containing a single double quote, using the double quoted notation, you would write $str = "\"";.
However, none of this matters.. You are taking input from a html form. When you click the submit button, the browser reads what is in the textarea(, and decodes it as html?). The browser then encodes it in a way as dictated by the form tag, and sends it to the server. The server then decodes the blob of text back in it's raw data form. That data is passed to PHP, and it is this form you will encounter in $_POST['myTextarea'].
In conclusion: If data is encoded, realize for which context it was encoded and decode it based on that context. You do not need to escape for php quoted strings, because you are working on internal strings. There is nothing to parse. Remind yourself that when you are going to use the data somewhere, that you should take care that all special characters in your data for that particular context are escaped.
I suppose that htmlspecialchars() function is called after posting the form to PHP. Simplest solution then will be to match against regular expression first and then do htmlspecialchars().
Also, if you have string encoded with htmlspecialchars(), after decoding with htmlspecialchars_decode(), PHP internal representation will be "\"", so you break nothing. There is big difference how you write strings by hand to PHP file and how PHP internally handle them. You really don't need to be bothered by this.
Related
In PHP, is it at all possible to output the contents of a string to show any escaped characters that may be contained within the string? I get that the whole point of escaping characters is so that they aren't treated in the usual way. But I would still like to be able to view the raw contents of a string so I can see for myself exactly how characters like \n and \r, etc. are represented. Does PHP have a method for doing this?
Use json_encode() to encode the string as JSON. The JSON encoding of strings (which is, in fact, JavaScript) is the same as the one used by PHP. Both JavaScript and PHP were inspired from C and they copied the notation of string literals from it.
if you use single quotation marks it should do what you need
eg echo 'this\n'; will output this\n where as echo "this\n"; will output this and a new line
I need to make up a simple string in PHP which is a string of data to be posted to another site.
The problem is that one of the fields is 'notify_url=..' and when I use that PHP takes the & in front of it and the not part to mean the logical operator AND NOT and converts it to a ¬ character:
$string = 'field1=1234&field2=this¬ify_url=http';
prints as 'field1=1234&field2=this¬ify_url=http'
The encoding on my page is UTF-8.
I have tried creating the string with single quotes as well as double quotes. I have tried making the fields names variables and concating them in but it always products the special character.
This is not being urlencoded because the string is meant to be hashed before the form is submitted to verify posted data.
PHP isn't doing that, it's your browser interpreting HTML entity notation. & has a special meaning in HTML as the start of an HTML entity, and ¬ happens to be a valid HTML entity. You need to HTML-encode characters with special meanings:
echo htmlspecialchars($string);
// field1=1234&field2=this¬ify_url=http
I have been reading up on htmlspecialchars() for escaping user input and user input from the database. Before anyone says anything, yes, I am filtering on db input as well as using prepared statements with bindings. I am only concerned about securing the output.
I am confused as to when to use ENT_COMPAT, ENT_QUOTES, ENT_NOQUOTES. I came across the following excerpt while doing my research:
The second argument in the htmlspecialchars() call is ENT_COMPAT. I've
used that because it's a safe default: it will also escape
double-quote characters ". You only really need to do that if you're
outputting inside an HTML attribute (like <img src="<?php echo htmlspecialchars($img_path, ENT_COMPAT, 'UTF-8')">). You could use
ENT_NOQUOTES everywhere else.
I have found similar comments elsewhere as well. What is the purpose of converting single and/or double quotes for attributes yet not converting them elsewhere? The only thing I can think of is if you were adding actual html into the page for instance:
My variable is : <img src="somepic.jpg" alt="some text"> if you converted the double quotes here it would not render properly because of the escaped quotes. In the example given in the excerpt though I can't even think of an instance where any type of quote would be used.
Secondly, in this particular reference it says to use ENT_NOQUOTES everywhere else. Why? My personal thought process is telling me to use ENT_QUOTES everywhere and ENT_NOQUOTES if and only if the variable is an actual html attribute that requires them.
I've done lots of searching and reading, but still confused about all of this. My main goal is to secure output to the page so there is no html, php, js manipulation happening.
Just use ENT_QUOTES everywhere. PHP gives the option in case you need it, but 99% of the time you don't. Escaping the quotes unnecessarily is harmless.
htmlspecialchars($string, ENT_QUOTES, 'UTF-8');
Because that code is just too long to keep writing everywhere wrap it in some tiny function.
function es($string) {
return htmlspecialchars($string, ENT_QUOTES, 'UTF-8');
}
Within HTML there are difference contexts where different characters are considered special. For example, within a double-quoted attribute value, a literal double quote would be interpreted as attribute value delimiter:
8.2.4.38 Attribute value (double-quoted) state
Consume the next input character:
↪ U+0022 QUOTATION MARK (")
Switch to the after attribute value (quoted) state.
↪ U+0026 AMPERSAND (&)
Switch to the character reference in attribute value state, with the additional allowed character being U+0022 QUOTATION MARK (").
↪ U+0000 NULL
Parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current attribute's value.
↪ EOF
Parse error. Switch to the data state. Reconsume the EOF character.
↪ Anything else
Append the current input character to the current attribute's value.
In such a case the double quote needs to be encoded using a character reference. Single-quoted attribute values are similar but here the first literal single quoted is considered the attribute value end delimiter.
Similar does also apply for the data context, i. e., outside a tag:
8.2.4.1 Data state
Consume the next input character:
↪ U+0026 AMPERSAND (&)
Switch to the character reference in data state.
↪ "<" (U+003C)
Switch to the tag open state.
↪ U+0000 NULL
Parse error. Emit the current input character as a character token.
↪ EOF
Emit an end-of-file token.
↪ Anything else
Emit the current input character as a character token.
As you can see, the only character that would be considered harmful in regards of Cross-Site Scripting is < as it would switch to the tag open context. So this would need to be encoded using a character reference to avoid the injection of a tag.
However, it is also allowed to use character references instead of the literal characters even though they are not special in the corresponding context or even at all. For example, the following are equivalent:
<a href="http://example.com/">
<a href="http://example.com/">
So only certain special characters are really required to be encoded as character references depending on the context but it doesn’t harm to encode other characters that are special in other contexts as well.
I'm running quoted_printable_decode() on HTML content that is stored in DB and has a lot of these types of characters =C5=DD= etc..
However, I also have this string in the HTML which I did not mean to replace:
link
Since it has =b in it, it replaces it as well.
Is there any way to avoid this?
Encode the = as =3D, which is the equivalent in Quoted Printable.
I am using JSON_ENCODE in PHP to output data.
When it gets to this word: Æther it outputs \u00c6ther.
Anyone know of a way to make json output that character or am I going to have to change the text to not have that character in it?
That's the unicode version of the character. JavaScript should handle it properly. You'll notice the slash before it which means that it's an escape sequence. The u indicates it's a unicode code point and the hex digits represent the actual character.
See here for some more info.
That is working as specified. The RFC ( http://www.ietf.org/rfc/rfc4627.txt ) indicates that any character may be escaped, and your average printable character can be written in the \uXXXX format.
Any JSON parser that cannot understand a character escaped in that way is not compliant with the standard. Work on resolving that problem rather than trying to coax PHP into misbehaving as well.
(It is legal to put UTF-8 characters into JSON strings without escaping them as well, with a few exceptions, but the safe approach of escaping anything questionable is wise.)