I'm sanitizing USER_AGENT for logging in PHP and need to know whether to use substr() or mb_strcut().
Seeing how USER_AGENT is directly derived from the HTTP request header User-Agent, I'm going to assume you're interested in HTTP headers.
Is it possible that HTTP headers will contain bytes outside the 7-bit ASCII range? Yes.
Is it likely that you'll actually see this in practice and need to handle it properly? I'd say no.
Therefore I suggest a third option: first strip all non-ASCII characters from the string, then use regular multibyte-unsafe functions to your heart's content.
Related
I've been coding in PHP for a while, and this is the first time I came across this issue.
My goal is to pass a GET variable (a url) without encoding or decoding it. Which means that "%2F" will not turn to "/" and the opposite. The reason for that is that I'm passing this variable to a 3rd party website and the vairable must stay exactly the way it is.
Right now what's happening is that this url (passed as a GET variable):http://example.com/something%2Felse turns into http://example.com/something/else.
How can I prevent php from encoding what's passed in GET?
Apache denies all URLs with %2F in the path part, for security reasons: scripts can't normally (ie. without rewriting) tell the difference between %2F and / due to the PATH_INFO environment variable being automatically URL-decoded (which is stupid, but a long-standing part of the CGI specification so there's nothing can be done about it).
You can turn this feature off using the AllowEncodedSlashes directive, but note that other web servers will still disallow it (with no option to turn that off), and that other characters may also be taboo (eg. %5C), and that %00 in particular will always be blocked by both Apache and IIS. So if your application relied on being able to have %2F or other characters in a path part you'd be limiting your compatibility/deployment options.
I am using urlencode() while preparing the search URL
You should use rawurlencode(), not urlencode() for escaping path parts. urlencode() is misnamed, it is actually for application/x-www-form-urlencoded data such as in the query string or the body of a POST request, and not for other parts of the URL.
The difference is that + doesn't mean space in path parts. rawurlencode() will correctly produce %20 instead, which will work both in form-encoded data and other parts of the URL.
Hex base16 encoding it is part of the HTTP protocol you cant prevent it else it would break the actual HTTP socket request to the server.
Use:
urlencode() to encode
urldecode() to decode
Please show an actual example of how you are sending the url to the 3rd party.
As it should read http%3A%2F%2Fexample.com%2Fsomething%2Felse not just the odd %2F like in your example.
I have a cookie which I am setting based on data in a field in Drupal. When I write the cookie using PHP the extended ASCII characters are shown as hex-codes (e.g. %7E) but if I write a similar cookie with JavaScript then the extended ASCII characters are show in the cookie as single characters (e.g. ~ ).
This is the string I want in my cookie.
Section1~email,email.calendar,calendar.wordpresssml,wordpress.moodlesml,moodle.maharasml,mahara.gdrive,gdrive.eportfolio,eportfolioblogs.wiki,wiki.youtube,email.feature,feature|Section2~reader,reader|
If I use
setcookie("p", "Section1~email,email.calendar,calendar.wordpresssml,wordpress.moodlesml,moodle.maharasml,mahara.gdrive,gdrive.eportfolio,eportfolioblogs.wiki,wiki.youtube,email.feature,feature|Section2~reader,reader|", $expire);
I get Section1%7Eemail%2Cemail.calendar%2Ccalendar.wordpresssml%2Cwordpress.moodlesml%2Cmoodle.maharasml%2Cmahara.gdrive%2Cgdrive.eportfolio%2Ceportfolioblogs.wiki%2Cwiki.youtube%2Cemail.feature%2Cfeature%7CSection2%7Ereader%2Creader%7C
rather than the string I want. If I write the cookie using JavaScript it works fine. I know this is an encoding issue but I would really like PHP to write the cookie using the full set of Extended ASCII characters.
Why is this a problem? You have to remember that this will appear in the Set-Cookie HTTP header and by not encoding it you will always have to remember what the special characters are and avoid using them. With encoding, you don't have to worry about that.
With PHP, when you do $_COOKIE['p'] it will appear as you originally intended. If you want to extract it in Javascript using document.cookie or something, then you can use decodeURIComponent(cookieValue).
I've done a lot of research on this and a lot of testing.
As I understand it, HTTP headers are only set if the web server is setup to do so, and may default to a particular encoding even if developers didn't intend this. Meta headers are only set if the developer decided to do so in their code... this may also be set automatically by some development frameworks ( which is problematic if the developer didn't consider this ).
I've found that if these are set at all, they often conflict with each other. eg. the HTTP header says the page is iso-8859-1 while the meta tag specifies windows-1252. I could assume one supersedes the other ( likely the meta tag ), but that seems fairly unreliable. It also seems like very few developers consider this when dealing with their data, so dynamically generated sites are often mixing encodings or using encodings that they don't intend to via different encodings coming from their database.
My conclusion has been to do the following:
Check the encoding of every page using mb_detect_encoding().
If that fails, I use the meta encoding ( http-equiv="Content-Type"... ).
If there is no meta content-type, I use the HTTP headers ( content_type ).
If there is no http content-type, I assume UTF-8.
Finally, I convert the document using mb_convert_encoding(). Then I scrape it for content. ( I've purposely left out the encoding to convert to, to avoid that discussion here. )
I'm attempting to get as much accurate content as possible, and not just ignore webpages because the developers didn't set their headers properly.
What problems do you see with this approach?
Am I going to run into problems using the mb_detect_encoding() and mb_convert_encoding() methods?
Yes, you will run into problems. mb_detect_encoding is not really reliable, see these examples:
This outputs bool(false) indicating that detecion failed:
var_dump(mb_detect_encoding(file_get_contents('http://www.pazaruvaj.com/')));
This other one outputs string(5) "UTF-8", which is obviously an incorrect result. HTTP headers and http-equiv are correctly set on this website, and it's not valid UTF-8:
var_dump(mb_detect_encoding(file_get_contents('http://www.arukereso.hu/')));
I suggest you apply all available methods and also make use of external libraries (like this one: http://mikolajj.republika.pl/) and use the most probable encoding.
Another approach to make it more precise is to build a country-specific list of possible character sets and use only those with mb_convert_encoding. Like in Hungary, ISO-8859-2 or UTF-8 is most probable, others are not really worth considering. Country can be guessed from the combination of TLD, Content-Language HTTP header and IP address location. Although this requires some research work and extra development, it might worth the effort.
Some comments under the documentation of mb_convert_encoding report that iconv works better for Japanese character sets.
Never trust the input. But it is also true for the character encoding? Is good practice to control the encoding of the string received, to avoid unexpected errors? Some people use preg_match to check invalid string. Others make a control byte for byte to validate it. And who normalized using iconv. What is the fastest and safest way to do this check?
edit
I noticed that if I try to save a string utf-8 corrupted in my mysql database, the string will be truncated without warning. There are countermeasures for this eventuality?
Is good practice to control the encoding of the string received, to avoid unexpected errors?
No. There is no reliable way to detect the incoming data's encoding*, so the common practice is to define which encoding is expected:
If you are exposing an API of some sort, or a script that gets requests from third party sites, you will usually point out in the documentation what encoding you are expecting.
If you have forms on your site that are submitted to scripts, you will usually have a site-wide convention of which character set is used.
The possibility that broken data comes in is always there, if the declared encoding doesn't match the data's actual encoding. In that case, your application should be designed so there are no errors except that a character gets displayed the wrong way.
Looking at the encoding that the request declares the incoming data to be in like #Ignacio suggests is a very interesting idea, but I have never seen it implemented in the PHP world. That is not saying anything against it, but you were asking about common practices.
*: It is often possible to verify whether incoming data has a specific encoding. For example, UTF-8 has specific byte values that can't stand on their own, but form a multi-byte character. ISO-8859-1 special characters overlap with those values, and will therefore be detected as invalid in UTF-8. But detecting a completely unknown encoding from an arbitrary set of data is close to impossible.
Look at the charset specified in the request.
Your web publishes the webservice or produces the form and you can specify which encoding you expect. So if the input passes your validation everything is ok. If it doesn't you don't need to take care why it didn't pass. If it was due to wrong encoding it is not your fault.
I'm looking at encoding strings to prevent XSS attacks. Right now we want to use a whitelist approach, where any characters outside of that whitelist will get encoded.
Right now, we're taking things like '(' and outputting '(' instead. As far as we can tell, this will prevent most XSS.
The problem is that we've got a lot of international users, and when the whole site's in japanese, encoding becomes a major bandwidth hog. Is it safe to say that any character outside of the basic ASCII set isn't a vulnerability and they don't need to be encoded, or are there characters outside the ASCII set that still need to be encoded?
Might be (a lot) easier if you just pass the encoding to htmlentities()/htmlspecialchars
echo htmlspecialchars($string, ENT_QUOTES, 'utf-8');
But if this is sufficient or not depends on what you're printing (and where).
see also:
http://shiflett.org/blog/2005/dec/googles-xss-vulnerability
http://jimbojw.com/wiki/index.php?title=Sanitizing_user_input_against_XSS
http://www.erich-kachel.de/?p=415 (in german. If I find something similar in English -> update) edit: well, I guess you can get the main point without being fluent in german ;)
The stringjavascript:eval(String.fromCharCode(97,108,101,114,116,40,39,88,83,83,39,41)) passes htmlentities() unchanged. Now consider something like<a href="<?php echo htmlentities($_GET['homepage']); ?>"which will send<a href="javascript:eval(String.fromCharCode(97,108,101,114,116,40,39,88,83,83,39,41))">to the browser. And that boils down tohref="javascript:eval(\"alert('XSS')\")"While htmlentities() gets the job done for the contents of an element, it's not so good for attributes.
In general, yes, you can depend on anything non-ascii to be "safe", however there are some very important caveats to consider:
Always ensure that what you're
sending to the client is tagged as
UTF-8. This means having a header
that explicitly says "Content-Type:
text/html; charset=utf-8" on every
single page, including all of your
error pages if any of the content on
those error pages is generated from
user input. (Many people forget to
test their 404 page, and have that
page include the not-found URL verbatim)
Always ensure that
what you're sending to the client is
valid UTF-8. This means you
cannot simply pass through
bytes received from the user back to
the user again. You need to decode
the bytes as UTF-8, apply your html-encoding XSS prevention, and then encode
them as UTF-8 as you write them back
out.
The first of those two caveats is to keep the client's browser from seeing a bunch of stuff including high-letter characters and falling back to some local multibyte character set. That local multi-byte character set may have multiple ways of specifying harmful ascii characters that you won't have defended against. Related to this, some older versions of certain browsers - cough ie cough - were a bit overeager in detecting that a page was UTF-7; this opens up no end of XSS possibilities. To defend against this, you might want to make sure you html-encode any outgoing "+" sign; this is excessive paranoia when you're generating proper Content-Type headers, but will save you when some future person flips a switch that turns off your custom headers. (For example, by putting a poorly configured caching reverse proxy in front of your app, or by doing something to insert an extra banner header - php won't let you set any HTTP headers if any output is already written)
The second of those is because it is possible in UTF-8 to specify "overly short" sequences that, while invalid under current specs, will be interpreted by older browsers as ASCII characters. (See what wikipedia has to say) Also, it is possible that someone may insert a single bad byte into a request; if you pass this pack to the user, it can cause some browsers to replace both the bad byte and one or more bytes after it with "?" or some other "couldn't understand this" character. That is, a single bad byte could cause some good bytes to also be swallowed up. If you look closely at what you're outputting, there's probably a spot somewhere where an attacker who was able to wipe a byte or two out of the output could do some XSS. Decoding the input as UTF-8 and then re-encoding it prevents this attack vector.