curl downloads http://mysite.com/Lunacy%20Disc%202%20of%202%20(U)(Saturn).zip
but not
http://mysite.com/Lunacy Disc 2 of 2 (U)(Saturn).zip
Why is this the case?
Do I need to convert it to the first format ?
using the URL generated via urlencode($url) fails.
Two problems:
urlencode will also encode the slashes on you. It's meant to encode query strings for use in urls, not full urls.
urlencode encodes spaces as +. You need rawurlencode if you want spaces as %20.
To convert an URL to the "first format", you can use the PHP function urlencode.
Now, for the "why", the answer can probably be found in the RFC 1738 - Uniform Resource Locators (URL).
Quoting some paragraphs :
Octets must be encoded if they have no corresponding graphic
character within the US-ASCII coded character set, if the use of the
corresponding character is unsafe, or if the corresponding character
is reserved for some other interpretation within the particular URL
scheme.
No corresponding graphic US-ASCII:
URLs are written only with the graphic printable characters of the
US-ASCII coded character set. The octets 80-FF hexadecimal are not
used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent
control characters; these must be encoded.
A space has the code %20 -- it's not in the range 00-1F, so it should be encoded for that reason... But, a bit later :
Unsafe:
Characters can be unsafe for a number of reasons. The space
character is unsafe because significant spaces may disappear and
insignificant spaces may be introduced when URLs are transcribed or
typeset or subjected to the treatment of word-processing programs.
And here, you know why the space character has to be escaped/encoded too ;-)
urlencode() does indeed fail with curl, if your problem is just with spaces, you can manually substitute them
$url = str_replace(' ', '%20', $url);
You need to urlencode to translate the spaces (in your example; there are other characters that require it) for transmission across the internet. The encoding ensures that the various communications protocols don't terminate or otherwise mangle the string while they're handling it.
http://mysite.com/Lunacy Disc 2 of 2 (U)(Saturn).zip
That is not a valid url. Accessing urls like this may work in your browser because most modern browsers will automatically encode the url for you if required. The curl library must not do this automatically.
Why? Because some characters has special meanings such as # (html anchor).
So all characters except alfanumeric ones are encoded regardless need to be encoded or not.
Related
I've done some tests, and it appears that when I test this:
http://127.0.0.1/test.php?x={some non-english string}
http://127.0.0.1/test.php?x=الapple
By examining the output of:
echo bin2hex($_GET["x"]);
In Firefox & Chrome, I get the UTF-8 representation of the string d8a7d9846170706c65.
$_GET['x'] variable. In IE, I get 3f3f6170706c65. which is wrong
And I know that PHP does not change encoding, and only sees the string as a byte array.
The question is:
Is this controlled by the browser used?
Is it reliable to always assume the input it in UTF-8 encoding?
Is there a way to manage what encoding the browser sends to the server? across all browsers?
There is a difference from where the request originated.
If it’s from a user’s input, e.g., entering the URL into the browser’s address field, most browsers follow the suggestion in RFC 3986 and use UTF-8 as encoding:
When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; […]
Although this is intended for new URI schemes and HTTP is quite old.
However, if the URL was embedded in a document, e.g., as a link or form action, the document’s encoding is used unless the data was already encoded using the URL encoding. And in case the data has a wrong encoding, invalid sequences may be replaces with certain characters that should denote those invalid sequences like the � (U+FFFD) in Unicode does. Similarly, the invalid encoded characters ل and ا may have been replaces by ?, which has the code point 0x3F in ASCII.
I think it should come down to how urldecode (http://www.php.net/manual/en/function.urldecode.php) interprets it, since the $_GET variables are all passed through that function (see http://php.net/manual/en/reserved.variables.get.php)
EDIT
To encode the characters to UTF-8 for use in a URL from the client side, you can use the encodeURI in JavaScript.
For the example you gave, you can do encodeURI('الapple');, which should return "%D8%A7%D9%84apple"
Giving this to PHP's urldecode function (as it would be automatically) returns the original string, with the following hex output;
echo bin2hex(urldecode("%D8%A7%D9%84apple")); //outputs d8a7d9846170706c65
yes it's possible !
To encode the URL :
<?php
$url = "http://127.0.0.1/test.php?x=".urlencode("some non-english string");
?>
To decode the URL :
<?php
$url = urldecode($_GET["x"]);
?>
I got a big sitemap created dynamically with PHP, it has a sitemap index with some 230 separate sitemaps, and each individual sitemap has between 3.000 and 15.000 URLs.
In most of those 230 sitemaps, everything is ok, but in some of them some URLs contain special characters and Google returns an error, does not accept such sitemap. The example of a normal, accepted URL:
http://www.site.com/Gentofte-Greve/Denmark 1 Badmintonligaen/12-fe-juice_a-1091627-1-33-1-odds/
The example of an URL which corrupts the entire sitemap file for Google:
http://www.site.com/Team%20%C5rhus%20Elite-Solr%F8d%20Strand/Denmark 1 Badmintonligaen/12-fe-juice_a-1091631-1-33-1-odds/
Any special character, for example the Nordic ones, will wreck the sitemap. Here is an example of Nordic characters: http://www.borgos.nndata.no/alfabet.htm
My questions is - HOW do I code those special characters (and other similar ones) so sitemap still checks out fine. Which PHP coding function do I use if that's a solution? Is the only solution to use str_replace and replace those characters with normal ones? It wouldn't be an issue, the URL works no matter what you write in the first part of it as that part is for SEO only, but this would be time-consuming. I'd prefer to be able to write those special characters in a way which doesn't wreck the sitemap for Google.
Everything else regarding my sitemaps is fine, they're coded in UTF-8 or at least they should be with this line:
<?xml version='1.0' encoding='UTF-8'?>
Are the %C5 and %F8 sequences meant to represent the characters U+00C5 (Å) and U+00F8 (ø)? If so, you need to use their UTF-8 encodings, not their raw Unicode codepoint numbers. 'Å' should be %C3%85, and 'ø' should be %C3%B8.
For more information about URI encoding, see RFC 3986.
Doing this in PHP is complicated by the fact that PHP strings are really byte strings, not Unicode character strings. They can't store abstract Unicode characters; they can only store the encoded representation of those characters, in a particular encoding such as UTF-8 or UTF-16. You can use the mbstring extension to work with encoded Unicode strings, but doing this correctly will probably mean using the mbstring functions for all handling of Unicode text throughout your application.
You should be looking to fix this encoding problem at the source: how did your program get a string that contains the byte 0xC5 to represent the character U+00C5? Something, somewhere, must've assumed that Unicode codepoint numbers translate directly into bytes, which is wrong. Find and fix that, so that your data is read into the PHP string in UTF-8 form to begin with, and then use the mbstring functions for any manipulation of the string afterward.
Once you have a string that contains the UTF-8 representation of your URL, rawurlencode() should give you the correct percent-escaped result.
So, I've run into a problem with PHP's rawurlencode function. All text fields in our web app are of course converted before being processed by the web-server, and we've used rawurlencode for this. This works fine with almost every character I've found, expect for the "£" sign. Now, there is no reason for our users to ever enter a pound sign, but they might, so I want to take care of this.
The problem is that rawurlencode doesn't encode a pound sign entered on the webpage as %A3, but instead as %C2%A3. Even worse, if the user failed to enter another bit of critical information (which causes the webpage to refresh - the checks are done on the backend side - and try and refill the form boxes with the information the user had used), then when the %C2 is run through rawurldecode/encode, it becomes Ã? - aka, %C3?. And of course the "£" is also turned into another £!
So, what is causing this? I assume it's a character encoding issue, but I'm not that knowledgable about these things. I heard somewhere that I can encode £s as £ manually, but why should I need to do that when the database can handle "£"s, and there is a percentage-encoding for a pound sign? Is this a bug in rawurlencode, or a bug caused by differing character sets?
Thanks for any help.
The standard requires forms to be submitted in the character encoding you specify in <form accept-charset="..."> or UTF-8 if it's not specified or the text the user has entered cannot be represented in the charset you specify.
Clearly, you're receiving the pound sign encoded in UTF-8. If you want to convert it to ISO-8859-15, write:
iconv("UTF-8", "ISO-8859-15//TRANSLIT", $original)
This is probably encoding A3 character in your native character set to C2A3 in UTF-8 encoding, which seems to be the valid UTF-8 encoding for an ANSI A3. Just consume your encoded url using UTF-8 encoding, or specify an ANSI encoding to urlencode.
Artefacto's answer represents a case when you need to convert character encodings, for example, you are displaying a page and the page encoding is set to Latin-1. (Raw)Urlencode will produce escaped strings with multibyte character representations. (Raw)Urldecode will by default produce utf-8 encoded strings, and will represent £ as two bytes. If you display this string making a claim that it is a ISO-8859 encoded string, it will appear as two characters.
A primer on PHP and UTF-8: http://www.phpwact.org/php/i18n/utf-8
Some "hot tips": http://www.sitepoint.com/blogs/2006/08/10/hot-php-utf-8-tips/
Likely, between getting the string from rawurldecode, and using the string, the locale is assumed to be ISO8859, so two bytes get interpreted as two characters when they represent one.
Use mb_convert_encoding to force PHP to realize that the bytes in the string represent a UTF-8 encoded string.
PHP's str_replace() was intended only for ANSI strings and as such can mangle UTF-8 strings. However, given that it's binary-safe would it work properly if it was only given valid UTF-8 strings as arguments?
Edit: I'm not looking for a replacement function, I would just like to know if this hypothesis is correct.
Yes. UTF-8 is deliberately designed to allow this and other similar non-Unicode-aware processing.
In UTF-8, any non-ASCII byte sequence representing a valid character always begins with a byte in the range \xC0-\xFF. This byte may not appear anywhere else in the sequence, so you can't make a valid UTF-8 sequence that matches part of a character.
This is not the case for older multibyte encodings, where different parts of a byte sequence are indistinguishable. This caused a lot of problems, for example trying to replace an ASCII backslash in a Shift-JIS string (where byte \x5C might be the second byte of a character sequence representing something else).
It's correct because UTF-8 multibyte characters are exclusively non-ASCII (128+ byte value) characters beginning with a byte that defines how many bytes follow, so you can't accidentally end up matching a part of one UTF-8 multibyte character with another.
To visualise (abstractly):
a for an ASCII character
2x for a 2-byte character
3xx for a 3-byte character
4xxx for a 4-byte character
If you're matching, say, a2x3xx (a bytes in ASCII range), since a < x, and 2x cannot be a subset of 3xx or 4xxx, et cetera, you can be safe that your UTF-8 will match correctly, given the prerequisite that all strings are definitely valid UTF-8.
Edit: See bobince's answer for a less abstract explanation.
Well, I do have a counter example: I have a UTF8 encoded settings ".ini' file specifying appliation settings like email sender name. it says something like:
email_from = Märta
and I read it from there to variable $sender. Now that I replace the message body (UTF8 again)
regards
{sender}
$message = str_replace("{sender}",$sender_name,$message);
The email is absolutely correct in every respect but the sender is totally broken. There are other cases (like explode() ) when something goes wrong with a UTF string. It is healthy before the conversion but not after it. Sorry to say there seems to be no way of correcting this behaviour.
Edit: Actually, explode() is involved in parsing the .ini file so the problem may well lie in that very function so the str_replace() may well be innocent.
No you cannot.
From practice I am telling you if you have some multibyte symbols like ◊ etc, and others are non-multibyte it wont work correctly, because there are symbols that take 2-4 to place them,
str_replace takes fixed bytes, and replaces... In result we have something that isn't any symbols trash etc.
Yes, I think this is correct, at least I couldn't find any counter-example.
I have user submitted tags that can be any type of (valid) UTF-8 string. I want to know if it is safe to include them in the URL merly by running them through urlencode().
In other words, is urlencode() safe to use for valid UTF-8 strings?
(by valid I mean id have already force-encoded them to UTF-8)
urlencode does not depend on a specific character encoding. It just looks at the bytes, interprets them as ASCII characters and replaces any byte that is either not allowed in ASCII (0x80–0xFF) or not allowed in plain in a URL.
Now to your question: Yes, using urlencode does encode any string in any character encoding to be safely used – but only in the URL query! Because urlencode formats the input according to application/x-www-form-urlencoded that differs from the “normal” percent encoding in how the space is encoded: In application/x-www-form-urlencoded spaces are replaced by + while the “normal” percent encoding replaces them by %20.
If you want to “normal” percent encoding use rawurlencode instead.
Just to be entirely on the safe side, I would remove newlines first. They are not dangerous in themselves, but they can be stepping stones in exploiting other vulnerabilities.
Yes, urlencode() should make a safe URL string out of any input string. As long as whatever that URL is mapping to (folder/file/htaccess), doesn't have funky characters in it. Whenever sanitizing stuff from a user where they could be posting something funky I love this function:
utf8_encode()