How to get a cryptographically strong integer from 0-X in PHP? - php

I want to generate random alphanumeric strings in PHP. They will be used in places where the strength of random numbers is important (publicly visible IDs in URLs and the like).
As I understand, in PHP the main source of cryptographically strong randomness is openssl_random_pseudo_bytes(). This however returns an array of bytes, not alphanumeric characters.
To convert them to alphanumerics I could either hash them (which would produce a longer-than-necessary string of a limited set of hex characters), or base64_encode() them (which would produce a string with +, / and = in it - not alphanumerics).
So I think that instead I could use the random bytes as a source of entropy and generated my own string consisting only of the characters 0-9a-zA-Z.
The problem then becomes - how to translate from 256 distinct values (one byte of input) to 62 distinct value (one character of output). And in a way, that all 62 characters are equally as likely. (Otherwise there will be 8 characters that appear more often than the rest).
Or perhaps I should use another approach entirely? I would like my string to be as short as possible (say, 20 characters or so - shorter URLs are better) and consist only of alphanumeric characters (so that it doesn't need to be specially escaped anywhere).

You can implement your own base64 encoding, sort of. If you can allow two specific symbols - these can be anything, for example . and -, it doesn't really matter. It can even be a space for one of them. In any case, what you would do is this:
$alphabet = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.-";
// using . and - for the two symbols here
$input = [123,193,21,13]; // whatever your input it, I'm assuming an array of bytes
$output = "";
foreach($input as $byte) {
$output .= $alphabet[$byte%64];
}
Assuming random input, all characters have equal probability of appearing.
That being said, if you can't allow anything except pure alphanumeric, cut the symbols from the $alphabet and use %62 instead of %64. While this does mean you have a small bias towards the chracters 0 through 7, I don't think it's significant enough to worry about.

I found this function on php.net in the user comments.
function crypto_rand($min,$max) {
$range = $max - $min;
if ($range == 0) return $min; // not so random...
$length = (int) (log($range,2) / 8) + 1;
return $min + (hexdec(bin2hex(openssl_random_pseudo_bytes($length,$s))) % $range);
}
Then do something like
for($i=0; $i<20; $i++)
{
$string.= chr(crypto_rand(1,26)+96); //or +64 for upper case
}
Or similar.

note: THIS IS WRONG! I leave this attempted answer for reference only.
(31 * 256) % 62 = 0
For each output alphanumeric character, generate 31 random values. Sum these 31 values and take the modulo 62.
Kind of brutal, but this is the only "mathematicaly correct" option I can think of :)

Related

Cryptographically secure random ASCII-string in PHP

I know about random_bytes() in PHP 7, and I want to use it for generating a cryptographically secure (e.g. hard to guess) random string for use as a one-time token or for longer term storage in a cookie.
Unfortunately, I don't know how to convert the output of random_bytes() to a string consisting only of human readable characters, so browsers don't get confused. I know about bin2hex(), but I'd prefer to use the full ASCII-range instead of hex numbers, for the sake of more bits per length.
Any ideas?
Unfortunately Peter O. deleted his answer after receiving negative attention in a review queue, perhaps because he phrased it as a question. I believe it is legitimate answer so I will reprise it.
One easy solution is to encode your random data into the base64 alphabet using base64_encode(). This will not produce the "full ASCII-range" as you have requested but it will give you most of it. An even larger ASCII range is output by a suitable base85 encoder, but php does not have a built-in one. You can probably find plenty of open-source base85 encoders for php though. In my opinion the decrease in length of base85 over base64 is unlikely to be worth the extra code you have to maintain.
I personally just use a GUID library and concatenate a couple of GUIDs to get a long unique token string. You also have the option to remove the dashes to keep it difficult to know the source and if you want to make it even more complex you can randomly cut back the string by up to 10 char to add complexity to its unknown length.
I use this library for generating my GUIDs
https://packagist.org/packages/ramsey/uuid
use Ramsey\Uuid\Uuid;
$token = Uuid::uuid4() . '-' . Uuid::uuid4();
Sorry, I overlooked the part about you wanting to use the full scope of 26 alpha char with numeric... Not sure I have an answer for you in this respect but you should have faith in the difficulty of guessing a UUID4, especially when you add a couple together and obfuscate the length by a factor of 10 to make guessing more complex.
Actually, if you could safely generate an array of random numbers in the range of valid ascii char codes then you could convert the entire random array of codes into the respective ascii char and implode them together as a single string.
function randomAsciiString($length) {
return implode('', array_map(
function($value) {
return chr($value);
},
array_map(
function($value) {
return random_int(33, 126);
},
array_fill(0, $length - 1, null)
)
));
}
echo randomAsciiString(128); // Normal 128 char string
echo randomAsciiString(random_int(118, 128)); // obfuscated length char string for extra complexity.
of course though... you should be mindful that you're using all the standard keys on the keyboard and some of those characters are going to upset things that are sensitive ( eg quotes etc.. )
Let's consider the letters to be used. For the sake of simplicity I will assume that you intend only big and small English letters to be used. This means that you have 26 big letters and 26 small letters, 52 different possible values. If we view a byte array of n elements as a number of n digits in base 256 and we convert this number into a base 52 number, where A is 0, B is 1, C is 2, ..., a is 26, ..., z is 51, then converting these digits into the corresponding letters will yield the text you wanted.

How to treat two chars in a string as a byte?

Consider:
$tag = "4F";
$tag is a string containing two characters, '4' and 'F'. I want to be able to treat these as the upper and lower nibbles respectively of a whole byte (4F) so that I can go on to compute the bit-patterns (01001111)
As these are technically characters, they can be treated in their own right as a byte each - 4 on the ASCII table is 0x52 and F is 0x70.
Pretty much all the PHP built-in functions that allow manipulation of bytes (that I've seen so far) are variations on the latter description: '4' is 0x52, and not the upper nibble of a byte.
I don't know of any quick or built-in way to get PHP to handle this the way I want, but it feels like it should be there.
How do I convert a string "4F" to the byte 4F, or treat each char as a nibble in a nibble-pair. Are there any built in functions to get PHP to handle a string like "4F" or "3F0E" as pairs of nibbles?
Thanks.
If you're wanting "the decimal representation of a hex digit", hexdec is one way to go.
If you're wanting "bit pattern for hex digit", then use base_convert. The docs even show an example of this maneuver:
Example #1 base_convert() example
$hexadecimal = 'a37334';
echo base_convert($hexadecimal, 16, 2);
The above example will output:
101000110111001100110100

How do I get the value encoded in a SHORT (signed 16 bit number) that is explicitly MSBF in PHP

I need to unpack binary data that is encoded rather exotically: a 32 bit 2's complement bit pattern, representing a SHORT.USHORT decimal fraction, with a signed SHORT integer component and an unsigned SHORT "this many 1/65536 parts" decimal fraction component. To make things even more fun, the sign of the SHORT is determined by the first bit in the 2's complement 32 bit pattern. Not by its sign after decoding to 'real' bit pattern.
An example of this would be the following:
2's complement bit pattern: 11111111110101101010101010101100
converted 'normal' pattern: 00000000001010010101010101010100
SHORT bits (upper 16): 0000000000101001 (decimal: 41)
USHORT bits (lower 16: 0101010101010100 (decimal: 21844)
actual number encoded: -41.333 (41, negative from high MSB + 21844/65536)
(if you think this scheme is insane: it certainly seems that way, doesn't it? It's the byte format used in Type2 fonts that are encoded in a CFF block, or "compact font format" block. Crazy as it is, this format is set in stone, and we're about 20 years too late to have it changed. This is the byte layout in a CFF font, and the only thing we get to worry about now is how to correctly decode it)
Problems occur when we're dealing with patterns like these:
2's complement bit pattern: 00000000000000000000000000000001
converted pattern: 11111111111111111111111111111111
upper 16 bits: 1111111111111111 (decimal 65535 *OR* -1)
lower 16 bits: 1111111111111111 (decimal 65535)
SHORT.USHORT number: -65536 *OR* 1
Depending on who you ask, the pattern 1111111111111111 can be decoded either as 65535, such as when interpreted as a bit pattern in a larger (32 or 64 bit) number, or as -1, when interpreted as a 16 bit signed integer. The only correct interpretation here, however, is as the latter, so this leads us to the question's subject line:
what PHP code do I use to turn this 16 bit pattern into the correct number, given that PHP has no pack/unpack parameter for unpacking as 16 bit int with the most significant bit first? There is a parameter for unpacking a 16 bit int using machine-indicated byte order, but this is going to give problems because font data storage is non-negotiable: all fonts, allwhere, everywhen, must be encoded using Motorola/Big Endian byte ordering, irrespective of the machine's preferred byte ordering.
My code to going from 32-bit 2's complement to final value at the moment is this:
// read in 32 bit pattern, represenging a 2's complement pattern
$p2c = 0x01000000 * $b[x] + 0x010000 * $b[x+1] + 0x0100 * $b[x+2] + $b[x+3];
// convert 2's complement to plain form
$p = (~$p2c + 1) & 0xFFFFFFFF;
// get lower 16 bits, representing an unsigned short.
// due to unsigned-ness, this values is always correct.
$ushort = 0xFFFF & $p;
// get higher 16 bits, representing a signed short.
// due to its sign, this value can be spectacularly wrong!
$short = ($p >> 16);
// "reconstitute" the FIXED format number
$num = - ($short + round($ushort/65536,3));
This had a pretty simple answer that I completely ignored for no good reason, and of course didn't think of until I wrote this question.
$short = $pattern >> 16;
if($short >= 32768) { $short -= 65536; }
and voila.

How to generate a 128-bit long string?

Basically, I'm looking for a function to perform the following
generateToken(128)
which will return a 128-bit string consisting of integers or alphabet characters.
Clarification: From the comments, I had to change the question. Apparently, I am looking for a string that is 16 characters long if it needs to be 128 bits.
Is there a reason you must restrict the string to integers? That actually makes the problem a lot harder because each digit gives you 3.3 bits (because 2^3.3 ~= 10). It's tricky to generate exactly 128 bits of token in this manner.
Much easier is to allow hexadecimal encoding (4 bits per character). You can then generate 128 genuine random bits, then encode them in hex for use in your application. Base64 encoding (6 bits per character) is also useful for this kind of thing.
openssl_random_pseudo_bytes will give you a string of random bytes that you can use bin2hex to encode, otherwise you can use mt_rand in your own token-generation routine.
EDIT: After reading the updates to the question it seems that you want to generate a token that represents 128 bits of data and the actual string length (in characters) is not so important. If I guess your intention correctly (that this is a unique ID, possibly for identification/authentication purposes) then I'd suggest you use openssl_random_pseudo_bytes to generate the right number of bits for your problem, in this case 128 (16 bytes). You can then encode those bits in any way you see fit: hex and base64 are two possibilities.
Note that hex encoding will use 32 characters to encode 128 bits of data since each character only encodes 4 bits (128 / 4 = 32). Base64 will use 22 characters (128 / 6 = 21.3). Each character takes up 8 bits of storage but only encodes 4 or 6 bits of information.
Be very careful not to confuse encoded string length with raw data length. If you choose a 16-character string using alphanumeric characters (a-z, A-Z, 0-9) then you only get 6 bits of information per character (log base 2 of 62 is nearly 6), so your 16-character string will only encode 96 bits of information. You should think of your token as an opaque byte array and only worry about turning it into / from a character string when you actually try to send it over the wire or put it in a cookie or whatever.
As of PHP 5.3:
$rand128 = bin2hex(openssl_random_pseudo_bytes(16));
What is your purpose?
If you just want a unique id, then use uniqid:
http://www.php.net/manual/en/function.uniqid.php
Its not random, its essentially a hex string based on microtime. If you do uniqid('', true), then it will return a hex string based on microtime as well as tack on a bunch of random numbers on the end of the id (so even if two calls come in on the same microsecond, it is unlikely that they'll share a unique id).
If you need a 16-character string exactly, then what purpose? Are you salting passwords? How random should the string be? All in all, you can always just do:
$toShow = array();
for($i = 0; $i<16; $i++){
$toShow[] = chr(mt_rand(ord('a'), ord('z')));
}
return $toShow
Now this creates a string of characters that are between 'a' and 'z'. You can change "ord('a')" to 0, and "ord('z')" to 255 to get a fully random binary string... or any other range you need.

urlencode vs rawurlencode?

If I want to create a URL using a variable I have two choices to encode the string. urlencode() and rawurlencode().
What exactly are the differences and which is preferred?
It will depend on your purpose. If interoperability with other systems is important then it seems rawurlencode is the way to go. The one exception is legacy systems which expect the query string to follow form-encoding style of spaces encoded as + instead of %20 (in which case you need urlencode).
rawurlencode follows RFC 1738 prior to PHP 5.3.0 and RFC 3986 afterwards (see http://us2.php.net/manual/en/function.rawurlencode.php)
Returns a string in which all non-alphanumeric characters except -_.~ have been replaced with a percent (%) sign followed by two hex digits. This is the encoding described in » RFC 3986 for protecting literal characters from being interpreted as special URL delimiters, and for protecting URLs from being mangled by transmission media with character conversions (like some email systems).
Note on RFC 3986 vs 1738. rawurlencode prior to php 5.3 encoded the tilde character (~) according to RFC 1738. As of PHP 5.3, however, rawurlencode follows RFC 3986 which does not require encoding tilde characters.
urlencode encodes spaces as plus signs (not as %20 as done in rawurlencode)(see http://us2.php.net/manual/en/function.urlencode.php)
Returns a string in which all non-alphanumeric characters except -_. have been replaced with a percent (%) sign followed by two hex digits and spaces encoded as plus (+) signs. It is encoded the same way that the posted data from a WWW form is encoded, that is the same way as in application/x-www-form-urlencoded media type. This differs from the » RFC 3986 encoding (see rawurlencode()) in that for historical reasons, spaces are encoded as plus (+) signs.
This corresponds to the definition for application/x-www-form-urlencoded in RFC 1866.
Additional Reading:
You may also want to see the discussion at http://bytes.com/groups/php/5624-urlencode-vs-rawurlencode.
Also, RFC 2396 is worth a look. RFC 2396 defines valid URI syntax. The main part we're interested in is from 3.4 Query Component:
Within a query component, the characters ";", "/", "?", ":", "#",
"&", "=", "+", ",", and "$" are reserved.
As you can see, the + is a reserved character in the query string and thus would need to be encoded as per RFC 3986 (as in rawurlencode).
Proof is in the source code of PHP.
I'll take you through a quick process of how to find out this sort of thing on your own in the future any time you want. Bear with me, there'll be a lot of C source code you can skim over (I explain it). If you want to brush up on some C, a good place to start is our SO wiki.
Download the source (or use https://heap.space/ to browse it online), grep all the files for the function name, you'll find something such as this:
PHP 5.3.6 (most recent at time of writing) describes the two functions in their native C code in the file url.c.
RawUrlEncode()
PHP_FUNCTION(rawurlencode)
{
char *in_str, *out_str;
int in_str_len, out_str_len;
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &in_str,
&in_str_len) == FAILURE) {
return;
}
out_str = php_raw_url_encode(in_str, in_str_len, &out_str_len);
RETURN_STRINGL(out_str, out_str_len, 0);
}
UrlEncode()
PHP_FUNCTION(urlencode)
{
char *in_str, *out_str;
int in_str_len, out_str_len;
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &in_str,
&in_str_len) == FAILURE) {
return;
}
out_str = php_url_encode(in_str, in_str_len, &out_str_len);
RETURN_STRINGL(out_str, out_str_len, 0);
}
Okay, so what's different here?
They both are in essence calling two different internal functions respectively: php_raw_url_encode and php_url_encode
So go look for those functions!
Lets look at php_raw_url_encode
PHPAPI char *php_raw_url_encode(char const *s, int len, int *new_length)
{
register int x, y;
unsigned char *str;
str = (unsigned char *) safe_emalloc(3, len, 1);
for (x = 0, y = 0; len--; x++, y++) {
str[y] = (unsigned char) s[x];
#ifndef CHARSET_EBCDIC
if ((str[y] < '0' && str[y] != '-' && str[y] != '.') ||
(str[y] < 'A' && str[y] > '9') ||
(str[y] > 'Z' && str[y] < 'a' && str[y] != '_') ||
(str[y] > 'z' && str[y] != '~')) {
str[y++] = '%';
str[y++] = hexchars[(unsigned char) s[x] >> 4];
str[y] = hexchars[(unsigned char) s[x] & 15];
#else /*CHARSET_EBCDIC*/
if (!isalnum(str[y]) && strchr("_-.~", str[y]) != NULL) {
str[y++] = '%';
str[y++] = hexchars[os_toascii[(unsigned char) s[x]] >> 4];
str[y] = hexchars[os_toascii[(unsigned char) s[x]] & 15];
#endif /*CHARSET_EBCDIC*/
}
}
str[y] = '\0';
if (new_length) {
*new_length = y;
}
return ((char *) str);
}
And of course, php_url_encode:
PHPAPI char *php_url_encode(char const *s, int len, int *new_length)
{
register unsigned char c;
unsigned char *to, *start;
unsigned char const *from, *end;
from = (unsigned char *)s;
end = (unsigned char *)s + len;
start = to = (unsigned char *) safe_emalloc(3, len, 1);
while (from < end) {
c = *from++;
if (c == ' ') {
*to++ = '+';
#ifndef CHARSET_EBCDIC
} else if ((c < '0' && c != '-' && c != '.') ||
(c < 'A' && c > '9') ||
(c > 'Z' && c < 'a' && c != '_') ||
(c > 'z')) {
to[0] = '%';
to[1] = hexchars[c >> 4];
to[2] = hexchars[c & 15];
to += 3;
#else /*CHARSET_EBCDIC*/
} else if (!isalnum(c) && strchr("_-.", c) == NULL) {
/* Allow only alphanumeric chars and '_', '-', '.'; escape the rest */
to[0] = '%';
to[1] = hexchars[os_toascii[c] >> 4];
to[2] = hexchars[os_toascii[c] & 15];
to += 3;
#endif /*CHARSET_EBCDIC*/
} else {
*to++ = c;
}
}
*to = 0;
if (new_length) {
*new_length = to - start;
}
return (char *) start;
}
One quick bit of knowledge before I move forward, EBCDIC is another character set, similar to ASCII, but a total competitor. PHP attempts to deal with both. But basically, this means byte EBCDIC 0x4c byte isn't the L in ASCII, it's actually a <. I'm sure you see the confusion here.
Both of these functions manage EBCDIC if the web server has defined it.
Also, they both use an array of chars (think string type) hexchars look-up to get some values, the array is described as such:
/* rfc1738:
...The characters ";",
"/", "?", ":", "#", "=" and "&" are the characters which may be
reserved for special meaning within a scheme...
...Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL...
For added safety, we only leave -_. unencoded.
*/
static unsigned char hexchars[] = "0123456789ABCDEF";
Beyond that, the functions are really different, and I'm going to explain them in ASCII and EBCDIC.
Differences in ASCII:
URLENCODE:
Calculates a start/end length of the input string, allocates memory
Walks through a while-loop, increments until we reach the end of the string
Grabs the present character
If the character is equal to ASCII Char 0x20 (ie, a "space"), add a + sign to the output string.
If it's not a space, and it's also not alphanumeric (isalnum(c)), and also isn't and _, -, or . character, then we , output a % sign to array position 0, do an array look up to the hexchars array for a lookup for os_toascii array (an array from Apache that translates char to hex code) for the key of c (the present character), we then bitwise shift right by 4, assign that value to the character 1, and to position 2 we assign the same lookup, except we preform a logical and to see if the value is 15 (0xF), and return a 1 in that case, or a 0 otherwise. At the end, you'll end up with something encoded.
If it ends up it's not a space, it's alphanumeric or one of the _-. chars, it outputs exactly what it is.
RAWURLENCODE:
Allocates memory for the string
Iterates over it based on length provided in function call (not calculated in function as with URLENCODE).
Note: Many programmers have probably never seen a for loop iterate this way, it's somewhat hackish and not the standard convention used with most for-loops, pay attention, it assigns x and y, checks for exit on len reaching 0, and increments both x and y. I know, it's not what you'd expect, but it's valid code.
Assigns the present character to a matching character position in str.
It checks if the present character is alphanumeric, or one of the _-. chars, and if it isn't, we do almost the same assignment as with URLENCODE where it preforms lookups, however, we increment differently, using y++ rather than to[1], this is because the strings are being built in different ways, but reach the same goal at the end anyway.
When the loop's done and the length's gone, It actually terminates the string, assigning the \0 byte.
It returns the encoded string.
Differences:
UrlEncode checks for space, assigns a + sign, RawURLEncode does not.
UrlEncode does not assign a \0 byte to the string, RawUrlEncode does (this may be a moot point)
They iterate differntly, one may be prone to overflow with malformed strings, I'm merely suggesting this and I haven't actually investigated.
They basically iterate differently, one assigns a + sign in the event of ASCII 20.
Differences in EBCDIC:
URLENCODE:
Same iteration setup as with ASCII
Still translating the "space" character to a + sign. Note-- I think this needs to be compiled in EBCDIC or you'll end up with a bug? Can someone edit and confirm this?
It checks if the present char is a char before 0, with the exception of being a . or -, OR less than A but greater than char 9, OR greater than Z and less than a but not a _. OR greater than z (yeah, EBCDIC is kinda messed up to work with). If it matches any of those, do a similar lookup as found in the ASCII version (it just doesn't require a lookup in os_toascii).
RAWURLENCODE:
Same iteration setup as with ASCII
Same check as described in the EBCDIC version of URL Encode, with the exception that if it's greater than z, it excludes ~ from the URL encode.
Same assignment as the ASCII RawUrlEncode
Still appending the \0 byte to the string before return.
Grand Summary
Both use the same hexchars lookup table
URIEncode doesn't terminate a string with \0, raw does.
If you're working in EBCDIC I'd suggest using RawUrlEncode, as it manages the ~ that UrlEncode does not (this is a reported issue). It's worth noting that ASCII and EBCDIC 0x20 are both spaces.
They iterate differently, one may be faster, one may be prone to memory or string based exploits.
URIEncode makes a space into +, RawUrlEncode makes a space into %20 via array lookups.
Disclaimer: I haven't touched C in years, and I haven't looked at EBCDIC in a really really long time. If I'm wrong somewhere, let me know.
Suggested implementations
Based on all of this, rawurlencode is the way to go most of the time. As you see in Jonathan Fingland's answer, stick with it in most cases. It deals with the modern scheme for URI components, where as urlencode does things the old school way, where + meant "space."
If you're trying to convert between the old format and new formats, be sure that your code doesn't goof up and turn something that's a decoded + sign into a space by accidentally double-encoding, or similar "oops" scenarios around this space/20%/+ issue.
If you're working on an older system with older software that doesn't prefer the new format, stick with urlencode, however, I believe %20 will actually be backwards compatible, as under the old standard %20 worked, just wasn't preferred. Give it a shot if you're up for playing around, let us know how it worked out for you.
Basically, you should stick with raw, unless your EBCDIC system really hates you. Most programmers will never run into EBCDIC on any system made after the year 2000, maybe even 1990 (that's pushing, but still likely in my opinion).
echo rawurlencode('http://www.google.com/index.html?id=asd asd');
yields
http%3A%2F%2Fwww.google.com%2Findex.html%3Fid%3Dasd%20asd
while
echo urlencode('http://www.google.com/index.html?id=asd asd');
yields
http%3A%2F%2Fwww.google.com%2Findex.html%3Fid%3Dasd+asd
The difference being the asd%20asd vs asd+asd
urlencode differs from RFC 1738 by encoding spaces as + instead of %20
One practical reason to choose one over the other is if you're going to use the result in another environment, for example JavaScript.
In PHP urlencode('test 1') returns 'test+1' while rawurlencode('test 1') returns 'test%201' as result.
But if you need to "decode" this in JavaScript using decodeURI() function then decodeURI("test+1") will give you "test+1" while decodeURI("test%201") will give you "test 1" as result.
In other words the space (" ") encoded by urlencode to plus ("+") in PHP will not be properly decoded by decodeURI in JavaScript.
In such cases the rawurlencode PHP function should be used.
I believe spaces must be encoded as:
%20 when used inside URL path component
+ when used inside URL query string component or form data (see 17.13.4 Form content types)
The following example shows the correct use of rawurlencode and urlencode:
echo "http://example.com"
. "/category/" . rawurlencode("latest songs")
. "/search?q=" . urlencode("lady gaga");
Output:
http://example.com/category/latest%20songs/search?q=lady+gaga
What happens if you encode path and query string components the other way round? For the following example:
http://example.com/category/latest+songs/search?q=lady%20gaga
The webserver will look for the directory latest+songs instead of latest songs
The query string parameter q will contain lady gaga
1. What exactly are the differences and
The only difference is in the way spaces are treated:
urlencode - based on legacy implementation converts spaces to +
rawurlencode - based on RFC 1738 translates spaces to %20
The reason for the difference is because + is reserved and valid (unencoded) in urls.
2. which is preferred?
I'd really like to see some reasons for choosing one over the other ... I want to be able to just pick one and use it forever with the least fuss.
Fair enough, I have a simple strategy that I follow when making these decisions which I will share with you in the hope that it may help.
I think it was the HTTP/1.1 specification RFC 2616 which called for "Tolerant applications"
Clients SHOULD be tolerant in parsing the Status-Line and servers
tolerant when parsing the Request-Line.
When faced with questions like these the best strategy is always to consume as much as possible and produce what is standards compliant.
So my advice is to use rawurlencode to produce standards compliant RFC 1738 encoded strings and use urldecode to be backward compatible and accomodate anything you may come across to consume.
Now you could just take my word for it but lets prove it shall we...
php > $url = <<<'EOD'
<<< > "Which, % of Alice's tasks saw $s # earnings?"
<<< > EOD;
php > echo $url, PHP_EOL;
"Which, % of Alice's tasks saw $s # earnings?"
php > echo urlencode($url), PHP_EOL;
%22Which%2C+%25+of+Alice%27s+tasks+saw+%24s+%40+earnings%3F%22
php > echo rawurlencode($url), PHP_EOL;
%22Which%2C%20%25%20of%20Alice%27s%20tasks%20saw%20%24s%20%40%20earnings%3F%22
php > echo rawurldecode(urlencode($url)), PHP_EOL;
"Which,+%+of+Alice's+tasks+saw+$s+#+earnings?"
php > // oops that's not right???
php > echo urldecode(rawurlencode($url)), PHP_EOL;
"Which, % of Alice's tasks saw $s # earnings?"
php > // now that's more like it
It would appear that PHP had exactly this in mind, even though I've never come across anyone refusing either of the two formats, I cant think of a better strategy to adopt as your defacto strategy, can you?
nJoy!
The difference is in the return values, i.e:
urlencode():
Returns a string in which all
non-alphanumeric characters except -_.
have been replaced with a percent (%)
sign followed by two hex digits and
spaces encoded as plus (+) signs. It
is encoded the same way that the
posted data from a WWW form is
encoded, that is the same way as in
application/x-www-form-urlencoded
media type. This differs from the »
RFC 1738 encoding (see rawurlencode())
in that for historical reasons, spaces
are encoded as plus (+) signs.
rawurlencode():
Returns a string in which all
non-alphanumeric characters except -_.
have been replaced with a percent (%)
sign followed by two hex digits. This
is the encoding described in » RFC
1738 for protecting literal characters
from being interpreted as special URL
delimiters, and for protecting URLs
from being mangled by transmission
media with character conversions (like
some email systems).
The two are very similar, but the latter (rawurlencode) will replace spaces with a '%' and two hex digits, which is suitable for encoding passwords or such, where a '+' is not e.g.:
echo '<a href="ftp://user:', rawurlencode('foo #+%/'),
'#ftp.example.com/x.txt">';
//Outputs <a href="ftp://user:foo%20%40%2B%25%2F#ftp.example.com/x.txt">
urlencode: This differs from the
» RFC 1738 encoding (see
rawurlencode()) in that for historical
reasons, spaces are encoded as plus
(+) signs.
Spaces encoded as %20 vs. +
The biggest reason I've seen to use rawurlencode() in most cases is because urlencode encodes text spaces as + (plus signs) where rawurlencode encodes them as the commonly-seen %20:
echo urlencode("red shirt");
// red+shirt
echo rawurlencode("red shirt");
// red%20shirt
I have specifically seen certain API endpoints that accept encoded text queries expect to see %20 for a space and as a result, fail if a plus sign is used instead. Obviously this is going to differ between API implementations and your mileage may vary.
I believe urlencode is for query parameters, whereas the rawurlencode is for the path segments. This is mainly due to %20 for path segments vs + for query parameters. See this answer which talks about the spaces: When to encode space to plus (+) or %20?
However %20 now works in query parameters as well, which is why rawurlencode is always safer. However the plus sign tends to be used where user experience of editing and readability of query parameters matter.
Note that this means rawurldecode does not decode + into spaces (http://au2.php.net/manual/en/function.rawurldecode.php). This is why the $_GET is always automatically passed through urldecode, which means that + and %20 are both decoded into spaces.
If you want the encoding and decoding to be consistent between inputs and outputs and you have selected to always use + and not %20 for query parameters, then urlencode is fine for query parameters (key and value).
The conclusion is:
Path Segments - always use rawurlencode/rawurldecode
Query Parameters - for decoding always use urldecode (done automatically), for encoding, both rawurlencode or urlencode is fine, just choose one to be consistent, especially when comparing URLs.
simple
* rawurlencode the path
- path is the part before the "?"
- spaces must be encoded as %20
* urlencode the query string
- Query string is the part after the "?"
-spaces are better encoded as "+"
= rawurlencode is more compatible generally

Categories