is PHP str_word_count() multibyte safe? - php

I want to use str_word_count() on a UTF-8 string.
Is this safe in PHP? It seems to me that it should be (especially considering that there is no mb_str_word_count()).
But on php.net there are a lot of people muddying the water by presenting their own 'multibyte compatible' versions of the function.
So I guess I want to know...
Given that str_word_count simply counts all character sequences in delimited by " " (space), it should be safe on multibyte strings, even though its not necessarily aware of the character sequences, right?
Are there any equivalent 'space' characters in UTF-8, which are not ASCII " " (space)?#
This is where the problem might lie I guess.

I'd say you guess right. And indeed there are space characters in UTF-8 which are not part of US-ASCII. To give you an example of such spaces:
Unicode Character 'NO-BREAK SPACE' (U+00A0): 2 Bytes in UTF-8: 0xC2 0xA0 (c2a0)
And perhaps as well:
Unicode Character 'NEXT LINE (NEL)' (U+0085): 2 Bytes in UTF-8: 0xC2 0x85 (c285)
Unicode Character 'LINE SEPARATOR' (U+2028): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)
Unicode Character 'PARAGRAPH SEPARATOR' (U+2029): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)
Anyway, the first one - the 'NO-BREAK SPACE' (U+00A0) - is a good example as it is also part of Latin-X charsets. And the PHP manual already provides a hint that str_word_count would be locale dependent.
If we want to put this to a test, we can set the locale to UTF-8, pass in an invalid string containing a \xA0 sequence and if this still counts as word-breaking character, that function is clearly not UTF-8 safe, hence not multibyte safe (as same non-defined as per the question):
<?php
/**
* is PHP str_word_count() multibyte safe?
* #link https://stackoverflow.com/q/8290537/367456
*/
echo 'New Locale: ', setlocale(LC_ALL, 'en_US.utf8'), "\n\n";
$test = "aword\xA0bword aword";
$result = str_word_count($test, 2);
var_dump($result);
Output:
New Locale: en_US.utf8
array(3) {
[0]=>
string(5) "aword"
[6]=>
string(5) "bword"
[12]=>
string(5) "aword"
}
As this demo shows, that function totally fails on the locale promise it gives on the manual page (I do not wonder nor moan about this, most often if you read that a function is locale specific in PHP, run for your life and find one that is not) which I exploit here to demonstrate that it by no means does anything regarding the UTF-8 character encoding.
Instead for UTF-8 you should take a look into the PCRE extension:
Matching Unicode letter characters in PCRE/PHP
PCRE has a good understanding of Unicode and UTF-8 in PHP in specific. It can also be quite fast if you craft the regular expression pattern carefully.

About the "template answer" - I don't get the demand "working faster". We're not talking about long times or lot of counts here, so who cares if it takes some milliseconds longer or not?
However, a str_word_count working with soft hyphen:
function my_word_count($str) {
return str_word_count(str_replace("\xC2\xAD",'', $str));
}
a function that complies with the asserts (but is probably not faster than str_word_count):
function my_word_count($str) {
$mystr = str_replace("\xC2\xAD",'', $str); // soft hyphen encoded in UTF-8
return preg_match_all('~[\p{L}\'\-]+~u', $mystr); // regex expecting UTF-8
}
The preg function is essentially the same what's already proposed, except that a) it already returns a count so no need to supply matches, which should make it faster and b) there really should not be iconv fallback, IMO.
About a comment:
I can see that your PCRE functions are wrost (performance) than my
preg_word_count() because need a str_replace that you not need:
'~[^\p{L}\'-\xC2\xAD]+~u' works fine (!).
I considered that a different thing, string replace will only remove the multibyte character, but regex of yours will deal with \\xC2 and \\xAD in any order they might appear, which is wrong. Consider a registered sign, which is \xC2\xAE.
However, now that I think about it due to the way valid UTF-8 works, it wouldn't really matter, so that should be usable equally well. So we can just have the function
function my_word_count($str) {
return preg_match_all('~[\p{L}\'\-\xC2\xAD]+~u', $str); // regex expecting UTF-8
}
without any need for matches or other replacements.
About str_word_count(str_replace("\xC2\xAD",'', $str));, if is stable
with UTF8, is good, but seems is not.
If you read this thread, you'll know str_replace is safe if you stick to valid UTF-8 strings. I didn't see any evidence in your link of the contrary.

EDITED (to show new clues): there are a possible solution using str_word_count() with PHP v5.1!
function my_word_count($str, $myLangChars="àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ") {
return str_word_count($str, 0, $myLangChars);
}
but not is 100% because I try to add to $myLangChars \xC2\xAD (the SHy - SOFT HYPHEN character), that must be a word component in any language, and it not works (see).
Another, not so fast, but complete and flexible solution (extracted from here), based on PCRE library, but with an option to mimic the str_word_count() behaviour on non-valid-UTF8:
/**
* Like str_word_count() but showing how preg can do the same.
* This function is most flexible but not faster than str_word_count.
* #param $wRgx the "word regular expression" as defined by user.
* #param $triggError changes behaviour causing error event.
* #param $OnBadUtfTryAgain when true mimic the str_word_count behaviour.
* #return 0 or positive integer as word-count, negative as PCRE error.
*/
function preg_word_count($s,$wRgx='/[-\'\p{L}\xC2\xAD]+/u', $triggError=true,
$OnBadUtfTryAgain=true) {
if ( preg_match_all($wRgx,$s,$m) !== false )
return count($m[0]);
else {
$lastError = preg_last_error();
$chkUtf8 = ($lastError==PREG_BAD_UTF8_ERROR);
if ($OnBadUtfTryAgain && $chkUtf8)
return preg_word_count(
iconv('CP1252','UTF-8',$s), $wRgx, $triggError, false
);
elseif ($triggError) trigger_error(
$chkUtf8? 'non-UTF8 input!': "error PCRE_code-$lastError",
E_USER_NOTICE
);
return -$lastError;
}
}
(TEMPLATE ANSWER) help for bounty!
(this is not an answer, is a help for bounty, because I can not edit neither to duplicate the question)
We want to count "real-world words" in a UTF-8 latim text.
FOR BOUNTY, WE NEED:
a function that comply the asserts below and is faster than str_word_count;
or str_word_count working with SHy character (how to?);
or preg_word_count working faster (using preg_replace? word-separator regular expression?).
ASSERTS
Supose that a "multibyte safe" function my_word_count() exists, then the following asserts must be true:
assert_options(ASSERT_ACTIVE, 1);
$text = "1,2,3,4=0 (1 2 3 4)=0 (... ,.)=0 (2.5±0.1; 0.5±0.2)=0";
assert( my_word_count($text)==0 ); // no word there
$text = "(one two,three;four)=4 (five-six se\xC2\xADven)=2";
assert( my_word_count($text)==6 ); // hyphen merges two words
$text = "(um±dois três)=3 (àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ)=1";
assert( my_word_count($text)==4 ); // a UTF8 case
$text = "(ÍSÔ9000-X, ISÔ 9000-X, ÍSÔ-9000-X)=6"; //Codes are words?
assert( my_word_count($text)==6 ); // suppose no: X is another word

All it does it count the number of spaces, or words in between. if you're curious, you can just make your own counting function using explode and count.
Anytime the ascii space byte is found, it splits and that all there really is to it.

Related

How can I use strlen in php for Persian?

I have this code:
$string = 'علی';
echo strlen($string);
Since $string has 3 Persian characters, output must be 3 but I get 6.
علی has 3 characters. Why my output is 6 ?
How can I use strlen() in php for Persian with real output?
Use mb_strlen
Returns the number of characters in string str having character encoding (the second parameter) encoding. A multi-byte character is counted as 1.
Since your 3 characters are all multi-byte, you get 6 returned with strlen, but this returns 3 as expected.
echo mb_strlen($string,'utf-8');
Fiddle
Note
It's important not to underestimate the power of this method and any similar alternatives. For example one could be inclined to say ok if the characters are multi-byte then just get the length with strlen and divide it by 2 but that will only work if all characters of your string are multi-byte and even a period . will invalidate the count. For example this
echo mb_strlen('علی.','utf-8');
Returns 4 which is correct. So this function is not only taking the whole length and dividing it by 2, it counts 1 for every multi-byte character and 1 for every single-byte character.
Note2:
It looks like you decided not to use this method because mbstring extension is not enabled by default for old PHP versions and you might have decided not to try enabling it:) For future readers though, it is not difficult and its advisable to enable it if you are dealing with multi-byte characters as its not only the length that you might need to deal with. See Manual
try this:
function ustrlen($text)
{
if(function_exists('mb_strlen'))
return mb_strlen( $text , 'utf-8' );
return count(preg_split('//u', $text)) - 2;
}
it will work for any php version.
mb_strlen function is your friend
$string = 'علی';
echo mb_strlen($string, 'utf8');
As of PHP5, iconv_strlen() can be used (as described in php.net, it returns the character count of a string, so probably it's the best choice):
iconv_strlen("علی");
// 3
Based on this answer by chernyshevsky#hotmail.com, you can try this:
function string_length (string $string) : int {
return strlen(utf8_decode($string));
}
string_length("علی");
// 3
Also, as others answered, you can use mb_strlen():
mb_strlen("علی");
// 3
Notes
There is a very little difference between them (for illegal latin characters):
iconv_strlen("a\xCC\r"); // A notice
string_length("a\xCC\r"); // 3
mb_strlen("a\xCC\r"); // 2
Performance: mb_strlen() is the fastest. Totally, there is no difference between iconv_strlen() and string_length() at performance. But amazingly, mb_strlen() is faster that both about 9 times (as I tested)!

Issue with string comparison in PHP

I have two strings with seemingly the same values. One is stored as a key in an array, the other a value in another different array. I compare the two using ==, ===, and strcmp. All treat them as different strings. I do a var_dump and this is what I get.
string(17) "Valentine’s Day"
string(15) "Valentine's Day"
Does anyone have any idea why the first string would be 17 characters and the second 15?
Update: This is slightly more obvious when I pasted this out of my editor whose font made the two different apostrophe's almost indistinguishable.
The first string contains a Unicode character for the apostrophe while the second string just has a regular ASCII ' character.
The Unicode character takes up more space.
If you run the PHP ord() function on each of those characters you'll see that you get different values for each:
echo ord("’"); //226 This is just the first 2 bytes (see comments below for details from ircmaxell)
echo ord("'"); //27
As a complement to #Mark answer above which is right (the ’ is a multi-byte character, most probably UTF-8, while ' is not). You can easily convert it to ASCII (or ISO-8859-1) using iconv, per example:
echo iconv('utf-8', 'ascii//TRANSLIT', $str);
Note: Not all characters can be transformed from multi-byte to ASCII or latin1. You can use //IGNORE to have them removed from the resulting string.
’ != '
mainly. if you want this not to be an issue, you could do something like this.
if (str_replace('’', '\'', "Valentine’s Day") == "Valentine's Day") {

Parsing multibyte string in PHP

I would like to write a (HTML) parser based on state machine but I have doubts how to acctually read/use an input. I decided to load the whole input into one string and then work with it as with an array and hold its index as current parsing position.
There would be no problems with single-byte encoding, but in multi-byte encoding each value does not represent a character, but a byte of a character.
Example:
$mb_string = 'žščř'; //4 multi-byte characters in UTF-8
for($i=0; $i < 4; $i++)
{
echo $mb_string[$i], PHP_EOL;
}
Outputs:
Ĺ
ž
Ĺ
Ą
This means I cannot iterate through the string in a loop to check single characters, because I never know if I am in the middle of an character or not.
So the questions are:
How do I multi-byte safe read a
single character from a string in a
performance friendly way?
Is it good idea to work with the
string as it was an array in this
case?
How would you read the input?
http://php.net/mb_string is the thing you're looking for
just mb_substr characters one by one
not until PHP6
what input exactly? The usual way in general
mb_internal_encoding("UTF-8");
$mb_string = 'žščř';
$l=mb_strlen($mb_string);
for($i=0;$i<$l;$i++){
print(mb_substr($mb_string,$i,1)."<br/>");
}
Without using the mdb_relatedFunctions and with multi-byte encoded strings you can use standard sub string functions that read in multiples of the bytes used for encoding.
For example for a UTF-8 encoded (2 bytes) string if you need the first character from the string
$string = 'žščř'; //4 multi-byte characters in UTF-8
You have to get the $string[0] AND $string[1] values, so you are actually looking for the substring between indexes 0 and 1 (for the first character).
Note that $string[0] or $string[N] will reference the first (or Nth byte of the multi-byte string)
regards,

php regular expression to filter out junk

So I have an interesting problem: I have a string, and for the most part i know what to expect:
http://www.someurl.com/st=????????
Except in this case, the ?'s are either upper case letters or numbers. The problem is, the string has garbage mixed in: the string is broken up into 5 or 6 pieces, and in between there's lots of junk: unprintable characters, foreign characters, as well as plain old normal characters. In short, stuff that's apt to look like this: Nyþ=mî;ëMÝ×nüqÏ
Usually the last 8 characters (the ?'s) are together right at the end, so at the moment I just have PHP grab the last 8 chars and hope for the best. Occasionally, that doesn't work, so I need a more robust solution.
The problem is technically unsolvable, but I think the best solution is to grab characters from the end of the string while they are upper case or numeric. If I get 8 or more, assume that is correct. Otherwise, find the st= and grab characters going forward as many as I need to fill up the 8 character quota. Is there a regex way to do this or will i need to roll up my sleeves and go nested-loop style?
update:
To clear up some confusion, I get an input string that's like this:
[garbage]http:/[garbage]/somewe[garbage]bsite.co[garbage]m/something=[garbage]????????
except the garbage is in unpredictable locations in the string (except the end is never garbage), and has unpredictable length (at least, I have been able to find patterns in neither). Usually the ?s are all together hence me just grabbing the last 8 chars, but sometimes they aren't which results in some missing data and returned garbage :-\
$var = '†http://þ=www.ex;üßample-website.î;ëcomÝ×ü/joy_hÏere.html'; // test case
$clean = join(
array_filter(
str_split($var, 1),
function ($char) {
return (
array_key_exists(
$char,
array_flip(array_merge(
range('A','Z'),
range('a','z'),
range((string)'0',(string)'9'),
array(':','.','/','-','_')
))
)
);
}
)
);
Hah, that was a joke. Here's a regex for you:
$clean = preg_replace('/[^A-Za-z0-9:.\/_-]/','',$var);
As stated, the problem is unsolvable. If the garbage can contain "plain old normal characters" characters, and the garbage can fall at the end of the string, then you cannot know whether the target string from this sample is "ABCDEFGH" or "BCDEFGHI":
__http:/____/somewe___bsite.co____m/something=__ABCDEFGHI__
What do these values represent? If you want to retain all of it, just without having to deal with garbage in your database, maybe you should hex-encode it using bin2hex().
You can use this regular expression :
if (preg_match('/[\'^£$%&*()}{##~?><>,|=_+¬-]/', $string) ==1)

urlencode vs rawurlencode?

If I want to create a URL using a variable I have two choices to encode the string. urlencode() and rawurlencode().
What exactly are the differences and which is preferred?
It will depend on your purpose. If interoperability with other systems is important then it seems rawurlencode is the way to go. The one exception is legacy systems which expect the query string to follow form-encoding style of spaces encoded as + instead of %20 (in which case you need urlencode).
rawurlencode follows RFC 1738 prior to PHP 5.3.0 and RFC 3986 afterwards (see http://us2.php.net/manual/en/function.rawurlencode.php)
Returns a string in which all non-alphanumeric characters except -_.~ have been replaced with a percent (%) sign followed by two hex digits. This is the encoding described in » RFC 3986 for protecting literal characters from being interpreted as special URL delimiters, and for protecting URLs from being mangled by transmission media with character conversions (like some email systems).
Note on RFC 3986 vs 1738. rawurlencode prior to php 5.3 encoded the tilde character (~) according to RFC 1738. As of PHP 5.3, however, rawurlencode follows RFC 3986 which does not require encoding tilde characters.
urlencode encodes spaces as plus signs (not as %20 as done in rawurlencode)(see http://us2.php.net/manual/en/function.urlencode.php)
Returns a string in which all non-alphanumeric characters except -_. have been replaced with a percent (%) sign followed by two hex digits and spaces encoded as plus (+) signs. It is encoded the same way that the posted data from a WWW form is encoded, that is the same way as in application/x-www-form-urlencoded media type. This differs from the » RFC 3986 encoding (see rawurlencode()) in that for historical reasons, spaces are encoded as plus (+) signs.
This corresponds to the definition for application/x-www-form-urlencoded in RFC 1866.
Additional Reading:
You may also want to see the discussion at http://bytes.com/groups/php/5624-urlencode-vs-rawurlencode.
Also, RFC 2396 is worth a look. RFC 2396 defines valid URI syntax. The main part we're interested in is from 3.4 Query Component:
Within a query component, the characters ";", "/", "?", ":", "#",
"&", "=", "+", ",", and "$" are reserved.
As you can see, the + is a reserved character in the query string and thus would need to be encoded as per RFC 3986 (as in rawurlencode).
Proof is in the source code of PHP.
I'll take you through a quick process of how to find out this sort of thing on your own in the future any time you want. Bear with me, there'll be a lot of C source code you can skim over (I explain it). If you want to brush up on some C, a good place to start is our SO wiki.
Download the source (or use https://heap.space/ to browse it online), grep all the files for the function name, you'll find something such as this:
PHP 5.3.6 (most recent at time of writing) describes the two functions in their native C code in the file url.c.
RawUrlEncode()
PHP_FUNCTION(rawurlencode)
{
char *in_str, *out_str;
int in_str_len, out_str_len;
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &in_str,
&in_str_len) == FAILURE) {
return;
}
out_str = php_raw_url_encode(in_str, in_str_len, &out_str_len);
RETURN_STRINGL(out_str, out_str_len, 0);
}
UrlEncode()
PHP_FUNCTION(urlencode)
{
char *in_str, *out_str;
int in_str_len, out_str_len;
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &in_str,
&in_str_len) == FAILURE) {
return;
}
out_str = php_url_encode(in_str, in_str_len, &out_str_len);
RETURN_STRINGL(out_str, out_str_len, 0);
}
Okay, so what's different here?
They both are in essence calling two different internal functions respectively: php_raw_url_encode and php_url_encode
So go look for those functions!
Lets look at php_raw_url_encode
PHPAPI char *php_raw_url_encode(char const *s, int len, int *new_length)
{
register int x, y;
unsigned char *str;
str = (unsigned char *) safe_emalloc(3, len, 1);
for (x = 0, y = 0; len--; x++, y++) {
str[y] = (unsigned char) s[x];
#ifndef CHARSET_EBCDIC
if ((str[y] < '0' && str[y] != '-' && str[y] != '.') ||
(str[y] < 'A' && str[y] > '9') ||
(str[y] > 'Z' && str[y] < 'a' && str[y] != '_') ||
(str[y] > 'z' && str[y] != '~')) {
str[y++] = '%';
str[y++] = hexchars[(unsigned char) s[x] >> 4];
str[y] = hexchars[(unsigned char) s[x] & 15];
#else /*CHARSET_EBCDIC*/
if (!isalnum(str[y]) && strchr("_-.~", str[y]) != NULL) {
str[y++] = '%';
str[y++] = hexchars[os_toascii[(unsigned char) s[x]] >> 4];
str[y] = hexchars[os_toascii[(unsigned char) s[x]] & 15];
#endif /*CHARSET_EBCDIC*/
}
}
str[y] = '\0';
if (new_length) {
*new_length = y;
}
return ((char *) str);
}
And of course, php_url_encode:
PHPAPI char *php_url_encode(char const *s, int len, int *new_length)
{
register unsigned char c;
unsigned char *to, *start;
unsigned char const *from, *end;
from = (unsigned char *)s;
end = (unsigned char *)s + len;
start = to = (unsigned char *) safe_emalloc(3, len, 1);
while (from < end) {
c = *from++;
if (c == ' ') {
*to++ = '+';
#ifndef CHARSET_EBCDIC
} else if ((c < '0' && c != '-' && c != '.') ||
(c < 'A' && c > '9') ||
(c > 'Z' && c < 'a' && c != '_') ||
(c > 'z')) {
to[0] = '%';
to[1] = hexchars[c >> 4];
to[2] = hexchars[c & 15];
to += 3;
#else /*CHARSET_EBCDIC*/
} else if (!isalnum(c) && strchr("_-.", c) == NULL) {
/* Allow only alphanumeric chars and '_', '-', '.'; escape the rest */
to[0] = '%';
to[1] = hexchars[os_toascii[c] >> 4];
to[2] = hexchars[os_toascii[c] & 15];
to += 3;
#endif /*CHARSET_EBCDIC*/
} else {
*to++ = c;
}
}
*to = 0;
if (new_length) {
*new_length = to - start;
}
return (char *) start;
}
One quick bit of knowledge before I move forward, EBCDIC is another character set, similar to ASCII, but a total competitor. PHP attempts to deal with both. But basically, this means byte EBCDIC 0x4c byte isn't the L in ASCII, it's actually a <. I'm sure you see the confusion here.
Both of these functions manage EBCDIC if the web server has defined it.
Also, they both use an array of chars (think string type) hexchars look-up to get some values, the array is described as such:
/* rfc1738:
...The characters ";",
"/", "?", ":", "#", "=" and "&" are the characters which may be
reserved for special meaning within a scheme...
...Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL...
For added safety, we only leave -_. unencoded.
*/
static unsigned char hexchars[] = "0123456789ABCDEF";
Beyond that, the functions are really different, and I'm going to explain them in ASCII and EBCDIC.
Differences in ASCII:
URLENCODE:
Calculates a start/end length of the input string, allocates memory
Walks through a while-loop, increments until we reach the end of the string
Grabs the present character
If the character is equal to ASCII Char 0x20 (ie, a "space"), add a + sign to the output string.
If it's not a space, and it's also not alphanumeric (isalnum(c)), and also isn't and _, -, or . character, then we , output a % sign to array position 0, do an array look up to the hexchars array for a lookup for os_toascii array (an array from Apache that translates char to hex code) for the key of c (the present character), we then bitwise shift right by 4, assign that value to the character 1, and to position 2 we assign the same lookup, except we preform a logical and to see if the value is 15 (0xF), and return a 1 in that case, or a 0 otherwise. At the end, you'll end up with something encoded.
If it ends up it's not a space, it's alphanumeric or one of the _-. chars, it outputs exactly what it is.
RAWURLENCODE:
Allocates memory for the string
Iterates over it based on length provided in function call (not calculated in function as with URLENCODE).
Note: Many programmers have probably never seen a for loop iterate this way, it's somewhat hackish and not the standard convention used with most for-loops, pay attention, it assigns x and y, checks for exit on len reaching 0, and increments both x and y. I know, it's not what you'd expect, but it's valid code.
Assigns the present character to a matching character position in str.
It checks if the present character is alphanumeric, or one of the _-. chars, and if it isn't, we do almost the same assignment as with URLENCODE where it preforms lookups, however, we increment differently, using y++ rather than to[1], this is because the strings are being built in different ways, but reach the same goal at the end anyway.
When the loop's done and the length's gone, It actually terminates the string, assigning the \0 byte.
It returns the encoded string.
Differences:
UrlEncode checks for space, assigns a + sign, RawURLEncode does not.
UrlEncode does not assign a \0 byte to the string, RawUrlEncode does (this may be a moot point)
They iterate differntly, one may be prone to overflow with malformed strings, I'm merely suggesting this and I haven't actually investigated.
They basically iterate differently, one assigns a + sign in the event of ASCII 20.
Differences in EBCDIC:
URLENCODE:
Same iteration setup as with ASCII
Still translating the "space" character to a + sign. Note-- I think this needs to be compiled in EBCDIC or you'll end up with a bug? Can someone edit and confirm this?
It checks if the present char is a char before 0, with the exception of being a . or -, OR less than A but greater than char 9, OR greater than Z and less than a but not a _. OR greater than z (yeah, EBCDIC is kinda messed up to work with). If it matches any of those, do a similar lookup as found in the ASCII version (it just doesn't require a lookup in os_toascii).
RAWURLENCODE:
Same iteration setup as with ASCII
Same check as described in the EBCDIC version of URL Encode, with the exception that if it's greater than z, it excludes ~ from the URL encode.
Same assignment as the ASCII RawUrlEncode
Still appending the \0 byte to the string before return.
Grand Summary
Both use the same hexchars lookup table
URIEncode doesn't terminate a string with \0, raw does.
If you're working in EBCDIC I'd suggest using RawUrlEncode, as it manages the ~ that UrlEncode does not (this is a reported issue). It's worth noting that ASCII and EBCDIC 0x20 are both spaces.
They iterate differently, one may be faster, one may be prone to memory or string based exploits.
URIEncode makes a space into +, RawUrlEncode makes a space into %20 via array lookups.
Disclaimer: I haven't touched C in years, and I haven't looked at EBCDIC in a really really long time. If I'm wrong somewhere, let me know.
Suggested implementations
Based on all of this, rawurlencode is the way to go most of the time. As you see in Jonathan Fingland's answer, stick with it in most cases. It deals with the modern scheme for URI components, where as urlencode does things the old school way, where + meant "space."
If you're trying to convert between the old format and new formats, be sure that your code doesn't goof up and turn something that's a decoded + sign into a space by accidentally double-encoding, or similar "oops" scenarios around this space/20%/+ issue.
If you're working on an older system with older software that doesn't prefer the new format, stick with urlencode, however, I believe %20 will actually be backwards compatible, as under the old standard %20 worked, just wasn't preferred. Give it a shot if you're up for playing around, let us know how it worked out for you.
Basically, you should stick with raw, unless your EBCDIC system really hates you. Most programmers will never run into EBCDIC on any system made after the year 2000, maybe even 1990 (that's pushing, but still likely in my opinion).
echo rawurlencode('http://www.google.com/index.html?id=asd asd');
yields
http%3A%2F%2Fwww.google.com%2Findex.html%3Fid%3Dasd%20asd
while
echo urlencode('http://www.google.com/index.html?id=asd asd');
yields
http%3A%2F%2Fwww.google.com%2Findex.html%3Fid%3Dasd+asd
The difference being the asd%20asd vs asd+asd
urlencode differs from RFC 1738 by encoding spaces as + instead of %20
One practical reason to choose one over the other is if you're going to use the result in another environment, for example JavaScript.
In PHP urlencode('test 1') returns 'test+1' while rawurlencode('test 1') returns 'test%201' as result.
But if you need to "decode" this in JavaScript using decodeURI() function then decodeURI("test+1") will give you "test+1" while decodeURI("test%201") will give you "test 1" as result.
In other words the space (" ") encoded by urlencode to plus ("+") in PHP will not be properly decoded by decodeURI in JavaScript.
In such cases the rawurlencode PHP function should be used.
I believe spaces must be encoded as:
%20 when used inside URL path component
+ when used inside URL query string component or form data (see 17.13.4 Form content types)
The following example shows the correct use of rawurlencode and urlencode:
echo "http://example.com"
. "/category/" . rawurlencode("latest songs")
. "/search?q=" . urlencode("lady gaga");
Output:
http://example.com/category/latest%20songs/search?q=lady+gaga
What happens if you encode path and query string components the other way round? For the following example:
http://example.com/category/latest+songs/search?q=lady%20gaga
The webserver will look for the directory latest+songs instead of latest songs
The query string parameter q will contain lady gaga
1. What exactly are the differences and
The only difference is in the way spaces are treated:
urlencode - based on legacy implementation converts spaces to +
rawurlencode - based on RFC 1738 translates spaces to %20
The reason for the difference is because + is reserved and valid (unencoded) in urls.
2. which is preferred?
I'd really like to see some reasons for choosing one over the other ... I want to be able to just pick one and use it forever with the least fuss.
Fair enough, I have a simple strategy that I follow when making these decisions which I will share with you in the hope that it may help.
I think it was the HTTP/1.1 specification RFC 2616 which called for "Tolerant applications"
Clients SHOULD be tolerant in parsing the Status-Line and servers
tolerant when parsing the Request-Line.
When faced with questions like these the best strategy is always to consume as much as possible and produce what is standards compliant.
So my advice is to use rawurlencode to produce standards compliant RFC 1738 encoded strings and use urldecode to be backward compatible and accomodate anything you may come across to consume.
Now you could just take my word for it but lets prove it shall we...
php > $url = <<<'EOD'
<<< > "Which, % of Alice's tasks saw $s # earnings?"
<<< > EOD;
php > echo $url, PHP_EOL;
"Which, % of Alice's tasks saw $s # earnings?"
php > echo urlencode($url), PHP_EOL;
%22Which%2C+%25+of+Alice%27s+tasks+saw+%24s+%40+earnings%3F%22
php > echo rawurlencode($url), PHP_EOL;
%22Which%2C%20%25%20of%20Alice%27s%20tasks%20saw%20%24s%20%40%20earnings%3F%22
php > echo rawurldecode(urlencode($url)), PHP_EOL;
"Which,+%+of+Alice's+tasks+saw+$s+#+earnings?"
php > // oops that's not right???
php > echo urldecode(rawurlencode($url)), PHP_EOL;
"Which, % of Alice's tasks saw $s # earnings?"
php > // now that's more like it
It would appear that PHP had exactly this in mind, even though I've never come across anyone refusing either of the two formats, I cant think of a better strategy to adopt as your defacto strategy, can you?
nJoy!
The difference is in the return values, i.e:
urlencode():
Returns a string in which all
non-alphanumeric characters except -_.
have been replaced with a percent (%)
sign followed by two hex digits and
spaces encoded as plus (+) signs. It
is encoded the same way that the
posted data from a WWW form is
encoded, that is the same way as in
application/x-www-form-urlencoded
media type. This differs from the »
RFC 1738 encoding (see rawurlencode())
in that for historical reasons, spaces
are encoded as plus (+) signs.
rawurlencode():
Returns a string in which all
non-alphanumeric characters except -_.
have been replaced with a percent (%)
sign followed by two hex digits. This
is the encoding described in » RFC
1738 for protecting literal characters
from being interpreted as special URL
delimiters, and for protecting URLs
from being mangled by transmission
media with character conversions (like
some email systems).
The two are very similar, but the latter (rawurlencode) will replace spaces with a '%' and two hex digits, which is suitable for encoding passwords or such, where a '+' is not e.g.:
echo '<a href="ftp://user:', rawurlencode('foo #+%/'),
'#ftp.example.com/x.txt">';
//Outputs <a href="ftp://user:foo%20%40%2B%25%2F#ftp.example.com/x.txt">
urlencode: This differs from the
» RFC 1738 encoding (see
rawurlencode()) in that for historical
reasons, spaces are encoded as plus
(+) signs.
Spaces encoded as %20 vs. +
The biggest reason I've seen to use rawurlencode() in most cases is because urlencode encodes text spaces as + (plus signs) where rawurlencode encodes them as the commonly-seen %20:
echo urlencode("red shirt");
// red+shirt
echo rawurlencode("red shirt");
// red%20shirt
I have specifically seen certain API endpoints that accept encoded text queries expect to see %20 for a space and as a result, fail if a plus sign is used instead. Obviously this is going to differ between API implementations and your mileage may vary.
I believe urlencode is for query parameters, whereas the rawurlencode is for the path segments. This is mainly due to %20 for path segments vs + for query parameters. See this answer which talks about the spaces: When to encode space to plus (+) or %20?
However %20 now works in query parameters as well, which is why rawurlencode is always safer. However the plus sign tends to be used where user experience of editing and readability of query parameters matter.
Note that this means rawurldecode does not decode + into spaces (http://au2.php.net/manual/en/function.rawurldecode.php). This is why the $_GET is always automatically passed through urldecode, which means that + and %20 are both decoded into spaces.
If you want the encoding and decoding to be consistent between inputs and outputs and you have selected to always use + and not %20 for query parameters, then urlencode is fine for query parameters (key and value).
The conclusion is:
Path Segments - always use rawurlencode/rawurldecode
Query Parameters - for decoding always use urldecode (done automatically), for encoding, both rawurlencode or urlencode is fine, just choose one to be consistent, especially when comparing URLs.
simple
* rawurlencode the path
- path is the part before the "?"
- spaces must be encoded as %20
* urlencode the query string
- Query string is the part after the "?"
-spaces are better encoded as "+"
= rawurlencode is more compatible generally

Categories