urlencode vs rawurlencode? - php

If I want to create a URL using a variable I have two choices to encode the string. urlencode() and rawurlencode().
What exactly are the differences and which is preferred?

It will depend on your purpose. If interoperability with other systems is important then it seems rawurlencode is the way to go. The one exception is legacy systems which expect the query string to follow form-encoding style of spaces encoded as + instead of %20 (in which case you need urlencode).
rawurlencode follows RFC 1738 prior to PHP 5.3.0 and RFC 3986 afterwards (see http://us2.php.net/manual/en/function.rawurlencode.php)
Returns a string in which all non-alphanumeric characters except -_.~ have been replaced with a percent (%) sign followed by two hex digits. This is the encoding described in » RFC 3986 for protecting literal characters from being interpreted as special URL delimiters, and for protecting URLs from being mangled by transmission media with character conversions (like some email systems).
Note on RFC 3986 vs 1738. rawurlencode prior to php 5.3 encoded the tilde character (~) according to RFC 1738. As of PHP 5.3, however, rawurlencode follows RFC 3986 which does not require encoding tilde characters.
urlencode encodes spaces as plus signs (not as %20 as done in rawurlencode)(see http://us2.php.net/manual/en/function.urlencode.php)
Returns a string in which all non-alphanumeric characters except -_. have been replaced with a percent (%) sign followed by two hex digits and spaces encoded as plus (+) signs. It is encoded the same way that the posted data from a WWW form is encoded, that is the same way as in application/x-www-form-urlencoded media type. This differs from the » RFC 3986 encoding (see rawurlencode()) in that for historical reasons, spaces are encoded as plus (+) signs.
This corresponds to the definition for application/x-www-form-urlencoded in RFC 1866.
Additional Reading:
You may also want to see the discussion at http://bytes.com/groups/php/5624-urlencode-vs-rawurlencode.
Also, RFC 2396 is worth a look. RFC 2396 defines valid URI syntax. The main part we're interested in is from 3.4 Query Component:
Within a query component, the characters ";", "/", "?", ":", "#",
"&", "=", "+", ",", and "$" are reserved.
As you can see, the + is a reserved character in the query string and thus would need to be encoded as per RFC 3986 (as in rawurlencode).

Proof is in the source code of PHP.
I'll take you through a quick process of how to find out this sort of thing on your own in the future any time you want. Bear with me, there'll be a lot of C source code you can skim over (I explain it). If you want to brush up on some C, a good place to start is our SO wiki.
Download the source (or use https://heap.space/ to browse it online), grep all the files for the function name, you'll find something such as this:
PHP 5.3.6 (most recent at time of writing) describes the two functions in their native C code in the file url.c.
RawUrlEncode()
PHP_FUNCTION(rawurlencode)
{
char *in_str, *out_str;
int in_str_len, out_str_len;
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &in_str,
&in_str_len) == FAILURE) {
return;
}
out_str = php_raw_url_encode(in_str, in_str_len, &out_str_len);
RETURN_STRINGL(out_str, out_str_len, 0);
}
UrlEncode()
PHP_FUNCTION(urlencode)
{
char *in_str, *out_str;
int in_str_len, out_str_len;
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &in_str,
&in_str_len) == FAILURE) {
return;
}
out_str = php_url_encode(in_str, in_str_len, &out_str_len);
RETURN_STRINGL(out_str, out_str_len, 0);
}
Okay, so what's different here?
They both are in essence calling two different internal functions respectively: php_raw_url_encode and php_url_encode
So go look for those functions!
Lets look at php_raw_url_encode
PHPAPI char *php_raw_url_encode(char const *s, int len, int *new_length)
{
register int x, y;
unsigned char *str;
str = (unsigned char *) safe_emalloc(3, len, 1);
for (x = 0, y = 0; len--; x++, y++) {
str[y] = (unsigned char) s[x];
#ifndef CHARSET_EBCDIC
if ((str[y] < '0' && str[y] != '-' && str[y] != '.') ||
(str[y] < 'A' && str[y] > '9') ||
(str[y] > 'Z' && str[y] < 'a' && str[y] != '_') ||
(str[y] > 'z' && str[y] != '~')) {
str[y++] = '%';
str[y++] = hexchars[(unsigned char) s[x] >> 4];
str[y] = hexchars[(unsigned char) s[x] & 15];
#else /*CHARSET_EBCDIC*/
if (!isalnum(str[y]) && strchr("_-.~", str[y]) != NULL) {
str[y++] = '%';
str[y++] = hexchars[os_toascii[(unsigned char) s[x]] >> 4];
str[y] = hexchars[os_toascii[(unsigned char) s[x]] & 15];
#endif /*CHARSET_EBCDIC*/
}
}
str[y] = '\0';
if (new_length) {
*new_length = y;
}
return ((char *) str);
}
And of course, php_url_encode:
PHPAPI char *php_url_encode(char const *s, int len, int *new_length)
{
register unsigned char c;
unsigned char *to, *start;
unsigned char const *from, *end;
from = (unsigned char *)s;
end = (unsigned char *)s + len;
start = to = (unsigned char *) safe_emalloc(3, len, 1);
while (from < end) {
c = *from++;
if (c == ' ') {
*to++ = '+';
#ifndef CHARSET_EBCDIC
} else if ((c < '0' && c != '-' && c != '.') ||
(c < 'A' && c > '9') ||
(c > 'Z' && c < 'a' && c != '_') ||
(c > 'z')) {
to[0] = '%';
to[1] = hexchars[c >> 4];
to[2] = hexchars[c & 15];
to += 3;
#else /*CHARSET_EBCDIC*/
} else if (!isalnum(c) && strchr("_-.", c) == NULL) {
/* Allow only alphanumeric chars and '_', '-', '.'; escape the rest */
to[0] = '%';
to[1] = hexchars[os_toascii[c] >> 4];
to[2] = hexchars[os_toascii[c] & 15];
to += 3;
#endif /*CHARSET_EBCDIC*/
} else {
*to++ = c;
}
}
*to = 0;
if (new_length) {
*new_length = to - start;
}
return (char *) start;
}
One quick bit of knowledge before I move forward, EBCDIC is another character set, similar to ASCII, but a total competitor. PHP attempts to deal with both. But basically, this means byte EBCDIC 0x4c byte isn't the L in ASCII, it's actually a <. I'm sure you see the confusion here.
Both of these functions manage EBCDIC if the web server has defined it.
Also, they both use an array of chars (think string type) hexchars look-up to get some values, the array is described as such:
/* rfc1738:
...The characters ";",
"/", "?", ":", "#", "=" and "&" are the characters which may be
reserved for special meaning within a scheme...
...Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL...
For added safety, we only leave -_. unencoded.
*/
static unsigned char hexchars[] = "0123456789ABCDEF";
Beyond that, the functions are really different, and I'm going to explain them in ASCII and EBCDIC.
Differences in ASCII:
URLENCODE:
Calculates a start/end length of the input string, allocates memory
Walks through a while-loop, increments until we reach the end of the string
Grabs the present character
If the character is equal to ASCII Char 0x20 (ie, a "space"), add a + sign to the output string.
If it's not a space, and it's also not alphanumeric (isalnum(c)), and also isn't and _, -, or . character, then we , output a % sign to array position 0, do an array look up to the hexchars array for a lookup for os_toascii array (an array from Apache that translates char to hex code) for the key of c (the present character), we then bitwise shift right by 4, assign that value to the character 1, and to position 2 we assign the same lookup, except we preform a logical and to see if the value is 15 (0xF), and return a 1 in that case, or a 0 otherwise. At the end, you'll end up with something encoded.
If it ends up it's not a space, it's alphanumeric or one of the _-. chars, it outputs exactly what it is.
RAWURLENCODE:
Allocates memory for the string
Iterates over it based on length provided in function call (not calculated in function as with URLENCODE).
Note: Many programmers have probably never seen a for loop iterate this way, it's somewhat hackish and not the standard convention used with most for-loops, pay attention, it assigns x and y, checks for exit on len reaching 0, and increments both x and y. I know, it's not what you'd expect, but it's valid code.
Assigns the present character to a matching character position in str.
It checks if the present character is alphanumeric, or one of the _-. chars, and if it isn't, we do almost the same assignment as with URLENCODE where it preforms lookups, however, we increment differently, using y++ rather than to[1], this is because the strings are being built in different ways, but reach the same goal at the end anyway.
When the loop's done and the length's gone, It actually terminates the string, assigning the \0 byte.
It returns the encoded string.
Differences:
UrlEncode checks for space, assigns a + sign, RawURLEncode does not.
UrlEncode does not assign a \0 byte to the string, RawUrlEncode does (this may be a moot point)
They iterate differntly, one may be prone to overflow with malformed strings, I'm merely suggesting this and I haven't actually investigated.
They basically iterate differently, one assigns a + sign in the event of ASCII 20.
Differences in EBCDIC:
URLENCODE:
Same iteration setup as with ASCII
Still translating the "space" character to a + sign. Note-- I think this needs to be compiled in EBCDIC or you'll end up with a bug? Can someone edit and confirm this?
It checks if the present char is a char before 0, with the exception of being a . or -, OR less than A but greater than char 9, OR greater than Z and less than a but not a _. OR greater than z (yeah, EBCDIC is kinda messed up to work with). If it matches any of those, do a similar lookup as found in the ASCII version (it just doesn't require a lookup in os_toascii).
RAWURLENCODE:
Same iteration setup as with ASCII
Same check as described in the EBCDIC version of URL Encode, with the exception that if it's greater than z, it excludes ~ from the URL encode.
Same assignment as the ASCII RawUrlEncode
Still appending the \0 byte to the string before return.
Grand Summary
Both use the same hexchars lookup table
URIEncode doesn't terminate a string with \0, raw does.
If you're working in EBCDIC I'd suggest using RawUrlEncode, as it manages the ~ that UrlEncode does not (this is a reported issue). It's worth noting that ASCII and EBCDIC 0x20 are both spaces.
They iterate differently, one may be faster, one may be prone to memory or string based exploits.
URIEncode makes a space into +, RawUrlEncode makes a space into %20 via array lookups.
Disclaimer: I haven't touched C in years, and I haven't looked at EBCDIC in a really really long time. If I'm wrong somewhere, let me know.
Suggested implementations
Based on all of this, rawurlencode is the way to go most of the time. As you see in Jonathan Fingland's answer, stick with it in most cases. It deals with the modern scheme for URI components, where as urlencode does things the old school way, where + meant "space."
If you're trying to convert between the old format and new formats, be sure that your code doesn't goof up and turn something that's a decoded + sign into a space by accidentally double-encoding, or similar "oops" scenarios around this space/20%/+ issue.
If you're working on an older system with older software that doesn't prefer the new format, stick with urlencode, however, I believe %20 will actually be backwards compatible, as under the old standard %20 worked, just wasn't preferred. Give it a shot if you're up for playing around, let us know how it worked out for you.
Basically, you should stick with raw, unless your EBCDIC system really hates you. Most programmers will never run into EBCDIC on any system made after the year 2000, maybe even 1990 (that's pushing, but still likely in my opinion).

echo rawurlencode('http://www.google.com/index.html?id=asd asd');
yields
http%3A%2F%2Fwww.google.com%2Findex.html%3Fid%3Dasd%20asd
while
echo urlencode('http://www.google.com/index.html?id=asd asd');
yields
http%3A%2F%2Fwww.google.com%2Findex.html%3Fid%3Dasd+asd
The difference being the asd%20asd vs asd+asd
urlencode differs from RFC 1738 by encoding spaces as + instead of %20

One practical reason to choose one over the other is if you're going to use the result in another environment, for example JavaScript.
In PHP urlencode('test 1') returns 'test+1' while rawurlencode('test 1') returns 'test%201' as result.
But if you need to "decode" this in JavaScript using decodeURI() function then decodeURI("test+1") will give you "test+1" while decodeURI("test%201") will give you "test 1" as result.
In other words the space (" ") encoded by urlencode to plus ("+") in PHP will not be properly decoded by decodeURI in JavaScript.
In such cases the rawurlencode PHP function should be used.

I believe spaces must be encoded as:
%20 when used inside URL path component
+ when used inside URL query string component or form data (see 17.13.4 Form content types)
The following example shows the correct use of rawurlencode and urlencode:
echo "http://example.com"
. "/category/" . rawurlencode("latest songs")
. "/search?q=" . urlencode("lady gaga");
Output:
http://example.com/category/latest%20songs/search?q=lady+gaga
What happens if you encode path and query string components the other way round? For the following example:
http://example.com/category/latest+songs/search?q=lady%20gaga
The webserver will look for the directory latest+songs instead of latest songs
The query string parameter q will contain lady gaga

1. What exactly are the differences and
The only difference is in the way spaces are treated:
urlencode - based on legacy implementation converts spaces to +
rawurlencode - based on RFC 1738 translates spaces to %20
The reason for the difference is because + is reserved and valid (unencoded) in urls.
2. which is preferred?
I'd really like to see some reasons for choosing one over the other ... I want to be able to just pick one and use it forever with the least fuss.
Fair enough, I have a simple strategy that I follow when making these decisions which I will share with you in the hope that it may help.
I think it was the HTTP/1.1 specification RFC 2616 which called for "Tolerant applications"
Clients SHOULD be tolerant in parsing the Status-Line and servers
tolerant when parsing the Request-Line.
When faced with questions like these the best strategy is always to consume as much as possible and produce what is standards compliant.
So my advice is to use rawurlencode to produce standards compliant RFC 1738 encoded strings and use urldecode to be backward compatible and accomodate anything you may come across to consume.
Now you could just take my word for it but lets prove it shall we...
php > $url = <<<'EOD'
<<< > "Which, % of Alice's tasks saw $s # earnings?"
<<< > EOD;
php > echo $url, PHP_EOL;
"Which, % of Alice's tasks saw $s # earnings?"
php > echo urlencode($url), PHP_EOL;
%22Which%2C+%25+of+Alice%27s+tasks+saw+%24s+%40+earnings%3F%22
php > echo rawurlencode($url), PHP_EOL;
%22Which%2C%20%25%20of%20Alice%27s%20tasks%20saw%20%24s%20%40%20earnings%3F%22
php > echo rawurldecode(urlencode($url)), PHP_EOL;
"Which,+%+of+Alice's+tasks+saw+$s+#+earnings?"
php > // oops that's not right???
php > echo urldecode(rawurlencode($url)), PHP_EOL;
"Which, % of Alice's tasks saw $s # earnings?"
php > // now that's more like it
It would appear that PHP had exactly this in mind, even though I've never come across anyone refusing either of the two formats, I cant think of a better strategy to adopt as your defacto strategy, can you?
nJoy!

The difference is in the return values, i.e:
urlencode():
Returns a string in which all
non-alphanumeric characters except -_.
have been replaced with a percent (%)
sign followed by two hex digits and
spaces encoded as plus (+) signs. It
is encoded the same way that the
posted data from a WWW form is
encoded, that is the same way as in
application/x-www-form-urlencoded
media type. This differs from the »
RFC 1738 encoding (see rawurlencode())
in that for historical reasons, spaces
are encoded as plus (+) signs.
rawurlencode():
Returns a string in which all
non-alphanumeric characters except -_.
have been replaced with a percent (%)
sign followed by two hex digits. This
is the encoding described in » RFC
1738 for protecting literal characters
from being interpreted as special URL
delimiters, and for protecting URLs
from being mangled by transmission
media with character conversions (like
some email systems).
The two are very similar, but the latter (rawurlencode) will replace spaces with a '%' and two hex digits, which is suitable for encoding passwords or such, where a '+' is not e.g.:
echo '<a href="ftp://user:', rawurlencode('foo #+%/'),
'#ftp.example.com/x.txt">';
//Outputs <a href="ftp://user:foo%20%40%2B%25%2F#ftp.example.com/x.txt">

urlencode: This differs from the
» RFC 1738 encoding (see
rawurlencode()) in that for historical
reasons, spaces are encoded as plus
(+) signs.

Spaces encoded as %20 vs. +
The biggest reason I've seen to use rawurlencode() in most cases is because urlencode encodes text spaces as + (plus signs) where rawurlencode encodes them as the commonly-seen %20:
echo urlencode("red shirt");
// red+shirt
echo rawurlencode("red shirt");
// red%20shirt
I have specifically seen certain API endpoints that accept encoded text queries expect to see %20 for a space and as a result, fail if a plus sign is used instead. Obviously this is going to differ between API implementations and your mileage may vary.

I believe urlencode is for query parameters, whereas the rawurlencode is for the path segments. This is mainly due to %20 for path segments vs + for query parameters. See this answer which talks about the spaces: When to encode space to plus (+) or %20?
However %20 now works in query parameters as well, which is why rawurlencode is always safer. However the plus sign tends to be used where user experience of editing and readability of query parameters matter.
Note that this means rawurldecode does not decode + into spaces (http://au2.php.net/manual/en/function.rawurldecode.php). This is why the $_GET is always automatically passed through urldecode, which means that + and %20 are both decoded into spaces.
If you want the encoding and decoding to be consistent between inputs and outputs and you have selected to always use + and not %20 for query parameters, then urlencode is fine for query parameters (key and value).
The conclusion is:
Path Segments - always use rawurlencode/rawurldecode
Query Parameters - for decoding always use urldecode (done automatically), for encoding, both rawurlencode or urlencode is fine, just choose one to be consistent, especially when comparing URLs.

simple
* rawurlencode the path
- path is the part before the "?"
- spaces must be encoded as %20
* urlencode the query string
- Query string is the part after the "?"
-spaces are better encoded as "+"
= rawurlencode is more compatible generally

Related

PHP: Questionmark and other special characters break my rename function, how to fix? [duplicate]

I know that / is illegal in Linux, and the following are illegal in Windows
(I think) * . " / \ [ ] : ; | ,
What else am I missing?
I need a comprehensive guide, however, and one that takes into account
double-byte characters. Linking to outside resources is fine with me.
I need to first create a directory on the filesystem using a name that may
contain forbidden characters, so I plan to replace those characters with
underscores. I then need to write this directory and its contents to a zip file
(using Java), so any additional advice concerning the names of zip directories
would be appreciated.
The forbidden printable ASCII characters are:
Linux/Unix:
/ (forward slash)
Windows:
< (less than)
> (greater than)
: (colon - sometimes works, but is actually NTFS Alternate Data Streams)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
* (asterisk)
Non-printable characters
If your data comes from a source that would permit non-printable characters then there is more to check for.
Linux/Unix:
0 (NULL byte)
Windows:
0-31 (ASCII control characters)
Note: While it is legal under Linux/Unix file systems to create files with control characters in the filename, it might be a nightmare for the users to deal with such files.
Reserved file names
The following filenames are reserved:
Windows:
CON, PRN, AUX, NUL
COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9
LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9
(both on their own and with arbitrary file extensions, e.g. LPT1.txt).
Other rules
Windows:
Filenames cannot end in a space or dot.
macOS:
You didn't ask for it, but just in case: Colon : and forward slash / depending on context are not permitted (e.g. Finder supports slashes, terminal supports colons). (More details)
A “comprehensive guide” of forbidden filename characters is not going to work on Windows because it reserves filenames as well as characters. Yes, characters like
* " ? and others are forbidden, but there are a infinite number of names composed only of valid characters that are forbidden. For example, spaces and dots are valid filename characters, but names composed only of those characters are forbidden.
Windows does not distinguish between upper-case and lower-case characters, so you cannot create a folder named A if one named a already exists. Worse, seemingly-allowed names like PRN and CON, and many others, are reserved and not allowed. Windows also has several length restrictions; a filename valid in one folder may become invalid if moved to another folder. The rules for
naming files and folders
are on the Microsoft docs.
You cannot, in general, use user-generated text to create Windows directory names. If you want to allow users to name anything they want, you have to create safe names like A, AB, A2 et al., store user-generated names and their path equivalents in an application data file, and perform path mapping in your application.
If you absolutely must allow user-generated folder names, the only way to tell if they are invalid is to catch exceptions and assume the name is invalid. Even that is fraught with peril, as the exceptions thrown for denied access, offline drives, and out of drive space overlap with those that can be thrown for invalid names. You are opening up one huge can of hurt.
Under Linux and other Unix-related systems, there were traditionally only two characters that could not appear in the name of a file or directory, and those are NUL '\0' and slash '/'. The slash, of course, can appear in a pathname, separating directory components.
Rumour1 has it that Steven Bourne (of 'shell' fame) had a directory containing 254 files, one for every single letter (character code) that can appear in a file name (excluding /, '\0'; the name . was the current directory, of course). It was used to test the Bourne shell and routinely wrought havoc on unwary programs such as backup programs.
Other people have covered the rules for Windows filenames, with links to Microsoft and Wikipedia on the topic.
Note that MacOS X has a case-insensitive file system. Current versions of it appear to allow colon : in file names, though historically that was not necessarily always the case:
$ echo a:b > a:b
$ ls -l a:b
-rw-r--r-- 1 jonathanleffler staff 4 Nov 12 07:38 a:b
$
However, at least with macOS Big Sur 11.7, the file system does not allow file names that are not valid UTF-8 strings. That means the file name cannot consist of the bytes that are always invalid in UTF-8 (0xC0, 0xC1, 0xF5-0xFF), and you can't use the continuation bytes 0x80..0xBF as the only byte in a file name. The error given is 92 Illegal byte sequence.
POSIX defines a Portable Filename Character Set consisting of:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 . _ -
Sticking with names formed solely from those characters avoids most of the problems, though Windows still adds some complications.
1 It was Kernighan & Pike in ['The Practice of Programming'](http://www.cs.princeton.edu/~bwk/tpop.webpage/) who said as much in Chapter 6, Testing, §6.5 Stress Tests:
When Steve Bourne was writing his Unix shell (which came to be known as the Bourne shell), he made a directory of 254 files with one-character names, one for each byte value except '\0' and slash, the two characters that cannot appear in Unix file names. He used that directory for all manner of tests of pattern-matching and tokenization. (The test directory was of course created by a program.) For years afterwards, that directory was the bane of file-tree-walking programs; it tested them to destruction.
Note that the directory must have contained entries . and .., so it was arguably 253 files (and 2 directories), or 255 name entries, rather than 254 files. This doesn't affect the effectiveness of the anecdote, or the careful testing it describes.
TPOP was previously at
http://plan9.bell-labs.com/cm/cs/tpop and
http://cm.bell-labs.com/cm/cs/tpop but both are now (2021-11-12) broken.
See also Wikipedia on TPOP.
Instead of creating a blacklist of characters, you could use a whitelist. All things considered, the range of characters that make sense in a file or directory name context is quite short, and unless you have some very specific naming requirements your users will not hold it against your application if they cannot use the whole ASCII table.
It does not solve the problem of reserved names in the target file system, but with a whitelist it is easier to mitigate the risks at the source.
In that spirit, this is a range of characters that can be considered safe:
Letters (a-z A-Z) - Unicode characters as well, if needed
Digits (0-9)
Underscore (_)
Hyphen (-)
Space
Dot (.)
And any additional safe characters you wish to allow. Beyond this, you just have to enforce some additional rules regarding spaces and dots. This is usually sufficient:
Name must contain at least one letter or number (to avoid only dots/spaces)
Name must start with a letter or number (to avoid leading dots/spaces)
Name may not end with a dot or space (simply trim those if present, like Explorer does)
This already allows quite complex and nonsensical names. For example, these names would be possible with these rules, and be valid file names in Windows/Linux:
A...........ext
B -.- .ext
In essence, even with so few whitelisted characters you should still decide what actually makes sense, and validate/adjust the name accordingly. In one of my applications, I used the same rules as above but stripped any duplicate dots and spaces.
The easy way to get Windows to tell you the answer is to attempt to rename a file via Explorer and type in a backslash, /, for the new name. Windows will popup a message box telling you the list of illegal characters.
A filename cannot contain any of the following characters:
\ / : * ? " < > |
Microsoft Docs - Naming Files, Paths, and Namespaces - Naming Conventions
Well, if only for research purposes, then your best bet is to look at this Wikipedia entry on Filenames.
If you want to write a portable function to validate user input and create filenames based on that, the short answer is don't. Take a look at a portable module like Perl's File::Spec to have a glimpse to all the hops needed to accomplish such a "simple" task.
Discussing different possible approaches
Difficulties with defining, what's legal and not were already adressed and whitelists were suggested. But not only Windows, but also many unixoid OSes support more-than-8-bit characters such as Unicode. You could here also talk about encodings such as UTF-8. You can consider Jonathan Leffler's comment, where he gives info about modern Linux and describes details for MacOS. Wikipedia states, that (for example) the
modifier letter colon [(See 7. below) is] sometimes used in Windows filenames as it is identical to the colon in the Segoe UI font used for filenames. The [inherited ASCII] colon itself is not permitted.
Therefore, I want to present a much more liberal approach using Unicode Homoglyph characters to replace the "illegal" ones. I found the result in my comparable use-case by far more readable and it's only limited by the used font, which is very broad, 3903 characters for Windows default. Plus you can even restore the original content from the replacements.
Possible choices and research notes
To keep things organized, I will always give the character, it's name and the hexadecimal number representation. The latter is is not case sensitive and leading zeroes can be added or ommitted freely, so for example U+002A and u+2a are equivalent. If available, I'll try to point to more info or alternatives - feel free to show me more or better ones.
Instead of * (U+2A * ASTERISK), you can use one of the many listed, for example U+2217 ∗ (ASTERISK OPERATOR) or the Full Width Asterisk U+FF0A *. u+20f0 ⃰ combining asterisk above from combining diacritical marks for symbols might also be a valid choice. You can read 4. for more info about the combining characters.
Instead of . (U+2E . full stop), one of these could be a good option, for example ⋅ U+22C5 dot operator.
Instead of " (U+22 " quotation mark), you can use “ U+201C english leftdoublequotemark, more alternatives see here. I also included some of the good suggestions of Wally Brockway's answer, in this case u+2036 ‶ reversed double prime and u+2033 ″ double prime - I will from now on denote ideas from that source by ¹³.
Instead of / (U+2F / SOLIDUS), you can use ∕ DIVISION SLASH U+2215 (others here), ̸ U+0338 COMBINING LONG SOLIDUS OVERLAY, ̷ COMBINING SHORT SOLIDUS OVERLAY U+0337 or u+2044 ⁄ fraction slash¹³. Be aware about spacing for some characters, including the combining or overlay ones, as they have no width and can produce something like -> ̸th̷is which is ̸th̷is. With added spaces you get -> ̸ th ̷ is, which is ̸ th ̷ is. The second one (COMBINING SHORT SOLIDUS OVERLAY) looks bad in the stackoverflow-font.
Instead of \ (U+5C Reverse solidus), you can use ⧵ U+29F5 Reverse solidus operator (more) or u+20E5 ⃥ combining reverse solidus overlay¹³.
To replace [ (U+5B [ Left square bracket) and ] (U+005D ] Right square bracket), you can use for example U+FF3B[ FULLWIDTH LEFT SQUARE BRACKET and U+FF3D ]FULLWIDTH RIGHT SQUARE BRACKET (from here, more possibilities here).
Instead of : (u+3a : colon), you can use U+2236 ∶ RATIO (for mathematical usage) or U+A789 ꞉ MODIFIER LETTER COLON, (see colon (letter), sometimes used in Windows filenames as it is identical to the colon in the Segoe UI font used for filenames. The colon itself is not permitted ... source and more replacements see here). Another alternative is this one: u+1361 ፡ ethiopic wordspace¹³.
Instead of ; (u+3b ; semicolon), you can use U+037E ; GREEK QUESTION MARK (see here).
For | (u+7c | vertical line), there are some good substitutes such as: U+2223 ∣ DIVIDES, U+0964 । DEVANAGARI DANDA, U+01C0 ǀ LATIN LETTER DENTAL CLICK (the last ones from Wikipedia) or U+2D4F ⵏ Tifinagh Letter Yan. Also the box drawing characters contain various other options.
Instead of , (, U+002C COMMA), you can use for example ‚ U+201A SINGLE LOW-9 QUOTATION MARK (see here).
For ? (U+003F ? QUESTION MARK), these are good candidates: U+FF1F ? FULLWIDTH QUESTION MARK or U+FE56 ﹖ SMALL QUESTION MARK (from here and here). There are also two more from the Dingbats Block (search for "question") and the u+203d ‽ interrobang¹³.
While my machine seems to accept it unchanged, I still want to include > (u+3e greater-than sign) and < (u+3c less-than sign) for the sake of completeness. The best replacement here is probably also from the quotation block, such as u+203a › single right-pointing angle quotation mark and u+2039 ‹ single left-pointing angle quotation mark respectively. The tifinagh block only contains ⵦ (u+2D66)¹³ to replace <. The last notion is ⋖ less-than with dot u+22D6 and ⋗ greater-than with dot u+22D7.
For additional ideas, you can also look for example into this block. You still want more ideas? You can try to draw your desired character and look at the suggestions here.
How do you type these characters
Say you want to type ⵏ (Tifinagh Letter Yan). To get all of its information, you can always search for this character (ⵏ) on a suited platform such as this Unicode Lookup (please add 0x when you search for hex) or that Unicode Table (that only allows to search for the name, in this case "Tifinagh Letter Yan"). You should obtain its Unicode number U+2D4F and the HTML-code ⵏ (note that 2D4F is hexadecimal for 11599). With this knowledge, you have several options to produce these special characters including the use of
code points to unicode converter or again the Unicode Lookup to reversely convert the numerical representation into the unicode character (remember to set the code point base below to decimal or hexadecimal respectively)
a one-liner makro in Autohotkey: :?*:altpipe::{U+2D4F} to type ⵏ instead of the string altpipe - this is the way I input those special characters, my Autohotkey script can be shared if there is common interest
Alt Characters or alt-codes by pressing and holding alt, followed by the decimal number for the desired character (more info for example here, look at a table here or there). For the example, that would be Alt+11599. Be aware, that many programs do not fully support this windows feature for all of unicode (as of time writing). Microsoft Office is an exception where it usually works, some other OSes provide similar functionality. Typing these chars with Alt-combinations into MS Word is also the way Wally Brockway suggests in his answer¹³ that was already mentionted - if you don't want to transfer all the hexadecimal values to the decimal asc, you can find some of them there¹³.
in MS Office, you can also use ALT + X as described in this MS article to produce the chars
if you rarely need it, you can of course still just copy-paste the special character of your choice instead of typing it
For Windows you can check it using PowerShell
$PathInvalidChars = [System.IO.Path]::GetInvalidPathChars() #36 chars
To display UTF-8 codes you can convert
$enc = [system.Text.Encoding]::UTF8
$PathInvalidChars | foreach { $enc.GetBytes($_) }
$FileNameInvalidChars = [System.IO.Path]::GetInvalidFileNameChars() #41 chars
$FileOnlyInvalidChars = #(':', '*', '?', '\', '/') #5 chars - as a difference
For anyone looking for a regex:
const BLACKLIST = /[<>:"\/\\|?*]/g;
In Windows 10 (2019), the following characters are forbidden by an error when you try to type them:
A file name can't contain any of the following characters:
\ / : * ? " < > |
Here's a c# implementation for windows based on Christopher Oezbek's answer
It was made more complex by the containsFolder boolean, but hopefully covers everything
/// <summary>
/// This will replace invalid chars with underscores, there are also some reserved words that it adds underscore to
/// </summary>
/// <remarks>
/// https://stackoverflow.com/questions/1976007/what-characters-are-forbidden-in-windows-and-linux-directory-names
/// </remarks>
/// <param name="containsFolder">Pass in true if filename represents a folder\file (passing true will allow slash)</param>
public static string EscapeFilename_Windows(string filename, bool containsFolder = false)
{
StringBuilder builder = new StringBuilder(filename.Length + 12);
int index = 0;
// Allow colon if it's part of the drive letter
if (containsFolder)
{
Match match = Regex.Match(filename, #"^\s*[A-Z]:\\", RegexOptions.IgnoreCase);
if (match.Success)
{
builder.Append(match.Value);
index = match.Length;
}
}
// Character substitutions
for (int cntr = index; cntr < filename.Length; cntr++)
{
char c = filename[cntr];
switch (c)
{
case '\u0000':
case '\u0001':
case '\u0002':
case '\u0003':
case '\u0004':
case '\u0005':
case '\u0006':
case '\u0007':
case '\u0008':
case '\u0009':
case '\u000A':
case '\u000B':
case '\u000C':
case '\u000D':
case '\u000E':
case '\u000F':
case '\u0010':
case '\u0011':
case '\u0012':
case '\u0013':
case '\u0014':
case '\u0015':
case '\u0016':
case '\u0017':
case '\u0018':
case '\u0019':
case '\u001A':
case '\u001B':
case '\u001C':
case '\u001D':
case '\u001E':
case '\u001F':
case '<':
case '>':
case ':':
case '"':
case '/':
case '|':
case '?':
case '*':
builder.Append('_');
break;
case '\\':
builder.Append(containsFolder ? c : '_');
break;
default:
builder.Append(c);
break;
}
}
string built = builder.ToString();
if (built == "")
{
return "_";
}
if (built.EndsWith(" ") || built.EndsWith("."))
{
built = built.Substring(0, built.Length - 1) + "_";
}
// These are reserved names, in either the folder or file name, but they are fine if following a dot
// CON, PRN, AUX, NUL, COM0 .. COM9, LPT0 .. LPT9
builder = new StringBuilder(built.Length + 12);
index = 0;
foreach (Match match in Regex.Matches(built, #"(^|\\)\s*(?<bad>CON|PRN|AUX|NUL|COM\d|LPT\d)\s*(\.|\\|$)", RegexOptions.IgnoreCase))
{
Group group = match.Groups["bad"];
if (group.Index > index)
{
builder.Append(built.Substring(index, match.Index - index + 1));
}
builder.Append(group.Value);
builder.Append("_"); // putting an underscore after this keyword is enough to make it acceptable
index = group.Index + group.Length;
}
if (index == 0)
{
return built;
}
if (index < built.Length - 1)
{
builder.Append(built.Substring(index));
}
return builder.ToString();
}
Though the only illegal Unix chars might be / and NULL, although some consideration for command line interpretation should be included.
For example, while it might be legal to name a file 1>&2 or 2>&1 in Unix, file names such as this might be misinterpreted when used on a command line.
Similarly it might be possible to name a file $PATH, but when trying to access it from the command line, the shell will translate $PATH to its variable value.
The .NET Framework System.IO provides the following functions for invalid file system characters:
Path.GetInvalidFileNameChars
Path.GetInvalidPathChars
Those functions should return appropriate results depending on the platform the .NET runtime is running in. That said, the Remarks in the documentation pages for those functions say:
The array returned from this method is not guaranteed to contain the
complete set of characters that are invalid in file and directory
names. The full set of invalid characters can vary by file system.
I always assumed that banned characters in Windows filenames meant that all exotic characters would also be outlawed. The inability to use ?, / and : in particular irked me. One day I discovered that it was virtually only those chars which were banned. Other Unicode characters may be used. So the nearest Unicode characters to the banned ones I could find were identified and MS Word macros were made for them as Alt+?, Alt+: etc. Now I form the filename in Word, using the substitute chars, and copy it to the Windows filename. So far I have had no problems.
Here are the substitute chars (Alt + the decimal Unicode) :
⃰ ⇔ Alt8432
⁄ ⇔ Alt8260
⃥ ⇔ Alt8421
∣ ⇔ Alt8739
ⵦ ⇔ Alt11622
⮚ ⇔ Alt11162
‽ ⇔ Alt8253
፡ ⇔ Alt4961
‶ ⇔ Alt8246
″ ⇔ Alt8243
As a test I formed a filename using all of those chars and Windows accepted it.
This is good enough for me in Python:
def fix_filename(name, max_length=255):
"""
Replace invalid characters on Linux/Windows/MacOS with underscores.
List from https://stackoverflow.com/a/31976060/819417
Trailing spaces & periods are ignored on Windows.
>>> fix_filename(" COM1 ")
'_ COM1 _'
>>> fix_filename("COM10")
'COM10'
>>> fix_filename("COM1,")
'COM1,'
>>> fix_filename("COM1.txt")
'_.txt'
>>> all('_' == fix_filename(chr(i)) for i in list(range(32)))
True
"""
return re.sub(r'[/\\:|<>"?*\0-\x1f]|^(AUX|COM[1-9]|CON|LPT[1-9]|NUL|PRN)(?![^.])|^\s|[\s.]$', "_", name[:max_length], flags=re.IGNORECASE)
See also this outdated list for additional legacy stuff like = in FAT32.
As of 18/04/2017, no simple black or white list of characters and filenames is evident among the answers to this topic - and there are many replies.
The best suggestion I could come up with was to let the user name the file however he likes. Using an error handler when the application tries to save the file, catch any exceptions, assume the filename is to blame (obviously after making sure the save path was ok as well), and prompt the user for a new file name. For best results, place this checking procedure within a loop that continues until either the user gets it right or gives up. Worked best for me (at least in VBA).
In Unix shells, you can quote almost every character in single quotes '. Except the single quote itself, and you can't express control characters, because \ is not expanded. Accessing the single quote itself from within a quoted string is possible, because you can concatenate strings with single and double quotes, like 'I'"'"'m' which can be used to access a file called "I'm" (double quote also possible here).
So you should avoid all control characters, because they are too difficult to enter in the shell. The rest still is funny, especially files starting with a dash, because most commands read those as options unless you have two dashes -- before, or you specify them with ./, which also hides the starting -.
If you want to be nice, don't use any of the characters the shell and typical commands use as syntactical elements, sometimes position dependent, so e.g. you can still use -, but not as first character; same with ., you can use it as first character only when you mean it ("hidden file"). When you are mean, your file names are VT100 escape sequences ;-), so that an ls garbles the output.
When creating internet shortcuts in Windows, to create the file name, it skips illegal characters, except for forward slash, which is converted to minus.
I had the same need and was looking for recommendation or standard references and came across this thread. My current blacklist of characters that should be avoided in file and directory names are:
$CharactersInvalidForFileName = {
"pound" -> "#",
"left angle bracket" -> "<",
"dollar sign" -> "$",
"plus sign" -> "+",
"percent" -> "%",
"right angle bracket" -> ">",
"exclamation point" -> "!",
"backtick" -> "`",
"ampersand" -> "&",
"asterisk" -> "*",
"single quotes" -> "“",
"pipe" -> "|",
"left bracket" -> "{",
"question mark" -> "?",
"double quotes" -> "”",
"equal sign" -> "=",
"right bracket" -> "}",
"forward slash" -> "/",
"colon" -> ":",
"back slash" -> "\\",
"lank spaces" -> "b",
"at sign" -> "#"
};

Is there a way to delimit "ucwords()" in PHP such that the first char is not automatically uppercased?

PHP has the function ucwords(), which allows for custom delimiters. This works well, and will turn my test string into My Test String no problem.
Take the following example: I want to make a super awesome 2009 gamer tag.
$gamerTag = 'xxx_l33t_xxx'; // Not yet epic.
echo ucwords($gamerTag,"x"); // want it to return 'xXx_l33t_xXx'
I would have assumed strings would delimit case-sensitively and update the the second x in each case, ignoring the third, since at that point the middle one would no longer match our delimiter.
However, this actually returns XxX_l33t_xXx, since it will automatically uppercase the first letter in the string.
I know that there are other methods of doing this (strsplit() array loops and pregreplace with a reverse lookup come to mind), but my primary question becomes the following:
Is there a way to delimit ucwords() such that it does not automatically uppercase the first character of the string?
The internal behaviour is unfortunately that the first character of the string will always be converted to upper case, regardless of the delimiters you pass in.
Digging into the PHP source, this is the implementation of ucwords:
*r = toupper((unsigned char) *r);
for (r_end = r + Z_STRLEN_P(return_value) - 1; r < r_end; ) {
if (mask[(unsigned char)*r++]) {
*r = toupper((unsigned char) *r);
}
}
From https://github.com/php/php-src/blob/master/ext/standard/string.c#L2651
Here r is the return value, and mask is a char array of the delimiting characters. The first call to toupper (outside the of the loop) means that there's no way to prevent the first character being converted.
Because this is done, it means the second character is not converted, since it's now preceded by X, not x. The third character is handled "correctly".
This can actually cause some strange cascading behaviour, since the return value is being iterated over while it's being modified:
php > echo ucwords('aaa', 'A');
AAA
The initial string doesn't contain the delimiting character anywhere, but the result is completely upper-case.
As mentioned in a comment, there's an open PHP bug to reflect this behaviour in the documentation here: https://bugs.php.net/bug.php?id=78393

http_build_query function's excessive urlencoding

Why when building a query string with http_build_query function, it urlencodes square brackets [] outside values and how do get rid of it?
$query = array("var" => array("foo" => "value", "bar" => "encodedBracket["));
$queryString = http_build_query($query, "", "&");
var_dump($queryString);
var_dump("urldecoded: " . urldecode($queryString));
outputs:
var%5Bfoo%5D=value&var%5Bbar%5D=encodedBracket%5B
urldecoded: var[foo]=value&var[bar]=encodedBracket[
The function correctly urlencoded a [ in encodedBracket[ in the first line of the output but what was the reason to encode square brackets in var[foo]= and var[bar]=? As you can see, urldecoding the string also decoded reserved characters in values, encodedBracket%5B should have stayed as was for the query string to be correct and not become encodedBracket[.
According to section 2.2 Reserved Characters of Uniform Resource Identifier (URI): Generic Syntax
URIs include components and subcomponents that are delimited by
characters in the "reserved" set. These characters are called
"reserved" because they may (or may not) be defined as delimiters by
the generic syntax, by each scheme-specific syntax, or by the
implementation-specific syntax of a URI's dereferencing algorithm. If
data for a URI component would conflict with a reserved character's
purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," /
";" / "="
So shouldn't http_build_query really produce more readable output with characters like [] urlencoded only where it's required? How do I make it produce such output?
Here's a quick function I wrote to produce nicer query strings. It not only doesn't encode square brackets but will also omit the array key if it matches the index.
Note it doesn't support objects or the additional options of http_build_query. The $prefix argument is used for recursion and should be omitted for the initial call.
function http_clean_query(array $query_data, string $prefix=null): string {
$parts = [];
$i = 0;
foreach ($query_data as $key=>$value) {
if ($prefix === null) {
$key = rawurlencode($key);
} else if ($key === $i) {
$key = $prefix.'[]';
$i++;
} else {
$key = $prefix.'['.rawurlencode($key).']';
}
if (is_array($value)) {
if (!empty($value)) $parts[] = http_clean_query($value, $key);
} else {
$parts[] = $key.'='.rawurlencode($value);
}
}
return implode('&', $parts);
}
I know this is a bit old, but I think it's still relevant today.
TL;DR: http_build_query() is working correctly
Longer explanation: Yes, http_build_query encodes [] and it looks awful... but it's the correct behavior: [] are reserved character as per rfc3986#section-2.3. And... no, they are NOT reserved for passing arrays!
What [] are reserved for is defined in rfc3986#section-3.2.2:
A host identified by an Internet Protocol literal address, version
6 [RFC3513] or later, is distinguished by enclosing the IP literal
within square brackets ("[" and "]"). This is the only place where
square bracket characters are allowed in the URI syntax. In
anticipation of future, as-yet-undefined IP literal address formats,
an implementation may use an optional version flag to indicate such a
format explicitly rather than rely on heuristic determination.
IP-literal = "[" ( IPv6address / IPvFuture ) "]"
IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
So basically it is reserved for something like https://[2607:f8b0:4004:808::200e]
There is another questione about this same topic here: https://stackoverflow.com/a/1016737/1204976
I found the following "fix" here:
[...] the workable 'fix' I have been using was to postprocess http_build_query() output
with the following - a 'solution' which makes my skin crawl just a little:
function http_build_query_unborker($s) {
return preg_replace_callback('#%5[bd](?=[^&]*=)#i', function($match) {
return urldecode($match[0]);
}, $s);
}
So now it would become:
$query = array("var" => array("foo" => "value", "bar" => "encodedBracket["));
$queryString = http_build_query_unborker(http_build_query($query, "", "&"));
var_dump($queryString);
var_dump("urldecoded: " . urldecode($queryString)); // var[foo]=value&var[bar]=encodedBracket%5B
You've got many questions here. Speaking in RFC terms of should, and reading your own questions in these same terms. I take your questions from bottom to top:
How do I make it produce such output?
By using a different encoder, Net_URL2 (pear / packagist) for example:
$vars = array("var" => array("foo" => "value", "bar" => "encodedBracket["));
$url = new Net_URL2('');
$url->setQueryVariables($vars);
$query = $url->getQuery();
var_dump($query); // string(41) "var[foo]=value&var[bar]=encodedBracket%5B"
So shouldn't http_build_query really produce more readable output with characters like [] urlencoded only where it's required?
No, it should not. Even it is not required to encode the square brackets inside the query part, it is recommended. That what is recommended should be done.
Next to that, the http_build_query() function is not about creating "more readable output". It is only about creating the query of an HTTP URI. For such a query part, square brackets should be percent-encoded. These are reserved characters not specifically allowed for query.
What was the reason to encode square brackets in var[foo]= and var[bar]=?
The reason to encode square brackets there is the same reason to encode square brackets in encodedBracket[. The differentiation you do between these parts in your question is purely syntactic on your own, within an URI these parts are treated equal. There are no sub-parts of a query part in an URI. So making a distinction between the bracket var[ or the bracket encodedBracket[ is purely unrelated to URI encoding of the query part.
As you say that the percent-encoding of encodedBracket[ to encodedBracket%5B is correct and as it belongs into the same part of the URI (the query part), logic dictates that you must accept that encoding the bracket in var[ to var%5B is equally correct in terms of URI encoding. Same URI part, same encoding. The only ending delimiter the query part has is "#".
Additionally your reasoning shows a misunderstanding in this part:
As you can see, urldecoding the string also decoded reserved characters in values, encodedBracket%5B should have stayed as was for the query string to be correct and not become encodedBracket[.
If you urldecode, all percent-encoded sequences will be decoded - regardless whether the percent-encoding was representing a reserved character or not. In terms of correct, it's the opposite of what you stated: %5B has to be decoded to [ regardless if it was at the beginning, in the middle or at the end of the string.
Why when building a query string with http_build_query function, it urlencodes square brackets [] outside values and how do get rid of it?
It's easier to answer the second part, see at the beginning of the answer, it's already answered.
About the why this perhaps might not be immediately visible especially as you might have found out that PHP itself accepts percent-encoded and verbatim square brackets in the query (even intermixed) without any problems.
How come the differences and why is that so? Is it really as simple as you outline it in your question? Is this a cosmetic difference only?
First of all, not encoding square brackets in the query query part of an URI violates RFC3986 in the sense that the query part should not contain the brackets from gen-delims characters unencoded. Non-percent-encoded square brackets can not be part of query according to the ABNF:
query = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDI
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
Getting rid of these therefore is not suggested (at least for encoding purposes following the standard) as it will change the URI:
URIs that differ in the replacement of a reserved character with its
corresponding percent-encoded octet are not equivalent.
This is already a good hint that for the URI you ask for, it has a different meaning than the URI PHP creates via the built-in function.
And further on:
URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component.
This is not the case for all characters in gen-delims but per the ABNF:
"/" / "?" / ":" / "#"
So it therefore looks like that http_build_query() went the route to percent-encode square brackets as those are reserved characters and not specifically allowed by the URI scheme for that part (the query). Basically nothing wrong with it, it follows the recommendation of RFC3986. And it is not suggesting a different meaning for those parts of the query.
However you clearly say, that technically these brackets aren't delimiters in the query. And yes, that is true:
The query component is indicated by the first question
mark ("?") character and terminated by a number sign ("#") character
or by the end of the URI.
So comparing to what has been identified earlier as reserved characters not specifically allowed:
"#" / "[" / "]"
(already a pretty small list) it should be clear that "#" must stay reserved otherwise the URI gets broken (a true, separating delimiter at the end of query), but the square brackets must not be specifically allowed when representing an unequal URI without data-loss and preserving all URI delimiters:
If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.
So if you can still follow me, one might want actually do what you're asking for: Creating an URI in which the square brackets meaning as a delimiter (e.g. representing a fraction of an array definition) but not having this as data. Albeit the data of the character is preserved per RFC 3986.
It therefore is technically possible to create an URI with the square brackets not percent encoded within the query. Technically even inside values, like it would be a syntactical difference outside of values, this is only another syntactic difference for inside of values.
This is also the reason why browsers preserve the state of square brackets within the query when you enter these into your browser. Percent-encoded or not - the browser passes that part of the URI as-is to the server so that the underlying processes on the server can benefit from syntactic differences that might have been expressed by that.
So choose the URL encoding correctly for the underlying platform. Only because it's possible, it must not mean it works in a stable manner. The way http_build_query() does is the most stable (safe) way following RFC 3986. However it's a should in the RFC, so if you understand this to the point, you can have valid reasons to not percent-encode the square brackets.
One reason you name in your question is readability. This is especially important when you write down URLs for example on a sheet of paper. I'm not so sure if a square bracket is such a good distinguishable character and if not percent encoding does even help with readability. But I have not tried it. PHP would accept both ways. But then you won't need to do that programmatically. So perhaps readability wasn't really the case in your scenario.

Why is base64_encode() adding a slash "/" in the result?

I am encoding the URL suffix of my application:
$url = 'subjects?_d=1';
echo base64_encode($url);
// Outputs
c3ViamVjdHM/X2Q9MQ==
Notice the slash before 'X2'.
Why is this happening? I thought base64 only outputted A-Z, 0-9 and '=' as padding?
No. The Base64 alphabet includes A-Z, a-z, 0-9 and + and /.
You can replace them if you don't care about portability towards other applications.
See: http://en.wikipedia.org/wiki/Base64#Variants_summary_table
You can use something like these to use your own symbols instead (replace - and _ by anything you want, as long as it is not in the base64 base alphabet, of course!).
The following example converts the normal base64 to base64url as specified in RFC 4648:
function base64url_encode($s) {
return str_replace(array('+', '/'), array('-', '_'), base64_encode($s));
}
function base64url_decode($s) {
return base64_decode(str_replace(array('-', '_'), array('+', '/'), $s));
}
In addition to all of the answers above, pointing out that / is part of the expected base64 alphabet, it should be noted that the particular reason you saw a / in your encoded string, is because when base64 encoding ASCII text, the only way to generate a / is to have a question mark in a position divisible by three.
Sorry, you thought wrong. A-Za-z0-9 only gets you 62 characters. Base64 uses two additional characters, in PHP's case / and +.
There is nothing special in that.
The base 64 "alphabet" or "digits" are A-Z,a-z,0-9 plus two extra characters + (plus) and / (slash).
You can later encode / with %2f if you want.
Not directly related, and enough people above have answered and explained solutions quite well.
However, going a bit outside of the scope of things. If you want readable base text, try looking into Base58. It's worth considering if you want only alphanumeric characters.
For base64 the valid charset is:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
the = is used as filler for the last bytes
M.
A-Z is 26 characters.
0-9 is 10 characters.
= is one character. That gives a total of 37 characters, which is some way short of 64.
/ is one of the 64 characters. You can see a complete list on the wikipedia page.

is PHP str_word_count() multibyte safe?

I want to use str_word_count() on a UTF-8 string.
Is this safe in PHP? It seems to me that it should be (especially considering that there is no mb_str_word_count()).
But on php.net there are a lot of people muddying the water by presenting their own 'multibyte compatible' versions of the function.
So I guess I want to know...
Given that str_word_count simply counts all character sequences in delimited by " " (space), it should be safe on multibyte strings, even though its not necessarily aware of the character sequences, right?
Are there any equivalent 'space' characters in UTF-8, which are not ASCII " " (space)?#
This is where the problem might lie I guess.
I'd say you guess right. And indeed there are space characters in UTF-8 which are not part of US-ASCII. To give you an example of such spaces:
Unicode Character 'NO-BREAK SPACE' (U+00A0): 2 Bytes in UTF-8: 0xC2 0xA0 (c2a0)
And perhaps as well:
Unicode Character 'NEXT LINE (NEL)' (U+0085): 2 Bytes in UTF-8: 0xC2 0x85 (c285)
Unicode Character 'LINE SEPARATOR' (U+2028): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)
Unicode Character 'PARAGRAPH SEPARATOR' (U+2029): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)
Anyway, the first one - the 'NO-BREAK SPACE' (U+00A0) - is a good example as it is also part of Latin-X charsets. And the PHP manual already provides a hint that str_word_count would be locale dependent.
If we want to put this to a test, we can set the locale to UTF-8, pass in an invalid string containing a \xA0 sequence and if this still counts as word-breaking character, that function is clearly not UTF-8 safe, hence not multibyte safe (as same non-defined as per the question):
<?php
/**
* is PHP str_word_count() multibyte safe?
* #link https://stackoverflow.com/q/8290537/367456
*/
echo 'New Locale: ', setlocale(LC_ALL, 'en_US.utf8'), "\n\n";
$test = "aword\xA0bword aword";
$result = str_word_count($test, 2);
var_dump($result);
Output:
New Locale: en_US.utf8
array(3) {
[0]=>
string(5) "aword"
[6]=>
string(5) "bword"
[12]=>
string(5) "aword"
}
As this demo shows, that function totally fails on the locale promise it gives on the manual page (I do not wonder nor moan about this, most often if you read that a function is locale specific in PHP, run for your life and find one that is not) which I exploit here to demonstrate that it by no means does anything regarding the UTF-8 character encoding.
Instead for UTF-8 you should take a look into the PCRE extension:
Matching Unicode letter characters in PCRE/PHP
PCRE has a good understanding of Unicode and UTF-8 in PHP in specific. It can also be quite fast if you craft the regular expression pattern carefully.
About the "template answer" - I don't get the demand "working faster". We're not talking about long times or lot of counts here, so who cares if it takes some milliseconds longer or not?
However, a str_word_count working with soft hyphen:
function my_word_count($str) {
return str_word_count(str_replace("\xC2\xAD",'', $str));
}
a function that complies with the asserts (but is probably not faster than str_word_count):
function my_word_count($str) {
$mystr = str_replace("\xC2\xAD",'', $str); // soft hyphen encoded in UTF-8
return preg_match_all('~[\p{L}\'\-]+~u', $mystr); // regex expecting UTF-8
}
The preg function is essentially the same what's already proposed, except that a) it already returns a count so no need to supply matches, which should make it faster and b) there really should not be iconv fallback, IMO.
About a comment:
I can see that your PCRE functions are wrost (performance) than my
preg_word_count() because need a str_replace that you not need:
'~[^\p{L}\'-\xC2\xAD]+~u' works fine (!).
I considered that a different thing, string replace will only remove the multibyte character, but regex of yours will deal with \\xC2 and \\xAD in any order they might appear, which is wrong. Consider a registered sign, which is \xC2\xAE.
However, now that I think about it due to the way valid UTF-8 works, it wouldn't really matter, so that should be usable equally well. So we can just have the function
function my_word_count($str) {
return preg_match_all('~[\p{L}\'\-\xC2\xAD]+~u', $str); // regex expecting UTF-8
}
without any need for matches or other replacements.
About str_word_count(str_replace("\xC2\xAD",'', $str));, if is stable
with UTF8, is good, but seems is not.
If you read this thread, you'll know str_replace is safe if you stick to valid UTF-8 strings. I didn't see any evidence in your link of the contrary.
EDITED (to show new clues): there are a possible solution using str_word_count() with PHP v5.1!
function my_word_count($str, $myLangChars="àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ") {
return str_word_count($str, 0, $myLangChars);
}
but not is 100% because I try to add to $myLangChars \xC2\xAD (the SHy - SOFT HYPHEN character), that must be a word component in any language, and it not works (see).
Another, not so fast, but complete and flexible solution (extracted from here), based on PCRE library, but with an option to mimic the str_word_count() behaviour on non-valid-UTF8:
/**
* Like str_word_count() but showing how preg can do the same.
* This function is most flexible but not faster than str_word_count.
* #param $wRgx the "word regular expression" as defined by user.
* #param $triggError changes behaviour causing error event.
* #param $OnBadUtfTryAgain when true mimic the str_word_count behaviour.
* #return 0 or positive integer as word-count, negative as PCRE error.
*/
function preg_word_count($s,$wRgx='/[-\'\p{L}\xC2\xAD]+/u', $triggError=true,
$OnBadUtfTryAgain=true) {
if ( preg_match_all($wRgx,$s,$m) !== false )
return count($m[0]);
else {
$lastError = preg_last_error();
$chkUtf8 = ($lastError==PREG_BAD_UTF8_ERROR);
if ($OnBadUtfTryAgain && $chkUtf8)
return preg_word_count(
iconv('CP1252','UTF-8',$s), $wRgx, $triggError, false
);
elseif ($triggError) trigger_error(
$chkUtf8? 'non-UTF8 input!': "error PCRE_code-$lastError",
E_USER_NOTICE
);
return -$lastError;
}
}
(TEMPLATE ANSWER) help for bounty!
(this is not an answer, is a help for bounty, because I can not edit neither to duplicate the question)
We want to count "real-world words" in a UTF-8 latim text.
FOR BOUNTY, WE NEED:
a function that comply the asserts below and is faster than str_word_count;
or str_word_count working with SHy character (how to?);
or preg_word_count working faster (using preg_replace? word-separator regular expression?).
ASSERTS
Supose that a "multibyte safe" function my_word_count() exists, then the following asserts must be true:
assert_options(ASSERT_ACTIVE, 1);
$text = "1,2,3,4=0 (1 2 3 4)=0 (... ,.)=0 (2.5±0.1; 0.5±0.2)=0";
assert( my_word_count($text)==0 ); // no word there
$text = "(one two,three;four)=4 (five-six se\xC2\xADven)=2";
assert( my_word_count($text)==6 ); // hyphen merges two words
$text = "(um±dois três)=3 (àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ)=1";
assert( my_word_count($text)==4 ); // a UTF8 case
$text = "(ÍSÔ9000-X, ISÔ 9000-X, ÍSÔ-9000-X)=6"; //Codes are words?
assert( my_word_count($text)==6 ); // suppose no: X is another word
All it does it count the number of spaces, or words in between. if you're curious, you can just make your own counting function using explode and count.
Anytime the ascii space byte is found, it splits and that all there really is to it.

Categories