PHP: comparing URIs which differ in percent-encoding - php

In PHP, I want to compare two relative URLs for equality. The catch: URLs may differ in percent-encoding, e.g.
/dir/file+file vs. /dir/file%20file
/dir/file(file) vs. /dir/file%28file%29
/dir/file%5bfile vs. /dir/file%5Bfile
According to RFC 3986, servers should treat these URIs identically. But if I use == to compare, I'll end up with a mismatch.
So I'm looking for a PHP function which will accepts two strings and returns TRUE if they represent the same URI (dicounting encoded/decoded variants of the same char, upper-case/lower-case hex digits in encoded chars, and + vs. %20 for spaces), and FALSE if they're different.
I know in advance that only ASCII chars are in these strings-- no unicode.

function uriMatches($uri1, $uri2)
{
return urldecode($uri1) == urldecode($uri2);
}
echo uriMatches('/dir/file+file', '/dir/file%20file'); // TRUE
echo uriMatches('/dir/file(file)', '/dir/file%28file%29'); // TRUE
echo uriMatches('/dir/file%5bfile', '/dir/file%5Bfile'); // TRUE
urldecode

EDIT: Please look at #webbiedave's response. His is much better (I wasn't even aware that there was a function in PHP to do that.. learn something new everyday)
You will have to parse the strings to look for something matching %## to find the occurences of those percent encoding. Then taking the number from those, you should be able to pass it so the chr() function to get the character of those percent encodings. Rebuild the strings and then you should be able to match them.
Not sure that's the most efficient method, but considering URLs are not usually that long, it shouldn't be too much of a performance hit.

I know this problem here seems to be solved by webbiedave, but I had my own problems with it.
First problem: Encoded characters are case-insensitive. So %C3 and %c3 are both the exact same character, although they are different as a URI. So both URIs point to the same location.
Second problem: folder%20(2) and folder%20%282%29 are both validly urlencoded URIs, which point to the same location, although they are different URIs.
Third problem: If I get rid of the url encoded characters I have two locations having the same URI like bla%2Fblubb and bla/blubb.
So what to do then? In order to compare two URIs, I need to normalize both of them in a way that I split them in all components, urldecode all paths and query-parts for once, rawurlencode them and glue them back together and then I could compare them.
And this could be the function to normalize it:
function normalizeURI($uri) {
$components = parse_url($uri);
$normalized = "";
if ($components['scheme']) {
$normalized .= $components['scheme'] . ":";
}
if ($components['host']) {
$normalized .= "//";
if ($components['user']) { //this should never happen in URIs, but still probably it's anything can happen thursday
$normalized .= rawurlencode(urldecode($components['user']));
if ($components['pass']) {
$normalized .= ":".rawurlencode(urldecode($components['pass']));
}
$normalized .= "#";
}
$normalized .= $components['host'];
if ($components['port']) {
$normalized .= ":".$components['port'];
}
}
if ($components['path']) {
if ($normalized) {
$normalized .= "/";
}
$path = explode("/", $components['path']);
$path = array_map("urldecode", $path);
$path = array_map("rawurlencode", $path);
$normalized .= implode("/", $path);
}
if ($components['query']) {
$query = explode("&", $components['query']);
foreach ($query as $i => $c) {
$c = explode("=", $c);
$c = array_map("urldecode", $c);
$c = array_map("rawurlencode", $c);
$c = implode("=", $c);
$query[$i] = $c;
}
$normalized .= "?".implode("&", $query);
}
return $normalized;
}
Now you can alter webbiedave's function to this:
function uriMatches($uri1, $uri2) {
return normalizeURI($uri1) === normalizeURI($uri2);
}
That should do. And yes, it is quite more complicated than even I wanted it to be.

Related

Replace string between two slashes [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 2 years ago.
I have to modify an URL like this:
$string = "/st:1/sc:RsrlYQhSQvs=/fp:1/g:3/start:2015-07-01/end:2015-07-30";
Namely, I want to delete st:1 with a regex. I used:
preg_replace("/\/st:(.*)\//",'',$string)
but I got
end:2015-07-30
while I would like to get:
/sc:RsrlYQhSQvs=/fp:1/g:3/start:2015-07-01/end:2015-07-30
Same if I would like to delete fp:1.
You can use:
$string = preg_replace('~/st:[^/]*~','',$string);
[^/]* will only match till next /
You are using greedy matching with . that matches any character.
Use a more restricted pattern:
preg_replace("/\/st:[^\/]*/",'',$string)
The [^\/]* negated character class only matches 0 or more characters other than /.
Another solution would be to use lazy matching with *? quantifier, but it is not that efficient as with the negated character class.
FULL REGEX EXPLANATION:
\/st: - literal /st:
[^\/]* - 0 or more characters other than /.
You need to add ? in your regex:-
<?php
$string = "/st:1/sc:RsrlYQhSQvs=/fp:1/g:3/start:2015-07-01/end:2015-07-30";
echo preg_replace("/\/st:(.*?)\//",'',$string)
?>
Output:- https://eval.in/397658
Based on this same you can do for next things also.
Instead of using regex here you should make parsing utility functions for your special format string, they are simple, they don't take to long to write and they will make your life a lot easier:
function readPath($path) {
$parameters = array();
foreach(explode('/', $path) as $piece) {
// Here we make sure we have something
if ($piece == "") {
continue;
}
// This here is just a fancy way of splitting the array returned
// into two variables.
list($key, $value) = explode(':', $piece);
$parameters[$key] = $value;
}
return $parameters;
}
function writePath($parameters) {
$path = "";
foreach($parameters as $key => $value) {
$path .= "/" . implode(":", array($key, $value));
}
return $path;
}
Now you can just work on it as a php array, in this case you would go:
$parameters = readPath($string);
unset($parameters['st']);
$string = writePath($parameters);
This makes for much more readable and reusable code, additionally since most of the time you are dealing with only slight variations of this format you can just change the delimiters each time or even abstract these functions to using different delimiters.
Another way to deal with this is to convert the string to conform to a normal path query, using something like:
function readPath($path) {
return parse_str(strtr($path, "/:", "&="));
}
In your case though since you are using the "=" character in a url you would also need to url encode each value so as to not conflict with the format, this would involve similarly structured code to above though.

Convert unicode URL to ASCII

I'm writing a PHP application that accepts an URL from the user, and then processes it with by making some calls to binaries with system()*. However, to avoid many complications that arise with this, I'm trying to convert the URL, which may contain Unicode characters, into ASCII characters.
Let's say I have the following URL:
https://täst.de:8118/news/zh-cn/新闻动态/2015/
Here two parts need to be dealt with: the hostname and the path.
For the hostname, I can simply call idn_to_ascii().
However, I can't simply call urlencode() over the path, as each of the characters that need to remain unmodified will also be converted (e.g. news/zh-cn/新闻动态/2015/ -> news%2Fzh-cn%2F%E6%96%B0%E9%97%BB%E5%8A%A8%E6%80%81%2F2015%2F as opposed to news/zh-cn/%E6%96%B0%E9%97%BB%E5%8A%A8%E6%80%81/2015/).
How should I approach this problem?
*I'd rather not deal with system() calls and the resulting complexity, but given that the functionality is only available by calling binaries, I unfortunately have no choice.
split URL by / then urlencode() that part then put it back together
$url = explode("/", $url);
$url[2] = idn_to_ascii($url[2]);
$url[5] = urlencode($url[5]);
$url = join("/", $url);
You could use PHP's iconv function:
inconv("UTF-8", "ASCII//TRANSLIT", $url);
The following can be used for this transformation:
function convertpath ($path) {
$path1 = '';
$len = strlen ($path);
for ($i = 0; $i < $len; $i++) {
if (preg_match ('/^[A-Za-z0-9\/?=+%_.~-]$/', $path[$i])) {
$path1 .= $path[$i];
}
else {
$path1 .= urlencode ($path[$i]);
}
}
return $path1;
}

$_REQUEST is replacing . with _ [duplicate]

If I pass PHP variables with . in their names via $_GET PHP auto-replaces them with _ characters. For example:
<?php
echo "url is ".$_SERVER['REQUEST_URI']."<p>";
echo "x.y is ".$_GET['x.y'].".<p>";
echo "x_y is ".$_GET['x_y'].".<p>";
... outputs the following:
url is /SpShipTool/php/testGetUrl.php?x.y=a.b
x.y is .
x_y is a.b.
... my question is this: is there any way I can get this to stop? Cannot for the life of me figure out what I've done to deserve this
PHP version I'm running with is 5.2.4-2ubuntu5.3.
Here's PHP.net's explanation of why it does it:
Dots in incoming variable names
Typically, PHP does not alter the
names of variables when they are
passed into a script. However, it
should be noted that the dot (period,
full stop) is not a valid character in
a PHP variable name. For the reason,
look at it:
<?php
$varname.ext; /* invalid variable name */
?>
Now, what
the parser sees is a variable named
$varname, followed by the string
concatenation operator, followed by
the barestring (i.e. unquoted string
which doesn't match any known key or
reserved words) 'ext'. Obviously, this
doesn't have the intended result.
For this reason, it is important to
note that PHP will automatically
replace any dots in incoming variable
names with underscores.
That's from http://ca.php.net/variables.external.
Also, according to this comment these other characters are converted to underscores:
The full list of field-name characters that PHP converts to _ (underscore) is the following (not just dot):
chr(32) ( ) (space)
chr(46) (.) (dot)
chr(91) ([) (open square bracket)
chr(128) - chr(159) (various)
So it looks like you're stuck with it, so you'll have to convert the underscores back to dots in your script using dawnerd's suggestion (I'd just use str_replace though.)
Long-since answered question, but there is actually a better answer (or work-around). PHP lets you at the raw input stream, so you can do something like this:
$query_string = file_get_contents('php://input');
which will give you the $_POST array in query string format, periods as they should be.
You can then parse it if you need (as per POSTer's comment)
<?php
// Function to fix up PHP's messing up input containing dots, etc.
// `$source` can be either 'POST' or 'GET'
function getRealInput($source) {
$pairs = explode("&", $source == 'POST' ? file_get_contents("php://input") : $_SERVER['QUERY_STRING']);
$vars = array();
foreach ($pairs as $pair) {
$nv = explode("=", $pair);
$name = urldecode($nv[0]);
$value = urldecode($nv[1]);
$vars[$name] = $value;
}
return $vars;
}
// Wrapper functions specifically for GET and POST:
function getRealGET() { return getRealInput('GET'); }
function getRealPOST() { return getRealInput('POST'); }
?>
Hugely useful for OpenID parameters, which contain both '.' and '_', each with a certain meaning!
Highlighting an actual answer by Johan in a comment above - I just wrapped my entire post in a top-level array which completely bypasses the problem with no heavy processing required.
In the form you do
<input name="data[database.username]">
<input name="data[database.password]">
<input name="data[something.else.really.deep]">
instead of
<input name="database.username">
<input name="database.password">
<input name="something.else.really.deep">
and in the post handler, just unwrap it:
$posdata = $_POST['data'];
For me this was a two-line change, as my views were entirely templated.
FYI. I am using dots in my field names to edit trees of grouped data.
Do you want a solution that is standards compliant, and works with deep arrays (for example: ?param[2][5]=10) ?
To fix all possible sources of this problem, you can apply at the very top of your PHP code:
$_GET = fix( $_SERVER['QUERY_STRING'] );
$_POST = fix( file_get_contents('php://input') );
$_COOKIE = fix( $_SERVER['HTTP_COOKIE'] );
The working of this function is a neat idea that I came up during my summer vacation of 2013. Do not be discouraged by a simple regex, it just grabs all query names, encodes them (so dots are preserved), and then uses a normal parse_str() function.
function fix($source) {
$source = preg_replace_callback(
'/(^|(?<=&))[^=[&]+/',
function($key) { return bin2hex(urldecode($key[0])); },
$source
);
parse_str($source, $post);
$result = array();
foreach ($post as $key => $val) {
$result[hex2bin($key)] = $val;
}
return $result;
}
This happens because a period is an invalid character in a variable's name, the reason for which lies very deep in the implementation of PHP, so there are no easy fixes (yet).
In the meantime you can work around this issue by:
Accessing the raw query data via either php://input for POST data or $_SERVER['QUERY_STRING'] for GET data
Using a conversion function.
The below conversion function (PHP >= 5.4) encodes the names of each key-value pair into a hexadecimal representation and then performs a regular parse_str(); once done, it reverts the hexadecimal names back into their original form:
function parse_qs($data)
{
$data = preg_replace_callback('/(?:^|(?<=&))[^=[]+/', function($match) {
return bin2hex(urldecode($match[0]));
}, $data);
parse_str($data, $values);
return array_combine(array_map('hex2bin', array_keys($values)), $values);
}
// work with the raw query string
$data = parse_qs($_SERVER['QUERY_STRING']);
Or:
// handle posted data (this only works with application/x-www-form-urlencoded)
$data = parse_qs(file_get_contents('php://input'));
This approach is an altered version of Rok Kralj's, but with some tweaking to work, to improve efficiency (avoids unnecessary callbacks, encoding and decoding on unaffected keys) and to correctly handle array keys.
A gist with tests is available and any feedback or suggestions are welcome here or there.
public function fix(&$target, $source, $keep = false) {
if (!$source) {
return;
}
$keys = array();
$source = preg_replace_callback(
'/
# Match at start of string or &
(?:^|(?<=&))
# Exclude cases where the period is in brackets, e.g. foo[bar.blarg]
[^=&\[]*
# Affected cases: periods and spaces
(?:\.|%20)
# Keep matching until assignment, next variable, end of string or
# start of an array
[^=&\[]*
/x',
function ($key) use (&$keys) {
$keys[] = $key = base64_encode(urldecode($key[0]));
return urlencode($key);
},
$source
);
if (!$keep) {
$target = array();
}
parse_str($source, $data);
foreach ($data as $key => $val) {
// Only unprocess encoded keys
if (!in_array($key, $keys)) {
$target[$key] = $val;
continue;
}
$key = base64_decode($key);
$target[$key] = $val;
if ($keep) {
// Keep a copy in the underscore key version
$key = preg_replace('/(\.| )/', '_', $key);
$target[$key] = $val;
}
}
}
The reason this happens is because of PHP's old register_globals functionality. The . character is not a valid character in a variable name, so PHP coverts it to an underscore in order to make sure there's compatibility.
In short, it's not a good practice to do periods in URL variables.
If looking for any way to literally get PHP to stop replacing '.' characters in $_GET or $_POST arrays, then one such way is to modify PHP's source (and in this case it is relatively straightforward).
WARNING: Modifying PHP C source is an advanced option!
Also see this PHP bug report which suggests the same modification.
To explore you'll need to:
download PHP's C source code
disable the . replacement check
./configure, make and deploy your customized build of PHP
The source change itself is trivial and involves updating just one half of one line in main/php_variables.c:
....
/* ensure that we don't have spaces or dots in the variable name (not binary safe) */
for (p = var; *p; p++) {
if (*p == ' ' /*|| *p == '.'*/) {
*p='_';
....
Note: compared to original || *p == '.' has been commented-out
Example Output:
given a QUERY_STRING of a.a[]=bb&a.a[]=BB&c%20c=dd,
running <?php print_r($_GET); now produces:
Array
(
[a.a] => Array
(
[0] => bb
[1] => BB
)
[c_c] => dd
)
Notes:
this patch addresses the original question only (it stops replacement of dots, not spaces).
running on this patch will be faster than script-level solutions, but those pure-.php answers are still generally-preferable (because they avoid changing PHP itself).
in theory a polyfill approach is possible here and could combine approaches -- test for the C-level change using parse_str() and (if unavailable) fall-back to slower methods.
My solution to this problem was quick and dirty, but I still like it. I simply wanted to post a list of filenames that were checked on the form. I used base64_encode to encode the filenames within the markup and then just decoded it with base64_decode prior to using them.
After looking at Rok's solution I have come up with a version which addresses the limitations in my answer below, crb's above and Rok's solution as well. See a my improved version.
#crb's answer above is a good start, but there are a couple of problems.
It reprocesses everything, which is overkill; only those fields that have a "." in the name need to be reprocessed.
It fails to handle arrays in the same way that native PHP processing does, e.g. for keys like "foo.bar[]".
The solution below addresses both of these problems now (note that it has been updated since originally posted). This is about 50% faster than my answer above in my testing, but will not handle situations where the data has the same key (or a key which gets extracted the same, e.g. foo.bar and foo_bar are both extracted as foo_bar).
<?php
public function fix2(&$target, $source, $keep = false) {
if (!$source) {
return;
}
preg_match_all(
'/
# Match at start of string or &
(?:^|(?<=&))
# Exclude cases where the period is in brackets, e.g. foo[bar.blarg]
[^=&\[]*
# Affected cases: periods and spaces
(?:\.|%20)
# Keep matching until assignment, next variable, end of string or
# start of an array
[^=&\[]*
/x',
$source,
$matches
);
foreach (current($matches) as $key) {
$key = urldecode($key);
$badKey = preg_replace('/(\.| )/', '_', $key);
if (isset($target[$badKey])) {
// Duplicate values may have already unset this
$target[$key] = $target[$badKey];
if (!$keep) {
unset($target[$badKey]);
}
}
}
}
Well, the function I include below, "getRealPostArray()", isn't a pretty solution, but it handles arrays and supports both names: "alpha_beta" and "alpha.beta":
<input type='text' value='First-.' name='alpha.beta[a.b][]' /><br>
<input type='text' value='Second-.' name='alpha.beta[a.b][]' /><br>
<input type='text' value='First-_' name='alpha_beta[a.b][]' /><br>
<input type='text' value='Second-_' name='alpha_beta[a.b][]' /><br>
whereas var_dump($_POST) produces:
'alpha_beta' =>
array (size=1)
'a.b' =>
array (size=4)
0 => string 'First-.' (length=7)
1 => string 'Second-.' (length=8)
2 => string 'First-_' (length=7)
3 => string 'Second-_' (length=8)
var_dump( getRealPostArray()) produces:
'alpha.beta' =>
array (size=1)
'a.b' =>
array (size=2)
0 => string 'First-.' (length=7)
1 => string 'Second-.' (length=8)
'alpha_beta' =>
array (size=1)
'a.b' =>
array (size=2)
0 => string 'First-_' (length=7)
1 => string 'Second-_' (length=8)
The function, for what it's worth:
function getRealPostArray() {
if ($_SERVER['REQUEST_METHOD'] !== 'POST') {#Nothing to do
return null;
}
$neverANamePart = '~#~'; #Any arbitrary string never expected in a 'name'
$postdata = file_get_contents("php://input");
$post = [];
$rebuiltpairs = [];
$postraws = explode('&', $postdata);
foreach ($postraws as $postraw) { #Each is a string like: 'xxxx=yyyy'
$keyvalpair = explode('=',$postraw);
if (empty($keyvalpair[1])) {
$keyvalpair[1] = '';
}
$pos = strpos($keyvalpair[0],'%5B');
if ($pos !== false) {
$str1 = substr($keyvalpair[0], 0, $pos);
$str2 = substr($keyvalpair[0], $pos);
$str1 = str_replace('.',$neverANamePart,$str1);
$keyvalpair[0] = $str1.$str2;
} else {
$keyvalpair[0] = str_replace('.',$neverANamePart,$keyvalpair[0]);
}
$rebuiltpair = implode('=',$keyvalpair);
$rebuiltpairs[]=$rebuiltpair;
}
$rebuiltpostdata = implode('&',$rebuiltpairs);
parse_str($rebuiltpostdata, $post);
$fixedpost = [];
foreach ($post as $key => $val) {
$fixedpost[str_replace($neverANamePart,'.',$key)] = $val;
}
return $fixedpost;
}
Using crb's I wanted to recreate the $_POST array as a whole though keep in mind you'll still have to ensure you're encoding and decoding correctly both at the client and the server. It's important to understand when a character is truly invalid and it is truly valid. Additionally people should still and always escape client data before using it with any database command without exception.
<?php
unset($_POST);
$_POST = array();
$p0 = explode('&',file_get_contents('php://input'));
foreach ($p0 as $key => $value)
{
$p1 = explode('=',$value);
$_POST[$p1[0]] = $p1[1];
//OR...
//$_POST[urldecode($p1[0])] = urldecode($p1[1]);
}
print_r($_POST);
?>
I recommend using this only for individual cases only, offhand I'm not sure about the negative points of putting this at the top of your primary header file.
My current solution (based on prev topic replies):
function parseQueryString($data)
{
$data = rawurldecode($data);
$pattern = '/(?:^|(?<=&))[^=&\[]*[^=&\[]*/';
$data = preg_replace_callback($pattern, function ($match){
return bin2hex(urldecode($match[0]));
}, $data);
parse_str($data, $values);
return array_combine(array_map('hex2bin', array_keys($values)), $values);
}
$_GET = parseQueryString($_SERVER['QUERY_STRING']);

str_replace() on multibyte strings dangerous?

This question already has answers here:
Can str_replace be safely used on a UTF-8 encoded string if it's only given valid UTF-8 encoded strings as arguments?
(5 answers)
Closed 10 hours ago.
Given certain multibyte character sets, am I correct in assuming that the following doesn't do what it was intended to do?
$string = str_replace('"', '\\"', $string);
In particular, if the input was in a character set that might have a valid character like 0xbf5c, so an attacker can inject 0xbf22 to get 0xbf5c22, leaving a valid character followed by an unquoted double quote (").
Is there an easy way to mitigate this problem, or am I misunderstanding the issue in the first place?
(In my case, the string is going into the value attribute of an HTML input tag: echo 'input type="text" value="' . $string . '">';)
EDIT: For that matter, what about a function like preg_quote()? There's no charset argument for it, so it seems totally useless in this scenario. When you DON'T have the option of limiting charset to UTF-8 (yes, that'd be nice), it seems like you are really handicapped. What replace and quoting functions are available in that case?
No, you’re right: Using a singlebyte string function on a multibyte string can cause an unexpected result. Use the multibyte string functions instead, for example mb_ereg_replace or mb_split:
$string = mb_ereg_replace('"', '\\"', $string);
$string = implode('\\"', mb_split('"', $string));
Edit    Here’s a mb_replace implementation using the split-join variant:
function mb_replace($search, $replace, $subject, &$count=0) {
if (!is_array($search) && is_array($replace)) {
return false;
}
if (is_array($subject)) {
// call mb_replace for each single string in $subject
foreach ($subject as &$string) {
$string = &mb_replace($search, $replace, $string, $c);
$count += $c;
}
} elseif (is_array($search)) {
if (!is_array($replace)) {
foreach ($search as &$string) {
$subject = mb_replace($string, $replace, $subject, $c);
$count += $c;
}
} else {
$n = max(count($search), count($replace));
while ($n--) {
$subject = mb_replace(current($search), current($replace), $subject, $c);
$count += $c;
next($search);
next($replace);
}
}
} else {
$parts = mb_split(preg_quote($search), $subject);
$count = count($parts)-1;
$subject = implode($replace, $parts);
}
return $subject;
}
As regards the combination of parameters, this function should behave like the singlebyte str_replace.
The code is perfectly safe with sane multibyte-encodings like UTF-8 and EUC-TW, but dangerous with broken ones like Shift_JIS, GB*, etc. Rather than going through all the headache and overhead to be safe with these legacy encodings, I would recommend just supporting only UTF-8.
You could use either mb_ereg_replace by first specifying the charset with mb_regex_encoding(). Alternatively if you use UTF-8, you can use preg_replace with the u modifier.

Get PHP to stop replacing '.' characters in $_GET or $_POST arrays?

If I pass PHP variables with . in their names via $_GET PHP auto-replaces them with _ characters. For example:
<?php
echo "url is ".$_SERVER['REQUEST_URI']."<p>";
echo "x.y is ".$_GET['x.y'].".<p>";
echo "x_y is ".$_GET['x_y'].".<p>";
... outputs the following:
url is /SpShipTool/php/testGetUrl.php?x.y=a.b
x.y is .
x_y is a.b.
... my question is this: is there any way I can get this to stop? Cannot for the life of me figure out what I've done to deserve this
PHP version I'm running with is 5.2.4-2ubuntu5.3.
Here's PHP.net's explanation of why it does it:
Dots in incoming variable names
Typically, PHP does not alter the
names of variables when they are
passed into a script. However, it
should be noted that the dot (period,
full stop) is not a valid character in
a PHP variable name. For the reason,
look at it:
<?php
$varname.ext; /* invalid variable name */
?>
Now, what
the parser sees is a variable named
$varname, followed by the string
concatenation operator, followed by
the barestring (i.e. unquoted string
which doesn't match any known key or
reserved words) 'ext'. Obviously, this
doesn't have the intended result.
For this reason, it is important to
note that PHP will automatically
replace any dots in incoming variable
names with underscores.
That's from http://ca.php.net/variables.external.
Also, according to this comment these other characters are converted to underscores:
The full list of field-name characters that PHP converts to _ (underscore) is the following (not just dot):
chr(32) ( ) (space)
chr(46) (.) (dot)
chr(91) ([) (open square bracket)
chr(128) - chr(159) (various)
So it looks like you're stuck with it, so you'll have to convert the underscores back to dots in your script using dawnerd's suggestion (I'd just use str_replace though.)
Long-since answered question, but there is actually a better answer (or work-around). PHP lets you at the raw input stream, so you can do something like this:
$query_string = file_get_contents('php://input');
which will give you the $_POST array in query string format, periods as they should be.
You can then parse it if you need (as per POSTer's comment)
<?php
// Function to fix up PHP's messing up input containing dots, etc.
// `$source` can be either 'POST' or 'GET'
function getRealInput($source) {
$pairs = explode("&", $source == 'POST' ? file_get_contents("php://input") : $_SERVER['QUERY_STRING']);
$vars = array();
foreach ($pairs as $pair) {
$nv = explode("=", $pair);
$name = urldecode($nv[0]);
$value = urldecode($nv[1]);
$vars[$name] = $value;
}
return $vars;
}
// Wrapper functions specifically for GET and POST:
function getRealGET() { return getRealInput('GET'); }
function getRealPOST() { return getRealInput('POST'); }
?>
Hugely useful for OpenID parameters, which contain both '.' and '_', each with a certain meaning!
Highlighting an actual answer by Johan in a comment above - I just wrapped my entire post in a top-level array which completely bypasses the problem with no heavy processing required.
In the form you do
<input name="data[database.username]">
<input name="data[database.password]">
<input name="data[something.else.really.deep]">
instead of
<input name="database.username">
<input name="database.password">
<input name="something.else.really.deep">
and in the post handler, just unwrap it:
$posdata = $_POST['data'];
For me this was a two-line change, as my views were entirely templated.
FYI. I am using dots in my field names to edit trees of grouped data.
Do you want a solution that is standards compliant, and works with deep arrays (for example: ?param[2][5]=10) ?
To fix all possible sources of this problem, you can apply at the very top of your PHP code:
$_GET = fix( $_SERVER['QUERY_STRING'] );
$_POST = fix( file_get_contents('php://input') );
$_COOKIE = fix( $_SERVER['HTTP_COOKIE'] );
The working of this function is a neat idea that I came up during my summer vacation of 2013. Do not be discouraged by a simple regex, it just grabs all query names, encodes them (so dots are preserved), and then uses a normal parse_str() function.
function fix($source) {
$source = preg_replace_callback(
'/(^|(?<=&))[^=[&]+/',
function($key) { return bin2hex(urldecode($key[0])); },
$source
);
parse_str($source, $post);
$result = array();
foreach ($post as $key => $val) {
$result[hex2bin($key)] = $val;
}
return $result;
}
This happens because a period is an invalid character in a variable's name, the reason for which lies very deep in the implementation of PHP, so there are no easy fixes (yet).
In the meantime you can work around this issue by:
Accessing the raw query data via either php://input for POST data or $_SERVER['QUERY_STRING'] for GET data
Using a conversion function.
The below conversion function (PHP >= 5.4) encodes the names of each key-value pair into a hexadecimal representation and then performs a regular parse_str(); once done, it reverts the hexadecimal names back into their original form:
function parse_qs($data)
{
$data = preg_replace_callback('/(?:^|(?<=&))[^=[]+/', function($match) {
return bin2hex(urldecode($match[0]));
}, $data);
parse_str($data, $values);
return array_combine(array_map('hex2bin', array_keys($values)), $values);
}
// work with the raw query string
$data = parse_qs($_SERVER['QUERY_STRING']);
Or:
// handle posted data (this only works with application/x-www-form-urlencoded)
$data = parse_qs(file_get_contents('php://input'));
This approach is an altered version of Rok Kralj's, but with some tweaking to work, to improve efficiency (avoids unnecessary callbacks, encoding and decoding on unaffected keys) and to correctly handle array keys.
A gist with tests is available and any feedback or suggestions are welcome here or there.
public function fix(&$target, $source, $keep = false) {
if (!$source) {
return;
}
$keys = array();
$source = preg_replace_callback(
'/
# Match at start of string or &
(?:^|(?<=&))
# Exclude cases where the period is in brackets, e.g. foo[bar.blarg]
[^=&\[]*
# Affected cases: periods and spaces
(?:\.|%20)
# Keep matching until assignment, next variable, end of string or
# start of an array
[^=&\[]*
/x',
function ($key) use (&$keys) {
$keys[] = $key = base64_encode(urldecode($key[0]));
return urlencode($key);
},
$source
);
if (!$keep) {
$target = array();
}
parse_str($source, $data);
foreach ($data as $key => $val) {
// Only unprocess encoded keys
if (!in_array($key, $keys)) {
$target[$key] = $val;
continue;
}
$key = base64_decode($key);
$target[$key] = $val;
if ($keep) {
// Keep a copy in the underscore key version
$key = preg_replace('/(\.| )/', '_', $key);
$target[$key] = $val;
}
}
}
The reason this happens is because of PHP's old register_globals functionality. The . character is not a valid character in a variable name, so PHP coverts it to an underscore in order to make sure there's compatibility.
In short, it's not a good practice to do periods in URL variables.
If looking for any way to literally get PHP to stop replacing '.' characters in $_GET or $_POST arrays, then one such way is to modify PHP's source (and in this case it is relatively straightforward).
WARNING: Modifying PHP C source is an advanced option!
Also see this PHP bug report which suggests the same modification.
To explore you'll need to:
download PHP's C source code
disable the . replacement check
./configure, make and deploy your customized build of PHP
The source change itself is trivial and involves updating just one half of one line in main/php_variables.c:
....
/* ensure that we don't have spaces or dots in the variable name (not binary safe) */
for (p = var; *p; p++) {
if (*p == ' ' /*|| *p == '.'*/) {
*p='_';
....
Note: compared to original || *p == '.' has been commented-out
Example Output:
given a QUERY_STRING of a.a[]=bb&a.a[]=BB&c%20c=dd,
running <?php print_r($_GET); now produces:
Array
(
[a.a] => Array
(
[0] => bb
[1] => BB
)
[c_c] => dd
)
Notes:
this patch addresses the original question only (it stops replacement of dots, not spaces).
running on this patch will be faster than script-level solutions, but those pure-.php answers are still generally-preferable (because they avoid changing PHP itself).
in theory a polyfill approach is possible here and could combine approaches -- test for the C-level change using parse_str() and (if unavailable) fall-back to slower methods.
My solution to this problem was quick and dirty, but I still like it. I simply wanted to post a list of filenames that were checked on the form. I used base64_encode to encode the filenames within the markup and then just decoded it with base64_decode prior to using them.
After looking at Rok's solution I have come up with a version which addresses the limitations in my answer below, crb's above and Rok's solution as well. See a my improved version.
#crb's answer above is a good start, but there are a couple of problems.
It reprocesses everything, which is overkill; only those fields that have a "." in the name need to be reprocessed.
It fails to handle arrays in the same way that native PHP processing does, e.g. for keys like "foo.bar[]".
The solution below addresses both of these problems now (note that it has been updated since originally posted). This is about 50% faster than my answer above in my testing, but will not handle situations where the data has the same key (or a key which gets extracted the same, e.g. foo.bar and foo_bar are both extracted as foo_bar).
<?php
public function fix2(&$target, $source, $keep = false) {
if (!$source) {
return;
}
preg_match_all(
'/
# Match at start of string or &
(?:^|(?<=&))
# Exclude cases where the period is in brackets, e.g. foo[bar.blarg]
[^=&\[]*
# Affected cases: periods and spaces
(?:\.|%20)
# Keep matching until assignment, next variable, end of string or
# start of an array
[^=&\[]*
/x',
$source,
$matches
);
foreach (current($matches) as $key) {
$key = urldecode($key);
$badKey = preg_replace('/(\.| )/', '_', $key);
if (isset($target[$badKey])) {
// Duplicate values may have already unset this
$target[$key] = $target[$badKey];
if (!$keep) {
unset($target[$badKey]);
}
}
}
}
Well, the function I include below, "getRealPostArray()", isn't a pretty solution, but it handles arrays and supports both names: "alpha_beta" and "alpha.beta":
<input type='text' value='First-.' name='alpha.beta[a.b][]' /><br>
<input type='text' value='Second-.' name='alpha.beta[a.b][]' /><br>
<input type='text' value='First-_' name='alpha_beta[a.b][]' /><br>
<input type='text' value='Second-_' name='alpha_beta[a.b][]' /><br>
whereas var_dump($_POST) produces:
'alpha_beta' =>
array (size=1)
'a.b' =>
array (size=4)
0 => string 'First-.' (length=7)
1 => string 'Second-.' (length=8)
2 => string 'First-_' (length=7)
3 => string 'Second-_' (length=8)
var_dump( getRealPostArray()) produces:
'alpha.beta' =>
array (size=1)
'a.b' =>
array (size=2)
0 => string 'First-.' (length=7)
1 => string 'Second-.' (length=8)
'alpha_beta' =>
array (size=1)
'a.b' =>
array (size=2)
0 => string 'First-_' (length=7)
1 => string 'Second-_' (length=8)
The function, for what it's worth:
function getRealPostArray() {
if ($_SERVER['REQUEST_METHOD'] !== 'POST') {#Nothing to do
return null;
}
$neverANamePart = '~#~'; #Any arbitrary string never expected in a 'name'
$postdata = file_get_contents("php://input");
$post = [];
$rebuiltpairs = [];
$postraws = explode('&', $postdata);
foreach ($postraws as $postraw) { #Each is a string like: 'xxxx=yyyy'
$keyvalpair = explode('=',$postraw);
if (empty($keyvalpair[1])) {
$keyvalpair[1] = '';
}
$pos = strpos($keyvalpair[0],'%5B');
if ($pos !== false) {
$str1 = substr($keyvalpair[0], 0, $pos);
$str2 = substr($keyvalpair[0], $pos);
$str1 = str_replace('.',$neverANamePart,$str1);
$keyvalpair[0] = $str1.$str2;
} else {
$keyvalpair[0] = str_replace('.',$neverANamePart,$keyvalpair[0]);
}
$rebuiltpair = implode('=',$keyvalpair);
$rebuiltpairs[]=$rebuiltpair;
}
$rebuiltpostdata = implode('&',$rebuiltpairs);
parse_str($rebuiltpostdata, $post);
$fixedpost = [];
foreach ($post as $key => $val) {
$fixedpost[str_replace($neverANamePart,'.',$key)] = $val;
}
return $fixedpost;
}
Using crb's I wanted to recreate the $_POST array as a whole though keep in mind you'll still have to ensure you're encoding and decoding correctly both at the client and the server. It's important to understand when a character is truly invalid and it is truly valid. Additionally people should still and always escape client data before using it with any database command without exception.
<?php
unset($_POST);
$_POST = array();
$p0 = explode('&',file_get_contents('php://input'));
foreach ($p0 as $key => $value)
{
$p1 = explode('=',$value);
$_POST[$p1[0]] = $p1[1];
//OR...
//$_POST[urldecode($p1[0])] = urldecode($p1[1]);
}
print_r($_POST);
?>
I recommend using this only for individual cases only, offhand I'm not sure about the negative points of putting this at the top of your primary header file.
My current solution (based on prev topic replies):
function parseQueryString($data)
{
$data = rawurldecode($data);
$pattern = '/(?:^|(?<=&))[^=&\[]*[^=&\[]*/';
$data = preg_replace_callback($pattern, function ($match){
return bin2hex(urldecode($match[0]));
}, $data);
parse_str($data, $values);
return array_combine(array_map('hex2bin', array_keys($values)), $values);
}
$_GET = parseQueryString($_SERVER['QUERY_STRING']);

Categories