Can you accurately process a directory structure by URI alone? - php

Take the following complex URI (or path, what have you).
/directory/subdirectory/flashy-seo-directory/?query=123&complexvar=abc/123etc
Take this simpler one.
/directory/?query=123
What methodology would you use to accurate process the URI to seperate the directory from the filename/query/etc?
I know how to do this in simple, expected, and typical case scenarios where everything is formatted "normally" or "favorably" but what I'd like to know is if the following example will accurately cover all possible valid directory names/structures/queries/etc. For example I once seen a URI like this that I don't quite understand: /directory/index.php/something/?query=123. Not even sure what's going on there.
Methodology (not dependent on any specific programming language, though I am using PHP for this)
explode entire URI by / placing each bit in a neat array
$bits = explode( '/', $uri );
Loop through each array item and determine(?) at what point we've "reached" the portion of the URI that is no longer directory structure
Note which array key is no longer directory structure and implode the prior keys to assemble the directory
--
My ideas for Step 2. was going to be basically check to make sure there are no query specific characters (?, &, =). I haven't seen any directories with .s in them, but as you can see you can have a query variable such as ?q=abc/123 so simply checking for / wouldn't work. I've seen directories with the ~ symbol so it so a simple [A-Za-z0-9-] regex might not work in every scenario. Wondering how Step 2. can be done accurately.
This is needed seeing as the URI can capture a "virtual directory" the script may be running under that doesn't actually exist anywhere, perhaps via .htaccess for SEO or what have you. And so needs to be properly and accurately "accounted for" in order to have robust and flexible functionality throughout.

If you are only interested in the path part, and there is no host involved, then you only need to split (explode) the string at the first valid URI path delimiter.
Valid delimiters: ; # ?
$uri = "/directory/flashy-seo-directory/?query=123&complexvar=abc/123etc";
foreach (str_split("#;?") as $dlm) {
$uri = str_contains($uri, $dlm) ? explode($dlm, $uri, 2)[0] : $uri;
}
echo($uri);
Result:
/directory/flashy-seo-directory/

I suppose you're looking for parse_url()
https://www.php.net/manual/en/function.parse-url

Related

How to separate filename from path? basename() versus preg_split() with array_pop()

Why use basename() in PHP scripts if what this function is actually doing may be written in 2 lines:
$subFolders = preg_split("!\\\|\\/!ui", $path); // explode on `\` or `/`
$name = array_pop($subFolder); // extract last element's value (filename)
Am I missing something?
However I guess this code above may not work correctly if file name has some \ or / in it. But that's not possible, right?
PHP may run on many different systems with many different filesystems and naming conventions. Using a function like basename() guarantees (or at least, is supposed to guarantee) a correct result regardless of the platform. This increases the code's portability.
Also, compare a simple basename() call to your code - which is more readable, and more obvious to other programmers in regards to what it is supposed to do?

Clean directory pathname php

basically what i want to do is:
include($_SERVER['REQUEST_URI']);
Problem is, that this is not safe.
It would be safe, if it would point to "/allowed/directory/" or it's subdirectories.
So i test for that with startsWith("/allowed/directory/").
However I'm still afraid of something like:
"allowed/directory/../../bad/directory"
Is there a way to check whether a string points to a specific directory or one of it's subdirectories in php?
(Basically apply all the /../ - or am i missing another security flaw?)
PHP function realpath() should remove the ../ /// from the path.
Though you are right, this can be a fairly dangerous operation. IMO the paths should be restricted to a known set of characters (like "a-zA-Z_" and / ). Also, path strings should be limited to a known size (like 256 chars).
Once you've determined the prefix is correct, you can use preg_match like this:
if(preg_match("#^[A-Za-z0-9/]+#", $string) {
// correct
}
else {
// incorrect
}
The variable part you're checking (non-static part) you typically want to be just alpha numeric.
As long as you're using include to include local PHP fils and properly validate your input (keeping that input simple) you should be fine. Just be extremely careful and test things throughly. You typically want to avoid passing user input into sensitive functions such as include. But with a framework, it's sometimes difficult to avoid that.
Another thing you could do is have a list of valid inputs to do an exact comparison. You could have this in an ini file and load it with parse_ini_file. This is usually the safest thing to do, just a little more work. You can also use a PHP file with an array, which works better with APC.

Codeigniter Routes, unexpected behavior and don't know why

So, I don't know how many times I have done similar with code igniter in the past across a handful of sites. However, just getting myself a fresh copy, and starting a rebuild on a new site Im already running into a twisted little issue that I can't figure out, I don't know what I am missing in this equation..
So this is my Routes currently:
$route['default_controller'] = "home";
$route['404_override'] = 'home/unknown';
$route['post/(:any)/(:any)'] = 'post/$1/$2';
//$route['post/(:any)'] = 'post/$1';
//$route['post'] = 'post';
$route['host-hotel'] = 'host_hotel';
$route['floor-plan'] = 'floor_plan';
$route['wine-facts'] = 'wine_facts';
$route['exhibitor-application'] = 'exhibitor_application';
$route['photo-gallery'] = 'photo_gallery';
$route['video-gallery'] = 'video_gallery';
Now the problem is specifically with this guy.
$route['post/(:any)/(:any)'] = 'post/$1/$2';
I've even tried naming the segment article instead of post, thinking maybe its a protected name of sorts in CI. If you notice above this line I have also tried adding varations of the URL so it could handle either just the first segment or one or two more there after. Grant it they are commented out in this example, but didn't work.
If I go to the domain.com/post the behavior is as expected. Anything there after is where the issue starts. If I did anything number or letter or combo there of..
ie: domain.com/post/s2hj or domain.com/post/s2hj/avd the page starts acting like the 404 behavior, page not found which to me makes no sense, as I said Ive done routes like this in the past. And to me this route looks proper? So anyone have any ideas/suggests what to look for?
:any matches literally any character, including slashes. This will be changed in CodeIgniter 3.0, but for the time being you can use ([^/]) to catch a single segment.
Another way to try this is through Good ol' regexp:
$route['post/([^/]+)/([^/]+)'] = 'post/$1/$2';
If also you need to be able to handle a total of 2 uri segments, consider:
$route['post/([^/]+)(?:/([^/]+))'] = 'post/$1/$2';
REGEXP explained:
() simply captures and handles the strings inside. Also gives us $1, $2 etc at the other end.
(?:) is the same as above, but doesn't capture $1
[] is the allowed combination of things inside. [0-9] will look for numbers only.
[^] is the reversed logic. everything exepct the given signs.
[^/] is everything but /, thus giving us desired result.

get the '/foo' out of 'http://someplace.com/index.php/foo'

As far as I know php has an function to get the '/foo/bar/' out of a URL like: 'http://someplace.com/index.php/foo/bar/'
Can't remember what the function is called.
[edit]
I remember using something like this in ExpressionEngine (see this). And later coming over an article explaining such a function build in PHP. However I can't recall what it was.
[edit #2]
I know that there are functions to get out the URL and several to manipulate it. However I clearly remember that there were one function doing just this specific thing. Look at the ExpressionEngine example I linked to too get a better understanding of what I mean.
[edit #3]
It wasn't ExpressionEngine I had used. It was CodeIngniter. But it's basically the same thing.
[edit #4]
Maybe I am wrong. I just remembering walking over just such a function in an article once...
Case closed (unless someone stumble upon just such a function).
I believe you are looking for parse_url.
parse_url('http://someplace.com/index.php/foo');
/*
Array
(
[scheme] => http
[host] => someplace.com
[path] => /index.php/foo
)
*/
You can then manipulate the path item to remove /index.php.
It's not a function. It's a variable: $_SERVER['PATH_INFO']
That's $_SERVER['PATH_INFO']. It may not be available on all systems, it's dependent on th ewebserver passing it on. In Apache, that's the AcceptPathInfo option.
response to gregoire:
It's impossible to pull out path_info from a url with 100% reliability unless it's being done on the webserver handling that url at the time - you cannot tell where the actual script part ends and the path_info starts, especially if the path is something like
/a/b/c/scriptishere/path/info
There's no '.html', or '.php', or '.aspx' or whatever to even given you a hint. As such, this is the only way to 100% reliably answer the OP's question. Anything else is a guess - even "index.php" in the OP's sample could be a directory and the actual script is 'foo'
If the string is always going to have index.php in it, why not just substr, like so:
$url = "http://someplace.com/index.php/foo/bar/";
$delim = 'index.php';
$path = substr($url,strpos($url,$delim)+strlen($delim));
Thats a little verbose, but if you could clarify where this string is coming from what parts are going to change I could give a more concise answer.
You could also use regular expressions:
$matches = array();
preg_match('index.php\/(.*)$',$matches);
$matches will contain the matched string in index 1, index 0 will be the original string.
I didn't test that regex, but something like that should work.

How to deal with question mark in url in php single entry website

I'm dealing with two question marks in a single entry website.
I'm trying to use urlencode to handle it.
The original URL:
'search.php?query='.quote_replace(addmarks($search_results['did_you_mean'])).'&search=1'
I want to use it in the single entry website:
'index.php?page='.urlencode('search?query='.quote_replace(addmarks($search_results['did_you_mean'])).'&search=1')
It doesn't work, and I don't know if I must use urldecode and where I can use it also.
Why not just rewrite it to become
index.php?page=search&query=...
mod_rewrite will do this for you if you use the [QSA] (query string append) flag.
http://wiki.apache.org/httpd/RewriteQueryString
$_SERVER['QUERY_STRING'] will give you everything after the first "?" in a URL.
From here you can parse using "explode" or common sting functions.
Example:
http://xxx/info.php?test=1?test=2&test=3
$_SERVER['QUERY_STRING'] =>test=1?test=2&test=3
list($localURL, $remoteURL) = explode("?", $_SERVER['QUERY_STRING']);
$localURL => 'test=1'
$remoretURL =>'test=2&test=3'
Hope this helps
I would suggest you to change the logic of the server code to handle simpler query form. This way it is probably going to lead you nowhere in very near future.
Use
index.php?page=search&query=...
as your query format but do not overwrite it with mod_rewrite to your first wanted format just to satisfy your current application logic, but handle it with some better logic on the server side. Write some ifs and thens, switches and cases ... but do not try to put the logic of the application into your URLs. It will make you really awkward URLs and soon you'll see that there is no lot of space in that layer to handle all the logic you will need. :)

Categories