How do I apply URL normalization rules in PHP?

How do I apply URL normalization rules in PHP? - php

Is there a pre-existing function or class for URL normalization in PHP?
Specifically, following the semantic preserving normalization rules laid out in this wikipedia article on URL normalization, (or whatever 'standard' I should be following).
Converting the scheme and host to lower case
Capitalizing letters in escape sequences
Adding trailing / (to directories, not files)
Removing the default port
Removing dot-segments
Right now, I'm thinking that I'll just use parse_url(), and apply the rules individually, but I'd prefer to avoid reinventing the wheel.

The Pear Net_URL2 library looks like it'll do at least part of what you want. It'll remove dot segments, fix capitalization and get rid of the default port:
include("Net/URL2.php");
$url = new Net_URL2('HTTP://example.com:80/a/../b/c');
print $url->getNormalizedURL();
emits:
http://example.com/b/c
I doubt there's a general purpose mechanism for adding trailing slashes to directories because you need a way to map urls to directories which is challenging to do in a generic way. But it's close.
References:
http://pear.php.net/package/Net_URL2
http://pear.php.net/package/Net_URL2/docs/latest/Net_URL2/Net_URL2.html

Related

Should I save a path with or without a trailing slash at the end, what's the convention

I'm always confuse whether I should add a trailing slash at the end of a path, and often mix it up, leading to some file no found.
Example with drupal:
$base_theme = '/sites/all/themes/myTheme/';
or
$base_theme = '/sites/all/themes/myTheme';
The image path could extend the base theme:
$base_image = $base_theme.'images/';
or
$base_image = $base_theme.'/images';
Is there any convention? Or I can pick which one I prefer?
I would choose to finish all path with a trailing slash since too many slash is better than no slash.

TL;DR: There's no real convention. Trailing slash would be the more globally easy to recognize format. The important thing is that you're consistent through your design and that you convey your usage clearly.
There's no real convention; but there are considerations to make.
Advantages in trailing slash:
Trailing slash usually indicates a folder path (or a prettified URL) whereas a file extension denotes a direct file link. (Think example.com/home/ VS example.com/style.css).
This is usually friendlier for people coming from UNIX and such, as in the terminal a clear convention is to leave a trailing slash for directories.
As a programmer - adding a trailing slash will result in less-likely programmer errors; for example: accidentally adding a second slash will look ugly (http://example.com/styles//myfile.css) but will not break the file link. Forgetting a slash will: http://example.com/stylesmyfile.css, however the behavior might be confusing for query strings: http://example.com/thread?id=1 VS http://example.com/thread/?id=1 <- the result really depends on how you handle your .htaccess.
Advantages in no trail:
Prettier, some might say
It's easier to remember and it's more readable to always add a slash when appending paths to a variable string than not. i.e. it's easier to remember $baseURL . '/path.php' than $baseURL . 'path.php'

PHP directory separators, forcing forward slash; non-intrusive

Whenever I work with PHP (often) I typically work on a Windows box, however I (try to) develop platform agnostic applications; one major point of issue being the use of directory separators.
As many know, doing any filesystem work in a Windows environment in PHP, you can use forward slashes in lieu of backwards, and PHP sorts it out under the hood. This is all fine when it comes to using string literals to pass a path to fopen() or whatever; but when retrieving paths, be it __FILE__ or expanding with realpath(), the paths retrieved are of course using the OS appropriate slashes. Also, I've noticed some inconsistencies in trailing slashes. Once or twice __DIR__ has appended one (a backslash) and realpath() too (I prefer the trailing slash, but not intermittently)
This is clearly a problem for string comparison, because instead of doing:
compare_somehow('path/to/file.php', __DIR__);
For the sake of reliability, I'm having to go:
compare_somehow('path/to/file.php', rtrim(strtr(__DIR__, '\\', '/'), '/') . '/');
This seems like alot of work. I can drop it into a function, sure; now I'm stuck with an arbitrary function dependency in all my OO code.
I understand that PHP isn't perfect, and accommodations need to be made, but surely there must exist some platform agnostic workaround to force filesystem hits to retrieve forward slashed paths, or at least a non-intrusive way to introduce a class-independent function for this purpose.
Summary question(s):
Is there some magical (though reliable) workaround, hack, or otherwise to force PHP to kick back forward slashed filesystem paths, regardless of the server OS?
I'm going to assume the answer is no to the above, so moving on; what provisions can I make to enforce forward slash (or whatever choice, really) as the directory separator? I'm assuming through the aforementioned filter function, but now where should it go?
Forward slash for everything. Even if the host OS separator is #*&#.

As I commented I can't really see why you would have to do this (I'd be interested in a quick description of the specific problem you are solving), but here is a possible solution using the output of __FILE__ as an example:-
$path = str_replace('\\', '/', __FILE__);
See it working
This will(should?) work regardless of the *slashes returned by the OS (I think).
Unfortunately I'm not aware of "some magical (though reliable) workaround, hack, or otherwise to force PHP to kick back forward slashed filesystem paths, regardless of the server OS" other than this. I imagine it could be wrapped in a helper class, but that still gives you an arbitary dependancy in your code.

Optional regular expression segment, but list of requirements if present?

I have a small routing engine in PHP. I'm trying to allow it to optionally match different "formats", such as requests to "/user/profile.json" or "/user/profile.xml". However, it should also match just a plain "/user/profile".
So, if if the format is present, it must be ".json" or ".xml". But it isn't required to be present at all.
Here is what I have so far:
#^GET /something/([a-zA-Z0-9\.\-_]+)(\.(html|json))?$#
Obviously, this doesn't work. This allows any "format" to be requested since the entire format segment is optional. How can I keep it optional, but constrain the formats that can be requested?

^GET /something/([a-zA-Z0-9._-]+)(\.(html|json))?$
allows dots in the first character class, so any file extension is legal. I expect you did that on purpose so filenames with dots in them are possible.
However, this means that if a filename contains a dot, it must end in either .html or .json. Right?
So change the regex to (using the \w shorthand for [A-Za-z0-9_]):
^GET /something/([\w.-]+\.(html|json)|[\w-]+)$

Alternative suggestion:
Instead of putting the desired output format into the URL, have the client specify it via the Accept Header in the HTTP Request (where it belongs). Content negotiation is baked into the HTTP protocol, so you do not have to reinvent it via URLs. Technically, it is wrong to put the format into the URL. Your URIs should point to the resource itself and not the resource representation.
Also see W3C: Content Negotiation: why it is useful, and how to make it work

The issue you're getting is arising from the fact that most extensions are alpha numeric, yet in your regex you're allowing a dot and characters:
#^GET /something/[a-zA-Z0-9\.\-_]+(\.(html|json))?$#
The section of problem being [a-zA-Z0-9\.\-_]+. For the example of the .csv making it though is because it's still matching that character range.
If something has dots in it's file name, then by default, it has a file extension (intentional or unintentional). The file My.Finance.Documents has the extension ".Documents" even though you'd assume it to be a text file or something else.
I hate doing it, but I think you might want to have a larger conditional in your regex, something along the lines of (this is an example, I haven't tested it):
#^GET /something/([^\.]+|.*\.(?:html|json))$#
Basically, if the file name has not dots in it, it's ok. If it does have a dot in it (which guarantees it has an extension), it must end with .html or .json.

extracting one or more urls from a string in php

I'm trying to extract one or more urls from a plain text string in php. Here's some examples
"mydomain.com has hit the headlines again"
extract " http://www.mydomain.com"
"this is 1 domain.com and this is anotherdomain.co.uk but sometimes http://thirddomain.net"
extract "http://www.domain.com" , "http://www.anotherdomain.co.uk" , "http://www.thirddomain.net"
There are two special cases I need - I'm thinking regex, but dont fully understand them
1) all symbols like '(' or ')' and spaces (excluding hyphens) need to be removed
2) the word dot needs to be replaced with the symbol . , so dot com would be .com
p.s I'm aware of PHP validation/regex for URL but cant work out how I would use this to achieve the end goal.
Thanks

In this case it will be hard to get 100% correct results.
Depending on the input you may try to force matching just most popular first level domains (add more to it):
(?:https?://)?[a-zA-Z0-9\-\.]+\.(?:com|org|net|biz|edu|uk|ly|gov)\b
You may need to remove the word boundary (\b) to get different results.
You can test it here:
http://bit.ly/dlrgzQ
EDIT: about your cases
1) remove from what?
2) this could be done in php like:
$result = preg_replace('/\s+dot\s+(?=(com|org|net|biz|edu|and_ect))/', '.', $input);
But I have few important notes:
This Regex are more like guidance, not actual production code
Working with this kind of loose rules on text is wacky for the least - and adding more special cases will make it even more looney. Consider this - even stackoverflow doesn't do that:
http://example.org
but not!
example.org
It would be easier if you'd said what are you trying to achieve? Because if you want to process some kind of text that goes somewhere on the WWW later, then it is very bad idea! You should not do this by your own (as you said - you don't understand Regex!), as this would be just can of XSS worms. Better think about some kind of Markdown language or BBCore or else.
Also get interested in: http://htmlpurifier.org/

Using a dash (-) instead of a plus (+) in Zend Framework URLs

by default it seems my ZF is separating multiple parameter words with plus signs.
eg. /product/test+product+name
I would like to use -> /product/test-product-name
Here is the line from routes.ini
routes.product.route = "product/:productName"<br />
routes.product.defaults.controller = product<br />
routes.product.defaults.action = product
What can do I do to fix this?

This happens because the URLs are urlencoded to ensure document validity. You'll need to filter/replace the terms (productName) before generating routes. A simple str_replace should be all that you need. In my app, I filter excess whitespace and then replace spaces with dashes.

Well, as the + sign is commonly known to browsers to separate words, I don't thing Zend has provided an option, an most likely just uses +s because it is correct.
You might have to edit the source.
You may want to look at the Regex Routing here. It seems like it might be useful.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How do I apply URL normalization rules in PHP? - php

Related

Should I save a path with or without a trailing slash at the end, what's the convention

PHP directory separators, forcing forward slash; non-intrusive

Optional regular expression segment, but list of requirements if present?

extracting one or more urls from a string in php

Using a dash (-) instead of a plus (+) in Zend Framework URLs

Categories

Resources