Removing characters after last alpha character (if exist) in url in PHP - php

I'm redoing a website and on this website there are two types of URL such as:
http://www.example.com/category/subcategory/one-friend-url-040569485.php
http://www.example.com/category/subcategory/one-friend-url.php
I need to get only one-friend-url without the extension .php or -040569485.php in any of the above situations, so I can submit the one-friend-url to be searched in mySQL.
So if the URL does not have a -040569485.php it just removes the .php extension, otherwise removes the -040569485.php from the URL.
What would be the best way to do this (php, regex or .htaccess)?

Can you try the regex if you need it in the format between "subcategory/" and ".php" with the numbers as optional
"subcategory/(.*?)(?:-?\d*?)\.php"

Related

URL Decoded Prior to htaccess Rewrite Rule

I have the following rewrite rule in .htaccess :-
RewriteRule ^.*/-y.* /handleurl.php [L]
Its purpose is to display appropriate pages depending on the values in the url, for example:
example.com/books/BookA/-y?act=x will display bookA page
the variable holding the book name is encoded such that ...
example.com/books/Book B/-y?act=x becomes example.com/books/book+B/-y?act=x
... which is fine (it's decoded in handleurl.php)
however if the book is called Book A/B I have ...
example.com/books/Book A/B/-y?act=x which becomes example.com/books/Book+A%2FB/-y?act=x
It appears that htaccess decodes this before the rewrite rule, so the rewrite rule sees too many elements in the URL delineated by the /.
Is there any way I can get the rewrite rule to ignore the encoded / as intended?
I have seen a previous response to a similar question, but I only need the / to be ignored, not other encoded characters.
It appears that htaccess decodes this before the rewrite rule, so the rewrite rule sees too many elements in the URL delineated by the /
This is not the problem. Regardless of whether the URL-path /books/Book+A%2FB/-y is decoded or not makes no difference here*1. Both would match the (rather generous) regex ^.*/-y.* in the RewriteRule pattern.
(*1 But yes, the URL-path matched by the RewriteRule pattern is URL decoded, ie. %-decoded.)
The problem is likely to be that Apache (by default) rejects - with a 404 - any URL that contains a %-encoded slash ie. %2F (or backslash %5C) in the URL-path portion of the URL. This is a security feature, that otherwise "could potentially allow unsafe paths" (source).
However, this can be overridden with the AllowEncodedSlashes directive. But this directive can only be used in a server or virtualhost context. It cannot be used in .htaccess.
You either need to set AllowEncodedSlashes On to allow encoded slashes, which are also decoded, as with other characters. Or set AllowEncodedSlashes NoDecode to permit encoded slashes, but do not decode them - which is preferred and probably what you are expecting.
Aside#1:
RewriteRule ^.*/-y.* /handleurl.php [L]
The regex ^.*/-y.* is very generic, possibly too generic. This is the same as simply /-y. What is the .* after -y intended to match? From your example URLs it looks like -y is always at the end of the URL-path, so this could be anchored, eg. /-y$. And if the URL that you need to match always starts /books/ then maybe this should also be included in the regex?
Aside#2:
...the book name is encoded such that ...
example.com/books/Book B/-y?act=x becomes example.com/books/book+B/-y?act=x ... which is fine (it's decoded in handleurl.php)
This isn't strictly "URL encoded", you have converted the space into a + in the URL-path. The + is a valid "URL encoding" for a space when used in the query string only. A + in the URL-path is a literal + (and will be seen by search engines as such). In the URL-path, a space would be URL encoded as %20. (You may have used the wrong PHP encoding functions, eg. urlencode() instead of rawurlencode()?)
Of course, you are free to convert/encode the URL however you wish to create a more readable URL - providing it's valid.
The rewrite rule was never the problem. I think it was Apache not liking the encoded '/' and the fact that the downstream url handling program was using '/' as a delimiter when identifying the individual url elements. I have to work out: 1) whether I want to allow '/' in the variables that make up the elements of the freindly url, and 2) if so how to pass it without upsetting Apache and how to subsequently disect the url. Maybe I will convert '/' to '~' for the benefit of the URL then convert back to '/' prior to subsequent display. Thank you Mr White.

How to pass forward slash (%2F) in laravel 5 url with addition get parrameters

Users can search my site. Sometimes they might use a search term containing a forward slash (search with / slash) which when submitted by the form turns into %2F in the url.
eg. www.mysite.com/search/search+with+%2F+slash
I have used the answer from here which works great to give me the right page and not send me to a 404.
My problem now is I use pagination on the page and custom filters which are both passed as get vars in the url and when accessing the GET var it's empty.
eg. www.mysite.com/search/search+with+%2F+slash?page=2
This is my current route
$this->get('search/{search_term}', ['uses' => 'SearchController#search'])
->where('search_term', '(.*(?:%2F:)?.*)');
Not sure what do from here
Including an encoded slash (%2F) in the path component of a URL is not a good idea. The HTTP RFC says that this sequence should be "equivalent" to a real slash:
Characters other than those in the "reserved" and "unsafe" sets (see
RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.
In practice, the handling of these URLs is inconsistent. Some web servers (and even some browsers!) will treat %2F as equivalent to a real slash, some will treat it differently, and some tools, including some web application firewalls and proxies, will simply reject URLs which contain such a sequence.
If you need to include user input like this in a URL, you should probably put it in a query string (/search/?q=search+with+%2f+slash).

Regex on File Names

I have a function called getContents(), Which accepts a regex for the file names it finds.
I scan the js folder for javascript files, with the following two regex patterns:
$js['head'] = "/(\.head\.js\.php)|(\.head\.js)|(\.h.js)/";
$js['foot'] = "/(\.foot\.js\.php)|(\.foot\.js)|(\.f.js)|(\.js)^(\.head\.js)/";
I have a naming system whereby if you determine where the javascript file gets loaded, in the <head> tag or footer of the HTML page. All files are generally considered to be loaded at the bottom of the page, unless you specify (.head.js for example).
Up until a few days a go I noticed that the js['foot'] array was also including .head.js as well, causing the files to be loaded twice. So I added in the ^(\.head\.js) and it worked! it stopped the .head.js files being added into the footer array. I was quite pleased with myself, because I suck at regex. However it seems now that standard .js files (any normal .js files) arnt being loaded into the $js['foot'] array now. Why is this? If I remove the ^(\.head\.js) part it loads them.
To be clear, I want the $js['foot'] array to load files ending with:
.foot.js.php
.foot.js
.f.js
.js
And IGNORE all:
.head.js.php
.head.js
.h.js
Can someone correct my regex above to do this? I thought the ^ operator was NOT but i was wrong!
^(\.head\.js) in the middle of string makes it an invalid because ^ is considered anchor that matches line start.
You actually need a negative lookbehind assertion to stop matching head.js in footer regex:
$js['head'] = '/\.head\.js(?:\.php)?|\.h.js/';
$js['foot'] = '/\.foot\.js(?:\.php)?|(?<!head|h)\.js/';
RegEx Demo

How to pass a filename via URL?

I need to pass filenames via the url, e.g.:
http://example.com/images/niceplace.jpg
The problem I'm having is when the file name contains a blank character, e.g.:
http://example.com/images/nice place.jpg
or
http://example.com/images/nice%20place.jpg
For these two URLs, codeigniter complains about the blank char: "The URI you submitted has disallowed characters."
How should I go about fixing this?
I know I can add the blank character to the permitted_uri_chars in config.php but I'm looking for a better solution as there might be other disallowed characters in a filename.
I figured out a solution.
The URL is generated using rawurlencode().
Then, within the images controller, the filename is decoded using rawurldecode(html_entity_decode($filename)).
I successfully tested this solution with a few special characters I can think of and with UTF-8 characters.
You can use this method:
http://php.net/urlencode
Actually, you will run into another issues, when a filename would contain & character, and a few others. urlencode would get rid of all the possible issues.
This configuration option is created to avoid some characters being passed in URI and you want to walkaround it in some cases. I think most appropriate solutions are:
Pass file name as a parameter - http://domain.com/images/?image=test.jpg
Remove all non alfanumeric characters and may be some other (dash, underscore, etc) from file name when you save it. In my opinion, it is better, because you can face other problems with some character in other cases.
One of the better way to work with url's for specified condition is to encode/encrypt your url parameters using encryption/security class in order to maintain URL security:
$encrypt=$this->encrypt->encode($param1) & $this->encrypt->decode($encrypt)
Alternatively if you want special chars to be allowed in the URL then change your config settings in config.php file.
File Location: application/config/config.php
$config['permitted_uri_chars'] = 'a-z 0-9~%.:_\-';
Add all characters in right side that you want to be allowed with your application.

Optional regular expression segment, but list of requirements if present?

I have a small routing engine in PHP. I'm trying to allow it to optionally match different "formats", such as requests to "/user/profile.json" or "/user/profile.xml". However, it should also match just a plain "/user/profile".
So, if if the format is present, it must be ".json" or ".xml". But it isn't required to be present at all.
Here is what I have so far:
#^GET /something/([a-zA-Z0-9\.\-_]+)(\.(html|json))?$#
Obviously, this doesn't work. This allows any "format" to be requested since the entire format segment is optional. How can I keep it optional, but constrain the formats that can be requested?
^GET /something/([a-zA-Z0-9._-]+)(\.(html|json))?$
allows dots in the first character class, so any file extension is legal. I expect you did that on purpose so filenames with dots in them are possible.
However, this means that if a filename contains a dot, it must end in either .html or .json. Right?
So change the regex to (using the \w shorthand for [A-Za-z0-9_]):
^GET /something/([\w.-]+\.(html|json)|[\w-]+)$
Alternative suggestion:
Instead of putting the desired output format into the URL, have the client specify it via the Accept Header in the HTTP Request (where it belongs). Content negotiation is baked into the HTTP protocol, so you do not have to reinvent it via URLs. Technically, it is wrong to put the format into the URL. Your URIs should point to the resource itself and not the resource representation.
Also see W3C: Content Negotiation: why it is useful, and how to make it work
The issue you're getting is arising from the fact that most extensions are alpha numeric, yet in your regex you're allowing a dot and characters:
#^GET /something/[a-zA-Z0-9\.\-_]+(\.(html|json))?$#
The section of problem being [a-zA-Z0-9\.\-_]+. For the example of the .csv making it though is because it's still matching that character range.
If something has dots in it's file name, then by default, it has a file extension (intentional or unintentional). The file My.Finance.Documents has the extension ".Documents" even though you'd assume it to be a text file or something else.
I hate doing it, but I think you might want to have a larger conditional in your regex, something along the lines of (this is an example, I haven't tested it):
#^GET /something/([^\.]+|.*\.(?:html|json))$#
Basically, if the file name has not dots in it, it's ok. If it does have a dot in it (which guarantees it has an extension), it must end with .html or .json.

Categories