PHP Regex - URL - link to a file

PHP Regex - URL - link to a file - php

How can I identify via preg_match that a string containing a URL is actually pointing to a file and not to a valid page. For example:
www.example.com/a.png
www.example.com/a/b/c/d.mp4
www.example.com/e/f/h.xls
If I just do an explode on "." and check last index, it will not work. Also, I don't have the complete list of possible extensions and want to write something generic.
Thanks.

\/.+\.(?!php|php5)[a-zA-Z0-9]{1,4}
(php and php5 are examples for blacklist here)
Or
explode on . and do an array_pop on it.
I suggest to use a whitelist instead of blacklist. Add only allowed extensions.

Related

Regular expression filename matching

I am writing a PHP function in Drupal to detect duplicate file uploads and attempting to compare the uploaded filename to previously uploaded files.
I have example files of:
trees-nature_0.jpg
trees-nature_1.jpg
trees-nature0.jpg
trees-nature.jpg
I am trying to match all of them all using the following code:
file_scan_directory('image/uploads', "/trees-nature[*]?.jpg/");
However, all I get back is trees-nature.jpg.
I would appreciate some correction.

Your regex is not correct. use:
file_scan_directory('image/uploads', '/trees-nature.*?\.jpg/');

You can use the following:
file_scan_directory('image/uploads', '/trees-nature(.*?)\.jpg/');
Correction:
[ ] cannot be used as parentheses.. it has special meaning in regex
* is not wildcard in regex.. you have to use .*
. also has special meaning here (any character) you need to escape it

PHP: How to get URL of relative file

Does PHP have a native function that returns the full URL of a file you declare with a relative path? I need to get: "http://www.domain.com/projects/test/img/share.jpg" from "img/share.jpg" So far I've tried the following:
realpath('img/share.jpg');
// Returns "/home/user/www.domain.com/projects/test/img/share.jpg"
I also tried:
dirname(__FILE__)
// Returns "/home/user/www.domain.com/projects/test"
And this answer states that the following can be tampered with client-side:
"http://'.$_SERVER[HTTP_HOST].$_SERVER[REQUEST_URI].'img/share.jpg"
plus, my path will vary depending on whether I'm accessing from /test/index.php or just test/ without index.php and I don't want to hard-code whether it's http or https.
Does anybody have a solution for this? I'll be sending these files to another person who will upload to their server, so the folder structure will not match "/home/user/www.domain.com/"

echo preg_replace(preg_quote($_SERVER['DOCUMENT_ROOT']), 'http://www.example.com/', realpath('img/share.jpg'), 1);
Docs: preg_replace and preg_quote.
The arguments of preg_replace:
preg_quote($_SERVER['DOCUMENT_ROOT']) - Takes the document root (e.g., /home/user/www.domain.com/) and makes it a regular expression for use with preg_replace.
'http://www.example.com/' - the string to replace the regex match with.
realpath('img/share.jpg') - the string for the file path including the document root.
1 - the number of times to replace regex matches.

How about
echo preg_replace("'". preg_quote($_SERVER['DOCUMENT_ROOT']) ."'",'http://www.example.com/', realpath('img/share.jpg'), 1);

Regex to match base name of files with multiple extensions

I'm trying to match files of the following structure in PHP.
Input:
filename.ext1
filename.ext1.ext2
filename.ext3.ext2.ext1
filename.ext4.ext2.ext1.ext4
file name with spaces and no way of knowing how long.ext1
file name with spaces and no way of knowing how long.ext1.ext2
file name with spaces and no way of knowing how long.ext2.ext1.ext3
file name with spaces and no way of knowing how long.ext3.ext1.ext4.ext3
Output:
filename
filename
filename
filename
file name with spaces and no way of knowing how long
file name with spaces and no way of knowing how long
file name with spaces and no way of knowing how long
file name with spaces and no way of knowing how long
What I've already attempted (doesn't work of course and I already understand why):
^(?P<basename>.*)(\.ext4)|(\.ext3)|(\.ext2)|(\.ext1).*$
I'd like to extract the base name of the file and basically strip all extensions, because there's no way of knowing in which order they may appear. I've tried several solutions presented here but they did not work for me. The extensions could be anything alphanumeric of any length.
I'm fairly new to regular expressions and am confused that apparently you cannot simply search forward to the first dot and remove it including everything that comes after.
To learn, I'd also love to see how to do the reverse and just match all the extensions including the first dot.
Update:
I didn't think about file names that contain dots. So obviously my thinking regarding "searching forward" is flawed. Does anyone have a solution for the case
file name with spaces and no. way of knowing how long.ext3.ext1.ext4.ext3
or even
file name with spaces and no way of knowing.how.long.ext3.ext1.ext4.ext3
The latter one would quite possibly only work when certain extensions are given. So please assume ext1-4 are given but are in an unpredictable sequence.

Quick and dirty:
preg_replace("/\.(ext1|ext2|ext3|ext4)/i", "", $filename)

There's no need to use regular expressions for this; PHP has the buildin function basename() for that

Does something simple like this works for you....
^[^.]*
Basically it just matches string before first dot.

This regex should work for you:
^.+?(?=\.[^.]*$)
Online Demo: http://regex101.com/r/uT2oK5
This will find file names before very last dot only. See all the examples included in the link.

am confused that apparently you cannot simply search forward to the first dot and remove it including everything that comes after.
Since regexes are read from left to right, looking for a single dot will lead you straight to the first dot. That said, you would thus be able to use:
preg_replace("/\..*/", "", $filename);
.* matches any characters except newlines.
If the filename has dots, this obviously won't work, since part of the filename will then be removed.
As per update, if you have the specific extensions, you can use something like this:
preg_replace("/(?:\.ext[1-4])+$/m", "", $filename);
regex101 demo
In a broader perspective, you could use something like this if you have an array of extensions at your disposition:
$exts = array(".ext1", ".ext2", ".ext3", ".ext4");
$result = preg_replace("/(?:". preg_quote(join("|",$exts)) .")+$/m", "", $filename);

.*(?=\.)
Try this? Will match all before the last dot even if theres a dot in the file name

This is easy with just plain old php functions. No need for fancy regex.
$name = substr($filename, 0, strpos($filename, '.'));
This won't work for filenames which have a . like your updated example, however in order to achieve this you would likely need to know in advance the extensions which you are likely to encounter.

regex to get current page or directory name?

I am trying to get the page or last directory name from a url
for example if the url is: http://www.example.com/dir/ i want it to return dir or if the passed url is http://www.example.com/page.php I want it to return page Notice I do not want the trailing slash or file extension.
I tried this:
$regex = "/.*\.(com|gov|org|net|mil|edu)/([a-z_\-]+).*/i";
$name = strtolower(preg_replace($regex,"$2",$url));
I ran this regex in PHP and it returned nothing. (however I tested the same regex in ActionScript and it worked!)
So what am I doing wrong here, how do I get what I want?
Thanks!!!

Don't use / as the regex delimiter if it also contains slashes. Try this:
$regex = "#^.*\.(com|gov|org|net|mil|edu)/([a-z_\-]+).*$#i";

You may try tho escape the "/" in the middle. That simply closes your regex. So this may work:
$regex = "/.*\.(com|gov|org|net|mil|edu)\/([a-z_\-]+).*/i";
You may also make the regex somewhat more general, but that's another problem.

You can use this
array_pop(explode('/', $url));
Then apply a simple regex to remove any file extension

Assuming you want to match the entire address after the domain portion:
$regex = "%://[^/]+/([^?#]+)%i";
The above assumes a URL of the format extension://domainpart/everythingelse.

Then again, it seems that the problem here isn't that your RegEx isn't powerful enough, just mistyped (closing delimiter in the middle of the string). I'll leave this up for posterity, but I strongly recommend you check out PHP's parse_url() method.
This should adequately deliver:
substr($s = basename($_SERVER['REQUEST_URI']), 0, strrpos($s,'.') ?: strlen($s))
But this is better:
preg_replace('/[#\.\?].*/','',basename($path));
Although, your example is short, so I cannot tell if you want to preserve the entire path or just the last element of it. The preceding example will only preserve the last piece, but this should save the whole path while being generic enough to work with just about anything that can be thrown at you:
preg_replace('~(?:/$|[#\.\?].*)~','',substr(parse_url($path, PHP_URL_PATH),1));

As much as I personally love using regular expressions, more 'crude' (for want of a better word) string functions might be a good alternative for you. The snippet below uses sscanf to parse the path part of the URL for the first bunch of letters.
$url = "http://www.example.com/page.php";
$path = parse_url($url, PHP_URL_PATH);
sscanf($path, '/%[a-z]', $part);
// $part = "page";

This expression:
(?<=^[^:]+://[^.]+(?:\.[^.]+)*/)[^/]*(?=\.[^.]+$|/$)
Gives the following results:
http://www.example.com/dir/ dir
http://www.example.com/foo/dir/ dir
http://www.example.com/page.php page
http://www.example.com/foo/page.php page
Apologies in advance if this is not valid PHP regex - I tested it using RegexBuddy.

Save yourself the regular expression and make PHP's other functions feel more loved.
$url = "http://www.example.com/page.php";
$filename = pathinfo(parse_url($url, PHP_URL_PATH), PATHINFO_FILENAME);
Warning: for PHP 5.2 and up.

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.

You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.

I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);

Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.