Regex to extract a string between two specific forward slashes - php

Hi I have the following text:
file:/home/dx/reader/validation-garage/IDON/test-test-test#2016-10-04.txt#/
I need to retrieve test-test-test#2016-10-04.txt# from the string above. If I can also exclude the hash even better.
I've tried looking at examples like this Regex to find text between second and third slashes but having trouble getting it working, can anyone help?
I'm using PHP regex to do this.

You may try regex expression below
\/([a-z\-]*\#[0-9\-\.]*[a-z]{3}\#)\/
A working example is here: https://www.regex101.com/r/RYsh7H/1
Explanation:
[a-z\-]* => Matches test-test-test part with lowercase and can contain dahses
\# => Matches constant # sign
[0-9\-\.]* => Matches the file name with digits, dashes and {dot}
[a-z]{3}\# => Matches your 3 letter extension and #
PS: If you really do not need # you do not have to use regex. And you may consider using parse_url method of PHP.
Hope this helps;

basename() also works, so you can also do like this:
echo basename('file:/home/dx/reader/validation-garage/IDON/test-test-test#2016-10-04.txt#/');

Without regex you can do:
$url_parts = parse_url('file:/home/dx/reader/validation-garage/IDON/test-test-test#2016-10-04.txt#/');
echo end(explode('/', $url_parts['path']));
or better:
$url_path = parse_url('file:/home/dx/reader/validation-garage/IDON/test-test-test#2016-10-04.txt#/', PHP_URL_PATH);
echo end(explode('/', $url_path));

Related

Extract only numbers from link with codeception

I have this link, and i need to work only with the numbers from that link.
How would i extract them?
I didn't find any answer that would work with codepcetion.
https://www.my-website.com/de/booking/extras#tab-nav-extras-1426
I tired something like this.
$I->grabFromCurrentUrl('\d+');
But i won't work.
Any ideas ?
Staying within the framework:
The manual clearly says that:
grabFromCurrentUrl
Executes the given regular expression against the current URI and
returns the first capturing group. If no parameters are provided, the
full URI is returned.
Since you didn't used any capturing groups (...), nothing is returned.
Try this:
$I->grabFromCurrentUrl('~(\d+)$~');
The $ at the end is optional, it just states that the string should end with the pattern.
Also note that the opening and closing pattern delimiters you would normally use (/) are replaced by tilde (~) characters for convenience, since the input string has a great chance to contain multiple forward slashes. Custom pattern delimiters are completely standard in regexp, as #Naktibalda pointed it out in this answer.
You can use parse_url() to parse entire URL and then extract the part which is most interested for you. After that you can use regex to extract only numbers from the string.
$url = "https://www.my-website.com/de/booking/extras#tab-nav-extras-1426";
$parsedUrl = parse_url($url);
$fragment = $parsedUrl['fragment']; // Contains: tab-nav-extras-1426
$id = preg_replace('/[^0-9]/', '', $fragment);
var_dump($id); // Output: string(4) "1426"
A variant using preg_match() after parse_url():
$url = "https://www.my-website.com/de/booking/extras#tab-nav-extras-1426";
preg_match('/\d+$/', parse_url($url)['fragment'], $id);
var_dump($id[0]);
// Outputs: string(4) "1426"

PHP: Check if string is URL [duplicate]

I'm not very good at regular expressions at all.
I've been using a lot of framework code to date, but I'm unable to find one that is able to match a URL like http://www.example.com/etcetc, but it is also is able to catch something like www.example.com/etcetc and example.com/etcetc.
For matching all kinds of URLs, the following code should work:
<?php
$regex = "((https?|ftp)://)?"; // SCHEME
$regex .= "([a-z0-9+!*(),;?&=$_.-]+(:[a-z0-9+!*(),;?&=$_.-]+)?#)?"; // User and Pass
$regex .= "([a-z0-9\-\.]*)\.(([a-z]{2,4})|([0-9]{1,3}\.([0-9]{1,3})\.([0-9]{1,3})))"; // Host or IP address
$regex .= "(:[0-9]{2,5})?"; // Port
$regex .= "(/([a-z0-9+$_%-]\.?)+)*/?"; // Path
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:#&%=+/$_.-]*)?"; // GET Query
$regex .= "(#[a-z_.-][a-z0-9+$%_.-]*)?"; // Anchor
?>
Then, the correct way to check against the regex is as follows:
<?php
if(preg_match("~^$regex$~i", 'www.example.com/etcetc', $m))
var_dump($m);
if(preg_match("~^$regex$~i", 'http://www.example.com/etcetc', $m))
var_dump($m);
?>
Courtesy: Comments made by splattermania in the PHP manual: preg_match
RegEx Demo in regex101
This worked for me in all cases I had tested:
$url_pattern = '/((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:#\-_=#]+\.([a-zA-Z0-9\&\.\/\?\:#\-_=#])*/';
Tests:
http://test.test-75.1474.stackoverflow.com/
https://www.stackoverflow.com
https://www.stackoverflow.com/
http://wwww.stackoverflow.com/
http://wwww.stackoverflow.com
http://test.test-75.1474.stackoverflow.com/
http://www.stackoverflow.com
http://www.stackoverflow.com/
stackoverflow.com/
stackoverflow.com
http://www.example.com/etcetc
www.example.com/etcetc
example.com/etcetc
user:pass#example.com/etcetc
example.com/etcetc?query=aasd
example.com/etcetc?query=aasd&dest=asds
http://stackoverflow.com/questions/6427530/regular-expression-pattern-to-match-url-with-or-without-http-www
http://stackoverflow.com/questions/6427530/regular-expression-pattern-to-match-url-with-or-without-http-www/
Every valid Internet URL has at least one dot, so the above pattern will simply try to find any at least two strings chained by a dot and has valid characters that URL may have.
Try this:
/^http:\/\/|(www\.)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$/
It works exactly like the people want.
It takes with or with out http://, https://, and www.
You can use a question mark after a regular expression to make it conditional so you would want to use:
http:\/\/(www\.)?
That will match anything that has either http://www. or http:// (with no www.)
You could just use a replace method to remove the above, thus getting you the domain. It depends on what you need the domain for.
Try something like this:
.*([\w-]+\.)+[a-z]{2,5}(/[\w-]+)*
Use:
/(https?://)?((?:(\w+-)*\w+)\.)+(?:[a-z]{2})(\/?\w?-?=?_?\??&?)+[\.]?([a-z0-9\?=&_\-%#])?/g
It matches something.com, http(s):// or www. It does not match other [something]:// URLs though, but for my purpose that's not necessary.
The regex matches e.g.:
http://foo.co.uk/
www.regex.com/foo.html?q=bar$some=thi-ng,regex
regex.foo.com/blog
You can try this:
r"(http[s]:\/\/)?([\w-]+\.)+([a-z]{2,5})(\/+\w+)? "
Selection:
may be start with http:// or https:// (optional)
anything (word) end with dot (.)
followed by 2 to 5 character [a-z]
followed by "/[anything]" (optional)
followed by space
Try this
$url_reg = /(ftp|https?):\/\/(\w+:?\w*#)?(\S+)(:[0-9]+)?(\/([\w#!:.?+=&%#!\/-])?)?/;
I have been using the following, which works for all my test cases, as well as fixes any issues where it would trigger at the end of a sentence preceded by a full-stop (end.), or where there were single character initials, such as 'C.C. Plumbing'.
The following regex contains multiple {2,}s, which means two or more matches of the previous pattern.
((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:#\-_=#]{2,}\.([a-zA-Z0-9\&\.\/\?\:#\-_=#]){2,}
Matches URLs such as, but not limited to:
https://example.com
http://example.com
example.com
example.com/test
example.com?value=test
Does not match non-URLs such as, but not limited to:
C.C Plumber
A full-stop at the end of a sentence.
Single characters such as a.b or x.y
Please note: Due to the above, this will not match any single character URLs, such as: a.co, but it will match if it is preceded by a URL scheme, such as: http://a.co.
I was getting so many issues getting the answer from anubhava to work due to recent PHP allowing $ in strings and the preg match wasn't working.
Here is what I used:
// Regular expression
$re = '/((https?|ftp):\/\/)?([a-z0-9+!*(),;?&=.-]+(:[a-z0-9+!*(),;?&=.-]+)?#)?([a-z0-9\-\.]*)\.(([a-z]{2,4})|([0-9]{1,3}\.([0-9]{1,3})\.([0-9]{1,3})))(:[0-9]{2,5})?(\/([a-z0-9+%-]\.?)+)*\/?(\?[a-z+&$_.-][a-z0-9;:#&%=+\/.-]*)?(#[a-z_.-][a-z0-9+$%_.-]*)?/i';
// Match all
preg_match_all($re, $blob, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
// The first element of the array is the full match
This PHP Composer package URL highlight is doing a good job in PHP:
<?php
use VStelmakh\UrlHighlight\UrlHighlight;
$urlHighlight = new UrlHighlight();
$matches = $urlHighlight->getUrls($string);
?>
If it does not have to be regex, you could always use the validate filters that are in PHP.
filter_var('http://example.com', FILTER_VALIDATE_URL);
filter_var (mixed $variable [, int $filter = FILTER_DEFAULT [, mixed $options ]]);
Types of Filters
Validate Filters
Regex if you want to ensure a URL starts with HTTP/HTTPS:
https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&//=]*)
If you do not require the HTTP protocol:
[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&//=]*)

regex to clean up url

I am looking for a way to get a valid url out of a string like:
$string = 'http://somesite.com/directory//sites/9/my_forms/3-895a3e/somefilename.jpg|:||:||:||:|19845';
My original solution was:
preg_match('#^[^:|]*#', str_replace('//', '/', $string), $modifiedPath);
But obviously its going to remove a slash from the http:// instead of the one in the middle of the string.
My expected output that I want from the original is:
http://somesite.com/directory/sites/9/my_forms/3-895a3e/somefilename.jpg
I could always break off the http part of the string first but would like a more elegant solution in the form of regex if possible. Thanks.
This will do exactly what you are asking:
<?php
$string = 'http://somesite.com/directory//sites/9/my_forms/3-895a3e/somefilename.jpg|:||:||:||:|19845';
preg_match('/^([^|]+)/', $string, $m); // get everything up to and NOT including the first pipe (|)
$string = $m[1];
$string = preg_replace('/(?<!:)\/\//', '/' ,$string); // replace all occurrences of // as long as they are not preceded by :
echo $string; // outputs: http://somesite.com/directory/sites/9/my_forms/3-895a3e/somefilename.jpg
exit;
?>
EDIT:
(?<!X) in regular expressions is the syntax for what is called a lookbehind. The X is replaced with the character(s) we are testing for.
The following expression would match every instance of double slashes (/):
\/\/
But we need to make sure that the match we are looking for is NOT preceded by the : character so we need to 'lookbehind' our match to see if the : character is there. If it is then we don't want it to be counted as a match:
(?<!:)\/\/
The ! is what says NOT to match in our lookbehind. If we changed it to (?=:)\/\/ then it would only match the double slashes that did have the : preceding them.
Here is a Quick tutorial that can explain it all better than I can lookahead and lookbehind tutorial
Assuming all your strings are in the form given, you don't need any but the simplest of regexes to do this; if you want an elegant solution, then a regex is definitely not what you need. Also, double slashes are legal in a URL, just like in a Unix path, and mean the same thing a single slash does, so you don't really need to get rid of them at all.
Why not just
$url = array_shift(preg_split('/\|/', $string));
?
If you really, really care about getting rid of the double slashes in the URL, then you can follow this with
$url = preg_replace('/([^:])\/\//', '$1/', $url);
or even combine them into
$url = preg_replace('/([^:])\/\//', '$1/', array_shift(preg_split('/\|/', $string)));
although that last form gets a little bit hairy.
Since this is a quite strictly defined situation, I'd consider just one preg to be the most elegant solution.
From the top of my head:
$sanitizedURL = preg_replace('~((?<!:)/(?=/)|\\|.+)~', '', $rawURL);
Basically, what this does is look for any forward slash that IS NOT preceded by a colon (:), and IS followed bij another forward slash. It also searches for any pipe character and any character following it.
Anything found is removed from the result.
I can explain the RegEx in more detail if you like.

Need a regular expression to capture url path

I am using PHP, and I have been trying to create a regular expression pattern to capture part of URL path, but to no avail.
The possible URL path could be any of these:
"product/zzz"
"yyyyyyyy/product/zzz"
"xxxxx/yyyyyyyy/product/zzz"
"xxxxx/yyyyyyyy/.../product/zzz" (... means other possible words)
what I need to capture is the part before "product".
for the first case, the result should be an empty string.
for the rest, they are "yyyyyyyy", "xxxxx/yyyyyyyy" and "xxxxx/yyyyyyyy/..."
Can anyone here give me hint? thanks!
PS.
It looks like the part I wanted is a repetition of same pattern "xxxx/". but I am not good at using group of regex.
Update:
I probably found a solution, by capturing pattern "xxx/" with zero or more repetitions: "([^/]+/)*"
so the full regex should be "(([^/]+/)*)product/([^/]+)"
#SERPRO: it passed the test in your "Live RegExp".
Hope it is helpful.
I would use parse_url():
$path = parse_url($url, PHP_URL_PATH);
// Deal with $path to figure out what's after '/product/'
This should work for you:
#(.*?)/?product.*\b#
You can see an example of result strings here:
http://xrg.es/#5awa10
This should do it:
^(.*[^/]|)/*product/[^/]+/*$
It will also allow an arbitrary number of slashes at the end of the path.
The part inside parentheses is your result.

How to get last digits which are number before '.html' string

there is a string, for example : http://address.com/sef-title-of-topic-1111.html
i could not get 1111 in anyway with regexp in php. Is it possible? How?
my code:
$address = 'http://address.com/sef-title-of-topic-1111.html';
preg_match('#-(.*?)\.html#sim',$address,$result);
If the url example is how they will always appear (ie. ending in hyphen, numbers, .html) then this should work:
$str = "http://address.com/sef-title-of-topic-1111.html";
preg_match('#.*-(\d+)\.html#', $str, $matches);
print_r($matches);
If they won't always match the pattern you gave in your question, then clarify by showing alternative values for your $address value.
If you know that the extension is definitely .html (and not .htm for example) then you could use
$lastNos= substr($input, -9, -4);
Clearly a simple solution but you have not specified why regex is required.
If the URL will always be in this format I would use str_replace to strip the .html then explode by "-" and find the last piece.
Of course all of that is assuming the URL is always in this format.
If the format is always the same you dont need a regex.
$url = "http://address.com/sef-title-of-topic-1111.html";
echo $str = strrev(array_shift(array_reverse(explode(".", array_shift(explode("-",strrev($url)))))));
edit: sorry my php is a bit rusty

Categories