Put URLs from string into array using regex (problem with trailing period) - php

I am trying to write a function that pulls all url's from a string and remove a potential trailing slash from the end.
function getUrls($string) {
$regex = '/https?\:\/\/[^\" ]+/i';
preg_match_all($regex, $string, $matches);
return ($matches[0]);
}
But that returns http://test.com. (trailing period) If i have
$string = "Hi I am sharing http://test.com.";
$urls = getUrls($string);
It returns the URL with the period at the end.

This one seems to work (taken from here)
$regex="/(https?:\/\/+[\w\-]+\.[\w\-]+)/i";

In case anyone comes across this, here is what I put together:
$aProtocols = array('http:\/\/', 'https:\/\/', 'ftp:\/\/', 'news:\/\/', 'nntp:\/\/', 'telnet:\/\/', 'irc:\/\/', 'mms:\/\/', 'ed2k:\/\/', 'xmpp:', 'mailto:');
$aSubdomains = array('www'=>'http://', 'ftp'=>'ftp://', 'irc'=>'irc://', 'jabber'=>'xmpp:');
$sRELinks = '/(?:(' . implode('|', $aProtocols) . ')[^\^\[\]{}|\\"\'<>`\s]*[^!#\^()\[\]{}|\\:;"\',.?<>`\s])|(?:(?:(?:(?:[^#:<>(){}`\'"\/\[\]\s]+:)?[^#:<>(){}`\'"\/\[\]\s]+#)?(' . implode('|', array_keys($aSubdomains)) . ')\.(?:[^`~!##$%^&*()_=+\[{\]}\\|;:\'",<.>\/?\s]+\.)+[a-z]{2,6}(?:[\/#?](?:[^\^\[\]{}|\\"\'<>`\s]*[^!#\^()\[\]{}|\\:;"\',.?<>`\s])?)?)|(?:(?:[^#:<>(){}`\'"\/\[\]\s]+#)?((?:(?:(?:(?:[0-1]?[0-9]?[0-9])|(?:2[0-4][0-9])|(?:25[0-5]))(?:\.(?:(?:[0-1]?[0-9]?[0-9])|(?:2[0-4][0-9])|(?:25[0-5]))){3})|(?:[A-Fa-f0-9:]{16,39}))|(?:(?:[^`~!##$%^&*()_=+\[{\]}\\|;:\'",<.>\/?\s]+\.)+[a-z]{2,6}))\/(?:[^\^\[\]{}|\\"\'<>`\s]*[^!#\^()\[\]{}|\\:;"\',.?<>`\s](?:[#?](?:[^\^\[\]{}|\\"\'<>`\s]*[^!#\^()\[\]{}|\\:;"\',.?<>`\s])?)?)?)|(?:[^#:<>(){}`\'"\/\[\]\s]+:[^#:<>(){}`\'"\/\[\]\s]+#((?:(?:(?:(?:[0-1]?[0-9]?[0-9])|(?:2[0-4][0-9])|(?:25[0-5]))(?:\.(?:(?:[0-1]?[0-9]?[0-9])|(?:2[0-4][0-9])|(?:25[0-5]))){3})|(?:[A-Fa-f0-9:]{16,39}))|(?:(?:[^`~!##$%^&*()_=+\[{\]}\\|;:\'",<.>\/?\s]+\.)+[a-z]{2,6}))(?:\/(?:(?:[^\^\[\]{}|\\"\'<>`\s]*[^!#\^()\[\]{}|\\:;"\',.?<>`\s])?)?)?(?:[#?](?:[^\^\[\]{}|\\"\'<>`\s]*[^!#\^()\[\]{}|\\:;"\',.?<>`\s])?)?))|([^#:<>(){}`\'"\/\[\]\s]+#(?:(?:(?:[^`~!##$%^&*()_=+\[{\]}\\|;:\'",<.>\/?\s]+\.)+[a-z]{2,6})|(?:(?:(?:(?:(?:[0-1]?[0-9]?[0-9])|(?:2[0-4][0-9])|(?:25[0-5]))(?:\.(?:(?:[0-1]?[0-9]?[0-9])|(?:2[0-4][0-9])|(?:25[0-5]))){3})|(?:[A-Fa-f0-9:]{16,39}))))(?:[^\^*\[\]{}|\\"<>\/`\s]+[^!#\^()\[\]{}|\\:;"\',.?<>`\s])?)/i';
function getUrls($string) {
global $sRELinks;
preg_match_all($sRELinks, $string, $matches);
return ($matches[0]);
}
From http://yellow5.us/journal/server_side_text_linkification/

Depending on how strict you want to be, consider the Liberal, Accurate Regex Pattern for Matching URLs regular expression pattern discussed on Daring Fireball. The pattern in full is:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
If you are interested in how it works, Alan Storm has a great explanation.

Related

PHP regex replace multiple patterns with callback

I'm trying to run a simple replacement on some input data that could be described as follows:
take a regular expression
take an input data stream
on every match, replace the match through a callback
Unfortunately, preg_replace_callback() doesn't work as I'd expect. It gives me all the matches on the entire line, not individual matches. So I need to put the line together again after replacement, but I don't have the information to do that. Case in point:
<?php
echo replace("/^\d+,(.*),(.*),.*$/", "12,LOWERME,ANDME,ButNotMe")."\n";
echo replace("/^\d+-\d+-(.*) .* (.*)$/", "13-007-THISLOWER ThisNot THISAGAIN")."\n";
function replace($pattern, $data) {
return preg_replace_callback(
$pattern,
function($match) {
return strtolower($match[0]);
}, $data
);
}
https://www.tehplayground.com/hE1ZBuJNtFiHbdHO
gives me 12,lowerme,andme,butnotme, but I want 12,lowerme,andme,ButNotMe.
I know using $match[0] is wrong. It's just to illustrate here. Inside the closure I need to run something like
foreach ($match as $m) { /* do something */ }
But as I said, I have no information about the position of the matches in the input string which makes it impossible to put the string together again.
I've digged through the PHP documentation as well as several searches and couldn't find a solution.
Clarifications:
I know that $match[1], $match[2]... etc contain the matches. But only a string, not a position. Imagine in my example the final string is also ANDME instead of ButNotMe - according to the regex, it should not be matched and the callback should not be applied to it. That's why I'm using regexes in the first place instead of string replacements.
Also, the reason I'm using capture groups this way is that I need the replacement process to be configurable. So I cannot hardcode something like "replace #1 and #2 but not #3". On a different input file, the positions might be different, or there might be more replacements needed, and only the regex used should change.
So if my input is "15,LOWER,ME,NotThis,AND,ME,AGAIN", I want to be able to just change the regex, not the code and get the desired result. Basically, both $pattern and $data are variable.
This uses preg_match() and PREG_OFFSET_CAPTURE to return the capture groups and the offset within the original string where it is found. This then uses substr_replace() with each capture group to replace only the part of the string which is to be changed - this stops any chance of replacing similar text which you do not want to be changed...
function lowerParts (string $input, string $regex ) {
preg_match($regex, $input, $matches, PREG_OFFSET_CAPTURE);
array_shift($matches);
foreach ( $matches as $match ) {
$input = substr_replace($input, strtolower($match[0]),
$match[1], strlen($match[0]));
}
return $input;
}
echo lowerParts ("12,LOWERME,ANDME,ButNotMe", "/^\d+,(.*),(.*),.*$/");
gives...
12,lowerme,andme,ButNotMe
But also with
echo lowerParts ("12,LOWERME,ANDME,LOWERME", "/^\d+,(.*),(.*),.*$/");
it gives
12,lowerme,andme,LOWERME
Edit:
If the replacement data is of different lengths, then you would need to chop the string up into parts and replace each one. The complication is that each change in length alters the relative position of the offsets, so this has to keep track of what this offset is. This version also has a parameter which is the process you want to apply to the strings (this example just passes "strtolower") ...
function processParts (string $input, string $regex, callable $process ) {
preg_match($regex, $input, $matches, PREG_OFFSET_CAPTURE);
array_shift($matches);
$offset = 0;
foreach ( $matches as $match ) {
$replacement = $process($match[0]);
$input = substr($input, 0, $match[1]+$offset)
.$replacement.
substr($input, $match[1]+$offset+strlen($match[0]));
$offset += strlen($replacement) - strlen($match[0]);
}
return $input;
}
echo processParts ("12,LOWERME,ANDME,LOWERME", "/^\d+,.*,(.*),(.*)$/", "strtolower");
This will work:
function replaceGroups(string $pattern, string $string, callable $callback)
{
preg_match($pattern, $string, $matches, PREG_OFFSET_CAPTURE);
array_shift($matches);
foreach (array_reverse($matches) as $match) {
$string = substr_replace($string, $callback($match[0]), $match[1], mb_strlen($match[0]));
}
return $string;
}
echo replaceGroups("/^\d+-\d+-(.*) .* (.*)$/", "13-007-THISLOWER ThisNot THISAGAIN", 'strtolower');

find all subdomains in a string using php

hello here is my html :
<div>
hello.domain.com
holla.domain.com
stack.domain.com
overflow.domain.com </div>
I want to return an array with : hello, holla, stack,overflow
then I have this https://hello.domain.com/c/mark?lang=fr
I want to return the value : mark
I know it should be done with regular expressions. As long as I know how to do it regular expression or not it will be good. thank you
Part 1: Subdomains
$regex = '~\w+(?=\.domain\.com)~i';
preg_match_all($regex, $yourstring, $matches);
print_r($matches[0]);
See the matches in the regex demo.
Match Array:
[0] => hello
[1] => holla
[2] => stack
[3] => overflow
Explanation
The i modifier makes it case-insensitive
\w+ matches letters, digits or underscores (our match)
The lookahead (?=\.domain\.com) asserts that it is followed by .domain.com
Part 2: Substring
$regex = '~https://hello\.domain\.com/c/\K[^\s#?]+(?=\?)~';
if (preg_match($regex, $yourstring, $m)) {
$thematch = $m[0];
}
else { // no match...
}
See the match in the regex demo.
Explanation
https://hello\.domain\.com/c/ matches https://hello.domain.com/c/
The \K tells the engine to drop what was matched so far from the final match it returns
[^\s#?]+ matches any chars that are not a white-space char, ? or # url fragment marker
The lookahead (?=\?) asserts that it is followed by a ?
Although I am not sure where you are trying to take this.
$input = 'somthing.domain.com';
$string = trim($input, '.domain.com');
may help you.
About the second part of your question, you can use the parse_url function:
$yourURL = 'https://hello.domain.com/c/mark?lang=fr';
$result = end(explode('/', parse_url($yourURL, PHP_URL_PATH)));
For the second part of your question (extract part of a URL) others have answered with a highly specific regex solution. More generally what you are trying to do is parse a URL for which there already exists the parse_url() function. You will find the following more flexible and applicable to other URLs:
php > $url = 'https://hello.domain.com/c/mark?lang=fr';
php > $urlpath = parse_url($url, PHP_URL_PATH);
php > print $urlpath ."\n";
/c/mark
php > print basename($urlpath) . "\n";
mark
php > $url = 'ftp://some.where.com.au/abcd/efg/wow?lang=id&q=blah';
php > print basename(parse_url($url, PHP_URL_PATH)) . "\n";
This assumes that you are after the last part of the URL path, but you could use explode("/", $urlpath) to access other components in the path.

Replace only the last match using preg_replace()

So, for example user input some regex match and he wants that last match will be replaced by input-string.
Example:
$str = "hello, world, hello!";
// For now, regex will be for example just word,
// but it should work with match too
replaceLastMatch($str, "hello", "replacement");
echo $str; // Should output "hello, world, replacement!";
Use a negative lookahead to ensure that you only match the last occurrence of the search string:
function replaceLastMatch($str, $search, $replace) {
$pattern = sprintf('~%s(?!.*%1$s)~', $search);
return preg_replace($pattern, $replace, $str, 1);
}
Usage:
$str = "hello, world, hello!";
echo replaceLastMatch($str, 'h\w{4}', 'replacement');
echo replaceLastMatch($str, 'hello', 'replacement');
Output:
hello, world, replacement!
Demo
Here is what I came up with:
Short version:
It is vulnerable though (e.g. if user uses groups (abc), this will break):
function replaceLastMatch($string, $search, $replacement) {
// Escape all / as it delimits the regex
// Construct the regex pattern to be ungreedy at the right (? behind .*)
$search = '/^(.*)' . str_replace('/', '\\/', $search) . '(.*?)$/s';
return preg_replace($search, '${1}' . $replacement . '${2}', $string);
}
Longer version (personally recommended):
This version allows the user to use groups without interfering with this function (e.g. pattern ((ab[cC])+(XY)*){1,5}):
function replaceLastMatch($string, $search, $replacement) {
// Escape all '/' as it delimits the regex
// Construct the regex pattern to be ungreedy at the right (? behind .*)
$search = '/^.*(' . str_replace('/', '\\/', $search) . ').*?$/s';
// Match our regex and store matches including offsets
// If regex does not match, return $string as-is
if(1 !== preg_match($search, $string, $matches, PREG_OFFSET_CAPTURE))
return $string;
return substr($string, 0, $matches[1][1]) . $replacement
. substr($string, $matches[1][1] + strlen($matches[1][0]));
}
One general warning: You should be very careful with user input, as it can do all kings of nasty stuff. Be always prepared for inputs that are rather "unproductive".
Explanation:
The core of the match last functionality is the ? (greediness inversion) operator (see Repetition - somewhere in the middle).
While repetition patterns (e.g. .*) are greedy by default, consuming as much as it can possibly match, making a pattern ungreedy (e.g. .*?) will make it match as little as possible (while still matching at all).
Hence, in our case, the greedy front part of the pattern will always have precedence over the non-greedy back part and our custom middle part will match the very last instance possible.

Can't get PHP Regex working

I'm trying to use PHP regular expressions. I've tried this code:
$regex = "c:(.+),";
$input = "otherStuff094322f98c:THIS,OtherStuffHeree129j12dls";
$match = Array();
preg_match_all($regex, $input, $match);
It should return a sub-string THIS ("c" and ":" followed by any character combination followed by ",") from $input. But it returns a empty array. What am I doing wrong?
I think you need the slashes to make regex working.
and using .+ will match everything behind the comma too, which is you don't want. Use .+? or [^,]+
$regex = "/c:(.+?),/";
or
$regex = "/c:([^,]+),/";

regular expression in PHP to create wiki-style links

I'm developing a site which is going to use wiki-style links to internal content eg [[Page Name]]
I'm trying to write a regex to achieve this and I've got as far as turning it into a link and replacing spaces with dashes (this is our space substitute rather than underscores) but only for page names of two words.
I could write a separate regex for all likely numbers of words (say from 10 downwards) but I'm sure there must be a neater way of doing it.
Here's what I have at the moment:
$regex = "#[\[][\[]([^\s\]]*)[\s]([^\s\]]*)[\]][\]]#";
$description = preg_replace($regex,"$1 $2",$description);
If someone can advise me how I can modify this regex so it works for any number of words that would be really helpful.
You can use the preg_replace_callback() function which accepts a callback to process the replacement string. You can also use lazy quantifiers in the pattern instead of a lot of negations inside character classes.
The external preg_replace_callback will extract the matched text and pass it to the callback function, which will return the properly modified version.
$str = '[[Page Name with many words]]';
echo preg_replace_callback('/\[\[(.*?)\]\]/', 'parse_tags', $str);
function parse_tags($match) {
$text = $match[1];
$slug = preg_replace('/\s+/', '-', $text);
return "$text";
}
You should use a callback function to do the replacement (using preg_replace_callback):
$str = preg_replace_callback('/\[\[([^\]]+)\]\]/', function($matches) {
return '<a href="' . preg_replace('/\s+/', '-', $matches[1]) . '>' . $matches[1] . '</a>';
}, $str);

Categories