I've been searching for a regex to replace plain text url's in a string (the string can contain more than 1 url), by:
url
and I found this:
http://mathiasbynens.be/demo/url-regex
I would like to use the diegoperini's regex (which according to the tests is the best):
_^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$_iuS
But I want o make it global to replace all the url's in a string.
When I use this:
/_(?:(?:https?|ftp)://)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?_iuS/g
It does not work, how do I make this regex global and what does the underscore at the beginning and the "_iuS", at the end, means?
I would like to use it with php so I am using:
preg_replace($regex, '$0', $examplestring);
The underscores are the regex delimiters, the i, u and S are pattern modifiers :
i (PCRE_CASELESS)
If this modifier is set, letters in the pattern match both upper and lower
case letters.
U (PCRE_UNGREEDY)
This modifier inverts the "greediness" of the quantifiers so that they are
not greedy by default, but become greedy if followed by ?. It is not compatible
with Perl. It can also be set by a (?U) modifier setting within the pattern
or by a question mark behind a quantifier (e.g. .*?).
S
When a pattern is going to be used several times, it is worth spending more
time analyzing it in order to speed up the time taken for matching. If this
modifier is set, then this extra analysis is performed. At present, studying
a pattern is useful only for non-anchored patterns that do not have a single
fixed starting character.
For more informations see http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
When you added the / ... /g , you added another regex delimiter plus the modifier g wich does not exists in PCRE, that's why it did not work.
I agree with #verdesmarald and used this pattern in the following function:
$string = preg_replace_callback(
"_(?:(?:https?|ftp)://)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?_iuS",
create_function('$match','
$m = trim(strtolower($match[0]));
$m = str_replace("http://", "", $m);
$m = str_replace("https://", "", $m);
$m = str_replace("ftp://", "", $m);
$m = str_replace("www.", "", $m);
if (strlen($m) > 25)
{
$m = substr($m, 0, 25) . "...";
}
return "$m";
'), $string);
return $string;
It seem to do the trick, and resolve an issue I was having. As #verdesmarald said, removing the ^ and $ characters allowed the pattern to work even in my pre_replace_callback().
Only thing that concerns me, is how efficient is the pattern. If used in a busy/high traffic web app, could it cause a bottle neck?
UPDATE
The above regex pattern breaks if there is a trail dot at the end of the path section of a url, like so http://www.mydomain.com/page.. To solve this I modified the final part of the regex pattern by adding ^. making the final part look like so [^\s^.]. As I read it, do not match a trailing space or dot.
In my tests so far it seems to be working fine.
Related
There is a website and I would like to get all the <td> (any content) </td> pattern string
So I write like this:
preg_match("/<td>.*</td>/", $web , $matches);
die(var_dump($matches));
That return null, how to fix the problem? Thanks for helping
OK.
You are only not escaping properly I guess.
Also use groups to capture your stuff properly.
<td>(.*)<\/td>
should do. You can try this regex on your given text here. Don't forget the global flag if you are matching ALL td's. (preg_match_all in PHP)
Usually parsing HTML with regex is not a good idea, try to use DOM parsers instead.
Example -> http://simplehtmldom.sourceforge.net/
Test the above regex with
$web = file_get_contents('http://www.w3schools.com/html/html_tables.asp' );
preg_match_all("/<td>(.*)<\/td>/", $web , $matches);
print_r( $matches);
Lazy Quantifier, Different Delimiter
You need .*? rather than .*, otherwise you can overshoot the closing </td>. Also, your / delimiter needed to be escaped when it appeared in </td>. We can replace it with another one that doesn't need escaping.
Do this:
$regex = '~<td>.*?</td>~';
preg_match_all($regex, $web, $matches);
print_r($matches[0]);
Explanation
The ~ is just an esthetic tweak—you can use any delimiter you like around your regex patttern, and in general ~ is more versatile than /, which needs to be escaped more often, for instance in </td>.
The star quantifier in .*? is made "lazy" by the ? so that the dot only matches as many characters as needed to allow the next token to match (shortest match). Without the ?, the .* first matches the whole string, then backtracks only as far as needed to allow the next token to match (longest match).
Sorry that my question is so horribly worded, but I have no idea how to state it as a question. It is easier for me to just show code and explain.
I am trying to write a function to allow for tagging of words. We have database of words we call glossary. I want to take a large amount of text and look for multiple instance of [G]some word/words here[/G]. Then I want to replace that with {WORD/WORDS BETWEEN [G][/G]}
Here is my current function:
function getGlossary($str)
{
$patterns = array();
$patterns[]='/\[G\](.*)\[\/G\]/';
$replacements = array();
$replacements[]='$1';
return preg_replace($patterns, $replacements, $str);
}
echo getGlossary($txt);
If I only do a single instance of the [G][/G] tag it works.
$txt='What you need to know about [G]beans[/G]';
This will output
What you need to know about beans
However this
$txt='What you need to know about [G]beans[/G] and [G]corn[/G]';
will output
What you need to know about beans[/G] and [G]corn
I am sure I have something wrong in my pattern. Any help would be appreciated.
You need to make your dot-star lazy: .*?
Without the ? to keep the .* in check, the .* will eat up all characters up to the final [/G]
the * quantifier is greedy, so the .* starts off by matching all the characters in the string up to the very end. Then it backtracks only as far as needed to allow the [/G] to match (therefore, it only backtracks to the last [/G]).
the ? makes quantifiers "lazy", so that they only match as far as needed for the rest of the regex to match. Therefore it will only match up to the first [/G].
Modify your regex like so:
$pattern = "~\[G\](.*?)\[/G\]~";
Note that to make the regex easier to read, I have changed the delimiter and unescaped the forward slash, as there is no need to escapes slashes unless the delimiter is a slash. Common delimiters include ~, %, #, #... But really tildes are the most beautiful. :)
Reference
The Many Degrees of Regex Greed
Repetition with Star and Plus
I am working on getting the regex to work but now I am starting to go in circles.
This regex would be used in the codeigniter for the routing purposes, something like:
$route['([\p{Ll}\p{Cyrillic}0-9\s\-]+)-(\d+).html'] = "blog/$2";
I've got a regex that does what I need to:
$pattern = "/^[\p{Ll}0-9\s\-]+$/u";
But for some reason it doesn't want to work in the patten bellow
$str="asdбв-37.html";
$pattern = "#^([\p{Ll}\p{Cyrillic}0-9\s\-]+)-(\d+).html#";
$result = (bool) preg_match($pattern, $str);
if($result)
echo "$str is composed of Cyrillic and alphanumeric characters\n";
My end target is to check that any character, from any language, is written in the lower case, that is why I have used \p{Ll}
The pattern which is working but for asdбв-37.html doesn't allow periods. Try adding them in:
^[a-zA-Z\p{Cyrillic}0-9\s.-]+$
[Also, you don't need to escape the - if it's at the end or beginning of a character class to change its meaning to literal.]
I am not sure if this problem is a boo-boo on my part or something about CI. I have a preg_replace process to convert a published gdoc spreadsheet url back into the original spreadsheet url.
$pat ='/(^[a-z\/\.\:]*?sheet\/)(pub)([a-zA-Z0-9\=\?]*)(\&output\=html)/';
$rep ='$1ccc$3#gid=0';
$theoriginal = preg_replace( $pat, $rep, $published );
This works fine in a test page run locally. This test page isn't framed by CI - it's just a basic php page.
When I copy and paste the pattern and replacement into the CI view which it's intended for, no joy.
Is this malfunction caused by CI or my 'bad' ? Are there easy-to-implement remedies ?
Here's a bit more code from the CI view:
<body id="sites" >
<?php
foreach ( $dets as $item )
{
$nona = $item->nona;
$address = $item->address;
$town = $item->town;
$pc = $item->pc;
$foto1 = $item->foto1;
$foto1txt = $item->foto1txt;
$foto2 = $item->foto2;
$foto2txt = $item->foto2txt;
$costurl = $item->costurl;
$sid = $item->sid;
}
//convert published spreadsheet url to gdoc spreadsheet url
$pat ='/(^[a-z\/\.\:]*?sheet\/)(pub)([a-zA-Z0-9\=\?]*)(\&output\=html)/i';
$rep ='$1ccc$3#gid=0';
$spreadsheet = preg_replace( $pat, $rep, $costurl);
Tom
The pattern you came to can be "tidied" up a bit:
~^(.*?sheet/)pub(.*)(&[a-z=]*)$~
See the regex demo.
The leading ^ and trailing $ are not usually put inside the groups. The / can be left unescaped if you use a regex delimiter other than /. A & and = are not special regex metacharacters, = is only "special" in positive lookaround constructs. So, your pattern means:
^ - start of a string anchor
(.*?sheet/) - Group 1: any 0+ chars other than line break chars, as few as possible (and since I belive the point is to only match pub in the URL path, not the query string, you need to actually replace .*? with [^?#]*? negated character class matching 0+ chars other than # and ?), up to the first occurrence of sheet/ and the subsequent subpatterns...
pub - a substring
(.*) - Group 2: any 0+ chars other than line break chars, as many as possible, up to the last occurrence of the subsequent subpatterns...
(&[a-z=]*) - Group 3: a & followed with 0 or more ASCII letters (since i modifier is used, the [a-z] pattern will also match uppercase letters) and/or =
$ - end of string anchor.
It seems to me that you may also use a better pattern like
~^([^?#]*?sheet/)pub(.*)(&[a-z=]*)$~
^^^^^^
See this regex demo. Explanation of the change is provided in the explanation above.
I know this has been asked before as ive just been reading those answers but still cant get this to work (properly).
Im very new to regex and am trying to do something that sounds pretty simple:
The string would be:
http://www.something.com/section/filter/colour/red-#998682/size/small/
What i would like to do is a preg_replace to remove the -#?????? so the url looks like:
http://www.something.com/section/filter/colour/red/size/small/
So i tried:
$string = $theURL;
$pattern = '/-\#(.*)\//i';
$replacement = '/';
$newURL = preg_replace($pattern, $replacement, $string);
That sort of works but it doesnt stop. If I have anything after the -#?????? it also removes that as well. But I thought having the / on the end would stop it doing that?
Hoping someone can help and thanks for reading
PCRE is greedy by default, meaning that .* will match as big a chunk as possible. Make it ungreedy by adding the U flag (for the entire pattern) or use .*? (for just that wildcard part):
/-\#(.*)\//iU
or
/-\#(.*?)\//i
You need to use non-greedy quantifier.
$pattern = '/-\#(.*?)\//i';
Your regex is greedy, which means that (.*)\/ looks for the last slash, not the first one.
demo
(.*) pattern is gready, which means it'll match as many characters as possible. To match everything to the first slash use (.*?):
$pattern = '/-\#(.*?)\//i';