preg_replace_callback pattern issue - php

I'm using the following pattern to capture links, and turn them into HTML friendly links. I use the following pattern in a preg_replace_callback and for the most part it works.
"#(https?|ftp)://(\S+[^\s.,>)\];'\"!?])#"
But this pattern fails when the text reads like so:
http://mylink.com/page[/b]
At that point it captures the [/b amusing it is part of the link, resulting in this:
woodmill.co.uk[/b]
I've look over the pattern, and used some cheat sheets to try and follow what is happening, but it has foxed me. Can any of you code ninja's help?

Try adding the open square bracket to your character class:
(\S+[^\s.,>)[\];'\"!?])
^
UPDATE
Try this more effective URL regex:
^(https?://)?([\da-z\.-]+)\.([a-z\.]{2,6})([/\w \.-]*)*/?$
(From: http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/)
I have no experience directly with PHP regular expressions, but the above is simple and generic enough that I wouldn't expect any problems. You may want to modify it some to extract just the domain, like you seem to be with your current regex.

Ok I solved the problem. Thanks to #Cyborgx37 and #MikeBrant for your help. Here's the solution.
Firstly I replaced my regexp pattern with the one that João Castro used in this question: Making a url regex global
The problem with that pattern is it captured any trailing dots at the end, so in the final section of the pattern I added ^. making the final part look like so [^\s^.]. As I read it, do not match a trailing space or dot.
This still caused an issue matching bbcode as I mentioned above, so I used preg_replace_callback() and create_function() to filter it out. The final create_function() looks like this:
create_function('$match','
$match[0] = preg_replace("/\[\/?(.*?)\]/", "", $match[0]);
$match[0] = preg_replace("/\<\/?(.*?)\>/", "", $match[0]);
$m = trim(strtolower($match[0]));
$m = str_replace("http://", "", $m);
$m = str_replace("https://", "", $m);
$m = str_replace("ftp://", "", $m);
$m = str_replace("www.", "", $m);
if (strlen($m) > 25)
{
$m = substr($m, 0, 25) . "...";
}
return "$m";
'), $string);
Tests so far are looking good, so I'm happy it is now solved.
Thanks again, and I hope this helps someone else :)

Related

php - Remove All Strings Starting From The First Specific Character

I have this kind of string:
DURATION : 00:23:55.060000000
I want to convert it to this:
00:23:55.060000000
Please note that after DURATION, it has many spaces.
EDIT:
It seems that I made you upset, guys. :D
I did this and not working:
preg_replace('/^Duration,\s+/', '', $result[20])
How to do it with php?
Your regex is messed up. You are looking for something in uppercase and your regex is in lowercase. And there is a comma laying around.
So if you rewrite that like:
preg_replace('/^Duration\s+: /i', '', $result[20])
(the i modifier after the regular expression says its case insenstive)
or:
preg_replace('/^DURATION\s+: /', '', $result[20])
It'll work.
But mostly, it seems that you want to catch the timestamp, and disregard the rest. For me, the code would be much clearer if your regex reflected that.
E.g.:
if (preg_match("|(?<timestamp>\d\d:\d\d\:\d\d\.\d{9})|", $string, $matches)) {
echo $matches['timestamp'];
}
Solution :
$duration = substr($duration, strpos($duration, (":")) + 2);
I hope this can be useful for others who need it:
preg_replace('/duration|^(.*?):|\s/i', '', $result[20]);
code explanation:
first, strip the duration, and then the first colon : lastly all spaces.
put i at the end to the regex to declare that the search is incase-sensitive.

Converting links occuring inside a string

I am attempting to change a string occurance e.g. http://www.bbc.co.uk/ so that it appears inside a html link e.g. http://www.bbc.co.uk
however for some reason my regex conversion does not work. Can someone please point me in the correct direction?
$text = "I love this website http://www.bbc.co.uk/";
$x = preg_replace("#[a-z]+://[^<>\s]+[[a-z0-9]/]#i", "\\0", $text);
var_dump($x);
outputs I love this website http://www.bbc.co.uk/ (No html link)
Your weird character class is at fault:
[[a-z0-9]/]
Double square brackets are for POSIX character classes like [[:digit:]].
You meant to write just:
[a-z0-9/]
It is because you regex is giving you a match (in fact it's really not even close to giving you a match as you are not accepting periods in the domain name at all). Try something like this:
$pattern = '#https?://.*\b#i';
$replace = '$0';
$x = preg_replace($pattern, $replace, $text);
Note that I am not actually trying to validate the URL format here, so I just accept anything like http():// up to the next word boundary. It didn't seem as if you were going for a true URL validation regex anyway (i.e. validating there is at least one ., that the TLD component has 2-6 characters, etc.), so I just figure I would give you the simplest pattern that would match.
Use this:
$x = preg_replace('#http://[?=&a-z0-9._/-]+#i', '<a target="_blank" href="$0">$0</a>', $text);

Make me understand preg_replace

I've been looking all over the internet for some useful information and I think I found too much. I'm trying to understand regular expressions but don't get it.
Lets for instance say $data="A bunch of text [link=123] another bunch of text.", and it should get replaced with "< a href=\"123.html\">123< /a>".
I've been trying around a lot with code similar to this:
$find = "/[link=[([0-9])]/";
$replace = "< a href=\"$1\">$1< /a>";
echo preg_replace ($find, $replace, $data);
but the output is always the same as the original $data.
I think I have to see something relevent to my problem understand the basics.
Remove the extra [] around the (), and add + after the [0-9] to quantify it. Also, escape the [] that make up the tag itself.
$find = "/\[link=(\d+)\]/"; // "\d" is equivalent to "[0-9]"
$replace = "$1";
echo preg_replace($find,$replace,$data);
The regex would be \[link=([\d]+)\]
A good source for an quick overview of regular expression can you find here http://www.regular-expressions.info/
When you really interested in the power of regular expression, you should buy this book: Mastering Regular Expressions
A good Programm to test your RexEx on a Windows Client is: RegEx-Trainer
You are missing the + quantifier and as a result of this your pattern matches if there is a single digit following link=.
And there is an extra pair of [..] as a result of this the outer [...] will be treated as the character class.
You also forgot the escape the closing ].
Solution:
$find = "/[link=([0-9]+)\]/";
<?php
$data= "A bunch of text [link=123] another bunch of text.";
$find = '/\[link=([0-9]+?)\]/';
echo preg_replace($find, "$1", $data);

Making a url regex global

I've been searching for a regex to replace plain text url's in a string (the string can contain more than 1 url), by:
url
and I found this:
http://mathiasbynens.be/demo/url-regex
I would like to use the diegoperini's regex (which according to the tests is the best):
_^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$_iuS
But I want o make it global to replace all the url's in a string.
When I use this:
/_(?:(?:https?|ftp)://)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?_iuS/g
It does not work, how do I make this regex global and what does the underscore at the beginning and the "_iuS", at the end, means?
I would like to use it with php so I am using:
preg_replace($regex, '$0', $examplestring);
The underscores are the regex delimiters, the i, u and S are pattern modifiers :
i (PCRE_CASELESS)
If this modifier is set, letters in the pattern match both upper and lower
case letters.
U (PCRE_UNGREEDY)
This modifier inverts the "greediness" of the quantifiers so that they are
not greedy by default, but become greedy if followed by ?. It is not compatible
with Perl. It can also be set by a (?U) modifier setting within the pattern
or by a question mark behind a quantifier (e.g. .*?).
S
When a pattern is going to be used several times, it is worth spending more
time analyzing it in order to speed up the time taken for matching. If this
modifier is set, then this extra analysis is performed. At present, studying
a pattern is useful only for non-anchored patterns that do not have a single
fixed starting character.
For more informations see http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
When you added the / ... /g , you added another regex delimiter plus the modifier g wich does not exists in PCRE, that's why it did not work.
I agree with #verdesmarald and used this pattern in the following function:
$string = preg_replace_callback(
"_(?:(?:https?|ftp)://)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?_iuS",
create_function('$match','
$m = trim(strtolower($match[0]));
$m = str_replace("http://", "", $m);
$m = str_replace("https://", "", $m);
$m = str_replace("ftp://", "", $m);
$m = str_replace("www.", "", $m);
if (strlen($m) > 25)
{
$m = substr($m, 0, 25) . "...";
}
return "$m";
'), $string);
return $string;
It seem to do the trick, and resolve an issue I was having. As #verdesmarald said, removing the ^ and $ characters allowed the pattern to work even in my pre_replace_callback().
Only thing that concerns me, is how efficient is the pattern. If used in a busy/high traffic web app, could it cause a bottle neck?
UPDATE
The above regex pattern breaks if there is a trail dot at the end of the path section of a url, like so http://www.mydomain.com/page.. To solve this I modified the final part of the regex pattern by adding ^. making the final part look like so [^\s^.]. As I read it, do not match a trailing space or dot.
In my tests so far it seems to be working fine.

preg_replace with URL problem

I use a preg_replace to auto insert HTML links within paragraphs.
Here's what I currently use:
$pattern = "~(?!(?:[^<\[]+[>\]]|[^>\]]+<\/a>))(".preg_quote($find_keyword, '/').")\b~msUi";
$replacement = "\$0";
$article_content = preg_replace($pattern, $replacement, stripslashes($article_content), 1, $added );
It works great, except 1 problem:
It doesn't match and replace if the keyword is a URL.
If: $find_keyword="http://www.mysite.com/" it won't come up with any matches even though it's in the content.
I already tried escaping $find_keyword with preg_quote, which didn't make any different.
Any regex experts know a solution? Thanks.
The forward slashes in your $find_keywords are not escaped which is breaking the pattern.
You can run your find_keyword through
$find_keyword=preg_quote("http://www.mysite.com/", '/');
http://www.php.net/manual/en/function.preg-quote.php

Categories