This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Replace URLs in text with HTML links
I'm writing some code to convert plain-text links to HyperText links in PHP. I'm using a regular expression (of which I know almost nothing) and I'm coming up against some problems:
I don't know how to conditionally include/exclude an item in the expression
I don't know how to include URL parameters in my expression
I don't know much about RegEx!
This is the code I've written:
function text2link($text)
{
$text = preg_replace
(
"/(ftp\.|www\.|http:\/\/|https:\/\/|)(.*)(\.)(com|net|co\.uk)/i",
"<a href='http://$2$3$4' target='_blank'>$1$2$3$4</a>",
$text
);
return $text;
}
This code will convert the following "styles" of hyperlinks:
http://www.google.com
www.google.com
google.com
etc., however, the links are forced to HTTP. No matter what you type in, the expression forces HTTP. Is is possible to conditionally include the necessary protocol? For example:
"<a href='[SOME IF-CONDITION GOES HERE TO DECIDE WHAT PROTOCOL TO USE]$2$3$4' target='_blank'>$1$2$3$4</a>",
This condition would be something along the lines of:
if $1 == "http, else if $1 == "https", else if $1 == "ftp",
etc., etc.
Also I realise I'm not accounting for anywhere near enough extensions, this is just test code right now.
Edit: I should note that I'm not looking for "fool proof", unbreakable solutions requiring 200 lines. I don't mind if some obscure domains don't get picked up. I just want the most common domains and URLs to get detected and converted to clickable links.
You don't need to tweak your regex:
function text2link($text, $protocol="http://")
{
$text = preg_replace
(
"/(ftp\.|www\.|http:\/\/|https:\/\/|)(.*)(\.)(com|net|co\.uk)/i",
"<a href='$protocol$2$3$4' target='_blank'>$1$2$3$4</a>",
$text
);
return $text;
}
You can pass in an optional protocol as a second parameter, and it'll default to http:// if you don't.
Related
I'm trying to parse a direct link out of a javascript function within a page. I'm able to parse the html info I need, but am stumped on the javascript part. Is this something that is achievable with php and possibly regex?
function videoPoster() {
document.getElementById("html5_vid").innerHTML =
"<video x-webkit-airplay='allow' id='html5_video' style='margin-top:"
+ style_padding
+ "px;' width='400' preload='auto' height='325' controls onerror='cantPlayVideo()' "
+ "<source src='http://video-website.com/videos/videoname.mp4' type='video/mp4'>";
}
What I need to pull out is the link "http://video-website.com/videos/videoname.mp4". Any help or pointers would be greatly appreciated!
/http://.*\.mp4/ will give you all characters between http:// and .mp4, inclusive.
See it in action.
If you need the session id, use something like /http://.*\.mp4?sessionid=\d+/
In general, no. Nothing short of a full javascript parser will always extract urls, and even then you'll have trouble with urls that are computed nontrivially.
In practice, it is often best to use the simplest capturing regexp that works for the code you actually need to parse. In this case:
['"](http://[^'"]*)['"]
If you have to enter that regexp as a string, beware of escaping.
If you ever have unescaped quotation marks in urls, this will fail. That's valid but rare. Whoever is writing the stuff you're parsing is unlikely to use them because they make referring to the urls in javascript a pain.
For your specific case, this should work, provided that none of the characters in the URL are escaped.
preg_match("/src='([^']*)'/", $html, $matches);
$url = $matches[1];
See the preg_match() manual page. You should probably add error handling, ensuring that the function returns 1 (that the regex matched) and possibly performing some additional checks as well (such as ensuring that the URL begins with http:// and contains .mp4?).
(As with all Web scraping techniques, the owner or maintainer of the site you are scraping may make a future change that breaks your script, and you should be prepared for that.)
The following captures any url in your html
$matches=array();
if (preg_match_all('/src=["\'](?P<urls>https?:\/\/[^"\']+)["\']/', $html, $matches)){
print_r($matches['urls']);
}
if you want to do the same in javascript you could use this:
var matches;
if (matches=html.match(/src=["'](https?:\/\/[^"']+)["']/g)){
//gives you all matches, but they are still including the src=" and " parts, so you would
//have to run every match again against the regex without the g modifier
}
This question already has answers here:
How can I convert ereg expressions to preg in PHP?
(4 answers)
Closed 3 years ago.
My hosts recently updated their PHP to 5.3 (without warning) and I now have to replace the code for my page navigation on my index page. This is what I'm currently using:
<? if (eregi(".shtml", $load)) {if (!#readfile("$load")) { readfile("error.shtml"); } } if (!eregi(".shtml", $load)) {if (!#readfile(include("/home/content/j/p/l/jplegacy/html/coranto/news.txt") )) ;}?>
This is what I use as a sample in my links to navigate with:
Archive
About Us
I looked at two different techniques, preg_match() and stristr().
preg_match() outputs my error page and news.txt file from Coranto, but doesn't navigate me to the pages like in the links above. What can I do here to make that work?
stristr() doesn't give me the warnings, but it doesn't navigate the page and only outputs my Coranto news page. If this would be better, what can I do to make this work?
What do I need to do to fix this? I am completely lost. :(
ereg(i) is considered to be deprecated. You should use preg_match for regular expressions as the preg is faster than the other engine. Also, if you want ends with ".shtml" that regex is wrong, and you don't need to resort to regular expressions. Ends with .shtml is "^*.shtml$". If you do choose to use preg_match, all preg_match regular expresions, like perl, are surrounded by "/" (slash), so the correct regex would look like:
preg_match("/^.*\.shtml$/", $load)
But on to the better solution for "ends with .shtml". Generally you shouldn't use regular expressions unless the situation actually calls for them (they're significantly slower than using substr and matching the last part of the string), simple matches like that don't qualify.
if (substr($test, sizeof($load) - 7, 6 )) == '.shtml') { ... }
I know there have been many questions asking for help converting URLs to clickable links in strings, but I haven't found quite what I'm looking for.
I want to be able to match any of the following examples and turn them into clickable links:
http://www.domain.com
https://www.domain.net
http://subdomain.domain.org
www.domain.com/folder
subdomain.domain.net
subdomain.domain.edu/folder/subfolder
domain.net
domain.com/folder
I do not want to match random.stuff.separated.with.periods.
EDIT: Please keep in mind that these URLs need to be found within larger strings of 'normal' text. For example, I want to match 'domain.net' in "Hello! Come check out domain.net!".
I think this could be accomplished with a regex that can determine whether the matching url contains .com, .net, .org, or .edu followed by either a forward slash or whitespace. Other than a user typo, I can't imagine any other case in which a valid URL would have one of those followed by anything else.
I realize there are many valid domain extensions out there, but I don't need to support them all. I can just choose which to support with something like (com|net|org|edu) in the regex. Unfortunately, I'm not skilled enough with regex yet to know how to properly implement this.
I'm hoping someone can help me find a regular expression (for use with PHP's preg_replace) that can match URLs based on just about any text connected by one or more dots and either ending with one of the specified extensions followed by whitespace OR containing one of the specified extensions followed by a slash and possibly folders.
I did several searches and so far have not found what I'm looking for. If there already exists a SO post that answers this, I apologize.
Thanks in advance.
--- EDIT 3 ---
After days of trial and error and some help from SO, here's what works:
preg_replace_callback('#(\s|^)((https?://)?(\w|-)+(\.(\w+|-)*)+(?<=\.net|org|edu|com|cc|br|jp|dk|gs|de)(\:[0-9]+)?(?:/[^\s]*)?)(?=\s|\b)#is',
create_function('$m', 'if (!preg_match("#^(https?://)#", $m[2]))
return $m[1]."".$m[2].""; else return $m[1]."".$m[2]."";'),
$event_desc);
This is a modified version of anubhava's code below and so far seems to do exactly what I want. Thanks!
You can use this regex:
#(\s|^)((?:https?://)?\w+(?:\.\w+)+(?<=\.(net|org|edu|com))(?:/[^\s]*|))(?=\s|\b)#is
Code:
$arr = array(
'http://www.domain.com/?foo=bar',
'http://www.that"sallfolks.com',
'This is really cool site: https://www.domain.net/ isn\'t it?',
'http://subdomain.domain.org',
'www.domain.com/folder',
'Hello! You can visit vertigofx.com/mysite/rocks for some awesome pictures, or just go to vertigofx.com by itself',
'subdomain.domain.net',
'subdomain.domain.edu/folder/subfolder',
'Hello! Check out my site at domain.net!',
'welcome.to.computers',
'Hello.Come visit oursite.com!',
'foo.bar',
'domain.com/folder',
);
foreach($arr as $url) {
$link = preg_replace_callback('#(\s|^)((?:https?://)?\w+(?:\.\w+)+(?<=\.(net|org|edu|com))(?:/[^\s]*|))(?=\s|\b)#is',
create_function('$m', 'if (!preg_match("#^(https?://)#", $m[2]))
return $m[1]."".$m[2].""; else return $m[1]."".$m[2]."";'),
$url);
echo $link . "\n";
OUTPUT:
http://www.domain.com/?foo=bar
http://www.that"sallfolks.com
This is really cool site: https://www.domain.net/ isn't it?
http://subdomain.domain.org
www.domain.com/folder
Hello! You can visit vertigofx.com/mysite/rocks for some awesome pictures, or just go to vertigofx.com by itself
subdomain.domain.net
subdomain.domain.edu/folder/subfolder
Hello! Check out my site at domain.net!
welcome.to.computers
Hello.Come visit oursite.com!
foo.bar
domain.com/folder
PS: This regex only supports http and https scheme in URL. So eg: if you want to support ftp also then you need to modify the regex a little.
'/(http(s)?:\/\/)?[\w\/\.]+(\.((com)|(edu)|(net)|(org)))[\w\/]*/'
That works for your examples. You might want to add extra characters support for "-", "&", "?", ":", etc in the last bracket.
'/(http(s)?:\/\/)?[\w\/\.]+(\.((com)|(edu)|(net)|(org)))[\w\/\?=&-;]*/'
This will support parameters and port numbers.
eg.: www.foo.ca:8888/test?param1=val1¶m2=val2
Thanks a ton. I modified his final solution to allow all domains (.ca, .co.uk), not just the specified ones.
$html = preg_replace_callback('#(\s|^)((https?://)?(\w|-)+(\.[a-z]{2,3})+(\:[0-9]+)?(?:/[^\s]*)?)(?=\s|\b)#is',
create_function('$m', 'if (!preg_match("#^(https?://)#", $m[2])) return $m[1]."".$m[2].""; else return $m[1]."".$m[2]."";'),
$url);
This question already has answers here:
Closed 12 years ago.
Possible Duplicates:
Identifying if a URL is present in a string
Php parse links/emails
I'm working on some PHP code which takes input from various sources and needs to find the URLs and save them somewhere. The kind of input that needs to be handled is as follows:
http://www.youtube.com/watch?v=IY2j_GPIqRA
Try google: http://google.com! (note exclamation mark is not part of the URL)
Is http://somesite.com/ down for anyone else?
Output:
http://www.youtube.com/watch?v=IY2j_GPIqRA
http://google.com
http://somesite.com/
I've already borrowed one regular expression from the internet which works, but unfortunately wipes the query string out - not good!
Any help putting together a regular expression, or perhaps another solution to this problem, would be appreciated.
Jan Goyvaerts, Regex Guru, has addressed this issue in his blog. There are quite a few caveats, for example extracting URLs inside parentheses correctly. What you need exactly depends on the "quality" of your input data.
For the examples you provided, \b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$] works when used in case-insensitive mode.
So to find all matches in a multiline string, use
preg_match_all('/\b(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[-A-Z0-9+&##\/%=~_|$?!:,.]*[A-Z0-9+&##\/%=~_|$]/i', $subject, $result, PREG_PATTERN_ORDER);
$result = $result[0];
Why not try this one. It is the first result of Googling "URL regular expression".
((https?|ftp|gopher|telnet|file|notes|ms-help):((\/\/)|(\\\\))+[\w\d:##%\/;$()~_?\+-=\\\.&]*)
Not PHP, but it should work, I just slightly modified it by escaping forward slashes.
source
I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.
You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.
I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);
Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';