PHP Find a URL in any form within a string - php

I'm looking for some form of input security on a project I working on.
Basically I wish to flag text if the user has inputted any form of a URL.
IE 'For more of my pic visit myhotpic.net'
Hence it would detect a url and then I can flag the string for validation via staff.
So I would need to check for any form of a URL.
There is a similar question here
Finding urls from text string via php and regex?
with an answer. But I have tired this with various strings and I do not get the expected results.
For example
$pattern = '#(www\.|https?:\/\/){?}[a-zA-Z0-9]{2,254}\.[a-zA-Z0-9]{2,4}(\S*)#i';
$count = preg_match_all($pattern, 'http://www.Imaurl.com', $matches, PREG_PATTERN_ORDER);
returns matches as
array(3) {
[0]=>
array(0) {
}
[1]=>
array(0) {
}
[2]=>
array(0) {
}
}
and no error is return via preg_last_error()
Why is this not working? Is there an error in the Regex? I would assume it to be fine as other users have had success with it.
I cannot seem to find a suitable answer for my problem anywhere else.

In the regex, change {?} to just ?. Then it will work. No idea what {?} is supposed to mean (I've never seen anything like that).
Your regex will work fine for some URLs, but you should be aware that URLs can be much more complicated than you might assume, and a regex that can match every URL is VERY complex. You might want to look up a better regex—you only need one complicated enough to handle the sorts of URLs you're expecting to match.

Just to add a little work on this specific question;
I took the original Regex as given by the OP and carried out some tweaks to it:
This is NOT perfect but does improve upon the original.
Added a netagive lookahead to avoid domains beginning with # (such as email addresses)
removed the incorrect {?}
Made the http or www a requirement rather than optional.
added _ and - characters to accepted URL character set ( I know this concept overall can be greatly expanded upon ).
so;
#(?<!#)(www\.|https?:\/\/)[a-z0-9-_]{2,254}\.[a-z0-9]{2,4}(\S*)#gi
Example:
check out my facebook www.prop-ERty-bg.ru/11be check out my facebook
www.property-bg.ru/11be horsae#microsoft.com
catches both www.property-bg.ru/11b but avoids the email address. See it in action.

Related

Simple web scraping in PHP

To make it clear from a beginning, I have total consent to do this by the website administrator until they build an API.
What I want to do is get, say, a number or any piece of data found in a specific part of the site, althought it's place in line can change.
An example of what I wish to do, if I were to store the html in a variable through file_get_contents, and wanted to find somewhere in the source where it says "<p>User status: Online.</p>"; I would need to store the text between "status: " and ".</p>" in a variable, only knowing these two strings to find it, but knowing as well that there's only one possible scenario where those two texts are in the same line
EDIT: I seem to have forgotten the most important part of this. Well, the question is how to do what I just described, if you have a lot of text, how can I find what's between one piece of text and another piece of text, and store it in a variable?
There are a couple ways to scrape websites, one would be to use CSS Selectors and another would be to use XPath, which both select elements from the DOM.
Since I can't see the full HTML of the webpage it would be hard for me to determine which method is better for you. There is another option which may be frowned upon, but in this case it might work.
You could use a Regex (regular expressions) to find the characters, I'm not the best at regular expressions but here is some sample code of how that might work:
<?php
$subject = "<html><body><p>Some User</p><p>User status: Online.</p></body></html>";
$pattern = '/User status: (.*)\<\/p\>/';
preg_match($pattern, $subject, $matches);
print_r($matches);
?>
Sample output:
Array
(
[0] => User status: Online.</p>
[1] => Online.
)
Basically what the regex above is doing is matching a pattern, in this case it looks for the string "User status: " then matches all the characters (.*) up to the ending paragraph tag (escaped).
Here is the pattern that will return just "Online" without the period, wasn't sure if all statuses ended in a period but here is what it would look like:
'/User status: (.*)\.\<\/p\>/'

Trying to stop regex at a tag

I know there are other posts with a similar name but I've looked through them and they haven't helped me resolve this.
I'm trying to get my head around regex and preg_match. I am going through a body of text and each time a link exists I want it to be extracted. I'm currently using the following:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
which works fine until it finds one that has <br after it. Then I get the url plus the <br which means it doesn't work correctly. How can I have it so that it stops at the < without including it?
Also, I have been looking everywhere for a clear explanation of using regex and I'm still confused by it. Has anyone any good guides on it for future reference?
\S* is too broad. In particular, I could inject into your code with a URL like:
http://hax.hax/"><script>alert('HAAAAAAAX!');</script>
You should only allow characters that are allowed in URLs:
[-A-Za-z0-9._~:/?#[]#!$&'()*+,;=]*
Some of these characters are only allowed in specific places (such as ?) so if you want better validation you will need more cleverness
Instead of \S exclude the open tag char from the class:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[^<]*)?/";
You might even want to be more restrictive by only allowing characters valid in URLs:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[a-zA-Z_\-\.%\?&]*)?/";
(or some more characters)
You could use this one as presented on the:
http://regex101.com/r/zV1uI7
On the bottom of the site you got it explained step by step.

Parsing link from javascript function

I'm trying to parse a direct link out of a javascript function within a page. I'm able to parse the html info I need, but am stumped on the javascript part. Is this something that is achievable with php and possibly regex?
function videoPoster() {
document.getElementById("html5_vid").innerHTML =
"<video x-webkit-airplay='allow' id='html5_video' style='margin-top:"
+ style_padding
+ "px;' width='400' preload='auto' height='325' controls onerror='cantPlayVideo()' "
+ "<source src='http://video-website.com/videos/videoname.mp4' type='video/mp4'>";
}
What I need to pull out is the link "http://video-website.com/videos/videoname.mp4". Any help or pointers would be greatly appreciated!
/http://.*\.mp4/ will give you all characters between http:// and .mp4, inclusive.
See it in action.
If you need the session id, use something like /http://.*\.mp4?sessionid=\d+/
In general, no. Nothing short of a full javascript parser will always extract urls, and even then you'll have trouble with urls that are computed nontrivially.
In practice, it is often best to use the simplest capturing regexp that works for the code you actually need to parse. In this case:
['"](http://[^'"]*)['"]
If you have to enter that regexp as a string, beware of escaping.
If you ever have unescaped quotation marks in urls, this will fail. That's valid but rare. Whoever is writing the stuff you're parsing is unlikely to use them because they make referring to the urls in javascript a pain.
For your specific case, this should work, provided that none of the characters in the URL are escaped.
preg_match("/src='([^']*)'/", $html, $matches);
$url = $matches[1];
See the preg_match() manual page. You should probably add error handling, ensuring that the function returns 1 (that the regex matched) and possibly performing some additional checks as well (such as ensuring that the URL begins with http:// and contains .mp4?).
(As with all Web scraping techniques, the owner or maintainer of the site you are scraping may make a future change that breaks your script, and you should be prepared for that.)
The following captures any url in your html
$matches=array();
if (preg_match_all('/src=["\'](?P<urls>https?:\/\/[^"\']+)["\']/', $html, $matches)){
print_r($matches['urls']);
}
if you want to do the same in javascript you could use this:
var matches;
if (matches=html.match(/src=["'](https?:\/\/[^"']+)["']/g)){
//gives you all matches, but they are still including the src=" and " parts, so you would
//have to run every match again against the regex without the g modifier
}

PHP / RegEx - Convert URLs to links by detecting .com/.net/.org/.edu etc

I know there have been many questions asking for help converting URLs to clickable links in strings, but I haven't found quite what I'm looking for.
I want to be able to match any of the following examples and turn them into clickable links:
http://www.domain.com
https://www.domain.net
http://subdomain.domain.org
www.domain.com/folder
subdomain.domain.net
subdomain.domain.edu/folder/subfolder
domain.net
domain.com/folder
I do not want to match random.stuff.separated.with.periods.
EDIT: Please keep in mind that these URLs need to be found within larger strings of 'normal' text. For example, I want to match 'domain.net' in "Hello! Come check out domain.net!".
I think this could be accomplished with a regex that can determine whether the matching url contains .com, .net, .org, or .edu followed by either a forward slash or whitespace. Other than a user typo, I can't imagine any other case in which a valid URL would have one of those followed by anything else.
I realize there are many valid domain extensions out there, but I don't need to support them all. I can just choose which to support with something like (com|net|org|edu) in the regex. Unfortunately, I'm not skilled enough with regex yet to know how to properly implement this.
I'm hoping someone can help me find a regular expression (for use with PHP's preg_replace) that can match URLs based on just about any text connected by one or more dots and either ending with one of the specified extensions followed by whitespace OR containing one of the specified extensions followed by a slash and possibly folders.
I did several searches and so far have not found what I'm looking for. If there already exists a SO post that answers this, I apologize.
Thanks in advance.
--- EDIT 3 ---
After days of trial and error and some help from SO, here's what works:
preg_replace_callback('#(\s|^)((https?://)?(\w|-)+(\.(\w+|-)*)+(?<=\.net|org|edu|com|cc|br|jp|dk|gs|de)(\:[0-9]+)?(?:/[^\s]*)?)(?=\s|\b)#is',
create_function('$m', 'if (!preg_match("#^(https?://)#", $m[2]))
return $m[1]."".$m[2].""; else return $m[1]."".$m[2]."";'),
$event_desc);
This is a modified version of anubhava's code below and so far seems to do exactly what I want. Thanks!
You can use this regex:
#(\s|^)((?:https?://)?\w+(?:\.\w+)+(?<=\.(net|org|edu|com))(?:/[^\s]*|))(?=\s|\b)#is
Code:
$arr = array(
'http://www.domain.com/?foo=bar',
'http://www.that"sallfolks.com',
'This is really cool site: https://www.domain.net/ isn\'t it?',
'http://subdomain.domain.org',
'www.domain.com/folder',
'Hello! You can visit vertigofx.com/mysite/rocks for some awesome pictures, or just go to vertigofx.com by itself',
'subdomain.domain.net',
'subdomain.domain.edu/folder/subfolder',
'Hello! Check out my site at domain.net!',
'welcome.to.computers',
'Hello.Come visit oursite.com!',
'foo.bar',
'domain.com/folder',
);
foreach($arr as $url) {
$link = preg_replace_callback('#(\s|^)((?:https?://)?\w+(?:\.\w+)+(?<=\.(net|org|edu|com))(?:/[^\s]*|))(?=\s|\b)#is',
create_function('$m', 'if (!preg_match("#^(https?://)#", $m[2]))
return $m[1]."".$m[2].""; else return $m[1]."".$m[2]."";'),
$url);
echo $link . "\n";
OUTPUT:
http://www.domain.com/?foo=bar
http://www.that"sallfolks.com
This is really cool site: https://www.domain.net/ isn't it?
http://subdomain.domain.org
www.domain.com/folder
Hello! You can visit vertigofx.com/mysite/rocks for some awesome pictures, or just go to vertigofx.com by itself
subdomain.domain.net
subdomain.domain.edu/folder/subfolder
Hello! Check out my site at domain.net!
welcome.to.computers
Hello.Come visit oursite.com!
foo.bar
domain.com/folder
PS: This regex only supports http and https scheme in URL. So eg: if you want to support ftp also then you need to modify the regex a little.
'/(http(s)?:\/\/)?[\w\/\.]+(\.((com)|(edu)|(net)|(org)))[\w\/]*/'
That works for your examples. You might want to add extra characters support for "-", "&", "?", ":", etc in the last bracket.
'/(http(s)?:\/\/)?[\w\/\.]+(\.((com)|(edu)|(net)|(org)))[\w\/\?=&-;]*/'
This will support parameters and port numbers.
eg.: www.foo.ca:8888/test?param1=val1&param2=val2
Thanks a ton. I modified his final solution to allow all domains (.ca, .co.uk), not just the specified ones.
$html = preg_replace_callback('#(\s|^)((https?://)?(\w|-)+(\.[a-z]{2,3})+(\:[0-9]+)?(?:/[^\s]*)?)(?=\s|\b)#is',
create_function('$m', 'if (!preg_match("#^(https?://)#", $m[2])) return $m[1]."".$m[2].""; else return $m[1]."".$m[2]."";'),
$url);

URL Validation Regex

As far as i know there are many other questions similar to title, but my main reason for asking this question is i want my validation as perfect as i want. Here is my explanation which URL should valid
http:// (if given then match otherwise skip),
domain.com (should match & return validate)
subdomain.domain.com (should match & return validate)
www.com (should return false)
http://www.com (should return false)
I searched a lot about perfect regex pattern according to my need but didn't succeed so thats why i made my self and posting here to want to know that anyother Valid URL would it skip or not except http://localhost.
If yes then please correct me.
Pattern:
((?:http|https|ftp)://)?(?:www.)?((?!www)[A-Z0-9][A-Z0-9_-]*(?:.[A-Z0-9][A-Z0-9_-]*)+):?(\d+)?/?
I know this actually doesn't answer your question directly, but REGEXes aside, you can also use filter_var(), with the flag FILTER_VALIDATE_URL, which returns the URL in case of valid url, or FALSE otherwise:
var_dump(filter_var('http://example.com', FILTER_VALIDATE_URL));
// string(18) http://example.com
You can read here the filters used by this function, especially the last row regarding flags used by the VALIDATE_URL filter.
I actually don't know how it's implemented internally, but I suppose it works better than many regexes you can find outside in the wild internet.

Categories