How to create a custom youtube url form validator - php

I want to create a custom form validator to check if my user is sending a youtube url.
I've already created my lib/validator/youtubeValidator.class.php
Then I use it in my MyForm.class.php : new YoutubeValidator(........)
Here is the code :
class YoutubeValidator extends sfValidatorUrl
{
protected function configure($options = array(), $messages = array())
{
$this->addMessage('invalid', 'Veuillez entrer un lien Youtube');
}
protected function doClean($url)
{
$pattern =
'%^# Match any youtube URL
(?:https?://)? # Optional scheme. Either http or https
(?:www\.)? # Optional www subdomain
(?: # Group host alternatives
youtu\.be/ # Either youtu.be,
| youtube\.com # or youtube.com
(?: # Group path alternatives
/embed/ # Either /embed/
| /v/ # or /v/
| /watch\?v= # or /watch\?v=
) # End path alternatives.
) # End host alternatives.
([\w-]{10,12}) # Allow 10-12 for 11 char youtube id.
$%x'
;
$result = preg_match($pattern, $url, $matches);
if (false !== $result)
{
return $matches[1];
}
return false;
if (false !== $result)
{
throw new sfValidatorError($this, 'invalid', array('value' => $value));
}
else
{
return true;
}
}
}
But it does not work at all.
Moreover, it could be great if my validator could check if youtube video does exist.

You probably need to change the last lines to something like this:
$result = preg_match($pattern, $url, $matches);
if (false === $result)
{
throw new sfValidatorError($this, 'invalid', array('value' => $url));
}
return $url;
This will only check if the url submitted by user is a youtube url (if it matches your regular expression). If no, will throw an exception.
UPDATE
-- deleted--
UPDATE 2
class YoutubeValidator extends sfValidatorUrl
{
protected function configure($options = array(), $messages = array())
{
parent::configure($options, $messages);
$this->setMessage('invalid', 'Veuillez entrer un lien Youtube');
}
protected function doClean($value)
{
$pattern = "/(http(s)?:\/\/)?(?:youtu.be\/|v\/|u\/\w\/|embed\/|watch\?v=)([^#\&\?]*).*/";
preg_match($pattern, $value, $matches);
if (empty($matches[3]))
{
throw new sfValidatorError($this, 'invalid', array('value' => $value));
}
return $matches[3];
}
}
I've tested it and seems to be working ok (returning the actual video id when using $form->getValues()).

Related

Why is this preg_match letting almost anything through? [duplicate]

I've been looking for a simple regex for URLs, does anybody have one handy that works well? I didn't find one with the zend framework validation classes and have seen several implementations.
Use the filter_var() function to validate whether a string is URL or not:
var_dump(filter_var('example.com', FILTER_VALIDATE_URL));
It is bad practice to use regular expressions when not necessary.
EDIT: Be careful, this solution is not unicode-safe and not XSS-safe. If you need a complex validation, maybe it's better to look somewhere else.
I used this on a few projects, I don't believe I've run into issues, but I'm sure it's not exhaustive:
$text = preg_replace(
'#((https?|ftp)://(\S*?\.\S*?))([\s)\[\]{},;"\':<]|\.\s|$)#i',
"'$3$4'",
$text
);
Most of the random junk at the end is to deal with situations like http://domain.example. in a sentence (to avoid matching the trailing period). I'm sure it could be cleaned up but since it worked. I've more or less just copied it over from project to project.
As per the PHP manual - parse_url should not be used to validate a URL.
Unfortunately, it seems that filter_var('example.com', FILTER_VALIDATE_URL) does not perform any better.
Both parse_url() and filter_var() will pass malformed URLs such as http://...
Therefore in this case - regex is the better method.
As per John Gruber (Daring Fireball):
Regex:
(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))
using in preg_match():
preg_match("/(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))/", $url)
Here is the extended regex pattern (with comments):
(?xi)
\b
( # Capture 1: entire matched URL
(?:
https?:// # http or https protocol
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
)
For more details please look at:
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
Just in case you want to know if the url really exists:
function url_exist($url){//se passar a URL existe
$c=curl_init();
curl_setopt($c,CURLOPT_URL,$url);
curl_setopt($c,CURLOPT_HEADER,1);//get the header
curl_setopt($c,CURLOPT_NOBODY,1);//and *only* get the header
curl_setopt($c,CURLOPT_RETURNTRANSFER,1);//get the response as a string from curl_exec(), rather than echoing it
curl_setopt($c,CURLOPT_FRESH_CONNECT,1);//don't use a cached version of the url
if(!curl_exec($c)){
//echo $url.' inexists';
return false;
}else{
//echo $url.' exists';
return true;
}
//$httpcode=curl_getinfo($c,CURLINFO_HTTP_CODE);
//return ($httpcode<400);
}
I don't think that using regular expressions is a smart thing to do in this case. It is impossible to match all of the possibilities and even if you did, there is still a chance that url simply doesn't exist.
Here is a very simple way to test if url actually exists and is readable :
if (preg_match("#^https?://.+#", $link) and #fopen($link,"r")) echo "OK";
(if there is no preg_match then this would also validate all filenames on your server)
I've used this one with good success - I don't remember where I got it from
$pattern = "/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&##\/%?=~_|!:,.;]*[-a-z0-9+&##\/%=~_|]/i";
The best URL Regex that worked for me:
function valid_URL($url){
return preg_match('%^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?#|\d{1,3}(?:\.\d{1,3}){3}|(?:(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)(?:\.(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)*(?:\.[a-z\x{00a1}-\x{ffff}]{2,6}))(?::\d+)?(?:[^\s]*)?$%iu', $url);
}
Examples:
valid_URL('https://twitter.com'); // true
valid_URL('http://twitter.com'); // true
valid_URL('http://twitter.co'); // true
valid_URL('http://t.co'); // true
valid_URL('http://twitter.c'); // false
valid_URL('htt://twitter.com'); // false
valid_URL('http://example.com/?a=1&b=2&c=3'); // true
valid_URL('http://127.0.0.1'); // true
valid_URL(''); // false
valid_URL(1); // false
Source: http://urlregex.com/
function validateURL($URL) {
$pattern_1 = "/^(http|https|ftp):\/\/(([A-Z0-9][A-Z0-9_-]*)(\.[A-Z0-9][A-Z0-9_-]*)+.(com|org|net|dk|at|us|tv|info|uk|co.uk|biz|se)$)(:(\d+))?\/?/i";
$pattern_2 = "/^(www)((\.[A-Z0-9][A-Z0-9_-]*)+.(com|org|net|dk|at|us|tv|info|uk|co.uk|biz|se)$)(:(\d+))?\/?/i";
if(preg_match($pattern_1, $URL) || preg_match($pattern_2, $URL)){
return true;
} else{
return false;
}
}
Edit:
As incidence pointed out this code has been DEPRECATED with the release of PHP 5.3.0 (2009-06-30) and should be used accordingly.
Just my two cents but I've developed this function and have been using it for a while with success. It's well documented and separated so you can easily change it.
// Checks if string is a URL
// #param string $url
// #return bool
function isURL($url = NULL) {
if($url==NULL) return false;
$protocol = '(http://|https://)';
$allowed = '([a-z0-9]([-a-z0-9]*[a-z0-9]+)?)';
$regex = "^". $protocol . // must include the protocol
'(' . $allowed . '{1,63}\.)+'. // 1 or several sub domains with a max of 63 chars
'[a-z]' . '{2,6}'; // followed by a TLD
if(eregi($regex, $url)==true) return true;
else return false;
}
And there is your answer =) Try to break it, you can't!!!
function link_validate_url($text) {
$LINK_DOMAINS = 'aero|arpa|asia|biz|com|cat|coop|edu|gov|info|int|jobs|mil|museum|name|nato|net|org|pro|travel|mobi|local';
$LINK_ICHARS_DOMAIN = (string) html_entity_decode(implode("", array( // #TODO completing letters ...
"æ", // æ
"Æ", // Æ
"À", // À
"à", // à
"Á", // Á
"á", // á
"Â", // Â
"â", // â
"å", // å
"Å", // Å
"ä", // ä
"Ä", // Ä
"Ç", // Ç
"ç", // ç
"Ð", // Ð
"ð", // ð
"È", // È
"è", // è
"É", // É
"é", // é
"Ê", // Ê
"ê", // ê
"Ë", // Ë
"ë", // ë
"Î", // Î
"î", // î
"Ï", // Ï
"ï", // ï
"ø", // ø
"Ø", // Ø
"ö", // ö
"Ö", // Ö
"Ô", // Ô
"ô", // ô
"Õ", // Õ
"õ", // õ
"Œ", // Œ
"œ", // œ
"ü", // ü
"Ü", // Ü
"Ù", // Ù
"ù", // ù
"Û", // Û
"û", // û
"Ÿ", // Ÿ
"ÿ", // ÿ
"Ñ", // Ñ
"ñ", // ñ
"þ", // þ
"Þ", // Þ
"ý", // ý
"Ý", // Ý
"¿", // ¿
)), ENT_QUOTES, 'UTF-8');
$LINK_ICHARS = $LINK_ICHARS_DOMAIN . (string) html_entity_decode(implode("", array(
"ß", // ß
)), ENT_QUOTES, 'UTF-8');
$allowed_protocols = array('http', 'https', 'ftp', 'news', 'nntp', 'telnet', 'mailto', 'irc', 'ssh', 'sftp', 'webcal');
// Starting a parenthesis group with (?: means that it is grouped, but is not captured
$protocol = '((?:'. implode("|", $allowed_protocols) .'):\/\/)';
$authentication = "(?:(?:(?:[\w\.\-\+!$&'\(\)*\+,;=" . $LINK_ICHARS . "]|%[0-9a-f]{2})+(?::(?:[\w". $LINK_ICHARS ."\.\-\+%!$&'\(\)*\+,;=]|%[0-9a-f]{2})*)?)?#)";
$domain = '(?:(?:[a-z0-9' . $LINK_ICHARS_DOMAIN . ']([a-z0-9'. $LINK_ICHARS_DOMAIN . '\-_\[\]])*)(\.(([a-z0-9' . $LINK_ICHARS_DOMAIN . '\-_\[\]])+\.)*('. $LINK_DOMAINS .'|[a-z]{2}))?)';
$ipv4 = '(?:[0-9]{1,3}(\.[0-9]{1,3}){3})';
$ipv6 = '(?:[0-9a-fA-F]{1,4}(\:[0-9a-fA-F]{1,4}){7})';
$port = '(?::([0-9]{1,5}))';
// Pattern specific to external links.
$external_pattern = '/^'. $protocol .'?'. $authentication .'?('. $domain .'|'. $ipv4 .'|'. $ipv6 .' |localhost)'. $port .'?';
// Pattern specific to internal links.
$internal_pattern = "/^(?:[a-z0-9". $LINK_ICHARS ."_\-+\[\]]+)";
$internal_pattern_file = "/^(?:[a-z0-9". $LINK_ICHARS ."_\-+\[\]\.]+)$/i";
$directories = "(?:\/[a-z0-9". $LINK_ICHARS ."_\-\.~+%=&,$'#!():;*#\[\]]*)*";
// Yes, four backslashes == a single backslash.
$query = "(?:\/?\?([?a-z0-9". $LINK_ICHARS ."+_|\-\.~\/\\\\%=&,$'():;*#\[\]{} ]*))";
$anchor = "(?:#[a-z0-9". $LINK_ICHARS ."_\-\.~+%=&,$'():;*#\[\]\/\?]*)";
// The rest of the path for a standard URL.
$end = $directories .'?'. $query .'?'. $anchor .'?'.'$/i';
$message_id = '[^#].*#'. $domain;
$newsgroup_name = '(?:[0-9a-z+-]*\.)*[0-9a-z+-]*';
$news_pattern = '/^news:('. $newsgroup_name .'|'. $message_id .')$/i';
$user = '[a-zA-Z0-9'. $LINK_ICHARS .'_\-\.\+\^!#\$%&*+\/\=\?\`\|\{\}~\'\[\]]+';
$email_pattern = '/^mailto:'. $user .'#'.'(?:'. $domain .'|'. $ipv4 .'|'. $ipv6 .'|localhost)'. $query .'?$/';
if (strpos($text, '<front>') === 0) {
return false;
}
if (in_array('mailto', $allowed_protocols) && preg_match($email_pattern, $text)) {
return false;
}
if (in_array('news', $allowed_protocols) && preg_match($news_pattern, $text)) {
return false;
}
if (preg_match($internal_pattern . $end, $text)) {
return false;
}
if (preg_match($external_pattern . $end, $text)) {
return false;
}
if (preg_match($internal_pattern_file, $text)) {
return false;
}
return true;
}
function is_valid_url ($url="") {
if ($url=="") {
$url=$this->url;
}
$url = #parse_url($url);
if ( ! $url) {
return false;
}
$url = array_map('trim', $url);
$url['port'] = (!isset($url['port'])) ? 80 : (int)$url['port'];
$path = (isset($url['path'])) ? $url['path'] : '';
if ($path == '') {
$path = '/';
}
$path .= ( isset ( $url['query'] ) ) ? "?$url[query]" : '';
if ( isset ( $url['host'] ) AND $url['host'] != gethostbyname ( $url['host'] ) ) {
if ( PHP_VERSION >= 5 ) {
$headers = get_headers("$url[scheme]://$url[host]:$url[port]$path");
}
else {
$fp = fsockopen($url['host'], $url['port'], $errno, $errstr, 30);
if ( ! $fp ) {
return false;
}
fputs($fp, "HEAD $path HTTP/1.1\r\nHost: $url[host]\r\n\r\n");
$headers = fread ( $fp, 128 );
fclose ( $fp );
}
$headers = ( is_array ( $headers ) ) ? implode ( "\n", $headers ) : $headers;
return ( bool ) preg_match ( '#^HTTP/.*\s+[(200|301|302)]+\s#i', $headers );
}
return false;
}
Inspired in this .NET StackOverflow question and in this referenced article from that question there is this URI validator (URI means it validates both URL and URN).
if( ! preg_match( "/^([a-z][a-z0-9+.-]*):(?:\\/\\/((?:(?=((?:[a-z0-9-._~!$&'()*+,;=:]|%[0-9A-F]{2})*))(\\3)#)?(?=(\\[[0-9A-F:.]{2,}\\]|(?:[a-z0-9-._~!$&'()*+,;=]|%[0-9A-F]{2})*))\\5(?::(?=(\\d*))\\6)?)(\\/(?=((?:[a-z0-9-._~!$&'()*+,;=:#\\/]|%[0-9A-F]{2})*))\\8)?|(\\/?(?!\\/)(?=((?:[a-z0-9-._~!$&'()*+,;=:#\\/]|%[0-9A-F]{2})*))\\10)?)(?:\\?(?=((?:[a-z0-9-._~!$&'()*+,;=:#\\/?]|%[0-9A-F]{2})*))\\11)?(?:#(?=((?:[a-z0-9-._~!$&'()*+,;=:#\\/?]|%[0-9A-F]{2})*))\\12)?$/i", $uri ) )
{
throw new \RuntimeException( "URI has not a valid format." );
}
I have successfully unit-tested this function inside a ValueObject I made named Uri and tested by UriTest.
UriTest.php (Contains valid and invalid cases for both URLs and URNs)
<?php
declare( strict_types = 1 );
namespace XaviMontero\ThrasherPortage\Tests\Tour;
use XaviMontero\ThrasherPortage\Tour\Uri;
class UriTest extends \PHPUnit_Framework_TestCase
{
private $sut;
public function testCreationIsOfProperClassWhenUriIsValid()
{
$sut = new Uri( 'http://example.com' );
$this->assertInstanceOf( 'XaviMontero\\ThrasherPortage\\Tour\\Uri', $sut );
}
/**
* #dataProvider urlIsValidProvider
* #dataProvider urnIsValidProvider
*/
public function testGetUriAsStringWhenUriIsValid( string $uri )
{
$sut = new Uri( $uri );
$actual = $sut->getUriAsString();
$this->assertInternalType( 'string', $actual );
$this->assertEquals( $uri, $actual );
}
public function urlIsValidProvider()
{
return
[
[ 'http://example-server' ],
[ 'http://example.com' ],
[ 'http://example.com/' ],
[ 'http://subdomain.example.com/path/?parameter1=value1&parameter2=value2' ],
[ 'random-protocol://example.com' ],
[ 'http://example.com:80' ],
[ 'http://example.com?no-path-separator' ],
[ 'http://example.com/pa%20th/' ],
[ 'ftp://example.org/resource.txt' ],
[ 'file://../../../relative/path/needs/protocol/resource.txt' ],
[ 'http://example.com/#one-fragment' ],
[ 'http://example.edu:8080#one-fragment' ],
];
}
public function urnIsValidProvider()
{
return
[
[ 'urn:isbn:0-486-27557-4' ],
[ 'urn:example:mammal:monotreme:echidna' ],
[ 'urn:mpeg:mpeg7:schema:2001' ],
[ 'urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66' ],
[ 'rare-urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66' ],
[ 'urn:FOO:a123,456' ]
];
}
/**
* #dataProvider urlIsNotValidProvider
* #dataProvider urnIsNotValidProvider
*/
public function testCreationThrowsExceptionWhenUriIsNotValid( string $uri )
{
$this->expectException( 'RuntimeException' );
$this->sut = new Uri( $uri );
}
public function urlIsNotValidProvider()
{
return
[
[ 'only-text' ],
[ 'http//missing.colon.example.com/path/?parameter1=value1&parameter2=value2' ],
[ 'missing.protocol.example.com/path/' ],
[ 'http://example.com\\bad-separator' ],
[ 'http://example.com|bad-separator' ],
[ 'ht tp://example.com' ],
[ 'http://exampl e.com' ],
[ 'http://example.com/pa th/' ],
[ '../../../relative/path/needs/protocol/resource.txt' ],
[ 'http://example.com/#two-fragments#not-allowed' ],
[ 'http://example.edu:portMustBeANumber#one-fragment' ],
];
}
public function urnIsNotValidProvider()
{
return
[
[ 'urn:mpeg:mpeg7:sch ema:2001' ],
[ 'urn|mpeg:mpeg7:schema:2001' ],
[ 'urn?mpeg:mpeg7:schema:2001' ],
[ 'urn%mpeg:mpeg7:schema:2001' ],
[ 'urn#mpeg:mpeg7:schema:2001' ],
];
}
}
Uri.php (Value Object)
<?php
declare( strict_types = 1 );
namespace XaviMontero\ThrasherPortage\Tour;
class Uri
{
/** #var string */
private $uri;
public function __construct( string $uri )
{
$this->assertUriIsCorrect( $uri );
$this->uri = $uri;
}
public function getUriAsString()
{
return $this->uri;
}
private function assertUriIsCorrect( string $uri )
{
// https://stackoverflow.com/questions/30847/regex-to-validate-uris
// http://snipplr.com/view/6889/regular-expressions-for-uri-validationparsing/
if( ! preg_match( "/^([a-z][a-z0-9+.-]*):(?:\\/\\/((?:(?=((?:[a-z0-9-._~!$&'()*+,;=:]|%[0-9A-F]{2})*))(\\3)#)?(?=(\\[[0-9A-F:.]{2,}\\]|(?:[a-z0-9-._~!$&'()*+,;=]|%[0-9A-F]{2})*))\\5(?::(?=(\\d*))\\6)?)(\\/(?=((?:[a-z0-9-._~!$&'()*+,;=:#\\/]|%[0-9A-F]{2})*))\\8)?|(\\/?(?!\\/)(?=((?:[a-z0-9-._~!$&'()*+,;=:#\\/]|%[0-9A-F]{2})*))\\10)?)(?:\\?(?=((?:[a-z0-9-._~!$&'()*+,;=:#\\/?]|%[0-9A-F]{2})*))\\11)?(?:#(?=((?:[a-z0-9-._~!$&'()*+,;=:#\\/?]|%[0-9A-F]{2})*))\\12)?$/i", $uri ) )
{
throw new \RuntimeException( "URI has not a valid format." );
}
}
}
Running UnitTests
There are 65 assertions in 46 tests. Caution: there are 2 data-providers for valid and 2 more for invalid expressions. One is for URLs and the other for URNs. If you are using a version of PhpUnit of v5.6* or earlier then you need to join the two data providers into a single one.
xavi#bromo:~/custom_www/hello-trip/mutant-migrant$ vendor/bin/phpunit
PHPUnit 5.7.3 by Sebastian Bergmann and contributors.
.............................................. 46 / 46 (100%)
Time: 82 ms, Memory: 4.00MB
OK (46 tests, 65 assertions)
Code coverage
There's is 100% of code-coverage in this sample URI checker.
"/(http(s?):\/\/)([a-z0-9\-]+\.)+[a-z]{2,4}(\.[a-z]{2,4})*(\/[^ ]+)*/i"
(http(s?)://) means http:// or https://
([a-z0-9-]+.)+ =>
2.0[a-z0-9-] means any a-z character or any 0-9 or (-)sign)
2.1 (+) means the character can be one or more ex: a1w,
a9-,c559s, f)
2.2 \. is (.)sign
2.3. the (+) sign after ([a-z0-9\-]+\.) mean do 2.1,2.2,2.3
at least 1 time
ex: abc.defgh0.ig, aa.b.ced.f.gh. also in case www.yyy.com
3.[a-z]{2,4} mean a-z at least 2 character but not more than
4 characters for check that there will not be
the case
ex: https://www.google.co.kr.asdsdagfsdfsf
4.(\.[a-z]{2,4})*(\/[^ ]+)* mean
4.1 \.[a-z]{2,4} means like number 3 but start with
(.)sign
4.2 * means (\.[a-z]{2,4})can be use or not use never mind
4.3 \/ means \
4.4 [^ ] means any character except blank
4.5 (+) means do 4.3,4.4,4.5 at least 1 times
4.6 (*) after (\/[^ ]+) mean use 4.3 - 4.5 or not use
no problem
use for case https://stackoverflow.com/posts/51441301/edit
5. when you use regex write in "/ /" so it come
"/(http(s?)://)([a-z0-9-]+.)+[a-z]{2,4}(.[a-z]{2,4})(/[^ ]+)/i"
6. almost forgot: letter i on the back mean ignore case of
Big letter or small letter ex: A same as a, SoRRy same
as sorry.
Note : Sorry for bad English. My country not use it well.
OK, so this is a little bit more complex then a simple regex, but it allows for different types of urls.
Examples:
google.com
www.microsoft.com/
http://www.yahoo.com/
https://www.bandcamp.com/artist/#!someone-special!
All which should be marked as valid.
function is_valid_url($url) {
// First check: is the url just a domain name? (allow a slash at the end)
$_domain_regex = "|^[A-Za-z0-9-]+(\.[A-Za-z0-9-]+)*(\.[A-Za-z]{2,})/?$|";
if (preg_match($_domain_regex, $url)) {
return true;
}
// Second: Check if it's a url with a scheme and all
$_regex = '#^([a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))$#';
if (preg_match($_regex, $url, $matches)) {
// pull out the domain name, and make sure that the domain is valid.
$_parts = parse_url($url);
if (!in_array($_parts['scheme'], array( 'http', 'https' )))
return false;
// Check the domain using the regex, stops domains like "-example.com" passing through
if (!preg_match($_domain_regex, $_parts['host']))
return false;
// This domain looks pretty valid. Only way to check it now is to download it!
return true;
}
return false;
}
Note that there is a in_array check for the protocols that you want to allow (currently only http and https are in that list).
var_dump(is_valid_url('google.com')); // true
var_dump(is_valid_url('google.com/')); // true
var_dump(is_valid_url('http://google.com')); // true
var_dump(is_valid_url('http://google.com/')); // true
var_dump(is_valid_url('https://google.com')); // true
For anyone developing with WordPress, just use
esc_url_raw($url) === $url
to validate a URL (here's WordPress' documentation on esc_url_raw). It handles URLs much better than filter_var($url, FILTER_VALIDATE_URL) because it is unicode and XSS-safe. (Here is a good article mentioning all the problems with filter_var).
Peter's Regex doesn't look right to me for many reasons. It allows all kinds of special characters in the domain name and doesn't test for much.
Frankie's function looks good to me and you can build a good regex from the components if you don't want a function, like so:
^(http://|https://)(([a-z0-9]([-a-z0-9]*[a-z0-9]+)?){1,63}\.)+[a-z]{2,6}
Untested but I think that should work.
Also, Owen's answer doesn't look 100% either. I took the domain part of the regex and tested it on a Regex tester tool http://erik.eae.net/playground/regexp/regexp.html
I put the following line:
(\S*?\.\S*?)
in the "regexp" section
and the following line:
-hello.com
under the "sample text" section.
The result allowed the minus character through. Because \S means any non-space character.
Note the regex from Frankie handles the minus because it has this part for the first character:
[a-z0-9]
Which won't allow the minus or any other special character.
Here is the way I did it. But I want to mentoin that I am not so shure about the regex. But It should work thou :)
$pattern = "#((http|https)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|”|\"|'|:|\<|$|\.\s)#i";
$text = preg_replace_callback($pattern,function($m){
return "$m[1]$m[4]";
},
$text);
This way you won't need the eval marker on your pattern.
Hope it helps :)
Here's a simple class for URL Validation using RegEx and then cross-references the domain against popular RBL (Realtime Blackhole Lists) servers:
Install:
require 'URLValidation.php';
Usage:
require 'URLValidation.php';
$urlVal = new UrlValidation(); //Create Object Instance
Add a URL as the parameter of the domain() method and check the the return.
$urlArray = ['http://www.bokranzr.com/test.php?test=foo&test=dfdf', 'https://en-gb.facebook.com', 'https://www.google.com'];
foreach ($urlArray as $k=>$v) {
echo var_dump($urlVal->domain($v)) . ' URL: ' . $v . '<br>';
}
Output:
bool(false) URL: http://www.bokranzr.com/test.php?test=foo&test=dfdf
bool(true) URL: https://en-gb.facebook.com
bool(true) URL: https://www.google.com
As you can see above, www.bokranzr.com is listed as malicious website via an RBL so the domain was returned as false.
I've found this to be the most useful for matching a URL..
^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$
There is a PHP native function for that:
$url = 'http://www.yoururl.co.uk/sub1/sub2/?param=1&param2/';
if ( ! filter_var( $url, FILTER_VALIDATE_URL ) ) {
// Wrong
}
else {
// Valid
}
Returns the filtered data, or FALSE if the filter fails.
Check it here

Get YouTube ID from URL regex pattern

I've seen a couple different examples on the site but it doesn't get the id out of all the youtube options... as an example the following links don't work with the regex pattern below. any help would be wonderful. Thanks in advance:
It just seems to be this one if a user goes to youtube homepage and clicks on one of the vids there they give this url:
http://www.youtube.com/watch?v=hLSoU53DXK8&feature=g-vrec
my regex puts it in the database as: hLSoU53DXK8-vrec and i need it without -vrec.
// YOUTUBE
$youtube = $_POST['youtube'];
function getYoutubeId($youtube) {
$url = parse_url($youtube);
if($url['host'] !== 'youtube.com' &&
$url['host'] !== 'www.youtube.com'&&
$url['host'] !== 'youtu.be'&&
$url['host'] !== 'www.youtu.be')
return false;
$youtube = preg_replace('~
# Match non-linked youtube URL in the wild. (Rev:20111012)
https?:// # Required scheme. Either http or https.
(?:[0-9A-Z-]+\.)? # Optional subdomain.
(?: # Group host alternatives.
youtu\.be/ # Either youtu.be,
| youtube\.com # or youtube.com followed by
\S* # Allow anything up to VIDEO_ID,
[^\w\-\s] # but char before ID is non-ID char.
) # End host alternatives.
([\w\-]{11}) # $1: VIDEO_ID is exactly 11 chars.
(?=[^\w\-]|$) # Assert next char is non-ID or EOS.
(?! # Assert URL is not pre-linked.
[?=&+%\w]* # Allow URL (query) remainder.
(?: # Group pre-linked alternatives.
[\'"][^<>]*> # Either inside a start tag,
| </a> # or inside <a> element text contents.
) # End recognized pre-linked alts.
) # End negative lookahead assertion.
[?=&+%\w]* # Consume any URL (query) remainder.
~ix',
'$1',
$youtube);
return $youtube;
}
$youtube_id = getYoutubeId($youtube);
$url = "http://www.youtube.com/watch?v=hLSoU53DXK8&feature=g-vrec";
$query_string = array();
parse_str(parse_url($url, PHP_URL_QUERY), $query_string);
$id = $query_string["v"];
Unfortuneately the solution above does not retrieve the Youtube ID for the short url "http://youtu.be". So based on the solutions above I wrote this function:
function get_youtube_id( $youtube_url ) {
$url = parse_url($youtube_url);
if( $url['host'] !== 'youtube.com' &&
$url['host'] !== 'www.youtube.com'&&
$url['host'] !== 'youtu.be'&&
$url['host'] !== 'www.youtu.be')
return '';
if( $url['host'] === 'youtube.com' || $url['host'] === 'www.youtube.com' ) :
parse_str(parse_url($youtube_url, PHP_URL_QUERY), $query_string);
return $query_string["v"];
endif;
$youtube_id = substr( $url['path'], 1 );
if( strpos( $youtube_id, '/' ) )
$youtube_id = substr( $youtube_id, 0, strpos( $youtube_id, '/' ) );
return $youtube_id;
}
$youtube = "theURL";
$query_string = array();
parse_str(parse_url($youtube, PHP_URL_QUERY), $query_string);
$youtube_id = $query_string["v"];

RegEx pattern to get the YouTube video ID from any YouTube URL

Let's take these URLs as an example:
http://www.youtube.com/watch?v=8GqqjVXhfMU&feature=youtube_gdata_player
http://www.youtube.com/watch?v=8GqqjVXhfMU
This PHP function will NOT properly obtain the ID in case 1, but will in case 2. Case 1 is very common, where ANYTHING can come behind the YouTube ID.
/**
* get YouTube video ID from URL
*
* #param string $url
* #return string YouTube video id or FALSE if none found.
*/
function youtube_id_from_url($url) {
$pattern =
'%^# Match any YouTube URL
(?:https?://)? # Optional scheme. Either http or https
(?:www\.)? # Optional www subdomain
(?: # Group host alternatives
youtu\.be/ # Either youtu.be,
| youtube\.com # or youtube.com
(?: # Group path alternatives
/embed/ # Either /embed/
| /v/ # or /v/
| /watch\?v= # or /watch\?v=
) # End path alternatives.
) # End host alternatives.
([\w-]{10,12}) # Allow 10-12 for 11 char YouTube id.
$%x'
;
$result = preg_match($pattern, $url, $matches);
if (false !== $result) {
return $matches[1];
}
return false;
}
What I'm thinking is that there must be a way where I can just look for the "v=", no matter where it lies in the URL, and take the characters after that. In this manner, no complex RegEx will be needed. Is this off base? Any ideas for starting points?
if (preg_match('/youtube\.com\/watch\?v=([^\&\?\/]+)/', $url, $id)) {
$values = $id[1];
} else if (preg_match('/youtube\.com\/embed\/([^\&\?\/]+)/', $url, $id)) {
$values = $id[1];
} else if (preg_match('/youtube\.com\/v\/([^\&\?\/]+)/', $url, $id)) {
$values = $id[1];
} else if (preg_match('/youtu\.be\/([^\&\?\/]+)/', $url, $id)) {
$values = $id[1];
}
else if (preg_match('/youtube\.com\/verify_age\?next_url=\/watch%3Fv%3D([^\&\?\/]+)/', $url, $id)) {
$values = $id[1];
} else {
// not an youtube video
}
This is what I use to extract the id from an youtube url. I think it works in all cases.
Note that at the end $values = id of the video
Instead of regex. I hightly recommend parse_url() and parse_str():
$url = "http://www.youtube.com/watch?v=8GqqjVXhfMU&feature=youtube_gdata_player";
parse_str(parse_url( $url, PHP_URL_QUERY ), $vars );
echo $vars['v'];
Done
You could just use parse_url and parse_str:
$query_string = parse_url($url, PHP_URL_QUERY);
parse_str($query_string);
echo $v;
I have used the following patterns because YouTube has a youtube-nocookie.com domain too:
'#youtube(?:-nocookie)?\.com/watch[#\?].*?v=([^"\& ]+)#i',
'#youtube(?:-nocookie)?\.com/embed/([^"\&\? ]+)#i',
'#youtube(?:-nocookie)?\.com/v/([^"\&\? ]+)#i',
'#youtube(?:-nocookie)?\.com/\?v=([^"\& ]+)#i',
'#youtu\.be/([^"\&\? ]+)#i',
'#gdata\.youtube\.com/feeds/api/videos/([^"\&\? ]+)#i',
In your case it would only mean to extend the existing expressions with an optional (-nocookie) for the regular YouTube.com URL like so:
if (preg_match('/youtube(?:-nocookie)\.com\/watch\?v=([^\&\?\/]+)/', $url, $id)) {
If you change your proposed expression to NOT contain the final $, it should work like you intended. I added the -nocookie as well.
/**
* get YouTube video ID from URL
*
* #param string $url
* #return string YouTube video id or FALSE if none found.
*/
function youtube_id_from_url($url) {
$pattern =
'%^# Match any YouTube URL
(?:https?://)? # Optional scheme. Either http or https
(?:www\.)? # Optional www subdomain
(?: # Group host alternatives
youtu\.be/ # Either youtu.be,
|youtube(?:-nocookie)?\.com # or youtube.com and youtube-nocookie
(?: # Group path alternatives
/embed/ # Either /embed/
| /v/ # or /v/
| /watch\?v= # or /watch\?v=
) # End path alternatives.
) # End host alternatives.
([\w-]{10,12}) # Allow 10-12 for 11 char YouTube id.
%x'
;
$result = preg_match($pattern, $url, $matches);
if (false !== $result) {
return $matches[1];
}
return false;
}
Another easy way is using parse_str():
<?php
$url = 'http://www.youtube.com/watch?v=8GqqjVXhfMU&feature=youtube_gdata_player';
parse_str($url, $yt);
// The associative array $yt now contains all of the key-value pairs from the querystring (along with the base 'watch' URL, but doesn't seem you need that)
echo $yt['v']; // echos '8GqqjVXhfMU';
?>
The parse_url suggestions are good. If you really want a regex you can use this:
/(?<=v=)[^&]+/`
SOLUTION for any YOUTUBE LINK:
http://youtube.com/v/dQw4w9WgXcQ
http://youtube.com/watch?v=dQw4w9WgXcQ
http://www.youtube.com/watch?feature=player&v=dQw4w9WgXcQ&var2=bla
http://youtu.be/dQw4w9WgXcQ
==
https://stackoverflow.com/a/20614061/2165415
Here is one solution
/**
* credits goes to: http://stackoverflow.com/questions/11438544/php-regex-for-youtube-video-id
* update: mobile link detection
*/
public function parseYouTubeUrl($url)
{
$pattern = '#^(?:https?://)?(?:www\.)?(?:m\.)?(?:youtu\.be/|youtube\.com(?:/embed/|/v/|/watch\?v=|/watch\?.+&v=))([\w-]{11})(?:.+)?$#x';
preg_match($pattern, $url, $matches);
return (isset($matches[1])) ? $matches[1] : false;
}
It can deal with mobile links too.
Here is my function for retrieving Youtube ID !
function getYouTubeId($url) {
if (!(strpos($url, 'v=') !== false)) return false;
$parse = explode('v=', $url);
$code = $parse[1];
if (strlen($code) < 11) return false;
$code = substr($code, 0, 11);
return $code;
}

Youtube API - Extract video ID

I am coding a functionality that allows users to enter a Youtube video URL. I would like to extract the video ID from these urls.
Does Youtube API support some kind of function where I pass the link and it gives the video ID in return. Or do I have to parse the string myself?
I am using PHP ... I would appreciate any pointers / code samples in this regard.
Thanks
Here is an example function that uses a regular expression to extract the youtube ID from a URL:
/**
* get youtube video ID from URL
*
* #param string $url
* #return string Youtube video id or FALSE if none found.
*/
function youtube_id_from_url($url) {
$pattern =
'%^# Match any youtube URL
(?:https?://)? # Optional scheme. Either http or https
(?:www\.)? # Optional www subdomain
(?: # Group host alternatives
youtu\.be/ # Either youtu.be,
| youtube\.com # or youtube.com
(?: # Group path alternatives
/embed/ # Either /embed/
| /v/ # or /v/
| /watch\?v= # or /watch\?v=
) # End path alternatives.
) # End host alternatives.
([\w-]{10,12}) # Allow 10-12 for 11 char youtube id.
$%x'
;
$result = preg_match($pattern, $url, $matches);
if ($result) {
return $matches[1];
}
return false;
}
echo youtube_id_from_url('http://youtu.be/NLqAF9hrVbY'); # NLqAF9hrVbY
It's an adoption of the answer from a similar question.
It's not directly the API you're looking for but probably helpful. Youtube has an oembed service:
$url = 'http://youtu.be/NLqAF9hrVbY';
var_dump(json_decode(file_get_contents(sprintf('http://www.youtube.com/oembed?url=%s&format=json', urlencode($url)))));
Which provides some more meta-information about the URL:
object(stdClass)#1 (13) {
["provider_url"]=>
string(23) "http://www.youtube.com/"
["title"]=>
string(63) "Hang Gliding: 3 Flights in 8 Days at Northside Point of the Mtn"
["html"]=>
string(411) "<object width="425" height="344"><param name="movie" value="http://www.youtube.com/v/NLqAF9hrVbY?version=3"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/NLqAF9hrVbY?version=3" type="application/x-shockwave-flash" width="425" height="344" allowscriptaccess="always" allowfullscreen="true"></embed></object>"
["author_name"]=>
string(11) "widgewunner"
["height"]=>
int(344)
["thumbnail_width"]=>
int(480)
["width"]=>
int(425)
["version"]=>
string(3) "1.0"
["author_url"]=>
string(39) "http://www.youtube.com/user/widgewunner"
["provider_name"]=>
string(7) "YouTube"
["thumbnail_url"]=>
string(48) "http://i3.ytimg.com/vi/NLqAF9hrVbY/hqdefault.jpg"
["type"]=>
string(5) "video"
["thumbnail_height"]=>
int(360)
}
But the ID is not a direct part of the response. However it might contain the information you're looking for and it might be useful to validate the youtube URL.
I am making slight changes in the above regular expression, although it is working fine for youtube short URL (which have been used in the above example) and simple video URL where no other parameter is coming after video code, but it does not work for URLs like
http://www.youtube.com/watch?v=B_izAKQ0WqQ&feature=related as video code is not the last parameter in this URL.
In the same way v={video_code} does not always come after watch (whereas above regular expression is assuming that it will always come after watch?), like if user has selected language OR location from the footer, for example if user has selected English (UK) from Language option then URL will be http://www.youtube.com/watch?feature=related&hl=en-GB&v=B_izAKQ0WqQ
So I have made some modification in the above regular expressions, but definitely credit goes to hakre for providing the base regular expression, thanks #hakre:
function youtube_id_from_url($url) {
$pattern =
'%^# Match any youtube URL
(?:https?://)? # Optional scheme. Either http or https
(?:www\.)? # Optional www subdomain
(?: # Group host alternatives
youtu\.be/ # Either youtu.be,
| youtube\.com # or youtube.com
(?: # Group path alternatives
/embed/ # Either /embed/
| /v/ # or /v/
| .*v= # or /watch\?v=
) # End path alternatives.
) # End host alternatives.
([\w-]{10,12}) # Allow 10-12 for 11 char youtube id.
($|&).* # if additional parameters are also in query string after video id.
$%x'
;
$result = preg_match($pattern, $url, $matches);
if (false !== $result) {
return $matches[1];
}
return false;
}
You can use the PHP function parse_url to extract host name, path, query string and the fragment. You can then use PHP string functions to locate the video id.
function getYouTubeVideoId($url)
{
$video_id = false;
$url = parse_url($url);
if (strcasecmp($url['host'], 'youtu.be') === 0)
{
#### (dontcare)://youtu.be/<video id>
$video_id = substr($url['path'], 1);
}
elseif (strcasecmp($url['host'], 'www.youtube.com') === 0)
{
if (isset($url['query']))
{
parse_str($url['query'], $url['query']);
if (isset($url['query']['v']))
{
#### (dontcare)://www.youtube.com/(dontcare)?v=<video id>
$video_id = $url['query']['v'];
}
}
if ($video_id == false)
{
$url['path'] = explode('/', substr($url['path'], 1));
if (in_array($url['path'][0], array('e', 'embed', 'v')))
{
#### (dontcare)://www.youtube.com/(whitelist)/<video id>
$video_id = $url['path'][1];
}
}
}
return $video_id;
}
$urls = array(
'http://youtu.be/dQw4w9WgXcQ',
'http://www.youtube.com/?v=dQw4w9WgXcQ',
'http://www.youtube.com/?v=dQw4w9WgXcQ&feature=player_embedded',
'http://www.youtube.com/watch?v=dQw4w9WgXcQ',
'http://www.youtube.com/watch?v=dQw4w9WgXcQ&feature=player_embedded',
'http://www.youtube.com/v/dQw4w9WgXcQ',
'http://www.youtube.com/e/dQw4w9WgXcQ',
'http://www.youtube.com/embed/dQw4w9WgXcQ'
);
foreach ($urls as $url)
{
echo sprintf('%s -> %s' . PHP_EOL, $url, getYouTubeVideoId($url));
}
Simple as return substr(strstr($url, 'v='), 2, 11);
I know this is a very late answer but I found this thread while searching for the topic so I want to suggest a more elegant way of doing this using oEmbed:
echo get_embed('youtube', 'https://www.youtube.com/watch?v=IdxKPCv0bSs');
function get_embed($provider, $url, $max_width = '', $max_height = ''){
$providers = array(
'youtube' => 'http://www.youtube.com/oembed'
/* you can add support for more providers here */
);
if(!isset($providers[$provider])){
return 'Invalid provider!';
}
$movie_data_json = #file_get_contents(
$providers[$provider] . '?url=' . urlencode($url) .
"&maxwidth={$max_width}&maxheight={$max_height}&format=json"
);
if(!$movie_data_json){
$error = error_get_last();
/* remove the PHP stuff from the error and show only the HTTP error message */
$error_message = preg_replace('/.*: (.*)/', '$1', $error['message']);
return $error_message;
}else{
$movie_data = json_decode($movie_data_json, true);
return $movie_data['html'];
}
}
oEmbed makes it possible to embed content from more sites by just adding their oEmbed API endpoint to the $providers array in the above code.
Here is a simple solution that has worked for me.
VideoId is the longest word in any YouTube URL types and it comprises (alphanumeric + "-") with minimum length of 8 surrounded by non-word chars. So you can search for below regex in the URL as a group and that first group is your answer. First group because some youtube parameters such as enablejsapi are more than 8 chars but they always come after videoId.
Regex: "\W([\w-]{9,})(\W|$)"
Here is the working java code:
String[] youtubeUrls = {
"https://www.youtube.com/watch?v=UzRtrjyDwx0",
"https://youtu.be/6butf1tEVKs?t=22s",
"https://youtu.be/R46-XgqXkzE?t=2m52s",
"http://youtu.be/dQw4w9WgXcQ",
"http://www.youtube.com/?v=dQw4w9WgXcQ",
"http://www.youtube.com/?v=dQw4w9WgXcQ&feature=player_embedded",
"http://www.youtube.com/watch?v=dQw4w9WgXcQ",
"http://www.youtube.com/watch?v=dQw4w9WgXcQ&feature=player_embedded",
"http://www.youtube.com/v/dQw4w9WgXcQ",
"http://www.youtube.com/e/dQw4w9WgXcQ",
"http://www.youtube.com/embed/dQw4w9WgXcQ"
};
String pattern = "\\W([\\w-]{9,})(\\W|$)";
Pattern pattern2 = Pattern.compile(pattern);
for (int i=0; i<youtubeUrls.length; i++){
Matcher matcher2 = pattern2.matcher(youtubeUrls[i]);
if (matcher2.find()){
System.out.println(matcher2.group(1));
}
else System.out.println("Not found");
}
As mentioned in a comment below the valid answer, we use it like this, and it works mighty fine!
function youtube_id_from_url($url) {
$url = trim(strtok("$url", '?'));
$url = str_replace("#!/", "", "$url");
$pattern =
'%^# Match any youtube URL
(?:https?://)? # Optional scheme. Either http or https
(?:www\.)? # Optional www subdomain
(?: # Group host alternatives
youtu\.be/ # Either youtu.be,
| youtube\.com # or youtube.com
(?: # Group path alternatives
/embed/ # Either /embed/
| /v/ # or /v/
| /watch\?v= # or /watch\?v=
) # End path alternatives.
) # End host alternatives.
([\w-]{10,12}) # Allow 10-12 for 11 char youtube id.
$%x'
;
$result = preg_match($pattern, $url, $matches);
if ($result) {
return $matches[1];
}
return false;
}
How about this one:
function getVideoId() {
$query = parse_url($this->url, PHP_URL_QUERY);
$arr = explode('=', $query);
$index = array_search('v', $arr);
if ($index !== false) {
if (isset($arr[$index++])) {
$string = $arr[$index++];
if (($amp = strpos($string, '&')) !== false) {
return substr($string, 0, $amp);
} else {
return $string;
}
} else {
return false;
}
}
return false;
}
No regex, support multiple query parameters, i.e, https://www.youtube.com/watch?v=PEQxWg92Ux4&index=9&list=RDMMom0RGEnWIEk also works.
For JAVA developers
Got this working for me, also supports no-cookie url's:
private static final Pattern youtubeId = Pattern.compile("^(?:https?\\:\\/\\/)?.*(?:youtu.be\\/|vi?\\/|vi?=|u\\/\\w\\/|embed\\/|(watch)?vi?=)([^#&?]*).*$");
#VisibleForTesting
String getVideoId(final String url) {
final Matcher matcher = youtubeId.matcher(url);
if(matcher.find()){
return matcher.group(2);
}
return "";
}
Some test to check youtube url's
#ParameterizedTest
#MethodSource("youtubeTestUrls")
void videoIdFromUrlTest(final String url, final String videoId) {
final String matchedVidID = this.youtubeService.getVideoId(url);
assertEquals(videoId, matchedVidID);
}
private static Stream<Arguments> youtubeTestUrls() {
return Stream.of(
Arguments.of("www.youtube-nocookie.com/embed/dQw4-9W_XcQ?rel=0", "dQw4-9W_XcQ"),
Arguments.of("http://www.youtube.com/user/Scobleizer#p/u/1/dQw4-9W_XcQ", "dQw4-9W_XcQ"),
Arguments.of("http://www.youtube.com/watch?v=dQw4-9W_XcQ&feature=channel", "dQw4-9W_XcQ"),
Arguments.of("http://www.youtube.com/watch?v=dQw4-9W_XcQ&playnext_from=TL&videos=osPknwzXEas&feature=sub", "dQw4-9W_XcQ"),
Arguments.of("http://www.youtube.com/ytscreeningroom?v=dQw4-9W_XcQ", "dQw4-9W_XcQ"),
Arguments.of("http://www.youtube.com/user/SilkRoadTheatre#p/a/u/2/dQw4-9W_XcQ", "dQw4-9W_XcQ"),
Arguments.of("http://youtu.be/dQw4-9W_XcQ", "dQw4-9W_XcQ"),
Arguments.of("http://www.youtube.com/watch?v=dQw4-9W_XcQ&feature=youtu.be", "dQw4-9W_XcQ"),
Arguments.of("http://youtu.be/dQw4-9W_XcQ", "dQw4-9W_XcQ"),
Arguments.of("https://www.youtube.com/user/Scobleizer#p/u/1/dQw4-9W_XcQ?rel=0", "dQw4-9W_XcQ"),
Arguments.of("http://www.youtube.com/watch?v=dQw4-9W_XcQ&playnext_from=TL&videos=dQw4-9W_XcQ&feature=sub", "dQw4-9W_XcQ"),
Arguments.of("http://www.youtube.com/ytscreeningroom?v=dQw4-9W_XcQ", "dQw4-9W_XcQ"),
Arguments.of("http://www.youtube.com/embed/dQw4-9W_XcQ?rel=0", "dQw4-9W_XcQ"),
Arguments.of("https://www.youtube.com/watch?v=dQw4-9W_XcQ", "dQw4-9W_XcQ"),
Arguments.of("http://youtube.com/v/dQw4-9W_XcQ?feature=youtube_gdata_player", "dQw4-9W_XcQ"),
Arguments.of("http://youtube.com/vi/dQw4-9W_XcQ?feature=youtube_gdata_player", "dQw4-9W_XcQ"),
Arguments.of("http://youtube.com/?v=dQw4-9W_XcQ&feature=youtube_gdata_player", "dQw4-9W_XcQ"),
Arguments.of("http://www.youtube.com/watch?v=dQw4-9W_XcQ&feature=youtube_gdata_player", "dQw4-9W_XcQ"),
Arguments.of("http://youtube.com/?vi=dQw4-9W_XcQ&feature=youtube_gdata_player", "dQw4-9W_XcQ"),
Arguments.of("https://youtube.com/watch?v=dQw4-9W_XcQ&feature=youtube_gdata_player", "dQw4-9W_XcQ"),
Arguments.of("http://youtube.com/watch?vi=dQw4-9W_XcQ&feature=youtube_gdata_player", "dQw4-9W_XcQ"),
Arguments.of("http://youtu.be/dQw4-9W_XcQ?feature=youtube_gdata_player", "dQw4-9W_XcQ"),
Arguments.of("https://www.youtube.com/watch?v=yYw2Q141thM&list=PLOwEeBApnYoUFioRitjwz-DREzFGOSgiE&index=2", "yYw2Q141thM"),
Arguments.of("https://www.youtube.com/watch?", "")
);
}

PHP validation/regex for URL

I've been looking for a simple regex for URLs, does anybody have one handy that works well? I didn't find one with the zend framework validation classes and have seen several implementations.
Use the filter_var() function to validate whether a string is URL or not:
var_dump(filter_var('example.com', FILTER_VALIDATE_URL));
It is bad practice to use regular expressions when not necessary.
EDIT: Be careful, this solution is not unicode-safe and not XSS-safe. If you need a complex validation, maybe it's better to look somewhere else.
I used this on a few projects, I don't believe I've run into issues, but I'm sure it's not exhaustive:
$text = preg_replace(
'#((https?|ftp)://(\S*?\.\S*?))([\s)\[\]{},;"\':<]|\.\s|$)#i',
"'$3$4'",
$text
);
Most of the random junk at the end is to deal with situations like http://domain.example. in a sentence (to avoid matching the trailing period). I'm sure it could be cleaned up but since it worked. I've more or less just copied it over from project to project.
As per the PHP manual - parse_url should not be used to validate a URL.
Unfortunately, it seems that filter_var('example.com', FILTER_VALIDATE_URL) does not perform any better.
Both parse_url() and filter_var() will pass malformed URLs such as http://...
Therefore in this case - regex is the better method.
As per John Gruber (Daring Fireball):
Regex:
(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))
using in preg_match():
preg_match("/(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))/", $url)
Here is the extended regex pattern (with comments):
(?xi)
\b
( # Capture 1: entire matched URL
(?:
https?:// # http or https protocol
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
)
For more details please look at:
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
Just in case you want to know if the url really exists:
function url_exist($url){//se passar a URL existe
$c=curl_init();
curl_setopt($c,CURLOPT_URL,$url);
curl_setopt($c,CURLOPT_HEADER,1);//get the header
curl_setopt($c,CURLOPT_NOBODY,1);//and *only* get the header
curl_setopt($c,CURLOPT_RETURNTRANSFER,1);//get the response as a string from curl_exec(), rather than echoing it
curl_setopt($c,CURLOPT_FRESH_CONNECT,1);//don't use a cached version of the url
if(!curl_exec($c)){
//echo $url.' inexists';
return false;
}else{
//echo $url.' exists';
return true;
}
//$httpcode=curl_getinfo($c,CURLINFO_HTTP_CODE);
//return ($httpcode<400);
}
I don't think that using regular expressions is a smart thing to do in this case. It is impossible to match all of the possibilities and even if you did, there is still a chance that url simply doesn't exist.
Here is a very simple way to test if url actually exists and is readable :
if (preg_match("#^https?://.+#", $link) and #fopen($link,"r")) echo "OK";
(if there is no preg_match then this would also validate all filenames on your server)
I've used this one with good success - I don't remember where I got it from
$pattern = "/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&##\/%?=~_|!:,.;]*[-a-z0-9+&##\/%=~_|]/i";
The best URL Regex that worked for me:
function valid_URL($url){
return preg_match('%^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?#|\d{1,3}(?:\.\d{1,3}){3}|(?:(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)(?:\.(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)*(?:\.[a-z\x{00a1}-\x{ffff}]{2,6}))(?::\d+)?(?:[^\s]*)?$%iu', $url);
}
Examples:
valid_URL('https://twitter.com'); // true
valid_URL('http://twitter.com'); // true
valid_URL('http://twitter.co'); // true
valid_URL('http://t.co'); // true
valid_URL('http://twitter.c'); // false
valid_URL('htt://twitter.com'); // false
valid_URL('http://example.com/?a=1&b=2&c=3'); // true
valid_URL('http://127.0.0.1'); // true
valid_URL(''); // false
valid_URL(1); // false
Source: http://urlregex.com/
function validateURL($URL) {
$pattern_1 = "/^(http|https|ftp):\/\/(([A-Z0-9][A-Z0-9_-]*)(\.[A-Z0-9][A-Z0-9_-]*)+.(com|org|net|dk|at|us|tv|info|uk|co.uk|biz|se)$)(:(\d+))?\/?/i";
$pattern_2 = "/^(www)((\.[A-Z0-9][A-Z0-9_-]*)+.(com|org|net|dk|at|us|tv|info|uk|co.uk|biz|se)$)(:(\d+))?\/?/i";
if(preg_match($pattern_1, $URL) || preg_match($pattern_2, $URL)){
return true;
} else{
return false;
}
}
Edit:
As incidence pointed out this code has been DEPRECATED with the release of PHP 5.3.0 (2009-06-30) and should be used accordingly.
Just my two cents but I've developed this function and have been using it for a while with success. It's well documented and separated so you can easily change it.
// Checks if string is a URL
// #param string $url
// #return bool
function isURL($url = NULL) {
if($url==NULL) return false;
$protocol = '(http://|https://)';
$allowed = '([a-z0-9]([-a-z0-9]*[a-z0-9]+)?)';
$regex = "^". $protocol . // must include the protocol
'(' . $allowed . '{1,63}\.)+'. // 1 or several sub domains with a max of 63 chars
'[a-z]' . '{2,6}'; // followed by a TLD
if(eregi($regex, $url)==true) return true;
else return false;
}
And there is your answer =) Try to break it, you can't!!!
function link_validate_url($text) {
$LINK_DOMAINS = 'aero|arpa|asia|biz|com|cat|coop|edu|gov|info|int|jobs|mil|museum|name|nato|net|org|pro|travel|mobi|local';
$LINK_ICHARS_DOMAIN = (string) html_entity_decode(implode("", array( // #TODO completing letters ...
"æ", // æ
"Æ", // Æ
"À", // À
"à", // à
"Á", // Á
"á", // á
"Â", // Â
"â", // â
"å", // å
"Å", // Å
"ä", // ä
"Ä", // Ä
"Ç", // Ç
"ç", // ç
"Ð", // Ð
"ð", // ð
"È", // È
"è", // è
"É", // É
"é", // é
"Ê", // Ê
"ê", // ê
"Ë", // Ë
"ë", // ë
"Î", // Î
"î", // î
"Ï", // Ï
"ï", // ï
"ø", // ø
"Ø", // Ø
"ö", // ö
"Ö", // Ö
"Ô", // Ô
"ô", // ô
"Õ", // Õ
"õ", // õ
"Œ", // Œ
"œ", // œ
"ü", // ü
"Ü", // Ü
"Ù", // Ù
"ù", // ù
"Û", // Û
"û", // û
"Ÿ", // Ÿ
"ÿ", // ÿ
"Ñ", // Ñ
"ñ", // ñ
"þ", // þ
"Þ", // Þ
"ý", // ý
"Ý", // Ý
"¿", // ¿
)), ENT_QUOTES, 'UTF-8');
$LINK_ICHARS = $LINK_ICHARS_DOMAIN . (string) html_entity_decode(implode("", array(
"ß", // ß
)), ENT_QUOTES, 'UTF-8');
$allowed_protocols = array('http', 'https', 'ftp', 'news', 'nntp', 'telnet', 'mailto', 'irc', 'ssh', 'sftp', 'webcal');
// Starting a parenthesis group with (?: means that it is grouped, but is not captured
$protocol = '((?:'. implode("|", $allowed_protocols) .'):\/\/)';
$authentication = "(?:(?:(?:[\w\.\-\+!$&'\(\)*\+,;=" . $LINK_ICHARS . "]|%[0-9a-f]{2})+(?::(?:[\w". $LINK_ICHARS ."\.\-\+%!$&'\(\)*\+,;=]|%[0-9a-f]{2})*)?)?#)";
$domain = '(?:(?:[a-z0-9' . $LINK_ICHARS_DOMAIN . ']([a-z0-9'. $LINK_ICHARS_DOMAIN . '\-_\[\]])*)(\.(([a-z0-9' . $LINK_ICHARS_DOMAIN . '\-_\[\]])+\.)*('. $LINK_DOMAINS .'|[a-z]{2}))?)';
$ipv4 = '(?:[0-9]{1,3}(\.[0-9]{1,3}){3})';
$ipv6 = '(?:[0-9a-fA-F]{1,4}(\:[0-9a-fA-F]{1,4}){7})';
$port = '(?::([0-9]{1,5}))';
// Pattern specific to external links.
$external_pattern = '/^'. $protocol .'?'. $authentication .'?('. $domain .'|'. $ipv4 .'|'. $ipv6 .' |localhost)'. $port .'?';
// Pattern specific to internal links.
$internal_pattern = "/^(?:[a-z0-9". $LINK_ICHARS ."_\-+\[\]]+)";
$internal_pattern_file = "/^(?:[a-z0-9". $LINK_ICHARS ."_\-+\[\]\.]+)$/i";
$directories = "(?:\/[a-z0-9". $LINK_ICHARS ."_\-\.~+%=&,$'#!():;*#\[\]]*)*";
// Yes, four backslashes == a single backslash.
$query = "(?:\/?\?([?a-z0-9". $LINK_ICHARS ."+_|\-\.~\/\\\\%=&,$'():;*#\[\]{} ]*))";
$anchor = "(?:#[a-z0-9". $LINK_ICHARS ."_\-\.~+%=&,$'():;*#\[\]\/\?]*)";
// The rest of the path for a standard URL.
$end = $directories .'?'. $query .'?'. $anchor .'?'.'$/i';
$message_id = '[^#].*#'. $domain;
$newsgroup_name = '(?:[0-9a-z+-]*\.)*[0-9a-z+-]*';
$news_pattern = '/^news:('. $newsgroup_name .'|'. $message_id .')$/i';
$user = '[a-zA-Z0-9'. $LINK_ICHARS .'_\-\.\+\^!#\$%&*+\/\=\?\`\|\{\}~\'\[\]]+';
$email_pattern = '/^mailto:'. $user .'#'.'(?:'. $domain .'|'. $ipv4 .'|'. $ipv6 .'|localhost)'. $query .'?$/';
if (strpos($text, '<front>') === 0) {
return false;
}
if (in_array('mailto', $allowed_protocols) && preg_match($email_pattern, $text)) {
return false;
}
if (in_array('news', $allowed_protocols) && preg_match($news_pattern, $text)) {
return false;
}
if (preg_match($internal_pattern . $end, $text)) {
return false;
}
if (preg_match($external_pattern . $end, $text)) {
return false;
}
if (preg_match($internal_pattern_file, $text)) {
return false;
}
return true;
}
function is_valid_url ($url="") {
if ($url=="") {
$url=$this->url;
}
$url = #parse_url($url);
if ( ! $url) {
return false;
}
$url = array_map('trim', $url);
$url['port'] = (!isset($url['port'])) ? 80 : (int)$url['port'];
$path = (isset($url['path'])) ? $url['path'] : '';
if ($path == '') {
$path = '/';
}
$path .= ( isset ( $url['query'] ) ) ? "?$url[query]" : '';
if ( isset ( $url['host'] ) AND $url['host'] != gethostbyname ( $url['host'] ) ) {
if ( PHP_VERSION >= 5 ) {
$headers = get_headers("$url[scheme]://$url[host]:$url[port]$path");
}
else {
$fp = fsockopen($url['host'], $url['port'], $errno, $errstr, 30);
if ( ! $fp ) {
return false;
}
fputs($fp, "HEAD $path HTTP/1.1\r\nHost: $url[host]\r\n\r\n");
$headers = fread ( $fp, 128 );
fclose ( $fp );
}
$headers = ( is_array ( $headers ) ) ? implode ( "\n", $headers ) : $headers;
return ( bool ) preg_match ( '#^HTTP/.*\s+[(200|301|302)]+\s#i', $headers );
}
return false;
}
Inspired in this .NET StackOverflow question and in this referenced article from that question there is this URI validator (URI means it validates both URL and URN).
if( ! preg_match( "/^([a-z][a-z0-9+.-]*):(?:\\/\\/((?:(?=((?:[a-z0-9-._~!$&'()*+,;=:]|%[0-9A-F]{2})*))(\\3)#)?(?=(\\[[0-9A-F:.]{2,}\\]|(?:[a-z0-9-._~!$&'()*+,;=]|%[0-9A-F]{2})*))\\5(?::(?=(\\d*))\\6)?)(\\/(?=((?:[a-z0-9-._~!$&'()*+,;=:#\\/]|%[0-9A-F]{2})*))\\8)?|(\\/?(?!\\/)(?=((?:[a-z0-9-._~!$&'()*+,;=:#\\/]|%[0-9A-F]{2})*))\\10)?)(?:\\?(?=((?:[a-z0-9-._~!$&'()*+,;=:#\\/?]|%[0-9A-F]{2})*))\\11)?(?:#(?=((?:[a-z0-9-._~!$&'()*+,;=:#\\/?]|%[0-9A-F]{2})*))\\12)?$/i", $uri ) )
{
throw new \RuntimeException( "URI has not a valid format." );
}
I have successfully unit-tested this function inside a ValueObject I made named Uri and tested by UriTest.
UriTest.php (Contains valid and invalid cases for both URLs and URNs)
<?php
declare( strict_types = 1 );
namespace XaviMontero\ThrasherPortage\Tests\Tour;
use XaviMontero\ThrasherPortage\Tour\Uri;
class UriTest extends \PHPUnit_Framework_TestCase
{
private $sut;
public function testCreationIsOfProperClassWhenUriIsValid()
{
$sut = new Uri( 'http://example.com' );
$this->assertInstanceOf( 'XaviMontero\\ThrasherPortage\\Tour\\Uri', $sut );
}
/**
* #dataProvider urlIsValidProvider
* #dataProvider urnIsValidProvider
*/
public function testGetUriAsStringWhenUriIsValid( string $uri )
{
$sut = new Uri( $uri );
$actual = $sut->getUriAsString();
$this->assertInternalType( 'string', $actual );
$this->assertEquals( $uri, $actual );
}
public function urlIsValidProvider()
{
return
[
[ 'http://example-server' ],
[ 'http://example.com' ],
[ 'http://example.com/' ],
[ 'http://subdomain.example.com/path/?parameter1=value1&parameter2=value2' ],
[ 'random-protocol://example.com' ],
[ 'http://example.com:80' ],
[ 'http://example.com?no-path-separator' ],
[ 'http://example.com/pa%20th/' ],
[ 'ftp://example.org/resource.txt' ],
[ 'file://../../../relative/path/needs/protocol/resource.txt' ],
[ 'http://example.com/#one-fragment' ],
[ 'http://example.edu:8080#one-fragment' ],
];
}
public function urnIsValidProvider()
{
return
[
[ 'urn:isbn:0-486-27557-4' ],
[ 'urn:example:mammal:monotreme:echidna' ],
[ 'urn:mpeg:mpeg7:schema:2001' ],
[ 'urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66' ],
[ 'rare-urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66' ],
[ 'urn:FOO:a123,456' ]
];
}
/**
* #dataProvider urlIsNotValidProvider
* #dataProvider urnIsNotValidProvider
*/
public function testCreationThrowsExceptionWhenUriIsNotValid( string $uri )
{
$this->expectException( 'RuntimeException' );
$this->sut = new Uri( $uri );
}
public function urlIsNotValidProvider()
{
return
[
[ 'only-text' ],
[ 'http//missing.colon.example.com/path/?parameter1=value1&parameter2=value2' ],
[ 'missing.protocol.example.com/path/' ],
[ 'http://example.com\\bad-separator' ],
[ 'http://example.com|bad-separator' ],
[ 'ht tp://example.com' ],
[ 'http://exampl e.com' ],
[ 'http://example.com/pa th/' ],
[ '../../../relative/path/needs/protocol/resource.txt' ],
[ 'http://example.com/#two-fragments#not-allowed' ],
[ 'http://example.edu:portMustBeANumber#one-fragment' ],
];
}
public function urnIsNotValidProvider()
{
return
[
[ 'urn:mpeg:mpeg7:sch ema:2001' ],
[ 'urn|mpeg:mpeg7:schema:2001' ],
[ 'urn?mpeg:mpeg7:schema:2001' ],
[ 'urn%mpeg:mpeg7:schema:2001' ],
[ 'urn#mpeg:mpeg7:schema:2001' ],
];
}
}
Uri.php (Value Object)
<?php
declare( strict_types = 1 );
namespace XaviMontero\ThrasherPortage\Tour;
class Uri
{
/** #var string */
private $uri;
public function __construct( string $uri )
{
$this->assertUriIsCorrect( $uri );
$this->uri = $uri;
}
public function getUriAsString()
{
return $this->uri;
}
private function assertUriIsCorrect( string $uri )
{
// https://stackoverflow.com/questions/30847/regex-to-validate-uris
// http://snipplr.com/view/6889/regular-expressions-for-uri-validationparsing/
if( ! preg_match( "/^([a-z][a-z0-9+.-]*):(?:\\/\\/((?:(?=((?:[a-z0-9-._~!$&'()*+,;=:]|%[0-9A-F]{2})*))(\\3)#)?(?=(\\[[0-9A-F:.]{2,}\\]|(?:[a-z0-9-._~!$&'()*+,;=]|%[0-9A-F]{2})*))\\5(?::(?=(\\d*))\\6)?)(\\/(?=((?:[a-z0-9-._~!$&'()*+,;=:#\\/]|%[0-9A-F]{2})*))\\8)?|(\\/?(?!\\/)(?=((?:[a-z0-9-._~!$&'()*+,;=:#\\/]|%[0-9A-F]{2})*))\\10)?)(?:\\?(?=((?:[a-z0-9-._~!$&'()*+,;=:#\\/?]|%[0-9A-F]{2})*))\\11)?(?:#(?=((?:[a-z0-9-._~!$&'()*+,;=:#\\/?]|%[0-9A-F]{2})*))\\12)?$/i", $uri ) )
{
throw new \RuntimeException( "URI has not a valid format." );
}
}
}
Running UnitTests
There are 65 assertions in 46 tests. Caution: there are 2 data-providers for valid and 2 more for invalid expressions. One is for URLs and the other for URNs. If you are using a version of PhpUnit of v5.6* or earlier then you need to join the two data providers into a single one.
xavi#bromo:~/custom_www/hello-trip/mutant-migrant$ vendor/bin/phpunit
PHPUnit 5.7.3 by Sebastian Bergmann and contributors.
.............................................. 46 / 46 (100%)
Time: 82 ms, Memory: 4.00MB
OK (46 tests, 65 assertions)
Code coverage
There's is 100% of code-coverage in this sample URI checker.
"/(http(s?):\/\/)([a-z0-9\-]+\.)+[a-z]{2,4}(\.[a-z]{2,4})*(\/[^ ]+)*/i"
(http(s?)://) means http:// or https://
([a-z0-9-]+.)+ =>
2.0[a-z0-9-] means any a-z character or any 0-9 or (-)sign)
2.1 (+) means the character can be one or more ex: a1w,
a9-,c559s, f)
2.2 \. is (.)sign
2.3. the (+) sign after ([a-z0-9\-]+\.) mean do 2.1,2.2,2.3
at least 1 time
ex: abc.defgh0.ig, aa.b.ced.f.gh. also in case www.yyy.com
3.[a-z]{2,4} mean a-z at least 2 character but not more than
4 characters for check that there will not be
the case
ex: https://www.google.co.kr.asdsdagfsdfsf
4.(\.[a-z]{2,4})*(\/[^ ]+)* mean
4.1 \.[a-z]{2,4} means like number 3 but start with
(.)sign
4.2 * means (\.[a-z]{2,4})can be use or not use never mind
4.3 \/ means \
4.4 [^ ] means any character except blank
4.5 (+) means do 4.3,4.4,4.5 at least 1 times
4.6 (*) after (\/[^ ]+) mean use 4.3 - 4.5 or not use
no problem
use for case https://stackoverflow.com/posts/51441301/edit
5. when you use regex write in "/ /" so it come
"/(http(s?)://)([a-z0-9-]+.)+[a-z]{2,4}(.[a-z]{2,4})(/[^ ]+)/i"
6. almost forgot: letter i on the back mean ignore case of
Big letter or small letter ex: A same as a, SoRRy same
as sorry.
Note : Sorry for bad English. My country not use it well.
OK, so this is a little bit more complex then a simple regex, but it allows for different types of urls.
Examples:
google.com
www.microsoft.com/
http://www.yahoo.com/
https://www.bandcamp.com/artist/#!someone-special!
All which should be marked as valid.
function is_valid_url($url) {
// First check: is the url just a domain name? (allow a slash at the end)
$_domain_regex = "|^[A-Za-z0-9-]+(\.[A-Za-z0-9-]+)*(\.[A-Za-z]{2,})/?$|";
if (preg_match($_domain_regex, $url)) {
return true;
}
// Second: Check if it's a url with a scheme and all
$_regex = '#^([a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))$#';
if (preg_match($_regex, $url, $matches)) {
// pull out the domain name, and make sure that the domain is valid.
$_parts = parse_url($url);
if (!in_array($_parts['scheme'], array( 'http', 'https' )))
return false;
// Check the domain using the regex, stops domains like "-example.com" passing through
if (!preg_match($_domain_regex, $_parts['host']))
return false;
// This domain looks pretty valid. Only way to check it now is to download it!
return true;
}
return false;
}
Note that there is a in_array check for the protocols that you want to allow (currently only http and https are in that list).
var_dump(is_valid_url('google.com')); // true
var_dump(is_valid_url('google.com/')); // true
var_dump(is_valid_url('http://google.com')); // true
var_dump(is_valid_url('http://google.com/')); // true
var_dump(is_valid_url('https://google.com')); // true
For anyone developing with WordPress, just use
esc_url_raw($url) === $url
to validate a URL (here's WordPress' documentation on esc_url_raw). It handles URLs much better than filter_var($url, FILTER_VALIDATE_URL) because it is unicode and XSS-safe. (Here is a good article mentioning all the problems with filter_var).
Peter's Regex doesn't look right to me for many reasons. It allows all kinds of special characters in the domain name and doesn't test for much.
Frankie's function looks good to me and you can build a good regex from the components if you don't want a function, like so:
^(http://|https://)(([a-z0-9]([-a-z0-9]*[a-z0-9]+)?){1,63}\.)+[a-z]{2,6}
Untested but I think that should work.
Also, Owen's answer doesn't look 100% either. I took the domain part of the regex and tested it on a Regex tester tool http://erik.eae.net/playground/regexp/regexp.html
I put the following line:
(\S*?\.\S*?)
in the "regexp" section
and the following line:
-hello.com
under the "sample text" section.
The result allowed the minus character through. Because \S means any non-space character.
Note the regex from Frankie handles the minus because it has this part for the first character:
[a-z0-9]
Which won't allow the minus or any other special character.
Here is the way I did it. But I want to mentoin that I am not so shure about the regex. But It should work thou :)
$pattern = "#((http|https)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|”|\"|'|:|\<|$|\.\s)#i";
$text = preg_replace_callback($pattern,function($m){
return "$m[1]$m[4]";
},
$text);
This way you won't need the eval marker on your pattern.
Hope it helps :)
Here's a simple class for URL Validation using RegEx and then cross-references the domain against popular RBL (Realtime Blackhole Lists) servers:
Install:
require 'URLValidation.php';
Usage:
require 'URLValidation.php';
$urlVal = new UrlValidation(); //Create Object Instance
Add a URL as the parameter of the domain() method and check the the return.
$urlArray = ['http://www.bokranzr.com/test.php?test=foo&test=dfdf', 'https://en-gb.facebook.com', 'https://www.google.com'];
foreach ($urlArray as $k=>$v) {
echo var_dump($urlVal->domain($v)) . ' URL: ' . $v . '<br>';
}
Output:
bool(false) URL: http://www.bokranzr.com/test.php?test=foo&test=dfdf
bool(true) URL: https://en-gb.facebook.com
bool(true) URL: https://www.google.com
As you can see above, www.bokranzr.com is listed as malicious website via an RBL so the domain was returned as false.
I've found this to be the most useful for matching a URL..
^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$
There is a PHP native function for that:
$url = 'http://www.yoururl.co.uk/sub1/sub2/?param=1&param2/';
if ( ! filter_var( $url, FILTER_VALIDATE_URL ) ) {
// Wrong
}
else {
// Valid
}
Returns the filtered data, or FALSE if the filter fails.
Check it here

Categories