PHP Regex for IMDB/TMDB Urls

PHP Regex for IMDB/TMDB Urls - php

I'm writing a code what compares a links from imdb and tmdb.
The code matches link to imdb and then transforms it for the tmdb link, if was inserted.
The links look like:
https://www.imdb.com/title/tt0848228
https://www.themoviedb.org/movie/24428
I want to ask if these regexs are correct for movies links.
For ex.
$imdb_url = https://www.imdb.com/title/tt0848228
if (strpos($imdb_url, 'themoviedb.org') == true) {
preg_match_all('/\\d*-/', $imdb_url, $tmdb_id);
$tmdb_id = $tmdb_id[0];
$tmdb_id = str_replace('-', '', $tmdb_id);
$tmdb_id = $tmdb_id[0];
$request_url = amy_movie_provider_build_query_url('tmdb', $tmdb_id, $api_key);
$the_data = wp_remote_get($request_url, array(
'timeout' => $timeout,
));
if (!is_wp_error($the_data) && !empty($the_data)) {
$movie_data = json_decode($the_data['body'], true);
$result = amy_movie_add_tmdb_movie_data($movie_data);
echo $result;
exit;
} else {
$result = esc_html__('Provider TMDB being error!', 'amy-movie-extend');
echo $result;
exit;
}
exit;
}
And else for imdb link:
else if (strpos($imdb_url, 'www.imdb.com') == true) {
preg_match_all('/tt\\d{7}/', $imdb_url, $imdb_id);
$imdb_id = $imdb_id[0];
$imdb_id = $imdb_id[0];
}
I think it's not working because something may be wrong with not existing /movie prefix in the link, but I tried changing that and it still catches error 404.

Why not combining the domain part with the rest of the URI? Why once omitting the subdomain and once making it mandatory?
$sURI= 'whatever';
if( preg_match( '#imdb\\.com/title/tt(\\d{7})#i', $sURI, $aMatch ) ) {
echo 'IMDb, movie #'. $aMatch[1];
} else
if( preg_match( '#themoviedb.org/movie/(\\d+)($|-)#i', $sURI, $aMatch ) ) {
echo 'TMDb, movie #'. $aMatch[1];
} else {
echo 'Unrecognized';
}
This way it doesn't matter if the IMDb URI comes with www. or not. Since the movie IDs have a fixed length we don't even need to expect/care a slash following. Your mistake was expecting a slash without any need.
Same for TMDb, which either ends right away (but we want to get all digits to the end, not just the first) or is followed by a dash. i is for really distorted URIs for whichever reason. Your mistake was to expect a dash and to make digits entirely optional (when at least one should be needed, as in https://www.themoviedb.org/movie/9)
Side note: Using \\d in a PHP string for a regular expression is the correct way, as you first have to deal with the string context - there an effective backslash has to be escaped by the backslash itself. And only after that the scope of the regular expression is encountered. \d only also works because unknown string escapings are silently ignored.

Related

How to convert random domain names into lowercase consistent urls?

I have this function in a class:
protected $supportedWebsitesUrls = ['www.youtube.com', 'www.vimeo.com', 'www.dailymotion.com'];
protected function isValid($videoUrl)
{
$urlDetails = parse_url($videoUrl);
if (in_array($urlDetails['host'], $this->supportedWebsitesUrls))
{
return true;
} else {
throw new \Exception('This website is not supported yet!');
return false;
}
}
It basically extracts the host name from any random url and then checks if it is in the $supportedWebsitesUrls array to ensure that it is from a supported website. But if I add say: dailymotion.com instead of www.dailymotion.com it won't detect that url. Also if I try to do WWW.DAILYMOTION.COM it still won't work. What can be done? Please help me.

You can use preg_grep function for this. preg_grep supports regex matches against a given array.
Sample use:
$supportedWebsitesUrls = array('www.dailymotion.com', 'www.youtube.com', 'www.vimeo.com');
$s = 'DAILYMOTION.COM';
if ( empty(preg_grep('/' . preg_quote($s, '/') . '/i', $supportedWebsitesUrls)) )
echo 'This website is not supported yet!\n';
else
echo "found a match\n";
Output:
found a match

You can run a few checks on it;
For lower case vs upper case, the php function strtolower() will sort you out.
as for checking with the www. at the beginning vs without it, you can add an extra check to your if clause;
if (in_array($urlDetails['host'], $this->supportedWebsitesUrls) || in_array('www.'.$urlDetails['host'], $this->supportedWebsitesUrls))

if else on variable link input

I have a method of pulling Youtube video data from API links. I use Wordpress and ran into a snag.
In order to pull the thumbnail, views, uploader and video title I need the user to input the 11 character code at the end of watch?v=_______. This is documented with specific instructions for the user, but what if they ignore it and paste the whole url?
// the url 'code' the user should input.
_gXp4hdd2pk
// the wrong way, when the user pastes the whole url.
https://www.youtube.com/watch?v=_gXp4hdd2pk
If the user accidentally pastes the entire URL and not the 11 character code then is there a way I can use PHP to grab either the code or whats at the end of this url (11 characters after 'watch?v='?
Here is my PHP code to pull the data:
// $url is the code at the end of 'watch?v=' that the user inputs
$url = get_post_meta ($post->ID, 'youtube_url', $single = true);
// $code is a variable for placing the $url in a youtube link so I can output it to an API link
$code = 'http://www.youtube.com/watch?v=' . $url;
// $code is called at the end of this oembed code, allowing me to decode json data and pull elements from json to echo in my html
// echoed output returns json file. example: http://www.youtube.com/oembed?url=http://www.youtube.com/watch?v=_gXp4hdd2pk
$json = file_get_contents('http://www.youtube.com/oembed?url='.urlencode($code));
Im looking for something like...
"if user inputs code, use this block of code, else if user inputs whole url use a different block of code, else throw error."
Or... if they use the whole URL can PHP only use a specific section of that url...?
EDIT: Thank you for all the answers! I am new to PHP, so thank you all for your patience. It is difficult for graphic designers to learn PHP, even reading the PHP manual can give us headaches. All of your answers were great and the ones ive tested have worked. Thank you so much :)

Try this,
$code = 'https://www.youtube.com/watch?v=_gXp4hdd2pk';
if (filter_var($code, FILTER_VALIDATE_URL) == TRUE) {
// if `$code` is valid url
$code_arr = explode('?v=', $code);
$query_str = explode('&', $code_arr[1]);
$new_code = $query_str[0];
} else {
// if `$code` is not a valid url like '_gXp4hdd2pk'
$new_code = $code;
}
echo $new_code;

Here's a simple option for you to do, unless you want to use regex like Nisse Engström's Answer.
Using the function parse_url() you could do something like this:
$url = 'https://www.youtube.com/watch?v=_gXp4hdd2pk&list=RD_gXp4hdd2pk#t=184';
$split = parse_url('https://www.youtube.com/watch?v=_gXp4hdd2pk&list=RD_gXp4hdd2pk#t=184');
$params = explode('&', $split['query']);
$video_id = str_replace('v=', '', $params[0]);
now $video_id would return:
_gXp4hdd2pk
from the $url supplied in the above code.
I suggest you read the parse_url() documentation to ensure you understand and grasp it all :-)
Update
for your comment.
You'd use something like this to make sure the parsed value is a valid URL:
// this will check if valid url
if (filter_var($code, FILTER_VALIDATE_URL)) {
// its valid as it returned true
// so run the code
$url = 'https://www.youtube.com/watch?v=_gXp4hdd2pk&list=RD_gXp4hdd2pk#t=184';
$split = parse_url('https://www.youtube.com/watch?v=_gXp4hdd2pk&list=RD_gXp4hdd2pk#t=184');
$params = explode('&', $split['query']);
$video_id = str_replace('v=', '', $params[0]);
} else {
// they must have posted the video code as the if check returned false.
$video_id = $url;
}

Just try as follows ..
$url =" https://www.youtube.com/watch?v=_gXp4hdd2pk";
$url= explode('?v=', $url);
$endofurl = end($url);
echo $endofurl;
Replace $url variable with input .

I instruct my users to copy and paste the whole youtube url.
Then, I do this:
$video_url = 'https://www.youtube.com/watch?v=_gXp4hdd2pk'; // this is from user input
$parsed_url = parse_url($video_url);
parse_str($parsed_url['query'], $query);
$vidID = isset($query['v']) ? $query['v'] : NULL;
$url = "http://gdata.youtube.com/feeds/api/videos/". $vidID; // this is used for the Api

$m = array();
if (preg_match ('#^(https?://www.youtube.com/watch\\?v=)?(.+)$#', $url, $m)) {
$code = $m[2];
} else {
/* No match */
}
The code uses a Regular Expression to match the user input (the subject) against a pattern. The pattern is enclosed in a pair of delimiters (#) of your choice. The rest of the pattern works like this:
^ matches the beginning of the string.
(...) creates a subpattern.
? matches 0 or 1 of the preceeding character or subpattern.
https? matches "http" or "https".
\? matches "?".
(.+) matches 1 or more arbitrary charactes. The . matches any character (except newline). + matches 1 or more of the preceeding character or subpattern.
$ matches the end of the string.
In other words, optionally match an http or https base URL, followed by the video code.
The matches are then written to $m. $m[0] contains the entire string, $m[1] contains the first subpattern (base URL) and $m[2] contains the second subpattern (code).

PHP Auto-correcting URLs

I dont wan't reinvent wheel, but i couldnt find any library that would do this perfectly.
In my script users can save URLs, i want when they give me list like:
google.com
www.msn.com
http://bing.com/
and so on...
I want to be able to save in database in "correct format".
Thing i do is I check is it there protocol, and if it's not present i add it and then validate URL against RegExp.
For PHP parse_url any URL that contains protocol is valid, so it didnt help a lot.
How guys you are doing this, do you have some idea you would like to share with me?
Edit:
I want to filter out invalid URLs from user input (list of URLs). And more important, to try auto correct URLs that are invalid (ex. doesn't contains protocol). Ones user enter list, it should be validated immediately (no time to open URLs to check those they really exist).
It would be great to extract parts from URL, like parse_url do, but problem with parse_url is, it doesn't work well with invalid URLs. I tried to parse URL with it, and for parts that are missing (and are required) to add default ones (ex. no protocol, add http). But parse_url for "google.com" wont return "google.com" as hostname but as path.
This looks like really common problem to me, but i could not find available solution on internet (found some libraries that will standardize URL, but they wont fix URL if it is invalid).
Is there some "smart" solution to this, or I should stick with my current:
Find first occurrence of :// and validate if it's text before is valid protocol, and add protocol if missing
Found next occurrence of / and validate is hostname is in valid format
For good measure validate once more via RegExp whole URL
I just have feeling I will reject some valid URLs with this, and for me is better to have false positive, that false negative.

I had the same problem with parse_url as OP, this is my quick and dirty solution to auto-correct urls(keep in mind that the code in no way are perfect or cover all cases):
Results:
http:/wwww.example.com/lorum.html => http://www.example.com/lorum.html
gopher:/ww.example.com => gopher://www.example.com
http:/www3.example.com/?q=asd&f=#asd =>http://www3.example.com/?q=asd&f=#asd
asd://.example.com/folder/folder/ =>http://example.com/folder/folder/
.example.com/ => http://example.com/
example.com =>http://example.com
subdomain.example.com => http://subdomain.example.com
function url_parser($url) {
// multiple /// messes up parse_url, replace 2+ with 2
$url = preg_replace('/(\/{2,})/','//',$url);
$parse_url = parse_url($url);
if(empty($parse_url["scheme"])) {
$parse_url["scheme"] = "http";
}
if(empty($parse_url["host"]) && !empty($parse_url["path"])) {
// Strip slash from the beginning of path
$parse_url["host"] = ltrim($parse_url["path"], '\/');
$parse_url["path"] = "";
}
$return_url = "";
// Check if scheme is correct
if(!in_array($parse_url["scheme"], array("http", "https", "gopher"))) {
$return_url .= 'http'.'://';
} else {
$return_url .= $parse_url["scheme"].'://';
}
// Check if the right amount of "www" is set.
$explode_host = explode(".", $parse_url["host"]);
// Remove empty entries
$explode_host = array_filter($explode_host);
// And reassign indexes
$explode_host = array_values($explode_host);
// Contains subdomain
if(count($explode_host) > 2) {
// Check if subdomain only contains the letter w(then not any other subdomain).
if(substr_count($explode_host[0], 'w') == strlen($explode_host[0])) {
// Replace with "www" to avoid "ww" or "wwww", etc.
$explode_host[0] = "www";
}
}
$return_url .= implode(".",$explode_host);
if(!empty($parse_url["port"])) {
$return_url .= ":".$parse_url["port"];
}
if(!empty($parse_url["path"])) {
$return_url .= $parse_url["path"];
}
if(!empty($parse_url["query"])) {
$return_url .= '?'.$parse_url["query"];
}
if(!empty($parse_url["fragment"])) {
$return_url .= '#'.$parse_url["fragment"];
}
return $return_url;
}
echo url_parser('http:/wwww.example.com/lorum.html'); // http://www.example.com/lorum.html
echo url_parser('gopher:/ww.example.com'); // gopher://www.example.com
echo url_parser('http:/www3.example.com/?q=asd&f=#asd'); // http://www3.example.com/?q=asd&f=#asd
echo url_parser('asd://.example.com/folder/folder/'); // http://example.com/folder/folder/
echo url_parser('.example.com/'); // http://example.com/
echo url_parser('example.com'); // http://example.com
echo url_parser('subdomain.example.com'); // http://subdomain.example.com

It's not 100% foolproof, but a 1 liner.
$URL = (((strpos($URL,'https://') === false) && (strpos($URL,'http://') === false))?'http://':'' ).$URL;
EDIT
There was apparently a problem with my initial version if the hostname contain http.
Thanks Trent

PHP router without regular expressions

I have been working on a fancy router/dispatcher class for weeks now trying to decide how I wanted it, I got it perfect IMO except performance is not what I am wanting from it. It uses a route map arrap = /forums/viewthread/:id/:page => 'forums/viewthread/(?\d+)' and loops through my map array with regex to get a match, I am trying to get something better on a high traffic site, here is a start...
$uri = "forum/viewforum/id-522/page-3";
$parts = explode("/", $uri);
$controller = $parts['0'];
$method = $parts['1'];
if($parts['2'] != ''){
$idNumber = $parts['2'];
}
if($parts['3'] != ''){
$pageNumber = $parts['3'];
}
Where I need help is sometime an id and a page will not be present sometime one or the other and sometimes both, so obvioulsy my above code would not cover that, it assumes array item 2 is always the id and 3 is always the page, could someone show me a practical way of matchting up the page and id to a variable only if they exist in the URI and without using regular expressions?
You can see what I have so far on my regular expressions versions in this question Is this a good way to match URI to class/method in PHP for MVC

This seems more extendable:
$parts = explode("/", $uri);
$parts_count=count($parts);
//set default values
$page_info=array('id'=>0,'page'=>0);
for($i=2;$i<$parts_count;$i++) {
if(strpos($parts[$i],'-')!==FALSE) {
list($info_type,$info_val)=explode('-',$parts[$i],2);
if(isset($page_info[$info_type])) {
$page_info[$info_type]=(int)$info_val;
}
}
}
then just use $page_info values. You can easily add other values this way and more levels of '/'.

if ( ! empty($parts['2']))
{
if (strpos($parts['2'], 'id-') !== FALSE)
{
$idNumber = str_replace('id-', '', $parts['2']);
}
elseif (strpos($parts['2'], 'page-') !== FALSE)
{
$pageNumber = str_replace('id-', '', $parts['2']);
}
}
And do the same for $part[3]

basic if/then with PHP

Okay so i set up this thing so that I can print out page that people came from, and then put dummy tags on certain pages. Some pages have commented out "linkto" tags with text in between them.
My problem is that some of my pages don't have "linkto" text. When I link to this page from there I want it to grab everything between "title" and "/title". How can I change the eregi so that if it turns up empty, it should then grab the title?
Here is what I have so far, I know I just need some kind of if/then but I'm a rank beginner. Thank you in advance for any help:
<?php
$filesource = $_SERVER['HTTP_REFERER'];
$a = fopen($filesource,"r"); //fopen("html_file.html","r");
$string = fread($a,1024);
?>
<?php
if (eregi("<linkto>(.*)</linkto>", $string, $out)) {
$outdata = $out[1];
}
//echo $outdata;
$outdatapart = explode( " " , $outdata);
echo $part[0];
?>

Here you go: if eregi() fails to match, the $outdata assignment will never happen as the if block will not be executed. If it matches, but there's nothing between the tags, $outdata will be assigned an empty string. In both cases, !$outdata will be true, so we can fallback to a second match on the title tag instead.
if(eregi("<linkto>(.*?)</linkto>", $string, $link_match)) {
$outdata = $link_match[1];
}
if(!$outdata && eregi("<title>(.*?)</title>", $string, $title_match)) {
$outdata = $title_match[1];
}
I also changed the (.*) in the match to (.*?). This means, don't be greedy. In the (.*) form, if you had $string set to
<title>Page Title</title> ...
... <iframe><title>A second title tag!</title></iframe>
The regex would match
Page Title</title> ... ... <iframe><title>A second title tag!
Because it tries to match as much as possible, as long as the text is between any and any other !. In the (.*?) form, the match does what you'd expect - it matches
Page Title
And stops as soon as it is able.
...
As an aside, this thing is an interesting scheme, but why do you need it? Pages can link to other pages and pass parameters via the query string:
...
Then somescript.php can access the prevpage parameter via the $_GET['prevpage'] superglobal variable.
Would that solve your problem?

The POSIX regex extension (ereg etc.) will be deprecated as of PHP 5.3.0 and may be gone completely come PHP 6, you're better off using the PCRE functions (preg_match and friends).
The PCRE functions are also faster, binary safe and support more features like non-greedy matching etc.
Just a pointer.

you need if, else.
if(eregi(...))
{
.
.
.
}
else
{
just grab title;
}
perhaps you should have done a quick google search to find this very simple answer.

Just add another if test before you assign the match to $outdata:
if (eregi("<linkto>(.*)</linkto>", $string, $out)) {
if ($out[1] != "") {
$outdata = $out[1];
} else {
// Look in the title.
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.