Getting part of a string using REGEX

Getting part of a string using REGEX - php

I have an Amazon link:
http://www.amazon.com/Pampers-Softcare-Fresh-Wipes-Count/dp/B007KXO998/ref=pd_zg_rss_ts_165796011_165796011_7?ie=UTF8&tag=elson06-20
I'm trying to get the product ID B007FHX9OK that is after dp/ and before ?ref=pd_zg_rss_ts_165796011_165796011_7
I want to get that using a regex or anything that can extract it.
The link of the url is static, it will not changed.

$string = 'http://www.amazon.com/iOttie-Windshield-INCREDIBLE-BlackBerry-Revolution/dp/B007FHX9OK?SubscriptionId=AKIAJJPPYQPVMQLOYLKQ&tag=elson06-20&linkCode=sp1&camp=2025&creative=165953&creativeASIN=B007FHX9OK';
//$string = 'http://www.amazon.com/Pampers-Softcare-Fresh-Wipes-Count/dp/B007KXO998/ref=pd_zg_rss_ts_165796011_165796011_7?ie=UTF8&tag=elson06-20';
$pid = basename((false !== strpos($string, '/ref='))
? pathinfo($string, PATHINFO_DIRNAME)
: parse_url($string, PHP_URL_PATH));
echo $pid; // Outputs B007KXO998 or B007FHX9OK, will work for both types of URLs
You don't need a regex, PHP has built-in functions to parse URLs.

Will the URLs always be in this exact format, or will it be expected to match any Amazon URL?
If the format will always be like this, then you can use #cryptic's answer. Otherwise, it would be more flexible to use a pattern like |dp/([A-Z0-9]+)|i for the pattern.
This will match any alphanumeric string (case insensitive) directly following dp/ in the string. Well, the entire match will include the dp/ part, but the parenthetical portion is a sub-match which will match only the product id.
Edit: According to this page, Amazon's product IDs (ASINs) can be present in a wide variety of URLs, making them difficult to match, and my code above won't catch them all.
One way to try to catch these would be to use parse_url to extract the host and the path portions of the URL. From there, you can check the host portion against known Amazon domain names, and you could explode the path, and check each portion for an alphanumeric section which is ten characters long. Even then, the ASIN for books is the books ISBN, and there are 13-digit versions which Amazon might use in some cases (though I don't have evidence that they do).
Here is a very basic example that I haven't thoroughly tested:
$url = get_url_from_wherever();
$url_parts = parse_url($url);
$host = $url_parts['host'];
$path = explode('/', $url_parts['path']);
$amazon_hosts = array(
'amazon.com', // United States
'amazon.ca', // Canada
'amazon.cn', // China
'amazon.fr', // France
'amazon.it', // Italy
'amazon.de', // Germany
'amazon.es', // Spain
'amazon.co.jp', // Japan
'amazon.co.uk', // United Kingdom
'amzn.to' // URL Shortener
);
$amazon_hosts = array_map('preg_quote', $amazon_hosts);
$asin = FALSE; // initialize in case we don't find the ASIN
if (preg_match('/(^|\.)(' . implode($amazon_hosts, '|') . ')$/i', $host)) {
// valid host
foreach($path as $path_component) {
if (preg_match('/^[A-Z0-9]{10}$/i', $path_component)) {
// this is probably the ASIN, since the string is a 10-character alphanumeric
$asin = $path_component;
}
}
}
if ($asin) {
// process ASIN
} else {
// couldn't find an ASIN in this URL
}

Here's what I did, since I'm pretty sure that the link has always the same format:
$link = 'http://www.amazon.com/Pampers-Softcare-Fresh-Wipes-Count/dp/B007KXO998/ref=pd_zg_rss_ts_165796011_165796011_7?ie=UTF8&tag=elson06-20'
$link = parse_url($link);
$link = explode('/',$link['path']);
$link = $link[3];
echo $link; //B007KXO998

Related

PHP Regex for IMDB/TMDB Urls

I'm writing a code what compares a links from imdb and tmdb.
The code matches link to imdb and then transforms it for the tmdb link, if was inserted.
The links look like:
https://www.imdb.com/title/tt0848228
https://www.themoviedb.org/movie/24428
I want to ask if these regexs are correct for movies links.
For ex.
$imdb_url = https://www.imdb.com/title/tt0848228
if (strpos($imdb_url, 'themoviedb.org') == true) {
preg_match_all('/\\d*-/', $imdb_url, $tmdb_id);
$tmdb_id = $tmdb_id[0];
$tmdb_id = str_replace('-', '', $tmdb_id);
$tmdb_id = $tmdb_id[0];
$request_url = amy_movie_provider_build_query_url('tmdb', $tmdb_id, $api_key);
$the_data = wp_remote_get($request_url, array(
'timeout' => $timeout,
));
if (!is_wp_error($the_data) && !empty($the_data)) {
$movie_data = json_decode($the_data['body'], true);
$result = amy_movie_add_tmdb_movie_data($movie_data);
echo $result;
exit;
} else {
$result = esc_html__('Provider TMDB being error!', 'amy-movie-extend');
echo $result;
exit;
}
exit;
}
And else for imdb link:
else if (strpos($imdb_url, 'www.imdb.com') == true) {
preg_match_all('/tt\\d{7}/', $imdb_url, $imdb_id);
$imdb_id = $imdb_id[0];
$imdb_id = $imdb_id[0];
}
I think it's not working because something may be wrong with not existing /movie prefix in the link, but I tried changing that and it still catches error 404.

Why not combining the domain part with the rest of the URI? Why once omitting the subdomain and once making it mandatory?
$sURI= 'whatever';
if( preg_match( '#imdb\\.com/title/tt(\\d{7})#i', $sURI, $aMatch ) ) {
echo 'IMDb, movie #'. $aMatch[1];
} else
if( preg_match( '#themoviedb.org/movie/(\\d+)($|-)#i', $sURI, $aMatch ) ) {
echo 'TMDb, movie #'. $aMatch[1];
} else {
echo 'Unrecognized';
}
This way it doesn't matter if the IMDb URI comes with www. or not. Since the movie IDs have a fixed length we don't even need to expect/care a slash following. Your mistake was expecting a slash without any need.
Same for TMDb, which either ends right away (but we want to get all digits to the end, not just the first) or is followed by a dash. i is for really distorted URIs for whichever reason. Your mistake was to expect a dash and to make digits entirely optional (when at least one should be needed, as in https://www.themoviedb.org/movie/9)
Side note: Using \\d in a PHP string for a regular expression is the correct way, as you first have to deal with the string context - there an effective backslash has to be escaped by the backslash itself. And only after that the scope of the regular expression is encountered. \d only also works because unknown string escapings are silently ignored.

if else on variable link input

I have a method of pulling Youtube video data from API links. I use Wordpress and ran into a snag.
In order to pull the thumbnail, views, uploader and video title I need the user to input the 11 character code at the end of watch?v=_______. This is documented with specific instructions for the user, but what if they ignore it and paste the whole url?
// the url 'code' the user should input.
_gXp4hdd2pk
// the wrong way, when the user pastes the whole url.
https://www.youtube.com/watch?v=_gXp4hdd2pk
If the user accidentally pastes the entire URL and not the 11 character code then is there a way I can use PHP to grab either the code or whats at the end of this url (11 characters after 'watch?v='?
Here is my PHP code to pull the data:
// $url is the code at the end of 'watch?v=' that the user inputs
$url = get_post_meta ($post->ID, 'youtube_url', $single = true);
// $code is a variable for placing the $url in a youtube link so I can output it to an API link
$code = 'http://www.youtube.com/watch?v=' . $url;
// $code is called at the end of this oembed code, allowing me to decode json data and pull elements from json to echo in my html
// echoed output returns json file. example: http://www.youtube.com/oembed?url=http://www.youtube.com/watch?v=_gXp4hdd2pk
$json = file_get_contents('http://www.youtube.com/oembed?url='.urlencode($code));
Im looking for something like...
"if user inputs code, use this block of code, else if user inputs whole url use a different block of code, else throw error."
Or... if they use the whole URL can PHP only use a specific section of that url...?
EDIT: Thank you for all the answers! I am new to PHP, so thank you all for your patience. It is difficult for graphic designers to learn PHP, even reading the PHP manual can give us headaches. All of your answers were great and the ones ive tested have worked. Thank you so much :)

Try this,
$code = 'https://www.youtube.com/watch?v=_gXp4hdd2pk';
if (filter_var($code, FILTER_VALIDATE_URL) == TRUE) {
// if `$code` is valid url
$code_arr = explode('?v=', $code);
$query_str = explode('&', $code_arr[1]);
$new_code = $query_str[0];
} else {
// if `$code` is not a valid url like '_gXp4hdd2pk'
$new_code = $code;
}
echo $new_code;

Here's a simple option for you to do, unless you want to use regex like Nisse Engström's Answer.
Using the function parse_url() you could do something like this:
$url = 'https://www.youtube.com/watch?v=_gXp4hdd2pk&list=RD_gXp4hdd2pk#t=184';
$split = parse_url('https://www.youtube.com/watch?v=_gXp4hdd2pk&list=RD_gXp4hdd2pk#t=184');
$params = explode('&', $split['query']);
$video_id = str_replace('v=', '', $params[0]);
now $video_id would return:
_gXp4hdd2pk
from the $url supplied in the above code.
I suggest you read the parse_url() documentation to ensure you understand and grasp it all :-)
Update
for your comment.
You'd use something like this to make sure the parsed value is a valid URL:
// this will check if valid url
if (filter_var($code, FILTER_VALIDATE_URL)) {
// its valid as it returned true
// so run the code
$url = 'https://www.youtube.com/watch?v=_gXp4hdd2pk&list=RD_gXp4hdd2pk#t=184';
$split = parse_url('https://www.youtube.com/watch?v=_gXp4hdd2pk&list=RD_gXp4hdd2pk#t=184');
$params = explode('&', $split['query']);
$video_id = str_replace('v=', '', $params[0]);
} else {
// they must have posted the video code as the if check returned false.
$video_id = $url;
}

Just try as follows ..
$url =" https://www.youtube.com/watch?v=_gXp4hdd2pk";
$url= explode('?v=', $url);
$endofurl = end($url);
echo $endofurl;
Replace $url variable with input .

I instruct my users to copy and paste the whole youtube url.
Then, I do this:
$video_url = 'https://www.youtube.com/watch?v=_gXp4hdd2pk'; // this is from user input
$parsed_url = parse_url($video_url);
parse_str($parsed_url['query'], $query);
$vidID = isset($query['v']) ? $query['v'] : NULL;
$url = "http://gdata.youtube.com/feeds/api/videos/". $vidID; // this is used for the Api

$m = array();
if (preg_match ('#^(https?://www.youtube.com/watch\\?v=)?(.+)$#', $url, $m)) {
$code = $m[2];
} else {
/* No match */
}
The code uses a Regular Expression to match the user input (the subject) against a pattern. The pattern is enclosed in a pair of delimiters (#) of your choice. The rest of the pattern works like this:
^ matches the beginning of the string.
(...) creates a subpattern.
? matches 0 or 1 of the preceeding character or subpattern.
https? matches "http" or "https".
\? matches "?".
(.+) matches 1 or more arbitrary charactes. The . matches any character (except newline). + matches 1 or more of the preceeding character or subpattern.
$ matches the end of the string.
In other words, optionally match an http or https base URL, followed by the video code.
The matches are then written to $m. $m[0] contains the entire string, $m[1] contains the first subpattern (base URL) and $m[2] contains the second subpattern (code).

I dont wan't reinvent wheel, but i couldnt find any library that would do this perfectly.
In my script users can save URLs, i want when they give me list like:
google.com
www.msn.com
http://bing.com/
and so on...
I want to be able to save in database in "correct format".
Thing i do is I check is it there protocol, and if it's not present i add it and then validate URL against RegExp.
For PHP parse_url any URL that contains protocol is valid, so it didnt help a lot.
How guys you are doing this, do you have some idea you would like to share with me?
Edit:
I want to filter out invalid URLs from user input (list of URLs). And more important, to try auto correct URLs that are invalid (ex. doesn't contains protocol). Ones user enter list, it should be validated immediately (no time to open URLs to check those they really exist).
It would be great to extract parts from URL, like parse_url do, but problem with parse_url is, it doesn't work well with invalid URLs. I tried to parse URL with it, and for parts that are missing (and are required) to add default ones (ex. no protocol, add http). But parse_url for "google.com" wont return "google.com" as hostname but as path.
This looks like really common problem to me, but i could not find available solution on internet (found some libraries that will standardize URL, but they wont fix URL if it is invalid).
Is there some "smart" solution to this, or I should stick with my current:
Find first occurrence of :// and validate if it's text before is valid protocol, and add protocol if missing
Found next occurrence of / and validate is hostname is in valid format
For good measure validate once more via RegExp whole URL
I just have feeling I will reject some valid URLs with this, and for me is better to have false positive, that false negative.

I had the same problem with parse_url as OP, this is my quick and dirty solution to auto-correct urls(keep in mind that the code in no way are perfect or cover all cases):
Results:
http:/wwww.example.com/lorum.html => http://www.example.com/lorum.html
gopher:/ww.example.com => gopher://www.example.com
http:/www3.example.com/?q=asd&f=#asd =>http://www3.example.com/?q=asd&f=#asd
asd://.example.com/folder/folder/ =>http://example.com/folder/folder/
.example.com/ => http://example.com/
example.com =>http://example.com
subdomain.example.com => http://subdomain.example.com
function url_parser($url) {
// multiple /// messes up parse_url, replace 2+ with 2
$url = preg_replace('/(\/{2,})/','//',$url);
$parse_url = parse_url($url);
if(empty($parse_url["scheme"])) {
$parse_url["scheme"] = "http";
}
if(empty($parse_url["host"]) && !empty($parse_url["path"])) {
// Strip slash from the beginning of path
$parse_url["host"] = ltrim($parse_url["path"], '\/');
$parse_url["path"] = "";
}
$return_url = "";
// Check if scheme is correct
if(!in_array($parse_url["scheme"], array("http", "https", "gopher"))) {
$return_url .= 'http'.'://';
} else {
$return_url .= $parse_url["scheme"].'://';
}
// Check if the right amount of "www" is set.
$explode_host = explode(".", $parse_url["host"]);
// Remove empty entries
$explode_host = array_filter($explode_host);
// And reassign indexes
$explode_host = array_values($explode_host);
// Contains subdomain
if(count($explode_host) > 2) {
// Check if subdomain only contains the letter w(then not any other subdomain).
if(substr_count($explode_host[0], 'w') == strlen($explode_host[0])) {
// Replace with "www" to avoid "ww" or "wwww", etc.
$explode_host[0] = "www";
}
}
$return_url .= implode(".",$explode_host);
if(!empty($parse_url["port"])) {
$return_url .= ":".$parse_url["port"];
}
if(!empty($parse_url["path"])) {
$return_url .= $parse_url["path"];
}
if(!empty($parse_url["query"])) {
$return_url .= '?'.$parse_url["query"];
}
if(!empty($parse_url["fragment"])) {
$return_url .= '#'.$parse_url["fragment"];
}
return $return_url;
}
echo url_parser('http:/wwww.example.com/lorum.html'); // http://www.example.com/lorum.html
echo url_parser('gopher:/ww.example.com'); // gopher://www.example.com
echo url_parser('http:/www3.example.com/?q=asd&f=#asd'); // http://www3.example.com/?q=asd&f=#asd
echo url_parser('asd://.example.com/folder/folder/'); // http://example.com/folder/folder/
echo url_parser('.example.com/'); // http://example.com/
echo url_parser('example.com'); // http://example.com
echo url_parser('subdomain.example.com'); // http://subdomain.example.com

It's not 100% foolproof, but a 1 liner.
$URL = (((strpos($URL,'https://') === false) && (strpos($URL,'http://') === false))?'http://':'' ).$URL;
EDIT
There was apparently a problem with my initial version if the hostname contain http.
Thanks Trent

Create URL with only A-Z characters that includes variable and extension

I am trying to create file links based a variable which has a "prefix" and an extension at the end.
Here's what I have:
$url = "http://www.example.com/mods/" . ereg("^[A-Za-z_\-]+$", $title) . ".php";
Example output of what I wish to have outputted (assuming $title = testing;):
http://www.example.com/mods/testing.php
What it currently outputs:
http://www.example.com/mods/.php
Thanks in advance!

Perhaps this is what you need:
$title = "testing";
if(preg_match("/^[A-Za-z_\-]+$/", $title, $match)){
$url = "http://www.example.com/mods/".$match[0].".php";
}
else{
// Think of something to do here...
}
Now $url is http://www.example.com/mods/testing.php.

Do you want to keep letters and remove all other chars in the URL?
In this case the following should work:
$title = ...
$fixedtitle=preg_replace("/[^A-Za-z_-]/", "", $title);
$url = "http://www.example.com/mods/".$fixedtitle.".php";
the inverted character class will remove everything you do not want.

OK first it's important for you to realize that ereg() is deprecated and will eventually not be available as a command for php, so to prevent an error down the road you should use preg_match instead.
Secondly, both ereg() and preg_match output the status of the match, not the match itself. So
ereg("^[A-Za-z_\-]+$", $title)
will output an integer equal to the length of the string in $title, 0 if there's no match and 1 if there's a match but you didn't pass it another variable to store the matches in.
I'm not sure why it's displaying
http://www.example.com/mods/.php
It should actually be outputting
http://www.example.com/mods/1.php
if everything was working correctly. So there is something going on there, and it's definitely not doing what you want it to. You need to pass another variable to the function that will store all the matches found. If the match is successful (which you can check using the return value of the function) then that variable will be an array of all matches.
Note that with preg_match by default only the first match will be returned. but it will still generate an array (which can be used to get isolated portions of the match) whereas preg_match_all will match multiple things.
See http://www.php.net/manual/en/function.preg-match.php for more details.
Your regex looks more or less correct
So the proper code should look something like:
$title = 'testing'; //making sure that $title is what we think it is
if (preg_match('/^[A-Za-z_\-]+$/',$title,$matches)) {
$url = "http://www.example.com/mods/" . $matches[0] . ".php";
} else {
//match failed, put error code in here
}

URL validation detect movie title

how can me detect movie title from url..
http://www.mysite.com/2430-Moonrise-Kingdom.aspx
http://www.mysite.com/2405-Dark-Shadows.aspx
http://www.mysite.com/2415-Madagascar-3-Europes-Most-Wanted.aspx
I need to convert utl from:
http://www.mysite.com/2405-Dark-Shadows.aspx
to: "Dark Shadows" or "Dark-Shadows"
my code is:
$regexUrl = "/\/[0-9]{2,4}\-[a-zA-Z0-9\-\.]+\.aspx\/?";
echo preg_match($regexUrl, $_SERVER["REQUEST_URI"]);

Something seems to go wrong with your pattern at the end. Try this (note the last few characters):
/\/[0-9]{2,4}\-[a-zA-Z0-9\-\.]+\.aspx/
You'll also want to add a group in order to more easily grab the part that is the title.
/\/[0-9]{2,4}\-([a-zA-Z0-9\-\.]+)\.aspx\/?
Finally, here's a simplified version that won't break if the number at the beginning has one or more than four digits, or if the title has strange characters (perhaps it's not an English-language film).
\/\d+-(.+)\.aspx$/

$url = 'http://www.mysite.com/2415-Madagascar-3-Europes-Most-Wanted.aspx';
$url = basename($url, '.aspx');
$url = substr($url, 1+strpos($url, '-'));
$url = strtr($url, '-', ' ');
echo $url;
But if this is your site you should rather use this:
$id = (int)basename($url);
and load the title from DB. So even when your URL gets corrupted you can still load proper page.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.