Regex to parse megavideo URL - php

I'm trying to write a regex to parse a this url for a php script:
http://www.megavideo.com/v/B4PZHP0Nb2e8a877f8378e778446318596415780
to get this: B4PZHP0N
Can someone help? Thanks in advance.

Since you're in PHP, just use parse_url and substr:
$mega = 'http://www.megavideo.com/v/B4PZHP0Nb2e8a877f8378e778446318596415780';
$want = substr(parse_url($mega, PHP_URL_PATH), 3, 8);
Demo: http://ideone.com/f3viH

Try this regex:
/^http:\/\/www\.megavideo\.com\/v\/(.{8}).*$/
(The error has been corrected)
Also see my ideone or my jsfiddle.

/([^:.\/]+)[a-f0-9]{32}/
So if it matches, B4PZHP0N is in capture buffer 1, ie: $1

I have done something similar but a bit more generic.
so the id can come either after /v/, ?v= or &v=
$url = 'http://www.megavideo.com/v/B4PZHP0Nb2e8a877f8378e778446318596415780';
foreach (array('/v/', '?v=', '&v=') as $k)
{
$pos = strpos($url, $k);
if ($pos>0)
{
$pos += strlen($k);
break;
}
}
if (!$pos)
die("not found");
$id = substr($url, $pos, 8);
die($id);

Related

PHP replace URL segment with str_replace();

I have "/foo/bar/url/" coming straight after my domain name.
What I want is to find penultimate slash symbol in my string and replace it with slash symbol + hashtag. Like so: from / to /# (The problem is not how to get URL, but how to handle it)
How this could be achieved? What is the best practice for doing stuff like that?
At the moment I'm pretty sure that I should use str_replace();
UPD. I think preg_replace() would be suitable for my case. But then there is another problem: what should regexp look like in order to make my issue solved?
P.S. Just in a case I'm using SilverStripe framework (v3.1.12)
$url = '/foo/bar/url/';
if (false !== $last = strrpos($url, '/')) {
if (false !== $penultimate = strrpos($url, '/', $last - strlen($url) - 1)) {
$url = substr_replace($url, '/#', $penultimate, 1);
}
}
echo $url;
This will output
/foo/bar/#url/
If you want to strip the last /:
echo rtrim($url, '/'); // print /foo/bar/#url
Here is a method that would function. There are probably cleaner ways.
// Let's assume you already have $url_string populated
$url_string = "http://whatever.com/foo/bar/url/";
$url_explode = explode("\\",$url_string);
$portion_count = count($url_explode);
$affected_portion = $portion_count - 2; // Minus two because array index starts at 0 and also we want the second to last occurence
$i = 0;
$output = "";
foreach ($url_explode as $portion){
$output.=$portion;
if ($i == $affected_portion){
$output.= "#";
}
$i++;
}
$new_url = $output;
Assuming you now have
$url = $this->Link(); // e.g. /foo/bar/my-urlsegment
You can combine it like
$handledUrl = $this->ParentID
? $this->Parent()->Link() + '#' + $this->URLSegment
: $this->Link();
where $this->Parent()->Link() is e.g. /foo/bar and $this->URLSegment is my-urlsegment
$this->ParentID also checks if we have a parent page or are on the top level of SiteTree
I might be tooooo late for answering this question but I thought this might help you. You can simply use preg_replace like as
$url = '/foo/bar/url/';
echo preg_replace('~(\/)(\w+)\/$~',"$1#$2",$url);
Output:
/foo/bar/#url
In my case this solved my problem:
$url = $this->Link();
$url = rtrim($url, '/');
$url = substr_replace($url, '#', strrpos($url, '/') + 1, 0);

How to filter URLs that contain white space with preg match?

I parse through a text that contains several links. Some of them contain white spaces but have a file ending. My current pattern is:
preg_match_all('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $links, $match);
This works the same way:
preg_match_all('/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/', $links, $match);
I don't know much about the patterns and didn't find a good tutorial that explains the meaning of all possible patterns and shows examples.
How could I filter an URL like this:
http://my-url.com/my doc.doc or even http://my-url.com/my doc with more white spaces.doc
The \s in that preg_match_all functions stands for a white space. But how could I check if there is a file ending behind one or some white spaces?
Is it possible?
Why not just make use of PHP's FILTER functions. ?
<?php
$url = "http://my-url.com/my doc.doc";
if(!filter_var($url, FILTER_VALIDATE_URL))
{
echo "URL is not valid";
}
else
{
echo "URL is valid";
}
OUTPUT :
URL is not valid
this might be what you are looking for which uses urlencode
$file = "my doc with more white spaces.doc";
echo " http://my-url.com/" . urlencode($file);
which produces:
http://my-url.com/my+doc+with+more+white+spaces.doc
or with rawurlencode
produces:
http://my-url.com/my%20doc%20with%20more%20white%20spaces.doc
EDIT: Something like the following might help to parse your urls with parse_url
DEMO
$url = 'http://my-url.com/my doc with more white spaces.doc';
$purl = parse_url($url);
$rurl = "";
if(isset($purl['scheme'])){
$rurl .= $purl['scheme'] . "://";
}
if(isset($purl['host'], $purl['path'])){
$rurl .= $purl['host'] . rawurlencode($purl['path']);
}
if($rurl === ""){
$rurl = $url;#error parsing error/invalid url?
}
for sub directories you can do
$purl['path'] = implode('/', array_map(function($value){return rawurlencode($value);}, explode('/', $purl['path'])));
I don't know much about php but this regex
(http|ftp)(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?
will match every url even with spaces
I think this regex will do.
use this regex
preg_match_all("/^(?si)(?>\s*)(((?>https?:\/\/(?>www\.)?)?(?=[\.-a-z0-9]{2,253}(?>$|\/|\?|\s))[a-z0-9][a-z0-9-]{1,62}(?>\.[a-z0-9][a-z0-9-]{1,62})+)(?>(?>\/|\?).*)?)?(?>\s*)$/", $input_lines, $output_array);
Demo
Alright after doing this really helpful tutorial I finally know how the regex syntax works. After finishing it I experimented a bit on this site
It was pretty easy after figuring out that all hyperlinks in my parsed document were in between quotation marks so I just had to change the regex to:
preg_match_all('#\bhttps?://[^()<>"]+#', $links, $match);
so that after the " it is looking for the next match that begins with http.
But that's not the full solution yet. The user Class was right - without rawurlencode the filenames it won't work.
So the next step was this:
function endsWith($haystack, $needle)
{
return $needle === "" || substr($haystack, -strlen($needle)) === $needle;
}
if(endsWith($textlink, ".doc") || endsWith($textlink, ".docx") || endsWith($textlink, ".pdf") || endsWith($textlink, ".jpg") || endsWith($textlink, ".jpeg") || endsWith($textlink, ".png")){
$file = substr( $textlink, strrpos( $textlink, '/' )+1 );
$rest_url=substr($textlink, 0, strrpos($textlink, '/' )+1 );
$textlink=$rest_url.rawurlencode($file);
}
That filters the filenames from the URLs and rawurlencodes them so that the the output links are correct.
I think this should work:
$url = '...';
$url_new = '';
$array = explode(' ',$url);
foreach($array as $name => $val){
if ($val!=' '){
$url_new = $url_new.$val;
}
}

how can I get a list of all files and urls on a webpage

I'm trying to get a list of all files and urls on a webpage. It's something like the list given on http://tools.pingdom.com when you type in some url. Now I'm trying to do this in php by using cURL or wget. Does anyone has a suggestion about how I can get this kind of file/path lists?
$url="http://wwww.xyz.com";
$data=file_get_contents($url);
$data = strip_tags($data,"<a>");
$d = preg_split("/<\/a>/",$data);
foreach ( $d as $k=>$string ){
if( strpos($string, "<a href=") !== FALSE ){
$string = preg_replace("/.*<a\s+href=\"/sm","",$u);
$stringu = preg_replace("/\".*/","",$string);
$url = $string
}
}
edit:
or you can use this function:
function getAllUrls($string)
{
$regex = '/https?\:\/\/[^\" ]+/i';
preg_match_all($regex, $string, $matches);
return ($matches[0]);
}
$url_array = getAllUrls($string);
print_r($url_array);
Once you have the document in a string use regex to find all the URLs.
Match URLs with regex
Use regex with PHP

PHP Regex to Remove http:// from string

I have full URLs as strings, but I want to remove the http:// at the beginning of the string to display the URL nicely (ex: www.google.com instead of http://www.google.com)
Can someone help?
$str = 'http://www.google.com';
$str = preg_replace('#^https?://#', '', $str);
echo $str; // www.google.com
That will work for both http:// and https://
You don't need regular expression at all. Use str_replace instead.
str_replace('http://', '', $subject);
str_replace('https://', '', $subject);
Combined into a single operation as follows:
str_replace(array('http://','https://'), '', $urlString);
Better use this:
$url = parse_url($url);
$url = $url['host'];
echo $url;
Simpler and works for http:// https:// ftp:// and almost all prefixes.
Why not use parse_url instead?
To remove http://domain ( or https ) and to get the path:
$str = preg_replace('#^https?\:\/\/([\w*\.]*)#', '', $str);
echo $str;
If you insist on using RegEx:
preg_match( "/^(https?:\/\/)?(.+)$/", $input, $matches );
$url = $matches[0][2];
Yeah, I think that str_replace() and substr() are faster and cleaner than regex. Here is a safe fast function for it. It's easy to see exactly what it does. Note: return substr($url, 7) and substr($url, 8), if you also want to remove the //.
// slash-slash protocol remove https:// or http:// and leave // - if it's not a string starting with https:// or http:// return whatever was passed in
function universal_http_https_protocol($url) {
// Breakout - give back bad passed in value
if (empty($url) || !is_string($url)) {
return $url;
}
// starts with http://
if (strlen($url) >= 7 && "http://" === substr($url, 0, 7)) {
// slash-slash protocol - remove https: leaving //
return substr($url, 5);
}
// starts with https://
elseif (strlen($url) >= 8 && "https://" === substr($url, 0, 8)) {
// slash-slash protocol - remove https: leaving //
return substr($url, 6);
}
// no match, return unchanged string
return $url;
}
<?php
// (PHP 4, PHP 5, PHP 7)
// preg_replace — Perform a regular expression search and replace
$array = [
'https://lemon-kiwi.co',
'http://lemon-kiwi.co',
'lemon-kiwi.co',
'www.lemon-kiwi.co',
];
foreach( $array as $value ){
$url = preg_replace("(^https?://)", "", $value );
}
This code output :
lemon-kiwi.co
lemon-kiwi.co
lemon-kiwi.co
www.lemon-kiwi.co
See documentation PHP preg_replace

Remove protocl and subdomain from URL

I have a string like this:
http://www.downlinegoldmine.com/viralmarketing
I need to remove http://www. from the string if it exists, as well as http:// if www is not included.
In few words I just need the domain name without any protocol.
parse_url is the perfect tool for the job. You would first call it to split the url in parts, then check the hostname part to see if it starts with www. and strip it, then assemble the url back.
Update: code
echo normalize_url('http://www.downlinegoldmine.com/viralmarketing');
function normalize_url($url) {
$parts = parse_url($url);
unset($parts['scheme']);
if (substr($parts['hostname'], 0, 4) == 'www.') {
$parts['hostname'] = substr($parts['hostname'], 4);
}
if (function_exists('http_build_url')) {
// This PECL extension makes life a lot easier
return http_build_url($parts);
}
// Otherwise it's the hard way
$result = null;
if (!empty($parts['username'])) {
$result .= $parts['username'];
if (!empty($parts['password'])) {
$result .= ':'.$parts['password'];
}
$result .= '#';
}
$result .= $parts['host'].$parts['path'];
if (!empty($parts['query'])) {
$result .= '?'.$parts['query'];
}
if (!empty($parts['fragment'])) {
$result .= '#'.$parts['fragment'];
}
return $result;
}
See it in action.
Just use parse_url (see: http://php.net/manual/de/function.parse-url.php ). It will also incorporate different protocols and paths etc.
$nvar = preg_replace("#http://(www\.)?#i", "", "http://www.downlinegoldmine.com/viralmarketing");
Test:
php> echo preg_replace("#http://(www\.)?#i", "", "http://www.downlinegoldmine.com/viralmarketing");
downlinegoldmine.com/viralmarketing
php> echo preg_replace("#http://(www\.)?#i", "", "http://downlinegoldmine.com/viralmarketing");
downlinegoldmine.com/viralmarketing
There's probably a better way, but:
$url = preg_replace("#^(http://)?(www\\.)?#i", "", $url);
$url = strncmp('http://', $url, 7) ? $url : substr($url, 7);
$url = strncmp('www.', $url, 4) ? $url : substr($url, 4);
You can use the following to remove the https://, http://, and www. from a url.
$url = 'http://www.downlinegoldmine.com/viralmarketing';
echo preg_replace('/https?:\/\/|www./', '', $url);
above returns downlinegoldmine.com/viralmarketing
and you can use the following to remove the urls path as well as the https://, http://, and www..
$url = 'http://www.downlinegoldmine.com/viralmarketing';
echo implode('/', array_slice(explode('/',preg_replace('/https?:\/\/|www./', '', $url)), 0, 1));
above returns downlinegoldmine.com

Categories