PHP Regex to Remove http:// from string - php

I have full URLs as strings, but I want to remove the http:// at the beginning of the string to display the URL nicely (ex: www.google.com instead of http://www.google.com)
Can someone help?

$str = 'http://www.google.com';
$str = preg_replace('#^https?://#', '', $str);
echo $str; // www.google.com
That will work for both http:// and https://

You don't need regular expression at all. Use str_replace instead.
str_replace('http://', '', $subject);
str_replace('https://', '', $subject);
Combined into a single operation as follows:
str_replace(array('http://','https://'), '', $urlString);

Better use this:
$url = parse_url($url);
$url = $url['host'];
echo $url;
Simpler and works for http:// https:// ftp:// and almost all prefixes.

Why not use parse_url instead?

To remove http://domain ( or https ) and to get the path:
$str = preg_replace('#^https?\:\/\/([\w*\.]*)#', '', $str);
echo $str;

If you insist on using RegEx:
preg_match( "/^(https?:\/\/)?(.+)$/", $input, $matches );
$url = $matches[0][2];

Yeah, I think that str_replace() and substr() are faster and cleaner than regex. Here is a safe fast function for it. It's easy to see exactly what it does. Note: return substr($url, 7) and substr($url, 8), if you also want to remove the //.
// slash-slash protocol remove https:// or http:// and leave // - if it's not a string starting with https:// or http:// return whatever was passed in
function universal_http_https_protocol($url) {
// Breakout - give back bad passed in value
if (empty($url) || !is_string($url)) {
return $url;
}
// starts with http://
if (strlen($url) >= 7 && "http://" === substr($url, 0, 7)) {
// slash-slash protocol - remove https: leaving //
return substr($url, 5);
}
// starts with https://
elseif (strlen($url) >= 8 && "https://" === substr($url, 0, 8)) {
// slash-slash protocol - remove https: leaving //
return substr($url, 6);
}
// no match, return unchanged string
return $url;
}

<?php
// (PHP 4, PHP 5, PHP 7)
// preg_replace — Perform a regular expression search and replace
$array = [
'https://lemon-kiwi.co',
'http://lemon-kiwi.co',
'lemon-kiwi.co',
'www.lemon-kiwi.co',
];
foreach( $array as $value ){
$url = preg_replace("(^https?://)", "", $value );
}
This code output :
lemon-kiwi.co
lemon-kiwi.co
lemon-kiwi.co
www.lemon-kiwi.co
See documentation PHP preg_replace

Related

php regex preg_replace_callback

I have some inherited code whose purpose is to identify urls in a string an prepend the http:// protocol onto them if it doesn't exist.
return preg_replace_callback(
'/((https?:\/\/)?\w+(\.\w{2,})+[\w?&%=+\/]+)/i',
function ($match) {
if (stripos($match[1], 'http://') !== 0 && stripos($match[1], 'https://') !== 0) {
$match[1] = 'http://' . $match[1];
}
return $match[1];
},
$string);
It's working, except when a domain has a hyphen it. So, for-instance, the following string will only partially work.
$string = "In front mfever.com/1 middle http://mf-ever.com/2 at the end";
Can any regex genius see what's wrong with it?
You just need to add the optional dash:
((https?:\/\/)?\w+\-?\w+(\.\w{2,})+[\w?&%=+\/]+)
See it work here https://regex101.com/r/Tkdapj/1

greek url conversion and trim unwated numbers and symbols

This problem is little complicated since i'm newbee to php encoding.
My site uses utf-8 encoding.
After a lot of tests, i found some solution. I use this kind of code:
function chr_conv($str)
{
$a=array with pattern('%CE%B2','%CE%B3','%CE%B4','%CE%B5' etc..);
$b=array with replacement characters(a,b,c,d, etc...);
return str_replace($a, $b2, $str);
}
function replace_old($str)
{
$a1 = array ('index.php','/http://' etc...);
$a2 = array with replacement characters('','' etc...);
return str_replace($a1, $a2, $str);
}
function sanitize($url)
{
$url= replace_old(replace_old($url));
$url = strtolower($url);
$url = preg_replace('/[0-9]/', '', $url);
$url = preg_replace('/[?]/', '', $url);
$url = substr($url,1);
return $url;
}
function wbz404_process404()
{
$options = wbz404_getOptions();
$urlRequest = $_SERVER['REQUEST_URI'];
$url = chr_conv($urlRequest);
$requestedURL = replace_old(replace_old($url));
$requestedURL .= wbz404_SortQuery($urlParts);
//Get URL data if it's already in our database
$redirect = wbz404_loadRedirectData($requestedURL);
echo sanitize($requestedURL);
echo "</br>";
echo $requestedURL;
echo "</br>";
}
When incoming url is:
/content.php?147-%CE%A8%CE%AC%CF%81%CE%B9-%CE%BC%CE%B5-%CF%80%CF%81%CE%AC%CF%83%CE%B1%28%CE%A7%CE%BF%CF%8D%CE%BC%CF%80%CE%BB%CE%B9%CE%BA%29";
I get:
/content.php?147-psari-me-prasa-choumplik
I want only:
/psari-me-prasa-choumplik
without the content.php?147- before URL.
BUT the most important problem is that I get ENDLESS LOOP instead of correct URL.
What am i doing wrong?
Have in mind that .htaccess solution won't work since i have a lighttpd server, not Apache.
If you need
I am assuming it's not always ?147- that you need to skip. But always after the first hyphen. In which case, before the echo add the following:
$requestedURL = substr($requestedURL, strrpos( $requestedURL , '-') +1 );
This will search for the position of the first hyphen and return that, add one so you skip the hyphen itself, and use that to cut the $requestedURL string up after the hyphen to the end of the string.
If it's always /content.php?127- then replace strrpos( $requestedURL , '-') +1 with the number 17.

How to filter URLs that contain white space with preg match?

I parse through a text that contains several links. Some of them contain white spaces but have a file ending. My current pattern is:
preg_match_all('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $links, $match);
This works the same way:
preg_match_all('/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/', $links, $match);
I don't know much about the patterns and didn't find a good tutorial that explains the meaning of all possible patterns and shows examples.
How could I filter an URL like this:
http://my-url.com/my doc.doc or even http://my-url.com/my doc with more white spaces.doc
The \s in that preg_match_all functions stands for a white space. But how could I check if there is a file ending behind one or some white spaces?
Is it possible?
Why not just make use of PHP's FILTER functions. ?
<?php
$url = "http://my-url.com/my doc.doc";
if(!filter_var($url, FILTER_VALIDATE_URL))
{
echo "URL is not valid";
}
else
{
echo "URL is valid";
}
OUTPUT :
URL is not valid
this might be what you are looking for which uses urlencode
$file = "my doc with more white spaces.doc";
echo " http://my-url.com/" . urlencode($file);
which produces:
http://my-url.com/my+doc+with+more+white+spaces.doc
or with rawurlencode
produces:
http://my-url.com/my%20doc%20with%20more%20white%20spaces.doc
EDIT: Something like the following might help to parse your urls with parse_url
DEMO
$url = 'http://my-url.com/my doc with more white spaces.doc';
$purl = parse_url($url);
$rurl = "";
if(isset($purl['scheme'])){
$rurl .= $purl['scheme'] . "://";
}
if(isset($purl['host'], $purl['path'])){
$rurl .= $purl['host'] . rawurlencode($purl['path']);
}
if($rurl === ""){
$rurl = $url;#error parsing error/invalid url?
}
for sub directories you can do
$purl['path'] = implode('/', array_map(function($value){return rawurlencode($value);}, explode('/', $purl['path'])));
I don't know much about php but this regex
(http|ftp)(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?
will match every url even with spaces
I think this regex will do.
use this regex
preg_match_all("/^(?si)(?>\s*)(((?>https?:\/\/(?>www\.)?)?(?=[\.-a-z0-9]{2,253}(?>$|\/|\?|\s))[a-z0-9][a-z0-9-]{1,62}(?>\.[a-z0-9][a-z0-9-]{1,62})+)(?>(?>\/|\?).*)?)?(?>\s*)$/", $input_lines, $output_array);
Demo
Alright after doing this really helpful tutorial I finally know how the regex syntax works. After finishing it I experimented a bit on this site
It was pretty easy after figuring out that all hyperlinks in my parsed document were in between quotation marks so I just had to change the regex to:
preg_match_all('#\bhttps?://[^()<>"]+#', $links, $match);
so that after the " it is looking for the next match that begins with http.
But that's not the full solution yet. The user Class was right - without rawurlencode the filenames it won't work.
So the next step was this:
function endsWith($haystack, $needle)
{
return $needle === "" || substr($haystack, -strlen($needle)) === $needle;
}
if(endsWith($textlink, ".doc") || endsWith($textlink, ".docx") || endsWith($textlink, ".pdf") || endsWith($textlink, ".jpg") || endsWith($textlink, ".jpeg") || endsWith($textlink, ".png")){
$file = substr( $textlink, strrpos( $textlink, '/' )+1 );
$rest_url=substr($textlink, 0, strrpos($textlink, '/' )+1 );
$textlink=$rest_url.rawurlencode($file);
}
That filters the filenames from the URLs and rawurlencodes them so that the the output links are correct.
I think this should work:
$url = '...';
$url_new = '';
$array = explode(' ',$url);
foreach($array as $name => $val){
if ($val!=' '){
$url_new = $url_new.$val;
}
}

Regex to parse megavideo URL

I'm trying to write a regex to parse a this url for a php script:
http://www.megavideo.com/v/B4PZHP0Nb2e8a877f8378e778446318596415780
to get this: B4PZHP0N
Can someone help? Thanks in advance.
Since you're in PHP, just use parse_url and substr:
$mega = 'http://www.megavideo.com/v/B4PZHP0Nb2e8a877f8378e778446318596415780';
$want = substr(parse_url($mega, PHP_URL_PATH), 3, 8);
Demo: http://ideone.com/f3viH
Try this regex:
/^http:\/\/www\.megavideo\.com\/v\/(.{8}).*$/
(The error has been corrected)
Also see my ideone or my jsfiddle.
/([^:.\/]+)[a-f0-9]{32}/
So if it matches, B4PZHP0N is in capture buffer 1, ie: $1
I have done something similar but a bit more generic.
so the id can come either after /v/, ?v= or &v=
$url = 'http://www.megavideo.com/v/B4PZHP0Nb2e8a877f8378e778446318596415780';
foreach (array('/v/', '?v=', '&v=') as $k)
{
$pos = strpos($url, $k);
if ($pos>0)
{
$pos += strlen($k);
break;
}
}
if (!$pos)
die("not found");
$id = substr($url, $pos, 8);
die($id);

Remove protocl and subdomain from URL

I have a string like this:
http://www.downlinegoldmine.com/viralmarketing
I need to remove http://www. from the string if it exists, as well as http:// if www is not included.
In few words I just need the domain name without any protocol.
parse_url is the perfect tool for the job. You would first call it to split the url in parts, then check the hostname part to see if it starts with www. and strip it, then assemble the url back.
Update: code
echo normalize_url('http://www.downlinegoldmine.com/viralmarketing');
function normalize_url($url) {
$parts = parse_url($url);
unset($parts['scheme']);
if (substr($parts['hostname'], 0, 4) == 'www.') {
$parts['hostname'] = substr($parts['hostname'], 4);
}
if (function_exists('http_build_url')) {
// This PECL extension makes life a lot easier
return http_build_url($parts);
}
// Otherwise it's the hard way
$result = null;
if (!empty($parts['username'])) {
$result .= $parts['username'];
if (!empty($parts['password'])) {
$result .= ':'.$parts['password'];
}
$result .= '#';
}
$result .= $parts['host'].$parts['path'];
if (!empty($parts['query'])) {
$result .= '?'.$parts['query'];
}
if (!empty($parts['fragment'])) {
$result .= '#'.$parts['fragment'];
}
return $result;
}
See it in action.
Just use parse_url (see: http://php.net/manual/de/function.parse-url.php ). It will also incorporate different protocols and paths etc.
$nvar = preg_replace("#http://(www\.)?#i", "", "http://www.downlinegoldmine.com/viralmarketing");
Test:
php> echo preg_replace("#http://(www\.)?#i", "", "http://www.downlinegoldmine.com/viralmarketing");
downlinegoldmine.com/viralmarketing
php> echo preg_replace("#http://(www\.)?#i", "", "http://downlinegoldmine.com/viralmarketing");
downlinegoldmine.com/viralmarketing
There's probably a better way, but:
$url = preg_replace("#^(http://)?(www\\.)?#i", "", $url);
$url = strncmp('http://', $url, 7) ? $url : substr($url, 7);
$url = strncmp('www.', $url, 4) ? $url : substr($url, 4);
You can use the following to remove the https://, http://, and www. from a url.
$url = 'http://www.downlinegoldmine.com/viralmarketing';
echo preg_replace('/https?:\/\/|www./', '', $url);
above returns downlinegoldmine.com/viralmarketing
and you can use the following to remove the urls path as well as the https://, http://, and www..
$url = 'http://www.downlinegoldmine.com/viralmarketing';
echo implode('/', array_slice(explode('/',preg_replace('/https?:\/\/|www./', '', $url)), 0, 1));
above returns downlinegoldmine.com

Categories