php regex preg_replace_callback - php

I have some inherited code whose purpose is to identify urls in a string an prepend the http:// protocol onto them if it doesn't exist.
return preg_replace_callback(
'/((https?:\/\/)?\w+(\.\w{2,})+[\w?&%=+\/]+)/i',
function ($match) {
if (stripos($match[1], 'http://') !== 0 && stripos($match[1], 'https://') !== 0) {
$match[1] = 'http://' . $match[1];
}
return $match[1];
},
$string);
It's working, except when a domain has a hyphen it. So, for-instance, the following string will only partially work.
$string = "In front mfever.com/1 middle http://mf-ever.com/2 at the end";
Can any regex genius see what's wrong with it?

You just need to add the optional dash:
((https?:\/\/)?\w+\-?\w+(\.\w{2,})+[\w?&%=+\/]+)
See it work here https://regex101.com/r/Tkdapj/1

Related

How to append embed code to youtube anchor link

I use this code to make urls clickable as anchor links in forum posts.
function makelink($str) {
$str = preg_replace_callback('/((((http)(s)?:\/\/)|www\.)[-0-9æøåa-zA-Z?-??-?\(\)%_+\.~#?&;:#\/\/=]+)(?<!\.)/i', function($matches) {
if (strtolower(substr($matches[0], 0 , 4)) == 'www.') {
$matches[0] = 'http://' . $matches[0];
}
return ''.$matches[0].'';
}, $str);
return trim($str);
}
It works fine. Now I need to also make youtube links into embed codes underneath the link (appended to the link I guess).
It's ok for this that there is an extra replacement routine going on.
How could I make some code that replaces the resulting anchor (if it's a youtube link):
https://www.youtube.com/watch?v=LOLAy72Tv24
With this:
https://www.youtube.com/watch?v=LOLAy72Tv24
<br />
<iframe class="youtube" width="350" height="250" src="https://www.youtube.com/embed/LOLAy72Tv24/" allowfullscreen></iframe>
So it just needs to take the video id and put out embed code underneath the original link, while outputting both together.
Here's my proposed solution:
function makelink($str) {
$pattern = '/((((http)(s)?:\/\/)|www\.)[-0-9æøåa-zA-Z?-??-?\(\)%_+\.~#?&;:#\/\/=]+)(?<!\.)/i';
$str = preg_replace_callback($pattern, function($matches) {
if (strtolower(substr($matches[0], 0 , 4)) == 'www.') {
$matches[0] = 'http://' . $matches[0];
}
// store anchor tag html in a variable instead of returning immediately
$html = ''.$matches[0].'';
if (isYouTubeVideoUrl($matches[0])) {
$html .= '<br />'.makeiFrame($matches[0]);
}
return $html;
}, $str);
return trim($str);
}
function isYouTubeVideoUrl(string $url): bool
{
return (parse_url($url, PHP_URL_HOST) === 'www.youtube.com' || isYouTubeShortUrl($url))
&& strpos(parse_url($url, PHP_URL_QUERY), 'v=') !== false;
}
function isYouTubeShortUrl(string $url): bool
{
return parse_url($url, PHP_URL_HOST) === 'youtu.be';
}
function makeiFrame(string $url): string {
$embedUrl = 'https://www.youtube.com/embed/'.getYouTubeVideoId($url).'/';
return '<iframe class="youtube" width="350" height="250" src="'.$embedUrl.'" allowfullscreen></iframe>';
}
function getYouTubeVideoId(string $url): string
{
if (isYouTubeShortUrl($url)) {
preg_match('/[^\/]+$/', $url, $matches);
return $matches[0];
}
preg_match('/(?<=v=)(.*?)(?=(&|$))/', $url, $matches);
return $matches[0];
}
It is designed to work both with regular YouTube URLs (https://www.youtube.com/watch?v=LOLAy72Tv24) and short YouTube URLs (https://youtu.be/LOLAy72Tv24). It also supports the v parameter being anywhere in the query string for regular URLs.
Most of the code is pretty straightforward, the key lies in extracting the video id.
Short URLs have the format where the id is behind a slash, so [^\/]+$ looks for any characters that are not a slash at the end of the string:
[^\/] matches any character not a slash
+ is a quantifier for one or more, greedy
$ asserts the position at the end of the string
Regular URLs have the format where the id is in a parameter named v, so (?<=v=)(.+?)(?=(&|$)) looks for everything between v= and either & or the end of the string:
(?<=v=) is a positive lookbehind, assuring that we look for a string right after v=
(.+?) matches one or more characters (any, except for line terminators), lazy
(?=(&|$)) is a positive lookahead, assuring that we look for a string right before an ampersand (&) or the end of a string ($)

How to filter URLs that contain white space with preg match?

I parse through a text that contains several links. Some of them contain white spaces but have a file ending. My current pattern is:
preg_match_all('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $links, $match);
This works the same way:
preg_match_all('/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/', $links, $match);
I don't know much about the patterns and didn't find a good tutorial that explains the meaning of all possible patterns and shows examples.
How could I filter an URL like this:
http://my-url.com/my doc.doc or even http://my-url.com/my doc with more white spaces.doc
The \s in that preg_match_all functions stands for a white space. But how could I check if there is a file ending behind one or some white spaces?
Is it possible?
Why not just make use of PHP's FILTER functions. ?
<?php
$url = "http://my-url.com/my doc.doc";
if(!filter_var($url, FILTER_VALIDATE_URL))
{
echo "URL is not valid";
}
else
{
echo "URL is valid";
}
OUTPUT :
URL is not valid
this might be what you are looking for which uses urlencode
$file = "my doc with more white spaces.doc";
echo " http://my-url.com/" . urlencode($file);
which produces:
http://my-url.com/my+doc+with+more+white+spaces.doc
or with rawurlencode
produces:
http://my-url.com/my%20doc%20with%20more%20white%20spaces.doc
EDIT: Something like the following might help to parse your urls with parse_url
DEMO
$url = 'http://my-url.com/my doc with more white spaces.doc';
$purl = parse_url($url);
$rurl = "";
if(isset($purl['scheme'])){
$rurl .= $purl['scheme'] . "://";
}
if(isset($purl['host'], $purl['path'])){
$rurl .= $purl['host'] . rawurlencode($purl['path']);
}
if($rurl === ""){
$rurl = $url;#error parsing error/invalid url?
}
for sub directories you can do
$purl['path'] = implode('/', array_map(function($value){return rawurlencode($value);}, explode('/', $purl['path'])));
I don't know much about php but this regex
(http|ftp)(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?
will match every url even with spaces
I think this regex will do.
use this regex
preg_match_all("/^(?si)(?>\s*)(((?>https?:\/\/(?>www\.)?)?(?=[\.-a-z0-9]{2,253}(?>$|\/|\?|\s))[a-z0-9][a-z0-9-]{1,62}(?>\.[a-z0-9][a-z0-9-]{1,62})+)(?>(?>\/|\?).*)?)?(?>\s*)$/", $input_lines, $output_array);
Demo
Alright after doing this really helpful tutorial I finally know how the regex syntax works. After finishing it I experimented a bit on this site
It was pretty easy after figuring out that all hyperlinks in my parsed document were in between quotation marks so I just had to change the regex to:
preg_match_all('#\bhttps?://[^()<>"]+#', $links, $match);
so that after the " it is looking for the next match that begins with http.
But that's not the full solution yet. The user Class was right - without rawurlencode the filenames it won't work.
So the next step was this:
function endsWith($haystack, $needle)
{
return $needle === "" || substr($haystack, -strlen($needle)) === $needle;
}
if(endsWith($textlink, ".doc") || endsWith($textlink, ".docx") || endsWith($textlink, ".pdf") || endsWith($textlink, ".jpg") || endsWith($textlink, ".jpeg") || endsWith($textlink, ".png")){
$file = substr( $textlink, strrpos( $textlink, '/' )+1 );
$rest_url=substr($textlink, 0, strrpos($textlink, '/' )+1 );
$textlink=$rest_url.rawurlencode($file);
}
That filters the filenames from the URLs and rawurlencodes them so that the the output links are correct.
I think this should work:
$url = '...';
$url_new = '';
$array = explode(' ',$url);
foreach($array as $name => $val){
if ($val!=' '){
$url_new = $url_new.$val;
}
}

Preg-replace - replace all URLs except a domain and its subdomains

I've a Glype proxy and I want not parse external URLs. All URLs on the page are automatically converted to: http://proxy.com/browse.php?u=[URL HERE]. Example: If I visit The Pirate Bay on my proxy, then I want not to parse the following URLs:
ByteLove.com (Not to: http://proxy.com/browse.php?u=http://bytelove.com&b=0)
BayFiles.com (Not to: http://proxy.com/browse.php?u=http://bayfiles.com&b=0)
BayIMG.com (Not to: http://proxy.com/browse.php?u=http://bayimg.com&b=0)
PasteBay.com (Not to: http://proxy.com/browse.php?u=http://pastebay.com&b=0)
Ipredator.com (Not to: http://proxy.com/browse.php?u=https://ipredator.se&b=0)
etc.
Of course I want to keep the internal URLs, so:
thepiratebay.se/browse (To: http://proxy.com/browse.php?u=http://thepiratebay.se/browse&b=0)
thepiratebay.se/top (To: http://proxy.com/browse.php?u=http://thepiratebay.se/top&b=0)
thepiratebay.se/recent (To: http://proxy.com/browse.php?u=http://thepiratebay.se/recent&b=0)
etc.
Is there a preg_replace to replace all URL's except thepiratebay.se and there subdomains (as in the example)? An other function is also welcome. (Such as domdocument, querypath, substr or strpos. Not str_replace because then I should define all URLs)
I've found something, but I'm not familiar with preg_replace:
$exclude = '.thepiratebay.se';
$pattern = '(https?\:\/\/.*?\..*?)(?=\s|$)';
$message= preg_replace("~(($exclude)?($pattern))~i", '$2$5$6', $message);
I'll guess you would need to provide a whitelist to tell which domains should be proxied:
$whitelist = array();
$whitelist[] = "internal1.se";
$whitelist[] = "internal2.no";
$whitelist[] = "internal3.com";
// and so on...
$string = 'External link 1<br>';
$string .= 'Internal link 1<br>';
$string .= 'Internal link 2<br>';
$string .= 'External link 2<br>';
//Assuming the URL always is inside '' or "" you can use this pattern:
$pattern = '#(https?://proxy\.org/browse\.php\?u=(https?[^&|\"|\']*)(&?[^&|\"|\']*))#i';
$string = preg_replace_callback($pattern, "my_callback", $string);
//I had only PHP 5.2 on my server, so I decided to use a callback function.
function my_callback($match) {
global $whitelist;
// set return bypass proxy URL
$returnstring = urldecode($match[2]);
foreach ($whitelist as $white) {
// check if URL matches whitelist
if (stripos($match[2], $white) > 0) {
$returnstring = $match[0];
break; } }
return $returnstring;
}
echo "NEW STRING[:\n" . $string . "\n]\n";
you can use preg_replace_callback() to execute a callback function for every match. In that function you can determine if the matched string should be converted or not.
<?php
$string = 'http://foobar.com/baz and http://example.org/bumm';
$pattern = '#(https?\:\/\/.*?\..*?)(?=\s|$)#i';
$string = preg_replace_callback($pattern, function($match) {
if (stripos($match[0], 'example.org/') !== false) {
// exclude all URLs containing example.org
return $match[0];
} else {
return 'http://proxy.com/?u=' . urlencode($match[0]);
}
}, $string);
echo $string, "\n";
(Example is using PHP 5.3 closure notation)

PHP Regex to Remove http:// from string

I have full URLs as strings, but I want to remove the http:// at the beginning of the string to display the URL nicely (ex: www.google.com instead of http://www.google.com)
Can someone help?
$str = 'http://www.google.com';
$str = preg_replace('#^https?://#', '', $str);
echo $str; // www.google.com
That will work for both http:// and https://
You don't need regular expression at all. Use str_replace instead.
str_replace('http://', '', $subject);
str_replace('https://', '', $subject);
Combined into a single operation as follows:
str_replace(array('http://','https://'), '', $urlString);
Better use this:
$url = parse_url($url);
$url = $url['host'];
echo $url;
Simpler and works for http:// https:// ftp:// and almost all prefixes.
Why not use parse_url instead?
To remove http://domain ( or https ) and to get the path:
$str = preg_replace('#^https?\:\/\/([\w*\.]*)#', '', $str);
echo $str;
If you insist on using RegEx:
preg_match( "/^(https?:\/\/)?(.+)$/", $input, $matches );
$url = $matches[0][2];
Yeah, I think that str_replace() and substr() are faster and cleaner than regex. Here is a safe fast function for it. It's easy to see exactly what it does. Note: return substr($url, 7) and substr($url, 8), if you also want to remove the //.
// slash-slash protocol remove https:// or http:// and leave // - if it's not a string starting with https:// or http:// return whatever was passed in
function universal_http_https_protocol($url) {
// Breakout - give back bad passed in value
if (empty($url) || !is_string($url)) {
return $url;
}
// starts with http://
if (strlen($url) >= 7 && "http://" === substr($url, 0, 7)) {
// slash-slash protocol - remove https: leaving //
return substr($url, 5);
}
// starts with https://
elseif (strlen($url) >= 8 && "https://" === substr($url, 0, 8)) {
// slash-slash protocol - remove https: leaving //
return substr($url, 6);
}
// no match, return unchanged string
return $url;
}
<?php
// (PHP 4, PHP 5, PHP 7)
// preg_replace — Perform a regular expression search and replace
$array = [
'https://lemon-kiwi.co',
'http://lemon-kiwi.co',
'lemon-kiwi.co',
'www.lemon-kiwi.co',
];
foreach( $array as $value ){
$url = preg_replace("(^https?://)", "", $value );
}
This code output :
lemon-kiwi.co
lemon-kiwi.co
lemon-kiwi.co
www.lemon-kiwi.co
See documentation PHP preg_replace

Codeigniter and preg_replace

I use Codeigniter to create a multilingual website and everything works fine, but when I try to use the "alternative languages helper" by Luis I've got a problem. This helper uses a regular expression to replace the current language with the new one:
$new_uri = preg_replace('/^'.$actual_lang.'/', $lang, $uri);
The problem is that I have a URL like this: http://www.example.com/en/language/english/ and I want to replace only the first "en" without changing the word "english". I tried to use the limit for preg_replace:
$new_uri = preg_replace('/^'.$actual_lang.'/', $lang, $uri, 1);
but this doesn't work for me. Any ideas?
You could do something like this:
$regex = '#^'.preg_quote($actual_lang, '#').'(?=/|$)#';
$new_uri = preg_replace($regex, $lang, $uri);
The last capture pattern basically means "only match if the next character is a forward slash or the end of the string"...
Edit:
If the code you always want to replace is at the beginning of the path, you could always do:
if (stripos($url, $actual_lang) !== false) {
if (strpos($url, '://') !== false) {
$path = parse_url($url, PHP_URL_PATH);
} else {
$path = $url;
}
list($language, $rest) = explode('/', $path, 2);
if ($language == $actual_lang) {
$url = str_replace($path, $lang . '/' . $rest, $url);
}
}
It's a bit more code, but it should be fairly robust. You could always build a class to do this for you (by parsing, replacing and then rebuilding the URL)...
If you know what the beginning of the URL will always, be, use it in the regex!
$domain = "http://www.example.com/"
$regex = '#(?<=^' . preg_quote($domain, '#') . ')' . preg_quote($actual_lang, '#') . '\b#';
$new_uri = preg_replace($regex, $lang, $uri);
In the case of your example, the regular expression would become #(?<=^http://www.example.com/)en\b which would match en only if it followed the specified beginning of a domain ((?<=...) in a regular expression specifies a positive lookbehind) and is followed by a word boundary (so english wouldn't match).

Categories