php - preg_replace - adding a protocol to href and src elements

php - preg_replace - adding a protocol to href and src elements - php

Is it possible to add a protocol to urls (href & src) which don't contain the protocols ?
For example, I would like to replace this URL:
TEXT
to:
TEXT
But important is two things:
if original URL in href/src is starting from slash "/", the protocol with domain should be add without slash on the end but when original URL isn't starting from slash - the protocol with domain should be add with slash,
if original URL is starting from "../" or "./" etc. - that should be remove and then, the protocol with domain should be add with slash.
Is it possible to do it in one regex ?
Thanks.
EDIT:
There is my code:
$url = 'http://my-page.com/';
$html = file_get_contents($url);
preg_match('"charset=([A-Za-z0-9\-]+)"si', $html, $charset);
$charset = strlen($charset[1]) > 3 ? $charset[1] : 'UTF-8';
$html = mb_convert_encoding($html, 'HTML-ENTITIES', $charset);
preg_match_all('"href=\"(.*?)\""si', $html, $matches);
foreach($matches[1] AS $key => $value)
{
if ( preg_match("/^(http|https):/", $value) )
{
continue;
}
$html = str_replace($value, $url.$value, $html);
}
preg_match_all('"src=\"(.*?)\""si', $html, $matches);
foreach($matches[1] AS $key => $value)
{
if ( preg_match("/^(http|https):/", $value) )
{
continue;
}
$html = str_replace($value, $url.$value, $html);
}
echo $html;

I would use this regex in a sed or other recipe:
sed 's/href="/href="http://site.domain/g'

Related

Replace all relative URLs with absolute URLS

I've seen a few answers (like this one), but I have some more complex scenarios I'm not sure how to account for.
I essentially have full HTML documents. I need to replace every single relative URL with absolute URLs.
Elements from potential HTML look as follows, may be other cases as well:
<img src="/relative/url/img.jpg" />
<form action="/">
<form action="/contact-us/">
<a href='/relative/url/'>Note the Single Quote</a>
<img src="//example.com/protocol-relative-img.jpg" />
Desired Output would be:
// "//example.com/" is ideal, but "http(s)://example.com/" are acceptable
<img src="//example.com/relative/url/img.jpg" />
<form action="//example.com/">
<form action="//example.com/contact-us/">
<a href='//example.com/relative/url/'>Note the Single Quote</a>
<img src="//example.com/protocol-relative-img.jpg" /> <!-- Unmodified -->
I DON'T want to replace protocol relative URLs, since they already function as absolute URLs. I've come up with some code that works, but I'm wondering if I can clean it up a little, as it's extremely repetitive.
But I have to account for single and double quoted attribute values for src, href, and action (am I missing any attributes that can have relative URLs?) while simultaneously avoiding protocol relative URLs.
Here's what I have so far:
// Make URL replacement protocol relative to not break insecure/secure links
$url = str_replace( array( 'http://', 'https://' ), '//', $url );
// Temporarily Modify Protocol-Relative URLS
$str = str_replace( 'src="//', 'src="::TEMP_REPLACE::', $str );
$str = str_replace( "src='//", "src='::TEMP_REPLACE::", $str );
$str = str_replace( 'href="//', 'href="::TEMP_REPLACE::', $str );
$str = str_replace( "href='//", "href='::TEMP_REPLACE::", $str );
$str = str_replace( 'action="//', 'action="::TEMP_REPLACE::', $str );
$str = str_replace( "action='//", "action='::TEMP_REPLACE::", $str );
// Replace all other Relative URLS
$str = str_replace( 'src="/', 'src="'. $url .'/', $str );
$str = str_replace( "src='/", "src='". $url ."/", $str );
$str = str_replace( 'href="/', 'href="'. $url .'/', $str );
$str = str_replace( "href='/", "href='". $url ."/", $str );
$str = str_replace( 'action="/', 'action="'. $url .'/', $str );
$str = str_replace( "action='/", "action='". $url ."/", $str );
// Change Protocol Relative URLs back
$str = str_replace( 'src="::TEMP_REPLACE::', 'src="//', $str );
$str = str_replace( "src='::TEMP_REPLACE::", "src='//", $str );
$str = str_replace( 'href="::TEMP_REPLACE::', 'href="//', $str );
$str = str_replace( "href='::TEMP_REPLACE::", "href='//", $str );
$str = str_replace( 'action="::TEMP_REPLACE::', 'action="//', $str );
$str = str_replace( "action='::TEMP_REPLACE::", "action='//", $str );
I mean, it works, but it's uuugly, and I was thinking there's probably a better way to do it.

New Answer
If your real html document is valid (and has a parent/containing tag), then the most appropriate and reliable technique will be to use a proper DOM parser.
Here is how DOMDocument and Xpath can be used to elegantly target and replace your designated tag attributes:
Code1 - Nested Xpath Queries: (Demo)
$domain = '//example.com';
$tagsAndAttributes = [
'img' => 'src',
'form' => 'action',
'a' => 'href'
];
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($tagsAndAttributes as $tag => $attr) {
foreach ($xpath->query("//{$tag}[not(starts-with(#{$attr}, '//'))]") as $node) {
$node->setAttribute($attr, $domain . $node->getAttribute($attr));
}
}
echo $dom->saveHTML();
Code2 - Single Xpath Query w/ Condition Block: (Demo)
$domain = '//example.com';
$targets = [
"//img[not(starts-with(#src, '//'))]",
"//form[not(starts-with(#action, '//'))]",
"//a[not(starts-with(#href, '//'))]"
];
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query(implode('|', $targets)) as $node) {
if ($src = $node->getAttribute('src')) {
$node->setAttribute('src', $domain . $src);
} elseif ($action = $node->getAttribute('action')) {
$node->setAttribute('action', $domain . $action);
} else {
$node->setAttribute('href', $domain . $node->getAttribute('href'));
}
}
echo $dom->saveHTML();
Old Answer: (...regex is not "DOM-aware" and is vulnerable to unexpected breakage)
If I understand you properly, you have a base value in mind, and you only want to apply it to relative paths.
Pattern Demo
Code: (Demo)
$html=<<<HTML
<img src="/relative/url/img.jpg" />
<form action="/">
<a href='/relative/url/'>Note the Single Quote</a>
<img src="//site.com/protocol-relative-img.jpg" />
HTML;
$base='https://example.com';
echo preg_replace('~(?:src|action|href)=[\'"]\K/(?!/)[^\'"]*~',"$base$0",$html);
Output:
<img src="https://example.com/relative/url/img.jpg" />
<form action="https://example.com/">
<a href='https://example.com/relative/url/'>Note the Single Quote</a>
<img src="//site.com/protocol-relative-img.jpg" />
Pattern Breakdown:
~ #Pattern delimiter
(?:src|action|href) #Match: src or action or href
= #Match equal sign
[\'"] #Match single or double quote
\K #Restart fullstring match (discard previously matched characters
/ #Match slash
(?!/) #Negative lookahead (zero-length assertion): must not be a slash immediately after first matched slash
[^\'"]* #Match zero or more non-single/double quote characters
~ #Pattern delimiter

I think that the <base> element is what you looking for...
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base
The <base> is an empty element that goes in the <head>. Using <base href="https://example.com/path/" /> will tell all relative URLs in the document to refer to https://example.com/path/ instead of the parent URL

Adjust youtube url to embed parser to remove extras

I have this php code to turn youtube urls into videos automatically:
$search = '%
(?:https?://)?
(?:www\.)?
(?:
youtu\.be/
| youtube\.com
(?:
/embed/
| /v/
| /watch\?v=
| /watch\?feature=player_embedded&v=
)
)
([\w\-]{10,12})
\b
%x';
$replace = "<iframe class=\"youtube-player\" width=\"550\" height=\"385\" src=\"http://www.youtube.com/embed/$1\" data-youtube-id=\"$1\" frameborder=\"0\" allowfullscreen></iframe>";
return preg_replace($search, $replace, $url);
What would be the easiest way to strip out anything after the video id?

Wow. The suggested link actually links to another regex. Use parse_url and parse_str, that's what they're there for, see this answer. Parsing URLs with regex is hard and there's no reason to reinvent the wheel.

I found a way thanks to others links, here's a function to search a body of text and replace all youtube links with videos:
function youtube($body)
{
$video_pattern = '~(?:http|https|)(?::\/\/|)(?:www.|)(?:youtu\.be\/|youtube\.com(?:\/embed\/|\/v\/|\/watch\?v=|\/ytscreeningroom\?v=|\/feeds\/api\/videos\/|\/user\S*[^\w\-\s]|\S*[^\w\-\s]))([\w\-]{11})[a-z0-9;:##?&%=+\/\$_.-]*~i';
preg_match_all($video_pattern, $body, $matches);
//print_r($matches[0]);
foreach ($matches[0] as $url)
{
if (strpos($url, 'feature=youtu.be') == TRUE || strpos($url, 'youtu.be') == FALSE )
{
parse_str(parse_url($url, PHP_URL_QUERY), $id);
$id = $id['v'];
}
else
{
$id = basename($url);
}
$body = str_replace($url, "<iframe class=\"youtube-player\" width=\"550\" height=\"385\" src=\"http://www.youtube.com/embed/{$id}\" data-youtube-id=\"{$id}\" frameborder=\"0\" allowfullscreen></iframe>", $body);
}
return $body;
}

check string for youtube link or image

I have a small piece of code that checks a string for a url and adds the < a href> tag to create a link. I also have it check the string for a youtube link and then add rel="youtube" to the < a> tag.
How can I get the code to only add rel to the youtube links?
How can I get it to add a different rel to any type of image link?
$text = "http://site.com a site www.anothersite.com/ http://www.youtube.com/watch?v=UyxqmghxS6M here is another site";
$linkstring = preg_replace( '/(http|ftp)+(s)?:(\/\/)((\w|\.)+)(\/)?(\S+)?/i', '\4', $text );
if(preg_match('/http:\/\/www\.youtube\.com\/watch\?v=[^&]+/', $linkstring, $vresult)) {
$linkstring = preg_replace( '/(http|ftp)+(s)?:(\/\/)((\w|\.)+)(\/)?(\S+)?/i', '<a rel="youtube" href="\0">\4</a>', $text );
$type= 'youtube';
}
else {
$type = 'none';
}
echo $text;
echo $linkstring, "<br />";
echo $type, "<br />";

Try http://simplehtmldom.sourceforge.net/.
Code:
<?php
include('simple_html_dom.php');
$html = str_get_html('Link');
$html->find('a', 0)->rel = 'youtube';
echo $html;
Output:
[username#localhost dom]$ php dom.php
Link
You can build an entire page DOM or a simple single link with this library.
Detecting hostname of URL:
Pass the url to parse_url. parse_url returns an array of the URL parts.
Code:
print_r(parse_url('http://www.youtube.com/watch?v=UyxqmghxS6M'));
Output:
Array
(
[scheme] => http
[host] => www.youtube.com
[path] => /watch
[query] => v=UyxqmghxS6M
)

Try the following:
//text
$text = "http://site.com/bounty.png a site www.anothersite.com/ http://www.youtube.com/watch?v=UyxqmghxS6M&featured=true here is another site";
//Youtube links
$pattern = "/(http:\/\/){0,1}(www\.){0,1}youtube\.com\/watch\?v=([a-z0-9\-_\|]{11})[^\s]*/i";
$replacement = '<a rel="youtube" href="http://www.youtube.com/watch?v=\3">\0</a>';
$text = preg_replace($pattern, $replacement, $text);
//image links
$pattern = "/(http:\/\/){0,1}(www\.){0,1}[^\/]+\/[^\s]+\.(png|jpg|jpeg|bmp|gif)[^\s]*/i";
$replacement = '<a rel="image" href="\0">\0</a>';
$text = preg_replace($pattern, $replacement, $text);
note that the latter can only detect links to images which have an extension. As such, links like www.example.com?image=3 will not be detected.

PHP Get end string on url between / and /

I need to get the last string content of the url between / and /
For example:
http://mydomain.com/get_this/
or
http://mydomain.com/lists/get_this/
I need to get where get_this is in the url.

trim() removes the trailing slash, strrpos() finds the last occurrence of / (after it's trimmed), and substr() gets all content after the last occurrence of /.
$url = trim($url, '/');
echo substr($url, strrpos($url, '/')+1);
View output
Even better, you can just use basename(), like hakre suggested:
echo basename($url);
View output

Assuming there always is a trailing slash:
$parts = explode('/', $url);
$get_this = $parts[count($parts)-2]; // -2 since there will be an empty array element due to the trailing slash
If not:
$url = trim($url, '/'); // If there is a trailing slash in this URL instance get rid of it so we're always sure the last part is where we expect it
$parts = explode('/', $url);
$get_this = $parts[count($parts)-1];

Something like this should work.
<?php
$subject = "http://mydomain.com/lists/get_this/";
$pattern = '/\/([^\/]*)\/$/';
preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE, 3);
print_r($matches);
?>

Just use parse_url() and explode():
<?php
$url = "http://mydomain.com/lists/get_this/";
$path = parse_url($url, PHP_URL_PATH);
$path_array = array_filter(explode('/', $path));
$last_path = $path_array[count($path_array) - 1];
echo $last_path;
?>

You can try this:
preg_match("/http:\/\/([a-z0-9\.]+)\/(.+)\/(.*)\/?/", $url, $matches);
print_r($matches);

php regex to get string inside href tag

I need a regex that will give me the string inside an href tag and inside the quotes also.
For example i need to extract theurltoget.com in the following:
URL
Additionally, I only want the base url part. I.e. from http://www.mydomain.com/page.html i only want http://www.mydomain.com/

Dont use regex for this. You can use xpath and built in php functions to get what you want:
$xml = simplexml_load_string($myHtml);
$list = $xml->xpath("//#href");
$preparedUrls = array();
foreach($list as $item) {
$item = parse_url($item);
$preparedUrls[] = $item['scheme'] . '://' . $item['host'] . '/';
}
print_r($preparedUrls);

$html = 'URL';
$url = preg_match('/<a href="(.+)">/', $html, $match);
$info = parse_url($match[1]);
echo $info['scheme'].'://'.$info['host']; // http://www.mydomain.com

this expression will handle 3 options:
no quotes
double quotes
single quotes
'/href=["\']?([^"\'>]+)["\']?/'

Use the answer by #Alec if you're only looking for the base url part (the 2nd part of the question by #David)!
$html = 'URL';
$url = preg_match('/<a href="(.+)">/', $html, $match);
$info = parse_url($match[1]);
This will give you:
$info
Array
(
[scheme] => http
[host] => www.mydomain.com
[path] => /page.html" class="myclass" rel="myrel
)
So you can use $href = $info["scheme"] . "://" . $info["host"]
Which gives you:
// http://www.mydomain.com
When you are looking for the entire url between the href, You should be using another regex, for instance the regex provided by #user2520237.
$html = 'URL';
$url = preg_match('/href=["\']?([^"\'>]+)["\']?/', $html, $match);
$info = parse_url($match[1]);
this will give you:
$info
Array
(
[scheme] => http
[host] => www.mydomain.com
[path] => /page.html
)
Now you can use $href = $info["scheme"] . "://" . $info["host"] . $info["path"];
Which gives you:
// http://www.mydomain.com/page.html

http://www.the-art-of-web.com/php/parse-links/
Let's start with the simplest case - a well formatted link with no extra attributes:
/<a href=\"([^\"]*)\">(.*)<\/a>/iU

For all href values replacement:
function replaceHref($html, $replaceStr)
{
$match = array();
$url = preg_match_all('/<a [^>]*href="(.+)"/', $html, $match);
if(count($match))
{
for($j=0; $j<count($match); $j++)
{
$html = str_replace($match[1][$j], $replaceStr.urlencode($match[1][$j]), $html);
}
}
return $html;
}
$replaceStr = "http://affilate.domain.com?cam=1&url=";
$replaceHtml = replaceHref($html, $replaceStr);
echo $replaceHtml;

This will handle the case where there are no quotes around the URL.
/<a [^>]*href="?([^">]+)"?>/
But seriously, do not parse HTML with regex. Use DOM or a proper parsing library.

/href="(https?://[^/]*)/
I think you should be able to handle the rest.

Because Positive and Negative Lookbehind are cool
/(?<=href=\").+(?=\")/
It will match only what you want, without quotation marks
Array (
[0] => theurltoget.com )

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

php - preg_replace - adding a protocol to href and src elements - php

I would use this regex in a sed or other recipe: sed 's/href="/href="http://site.domain/g'

Related

Replace all relative URLs with absolute URLS

Adjust youtube url to embed parser to remove extras

check string for youtube link or image

PHP Get end string on url between / and /

php regex to get string inside href tag

Categories

Resources