Replace all relative URLs with absolute URLS - php

I've seen a few answers (like this one), but I have some more complex scenarios I'm not sure how to account for.
I essentially have full HTML documents. I need to replace every single relative URL with absolute URLs.
Elements from potential HTML look as follows, may be other cases as well:
<img src="/relative/url/img.jpg" />
<form action="/">
<form action="/contact-us/">
<a href='/relative/url/'>Note the Single Quote</a>
<img src="//example.com/protocol-relative-img.jpg" />
Desired Output would be:
// "//example.com/" is ideal, but "http(s)://example.com/" are acceptable
<img src="//example.com/relative/url/img.jpg" />
<form action="//example.com/">
<form action="//example.com/contact-us/">
<a href='//example.com/relative/url/'>Note the Single Quote</a>
<img src="//example.com/protocol-relative-img.jpg" /> <!-- Unmodified -->
I DON'T want to replace protocol relative URLs, since they already function as absolute URLs. I've come up with some code that works, but I'm wondering if I can clean it up a little, as it's extremely repetitive.
But I have to account for single and double quoted attribute values for src, href, and action (am I missing any attributes that can have relative URLs?) while simultaneously avoiding protocol relative URLs.
Here's what I have so far:
// Make URL replacement protocol relative to not break insecure/secure links
$url = str_replace( array( 'http://', 'https://' ), '//', $url );
// Temporarily Modify Protocol-Relative URLS
$str = str_replace( 'src="//', 'src="::TEMP_REPLACE::', $str );
$str = str_replace( "src='//", "src='::TEMP_REPLACE::", $str );
$str = str_replace( 'href="//', 'href="::TEMP_REPLACE::', $str );
$str = str_replace( "href='//", "href='::TEMP_REPLACE::", $str );
$str = str_replace( 'action="//', 'action="::TEMP_REPLACE::', $str );
$str = str_replace( "action='//", "action='::TEMP_REPLACE::", $str );
// Replace all other Relative URLS
$str = str_replace( 'src="/', 'src="'. $url .'/', $str );
$str = str_replace( "src='/", "src='". $url ."/", $str );
$str = str_replace( 'href="/', 'href="'. $url .'/', $str );
$str = str_replace( "href='/", "href='". $url ."/", $str );
$str = str_replace( 'action="/', 'action="'. $url .'/', $str );
$str = str_replace( "action='/", "action='". $url ."/", $str );
// Change Protocol Relative URLs back
$str = str_replace( 'src="::TEMP_REPLACE::', 'src="//', $str );
$str = str_replace( "src='::TEMP_REPLACE::", "src='//", $str );
$str = str_replace( 'href="::TEMP_REPLACE::', 'href="//', $str );
$str = str_replace( "href='::TEMP_REPLACE::", "href='//", $str );
$str = str_replace( 'action="::TEMP_REPLACE::', 'action="//', $str );
$str = str_replace( "action='::TEMP_REPLACE::", "action='//", $str );
I mean, it works, but it's uuugly, and I was thinking there's probably a better way to do it.

New Answer
If your real html document is valid (and has a parent/containing tag), then the most appropriate and reliable technique will be to use a proper DOM parser.
Here is how DOMDocument and Xpath can be used to elegantly target and replace your designated tag attributes:
Code1 - Nested Xpath Queries: (Demo)
$domain = '//example.com';
$tagsAndAttributes = [
'img' => 'src',
'form' => 'action',
'a' => 'href'
];
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($tagsAndAttributes as $tag => $attr) {
foreach ($xpath->query("//{$tag}[not(starts-with(#{$attr}, '//'))]") as $node) {
$node->setAttribute($attr, $domain . $node->getAttribute($attr));
}
}
echo $dom->saveHTML();
Code2 - Single Xpath Query w/ Condition Block: (Demo)
$domain = '//example.com';
$targets = [
"//img[not(starts-with(#src, '//'))]",
"//form[not(starts-with(#action, '//'))]",
"//a[not(starts-with(#href, '//'))]"
];
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query(implode('|', $targets)) as $node) {
if ($src = $node->getAttribute('src')) {
$node->setAttribute('src', $domain . $src);
} elseif ($action = $node->getAttribute('action')) {
$node->setAttribute('action', $domain . $action);
} else {
$node->setAttribute('href', $domain . $node->getAttribute('href'));
}
}
echo $dom->saveHTML();
Old Answer: (...regex is not "DOM-aware" and is vulnerable to unexpected breakage)
If I understand you properly, you have a base value in mind, and you only want to apply it to relative paths.
Pattern Demo
Code: (Demo)
$html=<<<HTML
<img src="/relative/url/img.jpg" />
<form action="/">
<a href='/relative/url/'>Note the Single Quote</a>
<img src="//site.com/protocol-relative-img.jpg" />
HTML;
$base='https://example.com';
echo preg_replace('~(?:src|action|href)=[\'"]\K/(?!/)[^\'"]*~',"$base$0",$html);
Output:
<img src="https://example.com/relative/url/img.jpg" />
<form action="https://example.com/">
<a href='https://example.com/relative/url/'>Note the Single Quote</a>
<img src="//site.com/protocol-relative-img.jpg" />
Pattern Breakdown:
~ #Pattern delimiter
(?:src|action|href) #Match: src or action or href
= #Match equal sign
[\'"] #Match single or double quote
\K #Restart fullstring match (discard previously matched characters
/ #Match slash
(?!/) #Negative lookahead (zero-length assertion): must not be a slash immediately after first matched slash
[^\'"]* #Match zero or more non-single/double quote characters
~ #Pattern delimiter

I think that the <base> element is what you looking for...
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base
The <base> is an empty element that goes in the <head>. Using <base href="https://example.com/path/" /> will tell all relative URLs in the document to refer to https://example.com/path/ instead of the parent URL

Related

php - preg_replace - adding a protocol to href and src elements

Is it possible to add a protocol to urls (href & src) which don't contain the protocols ?
For example, I would like to replace this URL:
TEXT
to:
TEXT
But important is two things:
if original URL in href/src is starting from slash "/", the protocol with domain should be add without slash on the end but when original URL isn't starting from slash - the protocol with domain should be add with slash,
if original URL is starting from "../" or "./" etc. - that should be remove and then, the protocol with domain should be add with slash.
Is it possible to do it in one regex ?
Thanks.
EDIT:
There is my code:
$url = 'http://my-page.com/';
$html = file_get_contents($url);
preg_match('"charset=([A-Za-z0-9\-]+)"si', $html, $charset);
$charset = strlen($charset[1]) > 3 ? $charset[1] : 'UTF-8';
$html = mb_convert_encoding($html, 'HTML-ENTITIES', $charset);
preg_match_all('"href=\"(.*?)\""si', $html, $matches);
foreach($matches[1] AS $key => $value)
{
if ( preg_match("/^(http|https):/", $value) )
{
continue;
}
$html = str_replace($value, $url.$value, $html);
}
preg_match_all('"src=\"(.*?)\""si', $html, $matches);
foreach($matches[1] AS $key => $value)
{
if ( preg_match("/^(http|https):/", $value) )
{
continue;
}
$html = str_replace($value, $url.$value, $html);
}
echo $html;
I would use this regex in a sed or other recipe:
sed 's/href="/href="http://site.domain/g'

DOMDocument->saveHTML() vs urlencode with commercial at symbol (#)

Using DOMDocument(), I'm replacing links in a $message and adding some things, like [#MERGEID]. When I save the changes with $dom_document->saveHTML(), the links get "sort of" url-encoded. [#MERGEID] becomes %5B#MERGEID%5D.
Later in my code I need to replace [#MERGEID] with an ID. So I search for urlencode('[#MERGEID]') - however, urlencode() changes the commercial at symbol (#) to %40, while saveHTML() has left it alone. So there is no match - '%5B#MERGEID%5D' != '%5B%40MERGEID%5D'
Now, I know can run str_replace('%40', '#', urlencode('[#MERGEID]')) to get what I need to locate the merge variable in $message.
My question is, what RFC spec is DOMDocument using, and why is it different than urlencode or even rawurlencode? Is there anything I can do about that to save a str_replace?
Demo code:
$message = 'Google';
$dom_document = new \DOMDocument();
libxml_use_internal_errors(true); //Supress content errors
$dom_document->loadHTML(mb_convert_encoding($message, 'HTML-ENTITIES', 'UTF-8'));
$elements = $dom_document->getElementsByTagName('a');
foreach($elements as $element) {
$link = $element->getAttribute('href'); //http://www.google.com?ref=abc
$tag = $element->getAttribute('data-tag'); //thebottomlink
if ($link) {
$newlink = 'http://www.example.com/click/[#MERGEID]?url=' . $link;
if ($tag) {
$newlink .= '&tag=' . $tag;
}
$element->setAttribute('href', $newlink);
}
}
$message = $dom_document->saveHTML();
$urlencodedmerge = urlencode('[#MERGEID]');
die($message . ' and url encoded version: ' . $urlencodedmerge);
//<a data-tag="thebottomlink" href="http://www.example.com/click/%5B#MERGEID%5D?url=http://www.google.com?ref=abc&tag=thebottomlink">Google</a> and url encoded version: %5B%40MERGEID%5D
I believe that those two encoding serve different purposes. urlencode() encodes "a string to be used in a query part of a URL", while $element->setAttribute('href', $newlink); encodes a complete URL to be used as an URL.
For example:
urlencode('http://www.google.com'); // -> http%3A%2F%2Fwww.google.com
This is convenient for encoding the query part, but it cannot be used on <a href='...'>.
However:
$element->setAttribute('href', $newlink); // -> http://www.google.com
will properly encode the string so that it is still usable in href. The reason that it cannot encode # because it cannot tell whether # is a part of the query or is it part of the userinfo or email url (for example: mailto:invisal#google.com or invisal#127.0.0.1)
Solution
Instead of using [#MERGEID], you can use ##MERGEID##. Then, you replace that with your ID later. This solution does not require you to even use urlencode.
If you insist to use urlencode, you can just use %40 instead of #. So, your code will be like this $newlink = 'http://www.example.com/click/[%40MERGEID]?url=' . $link;
You can also do something like $newlink = 'http://www.example.com/click/' . urlencode('[#MERGEID]') . '?url=' . $link;
urlencode function and rawurlencode are mostly based on RFC 1738. However, since 2005 the current RFC in use for URIs standard is RFC 3986.
On the other hand, The DOM extension uses UTF-8 encoding, which is based on RFC 3629 . Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or Iconv for other encodings.
The generic URI syntax mandates that new URI schemes that provide for
the representation of character data in a URI must, in effect,
represent characters from the unreserved set without translation, and
should convert all other characters to bytes according to UTF-8, and
then percent-encode those values.
Here is a function to decode URLs according to RFC 3986.
<?php
function myUrlEncode($string) {
$entities = array('%21', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40', '%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%25', '%23', '%5B', '%5D');
$replacements = array('!', '*', "'", "(", ")", ";", ":", "#", "&", "=", "+", "$", ",", "/", "?", "%", "#", "[", "]");
return str_replace($entities, $replacements, urldecode($string));
}
?>
PHP Fiddle.
Update:
Since UTF8 has been used to encode $message:
$dom_document->loadHTML(mb_convert_encoding($message, 'HTML-ENTITIES', 'UTF-8'))
Use urldecode($message) when returning the URL without percents.
die(urldecode($message) . ' and url encoded version: ' . $urlencodedmerge);
The root cause of your problem has been very well explained from a technical point of view.
In my opinion, however, there is a conceptual flaw in your approach, and it created the situation that you are now trying to fix.
By processing your input $message through a DomDocument object, you have moved to a higher level of abstraction. It is wrong to manipulate as a unique plain string something that has been "promoted" to a HTML stream.
Instead of trying to reproduce DomDocument's behaviour, use the library itself to locate, extract and replace the values of interest:
$token = 'blah blah [#MERGEID]';
$message = '<a id="' . $token . '" href="' . $token . '"></a>';
$dom = new DOMDocument();
$dom->loadHTML($message);
echo $dom->saveHTML(); // now we have an abstract HTML document
// extract a raw value
$rawstring = $dom->getElementsByTagName('a')->item(0)->getAttribute('href');
// do the low-level fiddling
$newstring = str_replace($token, 'replaced', $rawstring);
// push the new value back into the abstract black box.
$dom->getElementsByTagName('a')->item(0)->setAttribute('href', $newstring);
// less code written, but works all the time
$rawstring = $dom->getElementsByTagName('a')->item(0)->getAttribute('id');
$newstring = str_replace($token, 'replaced', $rawstring);
$dom->getElementsByTagName('a')->item(0)->setAttribute('id', $newstring);
echo $dom->saveHTML();
As illustrated above, today we are trying to fix the problem when your token is inside a href, but one day we may want to search and replace the tag elsewhere in the document. To account for this case, do not bother making your low-level code HTML-aware.
(an alternative option would be not loading a DomDocument until all low-level replacements are done, but I am guessing this is not practical)
Complete proof of concept:
function searchAndReplace(DOMNode $node, $search, $replace) {
if($node->hasAttributes()) {
foreach ($node->attributes as $attribute) {
$input = $attribute->nodeValue;
$output = str_replace($search, $replace, $input);
$attribute->nodeValue = $output;
}
}
if(!$node instanceof DOMElement) { // this test needs double-checking
$input = $node->nodeValue;
$output = str_replace($search, $replace, $input);
$node->nodeValue = $output;
}
if($node->hasChildNodes()) {
foreach ($node->childNodes as $child) {
searchAndReplace($child, $search, $replace);
}
}
}
$token = '<>&;[#MERGEID]';
$message = '<a/>';
$dom = new DOMDocument();
$dom->loadHTML($message);
$dom->getElementsByTagName('a')->item(0)->setAttribute('id', "foo$token");
$dom->getElementsByTagName('a')->item(0)->setAttribute('href', "http://foo#$token");
$textNode = new DOMText("foo$token");
$dom->getElementsByTagName('a')->item(0)->appendchild($textNode);
echo $dom->saveHTML();
searchAndReplace($dom, $token, '*replaced*');
echo $dom->saveHTML();
If you use saveXML() it won't mess with the encoding the way saveHTML() does:
PHP
//your code...
$message = $dom_document->saveXML();
EDIT: also remove the XML tag:
//this will add an xml tag, so just remove it
$message=preg_replace("/\<\?xml(.*?)\?\>/","",$message);
echo $message;
Output
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>Google</body></html>
Notice that both still correctly convert & to &
Would it not make sense to just urlencode the original [#mergeid] whan saving it in the first place as well? Your search should then match without the need for the str_replace?
$newlink = 'http://www.example.com/click/'.urlencode('[#MERGEID]').'?url=' . $link;
I know this does not answer the first post of the question, but you cannot post code in comments as far as I can tell.

Add http:// to a link if it doesn't have it

I have this very simple url bbcoder which i wish to adjust so if the linked does not contain http:// to add it in, how can i do this?
$find = array(
"/\[url\=(.+?)\](.+?)\[\/url\]/is",
"/\[url\](.+?)\[\/url\]/is"
);
$replace = array(
"$2",
"$1"
);
$body = preg_replace($find, $replace, $body);
You can use a (http://)? to match the http:// if exists, and ignore the group result in 'replace to' pattern and use your own http:// , like this:
$find = array(
"/\[url\=(http://)?(.+?)\](.+?)\[\/url\]/is",
"/\[url\](http://)?(.+?)\[\/url\]/is"
);
$replace = array(
"$3",
"$2"
);
$body = preg_replace($find, $replace, $body);
if(strpos($string, 'http://') === FALSE) {
// add http:// to string
}
// I've added the http:// in the regex, to make it optional, but not remember it,
// than always add it in the replace
$find = array(
"/\[url\=(?:http://)(.+?)\](.+?)\[\/url\]/is",
"/\[url\](.+?)\[\/url\]/is"
);
$replace = array(
"$2",
"http://$1"
);
$body = preg_replace($find, $replace, $body);
If you would use a callback function and preg_replace_callback(), you can use something like this:
You can do that this way. It will always add 'http://', and than the string without 'http://'
$string = 'http://'. str_replace('http://', '', $string);

PHP remove newline in string for both Linux and Windows

Currently I have to call
$html = str_replace($search="\r\n", $replace='', $subject=$html);
$html = str_replace($search="\n", $replace='', $subject=$html);
to remove new line character in string $html. Is there a better/shorter way?
Try:
$html = str_replace(array("\r", "\n"), '', $html);
Yes, you can do that at once by using an array:
$search = array("\r\n", "\n");
$result = str_replace($search, $replace='', $subject=$html);
See str_replaceDocs.

php regex to get string inside href tag

I need a regex that will give me the string inside an href tag and inside the quotes also.
For example i need to extract theurltoget.com in the following:
URL
Additionally, I only want the base url part. I.e. from http://www.mydomain.com/page.html i only want http://www.mydomain.com/
Dont use regex for this. You can use xpath and built in php functions to get what you want:
$xml = simplexml_load_string($myHtml);
$list = $xml->xpath("//#href");
$preparedUrls = array();
foreach($list as $item) {
$item = parse_url($item);
$preparedUrls[] = $item['scheme'] . '://' . $item['host'] . '/';
}
print_r($preparedUrls);
$html = 'URL';
$url = preg_match('/<a href="(.+)">/', $html, $match);
$info = parse_url($match[1]);
echo $info['scheme'].'://'.$info['host']; // http://www.mydomain.com
this expression will handle 3 options:
no quotes
double quotes
single quotes
'/href=["\']?([^"\'>]+)["\']?/'
Use the answer by #Alec if you're only looking for the base url part (the 2nd part of the question by #David)!
$html = 'URL';
$url = preg_match('/<a href="(.+)">/', $html, $match);
$info = parse_url($match[1]);
This will give you:
$info
Array
(
[scheme] => http
[host] => www.mydomain.com
[path] => /page.html" class="myclass" rel="myrel
)
So you can use $href = $info["scheme"] . "://" . $info["host"]
Which gives you:
// http://www.mydomain.com
When you are looking for the entire url between the href, You should be using another regex, for instance the regex provided by #user2520237.
$html = 'URL';
$url = preg_match('/href=["\']?([^"\'>]+)["\']?/', $html, $match);
$info = parse_url($match[1]);
this will give you:
$info
Array
(
[scheme] => http
[host] => www.mydomain.com
[path] => /page.html
)
Now you can use $href = $info["scheme"] . "://" . $info["host"] . $info["path"];
Which gives you:
// http://www.mydomain.com/page.html
http://www.the-art-of-web.com/php/parse-links/
Let's start with the simplest case - a well formatted link with no extra attributes:
/<a href=\"([^\"]*)\">(.*)<\/a>/iU
For all href values replacement:
function replaceHref($html, $replaceStr)
{
$match = array();
$url = preg_match_all('/<a [^>]*href="(.+)"/', $html, $match);
if(count($match))
{
for($j=0; $j<count($match); $j++)
{
$html = str_replace($match[1][$j], $replaceStr.urlencode($match[1][$j]), $html);
}
}
return $html;
}
$replaceStr = "http://affilate.domain.com?cam=1&url=";
$replaceHtml = replaceHref($html, $replaceStr);
echo $replaceHtml;
This will handle the case where there are no quotes around the URL.
/<a [^>]*href="?([^">]+)"?>/
But seriously, do not parse HTML with regex. Use DOM or a proper parsing library.
/href="(https?://[^/]*)/
I think you should be able to handle the rest.
Because Positive and Negative Lookbehind are cool
/(?<=href=\").+(?=\")/
It will match only what you want, without quotation marks
Array (
[0] => theurltoget.com )

Categories