php - preg_replace - adding a protocol to href and src elements - php

Is it possible to add a protocol to urls (href & src) which don't contain the protocols ?
For example, I would like to replace this URL:
TEXT
to:
TEXT
But important is two things:
if original URL in href/src is starting from slash "/", the protocol with domain should be add without slash on the end but when original URL isn't starting from slash - the protocol with domain should be add with slash,
if original URL is starting from "../" or "./" etc. - that should be remove and then, the protocol with domain should be add with slash.
Is it possible to do it in one regex ?
Thanks.
EDIT:
There is my code:
$url = 'http://my-page.com/';
$html = file_get_contents($url);
preg_match('"charset=([A-Za-z0-9\-]+)"si', $html, $charset);
$charset = strlen($charset[1]) > 3 ? $charset[1] : 'UTF-8';
$html = mb_convert_encoding($html, 'HTML-ENTITIES', $charset);
preg_match_all('"href=\"(.*?)\""si', $html, $matches);
foreach($matches[1] AS $key => $value)
{
if ( preg_match("/^(http|https):/", $value) )
{
continue;
}
$html = str_replace($value, $url.$value, $html);
}
preg_match_all('"src=\"(.*?)\""si', $html, $matches);
foreach($matches[1] AS $key => $value)
{
if ( preg_match("/^(http|https):/", $value) )
{
continue;
}
$html = str_replace($value, $url.$value, $html);
}
echo $html;

I would use this regex in a sed or other recipe:
sed 's/href="/href="http://site.domain/g'

Related

Replace all relative URLs with absolute URLS

I've seen a few answers (like this one), but I have some more complex scenarios I'm not sure how to account for.
I essentially have full HTML documents. I need to replace every single relative URL with absolute URLs.
Elements from potential HTML look as follows, may be other cases as well:
<img src="/relative/url/img.jpg" />
<form action="/">
<form action="/contact-us/">
<a href='/relative/url/'>Note the Single Quote</a>
<img src="//example.com/protocol-relative-img.jpg" />
Desired Output would be:
// "//example.com/" is ideal, but "http(s)://example.com/" are acceptable
<img src="//example.com/relative/url/img.jpg" />
<form action="//example.com/">
<form action="//example.com/contact-us/">
<a href='//example.com/relative/url/'>Note the Single Quote</a>
<img src="//example.com/protocol-relative-img.jpg" /> <!-- Unmodified -->
I DON'T want to replace protocol relative URLs, since they already function as absolute URLs. I've come up with some code that works, but I'm wondering if I can clean it up a little, as it's extremely repetitive.
But I have to account for single and double quoted attribute values for src, href, and action (am I missing any attributes that can have relative URLs?) while simultaneously avoiding protocol relative URLs.
Here's what I have so far:
// Make URL replacement protocol relative to not break insecure/secure links
$url = str_replace( array( 'http://', 'https://' ), '//', $url );
// Temporarily Modify Protocol-Relative URLS
$str = str_replace( 'src="//', 'src="::TEMP_REPLACE::', $str );
$str = str_replace( "src='//", "src='::TEMP_REPLACE::", $str );
$str = str_replace( 'href="//', 'href="::TEMP_REPLACE::', $str );
$str = str_replace( "href='//", "href='::TEMP_REPLACE::", $str );
$str = str_replace( 'action="//', 'action="::TEMP_REPLACE::', $str );
$str = str_replace( "action='//", "action='::TEMP_REPLACE::", $str );
// Replace all other Relative URLS
$str = str_replace( 'src="/', 'src="'. $url .'/', $str );
$str = str_replace( "src='/", "src='". $url ."/", $str );
$str = str_replace( 'href="/', 'href="'. $url .'/', $str );
$str = str_replace( "href='/", "href='". $url ."/", $str );
$str = str_replace( 'action="/', 'action="'. $url .'/', $str );
$str = str_replace( "action='/", "action='". $url ."/", $str );
// Change Protocol Relative URLs back
$str = str_replace( 'src="::TEMP_REPLACE::', 'src="//', $str );
$str = str_replace( "src='::TEMP_REPLACE::", "src='//", $str );
$str = str_replace( 'href="::TEMP_REPLACE::', 'href="//', $str );
$str = str_replace( "href='::TEMP_REPLACE::", "href='//", $str );
$str = str_replace( 'action="::TEMP_REPLACE::', 'action="//', $str );
$str = str_replace( "action='::TEMP_REPLACE::", "action='//", $str );
I mean, it works, but it's uuugly, and I was thinking there's probably a better way to do it.
New Answer
If your real html document is valid (and has a parent/containing tag), then the most appropriate and reliable technique will be to use a proper DOM parser.
Here is how DOMDocument and Xpath can be used to elegantly target and replace your designated tag attributes:
Code1 - Nested Xpath Queries: (Demo)
$domain = '//example.com';
$tagsAndAttributes = [
'img' => 'src',
'form' => 'action',
'a' => 'href'
];
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($tagsAndAttributes as $tag => $attr) {
foreach ($xpath->query("//{$tag}[not(starts-with(#{$attr}, '//'))]") as $node) {
$node->setAttribute($attr, $domain . $node->getAttribute($attr));
}
}
echo $dom->saveHTML();
Code2 - Single Xpath Query w/ Condition Block: (Demo)
$domain = '//example.com';
$targets = [
"//img[not(starts-with(#src, '//'))]",
"//form[not(starts-with(#action, '//'))]",
"//a[not(starts-with(#href, '//'))]"
];
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query(implode('|', $targets)) as $node) {
if ($src = $node->getAttribute('src')) {
$node->setAttribute('src', $domain . $src);
} elseif ($action = $node->getAttribute('action')) {
$node->setAttribute('action', $domain . $action);
} else {
$node->setAttribute('href', $domain . $node->getAttribute('href'));
}
}
echo $dom->saveHTML();
Old Answer: (...regex is not "DOM-aware" and is vulnerable to unexpected breakage)
If I understand you properly, you have a base value in mind, and you only want to apply it to relative paths.
Pattern Demo
Code: (Demo)
$html=<<<HTML
<img src="/relative/url/img.jpg" />
<form action="/">
<a href='/relative/url/'>Note the Single Quote</a>
<img src="//site.com/protocol-relative-img.jpg" />
HTML;
$base='https://example.com';
echo preg_replace('~(?:src|action|href)=[\'"]\K/(?!/)[^\'"]*~',"$base$0",$html);
Output:
<img src="https://example.com/relative/url/img.jpg" />
<form action="https://example.com/">
<a href='https://example.com/relative/url/'>Note the Single Quote</a>
<img src="//site.com/protocol-relative-img.jpg" />
Pattern Breakdown:
~ #Pattern delimiter
(?:src|action|href) #Match: src or action or href
= #Match equal sign
[\'"] #Match single or double quote
\K #Restart fullstring match (discard previously matched characters
/ #Match slash
(?!/) #Negative lookahead (zero-length assertion): must not be a slash immediately after first matched slash
[^\'"]* #Match zero or more non-single/double quote characters
~ #Pattern delimiter
I think that the <base> element is what you looking for...
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base
The <base> is an empty element that goes in the <head>. Using <base href="https://example.com/path/" /> will tell all relative URLs in the document to refer to https://example.com/path/ instead of the parent URL

Adjust youtube url to embed parser to remove extras

I have this php code to turn youtube urls into videos automatically:
$search = '%
(?:https?://)?
(?:www\.)?
(?:
youtu\.be/
| youtube\.com
(?:
/embed/
| /v/
| /watch\?v=
| /watch\?feature=player_embedded&v=
)
)
([\w\-]{10,12})
\b
%x';
$replace = "<iframe class=\"youtube-player\" width=\"550\" height=\"385\" src=\"http://www.youtube.com/embed/$1\" data-youtube-id=\"$1\" frameborder=\"0\" allowfullscreen></iframe>";
return preg_replace($search, $replace, $url);
What would be the easiest way to strip out anything after the video id?
Wow. The suggested link actually links to another regex. Use parse_url and parse_str, that's what they're there for, see this answer. Parsing URLs with regex is hard and there's no reason to reinvent the wheel.
I found a way thanks to others links, here's a function to search a body of text and replace all youtube links with videos:
function youtube($body)
{
$video_pattern = '~(?:http|https|)(?::\/\/|)(?:www.|)(?:youtu\.be\/|youtube\.com(?:\/embed\/|\/v\/|\/watch\?v=|\/ytscreeningroom\?v=|\/feeds\/api\/videos\/|\/user\S*[^\w\-\s]|\S*[^\w\-\s]))([\w\-]{11})[a-z0-9;:##?&%=+\/\$_.-]*~i';
preg_match_all($video_pattern, $body, $matches);
//print_r($matches[0]);
foreach ($matches[0] as $url)
{
if (strpos($url, 'feature=youtu.be') == TRUE || strpos($url, 'youtu.be') == FALSE )
{
parse_str(parse_url($url, PHP_URL_QUERY), $id);
$id = $id['v'];
}
else
{
$id = basename($url);
}
$body = str_replace($url, "<iframe class=\"youtube-player\" width=\"550\" height=\"385\" src=\"http://www.youtube.com/embed/{$id}\" data-youtube-id=\"{$id}\" frameborder=\"0\" allowfullscreen></iframe>", $body);
}
return $body;
}

check string for youtube link or image

I have a small piece of code that checks a string for a url and adds the < a href> tag to create a link. I also have it check the string for a youtube link and then add rel="youtube" to the < a> tag.
How can I get the code to only add rel to the youtube links?
How can I get it to add a different rel to any type of image link?
$text = "http://site.com a site www.anothersite.com/ http://www.youtube.com/watch?v=UyxqmghxS6M here is another site";
$linkstring = preg_replace( '/(http|ftp)+(s)?:(\/\/)((\w|\.)+)(\/)?(\S+)?/i', '\4', $text );
if(preg_match('/http:\/\/www\.youtube\.com\/watch\?v=[^&]+/', $linkstring, $vresult)) {
$linkstring = preg_replace( '/(http|ftp)+(s)?:(\/\/)((\w|\.)+)(\/)?(\S+)?/i', '<a rel="youtube" href="\0">\4</a>', $text );
$type= 'youtube';
}
else {
$type = 'none';
}
echo $text;
echo $linkstring, "<br />";
echo $type, "<br />";
Try http://simplehtmldom.sourceforge.net/.
Code:
<?php
include('simple_html_dom.php');
$html = str_get_html('Link');
$html->find('a', 0)->rel = 'youtube';
echo $html;
Output:
[username#localhost dom]$ php dom.php
Link
You can build an entire page DOM or a simple single link with this library.
Detecting hostname of URL:
Pass the url to parse_url. parse_url returns an array of the URL parts.
Code:
print_r(parse_url('http://www.youtube.com/watch?v=UyxqmghxS6M'));
Output:
Array
(
[scheme] => http
[host] => www.youtube.com
[path] => /watch
[query] => v=UyxqmghxS6M
)
Try the following:
//text
$text = "http://site.com/bounty.png a site www.anothersite.com/ http://www.youtube.com/watch?v=UyxqmghxS6M&featured=true here is another site";
//Youtube links
$pattern = "/(http:\/\/){0,1}(www\.){0,1}youtube\.com\/watch\?v=([a-z0-9\-_\|]{11})[^\s]*/i";
$replacement = '<a rel="youtube" href="http://www.youtube.com/watch?v=\3">\0</a>';
$text = preg_replace($pattern, $replacement, $text);
//image links
$pattern = "/(http:\/\/){0,1}(www\.){0,1}[^\/]+\/[^\s]+\.(png|jpg|jpeg|bmp|gif)[^\s]*/i";
$replacement = '<a rel="image" href="\0">\0</a>';
$text = preg_replace($pattern, $replacement, $text);
note that the latter can only detect links to images which have an extension. As such, links like www.example.com?image=3 will not be detected.

PHP Get end string on url between / and /

I need to get the last string content of the url between / and /
For example:
http://mydomain.com/get_this/
or
http://mydomain.com/lists/get_this/
I need to get where get_this is in the url.
trim() removes the trailing slash, strrpos() finds the last occurrence of / (after it's trimmed), and substr() gets all content after the last occurrence of /.
$url = trim($url, '/');
echo substr($url, strrpos($url, '/')+1);
View output
Even better, you can just use basename(), like hakre suggested:
echo basename($url);
View output
Assuming there always is a trailing slash:
$parts = explode('/', $url);
$get_this = $parts[count($parts)-2]; // -2 since there will be an empty array element due to the trailing slash
If not:
$url = trim($url, '/'); // If there is a trailing slash in this URL instance get rid of it so we're always sure the last part is where we expect it
$parts = explode('/', $url);
$get_this = $parts[count($parts)-1];
Something like this should work.
<?php
$subject = "http://mydomain.com/lists/get_this/";
$pattern = '/\/([^\/]*)\/$/';
preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE, 3);
print_r($matches);
?>
Just use parse_url() and explode():
<?php
$url = "http://mydomain.com/lists/get_this/";
$path = parse_url($url, PHP_URL_PATH);
$path_array = array_filter(explode('/', $path));
$last_path = $path_array[count($path_array) - 1];
echo $last_path;
?>
You can try this:
preg_match("/http:\/\/([a-z0-9\.]+)\/(.+)\/(.*)\/?/", $url, $matches);
print_r($matches);

php regex to get string inside href tag

I need a regex that will give me the string inside an href tag and inside the quotes also.
For example i need to extract theurltoget.com in the following:
URL
Additionally, I only want the base url part. I.e. from http://www.mydomain.com/page.html i only want http://www.mydomain.com/
Dont use regex for this. You can use xpath and built in php functions to get what you want:
$xml = simplexml_load_string($myHtml);
$list = $xml->xpath("//#href");
$preparedUrls = array();
foreach($list as $item) {
$item = parse_url($item);
$preparedUrls[] = $item['scheme'] . '://' . $item['host'] . '/';
}
print_r($preparedUrls);
$html = 'URL';
$url = preg_match('/<a href="(.+)">/', $html, $match);
$info = parse_url($match[1]);
echo $info['scheme'].'://'.$info['host']; // http://www.mydomain.com
this expression will handle 3 options:
no quotes
double quotes
single quotes
'/href=["\']?([^"\'>]+)["\']?/'
Use the answer by #Alec if you're only looking for the base url part (the 2nd part of the question by #David)!
$html = 'URL';
$url = preg_match('/<a href="(.+)">/', $html, $match);
$info = parse_url($match[1]);
This will give you:
$info
Array
(
[scheme] => http
[host] => www.mydomain.com
[path] => /page.html" class="myclass" rel="myrel
)
So you can use $href = $info["scheme"] . "://" . $info["host"]
Which gives you:
// http://www.mydomain.com
When you are looking for the entire url between the href, You should be using another regex, for instance the regex provided by #user2520237.
$html = 'URL';
$url = preg_match('/href=["\']?([^"\'>]+)["\']?/', $html, $match);
$info = parse_url($match[1]);
this will give you:
$info
Array
(
[scheme] => http
[host] => www.mydomain.com
[path] => /page.html
)
Now you can use $href = $info["scheme"] . "://" . $info["host"] . $info["path"];
Which gives you:
// http://www.mydomain.com/page.html
http://www.the-art-of-web.com/php/parse-links/
Let's start with the simplest case - a well formatted link with no extra attributes:
/<a href=\"([^\"]*)\">(.*)<\/a>/iU
For all href values replacement:
function replaceHref($html, $replaceStr)
{
$match = array();
$url = preg_match_all('/<a [^>]*href="(.+)"/', $html, $match);
if(count($match))
{
for($j=0; $j<count($match); $j++)
{
$html = str_replace($match[1][$j], $replaceStr.urlencode($match[1][$j]), $html);
}
}
return $html;
}
$replaceStr = "http://affilate.domain.com?cam=1&url=";
$replaceHtml = replaceHref($html, $replaceStr);
echo $replaceHtml;
This will handle the case where there are no quotes around the URL.
/<a [^>]*href="?([^">]+)"?>/
But seriously, do not parse HTML with regex. Use DOM or a proper parsing library.
/href="(https?://[^/]*)/
I think you should be able to handle the rest.
Because Positive and Negative Lookbehind are cool
/(?<=href=\").+(?=\")/
It will match only what you want, without quotation marks
Array (
[0] => theurltoget.com )

Categories