Regex for Replacing Absolute URLs with Relative URLs - php

How can I write a regex expression that will convert any absolute URLs to relative paths. For example:
src="http://www.test.localhost/sites/
would become
src="/sites/"
The domains are not static.
I can't use parse_url (as per this answer) because it is part of a larger string, that contains no-url data as well.

Solution
You can use the following regex:
/https?:\/{2}[^\/]+/
Which would match the following:
http://www.test.localhost/sites/
http://www.domain.localhost/sites/
http://domain.localhost/sites/
So it would be:
$domain = preg_replace('/https?:\/{2}[^\/]+/', '', $domain);
Explanation
http: Look for 'http'
s?: Look for an 's' after the 'http' if there's one
: : Look for the ':' character
\/{2}: Look for the '//'
[^\/]+: Go for anything that is not a slash (/)

My guess is that maybe this expression or an improved version of that might work to some extent:
^\s*src=["']\s*https?:\/\/(?:[^\/]+)([^"']+?)\s*["']$
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.
Test
$re = '/^\s*src=["\']\s*https?:\/\/(?:[^\/]+)([^"\']+?)\s*["\']$/m';
$str = 'src=" http://www.test.localhost/sites/ "
src=" https://www.test.localhost/sites/"
src=" http://test.localhost/sites/ "
src="https://test.localhost/sites/ "
src="https://localhost/sites/ "
src=\'https://localhost/ \'
src=\'http://www.test1.test2.test3.test4localhost/sites1/sites2/sites3/ \'';
$subst = 'src="$1"';
var_export(preg_replace($re, $subst, $str));
Output
src="/sites/"
src="/sites/"
src="/sites/"
src="/sites/"
src="/sites/"
src="/"
src="/sites1/sites2/sites3/"
RegEx Circuit
jex.im visualizes regular expressions:

$dom = new DOMDocument;
$dom->loadHTML($yourHTML)
$xp = new DOMXPath($dom);
foreach($xp->query('//#src') as $attr) {
$url = parse_url($attr->nodeValue);
if ( !isset($url['scheme']) || stripos($url['scheme'], 'http']) !== 0 )
continue;
$src = $url['path']
. ( isset($url['query']) ? '?' . $url['query'] : '' )
. ( isset($url['fragment']) ? '#' . $url['fragment'] : '' );
$attr->parentNode->setAttribute('src', $src);
}
$result = $dom->saveHTML();
I added an if condition to skip cases when it isn't possible to say if the beginning of the src attribute is a domain or the beginning of the path. Depending of what you are trying to do, you can remove this test.
If you are working with parts of an html document (ie: not a full document), you have to change $result = $dom->saveHTML() with something like:
$result = '';
foreach ($dom->getElementsByTagName('body')->item(0)->childNodes as $childNode) {
$result . = $dom->saveHTML($childNode);
}

Related

PHP Substring of a regex replacement

I have the following regex :
$string = preg_replace("/([\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/i","<a target=\"_blank\" href=\"$1\">$1</A>",$string);
Using it to parse this string : http://www.ttt.com.ar/hello_world
Produces this new string :
<a target="_blank" href="http://www.ttt.com.ar/hello_world">http://www.ttt.com.ar/hello_world</A>
So far , soo good. What I want to do is to get replacement $1 to be a substring of $1 producing an output like :
<a target="_blank" href="http://www.ttt.com.ar/hello_world">http://www.ttt.com.ar/...</A>
Pseudocode of what I mean:
$string = preg_replace("/([\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/i","<a target=\"_blank\" href=\"$1\">substring($1,0,24)..</A>",$string);
Is this even possible? Probably Im just doing all wrong :)
Thanks in advance.
Check out preg_replace_callback():
$string = 'http://www.ttt.com.ar/hello_world';
$string = preg_replace_callback(
"/([\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/i",
function($matches) {
$link = $matches[1];
$substring = substr($link, 0, 24) . '..';
return "<a target=\"_blank\" href=\"$link\">$substring</a>";
},
$string
);
var_dump($string);
// <a target="_blank" href="http://www.ttt.com.ar/hello_world">http://www.ttt.com.ar/...</a>
Note, you can also use the e modifier in PHP to execute functions in your preg_replace(). This has been deprecated in PHP 5.5.0, in favor of preg_replace_callback().
You can use a capturing group inside of a lookahead like this:
preg_replace(
"/((?=(.{24}))[\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/i",
"<a target=\"_blank\" href=\"$1\">$2..</A>",
$string);
This will capture the entire URL in group 1, but it will also capture the first 24 characters of it in group 2.
You are showing bad practice. Regexes should not being used to parse or modify xml content from application's context.
Suggests:
Use a DOM parsing to read and modify the value
use parse_url() to get the protocol + domain name
Example:
$doc = new DOMDocument();
$doc->loadHTML(
'<a target="_blank" href="http://www.ttt.com.ar/hello_world">http://www.ttt.com.ar/hello_world</A>'#
);
$link = $doc->getElementsByTagName('a')->item(0);
$url = parse_url($link->nodeValue);
$link->nodeValue = $url['scheme'] . '://' . $url['host'] . '/...';
echo $doc->saveHTML();

PHP - Inner HTML recursive replace

I need to perform a recursive str_replace on a portion of HTML (with recursive I mean inner nodes first), so I wrote:
$str = //get HTML;
$pttOpen = '(\w+) *([^<]{1,100}?)';
$pttClose = '\w+';
$pttHtml = '(?:(?!(?:<x-)).+)';
while (preg_match("%<x-(?:$pttOpen)>($pttHtml)*</x-($pttClose)>%m", $str, $match)) {
list($outerHtml, $open, $attributes, $innerHtml, $close) = $match;
$newHtml = //some work....
str_replace($outerHtml, $newHtml, $str);
}
The idea is to first replace non-nested x-tags.
But it only works if innerHtml in on the same line of the opening tag (so I guess I misunderstood what the /m modifier does). I don't want to use a DOM library, because I just need simple string replacement. Any help?
Try this regex:
%<x-(?P<open>\w+)\s*(?P<attributes>[^>]*)>(?P<innerHtml>.*)</x-(?P=open)>%s
Demo
http://regex101.com/r/nA2zO5
Sample code
$str = // get HTML
$pattern = '%<x-(?P<open>\w+)\s*(?P<attributes>[^>]*)>(?P<innerHtml>.*)</x-(?P=open)>%s';
while (preg_match($pattern, $str, $matches)) {
$newHtml = sprintf('<ns:%1$s>%2$s</ns:%1$s>', $matches['open'], $matches['innerHtml']);
$str = str_replace($matches[0], $newHtml, $str);
}
echo htmlspecialchars($str);
Output
Initially, $str contained this text:
<x-foo>
sdfgsdfgsd
<x-bar>
sdfgsdfg
</x-bar>
<x-baz attr1='5'>
sdfgsdfg
</x-baz>
sdfgsdfgs
</x-foo>
It ends up with:
<ns:foo>
sdfgsdfgsd
<ns:bar>
sdfgsdfg
</ns:bar>
<ns:baz>
sdfgsdfg
</ns:baz>
sdfgsdfgs
</ns:foo>
Since, I didn't know what work is done on $newHtml, I mimic this work somehow by replacing x-with ns: and removing any attributes.
Thanks to #Alex I came up with this:
%<x-(?P<open>\w+)\s*(?P<attributes>[^>]*?)>(?P<innerHtml>((?!<x-).)*)</x-(?P=open)>%is
Without the ((?!<x-).)*) in the innerHtml pattern it won't work with nested tags (it will first match outer ones, which isn't what I wanted). This way innermost ones are matched first. Hope this helps.
I don't know exactly what kind of changes you are trying to do, however this is the way I will proceed:
$pattern = <<<'EOD'
~
<x-(?<tagName>\w++) (?<attributes>[^>]*+) >
(?<content>(?>[^<]++|<(?!/?x-))*) #by far more efficient than (?:(?!</?x-).)*
</x-\g<tagName>>
~x
EOD;
function callback($m) { // exemple function
return '<n-' . $m['tagName'] . $m['attributes'] . '>' . $m['content']
. '</n-' . $m['tagName'] . '>';
};
do {
$code = preg_replace_callback($pattern, 'callback', $code, -1, $count);
} while ($count);
echo htmlspecialchars(print_r($code, true));

PHP RegEx Negation Word

My preg_replace pattern regex code here..
/<img(.*?)src="(.*?)"/
This is my replace code..
<img$1src="'.$path.'$2"
So i want to negate/exlude a condition..
If img tag have a rel="customimg", dont preg_replace so skip it..
Example: Skip This Line
<img rel="customimg" src="http..">
What might add to this regex pattern?
I searched another post, but I couldn't exactly..
Because src argument may use single or double quotes, I suggest you to use
preg_replace(
"/(<img\b(?!.*\brel=[\"']customimg[\"']).*?\bsrc=)([\"']).*?\2/i",
"$1$2" . $path . "$2",
$string);
Edit:
To add url prefix instead of full url replacement, use
preg_replace(
"/(<img\b(?!.*\brel=[\"']customimg[\"']).*?\bsrc=)([\"'])(.*?)\2/i",
"$1$2" . $path . "$3$2",
$string);
Add a negative lookahead:
/<img(?![^>]*\srel="customimg")(.*?)src="(.*?)"/
Because I only see regex "solutions" coming in. Here is the answer using DOMDocument:
<?php
$path = 'the/path';
$doc = new DOMDocument();
#$doc->loadHTML('<img rel="customimg" src="/image.jpgm"><img src="/image.jpg">');
$xpath = new DOMXPath($doc);
$imageNodes = $xpath->query('//img[not(#rel="customimg")]');
foreach ($imageNodes as $node) {
$node->setAttribute('src', $path . $node->getAttribute('src'));
}
Demo: http://codepad.viper-7.com/uID5wz
It would seem like it'd be easier/more expressive to do
if(strpos($haystackString, '"customimg"') === false) // The === is important
{
// your preg_replace here
}
Edit: Thanks for pointing out missing param guys

Add id attribute to hyperlinks through PHP Regular Expressions

I am still relatively new to Regular Expressions and feel My code is being too greedy. I am trying to add an id attribute to existing links in a piece of code. My functions is like so:
function addClassHref($str) {
//$str = stripslashes($str);
$preg = "/<[\s]*a[\s]*href=[\s]*[\"\']?([\w.-]*)[\"\']?[^>]*>(.*?)<\/a>/i";
preg_match_all($preg, $str, $match);
foreach ($match[1] as $key => $val) {
$pattern[] = '/' . preg_quote($match[0][$key], '/') . '/';
$replace[] = "<a id='buttonRed' href='$val'>{$match[2][$key]}</a>";
}
return preg_replace($pattern, $replace, $str);
}
This adds the id tag like I want but it breaks the hyperlink. For example:
If the original code is : Link
Instead of <a id="class" href="http://www.google.com">Link</a>
It is giving
<a id="class" href="http">Link</a>
Any suggestions or thoughts?
Do not use regular expressions to parse XML or HTML.
$doc = new DOMDocument();
$doc->loadHTML($html);
$all_a = $doc->getElementsByTagName('a');
$firsta = $all_a->item(0);
$firsta->setAttribute('id', 'idvalue');
echo $doc->saveHTML($firsta);
You've got some overcomplications in your regex :)
Also, there's no need for the loop as preg_replace() will hit all the instances of the search pattern in the relevant string. The first regex below will take everything in the a tag and simply add the id attribute on at the end.
$str = 'Link' . "\n" .
'Link' . "\n" .
'Link';
$p = "{<\s*a\s*(href=[^>]*)>([^<]*)</a>}i";
$r = "<a $1 id=\"class\">$2</a>";
echo preg_replace($p, $r, $str);
If you only want to capture the href attribute you could do the following:
$p = '{<\s*a\s*href=["\']([^"\']*)["\'][^>]*>([^<]*)</a>}i';
$r = "<a href='$1' id='class'>$2</a>";
Your first subpattern ([\w.-]*) doesn't match :, thus it stops at "http".
Couldn't you just use a simple str_replace() for this? Regex seems like overkill if this is all you're doing.
$str = str_replace('<a ', '<a id="someID" ', $str);

PHP preg_replace weirdness with custom urls

I'm using the following code to add <span> tags behind <a> tags.
$html = preg_replace("~<a.*?href=\"$url\".*?>.*?</a>~i", "$0<span>test</span>", $html);
The code is working fine for regular links (ie. http://www.google.com/), but it will not perform a replace when the contents of $url are $link$/3/.
This is example code to show the (mis)behaviour:
<?php
$urls = array();
$urls[] = '$link$/3/';
$urls[] = 'http://www.google.com/';
$html = 'Test Link' . "\n" . 'Google';
foreach($urls as $url) {
$html = preg_replace("~<a.*?href=\"$url\".*?>.*?</a>~i", "$0<span>test</span>", $html);
}
echo $html;
?>
And this is the output it produces:
Test Link
Google<span>test</span>
$url = preg_quote($url, '~'); the dollar signs are interpreted as usual: end-of-input.
just somebody is correct; you must escape your special regex characters if you mean for them to be interpreted as literal.
It also looks to me like it can't perform the replace because it never makes a match.
Try replacing this line:
$urls[] = '$link$/3/';
With this:
$urls[] = '$link/3/';
$ is considered a special regex character and needs to be escaped. Use preg_quote() to escape $url before passing it to preg_replace().
$url = preg_quote($url, '~');
$ has special meaning in regex. End of line. Your expression is being expanded like this:
$html = preg_replace("~<a.*?href=\"$link$/3/\".*?>.*?</a>~i", "$0<span>test</span>", $html);
Which fails because it can't find "link" between two end of lines. Try escaping the $ in the $urls array:
$urls[] = '\$link\$/3/';

Categories