I have the following regex :
$string = preg_replace("/([\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/i","<a target=\"_blank\" href=\"$1\">$1</A>",$string);
Using it to parse this string : http://www.ttt.com.ar/hello_world
Produces this new string :
<a target="_blank" href="http://www.ttt.com.ar/hello_world">http://www.ttt.com.ar/hello_world</A>
So far , soo good. What I want to do is to get replacement $1 to be a substring of $1 producing an output like :
<a target="_blank" href="http://www.ttt.com.ar/hello_world">http://www.ttt.com.ar/...</A>
Pseudocode of what I mean:
$string = preg_replace("/([\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/i","<a target=\"_blank\" href=\"$1\">substring($1,0,24)..</A>",$string);
Is this even possible? Probably Im just doing all wrong :)
Thanks in advance.
Check out preg_replace_callback():
$string = 'http://www.ttt.com.ar/hello_world';
$string = preg_replace_callback(
"/([\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/i",
function($matches) {
$link = $matches[1];
$substring = substr($link, 0, 24) . '..';
return "<a target=\"_blank\" href=\"$link\">$substring</a>";
},
$string
);
var_dump($string);
// <a target="_blank" href="http://www.ttt.com.ar/hello_world">http://www.ttt.com.ar/...</a>
Note, you can also use the e modifier in PHP to execute functions in your preg_replace(). This has been deprecated in PHP 5.5.0, in favor of preg_replace_callback().
You can use a capturing group inside of a lookahead like this:
preg_replace(
"/((?=(.{24}))[\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/i",
"<a target=\"_blank\" href=\"$1\">$2..</A>",
$string);
This will capture the entire URL in group 1, but it will also capture the first 24 characters of it in group 2.
You are showing bad practice. Regexes should not being used to parse or modify xml content from application's context.
Suggests:
Use a DOM parsing to read and modify the value
use parse_url() to get the protocol + domain name
Example:
$doc = new DOMDocument();
$doc->loadHTML(
'<a target="_blank" href="http://www.ttt.com.ar/hello_world">http://www.ttt.com.ar/hello_world</A>'#
);
$link = $doc->getElementsByTagName('a')->item(0);
$url = parse_url($link->nodeValue);
$link->nodeValue = $url['scheme'] . '://' . $url['host'] . '/...';
echo $doc->saveHTML();
Related
How can I write a regex expression that will convert any absolute URLs to relative paths. For example:
src="http://www.test.localhost/sites/
would become
src="/sites/"
The domains are not static.
I can't use parse_url (as per this answer) because it is part of a larger string, that contains no-url data as well.
Solution
You can use the following regex:
/https?:\/{2}[^\/]+/
Which would match the following:
http://www.test.localhost/sites/
http://www.domain.localhost/sites/
http://domain.localhost/sites/
So it would be:
$domain = preg_replace('/https?:\/{2}[^\/]+/', '', $domain);
Explanation
http: Look for 'http'
s?: Look for an 's' after the 'http' if there's one
: : Look for the ':' character
\/{2}: Look for the '//'
[^\/]+: Go for anything that is not a slash (/)
My guess is that maybe this expression or an improved version of that might work to some extent:
^\s*src=["']\s*https?:\/\/(?:[^\/]+)([^"']+?)\s*["']$
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.
Test
$re = '/^\s*src=["\']\s*https?:\/\/(?:[^\/]+)([^"\']+?)\s*["\']$/m';
$str = 'src=" http://www.test.localhost/sites/ "
src=" https://www.test.localhost/sites/"
src=" http://test.localhost/sites/ "
src="https://test.localhost/sites/ "
src="https://localhost/sites/ "
src=\'https://localhost/ \'
src=\'http://www.test1.test2.test3.test4localhost/sites1/sites2/sites3/ \'';
$subst = 'src="$1"';
var_export(preg_replace($re, $subst, $str));
Output
src="/sites/"
src="/sites/"
src="/sites/"
src="/sites/"
src="/sites/"
src="/"
src="/sites1/sites2/sites3/"
RegEx Circuit
jex.im visualizes regular expressions:
$dom = new DOMDocument;
$dom->loadHTML($yourHTML)
$xp = new DOMXPath($dom);
foreach($xp->query('//#src') as $attr) {
$url = parse_url($attr->nodeValue);
if ( !isset($url['scheme']) || stripos($url['scheme'], 'http']) !== 0 )
continue;
$src = $url['path']
. ( isset($url['query']) ? '?' . $url['query'] : '' )
. ( isset($url['fragment']) ? '#' . $url['fragment'] : '' );
$attr->parentNode->setAttribute('src', $src);
}
$result = $dom->saveHTML();
I added an if condition to skip cases when it isn't possible to say if the beginning of the src attribute is a domain or the beginning of the path. Depending of what you are trying to do, you can remove this test.
If you are working with parts of an html document (ie: not a full document), you have to change $result = $dom->saveHTML() with something like:
$result = '';
foreach ($dom->getElementsByTagName('body')->item(0)->childNodes as $childNode) {
$result . = $dom->saveHTML($childNode);
}
I need to perform a recursive str_replace on a portion of HTML (with recursive I mean inner nodes first), so I wrote:
$str = //get HTML;
$pttOpen = '(\w+) *([^<]{1,100}?)';
$pttClose = '\w+';
$pttHtml = '(?:(?!(?:<x-)).+)';
while (preg_match("%<x-(?:$pttOpen)>($pttHtml)*</x-($pttClose)>%m", $str, $match)) {
list($outerHtml, $open, $attributes, $innerHtml, $close) = $match;
$newHtml = //some work....
str_replace($outerHtml, $newHtml, $str);
}
The idea is to first replace non-nested x-tags.
But it only works if innerHtml in on the same line of the opening tag (so I guess I misunderstood what the /m modifier does). I don't want to use a DOM library, because I just need simple string replacement. Any help?
Try this regex:
%<x-(?P<open>\w+)\s*(?P<attributes>[^>]*)>(?P<innerHtml>.*)</x-(?P=open)>%s
Demo
http://regex101.com/r/nA2zO5
Sample code
$str = // get HTML
$pattern = '%<x-(?P<open>\w+)\s*(?P<attributes>[^>]*)>(?P<innerHtml>.*)</x-(?P=open)>%s';
while (preg_match($pattern, $str, $matches)) {
$newHtml = sprintf('<ns:%1$s>%2$s</ns:%1$s>', $matches['open'], $matches['innerHtml']);
$str = str_replace($matches[0], $newHtml, $str);
}
echo htmlspecialchars($str);
Output
Initially, $str contained this text:
<x-foo>
sdfgsdfgsd
<x-bar>
sdfgsdfg
</x-bar>
<x-baz attr1='5'>
sdfgsdfg
</x-baz>
sdfgsdfgs
</x-foo>
It ends up with:
<ns:foo>
sdfgsdfgsd
<ns:bar>
sdfgsdfg
</ns:bar>
<ns:baz>
sdfgsdfg
</ns:baz>
sdfgsdfgs
</ns:foo>
Since, I didn't know what work is done on $newHtml, I mimic this work somehow by replacing x-with ns: and removing any attributes.
Thanks to #Alex I came up with this:
%<x-(?P<open>\w+)\s*(?P<attributes>[^>]*?)>(?P<innerHtml>((?!<x-).)*)</x-(?P=open)>%is
Without the ((?!<x-).)*) in the innerHtml pattern it won't work with nested tags (it will first match outer ones, which isn't what I wanted). This way innermost ones are matched first. Hope this helps.
I don't know exactly what kind of changes you are trying to do, however this is the way I will proceed:
$pattern = <<<'EOD'
~
<x-(?<tagName>\w++) (?<attributes>[^>]*+) >
(?<content>(?>[^<]++|<(?!/?x-))*) #by far more efficient than (?:(?!</?x-).)*
</x-\g<tagName>>
~x
EOD;
function callback($m) { // exemple function
return '<n-' . $m['tagName'] . $m['attributes'] . '>' . $m['content']
. '</n-' . $m['tagName'] . '>';
};
do {
$code = preg_replace_callback($pattern, 'callback', $code, -1, $count);
} while ($count);
echo htmlspecialchars(print_r($code, true));
I found this code (Swap all youtube urls to embed via preg_replace()) to swap youtube urls (http://www.youtube.com/watch?v=CfDQ92vOfdc, or http://www.youtube.com/v/CfDQ92vOfdc) into youtube embed urls (http://www.youtube.com/embed/CfDQ92vOfdc) but it doesn't seem to be working? Any ideas? I don't know much about regular expression.
Here's the code:
$string = 'http://www.youtube.com/watch?v=CfDQ92vOfdc';
$search = '#<a (?:.*?)href=["\\\']http[s]?:\/\/(?:[^\.]+\.)*youtube\.com\/(?:v\/|watch\?(?:.*?\&)?v=|embed\/)([\w\-\_]+)["\\\']#ixs';
$replace = 'http://www.youtube.com/embed/$2';
$url = preg_replace($search,$replace,$string);
but it's still displaying as:
http://www.youtube.com/watch?v=CfDQ92vOfdc
instead of:
http://www.youtube.com/embed/CfDQ92vOfdc
Thanks in advance.
One problem is that your expression is expecting a-href tags around the address.
Another issue is that your $replace string is using single-quotes which will not parse $2.
This simpler expression should work:
$string = 'http://www.youtube.com/watch?v=CfDQ92vOfdc';
$search = '/youtube\.com\/watch\?v=([a-zA-Z0-9]+)/smi';
$replace = "youtube.com/embed/$1";
$url = preg_replace($search,$replace,$string);
echo $url;
Either change
$string = 'http://www.youtube.com/watch?v=CfDQ92vOfdc';
to
$string = '<a href="http://www.youtube.com/watch?v=CfDQ92vOfdc" ></a>';
OR
$search = '#<a (?:.*?)href=["\\\']http[s]?:\/\/(?:[^\.]+\.)*youtube\.com\/(?:v\/|watch\?(?:.*?\&)?v=|embed\/)([\w\-\_]+)["\\\']#ixs';
to
$search = '#(.*?)(?:href="https?://)?(?:www\.)?(?:youtu\.be/|youtube\.com(?:/embed/|/v/|/watch?.*?v=))([\w\-]{10,12}).*#x';
If there is anyone who is still looking for a better straight up solution ,
here it is I just played with your code until it gave me an easy solution.
$string = $content;
$search = '/www.youtube\.com\/watch\?v=([a-zA-Z0-9]+)/smi';
$replace = "<iframe width='560' height='315' src='https://youtube.com/embed/$1' frameborder='0' allowfullscreen></iframe>
";
$content = preg_replace($search,$replace,$string);
NOTE: to choose how you want the links to be processed just edit the $search part,
if you will be processing from www.youtube.com it will be
$search = '/www.youtube\.com\/watch\?v=([a-zA-Z0-9]+)/smi';
else if you want it to process just youtube.com links just remove the www.
$search = '/youtube\.com\/watch\?v=([a-zA-Z0-9]+)/smi';
here is a function i wrote that you echo out the result:
function youtube_url_to_embed($youtube_url) {
$search = '/youtube\.com\/watch\?v=([a-zA-Z0-9]+)/smi';
$replace = "youtube.com/embed/$1";
$embed_url = preg_replace($search,$replace,$youtube_url);
return $embed_url;
}
I am still relatively new to Regular Expressions and feel My code is being too greedy. I am trying to add an id attribute to existing links in a piece of code. My functions is like so:
function addClassHref($str) {
//$str = stripslashes($str);
$preg = "/<[\s]*a[\s]*href=[\s]*[\"\']?([\w.-]*)[\"\']?[^>]*>(.*?)<\/a>/i";
preg_match_all($preg, $str, $match);
foreach ($match[1] as $key => $val) {
$pattern[] = '/' . preg_quote($match[0][$key], '/') . '/';
$replace[] = "<a id='buttonRed' href='$val'>{$match[2][$key]}</a>";
}
return preg_replace($pattern, $replace, $str);
}
This adds the id tag like I want but it breaks the hyperlink. For example:
If the original code is : Link
Instead of <a id="class" href="http://www.google.com">Link</a>
It is giving
<a id="class" href="http">Link</a>
Any suggestions or thoughts?
Do not use regular expressions to parse XML or HTML.
$doc = new DOMDocument();
$doc->loadHTML($html);
$all_a = $doc->getElementsByTagName('a');
$firsta = $all_a->item(0);
$firsta->setAttribute('id', 'idvalue');
echo $doc->saveHTML($firsta);
You've got some overcomplications in your regex :)
Also, there's no need for the loop as preg_replace() will hit all the instances of the search pattern in the relevant string. The first regex below will take everything in the a tag and simply add the id attribute on at the end.
$str = 'Link' . "\n" .
'Link' . "\n" .
'Link';
$p = "{<\s*a\s*(href=[^>]*)>([^<]*)</a>}i";
$r = "<a $1 id=\"class\">$2</a>";
echo preg_replace($p, $r, $str);
If you only want to capture the href attribute you could do the following:
$p = '{<\s*a\s*href=["\']([^"\']*)["\'][^>]*>([^<]*)</a>}i';
$r = "<a href='$1' id='class'>$2</a>";
Your first subpattern ([\w.-]*) doesn't match :, thus it stops at "http".
Couldn't you just use a simple str_replace() for this? Regex seems like overkill if this is all you're doing.
$str = str_replace('<a ', '<a id="someID" ', $str);
I have a string that has some hyperlinks inside. I want to match with regex only certain link from all of them. I can't know if the href or the class comes first, it may be vary.
This is for example a sting:
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>
»eee<span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
I want to select from the aboce string only the one that has the class nextpostslink
So, the match in this example should return this -
»eee
This regex is the most close I could get -
/<a\s?(href=)?('|")(.*)('|") class=('|")nextpostslink('|")>.{1,6}<\/a>/
But it is selecting the links from the start of the string.
I think my problem is in the (.*) , but I can't figure out how to change this to select only the needed link.
I would appreciate your help.
It's much better to use a genuine HTML parser for this. Abandon all attempts to use regular expressions on HTML.
Use PHP's DOMDocument instead:
$dom = new DOMDocument;
$dom->loadHTML($yourHTML);
foreach ($dom->getElementsByTagName('a') as $link) {
$classes = explode(' ', $link->getAttribute('class'));
if (in_array('nextpostslink', $classes)) {
// $link has the class "nextpostslink"
}
}
Not sure if that's what you're but anyway: it's a bad idea to parse html with regex. Use a xpath implementation in order to reach the desired elements. The following xpath expression would give you all the 'a' elements with class "nextpostlink" :
//a[contains(#class,"nextpostslink")]
There are loads of xpath info around, since you didn't mention your programming language here goes a quick xpath tutorial using java: http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html
Edit:
php + xpath + html: http://dev.juokaz.com/php/web-scraping-with-php-and-xpath
This would work in php:
/<a[^>]+href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>/m
This is of course assuming that the class attribute always comes after the href attribute.
This is a code snippet:
$html = <<<EOD
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>
»eee<span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
EOD;
$regexp = "/<a[^>]+href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>/m";
$matches = array();
if(preg_match($regexp, $html, $matches)) {
echo "URL: " . $matches[2] . "\n";
echo "Text: " . $matches[6] . "\n";
}
I would however suggest first matching the link and then getting the url so that the order of the attributes doesn't matter:
<?php
$html = <<<EOD
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>
»eee<span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
EOD;
$regexp = "/(<a[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>)/m";
$matches = array();
if(preg_match($regexp, $html, $matches)) {
$link = $matches[0];
$text = $matches[4];
$regexp = "/href=(\"|')([^'\"]*)(\"|')/";
$matches = array();
if(preg_match($regexp, $html, $matches)) {
$url = $matches[2];
echo "URL: $url\n";
echo "Text: $text\n";
}
}
You could of course extend the regexp by matching one of the both variants (class first vs href first) but it would be very long and I don't think it would be a performance increase.
Just as a proof of concept I created a regexp that doesn't care about the order:
/<a[^>]+(href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*(\"|')|class=(\"|')[^'\"]*nextpostslink[^'\"]*(\"|')[^>]+href=(\"|')([^\"']*)('|\"))[^>]*>(.{1,6})<\/a>/m
The text will be in group 12 and the URL will be in either group 3 or group 10 depending on the order.
As the question is to get it by regex, here is how <a\s[^>]*class=["|']nextpostslink["|'][^>]*>(.*)<\/a>.
It doesn't matter in which order are the attributs and it also consider simple or double quotes.
Check the regex online: https://regex101.com/r/DX03KD/1/
I replaced the (.*) with [^'"]+ as follows:
<a\s*(href=)?('|")[^'"]+('|") class=('|")nextpostslink('|")>.{1,6}</a>
Note: I tried this with RegEx Buddy so I didnt need to escape the <>'s or /