I'm stuck as preg_matching is not always that easy as I'm totally not familiar with it.
I'm trying to replace all the
In example:
Site1 => Site1
But the <a href can be written in many ways; a HREF or A href or double spaced <A href etc... How can i manage this. Bear in mind, performance is key
I've tried the following with str_replace, but of course that does not cover all the <a href (capital non capitalized versions).
$str = 'sitename1<br />sitename2<br />sitename3';
$Replace = str_replace('<a href="', '<a href="https://example.com/&i=1243123&r=', $str);
echo $Replace
Try this (PHP 5.3+):
$link = preg_replace_callback('#<a(.*?)href="(.*?)"(.*?)>#is', function ($match) {
return sprintf(
'<a%shref="%s"%s>',
$match[1],
'http://example.com?u=' . urlencode($match[2]),
$match[3]
);
}, 'Site1');
echo $link;
The only fully reliable way of doing this is to use a proper HTML parser.
Happily, PHP has one built-in.
You'd first load the HTML with DomDocument's loadHTML function: http://php.net/manual/en/domdocument.loadhtml.php
Then search the parsed tree with XPath and manipulate the A tags: http://php.net/manual/en/domxpath.query.php
Related
I am working with an editor that works purely with internal relative links for files which is great for 99% of what I use it for.
However, I am also using it to insert links to files within an email body and relative links don't cut the mustard.
Instead of modifying the editor, I would like to search the string from the editor and replace the relative links with external links as shown below
Replace
files/something.pdf
With
https://www.someurl.com/files/something.pdf
I have come up with the following but I am wondering if there is a better / more efficient way to do it with PHP
<?php
$string = 'A link, some other text, A different link';
preg_match_all('/<a[^>]+href=([\'"])(?<href>.+?)\1[^>]*>/i', $string, $result);
if (!empty($result)) {
// Found a link.
$baseUrl = 'https://www.someurl.com';
$newUrls = array();
$newString = '';
foreach($result['href'] as $url) {
$newUrls[] = $baseUrl . '/' . $url;
}
$newString = str_replace($result['href'], $newUrls, $string);
echo $newString;
}
?>
Many thanks
Lee
You can simply use preg_replace to replace all the occurrences of files starting URLs inside double quotes:
$string = 'A link, some other text, A different link';
$string = preg_replace('/"(files.*?)"/', '"https://www.someurl.com/$1"', $string);
The result would be:
A link, some other text, A different link
You really should use DOMdocument for such job, but if you want to use a regex, this one does the job:
$string = '<a some_attribute href="files/something.pdf" class="abc">A link</a>, some other text, <a class="def" href="files/somethingelse.pdf" attr="xyz">A different link</a>';
$baseUrl = 'https://www.someurl.com';
$newString = preg_replace('/(<a[^>]+href=([\'"]))(.+?)\2/i', "$1$baseUrl/$3$2", $string);
echo $newString,"\n";
Output:
<a some_attribute href="https://www.someurl.comfiles/something.pdf" class="abc">A link</a>, some other text, <a class="def" href="https://www.someurl.com/files/somethingelse.pdf" attr="xyz">A different link</a>
I'm using the following regex to select the href="http part inside an url which doesn't contain a rel="nofollow" yet:
preg_replace(
"/<a\b(?=[^>]+\b(href=\"http))(?![^>]+\brel=\"nofollow\")/',
"rel=\"nofollow\" href=\"http://",
$input_string
);
The thing is it only replaces the <a because that's the first match.
How is it possible to select the a tag but exclude the <a part from the results so it only will match href="http? Because preg_match does return <a AND href="http, but I only need href="http :)
The reason I think this might be the only right solution is because it's not sure how many <a> tag the given string contains and whether they contain a rel=nofollow or not. I need to make sure I only replace the http:// with rel="nofollow" http:// inside <a> tags with no rel="nofollow"
EDIT 1:
giuseppe straziota asked for an input and output example so here it is:
input:
this is a string with a lot of content and links and whatever....
output:
this is a string with a lot of content and <a rel="nofollow" href="http://information.nl" class="aClass">links</a> and whatever....
EDIT 2:
I run a couple of more tests, these are the results:
code (exact copy/paste):
$input_string = 'this is a string with a lot of content and links and whatever....';
$input_string = preg_replace(
'/<a\b(?=[^>]+\b(href="http))(?![^>]+\brel="nofollow")/',
'rel="nofollow" href="http://',
$input_string
);
echo htmlentities($input_string);
result from php 7.0.5:
this is a string with a lot of content and rel="nofollow" href="http:// href="http://information.nl" class="aClass">links</a> and whatever....
And it should be:
this is a string with a lot of content and <a rel="nofollow" href="http://information.nl" class="aClass">links</a> and whatever....
EDIT 3:
I tried this regex:
$test = preg_replace(
'/(?=<a\b[^>]+\b(href="http))(?![^>]+\brel="nofollow")/',
'rel="nofollow" href="http://',
$input_string
);
But now it places the 'rel="nofollow" href="http://', right before the <a, so the result:
rel="nofollow" href="http://links
Not exactly what I want either...
I'm thinking too difficult, I made some adaptions in my preg_replace so I can just use the first regex:
$test = preg_replace(
'/<a(?=\b[^>]+\b(href="http))(?![^>]+\brel="nofollow")/',
'<a rel="nofollow"',
$input_string
);
It replaces the <a tag, so I should have taken advantage of that like I do now.
I need to find a way to read content posted by user to find any hyperlinks that might have been included, create anchor tags, add target and rel=nofollow attribute to all those links.
I have come across some REGEX solutions like this:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
But on other questions on SO about the same problem, it has been highly recommended NOT to use REGEX instead use DOMDocument of PHP.
Whatever be the best way, I need to add some attributes like mentioned above in order to harden all external links on website.
First of all, the guidelines you mentioned advised against parsing HTML with regexes. As far as I understand, what you are trying to do is to parse plain text from user and convert it into HTML. For that purpose, regexes are usually just fine.
(Note that I assume you parse the text into links yourself and aren't using external library for that. In the latter case you'd need to fix the HTML the library outputs, and for this you should use DOMDocument to iterate over all <a> tags and add them proper attributes.)
Now, you can parse it in two ways: server side, or client side.
Server side
Pros:
It outputs ready to use HTML.
It doesn't require users to enable Javascript.
Cons:
You need to add rel="nofollow" attribute for the bots to not follow the links.
Client side
Pros:
You don't need to add rel="nofollow" attribute for the bots, since they don't see the links in the first place - they're generated with Javascript and bots usually don't parse Javascript.
Cons:
Creating links that way requires users to enable Javascript.
Implementing stuff like that in Javascript can give the impression that site is slow, especially if there is a lot of text to parse.
It makes caching parsed text difficult.
I'll focus on implementing it server-side.
Server-side implementation
So, in order to parse links from user input and add them any attribute you want, you can use something like this:
<?php
function replaceLinks($text)
{
$regex = '/'
. '(?<!\S)'
. '(((ftp|https?)?:?)\/\/|www\.)'
. '(\S+?)'
. '(?=$|\s|[,]|\.\W|\.$)'
. '/m';
return preg_replace_callback($regex, function($match)
{
return '<a'
. ' target=""'
. ' rel="nofollow"'
. ' href="' . $match[0] . '">'
. $match[0]
. '</a>';
}, $text);
}
Explanation:
(?<!\S): not preceded by non-whitespace characters.
(((ftp|https?)?:?)\/\/|www\.): accept ftp://, http://, https://, ://, // and www. as beginning of URLs.
(\S+?) match everything that is not whitespace in non-greedy fashion.
(?=$|\s|[,]|\.\W|\.$) every URL must be follow by either end of line, a whitespace, a comma, a dot followed by character other than \w (this is to allow .com, .co.jp etc to match) or by a dot followed by end of line.
m flag - match multiline text.
Testing
Now, to support my claim that it works I added a few test cases:
$tests = [];
$tests []= ['http://example.com', '<a target="" rel="nofollow" href="http://example.com">http://example.com</a>'];
$tests []= ['https://example.com', '<a target="" rel="nofollow" href="https://example.com">https://example.com</a>'];
$tests []= ['ftp://example.com', '<a target="" rel="nofollow" href="ftp://example.com">ftp://example.com</a>'];
$tests []= ['://example.com', '<a target="" rel="nofollow" href="://example.com">://example.com</a>'];
$tests []= ['//example.com', '<a target="" rel="nofollow" href="//example.com">//example.com</a>'];
$tests []= ['www.example.com', '<a target="" rel="nofollow" href="www.example.com">www.example.com</a>'];
$tests []= ['user#www.example.com', 'user#www.example.com'];
$tests []= ['testhttp://example.com', 'testhttp://example.com'];
$tests []= ['example.com', 'example.com'];
$tests []= [
'test http://example.com',
'test <a target="" rel="nofollow" href="http://example.com">http://example.com</a>'];
$tests []= [
'multiline' . PHP_EOL . 'blah http://example.com' . PHP_EOL . 'test',
'multiline' . PHP_EOL . 'blah <a target="" rel="nofollow" href="http://example.com">http://example.com</a>' . PHP_EOL . 'test'];
$tests []= [
'text //example.com/slashes.php?parameters#fragment, some other text',
'text <a target="" rel="nofollow" href="//example.com/slashes.php?parameters#fragment">//example.com/slashes.php?parameters#fragment</a>, some other text'];
$tests []= [
'text //example.com. new sentence',
'text <a target="" rel="nofollow" href="//example.com">//example.com</a>. new sentence'];
Each test case is composed of two parts: source input and expected output. I used following code to determine whether the function passes the tests above:
foreach ($tests as $test)
{
list ($source, $expected) = $test;
$actual = replaceLinks($source);
if ($actual != $expected)
{
echo 'Test ' . $source . ' failed.' . PHP_EOL;
echo 'Expected: ' . $expected . PHP_EOL;
echo 'Actual: ' . $actual . PHP_EOL;
die;
}
}
echo 'All tests passed' . PHP_EOL;
I think this gives you idea how to solve the problem. Feel free to add more tests and experiment with regex itself to make it suitable for your specific needs.
You might be interested in Goutte
you can define your own filters etc.
Get the content to post using jquery and process it before posting it to PHP.
$('#idof_content').val(
$('#idof_content').val().replace(/\b(http(s|):\/\/|)(www\.\S+)/ig,
"<a href='http\$2://\$3' target='_blank' rel='nofollow'>\$3</a>"));
Lets say I have the following string (from a much larger string with multiple similiar strings)
$str = '<div class='testdiv remove'>randomtext</div>
<div class='testdiv'>randomtext randomtext</div>';
The class 'remove' was added through a javascript function. How would I remove all elements of the class 'remove' and all links so that the string becomes this:
$str = '<div class='testdiv'>randomtext </div>';
I can't use jquery to remove these tags since I have to feed this into a php library function. How would I remove these?
Use a dom parser http://simplehtmldom.sourceforge.net/
use regular expression :)
$pattern = "/(?:<div class='testdiv remove'>[\s\S]+?</div>|<a[^>]+>[^<]+</a>)/i"
$str = preg_replace($pattern, "", $str);
I have a string that has some hyperlinks inside. I want to match with regex only certain link from all of them. I can't know if the href or the class comes first, it may be vary.
This is for example a sting:
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>
»eee<span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
I want to select from the aboce string only the one that has the class nextpostslink
So, the match in this example should return this -
»eee
This regex is the most close I could get -
/<a\s?(href=)?('|")(.*)('|") class=('|")nextpostslink('|")>.{1,6}<\/a>/
But it is selecting the links from the start of the string.
I think my problem is in the (.*) , but I can't figure out how to change this to select only the needed link.
I would appreciate your help.
It's much better to use a genuine HTML parser for this. Abandon all attempts to use regular expressions on HTML.
Use PHP's DOMDocument instead:
$dom = new DOMDocument;
$dom->loadHTML($yourHTML);
foreach ($dom->getElementsByTagName('a') as $link) {
$classes = explode(' ', $link->getAttribute('class'));
if (in_array('nextpostslink', $classes)) {
// $link has the class "nextpostslink"
}
}
Not sure if that's what you're but anyway: it's a bad idea to parse html with regex. Use a xpath implementation in order to reach the desired elements. The following xpath expression would give you all the 'a' elements with class "nextpostlink" :
//a[contains(#class,"nextpostslink")]
There are loads of xpath info around, since you didn't mention your programming language here goes a quick xpath tutorial using java: http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html
Edit:
php + xpath + html: http://dev.juokaz.com/php/web-scraping-with-php-and-xpath
This would work in php:
/<a[^>]+href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>/m
This is of course assuming that the class attribute always comes after the href attribute.
This is a code snippet:
$html = <<<EOD
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>
»eee<span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
EOD;
$regexp = "/<a[^>]+href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>/m";
$matches = array();
if(preg_match($regexp, $html, $matches)) {
echo "URL: " . $matches[2] . "\n";
echo "Text: " . $matches[6] . "\n";
}
I would however suggest first matching the link and then getting the url so that the order of the attributes doesn't matter:
<?php
$html = <<<EOD
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>
»eee<span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
EOD;
$regexp = "/(<a[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>)/m";
$matches = array();
if(preg_match($regexp, $html, $matches)) {
$link = $matches[0];
$text = $matches[4];
$regexp = "/href=(\"|')([^'\"]*)(\"|')/";
$matches = array();
if(preg_match($regexp, $html, $matches)) {
$url = $matches[2];
echo "URL: $url\n";
echo "Text: $text\n";
}
}
You could of course extend the regexp by matching one of the both variants (class first vs href first) but it would be very long and I don't think it would be a performance increase.
Just as a proof of concept I created a regexp that doesn't care about the order:
/<a[^>]+(href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*(\"|')|class=(\"|')[^'\"]*nextpostslink[^'\"]*(\"|')[^>]+href=(\"|')([^\"']*)('|\"))[^>]*>(.{1,6})<\/a>/m
The text will be in group 12 and the URL will be in either group 3 or group 10 depending on the order.
As the question is to get it by regex, here is how <a\s[^>]*class=["|']nextpostslink["|'][^>]*>(.*)<\/a>.
It doesn't matter in which order are the attributs and it also consider simple or double quotes.
Check the regex online: https://regex101.com/r/DX03KD/1/
I replaced the (.*) with [^'"]+ as follows:
<a\s*(href=)?('|")[^'"]+('|") class=('|")nextpostslink('|")>.{1,6}</a>
Note: I tried this with RegEx Buddy so I didnt need to escape the <>'s or /