Alter all a href links in php - php

Currently working on something where i need to add the UTM tag to all links, got 1/2 minor issues i cant figure out
This is the code im am using, the issue is if a link got a parameter like ?test=test then this refuses to add the utm tags.
The other issue is a minor issue that im not sure would make sence to change, insted of me having to add a url, it could be neat if it added utm tags to ALL a href's by default with out knowing the domain name.
Hope someone can help me out and push me in the right direction.
$url_modifier_domain = preg_quote('add-link.com');
$html_text = preg_replace_callback(
'#((?:https?:)?//'.$url_modifier_domain.'(/[^\'"\#]*)?)(?=[\'"\#])#i',
function($matches){
$url_modifier = 'utm=some&medium=stuff';
if (!isset($matches[2])) return $matches[1]."/?$url_modifier";
$q = strpos($matches[2],'?');
if ($q===false) return $matches[1]."?$url_modifier";
if ($q==strlen($matches[2])-1) return $matches[1].$url_modifier;
return $matches[1]."&$url_modifier";
},
$html);

once detected the urls you can use parse_url() and parse_str() to elaborate the url, add utm and medium and rebuild it without caring too much about the content of the get parameters or the hash:
$url_modifier_domain = preg_quote('add-link.com');
$html_text = preg_replace_callback(
'#((?:https?:)?//'.$url_modifier_domain.'(/[^\'"\#]*)?)(?=[\'"\#])#i',
function ($matches) {
$link = $matches[0];
if (strpos($link, '#') !== false) {
list($link, $hash) = explode('#', $link);
}
$res = parse_url($link);
$result = '';
if (isset($res['scheme'])) {
$result .= $res['scheme'].'://';
}
if (isset($res['host'])) {
$result .= $res['host'];
}
if (isset($res['path'])) {
$result .= $res['path'];
}
if (isset($res['query'])) {
parse_str($res['query'], $res['query']);
} else {
$res['query'] = [];
}
$res['query']['utm'] = 'some';
$res['query']['medium'] = 'stuff';
if (count($res['query']) > 0) {
$result .= '?'.http_build_query($res['query']);
}
if (isset($hash)) {
$result .= '#'.$hash;
}
return $result;
},
$html
);
As you can see, the code is longer but simpler
Edit
I made some change, searching for every href="xxx" inside the text. If the link is not from add-link.com the script will skip it, otherwise he will try to print it in the best way possible
$html = 'blabla a
a
a
a
a
a
a
a
a
a
a
';
$url_modifier_domain = preg_quote('add-link.com');
$html_text = preg_replace_callback(
'/href="([^"]+)"/i',
function ($matches) {
$link = $matches[1];
// ignoring outer links
if(strpos($link,'add-link.com') === false) return 'href="'.$link.'"';
if (strpos($link, '#') !== false) {
list($link, $hash) = explode('#', $link);
}
$res = parse_url($link);
$result = '';
if (isset($res['scheme'])) {
$result .= $res['scheme'].'://';
} else if(isset($res['host'])) {
$result .= '//';
}
if (isset($res['host'])) {
$result .= $res['host'];
}
if (isset($res['path'])) {
$result .= $res['path'];
} else {
$result .= '/';
}
if (isset($res['query'])) {
parse_str($res['query'], $res['query']);
} else {
$res['query'] = [];
}
$res['query']['utm'] = 'some';
$res['query']['medium'] = 'stuff';
if (count($res['query']) > 0) {
$result .= '?'.http_build_query($res['query']);
}
if (isset($hash)) {
$result .= '#'.$hash;
}
return 'href="'.$result.'"';
},
$html
);
var_dump($html_text);

Related

PHP - strpos doesn't work correctly for me?

So I am trying to search through JSON, using strpos for it, but it doesn't for me. I always get that "Items doesn't exist".
Here's the PHP
$getItems = file_get_contents('Items.json');
$decodeItems = json_decode($getItems,true);
//$output = ''.$decodeItems['items'][0]['name'].'';
$output = '';
if(isset($_POST['search'])){
$searchq = $_POST['search'];
$searchq = preg_replace("#[^0-9a-z]#i","",$searchq);
foreach($decodeItems['items'] as $data){
if($chance = strpos($data, $searchq) !== FALSE){
if($data['name'] == $chance){
$name = $data['name'];
$output .= "<div>".$name."</div>";
}
}
else{
$output = 'Items no';
}
}
}
Here's the Example JSON
{"img":"-9a81dlWLwJ2UUGcVs_nsVtzdOEdtWwKGZZLQHTxDZ7I56KU0Zwwo4NUX4oFJZEHLbXU5A1PIYQh5hlcX0nvUOGsx8DdQBJjIAVHubSaIAlp1fb3YihQ-tWglYy0lfjjOr6fxjpQ7MFz373Fodyl0AXh-ENkMWinJ4eXcA8-ZFHUq1K_xum70ZO56oOJlyUgjHI5fA","name":"&#x2605 Bowie Knife","assetid":"6442574944","myprice":"155.36","condition":"","originalname":"\u2605 Bowie Knife","inspect":"steam:\/\/rungame\/730\/76561202255233023\/+csgo_econ_action_preview%20S76561198034643020A6442574944D2305887113904442996","special":"0","floatvalue":null,"bitskinsprice":"117.36"}
From the manual the return value of strpos:
Returns the position of where the needle exists relative to the beginning of the haystack string (independent of offset). Also note that string positions start at 0, and not 1.
Returns FALSE if the needle was not found.
So, I suggest you to change this part of your code:
if(strpos($data['name'], $searchq) === true)
to
if(strpos($data['name'], $searchq) !== FALSE)
Updating my answer according to the comments below.
The JSON needs to follow the following notation in order for the code above to work:
{"items":[
{
"img":"img1",
"name":"name1",
"assetid":"1",
"myprice":"155.36",
"condition":"",
"originalname":"name1 original",
"inspect":"whatever",
"special":"0",
"floatvalue":null,
"bitskinsprice":"117.36"
},
{
"img":"img2",
"name":"name2",
"assetid":"2",
"myprice":"175.11",
"condition":"",
"originalname":"name2 original",
"inspect":"whatever2",
"special":"0",
"floatvalue":null,
"bitskinsprice":"55.55"
}
]};
Updating my answer according to the comments below.
OK, I made some changes to your script to deal correctly with the JSON file.
$getItems = file_get_contents('Items.json');
$decodeItems = json_decode($getItems,true);
//$output = ''.$decodeItems['items'][0]['name'].'';
$output = '';
if(isset($_POST['search'])){
$searchq = $_POST['search'];
$searchq = preg_replace("#[^0-9a-z]#i","",$searchq);
foreach($decodeItems['items'] as $data){
// this was *if($chance = strpos($data, $searchq) !== FALSE){*
if(strpos($data['name'], $searchq) !== FALSE) {
// here was another unneeded *if($data['name'] == $chance){*
$name = $data['name'];
$output .= "<div>".$name."</div>";
}
}
if (empty($output)) {
$output = 'Items no';
}
}

PHP synchronized file read write

i am using the function write_php_ini to write files
see the accepted answer with around 40 upvotes at:
create ini file, write values in PHP
originally i was copying from here:
How to read and write to an ini file with PHP
from the answer that is not accepted, because they looked so similar
the problem is, what if 2 process write to it at the same time? (that is, the second post seems to handle it, whereas the first one does not)
how to avoid this problem?
this is the code:
public static function write_ini_file($assoc_arr, $path, $has_sections=FALSE) {
$content = "";
if ($has_sections) {
foreach ($assoc_arr as $key=>$elem) {
$content .= "[".$key."]\n";
foreach ($elem as $key2=>$elem2) {
if(is_array($elem2))
{
for($i=0;$i<count($elem2);$i++)
{
$content .= $key2."[] = \"".$elem2[$i]."\"\n";
}
}
else if($elem2=="") $content .= $key2." = \n";
else $content .= $key2." = \"".$elem2."\"\n";
}
}
}
else {
foreach ($assoc_arr as $key=>$elem) {
if(is_array($elem))
{
for($i=0;$i<count($elem);$i++)
{
$content .= $key."[] = \"".$elem[$i]."\"\n";
}
}
else if($elem=="") $content .= $key." = \n";
else $content .= $key." = \"".$elem."\"\n";
}
}
if (!$handle = fopen($path, 'w')) {
return false;
}
$success = fwrite($handle, $content);
fclose($handle);
return $success;
}

Using regex function in a while loop

I have a function that gets a specific link from a specific website, and it works, but the problem starts when I try to use this function in a while loop. When I tried that, the links length starts to stack up for some reason.
function getLinks($link) {
$link1 = $link;
$content = file_get_contents($link1);
$content = str_replace("<", "", $content);
$content = str_replace(">", "", $content);
preg_match("~previous page.+?next page~i", $content, $match);
preg_match("~\"(/.+?)\"~i", $match[0], $match);
$link2 = "https://en.wiktionary.org".$match[1];
echo $link1."<br>";
echo $link2."<br>";
return $link2;
}
$firstLink = getLinks("https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=AUTOPILOT%0Aautopilot#mw-pages");
Result firstLink = getLinks():
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=AUTOPILOT%0Aautopilot#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=BAGSIE%0Abagsie#mw-pages
^--- See how it works fine when it's like this? Then when I put it in a while loop:
$count = 0;
while ($count < 5) {
$count++;
$firstLink = getLinks($firstLink);
}
The results comes up totally messed up, and the links started to stack up upon each other, like so:
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=AUTOPILOT%0Aautopilot#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=BAGSIE%0Abagsie#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=BAGSIE%0Abagsie#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bpagefrom=BAGSIE%0Abagsie&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bpagefrom=BAGSIE%0Abagsie&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bamp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bamp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bamp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bamp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bamp%3Bamp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bamp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
This is driving me insane, so if anyone know what I did wrong, please, please tell me. Thank you.
Regular function in while loop:
function addOne($num) {
echo $num."<br>";
$num++;
return $num;
}
$num = 0;
$count = 0;
while ($count < 5) {
$count++;
$num = addOne($num);
}
^---Works just fine
Your problem is with HTML entities. I've re-wrote the function to address that issue, repeated URLs and to make it more efficient. You call it with a depth parameter, which would in your case be your while's max.
function getLinks($linkd, $depth, $checked=array()) {
if(!is_array($linkd)) $linkd=array($linkd);
foreach($linkd as $link)
{
if(isset($checked[$link])) continue;
$link1 = $link;
$content = file_get_contents($link1);
$content = str_replace("<", "", $content);
$content = str_replace(">", "", $content);
preg_match("~previous page.+?next page~i", $content, $match);
preg_match("~\"(/.+?)\"~i", $match[0], $match);
$link2 = "https://en.wiktionary.org".$match[1];
echo $link1."<br>";
echo $link2."<br>";
$checked[$link] = true;
if($depth>0)
{
$depth--;
return getLinks(html_entity_decode($link2), $depth, $checked);
}
else
{
return $link2;
}
}
}
$firstLink = "https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=AUTOPILOT%0Aautopilot#mw-pages";
$firstLink = getLinks($firstLink, 5);

php function to extract link from string

i want to extract href link from text or string. i write a little function to do that but this is slow when a string to transform is large. My code is
function spy_linkIntoString_Format($text) {
global $inc_lang; $lang = $inc_lang['tlang_media'];
$it = explode(' ' ,$text);
$result = '';
foreach($it as $jt) {
$a = trim($jt);
if(preg_match('/((?:[\w\d]+\:\/\/)?(?:[\w\-\d]+\.)+[\w\-\d]+(?:\/[\w\-\d]+)*(?:\/|\.[\w\-\d]+)?(?:\?[\w\-\d]+\=[\w\-\d]+\&?)?(?:\#[\w\-\d]*)?)/', $jt)) {
$pros_lis = str_replace('www.','',$jt);
$pros_lis = (strpos($pros_lis, 'http://') === false ? 'http://'. $pros_lis : $pros_lis);
$urlregx = parse_url($pros_lis);
$host_name = (!empty($urlregx['host']) ? $urlregx['host'] : '.com');
if($host_name == 'youtube.com') {
$string_v = $urlregx['query']; parse_str($string_v, $outs); $stID = $outs['v'];
$result .= '<a title="Youtube video" coplay="'.$stID.'" cotype="1" class="media_spy_vr5" href="#"><span class="link_media"></span>'.$lang['vtype_youtube'].'</a> ';
} elseif($host_name == 'vimeo.com') {
$path_s = $urlregx['path']; $patplode = explode("/", $path_s); $stID = $patplode[1];
$result .= '<a title="Vimeo video" coplay="'.$stID.'" cotype="2" class="media_spy_vr5" href="#"><span class="link_media"></span>'.$lang['vtype_vimeo'].'</a> ';
} elseif($host_name == 'travspy.com') {
$result .= '</span>'.$pros_lis.' ';
} else {
$result .= '<span class="jkt_445 c8_big_corner"></span>'.$pros_lis.' ';
}
} else {
$result .= $jt.' ';
}
}
return trim($result);/**/
}
Can i do this run fast?
You should rewrite this to use preg_match_allinstead of splitting the text into words (i.e. drop the explode).
$regex = '/\b((?:[\w\d]+\:\/\/)?(?:[\w\-\d]+\.)+[\w\-\d]+(?:\/[\w\-\d]+)*(?:\/|\.[\w\-\d]+)?(?:\?[\w\-\d]+\=[\w\-\d]+\&?)?(?:\#[\w\-\d]*)?)\b/';
preg_match_all($regex, $text, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$url = $match[0];
// your link generator
}
You seem to be breaking the text into words separated by spaces, and then matching each word against a regular expression. This is very time consuming indeed.
A faster way to do this is to perform the regular expression search on the entire text and then iterate over it's results.
preg_match_all($regex, $text, $result, PREG_PATTERN_ORDER);
foreach($result[0] as $jt){
//do what you normally do with $jt
}

How to add rel="nofollow" to links with preg_replace()

The function below is designed to apply rel="nofollow" attributes to all external links and no internal links unless the path matches a predefined root URL defined as $my_folder below.
So given the variables...
$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';
And the content...
internal
internal cloaked link
external
The end result, after replacement should be...
internal
internal cloaked link
external
Notice that the first link is not altered, since its an internal link.
The link on the second line is also an internal link, but since it matches our $my_folder string, it gets the nofollow too.
The third link is the easiest, since it does not match the blog_url, its obviously an external link.
However, in the script below, ALL of my links are getting nofollow. How can I fix the script to do what I want?
function save_rseo_nofollow($content) {
$my_folder = $rseo['nofollow_folder'];
$blog_url = get_bloginfo('url');
preg_match_all('~<a.*>~isU',$content["post_content"],$matches);
for ( $i = 0; $i <= sizeof($matches[0]); $i++){
if ( !preg_match( '~nofollow~is',$matches[0][$i])
&& (preg_match('~' . $my_folder . '~', $matches[0][$i])
|| !preg_match( '~'.$blog_url.'~',$matches[0][$i]))){
$result = trim($matches[0][$i],">");
$result .= ' rel="nofollow">';
$content["post_content"] = str_replace($matches[0][$i], $result, $content["post_content"]);
}
}
return $content;
}
Here is the DOMDocument solution...
$str = 'internal
internal cloaked link
external
external
external
external
';
$dom = new DOMDocument();
$dom->preserveWhitespace = FALSE;
$dom->loadHTML($str);
$a = $dom->getElementsByTagName('a');
$host = strtok($_SERVER['HTTP_HOST'], ':');
foreach($a as $anchor) {
$href = $anchor->attributes->getNamedItem('href')->nodeValue;
if (preg_match('/^https?:\/\/' . preg_quote($host, '/') . '/', $href)) {
continue;
}
$noFollowRel = 'nofollow';
$oldRelAtt = $anchor->attributes->getNamedItem('rel');
if ($oldRelAtt == NULL) {
$newRel = $noFollowRel;
} else {
$oldRel = $oldRelAtt->nodeValue;
$oldRel = explode(' ', $oldRel);
if (in_array($noFollowRel, $oldRel)) {
continue;
}
$oldRel[] = $noFollowRel;
$newRel = implode($oldRel, ' ');
}
$newRelAtt = $dom->createAttribute('rel');
$noFollowNode = $dom->createTextNode($newRel);
$newRelAtt->appendChild($noFollowNode);
$anchor->appendChild($newRelAtt);
}
var_dump($dom->saveHTML());
Output
string(509) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
internal
internal cloaked link
external
external
external
external
</body></html>
"
Try to make it more readable first, and only afterwards make your if rules more complex:
function save_rseo_nofollow($content) {
$content["post_content"] =
preg_replace_callback('~<(a\s[^>]+)>~isU', "cb2", $content["post_content"]);
return $content;
}
function cb2($match) {
list($original, $tag) = $match; // regex match groups
$my_folder = "/hostgator"; // re-add quirky config here
$blog_url = "http://localhost/";
if (strpos($tag, "nofollow")) {
return $original;
}
elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
return $original;
}
else {
return "<$tag rel='nofollow'>";
}
}
Gives following output:
[post_content] =>
internal
<a href="http://localhost/mytest/go/hostgator" rel=nofollow>internal cloaked link</a>
<a href="http://cnn.com" rel=nofollow>external</a>
The problem in your original code might have been $rseo which wasn't declared anywhere.
Try this one (PHP 5.3+):
skip selected address
allow manually set rel parameter
and code:
function nofollow($html, $skip = null) {
return preg_replace_callback(
"#(<a[^>]+?)>#is", function ($mach) use ($skip) {
return (
!($skip && strpos($mach[1], $skip) !== false) &&
strpos($mach[1], 'rel=') === false
) ? $mach[1] . ' rel="nofollow">' : $mach[0];
},
$html
);
}
Examples:
echo nofollow('something');
// will be same because it's already contains rel parameter
echo nofollow('something'); // ad
// add rel="nofollow" parameter to anchor
echo nofollow('something', 'localhost');
// skip this link as internall link
Using regular expressions to do this job properly would be quite complicated. It would be easier to use an actual parser, such as the one from the DOM extension. DOM isn't very beginner-friendly, so what you can do is load the HTML with DOM then run the modifications with SimpleXML. They're backed by the same library, so it's easy to use one with the other.
Here's how it can look like:
$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';
$html = '<html><body>
internal
internal cloaked link
external
</body></html>';
$dom = new DOMDocument;
$dom->loadHTML($html);
$sxe = simplexml_import_dom($dom);
// grab all <a> nodes with an href attribute
foreach ($sxe->xpath('//a[#href]') as $a)
{
if (substr($a['href'], 0, strlen($blog_url)) === $blog_url
&& substr($a['href'], 0, strlen($my_folder)) !== $my_folder)
{
// skip all links that start with the URL in $blog_url, as long as they
// don't start with the URL from $my_folder;
continue;
}
if (empty($a['rel']))
{
$a['rel'] = 'nofollow';
}
else
{
$a['rel'] .= ' nofollow';
}
}
$new_html = $dom->saveHTML();
echo $new_html;
As you can see, it's really short and simple. Depending on your needs, you may want to use preg_match() in place of the strpos() stuff, for example:
// change the regexp to your own rules, here we match everything under
// "http://localhost/mytest/" as long as it's not followed by "go"
if (preg_match('#^http://localhost/mytest/(?!go)#', $a['href']))
{
continue;
}
Note
I missed the last code block in the OP when I first read the question. The code I posted (and basically any solution based on DOM) is better suited at processing a whole page rather than a HTML block. Otherwise, DOM will attempt to "fix" your HTML and may add a <body> tag, a DOCTYPE, etc...
Thanks #alex for your nice solution. But, I was having a problem with Japanese text. I have fixed it as following way. Also, this code can skip multiple domains with the $whiteList array.
public function addRelNoFollow($html, $whiteList = [])
{
$dom = new \DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
$a = $dom->getElementsByTagName('a');
/** #var \DOMElement $anchor */
foreach ($a as $anchor) {
$href = $anchor->attributes->getNamedItem('href')->nodeValue;
$domain = parse_url($href, PHP_URL_HOST);
// Skip whiteList domains
if (in_array($domain, $whiteList, true)) {
continue;
}
// Check & get existing rel attribute values
$noFollow = 'nofollow';
$rel = $anchor->attributes->getNamedItem('rel');
if ($rel) {
$values = explode(' ', $rel->nodeValue);
if (in_array($noFollow, $values, true)) {
continue;
}
$values[] = $noFollow;
$newValue = implode($values, ' ');
} else {
$newValue = $noFollow;
}
// Create new rel attribute
$rel = $dom->createAttribute('rel');
$node = $dom->createTextNode($newValue);
$rel->appendChild($node);
$anchor->appendChild($rel);
}
// There is a problem with saveHTML() and saveXML(), both of them do not work correctly in Unix.
// They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.
// So we need to do as follows. #see https://stackoverflow.com/a/20675396/1710782
return $dom->saveHTML($dom->documentElement);
}
<?
$str='internal
internal cloaked link
external';
function test($x){
if (preg_match('#localhost/mytest/(?!go/)#i',$x[0])>0) return $x[0];
return 'rel="nofollow" '.$x[0];
}
echo preg_replace_callback('/href=[\'"][^\'"]+/i', 'test', $str);
?>
Here is the another solution which has whitelist option and add tagret Blank attribute.
And also it check if there already a rel attribute before add a new one.
function Add_Nofollow_Attr($Content, $Whitelist = [], $Add_Target_Blank = true)
{
$Whitelist[] = $_SERVER['HTTP_HOST'];
foreach ($Whitelist as $Key => $Link)
{
$Host = preg_replace('#^https?://#', '', $Link);
$Host = "https?://". preg_quote($Host, '/');
$Whitelist[$Key] = $Host;
}
if(preg_match_all("/<a .*?>/", $Content, $matches, PREG_SET_ORDER))
{
foreach ($matches as $Anchor_Tag)
{
$IS_Rel_Exist = $IS_Follow_Exist = $IS_Target_Blank_Exist = $Is_Valid_Tag = false;
if(preg_match_all("/(\w+)\s*=\s*['|\"](.*?)['|\"]/",$Anchor_Tag[0],$All_matches2))
{
foreach ($All_matches2[1] as $Key => $Attr_Name)
{
if($Attr_Name == 'href')
{
$Is_Valid_Tag = true;
$Url = $All_matches2[2][$Key];
// bypass #.. or internal links like "/"
if(preg_match('/^\s*[#|\/].*/', $Url))
{
continue 2;
}
foreach ($Whitelist as $Link)
{
if (preg_match("#$Link#", $Url)) {
continue 3;
}
}
}
else if($Attr_Name == 'rel')
{
$IS_Rel_Exist = true;
$Rel = $All_matches2[2][$Key];
preg_match("/[n|d]ofollow/", $Rel, $match, PREG_OFFSET_CAPTURE);
if( count($match) > 0 )
{
$IS_Follow_Exist = true;
}
else
{
$New_Rel = 'rel="'. $Rel . ' nofollow"';
}
}
else if($Attr_Name == 'target')
{
$IS_Target_Blank_Exist = true;
}
}
}
$New_Anchor_Tag = $Anchor_Tag;
if(!$IS_Rel_Exist)
{
$New_Anchor_Tag = str_replace(">",' rel="nofollow">',$Anchor_Tag);
}
else if(!$IS_Follow_Exist)
{
$New_Anchor_Tag = preg_replace("/rel=[\"|'].*?[\"|']/",$New_Rel,$Anchor_Tag);
}
if($Add_Target_Blank && !$IS_Target_Blank_Exist)
{
$New_Anchor_Tag = str_replace(">",' target="_blank">',$New_Anchor_Tag);
}
$Content = str_replace($Anchor_Tag,$New_Anchor_Tag,$Content);
}
}
return $Content;
}
To use it:
$Page_Content = 'internal
internal
google
example
stackoverflow';
$Whitelist = ["http://yoursite.com","http://localhost"];
echo Add_Nofollow_Attr($Page_Content,$Whitelist,true);
WordPress decision:
function replace__method($match) {
list($original, $tag) = $match; // regex match groups
$my_folder = "/articles"; // re-add quirky config here
$blog_url = 'https://'.$_SERVER['SERVER_NAME'];
if (strpos($tag, "nofollow")) {
return $original;
}
elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
return $original;
}
else {
return "<$tag rel='nofollow'>";
}
}
add_filter( 'the_content', 'add_nofollow_to_external_links', 1 );
function add_nofollow_to_external_links( $content ) {
$content = preg_replace_callback('~<(a\s[^>]+)>~isU', "replace__method", $content);
return $content;
}
a good script which allows to add nofollow automatically and to keep the other attributes
function nofollow(string $html, string $baseUrl = null) {
return preg_replace_callback(
'#<a([^>]*)>(.+)</a>#isU', function ($mach) use ($baseUrl) {
list ($a, $attr, $text) = $mach;
if (preg_match('#href=["\']([^"\']*)["\']#', $attr, $url)) {
$url = $url[1];
if (is_null($baseUrl) || !str_starts_with($url, $baseUrl)) {
if (preg_match('#rel=["\']([^"\']*)["\']#', $attr, $rel)) {
$relAttr = $rel[0];
$rel = $rel[1];
}
$rel = 'rel="' . ($rel ? (strpos($rel, 'nofollow') ? $rel : $rel . ' nofollow') : 'nofollow') . '"';
$attr = isset($relAttr) ? str_replace($relAttr, $rel, $attr) : $attr . ' ' . $rel;
$a = '<a ' . $attr . '>' . $text . '</a>';
}
}
return $a;
},
$html
);
}

Categories