PHP Encode parts of a string and then decode - php

I have a string containing HTML and some placeholders.
The placeholders always start with {{ and end with }}.
I'm trying to encode the contents of places holders and the decode them later.
While they're encoded the ideally need to be valid HTML as I want to use DOMDocument on the string and the problem I'm having is that it ends up being a mess because the places holders are usually something like:
<img src="{{image url="mediadir/someimage.jpg"}}"/>
Sometimes they're something like this though:
<p>Some text</p>
{{widget type="pagelink" pageid="1"}}
<div class="whatever">Content</div>
I was wondering what the best way of doing this, thanks!
UPDATE: CONTEXT
The overall problem is that I have Magento site with a bunch of static links like:
Link text
And I need to replace them with widgets to the page so that if the URL changes the links update. So replace the above with something like this.
{{widget type="Magento\Cms\Block\Widget\Page\Link" anchor_text="Link Text" template="widget/link/link_block.phtml" page_id="123"}}
I have something which does this using the PHP DOMDocument functionality. It looks up CMS page through their URL, finds the ID and replaces the anchor node with the widget text. This works fine if the page doesn't already contain any widgets or URL placeholders.
However if it does then the placeholders come out broken when processed through the DOMDocument saveHTML() function.
My idea of a solution to this was to encode the widgets and URL placeholders before passing it toe DOMDocument loadHTML() function and decode them after the saveHTML() function when it is string again.
UPDATE: CODE
This is a cut down version of what I've got currently. It's messy but it does work in replacing pages with widgets.
$pageCollection = $this->pageCollectionFactory->create();
$collection = $pageCollection->load();
$findarray = array('http', 'mailto', '.pdf', '{', '}');
$findarray2 = array('mailto', '.pdf', '{', '}');
$specialurl = 'https://www.example.com';
$relative_links = 0;
$missing_pages = 0;
$fixed_links = 0;
try {
foreach ($collection as $page) {
$dom = new \DOMDocument();
$content = $this->cleanMagentoCode( $page->getContent() );
libxml_use_internal_errors(true); // Surpress warnings created by reading bad HTML
$dom->loadHTML( $content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD ); // Load HTML without doctype or html containing elements
$elements = $dom->getElementsByTagName("a");
for ($i = $elements->length - 1; $i >= 0; $i --) {
$link = $elements->item($i);
$found = false;
// To clean up later
if ( strpos($link->getAttribute('href'), $specialurl) !== FALSE ) {
foreach ($findarray2 as $find) {
if (stripos($link->getAttribute('href'), $find) !== FALSE) {
$found = true;
break;
}
}
} else {
foreach ($findarray as $find) {
if (stripos($link->getAttribute('href'), $find) !== FALSE) {
$found = true;
break;
}
}
}
if ( strpos($link->getAttribute('href'), '#') === 0 ) {
$found = true;
}
if ( $link->getAttribute('href') == '' ) {
$found = true;
}
if ( !$found ) {
$url = parse_url($link->getAttribute('href'));
if ( isset( $url['path'] ) ) {
$identifier = rtrim( ltrim($url['path'], '/'), '/' );
try {
$pagelink = $this->pageRepository->getById($identifier);
// Fix link
if ($this->fixLinksFlag($input)) {
if ( stripos( $link->getAttribute('class'), "btn" ) !== FALSE ) {
$link_template = "widget/link/link_block.phtml";
} else {
$link_template = "widget/link/link_inline.phtml";
}
$widgetcode = '{{widget type="Magento\Cms\Block\Widget\Page\Link" anchor_text="' . $link->nodeValue . '" template="' . $link_template . '" page_id="' . $pagelink->getId() . '"}}';
$widget = $dom->createTextNode($widgetcode);
$link->parentNode->replaceChild($widget, $link);
}
}
}
}
}
$page->setContent( $this->dirtyMagentoCode( $dom->saveHTML() ) );
$page->save();
}
}

Related

extracting and printing an html element by it's id using DOMDocument

i want to extract couple of tables from a web page and show them in my page
i was going to use regex to extract them but then i saw the DOMDocument class
and it seems cleaner i've looked in stackoverflow and it seems all the questions are about getting inner text or using a loop to get inner nodes of elements . i want to now how can i extract and print a html element by it's id .
$html = file_get_contents("www.site.com");
$xml = new DOMDocument();
$xml->loadHTML($html);
$xpath = new DOMXPath($xml);
$table =$xpath->query("//*[#id='myid']");
$table->saveHTML(); // this obviously doesn't work
how can i show or echo the $table as an actual html table on my page ?
Firstly, DOMDocument has a getElementById() method so your XPath is unnecessary - although I suspect that is how it works underneath.
Secondly, in order to get fragments of markup rather than a whole document, you use DOMNode::C41N(), so your code would look like this:
<?php
// Load the HTML into a DOMDocument
// Don't forget you could just pass the URL to loadHTML()
$html = file_get_contents("www.site.com");
$dom = new DOMDocument('1.0');
$dom->loadHTML($html);
// Get the target element
$element = $dom->getElementById('myid');
// Get the HTML as a string
$string = $element->C14N();
See a working example.
You can use DOMElement::C14N() to get the canonicalized HTML(XML) representation of a DOMElement, or if you like a bit more control so that you can filter certain elements and attributes you can use something like this:
function toHTML($nodeList, $tagsToStrip=array('script','object','noscript','form','style'),$attributesToSkip=array('on*')) {
$html = '';
foreach($nodeList as $subIndex => $values) {
if(!in_array(strtolower($values->nodeName), $tagsToStrip)) {
if(substr($values->nodeName,0,1) != '#') {
$html .= ' <'.$values->nodeName;
if($values->attributes) {
for($i=0;$values->attributes->item($i);$i++) {
if( !in_array( strtolower($values->attributes->item($i)->nodeName) , $attributesToSkip ) && (in_array('on*',$attributesToSkip) && substr( strtolower($values->attributes->item($i)->nodeName) ,0 , 2) != 'on') ) {
$vvv = $values->attributes->item($i)->nodeValue;
if( in_array( strtolower($values->attributes->item($i)->nodeName) , array('src','href') ) ) {
$vvv = resolve_href( $this->url , $vvv );
}
$html .= ' '.$values->attributes->item($i)->nodeName.'="'.$vvv.'"';
}
}
}
if(in_array(strtolower($values->nodeName), array('br','img'))) {
$html .= ' />';
} else {
$html .= '> ';
if(!$values->firstChild) {
$html .= htmlspecialchars( $values->textContent , ENT_COMPAT , 'UTF-8' , true );
} else {
$html .= toHTML($values->childNodes,$tagsToStrip,$attributesToSkip);
}
$html .= ' </'.$values->nodeName.'> ';
}
} elseif(substr($values->nodeName,1,1) == 't') {
$inner = htmlspecialchars( $values->textContent , ENT_COMPAT , 'UTF-8' , true );
$html .= $inner;
}
}
}
return $html;
}
echo toHTML($table);

Extracting certain portions of HTML from within PHP

Ok, so I'm writing an application in PHP to check my sites if all the links are valid, so I can update them if I have to.
And I ran into a problem. I've tried to use SimpleXml and DOMDocument objects to extract the tags but when I run the app with a sample site I usually get a ton of errors if I use the SimpleXml object type.
So is there a way to scan the html document for href attributes that's pretty much as simple as using SimpleXml?
<?php
// what I want to do is get a similar effect to the code described below:
foreach($html->html->body->a as $link)
{
// store the $link into a file
foreach($link->attributes() as $attribute=>$value);
{
//procedure to place the href value into a file
}
}
?>
so basically i'm looking for a way to preform the above operation. The thing is I'm currently getting confused as to how should I treat the string that i'm getting with the html code in it...
just to be clear, I'm using the following primitive way of getting the html file:
<?php
$target = "http://www.targeturl.com";
$file_handle = fopen($target, "r");
$a = "";
while (!feof($file_handle)) $a .= fgets($file_handle, 4096);
fclose($file_handle);
?>
Any info would be useful as well as any other language alternatives where the above problem is more elegantly fixed (python, c or c++)
You can use DOMDocument::loadHTML
Here's a bunch of code we use for a HTML parsing tool we wrote.
$target = "http://www.targeturl.com";
$result = file_get_contents($target);
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
#$dom->loadHTML($result);
$links = extractLink(getTags( $dom, 'a', ));
function extractLink( $html, $argument = 1 ) {
$href_regex_pattern = '/<a[^>]*?href=[\'"](.*?)[\'"][^>]*?>(.*?)<\/a>/si';
preg_match_all($href_regex_pattern,$html,$matches);
if (count($matches)) {
if (is_array($matches[$argument]) && count($matches[$argument])) {
return $matches[$argument][0];
}
return $matches[1];
} else
function getTags( $dom, $tagName, $element = false, $children = false ) {
$html = '';
$domxpath = new DOMXPath($dom);
$children = ($children) ? "/".$children : '';
$filtered = $domxpath->query("//$tagName" . $children);
$i = 0;
while( $myItem = $filtered->item($i++) ){
$newDom = new DOMDocument;
$newDom->formatOutput = true;
$node = $newDom->importNode( $myItem, true );
$newDom->appendChild($node);
$html[] = $newDom->saveHTML();
}
if ($element !== false && isset($html[$element])) {
return $html[$element];
} else
return $html;
}
You could just use strpos($html, 'href=') and then parse the URL. You could also search for <a or .php

Why doesn't my properly defined variable evaluate for length correctly (and subsequently work in the rest of my code)?

I employed the answer Marc B suggested and still got nothing in the variable when I echo'd it after the while loop, so I added a few checks to show status of things as they were processed through the code. When the if/else statement runs next, it shows the result that there is length to the variable. The next if/else statement branches to the else statement and then takes the else statement in the next if/else saying the xpath found nothing. So obviously when I go to use the variable $BEmp3s, it has nothing in it.
This doesn't make much sense to me since in the beginning, the echo of $BEpost_content shows the proper content in its entirety but the evaluation on its length shows nothing/NULL? Please help!
<?php
// Start MP3 URL
$doc = new DOMDocument();
$doc->strictErrorChecking = FALSE;
$xpath = new DOMXpath($doc);
// End MP3 URL
$a = 1;
if (have_posts()) :
while ( have_posts() ) : the_post();
?>
<?php
$BEpost_content = get_the_content();
if (strlen($BEpost_content) > 0) {
echo "<div id='debug_content'>get_the_content has something</div>";
} else {
echo "<div id='debug_content'>BEpost_content is empty</div>" ;
};
$success = $doc->loadHTML($BEpost_content);
if ($success === FALSE) {
echo "<div id='debug_loadcontent'>loadHTML failed to load post content</div>";
} else {
$hrefs = $xpath->query("//a[contains(#href,'mp3')]/#href");
if ($hrefs->length > 0) {
echo "<div id='debug_xpath'>xpath found something</div>";
} else {
echo "<div id='debug_xpath'>xpath found nothing</div>";
};
$BEmp3s = $hrefs->item(0);
};
?>
Here is the function get_the_content() which returns a string to my knowledge:
function get_the_content($more_link_text = null, $stripteaser = 0) {
global $post, $more, $page, $pages, $multipage, $preview;
if ( null === $more_link_text )
$more_link_text = __( '(more...)' );
$output = '';
$hasTeaser = false;
// If post password required and it doesn't match the cookie.
if ( post_password_required($post) ) {
$output = get_the_password_form();
return $output;
}
if ( $page > count($pages) ) // if the requested page doesn't exist
$page = count($pages); // give them the highest numbered page that DOES exist
$content = $pages[$page-1];
if ( preg_match('/<!--more(.*?)?-->/', $content, $matches) ) {
$content = explode($matches[0], $content, 2);
if ( !empty($matches[1]) && !empty($more_link_text) )
$more_link_text = strip_tags(wp_kses_no_null(trim($matches[1])));
$hasTeaser = true;
} else {
$content = array($content);
}
if ( (false !== strpos($post->post_content, '<!--noteaser-->') && ((!$multipage) || ($page==1))) )
$stripteaser = 1;
$teaser = $content[0];
if ( ($more) && ($stripteaser) && ($hasTeaser) )
$teaser = '';
$output .= $teaser;
if ( count($content) > 1 ) {
if ( $more ) {
$output .= '<span id="more-' . $post->ID . '"></span>' . $content[1];
} else {
if ( ! empty($more_link_text) )
$output .= apply_filters( 'the_content_more_link', ' $more_link_text", $more_link_text );
$output = force_balance_tags($output);
}
}
if ( $preview ) // preview fix for javascript bug with foreign languages
$output = preg_replace_callback('/\%u([0-9A-F]{4})/', '_convert_urlencoded_to_entities', $output);
return $output;
}
Your previous question told you to check the length of hrefs to see whether it had content. This is correct because hrefs is an array. It has a length and supports the length property. get_the_content() returns a string (see docs).
To check string length use strlen
To check if null use is_null
To check if set use isset
Difference between isset and is_null
Update
You asked why your code branches incorrectly at the following line:
$hrefs = $xpath->query("//a[contains(#href,'mp3')]/#href");
You also say that $xpath is defined further up in the code. However, you redefine $doc so why would $xpath have the correct values in it?
$success = $doc->loadHTML($BEpost_content); //you change $doc here!
$xpath = new DOMXpath($doc); //so perhaps you should load it into xpath here?
$hrefs = $xpath->query("//a[contains(#href,'mp3')]/#href"); //don't know what this query does. maybe it is broken.

(PHP) Regex for finding specific href tag

i have a html document with n "a href" tags with different target urls and different text between the tag.
For example:
<span ....>lorem ipsum</span>
<span ....>example</span>
example3
<img ...>test</img>
without a d as target url
As you can see the target urls switch between "d?, d., d/d?, d/d." and between the "a tag" there could be any type of html which is allowed by w3c.
I need a Regex which gives me all links which has one of these combination in the target url:
"d?, d., d/d?, d/d." and has "Lorem" or "test" between the "a tags" in any position including sub html tags.
My Regex so far:
href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>.*?</a>)
I tried to include the lorem / test as followed:
href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>(lorem|test)+</a>)
but this will only works if I put a ".*?" before and after the (lorem|test) and this would be to greedy.
If there is a easier way with SimpleXml or any other DOM parser, please let me know. Otherwise I would appreciate any help with the regex.
Thanks!
Here you go:
$html = array
(
'<span ....>lorem ipsum</span>',
'<span ....>example</span>',
'example3',
'<img ...>test</img>',
'without a d as target url',
);
$html = implode("\n", $html);
$result = array();
$anchors = phXML($html, '//a[contains(., "lorem") or contains(., "test")]');
foreach ($anchors as $anchor)
{
if (preg_match('~d[.?]~', strval($anchor['href'])) > 0)
{
$result[] = strval($anchor['href']);
}
}
echo '<pre>';
print_r($result);
echo '</pre>';
Output:
Array
(
[0] => http://www.example.com/d?12345abc
[1] => http://www.example.com/d/d.1234
)
The phXML() function is based on my DOMDocument / SimpleXML wrapper, and goes as follows:
function phXML($xml, $xpath = null)
{
if (extension_loaded('libxml') === true)
{
libxml_use_internal_errors(true);
if ((extension_loaded('dom') === true) && (extension_loaded('SimpleXML') === true))
{
if (is_string($xml) === true)
{
$dom = new DOMDocument();
if (#$dom->loadHTML($xml) === true)
{
return phXML(#simplexml_import_dom($dom), $xpath);
}
}
else if ((is_object($xml) === true) && (strcmp('SimpleXMLElement', get_class($xml)) === 0))
{
if (isset($xpath) === true)
{
$xml = $xml->xpath($xpath);
}
return $xml;
}
}
}
return false;
}
I'm too lazy not to use this function right now, but I'm sure you can get rid of it if you need to.
Here is a Regular Expression which works:
$search = '/<a\s[^>]*href=["\'](?:http:\/\/)?(?:[a-z0-9-]+(?:\.[a-z0-9-]+)*)\/(?:d\/)?d[?.].*?>.*?(?:lorem|test)+.*?<\/a>/i';
$matches = array();
preg_match_all($search, $html, $matches);
The only thing is it relies on there being a new-line character between each ` tag. Otherwise it will match something like:
example3<img ...>test</img>
Use an HTML parser. There are lots of reasons that Regex is absolutely not the solution for parsing HTML.
There's a good list of them here:
Robust and Mature HTML Parser for PHP
Will print only first and fourth link because two conditions are met.
preg_match_all('#href="(.*?)"(.*?)>(.*?)</a>#is', $string, $matches);
$count = count($matches[0]);
unset($matches[0], $matches[2]);
for($i = 0; $i < $count; $i++){
if(
strpos($matches[1][$i], '/d') !== false
&&
preg_match('#(lorem|test)#is', $matches[3][$i]) == true
)
{
echo $matches[1][$i];
}
}

How to add rel="nofollow" to links with preg_replace()

The function below is designed to apply rel="nofollow" attributes to all external links and no internal links unless the path matches a predefined root URL defined as $my_folder below.
So given the variables...
$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';
And the content...
internal
internal cloaked link
external
The end result, after replacement should be...
internal
internal cloaked link
external
Notice that the first link is not altered, since its an internal link.
The link on the second line is also an internal link, but since it matches our $my_folder string, it gets the nofollow too.
The third link is the easiest, since it does not match the blog_url, its obviously an external link.
However, in the script below, ALL of my links are getting nofollow. How can I fix the script to do what I want?
function save_rseo_nofollow($content) {
$my_folder = $rseo['nofollow_folder'];
$blog_url = get_bloginfo('url');
preg_match_all('~<a.*>~isU',$content["post_content"],$matches);
for ( $i = 0; $i <= sizeof($matches[0]); $i++){
if ( !preg_match( '~nofollow~is',$matches[0][$i])
&& (preg_match('~' . $my_folder . '~', $matches[0][$i])
|| !preg_match( '~'.$blog_url.'~',$matches[0][$i]))){
$result = trim($matches[0][$i],">");
$result .= ' rel="nofollow">';
$content["post_content"] = str_replace($matches[0][$i], $result, $content["post_content"]);
}
}
return $content;
}
Here is the DOMDocument solution...
$str = 'internal
internal cloaked link
external
external
external
external
';
$dom = new DOMDocument();
$dom->preserveWhitespace = FALSE;
$dom->loadHTML($str);
$a = $dom->getElementsByTagName('a');
$host = strtok($_SERVER['HTTP_HOST'], ':');
foreach($a as $anchor) {
$href = $anchor->attributes->getNamedItem('href')->nodeValue;
if (preg_match('/^https?:\/\/' . preg_quote($host, '/') . '/', $href)) {
continue;
}
$noFollowRel = 'nofollow';
$oldRelAtt = $anchor->attributes->getNamedItem('rel');
if ($oldRelAtt == NULL) {
$newRel = $noFollowRel;
} else {
$oldRel = $oldRelAtt->nodeValue;
$oldRel = explode(' ', $oldRel);
if (in_array($noFollowRel, $oldRel)) {
continue;
}
$oldRel[] = $noFollowRel;
$newRel = implode($oldRel, ' ');
}
$newRelAtt = $dom->createAttribute('rel');
$noFollowNode = $dom->createTextNode($newRel);
$newRelAtt->appendChild($noFollowNode);
$anchor->appendChild($newRelAtt);
}
var_dump($dom->saveHTML());
Output
string(509) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
internal
internal cloaked link
external
external
external
external
</body></html>
"
Try to make it more readable first, and only afterwards make your if rules more complex:
function save_rseo_nofollow($content) {
$content["post_content"] =
preg_replace_callback('~<(a\s[^>]+)>~isU', "cb2", $content["post_content"]);
return $content;
}
function cb2($match) {
list($original, $tag) = $match; // regex match groups
$my_folder = "/hostgator"; // re-add quirky config here
$blog_url = "http://localhost/";
if (strpos($tag, "nofollow")) {
return $original;
}
elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
return $original;
}
else {
return "<$tag rel='nofollow'>";
}
}
Gives following output:
[post_content] =>
internal
<a href="http://localhost/mytest/go/hostgator" rel=nofollow>internal cloaked link</a>
<a href="http://cnn.com" rel=nofollow>external</a>
The problem in your original code might have been $rseo which wasn't declared anywhere.
Try this one (PHP 5.3+):
skip selected address
allow manually set rel parameter
and code:
function nofollow($html, $skip = null) {
return preg_replace_callback(
"#(<a[^>]+?)>#is", function ($mach) use ($skip) {
return (
!($skip && strpos($mach[1], $skip) !== false) &&
strpos($mach[1], 'rel=') === false
) ? $mach[1] . ' rel="nofollow">' : $mach[0];
},
$html
);
}
Examples:
echo nofollow('something');
// will be same because it's already contains rel parameter
echo nofollow('something'); // ad
// add rel="nofollow" parameter to anchor
echo nofollow('something', 'localhost');
// skip this link as internall link
Using regular expressions to do this job properly would be quite complicated. It would be easier to use an actual parser, such as the one from the DOM extension. DOM isn't very beginner-friendly, so what you can do is load the HTML with DOM then run the modifications with SimpleXML. They're backed by the same library, so it's easy to use one with the other.
Here's how it can look like:
$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';
$html = '<html><body>
internal
internal cloaked link
external
</body></html>';
$dom = new DOMDocument;
$dom->loadHTML($html);
$sxe = simplexml_import_dom($dom);
// grab all <a> nodes with an href attribute
foreach ($sxe->xpath('//a[#href]') as $a)
{
if (substr($a['href'], 0, strlen($blog_url)) === $blog_url
&& substr($a['href'], 0, strlen($my_folder)) !== $my_folder)
{
// skip all links that start with the URL in $blog_url, as long as they
// don't start with the URL from $my_folder;
continue;
}
if (empty($a['rel']))
{
$a['rel'] = 'nofollow';
}
else
{
$a['rel'] .= ' nofollow';
}
}
$new_html = $dom->saveHTML();
echo $new_html;
As you can see, it's really short and simple. Depending on your needs, you may want to use preg_match() in place of the strpos() stuff, for example:
// change the regexp to your own rules, here we match everything under
// "http://localhost/mytest/" as long as it's not followed by "go"
if (preg_match('#^http://localhost/mytest/(?!go)#', $a['href']))
{
continue;
}
Note
I missed the last code block in the OP when I first read the question. The code I posted (and basically any solution based on DOM) is better suited at processing a whole page rather than a HTML block. Otherwise, DOM will attempt to "fix" your HTML and may add a <body> tag, a DOCTYPE, etc...
Thanks #alex for your nice solution. But, I was having a problem with Japanese text. I have fixed it as following way. Also, this code can skip multiple domains with the $whiteList array.
public function addRelNoFollow($html, $whiteList = [])
{
$dom = new \DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
$a = $dom->getElementsByTagName('a');
/** #var \DOMElement $anchor */
foreach ($a as $anchor) {
$href = $anchor->attributes->getNamedItem('href')->nodeValue;
$domain = parse_url($href, PHP_URL_HOST);
// Skip whiteList domains
if (in_array($domain, $whiteList, true)) {
continue;
}
// Check & get existing rel attribute values
$noFollow = 'nofollow';
$rel = $anchor->attributes->getNamedItem('rel');
if ($rel) {
$values = explode(' ', $rel->nodeValue);
if (in_array($noFollow, $values, true)) {
continue;
}
$values[] = $noFollow;
$newValue = implode($values, ' ');
} else {
$newValue = $noFollow;
}
// Create new rel attribute
$rel = $dom->createAttribute('rel');
$node = $dom->createTextNode($newValue);
$rel->appendChild($node);
$anchor->appendChild($rel);
}
// There is a problem with saveHTML() and saveXML(), both of them do not work correctly in Unix.
// They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.
// So we need to do as follows. #see https://stackoverflow.com/a/20675396/1710782
return $dom->saveHTML($dom->documentElement);
}
<?
$str='internal
internal cloaked link
external';
function test($x){
if (preg_match('#localhost/mytest/(?!go/)#i',$x[0])>0) return $x[0];
return 'rel="nofollow" '.$x[0];
}
echo preg_replace_callback('/href=[\'"][^\'"]+/i', 'test', $str);
?>
Here is the another solution which has whitelist option and add tagret Blank attribute.
And also it check if there already a rel attribute before add a new one.
function Add_Nofollow_Attr($Content, $Whitelist = [], $Add_Target_Blank = true)
{
$Whitelist[] = $_SERVER['HTTP_HOST'];
foreach ($Whitelist as $Key => $Link)
{
$Host = preg_replace('#^https?://#', '', $Link);
$Host = "https?://". preg_quote($Host, '/');
$Whitelist[$Key] = $Host;
}
if(preg_match_all("/<a .*?>/", $Content, $matches, PREG_SET_ORDER))
{
foreach ($matches as $Anchor_Tag)
{
$IS_Rel_Exist = $IS_Follow_Exist = $IS_Target_Blank_Exist = $Is_Valid_Tag = false;
if(preg_match_all("/(\w+)\s*=\s*['|\"](.*?)['|\"]/",$Anchor_Tag[0],$All_matches2))
{
foreach ($All_matches2[1] as $Key => $Attr_Name)
{
if($Attr_Name == 'href')
{
$Is_Valid_Tag = true;
$Url = $All_matches2[2][$Key];
// bypass #.. or internal links like "/"
if(preg_match('/^\s*[#|\/].*/', $Url))
{
continue 2;
}
foreach ($Whitelist as $Link)
{
if (preg_match("#$Link#", $Url)) {
continue 3;
}
}
}
else if($Attr_Name == 'rel')
{
$IS_Rel_Exist = true;
$Rel = $All_matches2[2][$Key];
preg_match("/[n|d]ofollow/", $Rel, $match, PREG_OFFSET_CAPTURE);
if( count($match) > 0 )
{
$IS_Follow_Exist = true;
}
else
{
$New_Rel = 'rel="'. $Rel . ' nofollow"';
}
}
else if($Attr_Name == 'target')
{
$IS_Target_Blank_Exist = true;
}
}
}
$New_Anchor_Tag = $Anchor_Tag;
if(!$IS_Rel_Exist)
{
$New_Anchor_Tag = str_replace(">",' rel="nofollow">',$Anchor_Tag);
}
else if(!$IS_Follow_Exist)
{
$New_Anchor_Tag = preg_replace("/rel=[\"|'].*?[\"|']/",$New_Rel,$Anchor_Tag);
}
if($Add_Target_Blank && !$IS_Target_Blank_Exist)
{
$New_Anchor_Tag = str_replace(">",' target="_blank">',$New_Anchor_Tag);
}
$Content = str_replace($Anchor_Tag,$New_Anchor_Tag,$Content);
}
}
return $Content;
}
To use it:
$Page_Content = 'internal
internal
google
example
stackoverflow';
$Whitelist = ["http://yoursite.com","http://localhost"];
echo Add_Nofollow_Attr($Page_Content,$Whitelist,true);
WordPress decision:
function replace__method($match) {
list($original, $tag) = $match; // regex match groups
$my_folder = "/articles"; // re-add quirky config here
$blog_url = 'https://'.$_SERVER['SERVER_NAME'];
if (strpos($tag, "nofollow")) {
return $original;
}
elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
return $original;
}
else {
return "<$tag rel='nofollow'>";
}
}
add_filter( 'the_content', 'add_nofollow_to_external_links', 1 );
function add_nofollow_to_external_links( $content ) {
$content = preg_replace_callback('~<(a\s[^>]+)>~isU', "replace__method", $content);
return $content;
}
a good script which allows to add nofollow automatically and to keep the other attributes
function nofollow(string $html, string $baseUrl = null) {
return preg_replace_callback(
'#<a([^>]*)>(.+)</a>#isU', function ($mach) use ($baseUrl) {
list ($a, $attr, $text) = $mach;
if (preg_match('#href=["\']([^"\']*)["\']#', $attr, $url)) {
$url = $url[1];
if (is_null($baseUrl) || !str_starts_with($url, $baseUrl)) {
if (preg_match('#rel=["\']([^"\']*)["\']#', $attr, $rel)) {
$relAttr = $rel[0];
$rel = $rel[1];
}
$rel = 'rel="' . ($rel ? (strpos($rel, 'nofollow') ? $rel : $rel . ' nofollow') : 'nofollow') . '"';
$attr = isset($relAttr) ? str_replace($relAttr, $rel, $attr) : $attr . ' ' . $rel;
$a = '<a ' . $attr . '>' . $text . '</a>';
}
}
return $a;
},
$html
);
}

Categories