Prevent recursion on replacing elements

Prevent recursion on replacing elements - php

at first: No, this is not a duplicate. I know that there are some possibilities to search for elements in a HTML-page, but this is not really my problem.
I will outline my problem:
My PHP-code is for reasons I can not change called 2-3 times on every page-rendering.
My code crawls the html-content for specific words and replaces them with a link.
To archive this I am using https://github.com/sunra/php-simple-html-dom-parser .
This is my source:
foreach ( $dom->find( 'text' ) as $element ) {
//$config['exclusions'] is an array like ['a', 'img']
if ( !in_array( $element->parent()->tag, $config[ 'exclusions' ] ) ) {
foreach ( $markers as $marker ) {
$text = $marker[ 'text' ];
$url = $marker[ 'url' ];
$tip = strip_tags( $marker[ 'excerpt' ] );
$tooltip = ( $tooltip ? "data-uk-tooltip title='$tip'" : "" );
$tmpval = "tmpval-$i";
$element->innertext = preg_replace(
'/\b' . preg_quote( $text, "/" ) . '\b/i',
"<a href='$url' $hrefclass target='$target' $tmpval>\$0</a>",
$element->innertext,
1
);
$element->innertext = str_replace( $tmpval, $tooltip, $element->innertext );
$i++;
}
}
}
The problem is: If the $tooltip contains a word that matches a marker, this word is being replaced. So the result is <a href='foo.html' target='_self' data-uk-tooltip title='<a href='bar.html'...'>\$0</a> which destroys the markup of the page.
So my question: How can I prevent this?

Use lookbehind:
$element->innertext = preg_replace(
'(?<!\w=['"])\b' . preg_quote( $text, "/" ) . '\b/ig',
"<a href='$url' $hrefclass target='$target' $tmpval>\$0</a>",
$element->innertext,
1
);

Related

Replacing characters at start/end of string with PHP's ltrim/rtrim

I am looking to clean whitespace and hyphen characters from the start/end of a string.
I have the following test code:
$sample_titles = array(
'— — -Daisy Chain –',
'Ocarina for a Mermaid –',
' –—Another for a Sailor - '
);
foreach ( $sample_titles as $title ) {
$updated_title = ltrim( $title, ' -–—' );
$updated_title = rtrim( $updated_title, ' -–—' );
echo $updated_title . '<br/>';
}
This correctly outputs:
Daisy Chain
Ocarina for a Mermaid
Another for a Sailor
However, when I apply the same ltrim/rtrim logic in a foreach loop (over post titles, I'm "cleaning" imported data, the rest of the code is irrelevant) like this:
foreach ( $product_ids as $key => $product_id ) {
$title = get_the_title( $product_id );
$updated_title = ltrim( $title, ' -–—' );
$updated_title = rtrim( $updated_title, ' -–—' );
echo $updated_title . '<br/>';
};
I still end up with the hyphens/dashes/whitespace like this:
Orb 3 –
Ocarina for a Mermaid –
Mini Marina –
Any ideas why this works in one context but not the other?

Simply wrapping the title grab function in html_entity_decode() fixed the issue I was having:
$product_title = html_entity_decode( get_the_title( $product_id ) );

Wordpress: Automatically change URLs in the_content section

The solution from here isn't solving our problem.
I have already a solution to change all links in a field form in our theme. I am using different arrays for every network like $example_network_1 $example_network_2 with a PHP foreach for each affiliate network.
Now I need a solution to use this same arrays for the content of a WordPress post.
This solution is working for one network, but it caused a 404e Error for YouTube videos. We put the URL from a YouTube video and WordPress generates automatically an embedded video. With the following code we get a 404 error iframe or something like this.
We need a solution for more than one network.
I am very thankful for every help!
$example_network_1 = array(
array('shop'=>'shop1.com','id'=>'11111'),
array('shop'=>'shop2.co.uk','id'=>'11112'),
);
$example_network_2 = array(
array('shop'=>'shop-x1.com','id'=>'11413'),
array('shop'=>'shop-x2.net','id'=>'11212'),
);
add_filter( 'the_content', 'wpso_change_urls' ) ;
function wpso_found_urls( $matches ) {
global $example_network_1,$example_network_2;
foreach( $example_network_1 as $rule ) {
if (strpos($matches[0], $rule['shop']) !== false) {
$raw_url = trim( $matches[0], '"' ) ;
return '"https://www.network-x.com/click/'. $rule['id'].'lorem_lorem='.rawurlencode($raw_url ) . '"';
}
/*
foreach( $example_network_2 as $rule ) {
if (strpos($matches[0], $rule['shop']) !== false) {
$raw_url = trim( $matches[0], '"' ) ;
return '"https://www.network-y.com/click/'. $rule['id'].'lorem='.rawurlencode($raw_url ) . '"';
}
*/
}
}
function wpso_change_urls( $content ) {
global $example_network_1,example_network_2;
return preg_replace_callback( '/"+(http|https)(\:\/\/\S*'. $example_network_1 ['shop'] .'\S*")/', 'wpso_found_urls', $content ) ;
// return preg_replace_callback( '/"+(http|https)(\:\/\/\S*'. $example_network_2 ['shop'] .'\S*")/', 'wpso_found_urls', $content ) ;
}

autoembed is hooked at the_content with priority 8 on wp-includes/class-wp-embed.php:39
Try to lower the priority of the the_content filter so that the URL replacement happens before the embed, something like this:
add_filter( 'the_content', function ( $content ) {
/*
* Here, we define the replacements for each site in the network.
* '1' = main blog
* '2' = site 2 in the network, and so on
*
* To add more sites, just add the key number '3', etc
*/
$network_replacements = [
'1' => [
[ 'shop' => 'shop1.com', 'id' => '11111' ],
[ 'shop' => 'shop2.co.uk', 'id' => '11112' ],
],
'2' => [
[ 'shop' => 'shop-x1.com', 'id' => '11413' ],
[ 'shop' => 'shop-x2.net', 'id' => '11212' ],
]
];
// Early bail: Current blog ID does not have replacements defined
if ( ! array_key_exists( get_current_blog_id(), $network_replacements ) ) {
return $content;
}
$replacements = $network_replacements[ get_current_blog_id() ];
return preg_replace_callback( '/"+(http|https)(\:\/\/\S*' . $replacements['shop'] . '\S*")/', function( $matches ) use ( $replacements ) {
foreach ( $replacements as $rule ) {
if ( strpos( $matches[0], $rule['shop'] ) !== false ) {
$raw_url = trim( $matches[0], '"' );
return '"https://www.network-x.com/click/' . $rule['id'] . 'lorem_lorem=' . rawurlencode( $raw_url ) . '"';
}
}
}, $content );
}, 1, 1 );
This is not a copy and paste solution, but should get you going. You might need to tweak your "preg_replace_callback" code, but you said it was working so I just left it is it was.

If preventing the wp auto-embed works, then just add this line to your theme functions.php
remove_filter( 'the_content', array( $GLOBALS['wp_embed'], 'autoembed' ), 8 );

I wrote solution without test. Your code is hard to test without your site but I think that problem is with regex. Callback is hard to debugging. My version below.
First step, change your structure. I suspect that domains are unique. One dimensional array is more useful.
$domains = array(
'shop1.com'=>'11111',
'shop2.co.uk'=>'11112',
'shop-x1.com'=>'11413',
'shop-x2.net'=>'11212',
);
Next:
$dangerouschars = array(
'.'=>'\.',
);
function wpso_change_urls( $content ) {
global $domains,$dangerouschars;
foreach($domains as $domain=>$id){
$escapedDomain = str_replace(array_keys($dangerouschars),array_values($dangerouschars), $domain);
if (preg_match_all('/=\s*"(\s*https?:\/\/(www\.)?'.$escapedDomain.'[^"]+)\s*"\s+/mi', $content, $matches)){
// $matches[0] - ="https://example.com"
// $matches[1] - https://example.com
for($i = 0; $i<count($matches[0]); $i++){
$matchedUrl = $matches[1][$i];
$url = rawurlencode($matchedUrl);
//is very important to replace with ="..."
$content = str_replace($matches[0][$i], "=\"https://www.network-x.com/click/{$id}lorem_lorem={$url}\" ", $content);
}
}
}
return $content;
}
Example script

DOMXPath does not print HTML-source in <pre><code>...</code></pre> on screen

I created a little script that modifies the output of a specific page:
foreach ( $markers as $marker ) {
foreach ( $xpath->query( '//text()[not(ancestor::a) and not(ancestor::pre) and not(ancestor::code) and not(ancestor::img) and not(ancestor::script) and not(ancestor::style)]' ) as $node ) {
$text = $marker[ 'text' ];
$url = $marker[ 'url' ];
$tip = strip_tags( $marker[ 'excerpt' ] );
$tooltip = ( $tooltip ? "data-uk-tooltip='' title='$tip'" : "" );
$replaced = preg_replace(
'/\b' . preg_quote( $text, "/" ) . '\b/i',
"<a href='$url' $hrefclass target='$target' $tooltip>\$0</a>",
$node->wholeText
);
$newNode = $dom->createDocumentFragment();
$newNode->appendXML( $replaced );
$node->parentNode->replaceChild( $newNode, $node );
}
}
$event->setContent( utf8_encode( $dom->saveHTML()) );
Now I got a problem:
If the page-content contains a <pre><code><strong>some-content</strong></code></pre> the elements are not displayed as text - they are getting parsed as source code.
So it does not display the text <strong>some-content</strong> but parses it and displays some-content

Wrap your XML output in a textarea.
<textarea style="width:100%; height:150px">
<strong>
some-content
</strong>
</textarea>

Extend PHP regex to cover "srcset" and "style" attributes

I've created a WordPress plugin that turn all links into protocol-relative URLs (removing http: and https:) based off the tags and attributes that I list in the $tag and $attribute variables. This is part of the function. To save space, the rest of the code can be found here.
$content_type = NULL;
# Check for 'Content-Type' headers only
foreach ( headers_list() as $header ) {
if ( strpos( strtolower( $header ), 'content-type:' ) === 0 ) {
$pieces = explode( ':', strtolower( $header ) );
$content_type = trim( $pieces[1] );
break;
}
}
# If the content-type is 'NULL' or 'text/html', apply rewrite
if ( is_null( $content_type ) || substr( $content_type, 0, 9 ) === 'text/html' ) {
$tag = 'a|base|div|form|iframe|img|link|meta|script|svg';
$attribute = 'action|content|data-project-file|href|src|srcset|style';
# If 'Protocol Relative URL' option is checked, only apply change to internal links
if ( $this->option == 1 ) {
# Remove protocol from home URL
$website = preg_replace( '/https?:\/\//', '', home_url() );
# Remove protocol from internal links
$links = preg_replace( '/(<(' . $tag . ')([^>]*)(' . $attribute . ')=["\'])https?:\/\/' . $website . '/i', '$1//' . $website, $links );
}
# Else, remove protocols from all links
else {
$links = preg_replace( '/(<(' . $tag . ')([^>]*)(' . $attribute . ')=["\'])https?:\/\//i', '$1//', $links );
}
}
# Return protocol relative links
return $links;
This works as intended, but it doesn't work on these examples:
<!-- Within the 'style' attribute -->
<div class="some-class" style='background-color:rgba(255,255,255,0);background-image:url("http://placehold.it/300x200");background-position:center center;background-repeat:no-repeat'>
<!-- Within the 'srcset' attribute -->
<img src="http://placehold.it/600x300" srcset="http://placehold.it/500 500x, http://placehold.it/100 100w">
However, the code partially works for these examples.
<div class="some-class" style='background-color:rgba(255,255,255,0);background-image:url("http://placehold.it/300x200");background-position:center center;background-repeat:no-repeat'>
<img src="http://placehold.it/600x300" srcset="//placehold.it/500 500x, http://placehold.it/100 100w">
I've played around with adding additional values to the $tag and $attribute variables, but that didn't help. I'd assume I need to update the rest of my regex to cover these two additional tags? Or is there is a different way to approach it, such as DOMDocument?

I was able to simplify the code by doing the following:
$content_type = NULL;
# Check for 'Content-Type' headers only
foreach ( headers_list() as $header ) {
if ( strpos( strtolower( $header ), 'content-type:' ) === 0 ) {
$pieces = explode( ':', strtolower( $header ) );
$content_type = trim( $pieces[1] );
break;
}
}
# If the content-type is 'NULL' or 'text/html', apply rewrite
if ( is_null( $content_type ) || substr( $content_type, 0, 9 ) === 'text/html' ) {
# Remove protocol from home URL
$website = $_SERVER['HTTP_HOST'];
$links = str_replace( 'https?://' . $website, '//' . $website, $links );
$links = preg_replace( '|https?://(.*?)|', '//$1', $links );
}
# Return protocol relative links
return $links;

Cut a string/url to always get a final string/url with a specific data and it's value in php

I have an url that contain the word "&key".
The "&key" word can be at the beginning or at the end of our url.
Ex1= http://xxxxx.com?c1=xxx&c2=xxx&c3=xxx&key=xxx&c4=xxx&f1=xxx
Ex2= http://xxxxx.com?c1=xxx&key=xxx&c2=xxx&c3=xxx&c4=xxx&f1=xxx
What I would like to get is all the time the url with the Key element and it's value.
R1: http://xxxxx.com?c1=xxx&c2=xxx&c3=xxx&key=xxx
R2: http://xxxxx.com?c1=xxx&key=xxx
Here is what I have done:
$lp_sp_ad_publisher = "http://xxxxx.com?c1=xxx&c2=xxx&c3=xxx&key=xxxc4=xxxf1=xxx";
$lp_sp_ad_publisher_cut_link = explode("&", $lp_sp_ad_publisher_cut[1]); // tab
$lp_sp_ad_publisher_cut_link_final = $lp_sp_ad_publisher_cut_link[0]; // http://xxxxx.com?c1=xxx
$counter = 1;
// finding &key inside $lp_sp_ad_publisher_cut_link_final
while ((strpos($lp_sp_ad_publisher_cut_link_final, '&key')) !== false);
{
$lp_sp_ad_publisher_cut_link_final .= $lp_sp_ad_publisher_cut_link[$counter];
echo 'counter: ' . $counter . ' link: ' . $lp_sp_ad_publisher_cut_link_final . '<br/>';
$counter++;
}
I'm only looping once all the time. I guess the while loop isn't refreshing with the inside new value. Any solution?

EDIT: Sorry, I misunderstood the question.
This is tricky because the url key and value can be anything, so it might be safer to breakdown the URL using a combination of parse_url() and parse_str(), then put the url back together leaving off the part you don't want. Something like this:
function cut_url( $url='', $key='' )
{
$output = '';
$parts = parse_url( $url );
$query = array();
if( isset( $parts['scheme'] ) )
{
$output .= $parts['scheme'].'://';
}
if( isset( $parts['host'] ) )
{
$output .= $parts['host'];
}
if( isset( $parts['path'] ) )
{
$output .= $parts['path'];
}
if( isset( $parts['query'] ) )
{
$output .= '?';
parse_str( $parts['query'], $query );
}
foreach( $query as $qkey => $qvalue )
{
$output .= $qkey.'='.$qvalue.'&';
if( $qkey == $key ) break;
}
return rtrim( $output, '&' );
}
Usage:
$input = 'https://www.xxxxx.com/test/path/index.php?c1=xxx&c2=xxx&key=xxx&c3=xxx&c4=xxx&f1=xxx';
$output = cut_url( $input, 'key' );
Output:
https://www.xxxxx.com/test/path/index.php?c1=xxx&c2=xxx&key=xxx

If the intention is to always ensure that the parameter key and it's associated value appear at the end of the string, how about something like:
$tmp=array();$key='';
$parts=explode( '&', parse_url( $_SERVER['REQUEST_URI'], PHP_URL_QUERY ) );
foreach( $parts as $pair ) {
list( $param,$value )=explode( '=',$pair );
if( $param=='key' )$key=$pair;
else $tmp[]=$pair;
}
$query = implode( '&', array( implode( '&', $tmp ), $key ) );
echo $query;
or,
parse_str( $_SERVER['QUERY_STRING'], $pieces );
foreach( $pieces as $param => $value ){
if( $param=='key' ) $key=$param.'='.$value;
else $tmp[]=$param.'='.$value;
}
$query = implode( '&', array( implode( '&', $tmp ), $key ) );
update
I'm puzzled that you were "not getting the good result"!
consider the url:
https://localhost/index.php?sort=0&dir=false&tax=23&cost=99&aardvark=creepy&key=banana&tree=large&ac=dc&limit=1000#569f945674935
The above would output:
sort=0&dir=false&tax=23&cost=99&aardvark=creepy&tree=large&ac=dc&limit=1000&key=banana
so the key=banana gets placed last using either method above.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Prevent recursion on replacing elements - php

Use lookbehind: $element->innertext = preg_replace( '(?<!\w=['"])\b' . preg_quote( $text, "/" ) . '\b/ig', "<a href='$url' $hrefclass target='$target' $tmpval>\$0</a>", $element->innertext, 1 );

Related

Replacing characters at start/end of string with PHP's ltrim/rtrim

Wordpress: Automatically change URLs in the_content section

DOMXPath does not print HTML-source in <pre><code>...</code></pre> on screen

Extend PHP regex to cover "srcset" and "style" attributes

Cut a string/url to always get a final string/url with a specific data and it's value in php

Categories

Resources