Keep Only subdirectory from href and src (ROOT html links )

Keep Only subdirectory from href and src (ROOT html links ) - php

Hello I have my code that copy the html from external url and echo it on my page.
Some of the HTMLs have links and/or picure SRC inside.
I will need some help to truncate them (from absolute url to relative url inside $data )
For example : inside html there is href
<a href="https://www.trade-ideas.com/products/score-vs-ibd/" >
or SRC
<img src="http://static.trade-ideas.com/Filters/MinDUp1.gif">
I would like to keep only subdirectory.
/products/score-vs-ibd/z
/Filters/MinDUp1.gif
Maybe with preg_replace , but im not familiar with Regular expressions.
This is my original code that works very well, but now im stuck truncating the links.
<?php
$post_tags = get_the_tags();
if ( $post_tags ) {
$tag = $post_tags[0]->name;
}
$html= file_get_contents('https://www.trade-ideas.com/ticky/ticky.html?symbol='. "$tag");
$start = strpos($html,'<div class="span3 height-325"');
$end = strpos($html,'<!-- /span -->',$start);
$data= substr($html,$start,$end-$start);
echo $data ;
?>

Here is the code:
function getUrlPath($url) {
$re = '/(?:https?:\/\/)?(?:[^?\/\s]+[?\/])(.*)/';
preg_match($re, $url, $matches);
return $matches[1];
}
Example: getUrlPaths("http://myassets.com:80/files/images/image.gif") returns files/images/image.gif

You can locate all the URLs in the html string with a regex using preg_match_all().
The regex:
'/=[\'"](https?:\/\/.*?(\/.*))[\'"]/i'
will capture both the entire URL and the path/query string for every occurrence of ="http://domain/path" or ='https://domain/path?query' (http/https, single or double quotes, with/without query string).
Then you can just use str_replace() to update the html string.
<?php
$html = '<a href="https://www.trade-ideas.com/products/score-vs-ibd/" >
<img src="http://static.trade-ideas.com/Filters/MinDUp1.gif">
<img src=\'https://static.trade-ideas.com/Filters/MinDUp1.gif?param=value\'>';
$pattern = '/=[\'"](https?:\/\/.*?(\/.*))[\'"]/i';
$urls = [];
preg_match_all($pattern, $html, $urls);
//var_dump($urls);
foreach($urls[1] as $i => $uri){
$html = str_replace($uri, $urls[2][$i], $html);
}
echo $html;
Run it live here.
Note, this will change all absolute URLs enclosed in quotes immediately following an =.

Related

How to preg_match_all to get the text inside the tags "<h3>" and "<h3> <a/> </h3>"

Hello I am currently creating an automatic table of contents my wordpress web. My reference from
https://webdeasy.de/en/wordpress-table-of-contents-without-plugin/
Problem :
Everything goes well unless in the <h3> tag has an <a> tag link. It make $names result missing.
I see problems because of this regex section
preg_match_all("/<h[3,4](?:\sid=\"(.*)\")?(?:.*)?>(.*)<\/h[3,4]>/", $content, $matches);
// get text under <h3> or <h4> tag.
$names = $matches[2];
I have tried modifying the regex (I don't really understand this)
preg_match_all (/ <h [3,4] (?: \ sid = \ "(. *) \")? (?:. *)?> <a (. *)> (. *) <\ / a> <\ / h [3,4]> /", $content, $matches)
// get text under <a> tag.
$names = $matches[4];
The code above work for to find the text that is in the <h3> <a> a text </a> <h3> tag, but the h3 tag which doesn't contain the <a> tag is a problem.
My Question :
How combine code above?
My expectation is if when the first code result does not appear then it is execute the second code as a result.
Or maybe there is a better solution? Thank you.

Here's a way that will remove any tags inside of header tags
$html = <<<EOT
<h3>Here's an alternative solution</h3> to using regex. <h3>It may <a name='#thing'>not</a></h3> be the most elegant solution, but it works
EOT;
preg_match_all('#<h(.*?)>(.*?)<\/h(.*?)>#si', $html, $matches);
foreach ($matches[0] as $num=>$blah) {
$look_for = preg_quote($matches[0][$num],"/");
$tag = str_replace("<","",explode(">",$matches[0][$num])[0]);
$replace_with = "<$tag>" . strip_tags($matches[2][$num]) . "</$tag>";
$html = preg_replace("/$look_for/", $replace_with,$html,1);
}
echo "<pre>$html</pre>";

The answer #kinglish is the base of this solution, thank you very much. I slightly modify and simplify it according to my question article link. This code worked for me:
preg_match_all('#(\<h[3-4])\sid=\"(.*?)\"?\>(.*?)(<\/h[3-4]>)#si',$content, $matches);
$tags = $matches[0];
$ids = $matches[2];
$raw_names = $matches[3];
/* Clean $rawnames from other html tags */
$clean_names= array_map(function($v){
return trim(strip_tags($v));
}, $raw_names);
$names = $clean_names;

remove tags <a> to a specific URL domain php [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
This is a script code that is not mine, I try to modify it. What it does search for all the tags and then delete them. How would you modify the code to erase only the tags of a given domain or url? for example, delete the domain tags: www.domainurl.com , Remove all tags as:
fsdf
<a title="Google Adsense" href="https://www.domainurl.com/refer/google-adsense/" target="_blank" rel="nofollow noopener">fgddf</a>
domain
<a title="Google Adsense" href="https://www.googlead.com/refer/google-adsense/" target="_blank" rel="nofollow noopener">googled</a>
result would look like this:
fsdf
fgddf
domain
<a title="Google Adsense" href="https://www.googlead.com/refer/google-adsense/" target="_blank" rel="nofollow noopener">google</a>
This is the code :
if (in_array ( 'OPT_STRIP', $camp_opt )) {
echo '<br>Striping links ';
//$abcont = strip_tags ( $abcont, '<p><img><b><strong><br><iframe><embed><table><del><i><div>' );
preg_match_all('{<a.*?>(.*?)</a>}' , $abcont , $allLinksMatchs);
$allLinksTexts = $allLinksMatchs[1];
$allLinksMatchs=$allLinksMatchs[0];
$j = 0;
foreach ($allLinksMatchs as $singleLink){
if(! stristr($singleLink, 'twitter.com'))
$abcont = str_replace($singleLink, $allLinksTexts[$j], $abcont);
$j++;
}
}
I tried doing this but it did not work for me:
Regex :
Specifying in the search with preg_match_all
preg_match_all('{<a.*?[^>]* href="((https?:\/\/)?([\w\-])+\.{1}domainurl\.([a-z]{2,6})([\/\w\.-]*)*\/?)">(.*?)</a>}' , $abcont , $allLinksMatchs);
Any ideas? , I would thank you a lot

Rather than try and parse HTML with regular expressions, as you suggested, I have chosen to use the DOMDocument class instead.
function remove_domain($str, $domainsToRemove)
{
$domainsToRemove = is_array($domainsToRemove) ? $domainsToRemove : array_slice(func_get_args(), 1);
$dom = new DOMDocument;
$dom->loadHTML("<div>{$str}</div>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$anchors = $dom->getElementsByTagName('a');
// Code taken and modified from: http://php.net/manual/en/domnode.replacechild.php#50500
$i = $anchors->length - 1;
while ($i > -1) {
$anchor = $anchors->item($i);
foreach ($domainsToRemove as $domain) {
if (strpos($anchor->getAttribute('href'), $domain) !== false) {
// $new = $dom->createElement('p', $anchor->textContent);
$new = $dom->createTextNode($anchor->textContent);
$anchor->parentNode->replaceChild($new, $anchor);
}
}
$i--;
}
// Create HTML string, then remove the wrapping div.
$html = $dom->saveHTML();
$html = substr($html, 5, strlen($html) - (strlen('</div>') + 1) - strlen('<div>'));
return $html;
}
You can then use the above code in the following examples.
Notice how you can either pass in a string as a domain to remove, or you can pass an array of domains, or you can take advantage of func_get_args and pass in an infinite number of parameters.
$str = <<<str
fsdf
<a title="Google Adsense" href="https://www.domainurl.com/refer/google-adsense/" target="_blank" rel="nofollow noopener">fgddf</a>
domain
<a title="Google Adsense" href="https://www.googlead.com/refer/google-adsense/" target="_blank" rel="nofollow noopener">googled</a>
str;
// Example usage
remove_domain($str, 'domainurl.com');
remove_domain($str, 'domainurl.com', 'googlead.com');
remove_domain($str, ['domainurl.com', 'googlead.com']);
Firstly, I have stored your string in a variable, but that is just so that I could utilize it for the answer; replace $str with wherever you get that code from.
The loadHTML function takes an HTML string, but requires one child element - hence why I have wrapped the string in a div.
The while loop will iterate over the anchor elements, and then replace any that match a specified domain with just the content of the anchor tags.
Note, I have left in a comment above this line which you can use instead. This will replace the anchor element with a p tag, which will have a default style of display: block; meaning that your layout won't be likely to break. However, since your expected output is just text nodes, I have left this as just an option.
Live demo

What about:
<a.*? href=\".*www\.googlead\.com.*\">(.*?)<\/a>
So it becomes:
preg_match_all('{<a.*? href=\".*www\.googlead\.com.*\">(.*?)<\/a>}' , $abcont , $allLinksMatchs);
This removes only a tags from www.googlead.com.
You can check the regex result here.

Supposing your HTML is contained in a variable for the following.
The usage of preg_replace should be a better option, here's a function that should help you a bit:
function removeLinkTagsOfDomain($html, $domain) {
// Escape all regex special characters
$domain = preg_quote($domain);
// Search for <a> tags with a href attribute containing the specified domain
$pattern = '/<a .*href=".*' . $domain . '.*".*>(.+)<\/a>/';
// Final replacement (should be the text node of <a> tags)
$replacer = '$1';
return preg_replace($pattern, '$1', $html);
}
// Usage:
$domains = [...];
$html = '...';
foreach ($domains as $d) {
$html = removeLinkTagsOfDomain($html, $d);
}

Replace string with different strings containing pattern

Say you have a string like
$str = "<img src='i12'><img src='i105'><img src='i12'><img src='i24'><img src='i15'>....";
is it possible to replace every i+n by the nth value of an array called $arr
so that for example <img src='i12'> is replaced by <img src='$arr[12]'>.

If I were you, I'd simply parse the markup, and process/alter it accordingly:
$dom = new DOMDocument;
$dom->loadHTML($str);//parse your markup string
$imgs = $dom->getElementsByTagName('img');//get all images
$cleanStr = '';//the result string
foreach($imgs as $img)
{
$img->setAttribute(
'src',
//get value of src, chop of first char (i)
//use that as index, optionally cast to int
$array[substr($img->getAttribute('src'), 1)]
);
$cleanStr .= $dom->saveXML($img);//get string representation of node
}
echo $cleanStr;//echoes new markup
working demo here
Now in the demo, you'll see the src attributes are replaced with a string like $array[n], the code above will replace the values with the value of an array...

I would use preg_replace for this:
$pattern="/(src=)'\w(\d+)'/";
$replacement = '${1}\'\$arr[$2]\'';
preg_replace($pattern, $replacement, $str);
$pattern="/(src=)'\w(\d+)'/";
It matches blocks of text like src='letter + digits'.
This catches the src= and digit blocks to be able to print them back.
$replacement = '${1}\'\$arr[$2]\'';
This makes the replacement itself.
Test
php > $str = "<img src='i12'><img src='i105'><img src='i12'><img src='i24'><img src='i15'>....";
php > $pattern="/(src=)'\w(\d+)'/";
php > $replacement = '${1}\'\$arr[$2]\'';
php > echo preg_replace($pattern, $replacement, $str);
<img src='$arr[12]'><img src='$arr[105]'><img src='$arr[12]'><img src='$arr[24]'><img src='$arr[15]'>....

PHP: replace string from a particular string to a particular string

How can I replace a string from a particular string? Please see my example below.
$string_value="<b>Man one</b> <img src=\"http://www.abc.com/image1.jpg\">and <b>Man two</b> <img src=\"http://www.abc.com/img2.png\">";
Now my expected out put is = <b>Man one</b> and <b>man two</b>. only the image tag should be deleted.
So I have to cut the full string from "<img" to ">" from string $string_value.
So how can I cut the full string between "<img" to ">" using preg_replace or anything else.
The replacing parameter should be blank space.

Looks like you want to strip the tags. You can do it easily by calling the function strip_tags which gets HTML and PHP tags stripped from a given string.
$string_value = strip_tags($string_value);
EDIT:
Since you want to strip only the <img tag, you can make use of this function:
function strip_only($str, $tags) {
if(!is_array($tags)) {
$tags = (strpos($str, '>') !== false ? explode('>', str_replace('<', '', $tags)) : array($tags));
if(end($tags) == '') array_pop($tags);
}
foreach($tags as $tag) $str = preg_replace('#</?'.$tag.'[^>]*>#is', '', $str);
return $str;
}
and call it as:
$string_value = strip_only($string_value,'img');
Working link

You can use regular expressions to exclude IMG tags:
<?php
$text = "Man one <img src=\"http://www.abc.com/image1.jpg\">and Man two <img src=\"http://www.abc.com/img2.png\">";
$pattern = "/<img(.*?)>/i";
$replace = '';
print preg_replace($pattern,$replace,$text);
?>

Just use strip_tags($string_value); and you will get your desired output.

PHP preg_replace weirdness with custom urls

I'm using the following code to add <span> tags behind <a> tags.
$html = preg_replace("~<a.*?href=\"$url\".*?>.*?</a>~i", "$0<span>test</span>", $html);
The code is working fine for regular links (ie. http://www.google.com/), but it will not perform a replace when the contents of $url are $link$/3/.
This is example code to show the (mis)behaviour:
<?php
$urls = array();
$urls[] = '$link$/3/';
$urls[] = 'http://www.google.com/';
$html = 'Test Link' . "\n" . 'Google';
foreach($urls as $url) {
$html = preg_replace("~<a.*?href=\"$url\".*?>.*?</a>~i", "$0<span>test</span>", $html);
}
echo $html;
?>
And this is the output it produces:
Test Link
Google<span>test</span>

$url = preg_quote($url, '~'); the dollar signs are interpreted as usual: end-of-input.

just somebody is correct; you must escape your special regex characters if you mean for them to be interpreted as literal.
It also looks to me like it can't perform the replace because it never makes a match.
Try replacing this line:
$urls[] = '$link$/3/';
With this:
$urls[] = '$link/3/';

$ is considered a special regex character and needs to be escaped. Use preg_quote() to escape $url before passing it to preg_replace().
$url = preg_quote($url, '~');

$ has special meaning in regex. End of line. Your expression is being expanded like this:
$html = preg_replace("~<a.*?href=\"$link$/3/\".*?>.*?</a>~i", "$0<span>test</span>", $html);
Which fails because it can't find "link" between two end of lines. Try escaping the $ in the $urls array:
$urls[] = '\$link\$/3/';

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Keep Only subdirectory from href and src (ROOT html links ) - php

Here is the code: function getUrlPath($url) { $re = '/(?:https?:\/\/)?(?:[^?\/\s]+[?\/])(.*)/'; preg_match($re, $url, $matches); return $matches[1]; } Example: getUrlPaths("http://myassets.com:80/files/images/image.gif") returns files/images/image.gif

Related

How to preg_match_all to get the text inside the tags "<h3>" and "<h3> <a/> </h3>"

remove tags <a> to a specific URL domain php [duplicate]

Replace string with different strings containing pattern

PHP: replace string from a particular string to a particular string

PHP preg_replace weirdness with custom urls

Categories

Resources