how to replace url path in html file - php

How to find and replace all URL paths in an HTML file? I have an HTML file with links from Wayback Machine, like these:
"/web/2016***/http://blog.mydomain.com/archive/img.jpg"
"/web/2016***/http://blog.mydomain.com/archive/img2.jpg"
"/web/2016***/http://blog.mydomain.com/archive/page2.html"
The 2016*** part is dynamic. How do I extract these elements:
"/archive/img.jpg"
"/archive/img2.jpg"
"/archive/page2.html"
I have tried:
$html = $url;
$content = file_get_contents($html);
$newhtml = preg_replace( 'web/-[^-.]*\./' , '/' , $content);
file_put_contents('post1.html', $newhtml);

Try this regular expression: \/web.*blog\.mydomain\.com(.*):
preg_replace('\/web.*blog\.mydomain\.com(.*)', '\1', $content);
Check it out in action: https://regex101.com/r/m5ZaRo/3

Related

Keep Only subdirectory from href and src (ROOT html links )

Hello I have my code that copy the html from external url and echo it on my page.
Some of the HTMLs have links and/or picure SRC inside.
I will need some help to truncate them (from absolute url to relative url inside $data )
For example : inside html there is href
<a href="https://www.trade-ideas.com/products/score-vs-ibd/" >
or SRC
<img src="http://static.trade-ideas.com/Filters/MinDUp1.gif">
I would like to keep only subdirectory.
/products/score-vs-ibd/z
/Filters/MinDUp1.gif
Maybe with preg_replace , but im not familiar with Regular expressions.
This is my original code that works very well, but now im stuck truncating the links.
<?php
$post_tags = get_the_tags();
if ( $post_tags ) {
$tag = $post_tags[0]->name;
}
$html= file_get_contents('https://www.trade-ideas.com/ticky/ticky.html?symbol='. "$tag");
$start = strpos($html,'<div class="span3 height-325"');
$end = strpos($html,'<!-- /span -->',$start);
$data= substr($html,$start,$end-$start);
echo $data ;
?>
Here is the code:
function getUrlPath($url) {
$re = '/(?:https?:\/\/)?(?:[^?\/\s]+[?\/])(.*)/';
preg_match($re, $url, $matches);
return $matches[1];
}
Example: getUrlPaths("http://myassets.com:80/files/images/image.gif") returns files/images/image.gif
You can locate all the URLs in the html string with a regex using preg_match_all().
The regex:
'/=[\'"](https?:\/\/.*?(\/.*))[\'"]/i'
will capture both the entire URL and the path/query string for every occurrence of ="http://domain/path" or ='https://domain/path?query' (http/https, single or double quotes, with/without query string).
Then you can just use str_replace() to update the html string.
<?php
$html = '<a href="https://www.trade-ideas.com/products/score-vs-ibd/" >
<img src="http://static.trade-ideas.com/Filters/MinDUp1.gif">
<img src=\'https://static.trade-ideas.com/Filters/MinDUp1.gif?param=value\'>';
$pattern = '/=[\'"](https?:\/\/.*?(\/.*))[\'"]/i';
$urls = [];
preg_match_all($pattern, $html, $urls);
//var_dump($urls);
foreach($urls[1] as $i => $uri){
$html = str_replace($uri, $urls[2][$i], $html);
}
echo $html;
Run it live here.
Note, this will change all absolute URLs enclosed in quotes immediately following an =.

PHP convert a URL into clickable link but without image src

I want to convert a string in PHP. This string may contain URL or image tags or other tags, but I don't want to convert image tags src value into a link. For example:
We have a link https://youtube.com/watch/8374h87shdv which needs to be converted but this is not to be <image class="emoji" alt="emoji" src="https://icloud.com/png/sdsdv234f.png"
The above string needs to be converted but without src URL.
I am using this currently:
function convert_strings( $content ){
$url = '~(?:(https?)://([^\s<]+)|(www\.[^\s<]+?\.[^\s<]+))(?<![\.,:])~i';
$content = preg_replace($url, '$0', $content);
$content = preg_replace('/(?<!\S)#([0-9a-zA-Z]+)/', '#$1', $content);
$content = convert_smilies( $content );
return $content;
}
But converts all. How can I achieve this?

Replacing Relative Links with External Links in PHP String

I am working with an editor that works purely with internal relative links for files which is great for 99% of what I use it for.
However, I am also using it to insert links to files within an email body and relative links don't cut the mustard.
Instead of modifying the editor, I would like to search the string from the editor and replace the relative links with external links as shown below
Replace
files/something.pdf
With
https://www.someurl.com/files/something.pdf
I have come up with the following but I am wondering if there is a better / more efficient way to do it with PHP
<?php
$string = 'A link, some other text, A different link';
preg_match_all('/<a[^>]+href=([\'"])(?<href>.+?)\1[^>]*>/i', $string, $result);
if (!empty($result)) {
// Found a link.
$baseUrl = 'https://www.someurl.com';
$newUrls = array();
$newString = '';
foreach($result['href'] as $url) {
$newUrls[] = $baseUrl . '/' . $url;
}
$newString = str_replace($result['href'], $newUrls, $string);
echo $newString;
}
?>
Many thanks
Lee
You can simply use preg_replace to replace all the occurrences of files starting URLs inside double quotes:
$string = 'A link, some other text, A different link';
$string = preg_replace('/"(files.*?)"/', '"https://www.someurl.com/$1"', $string);
The result would be:
A link, some other text, A different link
You really should use DOMdocument for such job, but if you want to use a regex, this one does the job:
$string = '<a some_attribute href="files/something.pdf" class="abc">A link</a>, some other text, <a class="def" href="files/somethingelse.pdf" attr="xyz">A different link</a>';
$baseUrl = 'https://www.someurl.com';
$newString = preg_replace('/(<a[^>]+href=([\'"]))(.+?)\2/i', "$1$baseUrl/$3$2", $string);
echo $newString,"\n";
Output:
<a some_attribute href="https://www.someurl.comfiles/something.pdf" class="abc">A link</a>, some other text, <a class="def" href="https://www.someurl.com/files/somethingelse.pdf" attr="xyz">A different link</a>

replace all url from html file with regex php

i want to save external html file using php
how to search and replace all url with regex using php
href="/web/***/http://blog.domain.com/site/styles-site.css"
where *** is code dynamic not known
replace to
href="site/styles-site.css"
mycode:
$html=$url;
$content = file_get_contents($html);
$newhtml = preg_replace( 'web/-[^-.]*\./' , '.' , $content);
file_put_contents('post1.html', $newhtml);

Removal of bad hyperlinks and the content inside of them

Ok, basically I have an array of bad urls and I would like to search through a string and strip them out. I want to strip everything from the opening tag to the closing tag, but only if the url in the hyperlink is in the array of bad urls. Here is how I would picture it working but I don't understand regular expressions well.
foreach($bad_urls as $bad_url){
$pattern = "/<a*$bad_url*</a>/";
$replacement = ' ';
preg_replace($pattern, $replacement, $content);
}
Thanks in advance.
Assuming that your 'bad urls' are properly formatted URLs, I would suggest doing something like this:
foreach($bad_urls as $bad_url){
$pattern = '/<[aA]\s.+[href|HREF]\=\"' . convert_to_pattern($bad_url) . '\".+<\/[aA]>/msU';
$replacement = ' ';
$content = preg_replace_all($pattern, $replacement, $content);
}
and separately
function convert_to_pattern($url)
{
searches = array('%', '&', '?', '.', '/', ';', ' ');
replaces = array('\%','\&','\?','\.','\/','\;','\ ');
return preg_replace_all($searches, $replaces, $url);
}
Please do not try to parse HTML using regular expressions. Just load up the HTML in a DOM, find all the <a> tags and check the href property. Much simpler and fool-proof.

Categories