how to replace url path in html file

how to replace url path in html file - php

How to find and replace all URL paths in an HTML file? I have an HTML file with links from Wayback Machine, like these:
"/web/2016***/http://blog.mydomain.com/archive/img.jpg"
"/web/2016***/http://blog.mydomain.com/archive/img2.jpg"
"/web/2016***/http://blog.mydomain.com/archive/page2.html"
The 2016*** part is dynamic. How do I extract these elements:
"/archive/img.jpg"
"/archive/img2.jpg"
"/archive/page2.html"
I have tried:
$html = $url;
$content = file_get_contents($html);
$newhtml = preg_replace( 'web/-[^-.]*\./' , '/' , $content);
file_put_contents('post1.html', $newhtml);

Try this regular expression: \/web.*blog\.mydomain\.com(.*):
preg_replace('\/web.*blog\.mydomain\.com(.*)', '\1', $content);
Check it out in action: https://regex101.com/r/m5ZaRo/3

Related

Keep Only subdirectory from href and src (ROOT html links )

Hello I have my code that copy the html from external url and echo it on my page.
Some of the HTMLs have links and/or picure SRC inside.
I will need some help to truncate them (from absolute url to relative url inside $data )
For example : inside html there is href
<a href="https://www.trade-ideas.com/products/score-vs-ibd/" >
or SRC
<img src="http://static.trade-ideas.com/Filters/MinDUp1.gif">
I would like to keep only subdirectory.
/products/score-vs-ibd/z
/Filters/MinDUp1.gif
Maybe with preg_replace , but im not familiar with Regular expressions.
This is my original code that works very well, but now im stuck truncating the links.
<?php
$post_tags = get_the_tags();
if ( $post_tags ) {
$tag = $post_tags[0]->name;
}
$html= file_get_contents('https://www.trade-ideas.com/ticky/ticky.html?symbol='. "$tag");
$start = strpos($html,'<div class="span3 height-325"');
$end = strpos($html,'<!-- /span -->',$start);
$data= substr($html,$start,$end-$start);
echo $data ;
?>

Here is the code:
function getUrlPath($url) {
$re = '/(?:https?:\/\/)?(?:[^?\/\s]+[?\/])(.*)/';
preg_match($re, $url, $matches);
return $matches[1];
}
Example: getUrlPaths("http://myassets.com:80/files/images/image.gif") returns files/images/image.gif

You can locate all the URLs in the html string with a regex using preg_match_all().
The regex:
'/=[\'"](https?:\/\/.*?(\/.*))[\'"]/i'
will capture both the entire URL and the path/query string for every occurrence of ="http://domain/path" or ='https://domain/path?query' (http/https, single or double quotes, with/without query string).
Then you can just use str_replace() to update the html string.
<?php
$html = '<a href="https://www.trade-ideas.com/products/score-vs-ibd/" >
<img src="http://static.trade-ideas.com/Filters/MinDUp1.gif">
<img src=\'https://static.trade-ideas.com/Filters/MinDUp1.gif?param=value\'>';
$pattern = '/=[\'"](https?:\/\/.*?(\/.*))[\'"]/i';
$urls = [];
preg_match_all($pattern, $html, $urls);
//var_dump($urls);
foreach($urls[1] as $i => $uri){
$html = str_replace($uri, $urls[2][$i], $html);
}
echo $html;
Run it live here.
Note, this will change all absolute URLs enclosed in quotes immediately following an =.

PHP convert a URL into clickable link but without image src

I want to convert a string in PHP. This string may contain URL or image tags or other tags, but I don't want to convert image tags src value into a link. For example:
We have a link https://youtube.com/watch/8374h87shdv which needs to be converted but this is not to be <image class="emoji" alt="emoji" src="https://icloud.com/png/sdsdv234f.png"
The above string needs to be converted but without src URL.
I am using this currently:
function convert_strings( $content ){
$url = '~(?:(https?)://([^\s<]+)|(www\.[^\s<]+?\.[^\s<]+))(?<![\.,:])~i';
$content = preg_replace($url, '$0', $content);
$content = preg_replace('/(?<!\S)#([0-9a-zA-Z]+)/', '#$1', $content);
$content = convert_smilies( $content );
return $content;
}
But converts all. How can I achieve this?

Replacing Relative Links with External Links in PHP String

I am working with an editor that works purely with internal relative links for files which is great for 99% of what I use it for.
However, I am also using it to insert links to files within an email body and relative links don't cut the mustard.
Instead of modifying the editor, I would like to search the string from the editor and replace the relative links with external links as shown below
Replace
files/something.pdf
With
https://www.someurl.com/files/something.pdf
I have come up with the following but I am wondering if there is a better / more efficient way to do it with PHP
<?php
$string = 'A link, some other text, A different link';
preg_match_all('/<a[^>]+href=([\'"])(?<href>.+?)\1[^>]*>/i', $string, $result);
if (!empty($result)) {
// Found a link.
$baseUrl = 'https://www.someurl.com';
$newUrls = array();
$newString = '';
foreach($result['href'] as $url) {
$newUrls[] = $baseUrl . '/' . $url;
}
$newString = str_replace($result['href'], $newUrls, $string);
echo $newString;
}
?>
Many thanks
Lee

You can simply use preg_replace to replace all the occurrences of files starting URLs inside double quotes:
$string = 'A link, some other text, A different link';
$string = preg_replace('/"(files.*?)"/', '"https://www.someurl.com/$1"', $string);
The result would be:
A link, some other text, A different link

You really should use DOMdocument for such job, but if you want to use a regex, this one does the job:
$string = '<a some_attribute href="files/something.pdf" class="abc">A link</a>, some other text, <a class="def" href="files/somethingelse.pdf" attr="xyz">A different link</a>';
$baseUrl = 'https://www.someurl.com';
$newString = preg_replace('/(<a[^>]+href=([\'"]))(.+?)\2/i', "$1$baseUrl/$3$2", $string);
echo $newString,"\n";
Output:
<a some_attribute href="https://www.someurl.comfiles/something.pdf" class="abc">A link</a>, some other text, <a class="def" href="https://www.someurl.com/files/somethingelse.pdf" attr="xyz">A different link</a>

replace all url from html file with regex php

i want to save external html file using php
how to search and replace all url with regex using php
href="/web/***/http://blog.domain.com/site/styles-site.css"
where *** is code dynamic not known
replace to
href="site/styles-site.css"
mycode:
$html=$url;
$content = file_get_contents($html);
$newhtml = preg_replace( 'web/-[^-.]*\./' , '.' , $content);
file_put_contents('post1.html', $newhtml);

Removal of bad hyperlinks and the content inside of them

Ok, basically I have an array of bad urls and I would like to search through a string and strip them out. I want to strip everything from the opening tag to the closing tag, but only if the url in the hyperlink is in the array of bad urls. Here is how I would picture it working but I don't understand regular expressions well.
foreach($bad_urls as $bad_url){
$pattern = "/<a*$bad_url*</a>/";
$replacement = ' ';
preg_replace($pattern, $replacement, $content);
}
Thanks in advance.

Assuming that your 'bad urls' are properly formatted URLs, I would suggest doing something like this:
foreach($bad_urls as $bad_url){
$pattern = '/<[aA]\s.+[href|HREF]\=\"' . convert_to_pattern($bad_url) . '\".+<\/[aA]>/msU';
$replacement = ' ';
$content = preg_replace_all($pattern, $replacement, $content);
}
and separately
function convert_to_pattern($url)
{
searches = array('%', '&', '?', '.', '/', ';', ' ');
replaces = array('\%','\&','\?','\.','\/','\;','\ ');
return preg_replace_all($searches, $replaces, $url);
}

Please do not try to parse HTML using regular expressions. Just load up the HTML in a DOM, find all the <a> tags and check the href property. Much simpler and fool-proof.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

how to replace url path in html file - php

Try this regular expression: \/web.blog\.mydomain\.com(.): preg_replace('\/web.blog\.mydomain\.com(.)', '\1', $content); Check it out in action: https://regex101.com/r/m5ZaRo/3

Related

Keep Only subdirectory from href and src (ROOT html links )

PHP convert a URL into clickable link but without image src

Replacing Relative Links with External Links in PHP String

replace all url from html file with regex php

Removal of bad hyperlinks and the content inside of them

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

how to replace url path in html file - php

Try this regular expression: \/web.*blog\.mydomain\.com(.*): preg_replace('\/web.*blog\.mydomain\.com(.*)', '\1', $content); Check it out in action: https://regex101.com/r/m5ZaRo/3

Related

Keep Only subdirectory from href and src (ROOT html links )

PHP convert a URL into clickable link but without image src

Replacing Relative Links with External Links in PHP String

replace all url from html file with regex php

Removal of bad hyperlinks and the content inside of them

Categories

Resources

Try this regular expression: \/web.blog\.mydomain\.com(.): preg_replace('\/web.blog\.mydomain\.com(.)', '\1', $content); Check it out in action: https://regex101.com/r/m5ZaRo/3