PHP RegEx: Replace all relative paths in HTML-string, with absolute paths - php

I'm working on a project where I need to scrape some content from the same site, but a subfolder, and store it. I know it's not ideal, but it's sadly the best approach for the client.
I need to change all references from relative to absolute URLs
All the references (images, css, js) are referred relatively with both:
"../../imgs/"
"/js/"
... which means they don't work in my sub-folder. I need a function that matches the regex on these references and replaces the path.
When I try this:
function getRelativeContent($url) {
$page = file_get_contents($url);
//url needs trailing /
if (substr($url, -1, 1) != "/")
$url .= "/";
$page = preg_replace('/src="(\/)?([\w_\-\/\.\?&=#%#]*)"/i','src="' . $url . '$2"', $page);
$page = preg_replace('/href="(\/)?([\w_\-\/\.\?&=#%#]*)"/i','href="' . $url . '$2"', $page);
return $page;
}
echo getRelativeContent($url);
Then these URLs doesn't work:
<link href="/cassette.axd/stylesheet/fdbdaa59cb97b35f06f65fd41cb60caa3975cc0f/forbrug-rwd_(max-width 767px)" type="text/css" rel="stylesheet" media="(max-width: 767px)">
<img src="https://www.domain.dk/~/media/2561BD6AFBD64402877E4ACED01F97FD.ashx" />

function getRelativeContent($url) {
$page = file_get_contents($url);
//url needs trailing /
if (substr($url, -1, 1) != "/")
$url .= "/";
$page = preg_replace('/src="(\/)?([\w_\-\/\.\?&=#%#]*)"/i','src="' . $url . '$2"', $page);
$page = preg_replace('/href="(\/)?([\w_\-\/\.\?&=#%#]*)"/i','href="' . $url . '$2"', $page);
return $page;
}
echo getRelativeContent($url);

Related

How to remove part of the url ratther than the shop url from the currently fetched web page url

In my prestashop shop i have fetched the current web page url by using the below php code.
<?php
$url="http://".$_SERVER['HTTP_HOST'].$_SERVER['REQUEST_URI'];
echo $url;
?>
My current echo url is http://shoppingworld.com/int/Mens-Tshirts/Fashion.html
My shop url is http://shoppingworld.com/int/
I need to remove the url portion which is coming next to the above shop url.
Try this
<?php
$url_path="http://www.shoppingworld.com/int/Mens-Tshirts/Fashion.html";
$a = parse_url($url_path, PHP_URL_SCHEME);
$b = parse_url($url_path, PHP_URL_HOST);
$url_name_parse=explode('/',$url_path);
$url_name=$url_name_parse[3];
echo ($a . "://" . $b .'/' .$url_name.'/'); ?>
Program Output
http://www.shoppingworld.com/int/
DEMO
You can't directly get that partial url.
Try this,
$url = 'http://shoppingworld.com/int/Mens-Tshirts/Fashion.html';
$parsed = parse_url($url);
$path_array = explode('/', $parsed['path']);
echo $parsed['scheme'] . '//' . $parsed['host'] .'/'. $path_array[1] . '/';
Demo
$arr_url = parse_url($url);
$host = $arr_url['host'];
$service_uri = $arr_url['path'];
read more in php manual about parse_url();

Remove Subdomain from URL string , keeping the rest of URL (for replacing subdomain)

i have a location menu that has to change location, the good thing is every url exist in every city,, and every city is a subdomain
city1.domain.com.uk/index.php?page=category/238/12
city2.domain.com.uk/index.php?page=category/238/12
Im trying this. Im trying to break the URL to remove subdomain , so i can replace it for each item in menu
I want to get index.php?page=category/238/12
<?PHP
$protocol = strpos(strtolower($_SERVER['SERVER_PROTOCOL']),'https')=== FALSE ? 'http' : 'https';
$host = $_SERVER['HTTP_HOST'];
$script = $_SERVER['SCRIPT_NAME'];
$params = $_SERVER['QUERY_STRING'];
$url = $protocol . '://' . $host . $script . '?' . $params;
// break it up using the "."
$urlb = explode('.',$url);
// get the domain
$dns = $urlb[count($urlb)-1];
// get the extension
$ext = $urlb[count($urlb)+0];
//put it back together
$fullDomain = $dns.'.'.$ext;
echo $fullDomain;
?>
But i Get this php?page=category/238/12
Also i havent think in a solution for an issue i will be facing with this..
If im looking at a product the url change to something like
city2.domain.com.uk/index.php?page=item/preview/25
But, the products dont exist in every city , so my user will get a 404.
=(
How can i make a conditional in the process so if page=item/preview/25 i do replace this for
page=index/index
You can split the domain as:
$url = "city1.domain.com.uk/index.php?page=category/238/12";
list($subDomain, $params) = explode('?', $url);
list($domain, $sub) = explode('/', $subDomain);
$newUrl = $sub . "?" . $params;
echo $newUrl;
Cheers!
How about this:
<?php
$protocol = strpos(strtolower($_SERVER['SERVER_PROTOCOL']),'https')=== FALSE ? 'http' : 'https';
$host = $_SERVER['HTTP_HOST'];
$script = $_SERVER['SCRIPT_NAME'];
$params = $_SERVER['QUERY_STRING'];
$url = $protocol . '://' . $host . $script . '?' . $params;
$url=(parse_url($url));
$dns = substr($url['host'],stripos($url['host'],'.')+1);
$fullDomain =$url['scheme']."://".$dns.$url['path']."?".$url['query'].$url['fragment'];
if (substr($url['query'],stripos($url['query'],'=')+1,stripos($url['query'],'/')-stripos($url['query'],'=')-1)=='item') {
echo "redirect";
} else {
echo "don't redirect";
}
echo "<br>".$fullDomain;
?>

Get Request URI PHP

I am struck in getting the URI in my wordpress application and lack of PHP knowledge is making my progress slow.
I have this URL
http://abc.com/my-blog/abc/cde
i need to create a URL something like
http://abc.com/my-blog/added-value/abc/cde
where http://abc.com/my-blog is the URL of my wordpress blog which i can easily get using following method
home_url()
i can use PHP $_SERVER["REQUEST_URI"] to get request URI which will come up as
/my-blog/abc/cde
and than i have no direct way to add value as per my requirement
is there any way to achieve this easily in PHP or Wordpress where i can get following information
Home URL
Rest part of the URL
so that in end i can do following
Home-URL+ custom-value+Rest part of the URL
My point of Confusion
On my local set up $_SERVER["REQUEST_URI"] is giving me /my-blog/abc/cde, where /my-blog is installation directory of wordpress and i can easily skip first level.
On production server its not same as /my-blog will not be part of the URL.
Very briefly:
<?php
$url = "http://abc.com/my-blog/abc/cde";
$parts = parse_url($url);
$path = explode("/", $parts["path"]);
array_splice($path, 2, 0, array("added-part")); //This line does the magic!
echo $parts["scheme"] . "://" . $parts["host"] . implode("/",$path);
OK, so if $addition is the bit you want in the middle and $uri is what you obtain from $_SERVER["REQUEST_URI"] then this..
$addition = "MIDDLEBIT/";
$uri = "/my-blog/abc/cde";
$parts = explode("/",$uri);
$homeurl = $parts[1]."/";
for($i=2;$i<count($parts);$i++){
$resturl .= $parts[$i]."/";
}
echo $homeurl . $addition . $resturl;
Should print:
my-blog/MIDDLEBIT/abc/cde/
You might want to use explode or some other sting function. Some examples below:
$urlBits = explode($_SERVER["REQUEST_URI"]);
//blog address
$blogAddress = $urlBits[0];
//abc
$secondPartOfUri = $urlBits[1];
//cde
$thirdPartOfUri = $urlBits[2];
//all of uri except your blog address
$uri = str_replace("/my-blog/", "", $_SERVER["REQUEST_URI"]);
This is a reliable way to get current url in PHP .
public static function getCurrentUrl($withQuery = true)
{
$protocol = stripos($_SERVER['SERVER_PROTOCOL'], 'https') === false ? 'http' : 'https';
$uri = $protocol . '://' . $_SERVER['HTTP_HOST'] . $_SERVER['REQUEST_URI'];
return $withQuery ? $uri : str_replace('?' . $_SERVER['QUERY_STRING'], '', $uri);
}
You can store the home url in a variable, using wordpress, using get_home_url()
$home_url = get_home_url();
$custom_value = '/SOME_VALUE';
$uri = $_SERVER['REQUEST_URI'];
$new_url = $home_url . $custom_value . $uri;

Scrape FULL image src with PHP

I am trying to scrape img src's with php, I can get the src fine, but if the src does not include the full path then I can't really reuse it. Is there a way to grab the full path of the image using php (browsers can get it if you use the right click menu).
ie. How do I get a FULL path including the domain in one of the following two examples?
src="../foo/logo.png"
src="/images/logo.png"
Thanks,
Allan
You don't need a regex... just some patience. I don't really want to write the code for you, but just check if the src starts with http://, and if not, you have like 3 different cases.
If it begins with a / then prepend http://domain.com
If it begins with .. you'll have to split the full URL and hack off pieces until the src starts with a /
Else (it begins with a letter), the take the full domain, and strip it down to the last slash then append the src URL.
Or.... be lazy and steal this script
$url = "http://www.goat.com/money/dave.html";
$rel = "../images/cheese.jpg";
$com = InternetCombineURL($url,$rel);
// Returns http://www.goat.com/images/cheese.jpg
function InternetCombineUrl($absolute, $relative) {
$p = parse_url($relative);
if($p["scheme"])return $relative;
extract(parse_url($absolute));
$path = dirname($path);
if($relative{0} == '/') {
$cparts = array_filter(explode("/", $relative));
}
else {
$aparts = array_filter(explode("/", $path));
$rparts = array_filter(explode("/", $relative));
$cparts = array_merge($aparts, $rparts);
foreach($cparts as $i => $part) {
if($part == '.') {
$cparts[$i] = null;
}
if($part == '..') {
$cparts[$i - 1] = null;
$cparts[$i] = null;
}
}
$cparts = array_filter($cparts);
}
$path = implode("/", $cparts);
$url = "";
if($scheme) {
$url = "$scheme://";
}
if($user) {
$url .= "$user";
if($pass) {
$url .= ":$pass";
}
$url .= "#";
}
if($host) {
$url .= "$host/";
}
$url .= $path;
return $url;
}
From http://www.web-max.ca/PHP/misc_24.php
Unless you have the site URL you're starting with (in which case you can prepend it to the value of the src attribute) it seems like all you're left with there is a string.
I'm assuming you don't have access to any additional information of course. If you're parsing HTML, I'd assume you must be able to access an absolute URL to at least the HTML page, but perhaps not.

Some Pictures Don't Load (CSS Problem?)

i am dynamically loading a website via file_get_contents with the following script.
<?php
header('Content-Type: text/html; charset=iso-8859-1');
$url = (substr($_GET['url'], 0, 7) == 'http://') ? $_GET['url'] : "http://{$_GET['url']}";
$base_url = explode('/', $url);
$base_url = (substr($url, 0, 7) == 'http://') ? $base_url[2] : $base_url[0];
if (file_get_contents($url) != false) {
$content = #file_get_contents($url);
// $search = array('#(<a\s*[^>]*href=[\'"]?(?![\'"]?http))#', '|(<img\s*[^>]*src=[\'"]?)|');
// $replace = array('\1proxy2.php?url=', '\1'.$url.'/');
// $new_content = preg_replace($search, $replace, $content);
function prepend_proxy($matches) {
$url = (substr($_GET['url'], 0, 7) == 'http://') ? $_GET['url'] : "http://{$_GET['url']}";
$prepend = $matches[2] ? $matches[2] : $url;
$prepend = 'http://h899310.devhost.se/proxy/proxy2.php?url='. $prepend .'/';
return $matches[1] . $prepend . $matches[3];
}
function imgprepend_proxy($matches2) {
$url = (substr($_GET['url'], 0, 7) == 'http://') ? $_GET['url'] : "http://{$_GET['url']}";
$prepend2 = $matches2[2] ? $matches2[2] : $url;
$prepend2 = $prepend2 .'/';
return $matches2[1] . $prepend2 . $matches2[3];
}
$new_content = preg_replace_callback(
'|(href=[\'"]?)(https?://)?([^\'"\s]+[\'"]?)|i',
'prepend_proxy',
preg_replace_callback(
'|(src=[\'"]?)(https?://)?([^\'"\s]+[\'"]?)|i',
'imgprepend_proxy',
$content
)
);
echo "<base href='http://{$base_url}' />";
echo $new_content;
} else {
echo "Sidan kan inte visas";
}
?>
Now the problem is that some pictures doesn't show in websites. For example those sites who does have CSS links. It is a CSS problem i think.
You can test the script here to see what i mean:
http://h899310.devhost.se/proxy/index.html
How can I fix this?
It would appear that one of your URL replacement methods is adding a slash too many. Visit one of the pages your proxy provides, and you will see several URLs beginning with:
http:///www.msdn.com
Take for example loading msdn.com; the CSS won't load, because when looking at the source code of the proxy'd page, we see the URL to the CSS is (note the tree forward slashes):
http://h899310.devhost.se/proxy/proxy2.php?url=http:///i3.msdn.microsoft.com/global/global-bn20090721.css
Viewing the URL directly reveals a warning in your script showing that file_get_contents can't load the URL:
Warning: file_get_contents(http:///i3.msdn.microsoft.com/global/global-bn20090721.css) [function.file-get-contents]: failed to open stream: No error in D:\users\u190790\h899310.devhost.se\Wwwroot\proxy\proxy2.php on line 9
Sidan kan inte visas
Briefly look at your code, it seems the problem is with $prepend; it should look like this instead:
<?php
$prepend = $matches2[2] ? $matches2[2] : $url . '/';
$prepend = $prepend;
?>
header('Content-Type: text/html; charset=iso-8859-1');
This sets your proxy to display only text; css and images won't load through your proxy (or at least, won't display correctly).

Categories