Getting Absolute Path of External Web Page Images - php

I am working on bookmarklet and I am fetching all the photos of any external page using HTML DOM parser(As suggested earlier by SO answer). I am fetching the photos correctly and displaying that in my bookmarklet pop up. But I am having problem with the relative path of photos.
for example the photo source on external page say http://www.example.com/dir/index.php
photo Source 1 : img source='hostname/photos/photo.jpg' - Getting photo as it is absolute
photo Source 2 : img source='/photos/photo.jpg' - not getting as it is not absolute.
I worked through the current url I mean using dirname or pathinfo for getting directory by current url. but causes problem between host/dir/ (gives host as parent directory ) and host/dir/index.php (host/dir as parent directory which is correct)
Please help How can I get these relative photos ??

FIXED (added support for query-string only image paths)
function make_absolute_path ($baseUrl, $relativePath) {
// Parse URLs, return FALSE on failure
if ((!$baseParts = parse_url($baseUrl)) || (!$pathParts = parse_url($relativePath))) {
return FALSE;
}
// Work-around for pre- 5.4.7 bug in parse_url() for relative protocols
if (empty($baseParts['host']) && !empty($baseParts['path']) && substr($baseParts['path'], 0, 2) === '//') {
$parts = explode('/', ltrim($baseParts['path'], '/'));
$baseParts['host'] = array_shift($parts);
$baseParts['path'] = '/'.implode('/', $parts);
}
if (empty($pathParts['host']) && !empty($pathParts['path']) && substr($pathParts['path'], 0, 2) === '//') {
$parts = explode('/', ltrim($pathParts['path'], '/'));
$pathParts['host'] = array_shift($parts);
$pathParts['path'] = '/'.implode('/', $parts);
}
// Relative path has a host component, just return it
if (!empty($pathParts['host'])) {
return $relativePath;
}
// Normalise base URL (fill in missing info)
// If base URL doesn't have a host component return error
if (empty($baseParts['host'])) {
return FALSE;
}
if (empty($baseParts['path'])) {
$baseParts['path'] = '/';
}
if (empty($baseParts['scheme'])) {
$baseParts['scheme'] = 'http';
}
// Start constructing return value
$result = $baseParts['scheme'].'://';
// Add username/password if any
if (!empty($baseParts['user'])) {
$result .= $baseParts['user'];
if (!empty($baseParts['pass'])) {
$result .= ":{$baseParts['pass']}";
}
$result .= '#';
}
// Add host/port
$result .= !empty($baseParts['port']) ? "{$baseParts['host']}:{$baseParts['port']}" : $baseParts['host'];
// Inspect relative path path
if ($relativePath[0] === '/') {
// Leading / means from root
$result .= $relativePath;
} else if ($relativePath[0] === '?') {
// Leading ? means query the existing URL
$result .= $baseParts['path'].$relativePath;
} else {
// Get the current working directory
$resultPath = rtrim(substr($baseParts['path'], -1) === '/' ? trim($baseParts['path']) : str_replace('\\', '/', dirname(trim($baseParts['path']))), '/');
// Split the image path into components and loop them
foreach (explode('/', $relativePath) as $pathComponent) {
switch ($pathComponent) {
case '': case '.':
// a single dot means "this directory" and can be skipped
// an empty space is a mistake on somebodies part, and can also be skipped
break;
case '..':
// a double dot means "up a directory"
$resultPath = rtrim(str_replace('\\', '/', dirname($resultPath)), '/');
break;
default:
// anything else can be added to the path
$resultPath .= "/$pathComponent";
break;
}
}
// Add path to result
$result .= $resultPath;
}
return $result;
}
Tests:
echo make_absolute_path('http://www.example.com/dir/index.php','/photos/photo.jpg')."\n";
// Outputs: http://www.example.com/photos/photo.jpg
echo make_absolute_path('http://www.example.com/dir/index.php','photos/photo.jpg')."\n";
// Outputs: http://www.example.com/dir/photos/photo.jpg
echo make_absolute_path('http://www.example.com/dir/index.php','./photos/photo.jpg')."\n";
// Outputs: http://www.example.com/dir/photos/photo.jpg
echo make_absolute_path('http://www.example.com/dir/index.php','../photos/photo.jpg')."\n";
// Outputs: http://www.example.com/photos/photo.jpg
echo make_absolute_path('http://www.example.com/dir/index.php','http://www.yyy.com/photos/photo.jpg')."\n";
// Outputs: http://www.yyy.com/photos/photo.jpg
echo make_absolute_path('http://www.example.com/dir/index.php','?query=something')."\n";
// Outputs: http://www.example.com/dir/index.php?query=something
I think that should deal with just about everything your likely to encounter correctly, and should equate to roughly the logic used by a browser. Also should correct any oddities you might get on Windows with stray forward slashes from using dirname().
First argument is the full URL of the page where you found the <img> (or <a> or whatever) and second argument is the contents of the src/href etc attribute.
If anyone finds something that doesn't work (cos I know you'll all be trying to break it :-D), let me know and I'll try and fix it.

'/' should be the base path. Check the first character returned from your dom parser, and if it is a '/' then just prefix it with the domain name.

Related

PHP: Find images and links with relative path in output and convert them to absolute path

There are a lot of posts on converting relative to absolute paths in PHP. I'm looking for a specific implementation beyond these posts (hopefully). Could anyone please help me with this specific implementation?
I have a PHP variable containing diverse HTML, including hrefs and imgs containing relative urls. Mostly (for example) /en/discover or /img/icons/facebook.png
I want to process this PHP variable in such a way that the values of my hrefs and imgs will be converted to http://mydomain.com/en/discover and http://mydomain.com/img/icons/facebook.png
I believe the question below covers the solution for hrefs. How can we expand this to also consider imgs?
Change a relative URL to absolute URL
Would a regex be in order? Or since we're dealing with a lot of output should we use DOMDocument?
After some further research I've stumbled upon this article from Gerd Riesselmann on how to solve the absence of a base href solution for RSS-feeds. His snippet actually solves my question!
http://www.gerd-riesselmann.net/archives/2005/11/rss-doesnt-know-a-base-url
<?php
function relToAbs($text, $base)
{
if (empty($base))
return $text;
// base url needs trailing /
if (substr($base, -1, 1) != "/")
$base .= "/";
// Replace links
$pattern = "/<a([^>]*) " .
"href=\"[^http|ftp|https|mailto]([^\"]*)\"/";
$replace = "<a\${1} href=\"" . $base . "\${2}\"";
$text = preg_replace($pattern, $replace, $text);
// Replace images
$pattern = "/<img([^>]*) " .
"src=\"[^http|ftp|https]([^\"]*)\"/";
$replace = "<img\${1} src=\"" . $base . "\${2}\"";
$text = preg_replace($pattern, $replace, $text);
// Done
return $text;
}
?>
Thank you Gerd! And thank you shadyyx to point me in the direction of base href!
Excellent solution.
However, there is a small typo in the pattern. As written above, it truncates the first character of the href or src. Here are patterns that work as intended:
// Replace links
$pattern = "/<a([^>]*) " .
"href=\"([^http|ftp|https|mailto][^\"]*)\"/";
and
// Replace images
$pattern = "/<img([^>]*) " .
"src=\"([^http|ftp|https][^\"]*)\"/";
The opening parenthesis of the second replacement references are moved. This brings the first character of the href or src which doesn't match http|ftp|https into the replacement references.
I found that when the href src and base url started getting more complex, the accepted answer solution didn't work for me.
for example:
base url: http://www.journalofadvertisingresearch.com/ArticleCenter/default.asp?ID=86411&Type=Article
href src: /ArticleCenter/LeftMenu.asp?Type=Article&FN=&ID=86411&Vol=&No=&Year=&Any=
incorrectly returned: /ArticleCenter/LeftMenu.asp?Type=Article&FN=&ID=86411&Vol=&No=&Year=&Any=
I found the below function which correctly returns the url. I got this from a comment here: http://php.net/manual/en/function.realpath.php from Isaac Z. Schlueter.
This correctly returned: http://www.journalofadvertisingresearch.com/ArticleCenter/LeftMenu.asp?Type=Article&FN=&ID=86411&Vol=&No=&Year=&Any=
function resolve_href ($base, $href) {
// href="" ==> current url.
if (!$href) {
return $base;
}
// href="http://..." ==> href isn't relative
$rel_parsed = parse_url($href);
if (array_key_exists('scheme', $rel_parsed)) {
return $href;
}
// add an extra character so that, if it ends in a /, we don't lose the last piece.
$base_parsed = parse_url("$base ");
// if it's just server.com and no path, then put a / there.
if (!array_key_exists('path', $base_parsed)) {
$base_parsed = parse_url("$base/ ");
}
// href="/ ==> throw away current path.
if ($href{0} === "/") {
$path = $href;
} else {
$path = dirname($base_parsed['path']) . "/$href";
}
// bla/./bloo ==> bla/bloo
$path = preg_replace('~/\./~', '/', $path);
// resolve /../
// loop through all the parts, popping whenever there's a .., pushing otherwise.
$parts = array();
foreach (
explode('/', preg_replace('~/+~', '/', $path)) as $part
) if ($part === "..") {
array_pop($parts);
} elseif ($part!="") {
$parts[] = $part;
}
return (
(array_key_exists('scheme', $base_parsed)) ?
$base_parsed['scheme'] . '://' . $base_parsed['host'] : ""
) . "/" . implode("/", $parts);
}

PHP Regex to determine relative or absolute path

I'm using cURL to pull the contents of a remote site. I need to check all "href=" attributes and determine if they're relative or absolute path, then get the value of the link and path it to something like href="http://www.website.com/index.php?url=[ABSOLUTE_PATH]"
Any help would be greatly appreciated.
Here is the one possible solution if i understood question correctly:
$prefix = 'http://www.website.com/index.php?url=';
$regex = '~(<a.*?href\s*=\s*")(.*?)(".*?>)~is';
$html = file_get_contents('http://cnn.com');
$html = preg_replace_callback($regex, function($input) use ($prefix) {
$parsed = parse_url($input[2]);
if (is_array($parsed) && sizeof($parsed) == 1 && isset($parsed['path'])) {
return $input[1] . $prefix . $parsed['path'] . $input[3];
}
}, $html);
echo $html;
A combination of a regex* and HTML's parse_url() should help:
// find all links in a page used within href="" or href='' syntax
$links = array();
preg_match_all('/href=(?:(?:"([^"]+)")|(?:\'([^\']+)\'))/i', $page_contents, $links);
// iterate through each array and check if it's "absolute"
$urls = array();
foreach ($links as $link) {
$path = $link;
if ((substr($link, 0, 7) == 'http://') || (substr($link, 0, 8) == 'https://')) {
// the current link is an "absolute" URL - parse it to get just the path
$parsed = parse_url($link);
$path = $parsed['path'];
}
$urls[] = 'http://www.website.com/index.php?url=' . $path;
}
To determine if the URL is absolute or not, I simply have it check if the beginning of the URL is http:// or https://; if your URLs contain other mediums such as ftp:// or tel:, you might need to handle those as well.
This solution does use regex to parse HTML, which is often frowned upon. To circumvent, you could switch to using [DOMDocument][2], but there's no need for extra code if there aren't any issues.

Obtaining deepest file path with PHP

Does anyone have a brilliant idea how to obtain the elements with the deepest path from an array with file paths? If this sounds weird, imagine the following array:
/a/b
/a
/1/2/3/4
/1/2
/1/2/3/5
/a/b/c/d/e
What I want to obtain is:
/1/2/3/4
/1/2/3/5
/a/b/c/d/e
Wondering what the fastest method is without having to iterate over the whole array over and over again. Language is PHP (5.2).
Following your clarifications, here's a function that would do it. It keeps an array of the "deepest paths" found and compares each path against it. The best-case scenario is O(n) (if all paths are subpaths of the largest one) and worst-case scenario is O(n2) (if all paths are completely distinct).
Note that continue 2 means "continue on the outer loop".
<?php
function getDeepestPaths($array)
{
$deepestPaths = array();
foreach ($array as $path)
{
$pathLength = strlen($path);
// look for all the paths we consider the longest
// (note how we're using references to the array members)
foreach ($deepestPaths as &$deepPath)
{
$deepPathLength = strlen($deepPath);
// if $path is prefixed by $deepPath, this means that $path is
// deeper, so we replace $deepPath with $path
if (substr($path, 0, $deepPathLength) == $deepPath)
{
$deepPath = $path;
continue 2;
}
// otherwise, if $deepPath is prefixed by $path, this means that
// $path is shallower; so we should stop looking
else if (substr($deepPath, 0, $pathLength) == $path)
{
continue 2;
}
}
// $path matches nothing currently in $deepestPaths, so we should
// add it to the array
$deepestPaths[] = $path;
}
return $deepestPaths;
}
$paths = array('/a/b', '/a', '/1/2/3/4', '/1/2', '/1/2/3/5', '/a/b/c/d/e');
print_r(getDeepestPaths($paths));
?>
If your folder names don't end with slashes, you'll want to do an additional check in the two ifs: that the character next to the prefix in the deeper path is a slash, because otherwise a path like /foo/bar will be seen as a "deeper path" than /foo/b (and will replace it).
if (substr($path, 0, $deepPathLength) == $deepPath && $path[$deepPathLength] == '/')
if (substr($deepPath, 0, $path) == $path && $deepPath[$path] == '/')
$aPathes = array(
'/a/b',
'/a',
'/1/2/3/4',
'/1/2',
'/1/2/3/5',
'/a/b/c/d/e'
);
function getDepth($sPath) {
return substr_count($sPath, '/');
}
$aPathDepths = array_map('getDepth', $aPathes);
arsort($aPathDepths);
foreach ($aPathDepths as $iKey => $iDepth) {
echo $aPathes[$iKey] . "\n";
}
Also see this example.
=== UPDATE ===
$aUsed = array();
foreach ($aPathes as $sPath) {
foreach ($aUsed as $iIndex => $sUsed) {
if (substr($sUsed, 0, strlen($sPath)) == $sPath || substr($sPath, 0, strlen($sUsed)) == $sUsed) {
if (strlen($sUsed) < strlen($sPath)) {
array_splice($aUsed, $iIndex, 1);
$aUsed[] = $sPath;
}
continue 2;
}
}
$aUsed[] = $sPath;
}
Also see this example.
If you can guarantee that the "spelling" is always the same (ie: "/a/b c/d" vs. /a/b\ /c/d) then you should be able to do some simple string comparation to see if one of the strings is fully contained within the other. If that is true discard the string.
Note that you will need to compare in both directions.

Using fopen, fwrite multiple times in a foreach loop

I want to save files from an external server into a folder on my server using fopen, fwrite.
First the page from the external site is loaded, and scanned for any image links. Then that list is sent from an to the fwrite function. The files are created, but they aren't the valid jpg files, viewing them in the browser it seems like their path on my server is written to them.
Here is the code:
//read the file
$data = file_get_contents("http://foo.html");
//scan content for jpg links
preg_match_all('/src=("[^"]*.jpg)/i', $data, $result);
//save img function
function save_image($inPath,$outPath)
{
$in= fopen($inPath, "rb");
$out= fopen($outPath, "wb");
while ($chunk = fread($in,8192))
{
fwrite($out, $chunk, 8192);
}
fclose($in);
fclose($out);
}
//output each img link from array
foreach ($result[1] as $imgurl) {
echo "$imgurl<br />\n";
$imgn = (basename ($imgurl));
echo "$imgn<br />\n";
save_image($imgurl, $imgn);
}
The save_image function works if I write out a list:
save_image('http://foo.html', foo1.jpg);
save_image('http://foo.html', foo1.jpg);
I was hoping that I'd be able to just loop the list from the matches in the array.
Thanks for looking.
There are two problems with your script. Firstly the quote mark is being included in the external image URL. To fix this your regex should be:
/src="([^"]*.jpg)/i
Secondly, the image URLs are probably not absolute (don't include http:// and the file path). Put this at the start of your foreach to fix that:
$url = 'http://foo.html';
# If the image is absolute.
if(substr($imgurl, 0, 7) == 'http://' || substr($imgurl, 0, 8) == 'https://')
{
$url = '';
}
# If the image URL starts with /, it goes from the website's root.
elseif(substr($imgurl, 0, 1) == '/')
{
# Repeat until only http:// and the domain remain.
while(substr_count($url, '/') != 2)
{
$url = dirname($url);
}
}
# If only http:// and a domain without a trailing slash.
elseif(substr_count($imgurl, '/') == 2)
{
$url .= '/';
}
# If the web page has an extension, find the directory name.
elseif(strrpos($url, '.') > strrpos($url, '/'))
{
$url = dirname($url);
}
$imgurl = $url. $imgurl;
fopen isn't guaranteed to work. You should be checking the return values of anything they may return something different on error...
fopen() - Returns a file pointer resource on success, or FALSE on error.
In fact all the file functions return false on error.
To figure out where it is failing I would recommend using a debugger, or printing out some information in the save_image function. i.e. What the $inPath and $outPath are, so you can validate they are being passed what you would expect.
The main issue I see is that the regex may not capture the full http:// path. Most sites leave this off and use relative paths. You should code in a check for that and add it in if that is not present.
Your match includes the src bit, so try this instead:
preg_match_all('/(?<=src=")[^"]*.jpg/i', $data, $result);
And then I think this should work:
unset($result[0]);
//output each img link from array
foreach ($result as $imgurl) {
echo "$imgurl<br />\n";
$imgn = (basename ($imgurl));
echo "$imgn<br />\n";
save_image($imgurl, $imgn);
}

Ensure a user-defined path is safe in PHP

I am implementing a simple directory listing script in PHP.
I want to ensure that the passed path is safe before opening directory handles and echoing the results willy-nilly.
$f = $_GET["f"];
if(! $f) {
$f = "/";
}
// make sure $f is safe
$farr = explode("/",$f);
$unsafe = false;
foreach($farr as $farre) {
// protect against directory traversal
if(strpos($farre,"..") != false) {
$unsafe = true;
break;
}
if(end($farr) != $farre) {
// make sure no dots are present (except after the last slash in the file path)
if(strpos($farre,".") != false) {
$unsafe = true;
break;
}
}
}
Is this enough to make sure a path sent by the user is safe, or are there other things I should do to protected against attack?
It may be that realpath() is helpful to you.
realpath() expands all symbolic links
and resolves references to '/./',
'/../' and extra '/' characters in the
input path, and returns the
canonicalized absolute pathname.
However, this function assumes that the path in question actually exists. It will not perform canonization for a non-existing path. In this case FALSE is returned.

Categories