How to scrape only the largest images from the DOM?

How to scrape only the largest images from the DOM? - php

I am using SimpleHTMLDOM to scrape pages (in servers other than mine).
The basic implementation is
try {
$html = file_get_html(urldecode(trim($url)));
} catch (Exception $e) {
echo $url;
}
foreach ($html->find('img') as $element) {
$src = "";
$src = $element->src;
if (preg_match("/\.(?:jpe?g|png)$/i", $src)) {
$images[] = $src;
}
}
This works fine but it returns all images from the page, including small avatars, icons, and button images. Of course I'd like to avoid these.
I then tried to insert within the loop as follows
...
if (preg_match("/\.(?:jpe?g|png)$/i", $src)) {
$size = getimagesize($src);
if ($size[0] > 200) {
$images[] = $src;
}
}
...
That works well on a page like http://cnn.com.
But in others it returns numerous errors.
For example
http://www.huffingtonpost.com/2012/05/27/alan-simpson-republicans_n_1549604.html
gives a bunch of errors like
<p>Severity: Warning</p>
<p>Message: getimagesize(/images/snn-logo-comments.png): failed to open stream: No such file or directory
<p>Severity: Warning</p>
<p>Message: getimagesize(/images/close-gray.png): failed to open stream: No such file or directory
which seem to happening because of relative URLs in some images. The problem here is that this crashes the script and then no images a loaded, with my Ajax box loading forever.
Do you have any ideas how to troubleshoot this?

The problem is that the image URLs are relative to the site root, so your server can't make sense of them to fetch them and find out their size. You could refer to this question to figure out how to get absolute URLs from relative ones.

The approach you tried with image size checking is correct.
However, in order for it to work on all sites, you would need to add some kind of relative URL parsing.
I don't know if there are any libraries or such for it but here's a quick overview on how to do it:
Find the domain part of the URL you're scraping
Assume any URL starting with / is an absolute URL. You can fetch these simply by concatenating domain and path
Assume any URL not starting with / is relative. You may need to parse any .. markers in the URL to locate the expected path
Check for the <base> tag in the document: If the document has a <base> tag, it will anchor all relative paths into the path defined in the tag.
You may be able to find a library to convert relative paths and absolute paths into something you can use, but in most cases they will not account for the <base> tag mentioned in the last point.

Try something like this assuming a url of http://somedomain.com...
$domain = explode('/', $url);
$domain = $domain[2];
// ... snip ...
if (preg_match("/\.(?:jpe?g|png)$/i", $src)) {
$size = getimagesize($src);
if ($size[0] > 200) {
if(strpos($src, '/', 0) === 0)
$src = $domain . $src;
$images[] = $src;
}
}
This will help some, but it won't be fool-proof - I can't think of many domains using ../../etc relative paths to images, but I'm sure someone is - of course, you could test for a match of anything other than the domain in the image's src attribute, and try throwing the domain on there but no promises that will work every time either. I would think there's a better way... perhaps have a default method and load a config with predefined domain "fixes" for troublesome domains.

Related

Set img src without knowing extension

So I have a few images in the server (public_html/img/profile_pictures/).
This is how I currently set the image:
echo "<img src='img/profile_pictures/main_photo.png'/>";
The main_photo can change each day, but if it changes to main_photo.jpg insted, it wont show (because the extension is hardcoded on that line(.png)). Is it possible to display the photo without knowing the extension for the image file?

If you want a PHP code, then try this. This code will look for main_photo.* inside your folder and automatically set the extension upon finding one.
Remember to set the path properly
<?php
$yourPhotoPath = "img/profile_pictures/";
foreach (glob($yourPhotoPath.'main_photo.*') as $filename) {
$pathInfo = pathinfo($filename);
$extension = $pathInfo['extension'];
$fileName = chop($pathInfo['basename'], $extension);
echo "<img src='".$yourPhotoPath.$fileName.$extension."'/>";
}
?>

if a Photo isn't loaded, it's width and size is null.
Although I would advise you to write a class that checks and loads images, I get a feeling you want a simple solution. so, given by the premise that the photo is either
<img src='img/profile_pictures/main_photo.png'/>
or
<img src='img/profile_pictures/main_photo.jpg'/>
and that neither this path nor this filename ever changes and in the folder is only one picture,
you could simply echo both.
The img of the one that is empty will not be shown.
A better way was to write a class that loads your photo and checks if the photo is really there, like
$path = 'img/profile_pictures/main_photo.png';
if(!file_exists('img/profile_pictures/main_photo.png'))
{
//use the jpg path
$path = 'img/profile_pictures/main_photo.jpg';
}
You can ofc just inline this if case, but it's bad practise to intermix buisinesslogic and format logic, so I advice you to write a class for it.

What is the best way to get the parameters in PHP?

I will completely clarify my question, sorry to everybody.
I have code writed in files from a website that now is not working, the html code is on pages with php extension, in a folder of a Virtual Host in my PC using Wampserever. C:\wamp\1VH\PRU1, when the site was online there was a folder where was a file called image.php. This file was called from other pages inside the site like this: (a little code of a file, C:\wamp\1VH\PRU1\example.php)
"<div><img src="https://www.example.com/img/image.php?f=images&folder=foods&type=salads&desc=green&dim=50&id=23" alt="green salad 23"></div>"
And the result was that the images was showed correctly.
Now, like i have this proyect in local, and i HAVE NOT the code of that image.php file i must to write it myself, this way the images will be showed the same way that when the site was online.
When the site was online and i open a image returned by that image.php file the URL was, following the example, https://example.com/images/foods/salads/green_50/23.png.
Now how the site is only local and i have not that image.php file writed because i'm bot sure how to write it, the images obviously are not showed.
In C:\wamp\1VH\PRU1\example.php the code of files was changed deleting "https://www.example.com/img/image.php?" for a local path "img/image.php?".
And in the same folder there is anothers: "img" folder (here must be allocated the image.php file), and "images" folder, inside it /foods/salads/green_50/23.png, 24.png.25.png..............
So i have exactly the same folder architecture that the online site and i changed the code that i could only, for example replacing with Jquery "https://www.example.com/img/image.php?" for "img/image.php?" but wich i can not do is replace all the code after the image.php file to obtain a image file.
So i think that the easiest way to can obtain the images normally is creating that IMAGE.PHP file that i have not here in my virtual host.
I'd like to know how to obtain the parameters and return the correct URL in the image,php file.
The image of the DIV EXAMPLE must be C:/wamp/1VH/PRU1/images/foods/salads/green_50/23.png
I have in my PC the correct folders and the images, i only need to write the image.php file.
Note that there are "&" and i must to unite the values of "desc=green&dim=50&" being the result: green_50 (a folder in my PC).
TVM.

You probably want something like this.
image.php
$id = intval($_GET['id']);
echo '<div><img src="images/foods/salads/green_50/'.$id.'.png" alt="green salad '.$id.'"></div>';
Then you would call this page
www.example.com/image.php?id=23
So you can see here in the url we have id=23 in the query part of the url. And we access this in PHP using $_GET['id']. Pretty simple. In this case it equals 23 if it was id=52 it would be that number instead.
Now the intval part is very important for security reasons you should never put user input directly into file paths. I won't get into the details of Directory Transversal attacks. But if you just allow anything in there that's what you would be vulnerable to. It's often overlooked, so you wouldn't be the first.
https://en.wikipedia.org/wiki/Directory_traversal_attack
Now granted the Server should have user permissions setup properly, but I say why gamble when we can be safe with 1 line of code.
This should get you started. For the rest of them I would setup a white list like this:
For
folder=foods
You would make an array with the permissible values,
$allowedFolders = [
'food',
'clothes'
'kids'
];
etc...
Then you would check it like this
///set a default
$folder = '';
if(!empty($_GET['folder'])){
if(in_array($_GET['folder'], $allowedFolders)){
$folder = $_GET['folder'].'/';
}else{
throw new Exception('Invalid value for "folder"');
}
}
etc...
Then at the end you would stitch all the "cleaned" values together. As I said before a lot of people simply neglect this and just put the stuff right in the path. But, it's not the right way to do it.
Anyway hope that helps.

You essentially just need to parse the $_GET parameters, then do a few checks that the file is found, a real image and then just serve the file by setting the appropriate content type header and then outputting the files contents.
This should do the trick:
<?php
// define expected GET parameters
$params = ['f', 'folder', 'type', 'desc', 'dim', 'id'];
// loop over parameters in order to build path: /imagenes/foods/salads/green_50/23.png
$path = null;
foreach ($params as $key => $param) {
if (isset($_GET[$param])) {
$path .= ($param == 'dim' ? '_' : '/').basename($_GET[$param]);
unset($params[$key]);
}
}
$path .= '.png';
// check all params were passed
if (!empty($params)) {
die('Invalid request');
}
// check file exists
if (!file_exists($path)) {
die('File does not exist');
}
// check file is image
if (!getimagesize($path)) {
die('Invalid image');
}
// all good serve file
header("Content-Type: image/png");
header('Content-Length: '.filesize($path));
readfile($path);
https://3v4l.org/tTALQ

use $_GET[];
<?php
$yourParam = $_GET['param_name'];
?>

I can obtain the values of parameters in the image.php file tis way:
<?php
$f = $_GET['f'];
$folder = $_GET['folder'];
$type = $_GET['type'];
$desc = $_GET['desc'];
$dim = $_GET['dim'];
$id = $_GET['id'];
?>
But what must i do for the image:
C:/wamp/1VH/PRU1/images/foods/salads/green_50/23.png
can be showed correctly in the DIV with IMG SRC atribute?

Just another php image exists issue

All of this stuff is for example (names aren't actual).
Everything is also located on localhost:8080 (USBWebserver 8.5)
Directory Structure:
(Files located on localhost:8080/[project_name])
/ajax
/ajax_file.php
/img
/250x250
/[image_name].jpg
Code (From ajax_file.php):
$url = 'img/250x250/'.$image_name.'.jpg';
$url = file_exists($url);
This will return false.
I've tried an img_exists($url) function which used cUrl that did not work.
I've also tried:
$url = 'img/250x250/'.$image_name.'.jpg';
$image_check = getimagesize($url);
if (!is_array($image_check))
{
$url = 'img/default_image.png';
}
but this returns a warning for getimagesize() saying no file or directory exists.
When I put $url = 'img/250x250/'.$image_name.'.jpg' into <img src="$url" /> the image shows up...but if the image does not exist then it comes up with a broken image...
How come anything I try to do fails in some way?
I want a default image to show up when the image is broken :/
EDIT
$url = 'img/products/250x250/'.$image_name.'.jpg';
$url = var_dump(file_exists($url));
Returns bool(false)
$url = '../img/products/250x250/'.$image_name.'.jpg';
$url = var_dump(file_exists($url));
Returns bool(false)

It appears as if you need to branch out of the ajax folder before accessing img folder?
Try:
$url = '../img/250x250/'.$image_name.'.jpg';
#Alex Lunix
My guess is that he put the img tag inside of the actual php page, not the ajax script.

If you're in /ajax/ajax_file.php and you look for 'img/250x250/'.$image_name.'.jpg' it will be looking for /ajax/img/250x250/'.$image_name.'.jpg. Instead you should be using
$url = '../img/250x250/'.$image_name.'.jpg';
Although I'm not sure why it shows up in image tags, my guess is you're getting lucky and your browser is fixing the url.

TCPDF cant image because it is using a wrong directory path

I get my images in my pdf document on my localhost but on the production site i get the error TCPDF ERROR: [Image] Unable to get image i am using an html img tag to get the images and the src is the directory path to this image not a url, but i found out that TCPDF is adding the path i give it with the path to my www folder like:
path to picture i give to tcpdf: home/inc_dir/img/pic.jpg
tcpdf looks for it here: home/www/home/inc_dir/pic.jpg
can someone please help me find out tcpdf is concatenating the directories?

You can also change only the image path instead of main path use:
define('K_PATH_IMAGES', '/path/to/images/');
require_once('tcpdf.php');
This won't break fonts/ and other tcpdf paths.

TCPDF is using $_SERVER['DOCUMENT_ROOT'] as a root directory of all your images, and builds their absolute paths in relation to it. You can change it either in $_SERVER or with this PHP constant: K_PATH_MAIN:
define('K_PATH_MAIN', '/path/to/my-images/');
require_once 'tcpdf.php';

I use image data instead of paths. It can be passed to TCPDF using an # in the image's src-attribute, like so:
<img src="#<?php echo base64_encode('/path/to/image.png')?>" />
An img-tag in HTML takes a BASE64 encoded string, unlike the Image() function, which takes unencoded data.
I don't know if this is even documented, I found this by reading the code (tcpdf.php, line 18824 pp):
if ($imgsrc[0] === '#') {
// data stream
$imgsrc = '#'.base64_decode(substr($imgsrc, 1));
$type = '';
}

I have got the same problem. But it is now resolved.
I have change the code of TCPDF.php from
Old Code
if ($tag['attribute']['src'][0] == '/') {
$tag['attribute']['src'] = $_SERVER['DOCUMENT_ROOT'].$tag['attribute']['src'];
}
$tag['attribute']['src'] = urldecode($tag['attribute']['src']);
$tag['attribute']['src'] = str_replace(K_PATH_URL, K_PATH_MAIN, $tag['attribute']['src']);
New Code
if ($tag['attribute']['src'][0] == '/') {
$tag['attribute']['src'] = $_SERVER['DOCUMENT_ROOT'].$tag['attribute']['src'];
$tag['attribute']['src'] = urldecode($tag['attribute']['src']);
$tag['attribute']['src'] = str_replace(K_PATH_URL, K_PATH_MAIN, $tag['attribute']['src']);
}
Please try this.

Ideal way to change all img links to point to cookieless domain in PHP

I have a PHP site using MVC, which builds HTML dynamically for most requests.
I'm updating my site to host images/static content on a cookieless domain.
Currently the images/css are written out as links to relative URLs.
The best I can think of for now is to change all html that writes out <img> tags and css links to use a PHP function which inserts an absolute URL with the cookieless domain instead of the relative URL. However, this involves a lot of change to code and there is potential to miss a few tags/links.
Any suggestions on a better way to handle this?

HTML
<base>
On subpages? If You don't have any static part of HTML, i dunno

You COULD (if you were lazy) do something like this:
At the start of the request (The top of the first file):
ob_start('parseImages');
Then declare the function:
function parseImages($data, $status) {
static $body = '';
switch ($status) {
case PHP_OUTPUT_HANDLER_START:
case PHP_OUTPUT_HANDLER_CONT:
$body .= $data;
return '';
break;
case PHP_OUTPUT_HANDLER_END:
$body .= $data;
$dom = new DomDocument();
$dom->loadHtml($body);
$imgs = $dom->getElementsByTagName('img');
foreach ($imgs as $img) {
$src = (string) $img->getAttribute('src');
if (substr($src, 0, 4) != 'http') {
//internal link
$src = 'http://my.cookieless.com/' . ltrim($src, '/');
$img->setAttribute('src', $src);
}
}
return $dom->saveHtml();
}
}

The sed solution:
Use sed on all you html files. You can loop through all files with PHP (or a script function). glob is a great function for this.
<?php
// this sed assumes you have
// <img src="imgs/example.png" />
// and want to change it to
// <img src="http://www.noCookies.com/imgs/example.png" />
// Loop through each .html file in current directory and create an .html2
// If satisfied w results can finalize .html2 files
// I use .html2 so you don't overwrite your original files
foreach (glob("*.html") as $file)
{
// You have to escape double quotes
$lookFor = 'src=\"imgs';
$replaceWith = 'src=\"http://www.noCookies.com/imgs';
// create the sed command:
// _g is to do a global search
$shellCommand = "sed s_{$lookFor}_{$replaceWith}_g {$file}";
// create a NEW file and open it
$fp = fopen("{$file}2", "w");
// preform sed and write it to the new file
fwrite($fp, shell_exec($shellCommand));
fclose($fp);
}
?>
Edit - original answer follows:
Another option option, which is probably just as slow or slower than the original is mod rewrite
If you run Apache, and all the images are in a separate folders (even if they're not, but then it's more work), you could use mod_rewrite... something like this...? I'm a little rusty oon mod_rewrite, but you want to take the relative path, and switch it with an absolute path:
RewriteEngine on
RewriteRule ^media/images/(.*) http://noCookies.com/media/images/$1 [L]
I think it's definitely the way the go though. Then you can change the old img sources slowly over time. Or just keep the old ones and add in the correct img src for any new images you add.
Here is the mod_rewrite documenation, which is a little opaque... Some of the tutorials on preventing image hotlinking could also be helpful
I just remembered, this is a good mod_rewrite tips and tricks page

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.