I'm trying to extract images from a web page.
I am using the follolwing code but it gives no output although i know there is some there (used ebay page as example)
$html = "http://www.ebay.co.uk/itm/190706137456?_trkparms=clkid%3D1088812801530482649&_qi=RTM944765";
$dom = new domDocument;
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
echo $image->getAttribute('src');
}
Further to this is it possible to just extract jpg images, and then further, only images above a certain height/width size?
I've been using simple_html_dom recently but it fails a lot of the time and I find it slow.
Is there a way, for example, instead of looking for 'img' and 'src' to just find anything that ends '.jpg' the strip everything before 'http://...etc etc..'
Try using $dom->loadHTMLFile() instead of $dom->loadHTML. So...
$html = "http://www.ebay.co.uk/itm/190706137456?_trkparms=clkid%3D1088812801530482649&_qi=RTM944765";
$dom = new domDocument();
$dom->loadHTMLFile($html);
You can filter image types (and get only the file name) in your foreach() loop. Try something like this:
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
$filename = basename($image->getAttribute('src'));
$ext = pathinfo($filename, PATHINFO_EXTENSION);
if ($ext == 'jpg') {
echo $filename . '<br>';
}
}
You can also filter by image width and height, but it appears to be weird with how it finds the width and height. You'd imagine that by using these attributes...
$width = $image->getAttribute('width');
$height = $image->getAttribute('height');
...It would spit out the width="xxx" and height="yyy"...but it doesn't. It looks like it takes the style attributes instead. So keep that in mind. That being said, you can use a similar solution like above for width and height too. Like so:
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
$filename = basename($image->getAttribute('src'));
$width = $image->getAttribute('width');
$height = $image->getAttribute('height');
$ext = pathinfo($filename, PATHINFO_EXTENSION);
if ($ext == 'jpg' && ($width > 20 && $height > 10)) {
echo $filename . "($width x $height)" . '<br>';
}
}
Hopefully that works for you. Here's everything, in case you need it:
$html = "http://www.ebay.co.uk/itm/190706137456?_trkparms=clkid%3D1088812801530482649&_qi=RTM944765";
$dom = new domDocument();
$dom->loadHTMLFile($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
$filename = basename($image->getAttribute('src'));
$width = $image->getAttribute('width');
$height = $image->getAttribute('height');
$ext = pathinfo($filename, PATHINFO_EXTENSION);
if ($ext == 'jpg' && ($width > 20 && $height > 10)) {
echo $filename . "($width x $height)" . '<br>';
}
}
Related
i am using php getimagesize to set width and height columns in the database. works fine with with every image except some GIFs shot by an Iphone camera(Live Photos). in that case the width and height are switched. my first thought was to use exif_read_data to check orientation and rotate. But i do not believe this it supports GIFs. what is the best way to identify disoriented images? if this is indeed the issue?
$img = base64_decode(preg_replace('#^data:image/\w+;base64,#i', '', $media));
if (!$img) {
echo "!img media: $media";
return false;
}
$tfn = $user_id.''.time().'tmp';
file_put_contents($tfn, $img);
$info = getimagesize($tfn);
unlink($tfn);
if (($info[0] > 0 && $info[1] > 0 && $info['mime'])) {
$mfa = explode('/', $info['mime']);
if($mfa[0] != 'image'){
echo "not an image";
return 0;
}
$media_format = $mfa[1];
if($media_type = "gif") {$media_format = "gif";$media_type = "image";}
$media_name = md5("kaa".$page_id."".time()."".$user_id).".".$media_format;
$file = rand(0,1000000);
$path = "/var/www/html/api/media/posts/".$file;
$data = base64_decode(preg_replace('#^data:image/\w+;base64,#i', '', $media));
if(!file_exists($path))
mkdir($path, 0777, true);
if(!file_put_contents($path.'/'.$media_name, $data)){
return 0;
}
$media_p = $file.'/'.$media_name;
$dimens = $info[0].'x'.$info[1];
//echo "info[3] = ";echo $info[3];
}
`
Am able to scrape the images from a website using php but I want to scrape the only first image that has height greater than 200px and width 200px. How can I get the dimensions of first image source? Here is my code..
$html_3 = file_get_contents('http://beignindian.com');
preg_match_all( '|<img.*?src=[\'"](.*?)[\'"].*?>|i',$html_3, $matches );
$main_image_1 = $matches[ 1 ][ 0 ];
You can use getimagesize function to get the image height and width. Once you get it then add if condition to execute further code.
list($width, $height) = getimagesize($main_image_1); // I am assuming that $main_image_1 has image source.
echo "width: " . $width . "<br />";
echo "height: " . $height;
if($width > 200 && $height > 200) {
// perform something here.
}
Update:
If you need to loop through all the images from a website then use following code:
$host = "http://www.beingindian.com/";
$html = file_get_contents($host);
// create new DOMDocument
$document = new DOMDocument('1.0', 'UTF-8');
// set error level
$internalErrors = libxml_use_internal_errors(true);
// load HTML
$document->loadHTML($html);
// Restore error level
libxml_use_internal_errors($internalErrors);
$images = $document->getElementsByTagName('img');
foreach ($images as $image) {
$image_source = $image->getAttribute('src');
// check if image URL is an absolute URL or relative URL
$image_url = (filter_var($image_source, FILTER_VALIDATE_URL))?$image_source:$host.$image_source;
list($width, $height) = getimagesize($image_url);
if($width > 200 && $height > 200) {
// perform something here.
}
else {
// perform something here.
}
}
I have a string that contains text and photos as you can see bellow.
My code so far get all the images and upload them into a folder.
I need to replace the new uploaded links with the correct oreder.
$nextstep = "Hello there this is image 1 <img src='http://www.demosite.com/wp-content/uploads/2015/01.jpg' width='653' height='340' alt='xxx' title='xxx'> !! And Now you can see image number 2 <img src='http://www.demosite.com/wp-content/uploads/2015/02.jpg' width='653' height='340' alt='xxx' title='xxx'>";
$string = $nextstep;
$doc = new DOMDocument();
$doc->loadHTML($string);
$images = $doc->getElementsByTagName('img');
foreach ($images as $image) { //STARTING LOOP
echo "</br>";
echo $image->getAttribute('src') . "\n";
echo "</br>";
$urlimg = $image->getAttribute('src'); //IMAGE URL
$URL = urldecode($urlimg);
$image_name = (stristr($URL,'?',true))?stristr($URL,'?',true):$URL;
$pos = strrpos($image_name,'/');
$image_name = substr($image_name,$pos+1);
$extension = stristr($image_name,'.');
if($extension == '.jpg' || $extension == '.png' || $extension == '.gif' || $extension == '.jpeg'){
$img = '../images/' . $image_name;
file_put_contents($img, file_get_contents($url)); //UPLOAD THEM ONE BY ONE
}
}
It's not clear what the desired outcome is here. It sounds like you want to change the src URL in your existing string to the one where you've saved the images. If this isn't the case please do try updating the question for more clarity.
Here's a simple way to break down the problem...
Step 1 - Extract the img tags from DOM using source string
$html = <<<'HTML'
Hello there this is image 1 <img src="http://www.demosite.com/wp-content/uploads/2015/01.jpg" width="653" height="340" alt="xxx" title="xxx"> !!
And Now you can see image number 2 <img src="http://www.demosite.com/wp-content/uploads/2015/02.jpg" width="653" height="340" alt="xxx" title="xxx">
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html);
$imgs = $dom->getElementsByTagName('img');
// Store the list of image urls in an array - this will come in handy later
$imgURLs = [];
foreach($imgs as $img) {
if (!$img->hasAttribute('src')) {
continue;
}
$imgURLs[] = $img->getAttribute('src');
}
Step 2 - Save the image in a different location
$newImgURLs = []; // new modified URLs where images were moved
$newPath = '../images'; // wherever you're saving the images
foreach($imgURLs as $imgURL) {
/**
* Use parse_url and pathinfo to break down the URL parts and extract the
* filename/extension instead of the fragile implementation you used above
*/
$URLparts = parse_url($imgURL);
$file = pathinfo($URLparts['path']);
$fileName = $file['filename'] . '.' . $file['extension'];
$newFileName = $newPath . '/' . $fileName;
$newImgURLs[] = $URLparts['scheme'] . '://' .
$URLparts['host'] . $file['dirname'] . '/' . $newFileName .
(isset($URLparts['query']) ? ('?' . $URLparts['query']) : null) .
(isset($URLparts['fragment']) ? ('#' . $URLparts['fragment']) : null);
// download image and save to new location
file_put_contents($newFileName, file_get_contents($imgURL));
}
Step 3 - Modify the img src URLs to new path
foreach($imgs as $i => $img) {
$img->setAttribute('src', $newImgURLs[$i]);
}
echo $dom->saveHTML(); // new updated DOM
// or just create a new $html string from scratch using the new URLs.
I currently have this code:
$img_dir = 'tsimgs/';
$images = array();
$files = scandir($img_dir);
foreach($files as $f) {
$extension = end(explode('.',$f));
if($extension == 'jpg') {
$images[] = $f;
}
elseif($extension == 'png') {
$images[] = $f;
}
}
$random = array_rand($images);
$chosen = $images[$random];
if(end(explode('.',$chosen)) == 'jpg') {
$image = imagecreatefromjpeg($img_dir.$chosen);
}
elseif(end(explode('.',$chosen)) == 'png') {
$image = imagecreatefrompng($img_dir.$chosen);
}
// WRITE OUT THE IMAGE //
header('Content-type: image/jpeg');
imagejpeg($image, NULL, 100);
imagedestroy($image);
Basically this displays a random image on webpage visit. How would I make a script to check which image is currently being displayed to the user. Is that even possible?
This might not be a perfect solution, but if you are already using sessions you could set a session variable when selecting the image to show to the user, and then fetch it when you want to know which image is being displayed.
You have to remember though, that if the user has multiple windows open there is no guarantee that this works.
I am parsing website (not any in particular but a variety) and saving various information from them including images. So far I am saving the image src and then going back through again to check the size of the image. What I would like to do is check the image size when I first parse the page....
Here's what I have...
$imgs = $dom->getElementsByTagName('img');
$imageArray = array();
foreach ($imgs as $img) {
$image = $this->returnProperURL($img->getAttribute('src'));
$imageSize = getimagesize($image);
$imageWidth = $imageSize[0];
$imageHeight = $imageSize[1];
if($imageWidth > $this->minImageWidth && $imageHeight > $this->minImageHeight){
$imageArray[] = $image;
}
}
instead of using the getimagesize() after I have the image src is there a way I can check the image size the first time through when I am parsing it? As I'm sure you suspect, it's taking twice as long as it should.