scraping all images from a website using DOMDocument - php

I basically want to get ALL the images in any website using DOMDocument.
but then i cant even load my html due to some reasons I dont know yet.
$url="http://<any_url_here>/";
$dom = new DOMDocument();
#$dom->loadHTML($url); //i have also tried removing #
$dom->preserveWhiteSpace = false;
$dom->saveHTML();
$images = $dom->getElementsByTagName('img');
foreach ($images as $image)
{
echo $image->getAttribute('src');
}
what happens is nothing gets printed . or did I do something wrong with the code?

You don't get a result because $dom->loadHTML() expects html. You give it an url, you first need to get the html of the page you want to parse. You can use file_get_contents() for that.
I used this in my image grab class. Works fine for me.
$html = file_get_contents('http://www.google.com/');
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
echo $image->getAttribute('src');
}

Related

How To Fetch images from a website and store in mysql in php

Hi I want to get all images from a website and store it on MySQL using PHP I am using HTML dom its showing me an error please help me here is this error
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 73 in C:\xampp\htdocs\Oifind\Formula.php on line 3
and here is my code
<?php
$html = file_get_contents('http://www.google.com/');
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
$src= $image->getAttribute('src');
echo "<img src='".$src."'>";
}
?>

Accessing child element

I have this xml code, and I need to get every value from . I've tried but what I get is only the first value of . I wonder what's wrong with my code.
Here's the xml code:
<item>
<g:detailed_images>
<g:detailed_image>hat.png</g:detailed_image>
<g:detailed_image>tie.png</g:detailed_image>
<g:detailed_image>eye_glass.png</g:detailed_image>
<g:detailed_image>watch.png</g:detailed_image>
</g:detailed_images>
</item>
<item>
<g:detailed_images>
<g:detailed_image>shoe.png</g:detailed_image>
<g:detailed_image>socks.png</g:detailed_image>
<g:detailed_image>hand_gloves.png</g:detailed_image>
<g:detailed_image>scarf.png</g:detailed_image>
</g:detailed_images>
</item>
And this is my code:
foreach($xpath->evaluate('//item') as $item)
{
$detailed_images = $xpath->evaluate('g:detailed_images', $item);
foreach ($detailed_images as $img)
{
$simg = $xpath->evaluate('string(g:detailed_image)', $img);
echo 'image = ';
echo $simg;
}
}
My result is:
image = hat.png
image = shoe.png
While what I want is this:
image = hat.png
image = tie.png
image = eye_glass.png
image = watch.png
image = shoe.png
image = socks.png
image = hand_gloves.png
image = scarf.png
Thanks for the help.
As you can see, you're only getting the first detailed_image of each detailed_images. So, keeping the way you're doing it, you'd need to have another foreach on $simg and print each resulting node. But you don't need to do all that XPath querying to get those elements. You can get there just fine with only one query:
//item/g:detailed_images/g:detailed_image
PHP Code:
$dom = new DOMDocument;
$dom->loadXML($xml);
$xpath = new DOMXPath($dom);
foreach($xpath->evaluate('//item/g:detailed_images/g:detailed_image') as $item) {
var_dump($item->nodeValue);
}
Demo

Get src attribute of specific image with DOM PHP

I'm trying to do the same that this jQuery function in PHP (FB Open Graph doesn't execute JS Code, so it has to be executed server side) :
<script>captureurl=jQuery('.blog-content').find('img').attr('src');
jQuery('head').append("<meta property='og:image' content="+captureurl+"/></meta>");</script>
I've seen I could get an image attribute like that :
<?php doc = new DOMDocument();
$doc->loadHTMLFile($url);
$xpath = new DOMXpath($doc);
$imgs = $xpath->query("//img");
for ($i=0; $i < $imgs->length; $i++) {
$img = $imgs->item($i);
$src = $img->getAttribute("src");
// do something with $src
} ?>
But how can I target the first image src in the div with .blog-content class?
Thanks for your help :)
Replace $xpath->query("//img") with following:
$imgs = $xpath->query('//img[contains(attribute::class, "blog-content")]'); //here we are querying domdocument to find img which has class .blog-content

Parse and extract image URL file names based on specific alt tags

I'm trying to print out a list of image file extensions within a webpage excluding the .png extension.
I only want to parse all image file names from image url's within a website that use div class = cartoon only.
Example Structure:
<div class="cartoon">
<img src="URL/images/element8/12345.png" alt="cartoon">
Desired Output: 12345
Here is my code that I use to return all images
include('simple_html_dom.php');
$html = new simple_html_dom();
$html->load_file('URL');
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//img"); // find your image
$imageTags = $doc->getElementsByTagName('img');
foreach($imageTags as $tag) {
echo $tag->getAttribute('src');
}
You want to do it with xpath?
How about:
.//*[contains(#class, "cartoon")]//img[not(contains(#src, "png"))]

printing out html content from domelement using nodeValue

I have image in html. I parse it to DOMDocument and start working with it...
$doc = new DOMDocument();
$doc->loadHTML($article_header);
$imgs = $doc->getElementsByTagName('img');
foreach ($imgs as $img) {
$container = $img->parentNode;
if ($container->tagName != "a") {
$image_inside=utf8_decode($img->nodeValue);
echo "3".$image_inside;
die;
}
}
This code works fine line 3 gets image. line 6 understands that there is no "a" tag above this "img" tag, and line 8 must print out my initial image. But the thing is I only see "3" without image tag and etc...
I did inspect element and nothing is there. just "3" is coming out. Why I cannot print out image ?
You could use:
DOMDocument::saveXML($img);
From PHP Documetation's saveXML().
$doc = new DOMDocument();
$doc->loadHTML($article_header);
$imgs = $doc->getElementsByTagName('img');
foreach ($imgs as $img) {
$container = $img->parentNode;
if ($container->tagName != "a") {
echo utf8_decode($doc->saveXML($img));
die;
}
}
If you're using PHP 5.3.6 you could use (from How to return outer html of DOMDocument?)
$doc->saveHtml($img);
Note the caveat mentioned in the linked-to question:
(...) use saveXml(), but that would
create XML compliant markup. In the
case of an <a>(<img>) element, that shouldn't
be an issue though.

Categories