PHP Image Scraper only selected images - php

im trying to build a simple image scraper to scrap certain images from a site how ever what i have so far scrapes all the images
<?php
$url = "http://www.techbuy.com.au/";
$html = file_get_contents('http://www.techbuy.com.au/');
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image)
{
$fimage = $image->getAttribute('src');
echo "<img src='$url" . "$fimage' ></img>";
}
?>
how can i make it say scrap the second image and leave the rest

<?php
$url = "http://www.techbuy.com.au/";
$html = file_get_contents('http://www.techbuy.com.au/');
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $key => $image)
{
if ($key === 1) {
$fimage = $image->getAttribute('src');
echo "<img src='$url" . "$fimage' ></img>";
}
}
?>
Grabs the second image.

if ($images->length >= 2) { $src = $images->item(1)->getAttribute("src"); }

Related

how to alter and then show attributes in html with php

in my table, I have a row that contains a string like this:
<p>hello</p><p>this is patrick</p><p><img src="/assets/img/myface.jpg" width="320" height="320"/></p>
and I want to give the <img> tag an alt attribute. I've got quite close now but somehow my code still shows 2 <img> tags although the string only has 1. can anyone tell me what I'm doing wrong?
this is my code so far:
$str = '<p>hello</p><p>this is patrick</p><p><img src="/assets/img/myface.jpg" width="320" height="320"/></p>';
$new_html = '';
$dom = new DOMDocument();
$dom->loadHTML($str);
$content = $dom->getElementsByTagName('*');
foreach ($content as $i => $node)
{
if ($node->nodeName == 'html' || $node->nodeName == 'body')
{
continue; // dont need to process these tags, right?
}
if ($node->nodeName == 'img')
{
$img_src = $node->getAttribute('src');
$path_arr = explode('/', $img_src);
$filename = $path_arr[count($path_arr)-1]; // myface.jpg
$alt = 'blah';
$node->setAttribute('alt', $alt);
}
echo $dom->saveXML($node);
}
$content = $dom->getElementsByTagName('img');
foreach ($content as $node) {
$img_src = $node->getAttribute('src');
$filename = basename($img_src);
$node->setAttribute('alt', $filename);
}
echo $dom->saveHTML();
Loop only through images with $content = $dom->getElementsByTagName('img');
Move $dom->saveHTML(); after lthe loop.
Get filename with $filename = basename($img_src);
The slightly changed code below does the work. It only gets the img tags and saves the HTML outside the loop. Note that I changed the way that HTML was loaded, to not include the wrapper tags.
<?php
$str = '<p>hello</p><p>this is patrick</p><p><img src="/assets/img/myface.jpg" width="320" height="320"/></p>';
$new_html = '';
$dom = new DOMDocument();
$dom->loadHTML($str, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$content = $dom->getElementsByTagName('img');
foreach ($content as $i => $node)
{
$img_src = $node->getAttribute('src');
$path_arr = explode('/', $img_src);
$filename = $path_arr[count($path_arr)-1]; // myface.jpg
$alt = 'blah';
$node->setAttribute('alt', $alt);
}
echo $dom->saveHTML();
The problem is that when you use
echo $dom->saveXML($node);
in the loop, it will output for various tags and so the output is not the end result, but a combination of other parts of the document.
Try changing it to
echo $node->nodeName."=>".$dom->saveXML($node).PHP_EOL;
to see what it does.
You could just remove the current echo and add
echo $dom->saveXML();
after the end of the loop.
Alternatively, if you just want to process the <img> tags, you can limit the loop more specifically...
$content = $dom->getElementsByTagName('img');
foreach ($content as $i => $node)
{
$img_src = $node->getAttribute('src');
$path_arr = explode('/', $img_src);
$filename = $path_arr[count($path_arr)-1]; // myface.jpg
$alt = 'blah';
$node->setAttribute('alt', $alt);
}
echo $dom->saveXML();

PHP - DOMDocument - need to change/replace an existing HTML tag w/ a new one

I need to change an <img> tag for a <video> tag. I
do not know how to continue with the code as I can change all tags provided they contain a WebM.
function iframe($text) {
$Dom = new DOMDocument;
libxml_use_internal_errors(true);
$Dom->loadHTML($text);
$links = $Dom->getElementsByTagName('img');
foreach ($links as $link) {
$href = $link->getAttribute('src');
if (!empty($href)) {
$pathinfo = pathinfo($href);
if (strtolower($pathinfo['extension']) === 'webm') {
//If extension webm change tag to <video>
}
}
}
$html = $Dom->saveHTML();
return $html;
}
Like Roman i'm using http://php.net/manual/en/domnode.replacechild.php
but i'm using a for-iteration and test for .webm extension in the src with a simple strpos().
$contents = <<<STR
this is some HTML with an <img src="test1.png"/> in it.
this is some HTML with an <img src="test2.png"/> in it.
this is some HTML with an <img src="test.webm"/> in it,
but it should be a video tag - when iframe() is done.
STR;
function iframe($text)
{
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($text);
$images = $dom->getElementsByTagName("img");
for ($i = $images->length - 1; $i >= 0; $i --) {
$nodePre = $images->item($i);
$src = $nodePre->getAttribute('src');
// search in src for ".webm"
if(strpos($src, '.webm') !== false ) {
$nodeVideo = $dom->createElement('video');
$nodeVideo->setAttribute("src", $src);
$nodeVideo->setAttribute("controls", '');
$nodePre->parentNode->replaceChild($nodeVideo, $nodePre);
}
}
$html = $dom->saveHTML();
return $html;
};
echo iframe($contents);
Part of output:
this is some HTML with an <video src="test.webm"></video> in it,
but it should be a video tag - when iframe() is done.
Use this code:
(...)
if( strtolower( $pathinfo['extension'] ) === 'webm')
{
//If extension webm change tag to <video>
$new = $Dom->createElement( 'video', $link->nodeValue );
foreach( $link->attributes as $attribute )
{
$new->setAttribute( $attribute->name, $attribute->value );
}
$link->parentNode->replaceChild( $new, $link );
}
(...)
By code above I create a new node with video tag and nodeValue as img node value, then I add to new node all img attributes, and finally I replace old node with new node.
Please note that if the old node has id, the code will produce a warning.
Solution with DOMDocument::createElement and DOMNode::replaceChild functions:
function iframe($text) {
$Dom = new DOMDocument;
libxml_use_internal_errors(true);
$Dom->loadHTML($text);
$links = $Dom->getElementsByTagName('img');
foreach ($links as $link) {
$href = $link->getAttribute('src');
if (!empty($href)) {
$pathinfo = pathinfo($href);
if (strtolower($pathinfo['extension']) === 'webm') {
//If extension webm change tag to <video>
$video = $Dom->createElement('video');
$video->setAttribute("src", $href);
$video->setAttribute("controls", '');
$link->parentNode->replaceChild($video, $link);
}
}
}
$html = $Dom->saveHTML();
return $html;
}
http://php.net/manual/en/domdocument.createelement.php
http://php.net/manual/en/domnode.replacechild.php

php - how to get the img url from a href using loadhtml

how to get the img url from a href url using dom loadhtml ? i try using $link->nodeValue to get the img src url but is not working
Example url source:
<img src="www.google.com/test.jpg" />Photo NodeValue
My php code:
// -------------------------------------------------------------------------
// ----------------------- Get URLs From Source
// -------------------------------------------------------------------------
function getVidesURL($url) {
$web_source = $this->getSource($url);
if($web_source != '') {
$Data = $this->Websites_Data[$this->getHost($url)];
preg_match($Data['Index_Preg_Match'], $web_source, $Videos_Page);
$Videos_Page = $Videos_Page[$Data['Index_Preg_Match_Num']];
if($Videos_Page != '') {
$dom = new DOMDocument;
#$dom->loadHTML($Videos_Page);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
$Video_Status = "";
$Video_Error = "";
$Video = array(
"URL" => $link->getAttribute('href'),
"Title" => $link->getAttribute('title'),
"MSG" => $link->nodeValue,
);
// Get Image URL Start
$dom = new DOMDocument;
#$dom->loadHTML($Video['MSG']);
$Video_Image = $dom->getElementsByTagName('img');
foreach ($Video_Image as $Image) {
$Video = array(
"IMG" => $link->getAttribute('src'),
);
}
$Videos_URLs .= $Video['IMG'] . '<br />';
}
// Get Image URL Stop
return $Videos_URLs;
}
}
}
The only problem of my code is i don't know how to get the img url from a href
Here is a small function that can pull out image sources from an HTML input:
<?php
echo PHP_EOL;
var_dump(getImgSrcFromHTML('<img src="www.google.com/test.jpg" />Photo NodeValue<div><img src="www.google.com/test2.jpg" /></div><table><tr><td><img src="www.google.com/test3.jpg" /></td></tr></table>'));
echo PHP_EOL;
function getImgSrcFromHTML($html){
$doc = new DOMDocument();
$doc->loadHTML($html);
$imagepPaths = array();
$imageTags = $doc->getElementsByTagName('img');
foreach ($imageTags as $tag) {
$imagePaths[] = $tag->getAttribute('src');
}
if(!empty($imagePaths)) {
return $imagePaths;
} else {
return false;
}
}
Hope this helps.

How to get all images from a webpage in all cases?

I am using this script to get all images from a generic external webpage:
$url = ANY URL HERE;
$html = #file_get_contents($url,false,$context);
$dom = new domDocument;
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
echo $image->getAttribute('src');
}
But in some cases like this ( where the image is in "rel:image_src" )
<img src="http://example.com/example.png" rel:image_src="http://example.com/dir/me.jpg" />
it doesn't work.
How can I do ?
you could include both:
foreach ($images as $image) {
echo $image->getAttribute('src');
echo $image->getAttribute('rel:image_src');
}
Check if the node has an attribute rel:image_src
foreach ($images as $image) {
if( $image->hasAttribute('rel:image_src') ) {
echo $image->getAttribute('rel:image_src');
} else {
echo $image->getAttribute('src');
}
}
If you want the rel:image_src to take precidence, check for the attribute's presence and use it selectively:
$url = ANY URL HERE;
$html = #file_get_contents($url,false,$context);
$dom = new domDocument;
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
if ($image->hasAttribute('rel:image_src')
{
echo $image->getAttribute('rel:image_src');
}
else
{
echo $image->getAttribute('src');
}
}

Taking node values from a site, and re-outputting only selected node tags that can be styled

I am pulling my hair trying to get this to work with php.
The problem: I am just trying to scrape products off a site and have them display as a list of products without anything else that I can style in css. What I'd like to output is <div id='product'><a href= $link ><img src= $image /></a><br/><p>$productText</p></div> as a list of products for a site (basically scrape them). This is a project I am trying for fun, here is the code:
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.amazon.com/gp/search/ref=sr_nr_p_8_3?rh=n%3A2619533011%2Ck%3Apet+products%2Cn%3A%212619534011%2Cn%3A2975312011%2Cp_72%3A2661618011%2Cp_8%3A2661607011&bbn=2975312011&keywords=pet+products&ie=UTF8&qid=1328429080&rnid=2661603011#/ref=sr_st?bbn=2975312011&keywords=pet+products&qid=1328429127&rh=n%3A2619533011%2Ck%3Apet+products%2Cn%3A!2619534011%2Cn%3A2975312011%2Cp_72%3A2661618011%2Cp_8%3A2661607011');
$xpath = new DOMXPath( $html );
$productName = $xpath->query( "//div[#id='btfResults']/div/div[4]/div[1]/a/text()" );
$link = $xpath->query( "//div[#id='btfResults']/div/div[3]/a/#href" );
$image = $xpath->query( "//div[#id='btfResults']/div/div[3]/a/img/#src" );
foreach ($productName as $n){
$productText = $n->nodeValue;
}
foreach ($image as $n){
$imageLink = $n->nodeValue;
}
foreach ($link as $n){
$linkLink = $n->nodeValue;
}
foreach ($link as $n)
{
echo "<div id='product'><a href= $linkLink ><img src= $imageLink /></a><br/><p>$productText</p></div>";
}
The truth is I have no clue how to get the right results I want. Let me know if this needs further explaining. Thanks!
Fixed and tested:
<?php
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.amazon.com/gp/search/ref=sr_nr_p_8_3?rh=n%3A2619533011%2Ck%3Apet+products%2Cn%3A%212619534011%2Cn%3A2975312011%2Cp_72%3A2661618011%2Cp_8%3A2661607011&bbn=2975312011&keywords=pet+products&ie=UTF8&qid=1328429080&rnid=2661603011#/ref=sr_st?bbn=2975312011&keywords=pet+products&qid=1328429127&rh=n%3A2619533011%2Ck%3Apet+products%2Cn%3A!2619534011%2Cn%3A2975312011%2Cp_72%3A2661618011%2Cp_8%3A2661607011');
$xpath = new DOMXPath( $html );
$btfResults = $xpath->query("//div[#id='btfResults']/div"); // get all item nodes
foreach ($btfResults as $node) // iterate through all items
{
$productText = $linkLink = $imageLink = null; // reset result variables for each loop
if ($productName = $xpath->query("./div[4]/div[1]/a/text()", $node)->item(0)) // fetch productName node from the item
{
$productText = $productName->nodeValue;
}
if ($link = $xpath->query("./div[3]/a/#href", $node)->item(0)) // fetch link node from the item
{
$linkLink = $link->nodeValue;
}
if ($image = $xpath->query("./div[3]/a/img/#src", $node)->item(0)) // fetch image node from the item
{
$imageLink = $image->nodeValue;
}
if ($productText && $linkLink && $imageLink) // only return a result when all variables are set.
{
echo '<div id="product"><img src="'.$imageLink.'"/><br/><p>'.$productText.'</p></div>';
}
}

Categories