scraping images from url using php - php

i am trying to make a page that allows me to grab and save images from another link , so here's what i want to add on my page:
text box (to enter url that i want to get images from).
save dialog box to specify the path to save images.
but what i am trying to do here i want to save images only from that url and from inside specific element.
for example on my code i say go to example.com and from inside of element class="images" grab all images.
notes: not all images from the page, just from inside the element
whether element has 3 images in it or 50 or 100 i don't care.
here's what i tried and worked using php
<?php
$html = file_get_contents('http://www.tgo-tv.net');
preg_match_all( '|<img.*?src=[\'"](.*?)[\'"].*?>|i',$html, $matches );
echo $matches[ 1 ][ 0 ];
?>
this gets image name and path but what i am trying to make is a save dialog box and the code must save image directly into that path instead of echo it out
hope you understand
Edit 2
it's ok of Not having save dialog box. i must specify save path from the code

If you want something generic, you can use:
<?php
$the_site = "http://somesite.com";
$the_tag = "div"; #
$the_class = "images";
$html = file_get_contents($the_site);
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//'.$the_tag.'[contains(#class,"'.$the_class.'")]/img') as $item) {
$img_src = $item->getAttribute('src');
print $img_src."\n";
}
Usage:
Change the site, tag, which can be a div, span, a, etc. also change the class name.
For example, change the values to:
$the_site = "https://stackoverflow.com/questions/23674744/what-is-the-equivalent-of-python-any-and-all-functions-in-javascript";
$the_tag = "div"; #
$the_class = "gravatar-wrapper-32";
Output:
https://www.gravatar.com/avatar/67d8ca039ee1ffd5c6db0d29aeb4b168?s=32&d=identicon&r=PG
https://www.gravatar.com/avatar/24da669dda96b6f17a802bdb7f6d429f?s=32&d=identicon&r=PG
https://www.gravatar.com/avatar/24780fb6df85a943c7aea0402c843737?s=32&d=identicon&r=PG

Maybe you should try HTML DOM Parser for PHP. I've found this tool recently and to be honest it works pretty well. It was JQuery-like selectors as you can see on the site. I suggest you to take a look and try something like:
<?php
require_once("./simple_html_dom.php");
foreach ($html->find("<tag>") as $<tag>) //Start from the root (<html></html>) find the the parent tag you want to search in instead of <tag> (e.g "div" if you want to search in all divs)
{
foreach ($<tag>->find("img") as $img) //Start searching for img tag in all (divs) you found
{
echo $img->src . "<br>"; //Output the information from the img's src attribute (if the found tag is <img src="www.example.com/cat.png"> you will get www.example.com/cat.png as result)
}
}
?>
I hope i helped you less or more.

Related

Change src atribute from img, using Simple HTML Dom php library

I'm totally new to php, and I'm having a hard time changing the src attribute of img tags.
I have a website that pulls a part of a page using Simple Html Dom php, here is the code:
<?php
include_once('simple_html_dom.php');
$html = file_get_html('http://www.tabuademares.com/br/bahia/morro-de-sao-paulo');
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
$elem = $html->find('table[id=tabla_mareas]', 0);
echo $elem;
?>
This code correctly returns the part of the page I want. But when I do this the img tags comes with the src of the original page: /assets/svg/icon_name.svg
What I want to do is change the original src so that it looks like this: http://www.mywebsite.com/wp-content/themes/mytheme/assets/svg/icon_name.svg
I want to put the url of my site in front of assets / svg / icon_name.svg
I already tried some tutorials, but I could not make any work.
Could someone please kind of help a noob in php?
i could make it work. So if someone have the same question, here is how i managed to get the code working.
<?php
// Note you must download the php files simple_html_dom.php from
// this link https://sourceforge.net/projects/simplehtmldom/files/
//than include them
include_once('simple_html_dom.php');
//target the website
$html = file_get_html('http://the_target_website.com');
//loop thru all images of the html dom
foreach($html ->find('img') as $item) {
// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false)
$value = $item->src;
// Set a attribute
$item->src = 'http://yourwebsite.com/'.$value;
}
//save the variable
$html->save();
//findo on html the div you want to get the content
$elem = $html->find('div[id=container]', 0);
//output it using echo
echo $elem;
?>
That's it!
did you read the documentation for read and modify attributes
As per that
// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false)
$value = $e->href;
// Set a attribute
$e->href = 'ursitename'.$value;

PHP Regex replace link if it does not have data attribute

I need to loop through a bunch of HTML code and remove the <a> </a> tags from all links which DONT include the data attribute data-link="keepLink"
Here is an example of body value I need to modify:
<p><a data-link=\"keepLink\" href=\"[1|9999|16|191967|256]\">Daily Racing Link</a></p>\r\n<br>\n <strong>OFFER – Get up to a £400 deposit bonus when you sign up with Fanduel.</strong>
After the modification I need it to look like (so the offer link is removed):
<p><a data-link=\"keepLink\" href=\"[1|9999|16|191967|256]\">Daily Racing Link</a></p>\r\n<br>\n <strong>OFFER – Get up to a £400 deposit bonus when you sign up with Fanduel.</strong>
So far I have managed to get the first half of the link removing if it doesn't include a data-link="keepLink" attribute. But the closing </a> is still present.
Here is the regex I have used:
$result["body_value"] = preg_replace('/<a (?![^>]*data-link="keepLink").*?>/i', '', $result["body_value"]);
So the new body value looks like:
<p><a data-link=\"keepLink\" href=\"[1|9999|16|191967|256]\">Daily Racing Link</a></p>\r\n<br>\n <strong>OFFER – Get up to a £400 deposit bonus when you sign up with Fanduel</a>.</strong>
The DOMDocument extension is available by default in PHP. It is presumably faster and is designed exactly for what you are trying to achieve. You can use it to load your document and search for any links without a data-link attribute like this:
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.example.com'); // load the file
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a[not(#data-link=\'keepLink\')]'); // search for links that do not have the 'data-link' attribute set to 'keepLink'
foreach($nodes as $element){
$textInside = $element->nodeValue; // get the text inside the link
$parentNode = $element->parentNode; // save parent node
$parentNode->replaceChild(new DOMText($textInside), $element); // remove the element
}
$myNewHTML = $dom->saveHTML(); // see http://php.net/manual/ro/domdocument.savehtml.php for limitations such as auto-adding of doc-type
echo $myNewHTML;
Proof of concept: https://3v4l.org/ejatQ.
Please bear in mind that this will take only the text values inside the elements without a data-link='keepLink' attribute value.
If you are set on regex and don't want to use a parser.
Try this
<a (?!data-link=)[^>]*>((?!<\/a>).*?)<\/a>
And replace it by $1. To keep your link-text.
See https://regex101.com/r/wKQk4p/2
Please say if you need any further explaination.

How to display image url from website sub pages using php code

I am using below mentioned php code to display images from webpages.Below mentioned code is able to display image url from main page but unable to display image urls from sub pages.
enter code here
<?php
include_once('simple_html_dom.php');
$target_url = "http://fffmovieposters.com/";
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find('img') as $img)
{
echo $img->src."<br />";
echo $img."<br/>";
}
?>
If by sub-page you mean a page that http://fffmovieposters.com is linking to, then of course that script won't show any of those since you're not loading those pages.
You basically have to write a spider that not only finds images, but also anchor tags and then repeats the process for those links. Just remember to add some filters so that you don't process pages more than once or start processing the entire internet by following external links.
Pseudo'ish code
$todo = ['http://fffmovieposters.com'];
$done = [];
$images = [];
while( ! empty($todo))
$link = array_shift($todo);
$done[] = $link;
$html = get html;
$images += find <img> tags
$newLinks = find <a> tags
remove all external links and all links already in $done from $newLinks
$todo += $newLinks;
Or something like that...

PHP appendChild giving nice fatal error -Uncaught exception 'DOMException' with message 'Hierarchy Request Error - how can I add html AFTER a tag

In my piece of code I'm trying to find all img tags using PHP DOM, add another img tag directly after and then wrap all of that in a div, i.e.
<!-- From this... -->
<img src="originalImage.jpg" />
<!-- ...to this... -->
<div class="wrappingDiv">
<img src="originalImage.jpg" />
<img src="newImage.jpg" />
</div>
This is the PHP that I'm bastardising trying:
$dom = new domDocument;
$dom->loadHTML($the_content_string);
$dom->preserveWhiteSpace = false;
//get all images and chuck them in an array
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
//create the surrounding div
$div = $image->ownerDocument->createElement('div');
$image->setAttribute('class','main-image');
$added_a = $image->parentNode->insertBefore($div,$image);
$added_a->setAttribute('class','theme-one');
$added_a->appendChild($image);
//create the second image
$secondary_image = $image->ownerDocument->createElement('img');
$added_img = $image->appendChild($secondary_image);
$added_img->setAttribute('src', $twine_img_url);
$added_img->setAttribute('class', $twine_class);
$added_img->appendChild($image);
}
echo $dom->saveHTML();
Everything up to where I create the $added_img variable works fine. Well, at the very least it doesn't error. It's those last four lines that kill it dead.
I'm clearly doing something relatively idiotic... Any lovely, lovely people out there able to point out where I've douched things up?
First:
You are trying to append and image to an image here(but of course in HTML an image cannot have an child-image, append the images to the div):
$added_img = $image->appendChild($secondary_image);
it has to be
$added_img = $added_a->appendChild($secondary_image);
and again here:
$added_img->appendChild($image);
has to be:
$added_a->appendChild($image);
But this will not work at all, because NodeList's are live. As soon as you append one new image, this image is a part of $images, you will run into an infinite loop. So first populate an array with the initial images instead of using a NodeList.
$imageList= $dom->getElementsByTagName('img');
$images=array();
for($i=0;$i<$imageList->length;++$i)
{
$images[]=$imageList->item($i);
}

get the href value of a specific element and load it

I'm using jquery to add rel=brochure using $('.imageOuter a').attr('rel', 'brochure') this works as expected.
However, I want to grab the link that has rel as brochure. I'm trying to do this with loadHTML, as below:
function getBrochureLink() {
$doc = new DOMDocument();
$doc->loadHTML($file);
$area = $doc->getElementsByTagName('body')->item(0);
$links = $area->getElementsByTagName("link");
foreach($links as $l) {
if($l->getAttribute("rel") == "brochure") {
$brochureLink = $l->getAttribute("href");
}
}
}
Sadly $brochureLink is empty and not grabbing it.
Your issue is that the attr is set via Javascript. When you retrieved the page's contents via loadHTML, the JS was not executed, so you can't find the matching link.
You'll have to either run the JS on the server side, put the attr into the DOM directly without JS, or find another architecture for whatever you're attempting to accomplish.

Categories