Extract an attribute from a specific element in DOM - php

I want to be able to extract only the src of the second image in an html file. I am using the PHP DOM parser:
foreach($html->find('img[src]') as $element)
$src = $element->getAttribute('src');
echo $src;
However, I am getting the src of the last image in the page, instead of the one I am looking for.
Can I display only a specific src outside of the foreach loop?

Your loop is missing {}, it is equivalent to
foreach($html->find('img[src]') as $element) {
$src = $element->getAttribute('src');
}
echo $src;
so, the echo gets the $src after the last iteration of your loop, which is the last element.

Using the example from their website, I'd go with this (braces are key here):
$count = 1;
foreach($html->find('img') as $element) {
if ($count == 2) {
echo $element->src;
break;
}
$count += 1;
}

Related

How to get image url by page in PHP

This is my code :
<form method="POST">
<input name="link">
<button type="submit">></button>
</form>
<title>GET IMAGE URL</title>
<?php
if (!isset($_POST['link'])) exit();
$link = $_POST['link'];
$parse = explode('.html', $link);
echo '<div id="pin" style="float:center"><textarea class="text" cols="110" rows="50">';
for ($i = 1; $i <=5; $i++)
{
if ($i > 1)
$link = "$parse[0]-$i.html";
$get = file_get_contents($link);
if (preg_match_all('/src="(.*?)"/', $get, $matches))
{
foreach ($matches[1] as $content)
echo $content."\r\n";
}
}
echo '</textarea>';
The page I'm trying to get the img src has 10 to 15 page,so I want my code to get all the img url until the end of the page. How can I do that without the loop?
If I use:
for ($i = 1; $i <=5; $i++)
this will get only 5 page img urls, but I want to make it get until the end. Then I don't need to edit the loop everytime I submit another URL with a different number of pages.
From this
this will get only 5 page img urls, but I want to make it get until the end. Then I don't need to edit the loop everytime I submit another URL with a different number of pages.
I could understand that your problem is with dynamic number of pages.Your urls have a next page link at the bottom
下一页
Identify it and get your images in while loop
<?php
// Link given in form
$link = "http://www.xiumm.org/photos/XiuRen-17305.html";
$parse = explode('.html', $link);
$i=1;
// Intialize a boolean
$nextPageFound = true;
while($nextPageFound) {
// Construct URL Every time when nextPageFound
if ($i == 1) {
$url = "$parse[0].html";
echo "First Page<br><br>";
} else {
$url = "$parse[0]-$i.html";
}
// Getting URL Contents
$get = file_get_contents($url);
if (preg_match_all('/src="(.*?)"/', $get, $matches))
{
// echoing contents
foreach ($matches[1] as $content)
echo $content."<br>";
}
// check nextPageBtn if available
if (strpos($get, '"nextPageBtn"') !== false) {
$nextPageFound = true;
// increment +1
$i++;
echo "<br>Page $i<br><br>";
} else {
$nextPageFound = false;
echo "THE END";
}
}
?>
You should use an HTML/XML parser, like DOMDocument, in combination with DOMXPath (xpath is query language to query (X)HTML data structures):
// create DOMDocument
$doc = new DOMDocument();
// load remote HTML file
$doc->loadHTMLFile( $link );
// create DOMXPath
$xpath = new DOMXPath( $doc );
// fetch all IMG elements that have a src attribute
$nodes = $xpath->query( '//img[#src]' );
// loop trough found IMG elements and echo their src attribute values
for( $i = 0; $i < $nodes->length; $i++ ) {
echo $nodes->item( $i )->getAttribute( 'src' ) . PHP_EOL;
}
Regarding the xpath query //div[contains(#class,'pic_box')]//#src, mentioned by #Enuma, in the comments:
The resulting DOMNodeList of that query will not contain DOMElement objects, but DOMAttr objects, because the query directly asks for attributes, not elements. Since DOMAttr represents an attribute and not an element, the method getAttribute() does not exist. To get the value of the attribute you have to use the property DOMAttr->value.
So, we have to slightly alter the relevant part of our example code from above to:
// loop trough found src attributes and echo their value
for( $i = 0; $i < $nodes->length; $i++ ) {
echo $nodes->item( $i )->value . PHP_EOL;
}
Putting it all together, our example code then becomes:
// create DOMDocument
$doc = new DOMDocument();
// load remote HTML file
$doc->loadHTMLFile( $link );
// create DOMXPath
$xpath = new DOMXPath( $doc );
// fetch all src attributes that are descendants of div.pic_box
$nodes = $xpath->query( '//div[contains(#class,'pic_box')]//#src' );
// loop trough found src attributes and echo their value
for( $i = 0; $i < $nodes->length; $i++ ) {
echo $nodes->item( $i )->value . PHP_EOL;
}
PS.: In order for DOMDocument to be able to load remote files, I believe some php config setting may be required to be set, which I don't know off the top of my head, right now. But since it already appeared to be working for #Enuma, it's not actually relevant now. Perhaps I'll look them up later.

Pass all results from a foreach loop to a new variable

I have used the following little bit of code to find all links on a page (home.php) and echoed them as URLs. It works fine, but how do I pass the results to a new variable? If I create a new variable:
$myvariable ="$element->href";
This only echos the last result of many.
// Create DOM from URL or file
$html = file_get_html('http://www.somewebsite.xxx/include/home.php');
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Concatenate with a String Operator:
$myvar = '';
foreach($html->find('a') as $element) {
$myvar .= $element->href . '<br>';
}
Or use an Array:
foreach($html->find('a') as $element) {
$myvar[] = $element->href; // removed <br> for implode, you can add it back
}
// if you want the array as one string
$myvar = implode('<br>', $myvar);
Use an array:
// Create DOM from URL or file
$html = file_get_html('http://www.somewebsite.xxx/include/home.php');
$urls = array();
foreach($html->find('a') as $element) {
$urls[] = $element->href;
}
print_r($urls);
You could use an Array to hold the values of all the Links from that Page in Question. In the End, the Array is the Variable you are looking for. Here's how:
<?php
//USE THE HTML DOM PARSER TO PARSE ALL THE HTML DATA ON THE PAGE: $page
$page = 'http://www.somewebsite.xxx/include/home.php';
$html = file_get_html($page);
// LOOPING THROUGH THE DOM ELEMENTS SELECT ONLY THE <a> TAGS
// AND BUNDLE THEM INTO AN ARRAY...
// THE ARRAY NOW FORMS THE VARIABLE YOU HAD EXPECTED TO CREATE..
$arrAnchors = array(); // INITIALIZE $arrAnchors TO AN EMPTY ARRAY...
foreach($html->find('a') as $element) {
// PUSH ALL THE ANCHOR'S HREF ATTRIBUTES (URLs) INTO THE $arrAnchors ARRAY
$arrAnchors[] = $element->href . '<br>';
}
// NOW TRY TO DUMP THE CONTENT OF YOUR $arrAnchors....
var_dump($arrAnchors); // DISPLAYS A NUMERICALLY INDEXED ARRAY OF LINKS ON THE PAGE: $page

PHP DOMDocument parentNode->replaceChild causing foreach to skip next item

I am parsing html in the $content variable with the DOMDocument to replace all iframes with images. The foreach is only replacing the ODD iframes. I have removed all the code in the foreach and found the piece of code causing this is: '$iframe->parentNode->replaceChild($link, $iframe);'
Why would the foreach be skipping all of the odd iframes?
The code:
$count = 1;
$dom = new DOMDocument;
$dom->loadHTML($content);
$iframes = $dom->getElementsByTagName('iframe');
foreach ($iframes as $iframe) {
$src = $iframe->getAttribute('src');
$width = $iframe->getAttribute('width');
$height = $iframe->getAttribute('height');
$link = $dom->createElement('img');
$link->setAttribute('class', 'iframe-'.self::return_video_type($iframe->getAttribute('src')).' iframe-'.$count.' iframe-ondemand-placeholderImg');
$link->setAttribute('src', $placeholder_image);
$link->setAttribute('height', $height);
$link->setAttribute('width', $width);
$link->setAttribute('data-iframe-src', $src);
$iframe->parentNode->replaceChild($link, $iframe);
echo "here:".$count;
$count++;
}
$content = $dom->saveHTML();
return $content;
This is the problem line of code
$iframe->parentNode->replaceChild($link, $iframe);
A DOMNodeList, such as that returned from getElementsByTagName, is "live":
that is, changes to the underlying document structure are reflected in all relevant NodeList... objects
So when you remove the element (in this case by replacing it with another one) it no longer exists in the node list, and the next one in line takes its position in the index. Then when foreach hits the next iteration, and hence the next index, one will be effectively skipped.
Don't remove elements from the DOM via foreach like this.
An approach that works instead would be to use a while loop to iterate and replace until your $iframes node list is empty.
Example:
while ($iframes->length) {
$iframe = $iframes->item(0);
$src = $iframe->getAttribute('src');
$width = $iframe->getAttribute('width');
$height = $iframe->getAttribute('height');
$link = $dom->createElement('img');
$link->setAttribute('class', 'iframe-'.self::return_video_type($iframe->getAttribute('src')).' iframe-'.$count.' iframe-ondemand-placeholderImg');
$link->setAttribute('src', $placeholder_image);
$link->setAttribute('height', $height);
$link->setAttribute('width', $width);
$link->setAttribute('data-iframe-src', $src);
$iframe->parentNode->replaceChild($link, $iframe);
echo "here:".$count;
$count++;
}
Faced this issue today, and guide by the answer, i make a simple code solution for you guys
$iframes = $dom->getElementsByTagName('iframe');
for ($i=0; $i< $iframes->length; $i++) {
$iframe = $iframes->item($i);
if("condition to replace"){
// do some replace thing
$i--;
}
}
Hope this help.

For Loop - Return first 2 instead of all

<?php
require 'simple_html_dom.php';
$html = file_get_html("website" . date("Ymd"));
foreach($html->find('td[class=x]') as $element)
echo $element;
?>
I am using the above code to parse a website. Instead of returning all the td elements I would like to return the first two. I think I would need to edit the for loop. How can I do this. I have limited PHP experience.
One technique would be to use a counter
$counter = 0;
foreach ($html->find('td[class=x]') as $element) {
if($counter<=1){
echo $element;
}
$counter++;
}

PHP get element using a known attribute

So lets say I have:
<?php
$template = '<img src="{image}" editable="all image_all" />';
$template .= '<div>';
$template .= '<img src="{image}" editable="yes" />';
$template .= '</div>';
?>
Now what I would like is to make the script go through all the elements containing the {image} src and checking to see if any of them have the
editable="all"
attribute.
If so: get the second editable attribute e.g.
image_all
And include that into the src.
This task can be simplified with the use of a library suggested on comments, Simple HTML DOM Parser:
It is as easy as this:
$images = array(); //an array for your images with {image} in src
$html = "...";
foreach($html->find('img') as $element)
if($element->src == '{image}') {
//add to the collection
$images[] = $element;
}
//Also you can compare for the editable attribute same way as above.
}
if you want to get second editable attr and save it in an array like $src so check this code:
$content=new DOMDocument();
$content->loadHTML($template);
$elements=simplexml_import_dom($content);
$images=$elements->xpath('//img');
foreach ($images as $img) {
if(preg_match('/all /i', $img['editable']))
$src[]=substr($img['editable'],4) ;
}
print_r($src);
will output:
Array ( [0] => image_all )
Try this,
include('simple_html_dom.php');
$html = str_get_html('<div><img src="{image}" editable="all image_all" /><img src="{image}" editable="yes" /></div>');
$second_args= array();
foreach($html->find('img[src="{image}"]') as $element){
$editables = explode(' ',$element->editable);
if($editables[0] === "all"){
$second_args[] = $editables[1];
}
}
print_r($second_args);

Categories