Nested foreach loops - php

I know this solution is simple, but it keeps slipping my mind. When I parse the page with this code and the $links array is printed, all of href parts are correct yet the img part only prints the last src element that is found on the page.
$doc = new DOMDocument();
$doc->loadHTML($html);
$links = array();
$images = $doc->getElementsByTagName("img");
$arr = $doc->getElementsByTagName("a");
foreach($arr as $item) {
// get links
$href = $item->getAttribute("href");
// get images.
foreach ($images as $item) {
$img = $item->getAttribute('src');
}
$links[] = array(
'href' => $href,
'img' => $img
);
}
print_r(array_values($links));

The for each statement for images should be building an array where as the final array ($links)is a multi-dimentional array($img being the nested array).

Check if this works for you:
$doc = new DOMDocument();
$doc->loadHTML($html);
$links = array();
$images = $doc->getElementsByTagName("img");
$arr = $doc->getElementsByTagName("a");
foreach($arr as $item) {
// get links
$href = $item->getAttribute("href");
// get images.
foreach ($images as $item) {
$img = $item->getAttribute('src');
// storing the image src
$links[] = array(
'img' => $img
);
}
$links[] = array(
'href' => $href
);
}
print_r(array_values($links));

You use the dublicate variable $item in internal foreach.
Try this without internal foreach
$doc = new DOMDocument();
$doc->loadHTML($html);
$links = array();
$images = $doc->getElementsByTagName("img");
$arr = $doc->getElementsByTagName("a");
foreach($arr as $key=>$item) {
// get links
$href = $item->getAttribute("href");
$img = $images[$key]->getAttribute('src');
$links[] = array(
'href' => $href,
'img' => $img
);
}unset($item);
print_r(array_values($links));

Related

how to alter and then show attributes in html with php

in my table, I have a row that contains a string like this:
<p>hello</p><p>this is patrick</p><p><img src="/assets/img/myface.jpg" width="320" height="320"/></p>
and I want to give the <img> tag an alt attribute. I've got quite close now but somehow my code still shows 2 <img> tags although the string only has 1. can anyone tell me what I'm doing wrong?
this is my code so far:
$str = '<p>hello</p><p>this is patrick</p><p><img src="/assets/img/myface.jpg" width="320" height="320"/></p>';
$new_html = '';
$dom = new DOMDocument();
$dom->loadHTML($str);
$content = $dom->getElementsByTagName('*');
foreach ($content as $i => $node)
{
if ($node->nodeName == 'html' || $node->nodeName == 'body')
{
continue; // dont need to process these tags, right?
}
if ($node->nodeName == 'img')
{
$img_src = $node->getAttribute('src');
$path_arr = explode('/', $img_src);
$filename = $path_arr[count($path_arr)-1]; // myface.jpg
$alt = 'blah';
$node->setAttribute('alt', $alt);
}
echo $dom->saveXML($node);
}
$content = $dom->getElementsByTagName('img');
foreach ($content as $node) {
$img_src = $node->getAttribute('src');
$filename = basename($img_src);
$node->setAttribute('alt', $filename);
}
echo $dom->saveHTML();
Loop only through images with $content = $dom->getElementsByTagName('img');
Move $dom->saveHTML(); after lthe loop.
Get filename with $filename = basename($img_src);
The slightly changed code below does the work. It only gets the img tags and saves the HTML outside the loop. Note that I changed the way that HTML was loaded, to not include the wrapper tags.
<?php
$str = '<p>hello</p><p>this is patrick</p><p><img src="/assets/img/myface.jpg" width="320" height="320"/></p>';
$new_html = '';
$dom = new DOMDocument();
$dom->loadHTML($str, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$content = $dom->getElementsByTagName('img');
foreach ($content as $i => $node)
{
$img_src = $node->getAttribute('src');
$path_arr = explode('/', $img_src);
$filename = $path_arr[count($path_arr)-1]; // myface.jpg
$alt = 'blah';
$node->setAttribute('alt', $alt);
}
echo $dom->saveHTML();
The problem is that when you use
echo $dom->saveXML($node);
in the loop, it will output for various tags and so the output is not the end result, but a combination of other parts of the document.
Try changing it to
echo $node->nodeName."=>".$dom->saveXML($node).PHP_EOL;
to see what it does.
You could just remove the current echo and add
echo $dom->saveXML();
after the end of the loop.
Alternatively, if you just want to process the <img> tags, you can limit the loop more specifically...
$content = $dom->getElementsByTagName('img');
foreach ($content as $i => $node)
{
$img_src = $node->getAttribute('src');
$path_arr = explode('/', $img_src);
$filename = $path_arr[count($path_arr)-1]; // myface.jpg
$alt = 'blah';
$node->setAttribute('alt', $alt);
}
echo $dom->saveXML();

Array filter in PHP

I am using a simple html dom to parsing html file.
I have a dynamic array called links2, it can be empty or maybe have 4 elements inside or more depending on the case
<?php
include('simple_html_dom.php');
$url = 'http://www.example.com/';
$html = file_get_html($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
//////////////////////////////////////////////////////////////////////////////
foreach ($doc->getElementsByTagName('p') as $link)
{
$intro2 = $link->nodeValue;
$links2[] = array(
'value' => $link->textContent,
);
$su=count($links2);
}
$word = 'document.write(';
Assuming that the two elements contain $word in "array links2", when I try to filter this "array links2" by removing elements contains matches
unset( $links2[array_search($word, $links2 )] );
print_r($links2);
the filter removes only one element and array_diff doesn't solve the problem. Any suggestion?
solved by adding an exception
foreach ($doc->getElementsByTagName('p') as $link)
{
$dont = $link->textContent;
if (strpos($dont, 'document') === false) {
$links2[] = array(
'value' => $link->textContent,
);
}
$su=count($links2);
echo $su;

PHP: DOM get url and anchors (but not IMG)

I want to select all URL's from a HTML page into an array like:
This is a webpage with
different kinds of <img src="someimg.png">
The output i would like is:
with => http://somesite.se/link1.php
Now i get:
<img src="someimg.png"> => http://somesite.com/link1.php
with => http://somesite.com/link1.php
I do not want the urls/links that does contain a image between the start and end . Only the ones with text.
My current code is:
<?php
function innerHTML($node) {
$ret = '';
foreach ($node->childNodes as $node) {
$ret .= $node->ownerDocument->saveHTML($node);
}
return $ret;
}
$html = file_get_contents('http://somesite.com/'.$_GET['apt']);
$dom = new DOMDocument;
#$dom->loadHTML($html); // # = Removes errors from the HTML...
$links = $dom->getElementsByTagName('a');
$result = array();
foreach ($links as $link) {
//$node = $link->nodeValue;
$node = innerHTML($link);
$href = $link->getAttribute('href');
if (preg_match('/\.pdf$/i', $href))
$result[$node] = $href;
}
print_r($result);
?>
Add a second preg_match to your conditional:
if(preg_match('/\.pdf$/i',$href) && !preg_match('/<img .*>/i',$node)) $result[$node] = $href;

creating multidimensional array with two arrays

I am indexing web pages. The code scans the web pages for links and the web page that is given's title. The links and title are stored in two different arrays. I would like to create a multidimensional array that has the word Array, followed by the links, followed by the individual titles of the links. I have the code, I just don't know how to put it together.
require_once('simplehtmldom_1_5/simple_html_dom.php');
require_once('url_to_absolute/url_to_absolute.php');
//links
$links = Array();
$URL = 'http://www.youtube.com'; // change it for urls to grab
// grabs the urls from URL
$file = file_get_html($URL);
foreach ($file->find('a') as $theelement) {
$links[] = url_to_absolute($URL, $theelement->href);
}
print_r($links);
//titles
$titles = Array();
$str = file_get_contents($URL);
$titles[] = preg_match_all( "/\<title\>(.*)\<\/title\>/", $str, $title );
print_r($title[1]);
You should be able to do this, assuming there are the same amount of links as there are titles, then they should correspond to the same array key.
$newArray = array();
foreach ($links as $key=>$val)
{
$newArray[$key]['link'] = $val;
$newArray[$key]['title'] = $titles[$key];
}
It is not clear what you want.
Anyway, here is how I would rewrite your code in a more organized way:
require_once('simplehtmldom_1_5/simple_html_dom.php');
require_once('url_to_absolute/url_to_absolute.php');
$info = array();
$urls = array(
'http://www.youtube.com',
'http://www.google.com.br'
);
foreach ($urls as $url)
{
$str = file_get_contents($url);
$html = str_get_html($str);
$title = strval($html->find('title')->plaintext);
$links = array();
foreach($html->find(a) as $anchor)
{
$links[] = url_to_absolute($url, strval($anchor->href));
}
$links = array_unique($links);
$info[$url] = array(
'title' => $title,
'links' => $links
);
}
print_r($info);

Problem with sort and then output

Im using the PHP below to generate some HTML output:
<?php
$url = "images.xml";
$xmlstr = file_get_contents($url);
$xml = new SimpleXMLElement($xmlstr);
$images = array();
$ids = array();
foreach ($xml->image as $image) {
$images[]['id'] = $image -> id;
$images[]['link'] = $image->href;
$images[]['src'] = $image->source;
$images[]['title'] = $image->title;
$images[]['alt'] = $image->alt;
$ids[] = $image -> id;
}
array_multisort($ids, SORT_ASC, $images);
foreach ($images as $image){
echo "<a href='".$image['link']."'><img src='".$image['src']."' alt='".$image['alt']."' title='".$image['title']."' /></a>";
}
?>
If I change the code here:
foreach ($images as $image){
echo $image['link'];
echo "Item";
}
I get the image link 3 times, which is correct because there are 3 records in the XML. But I get 12 copies of the text Item.
Why is this happening?
You're putting each attribute in a new row in the array.
Try this:
foreach ($xml->image as $image)
{
$images[] = array(
'id' => $image->id,
'link' => $image->href,
'src' => $image->source,
'title' => $image->title,
'alt' => $image->alt
);
$ids[] = $image -> id;
}

Categories