how to alter and then show attributes in html with php - php

in my table, I have a row that contains a string like this:
<p>hello</p><p>this is patrick</p><p><img src="/assets/img/myface.jpg" width="320" height="320"/></p>
and I want to give the <img> tag an alt attribute. I've got quite close now but somehow my code still shows 2 <img> tags although the string only has 1. can anyone tell me what I'm doing wrong?
this is my code so far:
$str = '<p>hello</p><p>this is patrick</p><p><img src="/assets/img/myface.jpg" width="320" height="320"/></p>';
$new_html = '';
$dom = new DOMDocument();
$dom->loadHTML($str);
$content = $dom->getElementsByTagName('*');
foreach ($content as $i => $node)
{
if ($node->nodeName == 'html' || $node->nodeName == 'body')
{
continue; // dont need to process these tags, right?
}
if ($node->nodeName == 'img')
{
$img_src = $node->getAttribute('src');
$path_arr = explode('/', $img_src);
$filename = $path_arr[count($path_arr)-1]; // myface.jpg
$alt = 'blah';
$node->setAttribute('alt', $alt);
}
echo $dom->saveXML($node);
}

$content = $dom->getElementsByTagName('img');
foreach ($content as $node) {
$img_src = $node->getAttribute('src');
$filename = basename($img_src);
$node->setAttribute('alt', $filename);
}
echo $dom->saveHTML();
Loop only through images with $content = $dom->getElementsByTagName('img');
Move $dom->saveHTML(); after lthe loop.
Get filename with $filename = basename($img_src);

The slightly changed code below does the work. It only gets the img tags and saves the HTML outside the loop. Note that I changed the way that HTML was loaded, to not include the wrapper tags.
<?php
$str = '<p>hello</p><p>this is patrick</p><p><img src="/assets/img/myface.jpg" width="320" height="320"/></p>';
$new_html = '';
$dom = new DOMDocument();
$dom->loadHTML($str, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$content = $dom->getElementsByTagName('img');
foreach ($content as $i => $node)
{
$img_src = $node->getAttribute('src');
$path_arr = explode('/', $img_src);
$filename = $path_arr[count($path_arr)-1]; // myface.jpg
$alt = 'blah';
$node->setAttribute('alt', $alt);
}
echo $dom->saveHTML();

The problem is that when you use
echo $dom->saveXML($node);
in the loop, it will output for various tags and so the output is not the end result, but a combination of other parts of the document.
Try changing it to
echo $node->nodeName."=>".$dom->saveXML($node).PHP_EOL;
to see what it does.
You could just remove the current echo and add
echo $dom->saveXML();
after the end of the loop.
Alternatively, if you just want to process the <img> tags, you can limit the loop more specifically...
$content = $dom->getElementsByTagName('img');
foreach ($content as $i => $node)
{
$img_src = $node->getAttribute('src');
$path_arr = explode('/', $img_src);
$filename = $path_arr[count($path_arr)-1]; // myface.jpg
$alt = 'blah';
$node->setAttribute('alt', $alt);
}
echo $dom->saveXML();

Related

Getting link tag via DOMDocument

I convert an atom feed into RSS using atom2rss.xsl. Works fine.
Then, using DOMDocument, I try to get the post title and URL:
$feed = new DOMDocument();
$feed->loadHTML('<?xml encoding="utf-8" ?>' . $html);
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
echo 'url: '. $item->getElementsByTagName("link")->item(0)->nodeValue;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
But the post URL is empty.
See this eval which contains HTML. What am I doing wrong? I suspect I am not getting the link tag properly via $item->getElementsByTagName("link")->item(0)->nodeValue.
I think the problem is that there are several <link> elements in each item and the one (I think) your interested in is the one with rel="self" as an attribute. The quickest way (without messing around with XPath) is to loop over each <link> element checking for the right rel value and then take the href attribute from that...
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
$url = "";
// Look for the 'right' link tag and extract URL from that
foreach ( $item->getElementsByTagName("link") as $link ) {
if ( $link->getAttribute("rel") == "self" ) {
$url = $link->getAttribute("href");
break;
}
}
echo 'url: '. $url;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
which gives...
url: https://www.blogger.com/feeds/2984353310628523257/posts/default/1947782625877709813titleExtraordinary Genius - Cp274
function get_links($link)
{
$ret = array();
$dom = new DOMDocument();
#$dom->loadHTML(file_get_contents($link));
$dom->preserveWhiteSpace = false;
$links = $dom->getElementsByTagName('a');
foreach ($links as $tag){
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
}
return $ret;
}
print_r(get_links('http://www.google.com'));
OR u can use DOMXpath
$html = file_get_contents('http://www.google.com');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// take all links
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'
';

PHP - DOMDocument - need to change/replace an existing HTML tag w/ a new one

I need to change an <img> tag for a <video> tag. I
do not know how to continue with the code as I can change all tags provided they contain a WebM.
function iframe($text) {
$Dom = new DOMDocument;
libxml_use_internal_errors(true);
$Dom->loadHTML($text);
$links = $Dom->getElementsByTagName('img');
foreach ($links as $link) {
$href = $link->getAttribute('src');
if (!empty($href)) {
$pathinfo = pathinfo($href);
if (strtolower($pathinfo['extension']) === 'webm') {
//If extension webm change tag to <video>
}
}
}
$html = $Dom->saveHTML();
return $html;
}
Like Roman i'm using http://php.net/manual/en/domnode.replacechild.php
but i'm using a for-iteration and test for .webm extension in the src with a simple strpos().
$contents = <<<STR
this is some HTML with an <img src="test1.png"/> in it.
this is some HTML with an <img src="test2.png"/> in it.
this is some HTML with an <img src="test.webm"/> in it,
but it should be a video tag - when iframe() is done.
STR;
function iframe($text)
{
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($text);
$images = $dom->getElementsByTagName("img");
for ($i = $images->length - 1; $i >= 0; $i --) {
$nodePre = $images->item($i);
$src = $nodePre->getAttribute('src');
// search in src for ".webm"
if(strpos($src, '.webm') !== false ) {
$nodeVideo = $dom->createElement('video');
$nodeVideo->setAttribute("src", $src);
$nodeVideo->setAttribute("controls", '');
$nodePre->parentNode->replaceChild($nodeVideo, $nodePre);
}
}
$html = $dom->saveHTML();
return $html;
};
echo iframe($contents);
Part of output:
this is some HTML with an <video src="test.webm"></video> in it,
but it should be a video tag - when iframe() is done.
Use this code:
(...)
if( strtolower( $pathinfo['extension'] ) === 'webm')
{
//If extension webm change tag to <video>
$new = $Dom->createElement( 'video', $link->nodeValue );
foreach( $link->attributes as $attribute )
{
$new->setAttribute( $attribute->name, $attribute->value );
}
$link->parentNode->replaceChild( $new, $link );
}
(...)
By code above I create a new node with video tag and nodeValue as img node value, then I add to new node all img attributes, and finally I replace old node with new node.
Please note that if the old node has id, the code will produce a warning.
Solution with DOMDocument::createElement and DOMNode::replaceChild functions:
function iframe($text) {
$Dom = new DOMDocument;
libxml_use_internal_errors(true);
$Dom->loadHTML($text);
$links = $Dom->getElementsByTagName('img');
foreach ($links as $link) {
$href = $link->getAttribute('src');
if (!empty($href)) {
$pathinfo = pathinfo($href);
if (strtolower($pathinfo['extension']) === 'webm') {
//If extension webm change tag to <video>
$video = $Dom->createElement('video');
$video->setAttribute("src", $href);
$video->setAttribute("controls", '');
$link->parentNode->replaceChild($video, $link);
}
}
}
$html = $Dom->saveHTML();
return $html;
}
http://php.net/manual/en/domdocument.createelement.php
http://php.net/manual/en/domnode.replacechild.php

PHP Image Scraper only selected images

im trying to build a simple image scraper to scrap certain images from a site how ever what i have so far scrapes all the images
<?php
$url = "http://www.techbuy.com.au/";
$html = file_get_contents('http://www.techbuy.com.au/');
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image)
{
$fimage = $image->getAttribute('src');
echo "<img src='$url" . "$fimage' ></img>";
}
?>
how can i make it say scrap the second image and leave the rest
<?php
$url = "http://www.techbuy.com.au/";
$html = file_get_contents('http://www.techbuy.com.au/');
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $key => $image)
{
if ($key === 1) {
$fimage = $image->getAttribute('src');
echo "<img src='$url" . "$fimage' ></img>";
}
}
?>
Grabs the second image.
if ($images->length >= 2) { $src = $images->item(1)->getAttribute("src"); }

How to get all images from a webpage in all cases?

I am using this script to get all images from a generic external webpage:
$url = ANY URL HERE;
$html = #file_get_contents($url,false,$context);
$dom = new domDocument;
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
echo $image->getAttribute('src');
}
But in some cases like this ( where the image is in "rel:image_src" )
<img src="http://example.com/example.png" rel:image_src="http://example.com/dir/me.jpg" />
it doesn't work.
How can I do ?
you could include both:
foreach ($images as $image) {
echo $image->getAttribute('src');
echo $image->getAttribute('rel:image_src');
}
Check if the node has an attribute rel:image_src
foreach ($images as $image) {
if( $image->hasAttribute('rel:image_src') ) {
echo $image->getAttribute('rel:image_src');
} else {
echo $image->getAttribute('src');
}
}
If you want the rel:image_src to take precidence, check for the attribute's presence and use it selectively:
$url = ANY URL HERE;
$html = #file_get_contents($url,false,$context);
$dom = new domDocument;
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
if ($image->hasAttribute('rel:image_src')
{
echo $image->getAttribute('rel:image_src');
}
else
{
echo $image->getAttribute('src');
}
}

PHP: DOM get url and anchors (but not IMG)

I want to select all URL's from a HTML page into an array like:
This is a webpage with
different kinds of <img src="someimg.png">
The output i would like is:
with => http://somesite.se/link1.php
Now i get:
<img src="someimg.png"> => http://somesite.com/link1.php
with => http://somesite.com/link1.php
I do not want the urls/links that does contain a image between the start and end . Only the ones with text.
My current code is:
<?php
function innerHTML($node) {
$ret = '';
foreach ($node->childNodes as $node) {
$ret .= $node->ownerDocument->saveHTML($node);
}
return $ret;
}
$html = file_get_contents('http://somesite.com/'.$_GET['apt']);
$dom = new DOMDocument;
#$dom->loadHTML($html); // # = Removes errors from the HTML...
$links = $dom->getElementsByTagName('a');
$result = array();
foreach ($links as $link) {
//$node = $link->nodeValue;
$node = innerHTML($link);
$href = $link->getAttribute('href');
if (preg_match('/\.pdf$/i', $href))
$result[$node] = $href;
}
print_r($result);
?>
Add a second preg_match to your conditional:
if(preg_match('/\.pdf$/i',$href) && !preg_match('/<img .*>/i',$node)) $result[$node] = $href;

Categories