how to get the img url from a href url using dom loadhtml ? i try using $link->nodeValue to get the img src url but is not working
Example url source:
<img src="www.google.com/test.jpg" />Photo NodeValue
My php code:
// -------------------------------------------------------------------------
// ----------------------- Get URLs From Source
// -------------------------------------------------------------------------
function getVidesURL($url) {
$web_source = $this->getSource($url);
if($web_source != '') {
$Data = $this->Websites_Data[$this->getHost($url)];
preg_match($Data['Index_Preg_Match'], $web_source, $Videos_Page);
$Videos_Page = $Videos_Page[$Data['Index_Preg_Match_Num']];
if($Videos_Page != '') {
$dom = new DOMDocument;
#$dom->loadHTML($Videos_Page);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
$Video_Status = "";
$Video_Error = "";
$Video = array(
"URL" => $link->getAttribute('href'),
"Title" => $link->getAttribute('title'),
"MSG" => $link->nodeValue,
);
// Get Image URL Start
$dom = new DOMDocument;
#$dom->loadHTML($Video['MSG']);
$Video_Image = $dom->getElementsByTagName('img');
foreach ($Video_Image as $Image) {
$Video = array(
"IMG" => $link->getAttribute('src'),
);
}
$Videos_URLs .= $Video['IMG'] . '<br />';
}
// Get Image URL Stop
return $Videos_URLs;
}
}
}
The only problem of my code is i don't know how to get the img url from a href
Here is a small function that can pull out image sources from an HTML input:
<?php
echo PHP_EOL;
var_dump(getImgSrcFromHTML('<img src="www.google.com/test.jpg" />Photo NodeValue<div><img src="www.google.com/test2.jpg" /></div><table><tr><td><img src="www.google.com/test3.jpg" /></td></tr></table>'));
echo PHP_EOL;
function getImgSrcFromHTML($html){
$doc = new DOMDocument();
$doc->loadHTML($html);
$imagepPaths = array();
$imageTags = $doc->getElementsByTagName('img');
foreach ($imageTags as $tag) {
$imagePaths[] = $tag->getAttribute('src');
}
if(!empty($imagePaths)) {
return $imagePaths;
} else {
return false;
}
}
Hope this helps.
Related
I convert an atom feed into RSS using atom2rss.xsl. Works fine.
Then, using DOMDocument, I try to get the post title and URL:
$feed = new DOMDocument();
$feed->loadHTML('<?xml encoding="utf-8" ?>' . $html);
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
echo 'url: '. $item->getElementsByTagName("link")->item(0)->nodeValue;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
But the post URL is empty.
See this eval which contains HTML. What am I doing wrong? I suspect I am not getting the link tag properly via $item->getElementsByTagName("link")->item(0)->nodeValue.
I think the problem is that there are several <link> elements in each item and the one (I think) your interested in is the one with rel="self" as an attribute. The quickest way (without messing around with XPath) is to loop over each <link> element checking for the right rel value and then take the href attribute from that...
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
$url = "";
// Look for the 'right' link tag and extract URL from that
foreach ( $item->getElementsByTagName("link") as $link ) {
if ( $link->getAttribute("rel") == "self" ) {
$url = $link->getAttribute("href");
break;
}
}
echo 'url: '. $url;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
which gives...
url: https://www.blogger.com/feeds/2984353310628523257/posts/default/1947782625877709813titleExtraordinary Genius - Cp274
function get_links($link)
{
$ret = array();
$dom = new DOMDocument();
#$dom->loadHTML(file_get_contents($link));
$dom->preserveWhiteSpace = false;
$links = $dom->getElementsByTagName('a');
foreach ($links as $tag){
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
}
return $ret;
}
print_r(get_links('http://www.google.com'));
OR u can use DOMXpath
$html = file_get_contents('http://www.google.com');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// take all links
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'
';
if you inspect the page to get img src, you'll see sth like this:/images/March/img1.jpeg. but as you know that's not a real address. I want to scrape this page and get the proper src value. how can I do that?
thx in advance.
<?php
$content=file_get_content('example.com');
$dom= new DOMDocument();
$dom->loadHTML($content);
$xpath=new DOMXpath();
$img=$xpath->query("(//img)[2]/#src");
foreach($img as $val){
$images=$val->nodeValue;//just returns img/march/img1.jpeg
//instead of www.example.com/img.....
}
?>
You have to make Absolute path manually like this:
<?php
$content = file_get_contents('example.com');
$dom = new DOMDocument();
$dom->loadHTML($content);
$xpath = new DOMXpath();
$img = $xpath->query("(//img)[2]/#src");
// Make Absolute Url
function getAbsUrl($value, $baseurl)
{
$Parsed = parse_url($value);
if (empty($Parsed['host'])) {
// Relative
return rtrim($baseurl, '/') . '/' . ltrim($Parsed['path'], '/');
} else {
return $value;
}
}
foreach ($img as $val) {
$images = getAbsUrl($val->nodeValue, 'http://www.example.com/');
}
I need to change an <img> tag for a <video> tag. I
do not know how to continue with the code as I can change all tags provided they contain a WebM.
function iframe($text) {
$Dom = new DOMDocument;
libxml_use_internal_errors(true);
$Dom->loadHTML($text);
$links = $Dom->getElementsByTagName('img');
foreach ($links as $link) {
$href = $link->getAttribute('src');
if (!empty($href)) {
$pathinfo = pathinfo($href);
if (strtolower($pathinfo['extension']) === 'webm') {
//If extension webm change tag to <video>
}
}
}
$html = $Dom->saveHTML();
return $html;
}
Like Roman i'm using http://php.net/manual/en/domnode.replacechild.php
but i'm using a for-iteration and test for .webm extension in the src with a simple strpos().
$contents = <<<STR
this is some HTML with an <img src="test1.png"/> in it.
this is some HTML with an <img src="test2.png"/> in it.
this is some HTML with an <img src="test.webm"/> in it,
but it should be a video tag - when iframe() is done.
STR;
function iframe($text)
{
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($text);
$images = $dom->getElementsByTagName("img");
for ($i = $images->length - 1; $i >= 0; $i --) {
$nodePre = $images->item($i);
$src = $nodePre->getAttribute('src');
// search in src for ".webm"
if(strpos($src, '.webm') !== false ) {
$nodeVideo = $dom->createElement('video');
$nodeVideo->setAttribute("src", $src);
$nodeVideo->setAttribute("controls", '');
$nodePre->parentNode->replaceChild($nodeVideo, $nodePre);
}
}
$html = $dom->saveHTML();
return $html;
};
echo iframe($contents);
Part of output:
this is some HTML with an <video src="test.webm"></video> in it,
but it should be a video tag - when iframe() is done.
Use this code:
(...)
if( strtolower( $pathinfo['extension'] ) === 'webm')
{
//If extension webm change tag to <video>
$new = $Dom->createElement( 'video', $link->nodeValue );
foreach( $link->attributes as $attribute )
{
$new->setAttribute( $attribute->name, $attribute->value );
}
$link->parentNode->replaceChild( $new, $link );
}
(...)
By code above I create a new node with video tag and nodeValue as img node value, then I add to new node all img attributes, and finally I replace old node with new node.
Please note that if the old node has id, the code will produce a warning.
Solution with DOMDocument::createElement and DOMNode::replaceChild functions:
function iframe($text) {
$Dom = new DOMDocument;
libxml_use_internal_errors(true);
$Dom->loadHTML($text);
$links = $Dom->getElementsByTagName('img');
foreach ($links as $link) {
$href = $link->getAttribute('src');
if (!empty($href)) {
$pathinfo = pathinfo($href);
if (strtolower($pathinfo['extension']) === 'webm') {
//If extension webm change tag to <video>
$video = $Dom->createElement('video');
$video->setAttribute("src", $href);
$video->setAttribute("controls", '');
$link->parentNode->replaceChild($video, $link);
}
}
}
$html = $Dom->saveHTML();
return $html;
}
http://php.net/manual/en/domdocument.createelement.php
http://php.net/manual/en/domnode.replacechild.php
I want to select all URL's from a HTML page into an array like:
This is a webpage with
different kinds of <img src="someimg.png">
The output i would like is:
with => http://somesite.se/link1.php
Now i get:
<img src="someimg.png"> => http://somesite.com/link1.php
with => http://somesite.com/link1.php
I do not want the urls/links that does contain a image between the start and end . Only the ones with text.
My current code is:
<?php
function innerHTML($node) {
$ret = '';
foreach ($node->childNodes as $node) {
$ret .= $node->ownerDocument->saveHTML($node);
}
return $ret;
}
$html = file_get_contents('http://somesite.com/'.$_GET['apt']);
$dom = new DOMDocument;
#$dom->loadHTML($html); // # = Removes errors from the HTML...
$links = $dom->getElementsByTagName('a');
$result = array();
foreach ($links as $link) {
//$node = $link->nodeValue;
$node = innerHTML($link);
$href = $link->getAttribute('href');
if (preg_match('/\.pdf$/i', $href))
$result[$node] = $href;
}
print_r($result);
?>
Add a second preg_match to your conditional:
if(preg_match('/\.pdf$/i',$href) && !preg_match('/<img .*>/i',$node)) $result[$node] = $href;
I would like to extract all img tags that are within an anchor tag using the PHP DOM object.
I am trying it with the code below but its getting all anchor tag and making it's text empty due the inside of an img tag.
function get_links($url) {
// Create a new DOM Document to hold our webpage structure
$xml = new DOMDocument();
// Load the url's contents into the DOM
#$xml->loadHTMLFile($url);
// Empty array to hold all links to return
$links = array();
//Loop through each <a> tag in the dom and add it to the link array
foreach($xml->getElementsByTagName('a') as $link)
{
$hrefval = '';
if(strpos($link->getAttribute('href'),'www') > 0)
{
//$links[] = array('url' => $link->getAttribute('href'), 'text' => $link->nodeValue);
$hrefval = '#URL#'.$link->getAttribute('href').'#TEXT#'.$link->nodeValue;
$links[$hrefval] = $hrefval;
}
else
{
//$links[] = array('url' => GetMainBaseFromURL($url).$link->getAttribute('href'), 'text' => $link->nodeValue);
$hrefval = '#URL#'.GetMainBaseFromURL($url).$link->getAttribute('href').'#TEXT#'.$link->nodeValue;
$links[$hrefval] = $hrefval;
}
}
foreach($xml->getElementsByTagName('img') as $link)
{
$srcval = '';
if(strpos($link->getAttribute('src'),'www') > 0)
{
//$links[] = array('src' => $link->getAttribute('src'), 'nodval' => $link->nodeValue);
$srcval = '#SRC#'.$link->getAttribute('src').'#NODEVAL#'.$link->nodeValue;
$links[$srcval] = $srcval;
}
else
{
//$links[] = array('src' => GetMainBaseFromURL($url).$link->getAttribute('src'), 'nodval' => $link->nodeValue);
$srcval = '#SRC#'.GetMainBaseFromURL($url).$link->getAttribute('src').'#NODEVAL#'.$link->nodeValue;
$links[$srcval] = $srcval;
}
}
//Return the links
//$links = unsetblankvalue($links);
return $links;
}
This returns all anchor tag and all img tag separately.
$xml = new DOMDocument;
libxml_use_internal_errors(true);
$xml->loadHTMLFile($url);
libxml_clear_errors();
libxml_use_internal_errors(false);
$xpath = new DOMXPath($xml);
foreach ($xpath->query('//a[contains(#href, "www")]/img') as $entry) {
var_dump($entry->getAttribute('src'));
}
The usage of strpos() function is not correct in the code.
Instead of using
if(strpos($link->getAttribute('href'),'www') > 0)
Use
if(strpos($link->getAttribute('href'),'www')!==false )