how to get exact img src in xpath

how to get exact img src in xpath - php

if you inspect the page to get img src, you'll see sth like this:/images/March/img1.jpeg. but as you know that's not a real address. I want to scrape this page and get the proper src value. how can I do that?
thx in advance.
<?php
$content=file_get_content('example.com');
$dom= new DOMDocument();
$dom->loadHTML($content);
$xpath=new DOMXpath();
$img=$xpath->query("(//img)[2]/#src");
foreach($img as $val){
$images=$val->nodeValue;//just returns img/march/img1.jpeg
//instead of www.example.com/img.....
}
?>

You have to make Absolute path manually like this:
<?php
$content = file_get_contents('example.com');
$dom = new DOMDocument();
$dom->loadHTML($content);
$xpath = new DOMXpath();
$img = $xpath->query("(//img)[2]/#src");
// Make Absolute Url
function getAbsUrl($value, $baseurl)
{
$Parsed = parse_url($value);
if (empty($Parsed['host'])) {
// Relative
return rtrim($baseurl, '/') . '/' . ltrim($Parsed['path'], '/');
} else {
return $value;
}
}
foreach ($img as $val) {
$images = getAbsUrl($val->nodeValue, 'http://www.example.com/');
}

Related

Web scraping with Xpath, grabbing img

I am trying to scrape some img from page. But couldn't grab those. My path is true(i think) but Xpath returns 0. Any idea what is wrong with my path?
function pageContent($url)
{
$html = cache()->rememberForever($url, function () use ($url) {
return file_get_contents($url);
});
$parser = new \DOMDocument();
$parser->loadHTML($html);
return $parser;
}
$url = 'https://sumai.tokyu-land.co.jp/osaka';
#$parser = pageContent($url);
$resimler = [];
$rota = new \DOMXPath($parser);
$images = $rota->query("//section//div[#class='p-articlelist-content-left']//div[#class='p-articlelist-content-img']//img");
foreach ($images as $image) {
$resimler[] = $image->getAttribute("src");
}
var_dump($resimler);

You were looking for a div[#class='p-articlelist-content-img'] instead of a ul.
In addition to that, you should not be hiding error messages with the # operator, instead use the libxml_use_internal_errors() function as it was intended.
Finally, the // search in XPath is expensive, so avoid it where possible, and you can get the attribute value directly from the query (I don't know if this is any more efficient though.)
function pageContent(String $url) : \DOMDocument
{
$html = cache()->rememberForever($url, function () use ($url) {
return file_get_contents($url);
});
$parser = new \DOMDocument();
libxml_use_internal_errors(true);
$parser->loadHTML($html);
libxml_use_internal_errors(false);
return $parser;
}
$url = "https://sumai.tokyu-land.co.jp/osaka";
$parser = pageContent($url);
$rota = new \DOMXPath($parser);
$images = $rota->query("//ul[#class='p-articlelist-content-img']/li/img/#src");
foreach ($images as $image) {
$resimler[] = $image->nodeValue;
}
var_dump($resimler);

Getting link tag via DOMDocument

I convert an atom feed into RSS using atom2rss.xsl. Works fine.
Then, using DOMDocument, I try to get the post title and URL:
$feed = new DOMDocument();
$feed->loadHTML('<?xml encoding="utf-8" ?>' . $html);
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
echo 'url: '. $item->getElementsByTagName("link")->item(0)->nodeValue;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
But the post URL is empty.
See this eval which contains HTML. What am I doing wrong? I suspect I am not getting the link tag properly via $item->getElementsByTagName("link")->item(0)->nodeValue.

I think the problem is that there are several <link> elements in each item and the one (I think) your interested in is the one with rel="self" as an attribute. The quickest way (without messing around with XPath) is to loop over each <link> element checking for the right rel value and then take the href attribute from that...
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
$url = "";
// Look for the 'right' link tag and extract URL from that
foreach ( $item->getElementsByTagName("link") as $link ) {
if ( $link->getAttribute("rel") == "self" ) {
$url = $link->getAttribute("href");
break;
}
}
echo 'url: '. $url;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
which gives...
url: https://www.blogger.com/feeds/2984353310628523257/posts/default/1947782625877709813titleExtraordinary Genius - Cp274

function get_links($link)
{
$ret = array();
$dom = new DOMDocument();
#$dom->loadHTML(file_get_contents($link));
$dom->preserveWhiteSpace = false;
$links = $dom->getElementsByTagName('a');
foreach ($links as $tag){
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
}
return $ret;
}
print_r(get_links('http://www.google.com'));
OR u can use DOMXpath
$html = file_get_contents('http://www.google.com');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// take all links
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'
';

PHP - DOMDocument - need to change/replace an existing HTML tag w/ a new one

I need to change an <img> tag for a <video> tag. I
do not know how to continue with the code as I can change all tags provided they contain a WebM.
function iframe($text) {
$Dom = new DOMDocument;
libxml_use_internal_errors(true);
$Dom->loadHTML($text);
$links = $Dom->getElementsByTagName('img');
foreach ($links as $link) {
$href = $link->getAttribute('src');
if (!empty($href)) {
$pathinfo = pathinfo($href);
if (strtolower($pathinfo['extension']) === 'webm') {
//If extension webm change tag to <video>
}
}
}
$html = $Dom->saveHTML();
return $html;
}

Like Roman i'm using http://php.net/manual/en/domnode.replacechild.php
but i'm using a for-iteration and test for .webm extension in the src with a simple strpos().
$contents = <<<STR
this is some HTML with an <img src="test1.png"/> in it.
this is some HTML with an <img src="test2.png"/> in it.
this is some HTML with an <img src="test.webm"/> in it,
but it should be a video tag - when iframe() is done.
STR;
function iframe($text)
{
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($text);
$images = $dom->getElementsByTagName("img");
for ($i = $images->length - 1; $i >= 0; $i --) {
$nodePre = $images->item($i);
$src = $nodePre->getAttribute('src');
// search in src for ".webm"
if(strpos($src, '.webm') !== false ) {
$nodeVideo = $dom->createElement('video');
$nodeVideo->setAttribute("src", $src);
$nodeVideo->setAttribute("controls", '');
$nodePre->parentNode->replaceChild($nodeVideo, $nodePre);
}
}
$html = $dom->saveHTML();
return $html;
};
echo iframe($contents);
Part of output:
this is some HTML with an <video src="test.webm"></video> in it,
but it should be a video tag - when iframe() is done.

Use this code:
(...)
if( strtolower( $pathinfo['extension'] ) === 'webm')
{
//If extension webm change tag to <video>
$new = $Dom->createElement( 'video', $link->nodeValue );
foreach( $link->attributes as $attribute )
{
$new->setAttribute( $attribute->name, $attribute->value );
}
$link->parentNode->replaceChild( $new, $link );
}
(...)
By code above I create a new node with video tag and nodeValue as img node value, then I add to new node all img attributes, and finally I replace old node with new node.
Please note that if the old node has id, the code will produce a warning.

Solution with DOMDocument::createElement and DOMNode::replaceChild functions:
function iframe($text) {
$Dom = new DOMDocument;
libxml_use_internal_errors(true);
$Dom->loadHTML($text);
$links = $Dom->getElementsByTagName('img');
foreach ($links as $link) {
$href = $link->getAttribute('src');
if (!empty($href)) {
$pathinfo = pathinfo($href);
if (strtolower($pathinfo['extension']) === 'webm') {
//If extension webm change tag to <video>
$video = $Dom->createElement('video');
$video->setAttribute("src", $href);
$video->setAttribute("controls", '');
$link->parentNode->replaceChild($video, $link);
}
}
}
$html = $Dom->saveHTML();
return $html;
}
http://php.net/manual/en/domdocument.createelement.php
http://php.net/manual/en/domnode.replacechild.php

php - how to get the img url from a href using loadhtml

how to get the img url from a href url using dom loadhtml ? i try using $link->nodeValue to get the img src url but is not working
Example url source:
<img src="www.google.com/test.jpg" />Photo NodeValue
My php code:
// -------------------------------------------------------------------------
// ----------------------- Get URLs From Source
// -------------------------------------------------------------------------
function getVidesURL($url) {
$web_source = $this->getSource($url);
if($web_source != '') {
$Data = $this->Websites_Data[$this->getHost($url)];
preg_match($Data['Index_Preg_Match'], $web_source, $Videos_Page);
$Videos_Page = $Videos_Page[$Data['Index_Preg_Match_Num']];
if($Videos_Page != '') {
$dom = new DOMDocument;
#$dom->loadHTML($Videos_Page);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
$Video_Status = "";
$Video_Error = "";
$Video = array(
"URL" => $link->getAttribute('href'),
"Title" => $link->getAttribute('title'),
"MSG" => $link->nodeValue,
);
// Get Image URL Start
$dom = new DOMDocument;
#$dom->loadHTML($Video['MSG']);
$Video_Image = $dom->getElementsByTagName('img');
foreach ($Video_Image as $Image) {
$Video = array(
"IMG" => $link->getAttribute('src'),
);
}
$Videos_URLs .= $Video['IMG'] . '<br />';
}
// Get Image URL Stop
return $Videos_URLs;
}
}
}
The only problem of my code is i don't know how to get the img url from a href

Here is a small function that can pull out image sources from an HTML input:
<?php
echo PHP_EOL;
var_dump(getImgSrcFromHTML('<img src="www.google.com/test.jpg" />Photo NodeValue<div><img src="www.google.com/test2.jpg" /></div><table><tr><td><img src="www.google.com/test3.jpg" /></td></tr></table>'));
echo PHP_EOL;
function getImgSrcFromHTML($html){
$doc = new DOMDocument();
$doc->loadHTML($html);
$imagepPaths = array();
$imageTags = $doc->getElementsByTagName('img');
foreach ($imageTags as $tag) {
$imagePaths[] = $tag->getAttribute('src');
}
if(!empty($imagePaths)) {
return $imagePaths;
} else {
return false;
}
}
Hope this helps.

How to return outer html of DOMDocument?

I'm trying to replace video links inside a string - here's my code:
$doc = new DOMDocument();
$doc->loadHTML($content);
foreach ($doc->getElementsByTagName("a") as $link)
{
$url = $link->getAttribute("href");
if(strpos($url, ".flv"))
{
echo $link->outerHTML();
}
}
Unfortunately, outerHTML doesn't work when I'm trying to get the html code for the full hyperlink like <a href='http://www.myurl.com/video.flv'></a>
Any ideas how to achieve this?

As of PHP 5.3.6 you can pass a node to saveHtml, e.g.
$domDocument->saveHtml($nodeToGetTheOuterHtmlFrom);
Previous versions of PHP did not implement that possibility. You'd have to use saveXml(), but that would create XML compliant markup. In the case of an <a> element, that shouldn't be an issue though.
See http://blog.gordon-oheim.biz/2011-03-17-The-DOM-Goodie-in-PHP-5.3.6/

You can find a couple of propositions in the users notes of the DOM section of the PHP Manual.
For example, here's one posted by xwisdom :
<?php
// code taken from the Raxan PDI framework
// returns the html content of an element
protected function nodeContent($n, $outer=false) {
$d = new DOMDocument('1.0');
$b = $d->importNode($n->cloneNode(true),true);
$d->appendChild($b); $h = $d->saveHTML();
// remove outter tags
if (!$outer) $h = substr($h,strpos($h,'>')+1,-(strlen($n->nodeName)+4));
return $h;
}
?>

The best possible solution is to define your own function which will return you outerhtml:
function outerHTML($e) {
$doc = new DOMDocument();
$doc->appendChild($doc->importNode($e, true));
return $doc->saveHTML();
}
than you can use in your code
echo outerHTML($link);

Rename a file with href to links.html or links.html to say google.com/fly.html that has flv in it or change flv to wmv etc you want href from if there are other href
it will pick them up as well
<?php
$contents = file_get_contents("links.html");
$domdoc = new DOMDocument();
$domdoc->preservewhitespaces=“false”;
$domdoc->loadHTML($contents);
$xpath = new DOMXpath($domdoc);
$query = '//#href';
$nodeList = $xpath->query($query);
foreach ($nodeList as $node){
if(strpos($node->nodeValue, ".flv")){
$linksList = $node->nodeValue;
$htmlAnchor = new DOMElement("a", $linksList);
$htmlURL = new DOMAttr("href", $linksList);
$domdoc->appendChild($htmlAnchor);
$htmlAnchor->appendChild($htmlURL);
$domdoc->saveHTML();
echo ("<a href='". $node->nodeValue. "'>". $node->nodeValue. "</a><br />");
}
}
echo("done");
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

how to get exact img src in xpath - php

Related

Web scraping with Xpath, grabbing img

Getting link tag via DOMDocument

PHP - DOMDocument - need to change/replace an existing HTML tag w/ a new one

php - how to get the img url from a href using loadhtml

How to return outer html of DOMDocument?

Categories

Resources