Fetching all images src from specific div - php

Suppose, I have HTML structure like:
<div>
<div class="content">
<p>This is dummy text</p>
<p><img src="a.jpg"></p>
<p>This is dummy text</p>
<p><img src="b.jpg"></p>
</div>
</div>
I want to fetch all image src from .content div. I tried :
<?php
// a new dom object
$dom = new domDocument;
// load the html into the object
$dom->loadHTML("example.com/article/2345");
// discard white space
$dom->preserveWhiteSpace = false;
//get element by class
$finder = new DomXPath($dom);
$classname = 'content';
$content = $finder->query("//*[contains(#class, '$classname')]");
foreach($content as $item){
echo $item->nodevalue;
}
But, I cannot get anything when I loop through $content. PLease Help.

Change your XPath query as shown below:
// loading html content from remote url
$html = file_get_contents("http://nepalpati.com/entertainment/22577/");
#$dom->loadHTML($html);
...
$classname = 'content';
$img_sources = [];
// getting all images within div with class "content"
$content = $finder->query("//div[#class='$classname']/p/img");
foreach ($content as $img) {
$img_sources[] = $img->getAttribute('src');
}
...
var_dump($img_sources);
// the output:
array(2) {
[0]=>
string(68) "http://nepalpati.com/mediastorage/images/2072/Falgun/khole-selfi.jpg"
[1]=>
string(72) "http://nepalpati.com/mediastorage/images/2072/Falgun/khole-hot-selfi.jpg"
}

Related

How to extract content from all DIV's with class="sco" using DOM and PHP

I have got multiple DIV's with content in a php file : div class="sco".
The following code works very well but it only extracts the content from the first DIV. How can I extract content from all DIV's ?
Thank you in advance.
$html = file_get_contents('http://www..../include/test.php');
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html);
$finder = new DomXPath($doc);
$node = $finder->query("//*[contains(#class, 'sco')]");
print_r($doc->saveHTML($node->item(0)));
On the last line, you're accessing only the first item (index 0): $node->item(0)
Instead, loop the $node and print each item:
$node = $finder->query("//*[contains(#class, 'sco')]");
foreach($node as $item){
print_r($doc->saveHTML($item));
}
Just iterate over your DOMNodeList object:
<?php
$html = '
<div class="sco">a</div>
<div class="sco">b</div>
<div class="sco">c</div>
';
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html);
$finder = new DomXPath($doc);
$node = $finder->query("//*[contains(#class, 'sco')]");
for($i=0;$i<$node->length;$i++)
{
echo "<pre>";
var_dump($node->item($i)->nodeValue);
echo "</pre>";
}
Output:
string(1) "a"
string(1) "b"
string(1) "c"

How to get PHP DOM getElementsByTagName('body') with html tags

Im getting the body content but without html tags(It is cleaned up) inside the body.I need with all html tags inside the body. what do I want to change on my code?
$doc = new DOMDocument();
#$doc->loadHTMLFile($myURL);
$elements2 = $doc->getElementsByTagName('body');
foreach ($elements2 as $el2) {
echo $el2->nodeValue, PHP_EOL;
echo "<br/>";
}
You will need to save the body child nodes as HTML. I suggest using Xpath to fetch the nodes, this avoids the outer loop:
$html = <<<'HTML'
<html>
<body>
Foo
<p>Bar</p>
</body>
</html>
HTML;
$document = new DOMDocument();
$document->loadHtml($html);
$xpath = new DOMXpath($document);
$result = '';
foreach ($xpath->evaluate('//body/node()') as $node) {
$result .= $document->saveHtml($node);
}
var_dump($result);
Output:
string(29) "
Foo
<p>Bar</p>
"

Get href value from matching anchor text

I'm pretty new to the DOMDocument class and can't seem to find an answer for what i'm trying to do.
I have a large html file and i want to grab the link from an element based on the anchor text.
so for example
$html = <<<HTML
<div class="main">
<img src="http://images.com/spacer.gif"/>Keyword</font></span>
other text
</div>
HTML;
// domdocument
$doc = new DOMDocument();
$doc->loadHTML($html);
i want to get the value of the href attribute of any element that has the text keyword. Hope that was clear
$html = <<<HTML
<div class="main">
<img src="http://images.com/spacer.gif"/>Keyword</font></span>
other text
</div>
HTML;
$keyword = "Keyword";
// domdocument
$doc = new DOMDocument();
$doc->loadHTML($html);
$as = $doc->getElementsByTagName('a');
foreach ($as as $a) {
if ($a->nodeValue === $keyword) {
echo $a->getAttribute('href'); // prints "http://link.com"
break;
}
}

How do I extract this value using PHP Dom

I do have html file this is just a prt of it though...
<div id="result" >
<div class="res_item" id="1" h="63c2c439b62a096eb3387f88465d36d0">
<div class="res_main">
<h2 class="res_main_top">
<img
src="/ff/gigablast.com.png"
alt="favicon for gigablast.com"
width=16
height=16
/>
<a
href="http://www.gigablast.com/"
rel="nofollow"
>
Gigablast
</a>
<div class="res_main">
<h2 class="res_main_top">
<img
src="/ff/ask.com.png"
alt="favicon for ask.com"
width=16
height=16
/>
<a
href="http://ask.com/" rel="nofollow"
>
Ask.com - What's Your Question?
</a>....
I want extract only url address (for example: http://www.gigablast.com and http://ask.com/ - there are atleast 10 urls in that html) from above using PHP Dom Document..I know up to this but dont know how to move ahead??
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$data = $doc->getElementById('result');
then what?? this is inside tag hence I cant use $data->getElementsByTagName() here!!
Using XPath to narrow down the field to a elements inside the <div class="res_main"> element:
$doc = new DomDocument();
$doc->loadHTMLFile('urllist.html');
$xpath = new DomXpath($doc);
$query = '//div[#class="res_main"]//a';
$nodes = $xpath->query($query);
$urls = array();
foreach ($nodes as $node) {
$href = $node->getAttribute('href');
if (!empty($href)) {
$urls[] = $href;
}
}
This solves the problem of picking up all the <a> elements inside of the document, since it allows you to filter only the ones you want (since you don't care about navigation links, etc)...
You can call getElementsByTagName on a DOMElement object:
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$result = $doc->getElementById('result');
$anchors = $result->getElementsByTagName('a');
$urls = array();
foreach ($anchors as $a) {
$urls[] = $a->getAttribute('href');
}
If you want to get image sources as well, that would be easy to add.
If you are just trying to extract the href attribute of all a tags in the document (and the <div id="result"> doesn't matter, you could use this:
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$anchors = $doc->getElementsByTagName('a');
$urls = array();
foreach($anchors as $anchor) {
$urls[] = $anchor->attributes->href;
}
// $urls is your collection of urls in the original document.

how to use dom php parser

I'm new to DOM parsing in PHP:
I have a HTML file that I'm trying to parse. It has a bunch of DIVs like this:
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox">
......
I'm trying to get the contents of the many div boxes using php.
How can I use the DOM parser to do this?
Thanks!
First i have to tell you that you can't use the same id on two different divs; there are classes for that point. Every element should have an unique id.
Code to get the contents of the div with id="interestingbox"
$html = '
<html>
<head></head>
<body>
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox2">a link</div>
</body>
</html>';
$dom_document = new DOMDocument();
$dom_document->loadHTML($html);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom_document);
// if you want to get the div with id=interestingbox
$elements = $dom_xpath->query("*/div[#id='interestingbox']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "\n[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
//OUTPUT
[div] {
Content1
Content2
}
Example with classes:
$html = '
<html>
<head></head>
<body>
<div class="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div class="interestingbox">a link</div>
</body>
</html>';
//the same as before.. just change the xpath
[...]
$elements = $dom_xpath->query("*/div[#class='interestingbox']");
[...]
//OUTPUT
[div] {
Content1
Content2
}
[div] {
a link
}
Refer to the DOMXPath page for more details.
I got this to work using simplehtmldom as a start:
$html = file_get_html('example.com');
foreach ($html->find('div[id=interestingbox]') as $result)
{
echo $result->innertext;
}
Very nice function from http://www.sitepoint.com/forums/showthread.php?611393-php5-need-something-like-innerHTML-instead-of-nodeValue
function innerXML($node)
{
$doc = $node->ownerDocument;
$frag = $doc->createDocumentFragment();
foreach ($node->childNodes as $child)
{
$frag->appendChild($child->cloneNode(TRUE));
}
return $doc->saveXML($frag);
}
$dom = new DOMDocument();
$dom->loadXML('
<html>
<body>
<table>
<tr>
<td id="foo">
The first bit of Data I want
<br />The second bit of Data I want
<br />The third bit of Data I want
</td>
</tr>
</table>
<body>
<html>
');
$xpath = new DOMXPath($dom);
$node = $xpath->evaluate("/html/body//td[#id='foo' ]");
$dataString = innerXML($node->item(0));
$dataArr = explode("<br />", $dataString);
$dataUno = $dataArr[0];
$dataDos = $dataArr[1];
$dataTres = $dataArr[2];
echo "firstdata = $nameUno<br />seconddata = $nameDos<br />thirddata = $nameTres<br />"
WebExtractor: https://github.com/knyga/webextractor
It can parse page with css, regex, xpath selectors.
Look package and tests for examples:
use WebExtractor\DataExtractor\DataExtractorFactory; use
WebExtractor\DataExtractor\DataExtractorTypes; use
WebExtractor\Client\Client;
$factory = DataExtractorFactory::getFactory(); $extractor =
$factory->createDataExtractor(DataExtractorTypes::CSS); $client = new
Client; $content =
$client->get('https://en.wikipedia.org/wiki/2014_Winter_Olympics');
$extractor->setContent($content); $h1 =
$extractor->setSelector('h1')->extract();

Categories