Php Dom Document results error - php

I would like to scrape some elements from html, but I am unable to scrape the data as I need.
html
<div class="opinions">
<ul>
<li>
<div class="imgcontainers">
<a href="domainname.com" title="title"> `<img width="160" src="image.jpg" />`
</a>
</div>
<p class="caption">
asdfad
<span>November 03, 2015 09:29 This is article title</span>
</p>
</li>
</ul>
</div>
$dom = new DOMDocument();
$classname = 'opinions';
$html = get_page($url);
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$xpath = new DOMXPath($dom);
$articles = $xpath->query("//*[#class='" . $classname . "']");
$p = $articles->getElementsByTagName('a');
$div = $articles->getElementsByTagName('div');
foreach($p as $value){
$title = $value->getAttribute("href");
echo $title;
}
when I run this script I am getting this error "Call to undefined method DOMNodeList::getElementsByTagName()"
What I exactly need is, I need every href link and img src path (if there) and span text value of this . Please suggest your advice how to achieve this.

Maybe you can learn something from my code
Or, if you decide to include my function, here is how I do it:
$html = ""; //your html
$props = array(
array("tagname"=>"div", "props"=>array("class"=>"opinions")),
//the '/' before 'a' is for all descendant <a> of <div>
array("tagname"=>"/a"),
);
$options = array("property"=>"href");
require_once 'getNodeValue.php';
$hrefs = getNodeValue($html, $props, $options);
print_r($hrefs);

Related

How to extract a link between paragraph tags

I'm trying to fetch a link which is in between p tags, But my result is "/playlist" and i need the link like "song/54826/father-friend".
Been on this for hours now. Help me out please
<div class="track__body">
<p class="track__track">
Father & Friend
<span class="track__artist" data-artist="" id="zwartelijst_artist">Alain Clark</span>
</p>
<a href="/playlist" class="track__playlist">
</a>
include('simple_html_dom.php');
$url="some url";
$html = file_get_contents($url);
$links = [];
$document = new DOMDocument;
$document ->loadHTML($html);
$xPath = new DOMXPath($document );
$anchorTags = $xPath->evaluate("//div[#class=\"track__body\"]//a/#href");
foreach ($anchorTags as $anchorTag) {
$links[] = $anchorTag->nodeValue;
}
echo $links[1];
You need to modify your xpath so it scopes to the right element.
$document = new DOMDocument;
$document ->loadHTML('<div class="track__body">
<p class="track__track">
Father & Friend
<span class="track__artist" data-artist="" id="zwartelijst_artist">Alain Clark</span>
</p>
<a href="/playlist" class="track__playlist">
</a>');
$xPath = new DOMXPath($document );
$anchorTags = $xPath->evaluate("//div[#class=\"track__body\"]/p[#class=\"track__track\"]/a/#href");
foreach ($anchorTags as $anchorTag) {
echo $anchorTag->nodeValue;
}
https://3v4l.org/YY0dD

Problem to get content from html string variable in php

I have this code:
$doc = new \DOMDocument();
$doc->loadHTML($content);
$links = [];
$container = $doc->getElementById("content");
$arr = $container->getElementsByTagName("a");
foreach($arr as $item) {
$href = $item->getAttribute("href");
$title = $item->getAttribute("title");
$links[] = [
'href' => $href,
'title' => $title
];
}
for($i = 0, $l = count($links); $i < $l; ++$i) {
echo $links[$i]['title'].' '.$links[$i]['href'].'<br />';
}
The html structure is like that:
<div class="post-content right-col">
<a title="" href="https://www.swisscars.pl/samochody/516321/">
<img src="https://swisscars.pl/uploads2/180843_0.jpg" alt="" class="thumb alignleft" height="75" width="75"/>
</a>
<h2 style="line-height:150%;">
<a href="https://www.swisscars.pl/samochody/516321/" rel="bookmark" title="Renault Kangoo II (96’011 km)">
Renault Kangoo II (96’011 km) </a>
</h2>
Do końca aukcji: <span id="countdown100">2018-10-23 14:00:00 GMT+02:00</span><p>DATA ZAKONCZENIA AUKCJI: 2018-10-23 14:00</p>
</div>
</div>
I want to get only values from a tag witch attribute rel="bookmark". Please help me with this. I try to use hasAttribute function but is not working. Please describe me what I can get only content from a tag with rel="bookmark" attribute. PHP have hasAttribute() function or something like this function?
Thanks for help
XPath might be an option, you could do this:
$doc = new \DOMDocument();
$doc->loadHTML($content);
$xpath = new DOMXPath($dom);
$links = $xpath->query('//a[#rel="bookmark"]');
That should return a DOMNodeList you could loop through.

Get href value from matching anchor text

I'm pretty new to the DOMDocument class and can't seem to find an answer for what i'm trying to do.
I have a large html file and i want to grab the link from an element based on the anchor text.
so for example
$html = <<<HTML
<div class="main">
<img src="http://images.com/spacer.gif"/>Keyword</font></span>
other text
</div>
HTML;
// domdocument
$doc = new DOMDocument();
$doc->loadHTML($html);
i want to get the value of the href attribute of any element that has the text keyword. Hope that was clear
$html = <<<HTML
<div class="main">
<img src="http://images.com/spacer.gif"/>Keyword</font></span>
other text
</div>
HTML;
$keyword = "Keyword";
// domdocument
$doc = new DOMDocument();
$doc->loadHTML($html);
$as = $doc->getElementsByTagName('a');
foreach ($as as $a) {
if ($a->nodeValue === $keyword) {
echo $a->getAttribute('href'); // prints "http://link.com"
break;
}
}

How do I extract this value using PHP Dom

I do have html file this is just a prt of it though...
<div id="result" >
<div class="res_item" id="1" h="63c2c439b62a096eb3387f88465d36d0">
<div class="res_main">
<h2 class="res_main_top">
<img
src="/ff/gigablast.com.png"
alt="favicon for gigablast.com"
width=16
height=16
/>
<a
href="http://www.gigablast.com/"
rel="nofollow"
>
Gigablast
</a>
<div class="res_main">
<h2 class="res_main_top">
<img
src="/ff/ask.com.png"
alt="favicon for ask.com"
width=16
height=16
/>
<a
href="http://ask.com/" rel="nofollow"
>
Ask.com - What's Your Question?
</a>....
I want extract only url address (for example: http://www.gigablast.com and http://ask.com/ - there are atleast 10 urls in that html) from above using PHP Dom Document..I know up to this but dont know how to move ahead??
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$data = $doc->getElementById('result');
then what?? this is inside tag hence I cant use $data->getElementsByTagName() here!!
Using XPath to narrow down the field to a elements inside the <div class="res_main"> element:
$doc = new DomDocument();
$doc->loadHTMLFile('urllist.html');
$xpath = new DomXpath($doc);
$query = '//div[#class="res_main"]//a';
$nodes = $xpath->query($query);
$urls = array();
foreach ($nodes as $node) {
$href = $node->getAttribute('href');
if (!empty($href)) {
$urls[] = $href;
}
}
This solves the problem of picking up all the <a> elements inside of the document, since it allows you to filter only the ones you want (since you don't care about navigation links, etc)...
You can call getElementsByTagName on a DOMElement object:
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$result = $doc->getElementById('result');
$anchors = $result->getElementsByTagName('a');
$urls = array();
foreach ($anchors as $a) {
$urls[] = $a->getAttribute('href');
}
If you want to get image sources as well, that would be easy to add.
If you are just trying to extract the href attribute of all a tags in the document (and the <div id="result"> doesn't matter, you could use this:
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$anchors = $doc->getElementsByTagName('a');
$urls = array();
foreach($anchors as $anchor) {
$urls[] = $anchor->attributes->href;
}
// $urls is your collection of urls in the original document.

how to use dom php parser

I'm new to DOM parsing in PHP:
I have a HTML file that I'm trying to parse. It has a bunch of DIVs like this:
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox">
......
I'm trying to get the contents of the many div boxes using php.
How can I use the DOM parser to do this?
Thanks!
First i have to tell you that you can't use the same id on two different divs; there are classes for that point. Every element should have an unique id.
Code to get the contents of the div with id="interestingbox"
$html = '
<html>
<head></head>
<body>
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox2">a link</div>
</body>
</html>';
$dom_document = new DOMDocument();
$dom_document->loadHTML($html);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom_document);
// if you want to get the div with id=interestingbox
$elements = $dom_xpath->query("*/div[#id='interestingbox']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "\n[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
//OUTPUT
[div] {
Content1
Content2
}
Example with classes:
$html = '
<html>
<head></head>
<body>
<div class="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div class="interestingbox">a link</div>
</body>
</html>';
//the same as before.. just change the xpath
[...]
$elements = $dom_xpath->query("*/div[#class='interestingbox']");
[...]
//OUTPUT
[div] {
Content1
Content2
}
[div] {
a link
}
Refer to the DOMXPath page for more details.
I got this to work using simplehtmldom as a start:
$html = file_get_html('example.com');
foreach ($html->find('div[id=interestingbox]') as $result)
{
echo $result->innertext;
}
Very nice function from http://www.sitepoint.com/forums/showthread.php?611393-php5-need-something-like-innerHTML-instead-of-nodeValue
function innerXML($node)
{
$doc = $node->ownerDocument;
$frag = $doc->createDocumentFragment();
foreach ($node->childNodes as $child)
{
$frag->appendChild($child->cloneNode(TRUE));
}
return $doc->saveXML($frag);
}
$dom = new DOMDocument();
$dom->loadXML('
<html>
<body>
<table>
<tr>
<td id="foo">
The first bit of Data I want
<br />The second bit of Data I want
<br />The third bit of Data I want
</td>
</tr>
</table>
<body>
<html>
');
$xpath = new DOMXPath($dom);
$node = $xpath->evaluate("/html/body//td[#id='foo' ]");
$dataString = innerXML($node->item(0));
$dataArr = explode("<br />", $dataString);
$dataUno = $dataArr[0];
$dataDos = $dataArr[1];
$dataTres = $dataArr[2];
echo "firstdata = $nameUno<br />seconddata = $nameDos<br />thirddata = $nameTres<br />"
WebExtractor: https://github.com/knyga/webextractor
It can parse page with css, regex, xpath selectors.
Look package and tests for examples:
use WebExtractor\DataExtractor\DataExtractorFactory; use
WebExtractor\DataExtractor\DataExtractorTypes; use
WebExtractor\Client\Client;
$factory = DataExtractorFactory::getFactory(); $extractor =
$factory->createDataExtractor(DataExtractorTypes::CSS); $client = new
Client; $content =
$client->get('https://en.wikipedia.org/wiki/2014_Winter_Olympics');
$extractor->setContent($content); $h1 =
$extractor->setSelector('h1')->extract();

Categories