I would like to scrape some elements from html, but I am unable to scrape the data as I need.
html
<div class="opinions">
<ul>
<li>
<div class="imgcontainers">
<a href="domainname.com" title="title"> `<img width="160" src="image.jpg" />`
</a>
</div>
<p class="caption">
asdfad
<span>November 03, 2015 09:29 This is article title</span>
</p>
</li>
</ul>
</div>
$dom = new DOMDocument();
$classname = 'opinions';
$html = get_page($url);
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$xpath = new DOMXPath($dom);
$articles = $xpath->query("//*[#class='" . $classname . "']");
$p = $articles->getElementsByTagName('a');
$div = $articles->getElementsByTagName('div');
foreach($p as $value){
$title = $value->getAttribute("href");
echo $title;
}
when I run this script I am getting this error "Call to undefined method DOMNodeList::getElementsByTagName()"
What I exactly need is, I need every href link and img src path (if there) and span text value of this . Please suggest your advice how to achieve this.
Maybe you can learn something from my code
Or, if you decide to include my function, here is how I do it:
$html = ""; //your html
$props = array(
array("tagname"=>"div", "props"=>array("class"=>"opinions")),
//the '/' before 'a' is for all descendant <a> of <div>
array("tagname"=>"/a"),
);
$options = array("property"=>"href");
require_once 'getNodeValue.php';
$hrefs = getNodeValue($html, $props, $options);
print_r($hrefs);
Related
I'm trying to fetch a link which is in between p tags, But my result is "/playlist" and i need the link like "song/54826/father-friend".
Been on this for hours now. Help me out please
<div class="track__body">
<p class="track__track">
Father & Friend
<span class="track__artist" data-artist="" id="zwartelijst_artist">Alain Clark</span>
</p>
<a href="/playlist" class="track__playlist">
</a>
include('simple_html_dom.php');
$url="some url";
$html = file_get_contents($url);
$links = [];
$document = new DOMDocument;
$document ->loadHTML($html);
$xPath = new DOMXPath($document );
$anchorTags = $xPath->evaluate("//div[#class=\"track__body\"]//a/#href");
foreach ($anchorTags as $anchorTag) {
$links[] = $anchorTag->nodeValue;
}
echo $links[1];
You need to modify your xpath so it scopes to the right element.
$document = new DOMDocument;
$document ->loadHTML('<div class="track__body">
<p class="track__track">
Father & Friend
<span class="track__artist" data-artist="" id="zwartelijst_artist">Alain Clark</span>
</p>
<a href="/playlist" class="track__playlist">
</a>');
$xPath = new DOMXPath($document );
$anchorTags = $xPath->evaluate("//div[#class=\"track__body\"]/p[#class=\"track__track\"]/a/#href");
foreach ($anchorTags as $anchorTag) {
echo $anchorTag->nodeValue;
}
https://3v4l.org/YY0dD
I have this code:
$doc = new \DOMDocument();
$doc->loadHTML($content);
$links = [];
$container = $doc->getElementById("content");
$arr = $container->getElementsByTagName("a");
foreach($arr as $item) {
$href = $item->getAttribute("href");
$title = $item->getAttribute("title");
$links[] = [
'href' => $href,
'title' => $title
];
}
for($i = 0, $l = count($links); $i < $l; ++$i) {
echo $links[$i]['title'].' '.$links[$i]['href'].'<br />';
}
The html structure is like that:
<div class="post-content right-col">
<a title="" href="https://www.swisscars.pl/samochody/516321/">
<img src="https://swisscars.pl/uploads2/180843_0.jpg" alt="" class="thumb alignleft" height="75" width="75"/>
</a>
<h2 style="line-height:150%;">
<a href="https://www.swisscars.pl/samochody/516321/" rel="bookmark" title="Renault Kangoo II (96’011 km)">
Renault Kangoo II (96’011 km) </a>
</h2>
Do końca aukcji: <span id="countdown100">2018-10-23 14:00:00 GMT+02:00</span><p>DATA ZAKONCZENIA AUKCJI: 2018-10-23 14:00</p>
</div>
</div>
I want to get only values from a tag witch attribute rel="bookmark". Please help me with this. I try to use hasAttribute function but is not working. Please describe me what I can get only content from a tag with rel="bookmark" attribute. PHP have hasAttribute() function or something like this function?
Thanks for help
XPath might be an option, you could do this:
$doc = new \DOMDocument();
$doc->loadHTML($content);
$xpath = new DOMXPath($dom);
$links = $xpath->query('//a[#rel="bookmark"]');
That should return a DOMNodeList you could loop through.
I'm pretty new to the DOMDocument class and can't seem to find an answer for what i'm trying to do.
I have a large html file and i want to grab the link from an element based on the anchor text.
so for example
$html = <<<HTML
<div class="main">
<img src="http://images.com/spacer.gif"/>Keyword</font></span>
other text
</div>
HTML;
// domdocument
$doc = new DOMDocument();
$doc->loadHTML($html);
i want to get the value of the href attribute of any element that has the text keyword. Hope that was clear
$html = <<<HTML
<div class="main">
<img src="http://images.com/spacer.gif"/>Keyword</font></span>
other text
</div>
HTML;
$keyword = "Keyword";
// domdocument
$doc = new DOMDocument();
$doc->loadHTML($html);
$as = $doc->getElementsByTagName('a');
foreach ($as as $a) {
if ($a->nodeValue === $keyword) {
echo $a->getAttribute('href'); // prints "http://link.com"
break;
}
}
I do have html file this is just a prt of it though...
<div id="result" >
<div class="res_item" id="1" h="63c2c439b62a096eb3387f88465d36d0">
<div class="res_main">
<h2 class="res_main_top">
<img
src="/ff/gigablast.com.png"
alt="favicon for gigablast.com"
width=16
height=16
/>
<a
href="http://www.gigablast.com/"
rel="nofollow"
>
Gigablast
</a>
<div class="res_main">
<h2 class="res_main_top">
<img
src="/ff/ask.com.png"
alt="favicon for ask.com"
width=16
height=16
/>
<a
href="http://ask.com/" rel="nofollow"
>
Ask.com - What's Your Question?
</a>....
I want extract only url address (for example: http://www.gigablast.com and http://ask.com/ - there are atleast 10 urls in that html) from above using PHP Dom Document..I know up to this but dont know how to move ahead??
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$data = $doc->getElementById('result');
then what?? this is inside tag hence I cant use $data->getElementsByTagName() here!!
Using XPath to narrow down the field to a elements inside the <div class="res_main"> element:
$doc = new DomDocument();
$doc->loadHTMLFile('urllist.html');
$xpath = new DomXpath($doc);
$query = '//div[#class="res_main"]//a';
$nodes = $xpath->query($query);
$urls = array();
foreach ($nodes as $node) {
$href = $node->getAttribute('href');
if (!empty($href)) {
$urls[] = $href;
}
}
This solves the problem of picking up all the <a> elements inside of the document, since it allows you to filter only the ones you want (since you don't care about navigation links, etc)...
You can call getElementsByTagName on a DOMElement object:
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$result = $doc->getElementById('result');
$anchors = $result->getElementsByTagName('a');
$urls = array();
foreach ($anchors as $a) {
$urls[] = $a->getAttribute('href');
}
If you want to get image sources as well, that would be easy to add.
If you are just trying to extract the href attribute of all a tags in the document (and the <div id="result"> doesn't matter, you could use this:
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$anchors = $doc->getElementsByTagName('a');
$urls = array();
foreach($anchors as $anchor) {
$urls[] = $anchor->attributes->href;
}
// $urls is your collection of urls in the original document.
I'm new to DOM parsing in PHP:
I have a HTML file that I'm trying to parse. It has a bunch of DIVs like this:
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox">
......
I'm trying to get the contents of the many div boxes using php.
How can I use the DOM parser to do this?
Thanks!
First i have to tell you that you can't use the same id on two different divs; there are classes for that point. Every element should have an unique id.
Code to get the contents of the div with id="interestingbox"
$html = '
<html>
<head></head>
<body>
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox2">a link</div>
</body>
</html>';
$dom_document = new DOMDocument();
$dom_document->loadHTML($html);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom_document);
// if you want to get the div with id=interestingbox
$elements = $dom_xpath->query("*/div[#id='interestingbox']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "\n[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
//OUTPUT
[div] {
Content1
Content2
}
Example with classes:
$html = '
<html>
<head></head>
<body>
<div class="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div class="interestingbox">a link</div>
</body>
</html>';
//the same as before.. just change the xpath
[...]
$elements = $dom_xpath->query("*/div[#class='interestingbox']");
[...]
//OUTPUT
[div] {
Content1
Content2
}
[div] {
a link
}
Refer to the DOMXPath page for more details.
I got this to work using simplehtmldom as a start:
$html = file_get_html('example.com');
foreach ($html->find('div[id=interestingbox]') as $result)
{
echo $result->innertext;
}
Very nice function from http://www.sitepoint.com/forums/showthread.php?611393-php5-need-something-like-innerHTML-instead-of-nodeValue
function innerXML($node)
{
$doc = $node->ownerDocument;
$frag = $doc->createDocumentFragment();
foreach ($node->childNodes as $child)
{
$frag->appendChild($child->cloneNode(TRUE));
}
return $doc->saveXML($frag);
}
$dom = new DOMDocument();
$dom->loadXML('
<html>
<body>
<table>
<tr>
<td id="foo">
The first bit of Data I want
<br />The second bit of Data I want
<br />The third bit of Data I want
</td>
</tr>
</table>
<body>
<html>
');
$xpath = new DOMXPath($dom);
$node = $xpath->evaluate("/html/body//td[#id='foo' ]");
$dataString = innerXML($node->item(0));
$dataArr = explode("<br />", $dataString);
$dataUno = $dataArr[0];
$dataDos = $dataArr[1];
$dataTres = $dataArr[2];
echo "firstdata = $nameUno<br />seconddata = $nameDos<br />thirddata = $nameTres<br />"
WebExtractor: https://github.com/knyga/webextractor
It can parse page with css, regex, xpath selectors.
Look package and tests for examples:
use WebExtractor\DataExtractor\DataExtractorFactory; use
WebExtractor\DataExtractor\DataExtractorTypes; use
WebExtractor\Client\Client;
$factory = DataExtractorFactory::getFactory(); $extractor =
$factory->createDataExtractor(DataExtractorTypes::CSS); $client = new
Client; $content =
$client->get('https://en.wikipedia.org/wiki/2014_Winter_Olympics');
$extractor->setContent($content); $h1 =
$extractor->setSelector('h1')->extract();