Get href value from matching anchor text - php

I'm pretty new to the DOMDocument class and can't seem to find an answer for what i'm trying to do.
I have a large html file and i want to grab the link from an element based on the anchor text.
so for example
$html = <<<HTML
<div class="main">
<img src="http://images.com/spacer.gif"/>Keyword</font></span>
other text
</div>
HTML;
// domdocument
$doc = new DOMDocument();
$doc->loadHTML($html);
i want to get the value of the href attribute of any element that has the text keyword. Hope that was clear

$html = <<<HTML
<div class="main">
<img src="http://images.com/spacer.gif"/>Keyword</font></span>
other text
</div>
HTML;
$keyword = "Keyword";
// domdocument
$doc = new DOMDocument();
$doc->loadHTML($html);
$as = $doc->getElementsByTagName('a');
foreach ($as as $a) {
if ($a->nodeValue === $keyword) {
echo $a->getAttribute('href'); // prints "http://link.com"
break;
}
}

Related

XPath extract attribute from <div> in PHP

i want to extract an attribute from an and display its value.
<div class="b-text-4xl b-text-btc-first b-font-bold btcecc-animated liveup livedown" data-price="38696.15125182" data-live-price="bitcoin" data-rate="1" data-currency="USD" data-timeout="1610051644181"><span>38,696.15</span> <b class="fiat-symbol">$</b></div>
I need the value of "data-price".
The location of the full html is at https://www.btc-echo.de/kurs/bitcoin/
I tried this:
$url = "https://www.btc-echo.de/kurs/bitcoin/";
libxml_use_internal_errors(true);
$doc = new DOMDocument;
$doc->loadHTML(utf8_encode(file_get_contents($url)));
$xpath = new DOMXpath($doc);
foreach ($xpath->query('*[#id="main"]/div[1]/div[3]/div[2]/div[1]/div/div/div[1]/div/#data-price') as $textNode) {
echo $textNode->nodeValue;
}

PHP DomDocument get text after tag

I have this in my php file.
<?php
$str = '<div>
<p>Text</p>
I need this text...
<p>next p</p>
... and this
</div>
';
$dom=new DomDocument();
$dom->loadHTML($str);
$p = $dom->getElementsByTagName('p');
foreach ($p as $item) {
echo $item->nodeValue;
}
This gives me the correct text for the p tags, but I also need the the text between the p tags ("I need this text...", "...and this").
Anyone know how to get the text after the p tag?
Best
Use DOMXPath:
$xpath = new DOMXpath($domDocument);
foreach ($xpath->query('//div/text()') as $textNode) {
echo $textNode->nodeValue;
}

Parse HTML with PHP do not remove all the html tag?

I want to parse html using the php.
My html file is like this
<div class="main">
<div class="text">
Welcom to Stackoverflow
</div>
</div>
now i want to extract the only this part
<div class="text">
Welcom to Stackoverflow
</div>
for this i create the code like this
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[#class="main"]');
foreach ($tags as $tag) {
var_dump(trim($tag->nodeValue));
}
this code gives only the
Welcom to Stackoverflow
but i want the tag also. how to do this??
If you only want to have the div with class "text" try this:
Change your query to: $xpath->query('//div[#class="text"]');
For the output you need: echo $dom->saveHTML( $tag );
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[#class="text"]');
foreach ($tags as $tag) {
echo $dom->saveHTML( $tag );
}
The Querypath library for html/xml parsing makes such things much much easier.

How do I extract this value using PHP Dom

I do have html file this is just a prt of it though...
<div id="result" >
<div class="res_item" id="1" h="63c2c439b62a096eb3387f88465d36d0">
<div class="res_main">
<h2 class="res_main_top">
<img
src="/ff/gigablast.com.png"
alt="favicon for gigablast.com"
width=16
height=16
/>
<a
href="http://www.gigablast.com/"
rel="nofollow"
>
Gigablast
</a>
<div class="res_main">
<h2 class="res_main_top">
<img
src="/ff/ask.com.png"
alt="favicon for ask.com"
width=16
height=16
/>
<a
href="http://ask.com/" rel="nofollow"
>
Ask.com - What's Your Question?
</a>....
I want extract only url address (for example: http://www.gigablast.com and http://ask.com/ - there are atleast 10 urls in that html) from above using PHP Dom Document..I know up to this but dont know how to move ahead??
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$data = $doc->getElementById('result');
then what?? this is inside tag hence I cant use $data->getElementsByTagName() here!!
Using XPath to narrow down the field to a elements inside the <div class="res_main"> element:
$doc = new DomDocument();
$doc->loadHTMLFile('urllist.html');
$xpath = new DomXpath($doc);
$query = '//div[#class="res_main"]//a';
$nodes = $xpath->query($query);
$urls = array();
foreach ($nodes as $node) {
$href = $node->getAttribute('href');
if (!empty($href)) {
$urls[] = $href;
}
}
This solves the problem of picking up all the <a> elements inside of the document, since it allows you to filter only the ones you want (since you don't care about navigation links, etc)...
You can call getElementsByTagName on a DOMElement object:
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$result = $doc->getElementById('result');
$anchors = $result->getElementsByTagName('a');
$urls = array();
foreach ($anchors as $a) {
$urls[] = $a->getAttribute('href');
}
If you want to get image sources as well, that would be easy to add.
If you are just trying to extract the href attribute of all a tags in the document (and the <div id="result"> doesn't matter, you could use this:
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$anchors = $doc->getElementsByTagName('a');
$urls = array();
foreach($anchors as $anchor) {
$urls[] = $anchor->attributes->href;
}
// $urls is your collection of urls in the original document.

PHP: Fetch content from a html page using xpath()

I'm trying to fetch the content of a div in a html page using xpath and domdocument. This is the structure of the page:
<div id="content">
<div class="div1"></div>
<span class="span1></span>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<div class="div2"></div>
</div>
I want to get only the content of p, not spans and divs. I came thru this xpath expression .//*[#id='content']/p but guess something's not right because i'm getting only the first p. Tried using other expression with following-sibling and node() but all return the first p only.
.//*[#id='content']/span/following-sibling::p
.//*[#id='content']/node()[self::p]
This is how's used xpath:
$domDocument=new DOMDocument();
$domDocument->encoding = 'UFT8';
$domDocument->loadHTML($page);
$domXPath = new DOMXPath($domDocument);
$domNodeList = $domXPath->query($this->xpath);
$content = $this->GetHTMLFromDom($domNodeList);
And this is how i get html from nodes:
private function GetHTMLFromDom($domNodeList){
$domDocument = new DOMDocument();
$node = $domNodeList->item(0);
foreach($node->childNodes as $childNode)
$domDocument->appendChild($domDocument->importNode($childNode, true));
return $domDocument->saveHTML();
}
This XPath expression:
//div[#id='content']/p
Result in the wanted node set (five p elements)
EDIT: Now it's clear what is your problem. You need to iterate over the NodeList:
private function GetHTMLFromDom($domNodeList){
$domDocument = new DOMDocument();
foreach ($nodelist as $node) {
$domDocument->appendChild($domDocument->importNode($node, true));
}
return $domDocument->saveHTML();
}

Categories