DOMXpath and get the result - php

I try to get the result of a span tag like this:
The problem, the span class is often names css "AB" into the html code and I want to get the one span tag with the itemprop ratingvalue, only.
<span class="AB" itemprop='ratingValue'>Count</span>
In php I use generally this code:
$html= file_get_contents('url');
$html = escapeshellarg($html);
$html = nl2br($html);
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[#itemprop='ratingValue']");
if ($results->length > 0) {
echo $review_count_html = $results->item(0)->nodeValue;
}
This code does run generally, but by this request I get no results. Can anybody help me? Thanks a lot.

Related

Attempted XPath query not showing any results

I'm currently working on a fantasy sports site, and I want to be able to pull basic stats from another site. (I don't have much experience with XML or pulling data from other sites).
I inspected the element to gain it's XPath:
Which gave me: //*[#id="cp1_ctl01_pnlPlayerStats"]/table[1]/tbody/tr[4]/td[18]
I've looked into a couple methods of trying to pull the info and came up with this:
But I just end up with empty elements in my table within my site:
Here's My Code:
$doc = new DOMDocument();
#$doc->loadHTMLFile($P_RotoLink);
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//* [#id="cp1_ctl01_pnlPlayerStats"]/table[1]/tbody/tr[4]/td[18]');
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
A few things I've tried have thrown me errors, and any time I finally get pass them or suppress them I get empty content. I've tried a bunch of different formats but none seem to give me the desired content.
Edit: Here's the source HTML, I want to grab the value within the td (13.0).
Edit 2: So this is what I'm trying now:
$html = file_get_contents($P_RotoLink);
$doc = new DOMDocument;
libxml_use_internal_errors(true);
$doc->loadHTML($html);
libxml_use_internal_errors(false);
$xpath = new DOMXpath( $doc);
foreach ($xpath->query('//*[#id="cp1_ctl01_pnlPlayerStats"]/table//tr[4]/td[18]') as $node) {
$ppg = substr($node->textContent,0,3);
echo $ppg;
}
The problem is that the table in the screenshot doesn't have tbody node, but your XPath expression includes tbody which causes DOMXPath::query to return an empty list of nodes. I suggest ignoring tbody and fetching only rows with //tr.
Example
$html = <<<'HTML'
<div id="cp1_ctl01_pnlPlayerStats">
<table>
<tr></tr>
<tr>
<td><span>0.9</span>1.0<span>3.0</span></td><td>2.0</td>
</tr>
</table>
</div>
HTML;
$doc = new DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
$expr = '//*[#id="cp1_ctl01_pnlPlayerStats"]/table//tr[2]/td[1]/text()';
$td = $xp->query($expr);
if ($td->length) {
var_dump($td[0]->nodeValue);
}
Output
string(3) "1.0"
The text() function selects all text node children of the context node.

get all <h2> tag and <p> tag text from mysql text datatype column

I am trying to split and fetch p tag and h2 tag texts from database. I have tried this below code. it returns first result only. For example in my database I have
<h2>india</h2><p>country</p><h2>dravid</h2><p>cricket player</p>
I want to fetch h2 results and para results separately. but this below code returns first h2 and para results only. How do I get all h2 tag and p tag text from database?
$getdata = $res['review_content'];
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($getdata); // loads your html
$xpath = new DOMXPath($doc);
$heading = $xpath->evaluate("string(//h2/text())");
// paragraph text
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($getdata); // loads your html
$xpath = new DOMXPath($doc);
$paragraph = $xpath->evaluate("string(//p/text())");
When I tried echo $heading it returns India only. But I want to display India and Dravid
Try the below code, it will first parse the html into object,
then we are searching for specific element by there tag name getElementsByTagName and getting the content of the tag by textContent function
<?php
$getdata = '<h2>india</h2><p>country</p><h2>dravid</h2><p>cricket player</p>';
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$pTag = array();
$h2Tag= array();
$xmlDoc = new DOMDocument();
$xmlDoc->loadHTML($getdata);
$searchNode = $xmlDoc->getElementsByTagName("p");
foreach($searchNode as $d){
$pTag[] = $d->textContent;
}
$searchNode = $xmlDoc->getElementsByTagName("h2");
foreach($searchNode as $d){
$h2Tag[] = $d->textContent;
}
// pTag[] contain array of content all p tag
// h2Tag[] contain array of content all h2 tag
?>
You can use the function getElementsByTagName.
Example:
$h2 = $doc->getElementsByTagName('h2');
$p = $doc->getElementsByTagName('p');
You Should try this code it will help you get your desired result.
$db_string=html_entity_decode($file_contents);
$doc = new DOMDocument();
$doc->loadXML( $db_string );//string goes here from database
$para= $doc->getElementsByTagName( "p" );
$a= $doc->getElementsByTagName( "a" );
foreach($para as $p_tag){
$para_values = $p_tag->item(0)->nodeValue;
}
foreach($a as $a_tag){
$a_values = $a_tag->item(0)->nodeValue;
}

DOM Parser grabbing href of <a> tag by class="Decision"

I'm working with a DOM parser and I'm having issues. I'm basically trying to grab the href within the tag that only contain the class ID of 'thumbnail '. I've been trying to print the links on the screen and still get no results. Any help is appreciated. I also turned on error_reporting(E_ALL); and still nothing.
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
$classId = "thumbnail ";
$div = $html->find('a#'.$classId);
echo $div;
I also tried this but still had the same result of NOTHING:
include('simple_html_dom.php');
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
$ret = $html->find('a[class=thumbnail]');
echo $ret;
You were almost there:
<?php
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.reddit.com/r/funny');
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a[contains(concat(' ',normalize-space(#class),' '),' thumbnail ')]");
var_dump($hrefs);
Gives:
class DOMNodeList#28 (1) {
public $length =>
int(25)
}
25 matches, I'd call it success.
This code would probably work:
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hyperlinks = $xpath->query('//a[#class="thumbnail"]');
foreach($hyperlinks as $hyperlink) {
echo $hyperlink->getAttribute('href'), '<br>;'
}
if you're using simple_html_dom, why are you doing all these superfluous things? It already wraps the resource in everything you need -- http://simplehtmldom.sourceforge.net/manual.htm
include('simple_html_dom.php');
// set up:
$html = new simple_html_dom();
// load from URL:
$html->load_file('http://www.reddit.com/r/funny');
// find those <a> elements:
$links = $html->find('a[class=thumbnail]');
// done.
echo $links;
Tested it and made some changes - this works perfect too.
<?php
// load the url and set up an array for the links
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.reddit.com/r/funny');
$links = array();
// loop thru all the A elements found
foreach($dom->getElementsByTagName('a') as $link) {
$url = $link->getAttribute('href');
$class = $link->getAttribute('class');
// Check if the URL is not empty and if the class contains thumbnail
if(!empty($url) && strpos($class,'thumbnail') !== false) {
array_push($links, $url);
}
}
// Print results
print_r($links);
?>

Strip links, keep markup and text (with or without specific domains) [duplicate]

I am trying to remove certain links depending on their ID tag, but leave the content of the link. For example I want to turn
Some text goes here
to
Some text goes here
I have tried using the below.
$dom = new DOMDocument;
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
$xp = new DOMXPath($dom);
foreach($xp->query('//a[contains(#id="remove")]') as $oldNode) {
$revised = strip_tags($oldNode);
}
$revised = mb_substr($dom->saveXML($xp->query('//body')->item(0)), 6, -7, "UTF-8");
echo $revised;
roughly taken from here but it just spits back the same content of $html.
Any idea's on how I would achieve this?
That's my function for that:
function DOMRemove(DOMNode $from) {
$sibling = $from->firstChild;
do {
$next = $sibling->nextSibling;
$from->parentNode->insertBefore($sibling, $from);
} while ($sibling = $next);
$from->parentNode->removeChild($from);
}
So this:
$dom->loadHTML('Hello <span>World</span>');
$a = $dom->getElementsByTagName('a')->item(0); // get first
DOMRemove($a);
Should give you:
Hello <span>World</span>
To get nodes with a specific ID, use XPath:
$xpath = new DOMXpath($dom);
$node = $xpath->query('//a[#id="something"]')->item(0); // get first
DOMRemove($node);
An approach similar to #netcoder's answer but using a different loop structure and DOMElement methods.
$html = '<html><body>This link was removed.</body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//a[#id="remove"]') as $link) {
// Move all link tag content to its parent node just before it.
while($link->hasChildNodes()) {
$child = $link->removeChild($link->firstChild);
$link->parentNode->insertBefore($child, $link);
}
// Remove the link tag.
$link->parentNode->removeChild($link);
}
$html = $dom->saveXML();
Use:
//a[#id='remove']/node()
|
//*[a[#id='remove']]/node()[not(self::a[#id=''remove])]
This selects all children of any a having attribute id with value "remove" and all preceding and following siblings of this a that are not themselves another a having attribute id with value of "remove"

how to handle DOM in PHP

My PHP code
$dom = new DOMDocument();
#$dom->loadHTML($file);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[#class="text"]');
foreach ($tags as $tag) {
echo $tag->textContent;
}
What I'm trying to do here is to get the content of the div that has class 'text' but the problem when I loop and echo the results I only get the text I can't get the HTML code with images and all the HTML tags like p, br, img... etc i tried to use $tag->nodeValue; but also nothing worked out.
Personally, I like Simple HTML Dom Parser.
include "lib.simple_html_dom.php"
$html = str_get_html($file);
foreach($html->find('div.text') as $e){
echo $e->innertext;
}
Pretty simple, huh? It accommodates selectors like jQuery :)
What you need to do is create a temporary document, add the element to that and then use saveHTML():
foreach ($tags as $tag) {
$doc = new DOMDocument;
$doc->appendChild($doc->importNode($tag, true));
$html = $doc->saveHTML();
}
I found this snippet at http://www.php.net/manual/en/class.domelement.php:
<?php
function getInnerHTML($Node)
{
$Body = $Node->ownerDocument->documentElement->firstChild->firstChild;
$Document = new DOMDocument();
$Document->appendChild($Document->importNode($Body,true));
return $Document->saveHTML();
}
?>
Not sure if it works though.

Categories