XPATH/PHP - Smarter way to acommplish this? - php

I have the following:
$html = "<img src="path/to/image.jpg" alt="Alt name" />Page name"
I need to extract href and src attribute and anchor text
My solution:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node) {
$href = $node->getAttribute('href');
$title = $node->nodeValue;
}
foreach ($dom->getElementsByTagName('img') as $node) {
$img = $node->getAttribute('src');
}
What would be the smarter way?

You can avoid the loops if you use DOMXPath to grab the elements directly:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXpath( $dom);
$a = $xpath->query( '//a')->item( 0); // Get the first <a> node
$img = $xpath->query( '//img', $a)->item( 0); // Get the <img> child of that <a>
Now, you can do:
echo $a->getAttribute('href');
echo $a->nodeValue;
echo $img->getAttribute('src');
This will print:
/path/to/page.html
Page name
path/to/image.jpg

Possible alternative approach:
$domXpath = new DOMXPath(DOMDocument::loadHTML($html));
$href = $domXpath->query('a/#href')->item(0)->nodeValue;
$src = $domXpath->query('img/#src')->item(0)->nodeValue;
Empty/null checks are up to you.

http://ca2.php.net/manual/en/function.preg-match.php - if you want to use regex
or
http://php.net/manual/en/book.simplexml.php
if you need to use xml parsing.
// Simple xml
$xml = simplexml_load_string($html);
$attr = $xml->attributes();
echo 'href: ' . $attr['href'] . PHP_EOL;

Related

Xpath nodeValue/textContent unable to see <BR> tag

HTML is as follows:
ABC<BR>DEF
However, both nodeValue and textContent attributes show "ABCDEF" as the value.
Any way to show or parse the <BR>?
Maybe this'll help you: DOMNode::C14N
It'll return the HTML of the node.
<?php
$a = 'ABC<BR>DEF';
$doc = new DOMDocument();
#$doc->loadHTML($a);
$finder = new DomXPath($doc);
$nodes = $finder->query("//a");
foreach ($nodes as $node) {
var_dump($node->c14n());
}
Demo
I know you have already solved your problem, but I wanted to add a more direct way of solving it...
$a = 'ABC<BR>DEF';
$doc = new DOMDocument();
$doc->loadHTML($a);
$xp = new DomXPath($doc);
$nodes = $xp->query("//a/node()");
$text = '';
foreach ($nodes as $node) {
$text .= $doc->saveHTML($node);
}
echo $text;
Outputs...
ABC<br>DEF

get value of href inside of div from external site using PHP

good day Sir/Maam.
I have a certain html attribute that I want to search from the external website
I want to get the a href value but the problem is the id or class or name is random.
<div class="static">
Dynamic
</div>
This code should display all the hrefs in http://example.com
In this case I use DOMDocument and XPath to select the elements you want to access because it's very flexible and easy to use.
<?php
$html = file_get_contents("http://example.com");
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DomXPath($doc);
$nodeList = $xpath->query("//a/#href");
print_r($nodeList);
// To access the values inside nodes
foreach($nodeList as $node){
echo "<p>" . $node->nodeValue . "</p>";
}
use jquery to get the value as follow:
var link = $(".static>a").attr("href");
You can use PHP DOMDocument:
<?php
$exampleurl = "http://YourDomain.com"; //set your url
$filterClass = "dynamicclass";
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($exampleurl);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href'); // all href
$class = $element->getAttribute('class');
if($class==$filterClass){
echo $href;
}
}
?>

DOM Parser grabbing href of <a> tag by class="Decision"

I'm working with a DOM parser and I'm having issues. I'm basically trying to grab the href within the tag that only contain the class ID of 'thumbnail '. I've been trying to print the links on the screen and still get no results. Any help is appreciated. I also turned on error_reporting(E_ALL); and still nothing.
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
$classId = "thumbnail ";
$div = $html->find('a#'.$classId);
echo $div;
I also tried this but still had the same result of NOTHING:
include('simple_html_dom.php');
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
$ret = $html->find('a[class=thumbnail]');
echo $ret;
You were almost there:
<?php
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.reddit.com/r/funny');
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a[contains(concat(' ',normalize-space(#class),' '),' thumbnail ')]");
var_dump($hrefs);
Gives:
class DOMNodeList#28 (1) {
public $length =>
int(25)
}
25 matches, I'd call it success.
This code would probably work:
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hyperlinks = $xpath->query('//a[#class="thumbnail"]');
foreach($hyperlinks as $hyperlink) {
echo $hyperlink->getAttribute('href'), '<br>;'
}
if you're using simple_html_dom, why are you doing all these superfluous things? It already wraps the resource in everything you need -- http://simplehtmldom.sourceforge.net/manual.htm
include('simple_html_dom.php');
// set up:
$html = new simple_html_dom();
// load from URL:
$html->load_file('http://www.reddit.com/r/funny');
// find those <a> elements:
$links = $html->find('a[class=thumbnail]');
// done.
echo $links;
Tested it and made some changes - this works perfect too.
<?php
// load the url and set up an array for the links
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.reddit.com/r/funny');
$links = array();
// loop thru all the A elements found
foreach($dom->getElementsByTagName('a') as $link) {
$url = $link->getAttribute('href');
$class = $link->getAttribute('class');
// Check if the URL is not empty and if the class contains thumbnail
if(!empty($url) && strpos($class,'thumbnail') !== false) {
array_push($links, $url);
}
}
// Print results
print_r($links);
?>

get value of <h2> of html page with PHP DOM?

I have a var of a HTTP (craigslist) link $link, and put the contents into $linkhtml. In this var is the HTML code for a craigslist page, $link.
I need to extract the text between <h2> and </h2>. I could use a regexp, but how do I do this with PHP DOM? I have this so far:
$linkhtml= file_get_contents($link);
$dom = new DOMDocument;
#$dom->loadHTML($linkhtml);
What do I do next to put the contents of the element <h2> into a var $title?
if DOMDocument looks complicated to understand/use to you, then you may try PHP Simple HTML DOM Parser which provides the easiest ever way to parse html.
require 'simple_html_dom.php';
$html = '<h1>Header 1</h1><h2>Header 2</h2>';
$dom = new simple_html_dom();
$dom->load( $html );
$title = $dom->find('h2',0)->plaintext;
echo $title; // outputs: Header 2
You can use this code:
$linkhtml= file_get_contents($link);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($linkhtml); // loads your html
$xpath = new DOMXPath($doc);
$h2text = $xpath->evaluate("string(//h2/text())");
// $h2text is your text between <h2> and </h2>
You can do this with XPath: untested, may contain errors
$linkhtml= file_get_contents($link);
$dom = new DOMDocument;
#$dom->loadHTML($linkhtml);
$xpath = new DOMXpath($dom);
$elements = $xpath->query("/html/body/h2");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}

Simple HTML DOM gets only 1 element

I'm following a simplified version of the scraping tutorial by NetTuts here, which basically finds all divs with class=preview
http://net.tutsplus.com/tutorials/php/html-parsing-and-screen-scraping-with-the-simple-html-dom-library/comment-page-1/#comments
This is my code. The problem is that when I count $items I get only 1, so it's getting only the first div with class=preview, not all of them.
$articles = array();
$html = new simple_html_dom();
$html->load_file('http://net.tutsplus.com/page/76/');
$items = $html->find('div[class=preview]');
echo "count: " . count($items);
Try using DOMDocument and DOMXPath:
$file = file_get_contents('http://net.tutsplus.com/page/76/');
$dom = new DOMDocument();
#$dom->loadHTML($file);
$domx = new DOMXPath($dom);
$nodelist = $domx->evaluate("//div[#class='preview']");
foreach ($nodelist as $node) { print $node->nodeValue; }

Categories