We have this code
$page = file_get_contents('http://example.aspx?a=14&c=14213&med=0');
$doc = new DOMDocument();
$doc->loadHTML($page);
$divs = $doc->getElementsByTagName('table');
foreach($divs as $div) {
// Loop through the tableĀ“s looking for one withan id of "Table2"
// Then echo out its contents
if ($div->getAttribute('id') === 'Table2') {
echo $div->childNodes;
}
}
As you see the code works, but outputs plain text, because the function of childnodes, but we need to output the code of "Table2" instead of plain text.
How can I do this?
Solved, with this code
$dom = new DOMDocument();
$data = file_get_contents('http://example.aspx?a=14&c=14213&med=0');
$dom->loadHTML($data); // $data is your html code, grab it using file_get_contents or cURL.
$xpath = new DOMXPath($dom);
$div = $xpath->query('//table[#id="Table2"]');
$div = $div->item(0);
echo $dom->saveXML($div);
Related
I'm getting the content from a DOMDocument and using query to get a certain part of the DOM.
I get the content perfectly but not the HTML tags, only pure text.
Here is the code I use:
$DOM = file_get_contents($link);
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$doc->loadHTML($DOM);
$finder = new DomXPath($doc);
$founds = 0;
$nodes = $finder->query("//div[contains(#class, 'td-post-content td-pb-padding-side')]");
foreach ( $nodes as $entrada ) {
if ( $founds == 0 ) {
$description = $entrada->nodeValue;
} $founds++;
}
Once I echo $description, no html tags like <p> <div> and so on, are not included.
What's wrong?
I'm working with a DOM parser and I'm having issues. I'm basically trying to grab the href within the tag that only contain the class ID of 'thumbnail '. I've been trying to print the links on the screen and still get no results. Any help is appreciated. I also turned on error_reporting(E_ALL); and still nothing.
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
$classId = "thumbnail ";
$div = $html->find('a#'.$classId);
echo $div;
I also tried this but still had the same result of NOTHING:
include('simple_html_dom.php');
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
$ret = $html->find('a[class=thumbnail]');
echo $ret;
You were almost there:
<?php
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.reddit.com/r/funny');
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a[contains(concat(' ',normalize-space(#class),' '),' thumbnail ')]");
var_dump($hrefs);
Gives:
class DOMNodeList#28 (1) {
public $length =>
int(25)
}
25 matches, I'd call it success.
This code would probably work:
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hyperlinks = $xpath->query('//a[#class="thumbnail"]');
foreach($hyperlinks as $hyperlink) {
echo $hyperlink->getAttribute('href'), '<br>;'
}
if you're using simple_html_dom, why are you doing all these superfluous things? It already wraps the resource in everything you need -- http://simplehtmldom.sourceforge.net/manual.htm
include('simple_html_dom.php');
// set up:
$html = new simple_html_dom();
// load from URL:
$html->load_file('http://www.reddit.com/r/funny');
// find those <a> elements:
$links = $html->find('a[class=thumbnail]');
// done.
echo $links;
Tested it and made some changes - this works perfect too.
<?php
// load the url and set up an array for the links
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.reddit.com/r/funny');
$links = array();
// loop thru all the A elements found
foreach($dom->getElementsByTagName('a') as $link) {
$url = $link->getAttribute('href');
$class = $link->getAttribute('class');
// Check if the URL is not empty and if the class contains thumbnail
if(!empty($url) && strpos($class,'thumbnail') !== false) {
array_push($links, $url);
}
}
// Print results
print_r($links);
?>
I have a var of a HTTP (craigslist) link $link, and put the contents into $linkhtml. In this var is the HTML code for a craigslist page, $link.
I need to extract the text between <h2> and </h2>. I could use a regexp, but how do I do this with PHP DOM? I have this so far:
$linkhtml= file_get_contents($link);
$dom = new DOMDocument;
#$dom->loadHTML($linkhtml);
What do I do next to put the contents of the element <h2> into a var $title?
if DOMDocument looks complicated to understand/use to you, then you may try PHP Simple HTML DOM Parser which provides the easiest ever way to parse html.
require 'simple_html_dom.php';
$html = '<h1>Header 1</h1><h2>Header 2</h2>';
$dom = new simple_html_dom();
$dom->load( $html );
$title = $dom->find('h2',0)->plaintext;
echo $title; // outputs: Header 2
You can use this code:
$linkhtml= file_get_contents($link);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($linkhtml); // loads your html
$xpath = new DOMXPath($doc);
$h2text = $xpath->evaluate("string(//h2/text())");
// $h2text is your text between <h2> and </h2>
You can do this with XPath: untested, may contain errors
$linkhtml= file_get_contents($link);
$dom = new DOMDocument;
#$dom->loadHTML($linkhtml);
$xpath = new DOMXpath($dom);
$elements = $xpath->query("/html/body/h2");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
I'm following a simplified version of the scraping tutorial by NetTuts here, which basically finds all divs with class=preview
http://net.tutsplus.com/tutorials/php/html-parsing-and-screen-scraping-with-the-simple-html-dom-library/comment-page-1/#comments
This is my code. The problem is that when I count $items I get only 1, so it's getting only the first div with class=preview, not all of them.
$articles = array();
$html = new simple_html_dom();
$html->load_file('http://net.tutsplus.com/page/76/');
$items = $html->find('div[class=preview]');
echo "count: " . count($items);
Try using DOMDocument and DOMXPath:
$file = file_get_contents('http://net.tutsplus.com/page/76/');
$dom = new DOMDocument();
#$dom->loadHTML($file);
$domx = new DOMXPath($dom);
$nodelist = $domx->evaluate("//div[#class='preview']");
foreach ($nodelist as $node) { print $node->nodeValue; }
I am trying to remove certain links depending on their ID tag, but leave the content of the link. For example I want to turn
Some text goes here
to
Some text goes here
I have tried using the below.
$dom = new DOMDocument;
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
$xp = new DOMXPath($dom);
foreach($xp->query('//a[contains(#id="remove")]') as $oldNode) {
$revised = strip_tags($oldNode);
}
$revised = mb_substr($dom->saveXML($xp->query('//body')->item(0)), 6, -7, "UTF-8");
echo $revised;
roughly taken from here but it just spits back the same content of $html.
Any idea's on how I would achieve this?
That's my function for that:
function DOMRemove(DOMNode $from) {
$sibling = $from->firstChild;
do {
$next = $sibling->nextSibling;
$from->parentNode->insertBefore($sibling, $from);
} while ($sibling = $next);
$from->parentNode->removeChild($from);
}
So this:
$dom->loadHTML('Hello <span>World</span>');
$a = $dom->getElementsByTagName('a')->item(0); // get first
DOMRemove($a);
Should give you:
Hello <span>World</span>
To get nodes with a specific ID, use XPath:
$xpath = new DOMXpath($dom);
$node = $xpath->query('//a[#id="something"]')->item(0); // get first
DOMRemove($node);
An approach similar to #netcoder's answer but using a different loop structure and DOMElement methods.
$html = '<html><body>This link was removed.</body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//a[#id="remove"]') as $link) {
// Move all link tag content to its parent node just before it.
while($link->hasChildNodes()) {
$child = $link->removeChild($link->firstChild);
$link->parentNode->insertBefore($child, $link);
}
// Remove the link tag.
$link->parentNode->removeChild($link);
}
$html = $dom->saveXML();
Use:
//a[#id='remove']/node()
|
//*[a[#id='remove']]/node()[not(self::a[#id=''remove])]
This selects all children of any a having attribute id with value "remove" and all preceding and following siblings of this a that are not themselves another a having attribute id with value of "remove"