Scraping with DOMDocument PHP - php

This is the current code that I have for scraping.
$item is the HTML for the div HTML within the loop.
$doc = DOMDocument::loadHTML($item);
$xpath = new DOMXPath($doc);
$link = "//a[#class='s-item__link']";
$entries = $xpath->query($link);
foreach ($entries as $entry) {
// do work here
}
I am changing the first two lines to be...
$doc = new DOMDocument();
$xpath = $doc->load($item);
With that, I am getting the following error...
Fatal error: Uncaught Error: Call to a member function query() on bool in
The error is coming in from $entries = $xpath->query($link); and I can not figure out where to change this line to.
Any help would be appreciated.
UPDATE:
same error
$doc = new DOMDocument();
$xpath = $doc->loadHTML($item);
$link = "//a[#class='s-item__link']";
$entries = $xpath->query($link);
foreach ($entries as $entry) {
// do work here
}

Look at the return value from DOMDocument:load()...
Returns true on success or false on failure. If called statically, returns a DOMDocument or false on failure.
Emphasis: Mine. Notice that you're not calling it statically anymore with your change.
So, with code like, $xpath = $doc->load($item);, of course $xpath will need to be a bool (true or false), and your error makes total sense: Fatal error: Uncaught Error: Call to a member function query() on bool.
I just scooped out the Xpath stuff I'm using right now for my own PHP scraper. This should work...
$dom = new DOMDocument;
#$dom->loadHTML(mb_convert_encoding($htmltext, 'HTML-ENTITIES', 'UTF-8'));
$xpath = new DOMXPath($dom);
Explanation:
new DOMDocument : New class instance of DOMDocument().
#$dom->loadHTML : The # symbol suppresses warnings, and this class is very wordy with its errors, you don't want to see them all the time.
mb_convert_encoding($htmltext, 'HTML-ENTITIES', 'UTF-8') : loadHTML() appreciates properly UTF-8 encoded text, also, mb_convert_encoding() is optimized for massive strings.
new DOMXPath($dom); : New class instance of DOMXPath().

->load expects a filename as first parameter as shown in the documentation.
In your first code block, you use loadHTML.
Use ->loadHTML instead off ->load on an empty DomDocument:
$doc = new DOMDocument();
$xpath = $doc->loadHTML($item);
public load ( string $filename , int $options = 0 ) : DOMDocument|bool
public loadHTML ( string $source , int $options = 0 ) : DOMDocument|bool
public loadHTMLFile ( string $filename , int $options = 0 ) : DOMDocument|bool

Related

DOMDocument returns empty data

I am not getting into this if loop. I have a copy of the code exactly on another virtual server with the same PHP version and this gets in the loop and returns the result from CURL using a webpage as a resource. Is there something I am missing that is wrong with the code snip (I am trying to be succinct, but there is more code). Any related tips would be helpful.
class News extends General
{
public $CU;
// Constructor
public function __construct()
{
$this->CU = new CURL();
$dom = new DOMDocument();
#$dom->loadHTML($this->CU->Response['Body']);
$xpath = new DOMXPath($dom);
}
$result_list = $xpath->query("//div[contains(#class, 's-search-results')]//div[contains(#class, 's-result-item')]//h2//a");
$price_list = $xpath->query("//div[contains(#class, 's-search-results')]//div[contains(#class, 's-result-item')]//span[contains(#data-a-color, 'base')]//span[contains(#class, 'a-offscreen')]");
if($result_list->length > 0 && $result_list->length==$price_list->length)
{
ECHO "In the loop";
}
}

How to traverse child elements by xpath in DomNodeList?

I have this PHP script:
<?php
libxml_use_internal_errors(true);
/* Createa a new DomDocument object */
$dom = new DomDocument;
$dom_grep = new DomDocument;
/* Load the HTML */
$dom->loadHTMLFile("http://domain.com/catalog/0_1.html");
/* Create a new XPath object */
$xpath = new DomXPath($dom);
/* Query all <table> nodes containing specified class name */
$nodes = $xpath->query("/html/.//table[#class='right']");
/* Set HTTP response header to plain text for debugging output */
header("Content-type: text/plain");
/* How to make Xpath in code below??? */
foreach ($nodes as $i => $node) {
$child[$i]["title"] = $node->query("//tr[#class='bg3']//h3");
$child[$i]["href"] = $node->query("a['href=/catalog/details']");
}
}
?>
But I got this error in result:
"Fatal error: Call to undefined method DOMElement::query()" in $child array
How to make another xpath query in $nodes?
Thank you!

PHP DOMDocument And DOMXpath

I am trying to find the last paragraph tag in a block of HTML using DOMDocument/DOMXpath but can't seem to figure it out.
# Create DOMDocument Object
$dom = new DOMDocument;
# Load HTML into DomDocument Object
$dom->loadHTML($data['component2']);
# Creat DOMXPath Object and load DOMDocument Object into XPath for magical goodness
$xpath = new DOMXPath($dom);
# Loop through each comment node
foreach($xpath->query('//p') as $node) {
// krumo($node->parentNode);
print_r($node->parentNode->lastChild);
}
exit();
The print_r returns an empty DOMText Object ( )... any idea on how to find the last paragraph in a block of HTML using DOMDocument/DOMXPath?
Working Code:
# Create DOMDocument Object
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
# Load HTML into DomDocument Object
$dom->loadHTML($data['component2']);
# Creat DOMXPath Object and load DOMDocument Object into XPath for magical goodness
$xpath = new DOMXPath($dom);
$q = $xpath->query('//div[#class="t_content"]/p[last()]');
$data['component2'] = str_replace(utf8_decode($q->item(0)->nodeValue), "", $data['component2']);
Use this instead:
print_r($node->parentNode->lastChild->nodeValue);

How to insert HTML to PHP DOMNode?

Is there any way I can insert an HTML template to existing DOMNode without content being encoded?
I have tried to do that with:
$dom->createElement('div', '<h1>Hello world</h1>');
$dom->createTextNode('<h1>Hello world</h1>');
The output is pretty much the same, with only difference that first code would wrap it in a div.
I have tried to loadHTML from string but I have no idea how can I append it's body content to another DOMDocument.
In javascript, this process seems to be quite simple and obvious.
You can use
DOMDocumentFragment::appendXML — Append raw XML data
Example:
// just some setup
$dom = new DOMDocument;
$dom->loadXml('<html><body/></html>');
$body = $dom->documentElement->firstChild;
// this is the part you are looking for
$template = $dom->createDocumentFragment();
$template->appendXML('<h1>This is <em>my</em> template</h1>');
$body->appendChild($template);
// output
echo $dom->saveXml();
Output:
<?xml version="1.0"?>
<html><body><h1>This is <em>my</em> template</h1></body></html>
If you want to import from another DOMDocument, replace the three lines with
$tpl = new DOMDocument;
$tpl->loadXml('<h1>This is <em>my</em> template</h1>');
$body->appendChild($dom->importNode($tpl->documentElement, TRUE));
Using TRUE as the second argument to importNode will do a recursive import of the node tree.
If you need to import (malformed) HTML, change loadXml to loadHTML. This will trigger the HTML parser of libxml (what ext/DOM uses internally):
libxml_use_internal_errors(true);
$tpl = new DOMDocument;
$tpl->loadHtml('<h1>This is <em>malformed</em> template</h2>');
$body->appendChild($dom->importNode($tpl->documentElement, TRUE));
libxml_use_internal_errors(false);
Note that libxml will try to correct the markup, e.g. it will change the wrong closing </h2> to </h1>.
It works with another DOMDocument for parsing the HTML code. But you need to import the nodes into the main document before you can use them in it:
$newDiv = $dom->createElement('div');
$tmpDoc = new DOMDocument();
$tmpDoc->loadHTML($str);
foreach ($tmpDoc->getElementsByTagName('body')->item(0)->childNodes as $node) {
$node = $dom->importNode($node, true);
$newDiv->appendChild($node);
}
And as a handy function:
function appendHTML(DOMNode $parent, $source) {
$tmpDoc = new DOMDocument();
$tmpDoc->loadHTML($source);
foreach ($tmpDoc->getElementsByTagName('body')->item(0)->childNodes as $node) {
$node = $parent->ownerDocument->importNode($node, true);
$parent->appendChild($node);
}
}
Then you can simply do this:
$elem = $dom->createElement('div');
appendHTML($elem, '<h1>Hello world</h1>');
As I do not want to struggle with XML, because it throws errors faster and I am not a fan of prefixing an # to prevent error output. The loadHTML does the better job in my opinion and it is quite simple as that:
$doc = new DOMDocument();
$div = $doc->createElement('div');
// use a helper to load the HTML into a string
$helper = new DOMDocument();
$helper->loadHTML('This is my HTML Link.');
// now the magic!
// import the document node of the $helper object deeply (true)
// into the $div and append as child.
$div->appendChild($doc->importNode($helper->documentElement, true));
// add the div to the $doc
$doc->appendChild($div);
// final output
echo $doc->saveHTML();
Here is simple example by using DOMDocumentFragment:
$doc = new DOMDocument();
$doc->loadXML("<root/>");
$f = $doc->createDocumentFragment();
$f->appendXML("<foo>text</foo><bar>text2</bar>");
$doc->documentElement->appendChild($f);
echo $doc->saveXML();
Here is helper function for replacing DOMNode:
/**
* Helper function for replacing $node (DOMNode)
* with an XML code (string)
*
* #var DOMNode $node
* #var string $xml
*/
public function replaceNodeXML(&$node, $xml) {
$f = $this->dom->createDocumentFragment();
$f->appendXML($xml);
$node->parentNode->replaceChild($f,$node);
}
Source: Some old "PHP5 Dom Based Template" article.
And here is another suggestion posted by Pian0_M4n to use value attribute as workaround:
$dom = new DomDocument;
// main object
$object = $dom->createElement('div');
// html attribute
$attr = $dom->createAttribute('value');
// ugly html string
$attr->value = "<div> this is a really html string ©</div><i></i> with all the © that XML hates!";
$object->appendChild($attr);
// jquery fix (or javascript as well)
$('div').html($(this).attr('value')); // and it works!
$('div').removeAttr('value'); // to clean-up
No ideal, but at least it works.
Gumbo's code works perfectly! Just a little enhancement that adding the TRUE parameter so that it works with nested html snippets.
$node = $parent->ownerDocument->importNode($node);
$node = $parent->ownerDocument->importNode($node, **TRUE**);

Can I get the matched DOM string with PHP and DOMDocument?

I've got my HTML inside of $html.
dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[#id="header"]');
foreach($tags as $tag) {
var_dump($tag->nodeValue); // the innerHTML of that element
var_dump($tag); // object(DOMElement)#3 (0) { }
}
Is there a way to get that node, or remove it?
Basically, I'm parsing an existing website and need to remove elements from it. What method do I call to do that?
Thanks
Have you checked out DOMNode::removeChild ?

Categories