How to edit html of elements? I tried this, but i get this error.
Fatal error: Uncaught InvalidArgumentException: Attaching DOM nodes
from multiple documents in the same crawler is forbidden.
$crawler = new Crawler('<h1>The title</h1>');
$crawler
->filter('h1,h2,h3,h4,h5,h6')
->each(function (Crawler $crawler, $i) use (&$replace) {
$crawler->html('<span>test</span>' . $crawler->html());
});
Use this:
$doc = new DOMDocument;
$doc->loadHTML($html);
$crawler = new Crawler($doc);
$crawler
->filter('h1,h2,h3,h4,h5,h6')
->each(function (Crawler $crawler) use ($doc) {
foreach ($crawler as $node) {
$span = $doc->createElement('span', 'test');
$node->parentNode->insertBefore($span, $node);
}
});
Important: Use same DOMDocument object for creating new tag that used in Crawler object.
As explained in The DomCrawler Component docs:
An instance of the Crawler represents a set of DOMElement objects, which are nodes that can be traversed...
So, you need to traverse Crawler object before manipulate DOMElements.
Related
This is the current code that I have for scraping.
$item is the HTML for the div HTML within the loop.
$doc = DOMDocument::loadHTML($item);
$xpath = new DOMXPath($doc);
$link = "//a[#class='s-item__link']";
$entries = $xpath->query($link);
foreach ($entries as $entry) {
// do work here
}
I am changing the first two lines to be...
$doc = new DOMDocument();
$xpath = $doc->load($item);
With that, I am getting the following error...
Fatal error: Uncaught Error: Call to a member function query() on bool in
The error is coming in from $entries = $xpath->query($link); and I can not figure out where to change this line to.
Any help would be appreciated.
UPDATE:
same error
$doc = new DOMDocument();
$xpath = $doc->loadHTML($item);
$link = "//a[#class='s-item__link']";
$entries = $xpath->query($link);
foreach ($entries as $entry) {
// do work here
}
Look at the return value from DOMDocument:load()...
Returns true on success or false on failure. If called statically, returns a DOMDocument or false on failure.
Emphasis: Mine. Notice that you're not calling it statically anymore with your change.
So, with code like, $xpath = $doc->load($item);, of course $xpath will need to be a bool (true or false), and your error makes total sense: Fatal error: Uncaught Error: Call to a member function query() on bool.
I just scooped out the Xpath stuff I'm using right now for my own PHP scraper. This should work...
$dom = new DOMDocument;
#$dom->loadHTML(mb_convert_encoding($htmltext, 'HTML-ENTITIES', 'UTF-8'));
$xpath = new DOMXPath($dom);
Explanation:
new DOMDocument : New class instance of DOMDocument().
#$dom->loadHTML : The # symbol suppresses warnings, and this class is very wordy with its errors, you don't want to see them all the time.
mb_convert_encoding($htmltext, 'HTML-ENTITIES', 'UTF-8') : loadHTML() appreciates properly UTF-8 encoded text, also, mb_convert_encoding() is optimized for massive strings.
new DOMXPath($dom); : New class instance of DOMXPath().
->load expects a filename as first parameter as shown in the documentation.
In your first code block, you use loadHTML.
Use ->loadHTML instead off ->load on an empty DomDocument:
$doc = new DOMDocument();
$xpath = $doc->loadHTML($item);
public load ( string $filename , int $options = 0 ) : DOMDocument|bool
public loadHTML ( string $source , int $options = 0 ) : DOMDocument|bool
public loadHTMLFile ( string $filename , int $options = 0 ) : DOMDocument|bool
I am in the middle of a process that should extract something from a HTML page. I am fairly new to DomDocument in PHP, but I got this together from some tutorials and Stack Overflow.
Unfortunately, I need to know what element I am currently getting in the foreach loop below. As far as I know, the getName() function has something to do with XML, because it gives an Undefined Function Fatal error. Do you guys know any way to do this?
$rawdom = new DOMDocument();
$rawdom->loadHTML($page);
$finder = new DomXPath($rawdom);
$nodes = $finder->query("//dl[contains(#class, 'layout__definitionlist')]");
$tmp_dom = new DOMDocument();
foreach ($nodes as $node) {
echo $node->getName();
$tmp_dom->appendChild($tmp_dom->importNode($node,true));
}
$innerHTML = $tmp_dom->saveHTML();
echo $innerHTML;
With DOMElement objects, the element name is not accessible using a getName() function, but as property $tagName:
echo $node->tagName;
getName() is only available with SimpleXMLElement, which is another XML/XPath API for PHP.
I am trying to find the last paragraph tag in a block of HTML using DOMDocument/DOMXpath but can't seem to figure it out.
# Create DOMDocument Object
$dom = new DOMDocument;
# Load HTML into DomDocument Object
$dom->loadHTML($data['component2']);
# Creat DOMXPath Object and load DOMDocument Object into XPath for magical goodness
$xpath = new DOMXPath($dom);
# Loop through each comment node
foreach($xpath->query('//p') as $node) {
// krumo($node->parentNode);
print_r($node->parentNode->lastChild);
}
exit();
The print_r returns an empty DOMText Object ( )... any idea on how to find the last paragraph in a block of HTML using DOMDocument/DOMXPath?
Working Code:
# Create DOMDocument Object
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
# Load HTML into DomDocument Object
$dom->loadHTML($data['component2']);
# Creat DOMXPath Object and load DOMDocument Object into XPath for magical goodness
$xpath = new DOMXPath($dom);
$q = $xpath->query('//div[#class="t_content"]/p[last()]');
$data['component2'] = str_replace(utf8_decode($q->item(0)->nodeValue), "", $data['component2']);
Use this instead:
print_r($node->parentNode->lastChild->nodeValue);
Is there any way I can insert an HTML template to existing DOMNode without content being encoded?
I have tried to do that with:
$dom->createElement('div', '<h1>Hello world</h1>');
$dom->createTextNode('<h1>Hello world</h1>');
The output is pretty much the same, with only difference that first code would wrap it in a div.
I have tried to loadHTML from string but I have no idea how can I append it's body content to another DOMDocument.
In javascript, this process seems to be quite simple and obvious.
You can use
DOMDocumentFragment::appendXML — Append raw XML data
Example:
// just some setup
$dom = new DOMDocument;
$dom->loadXml('<html><body/></html>');
$body = $dom->documentElement->firstChild;
// this is the part you are looking for
$template = $dom->createDocumentFragment();
$template->appendXML('<h1>This is <em>my</em> template</h1>');
$body->appendChild($template);
// output
echo $dom->saveXml();
Output:
<?xml version="1.0"?>
<html><body><h1>This is <em>my</em> template</h1></body></html>
If you want to import from another DOMDocument, replace the three lines with
$tpl = new DOMDocument;
$tpl->loadXml('<h1>This is <em>my</em> template</h1>');
$body->appendChild($dom->importNode($tpl->documentElement, TRUE));
Using TRUE as the second argument to importNode will do a recursive import of the node tree.
If you need to import (malformed) HTML, change loadXml to loadHTML. This will trigger the HTML parser of libxml (what ext/DOM uses internally):
libxml_use_internal_errors(true);
$tpl = new DOMDocument;
$tpl->loadHtml('<h1>This is <em>malformed</em> template</h2>');
$body->appendChild($dom->importNode($tpl->documentElement, TRUE));
libxml_use_internal_errors(false);
Note that libxml will try to correct the markup, e.g. it will change the wrong closing </h2> to </h1>.
It works with another DOMDocument for parsing the HTML code. But you need to import the nodes into the main document before you can use them in it:
$newDiv = $dom->createElement('div');
$tmpDoc = new DOMDocument();
$tmpDoc->loadHTML($str);
foreach ($tmpDoc->getElementsByTagName('body')->item(0)->childNodes as $node) {
$node = $dom->importNode($node, true);
$newDiv->appendChild($node);
}
And as a handy function:
function appendHTML(DOMNode $parent, $source) {
$tmpDoc = new DOMDocument();
$tmpDoc->loadHTML($source);
foreach ($tmpDoc->getElementsByTagName('body')->item(0)->childNodes as $node) {
$node = $parent->ownerDocument->importNode($node, true);
$parent->appendChild($node);
}
}
Then you can simply do this:
$elem = $dom->createElement('div');
appendHTML($elem, '<h1>Hello world</h1>');
As I do not want to struggle with XML, because it throws errors faster and I am not a fan of prefixing an # to prevent error output. The loadHTML does the better job in my opinion and it is quite simple as that:
$doc = new DOMDocument();
$div = $doc->createElement('div');
// use a helper to load the HTML into a string
$helper = new DOMDocument();
$helper->loadHTML('This is my HTML Link.');
// now the magic!
// import the document node of the $helper object deeply (true)
// into the $div and append as child.
$div->appendChild($doc->importNode($helper->documentElement, true));
// add the div to the $doc
$doc->appendChild($div);
// final output
echo $doc->saveHTML();
Here is simple example by using DOMDocumentFragment:
$doc = new DOMDocument();
$doc->loadXML("<root/>");
$f = $doc->createDocumentFragment();
$f->appendXML("<foo>text</foo><bar>text2</bar>");
$doc->documentElement->appendChild($f);
echo $doc->saveXML();
Here is helper function for replacing DOMNode:
/**
* Helper function for replacing $node (DOMNode)
* with an XML code (string)
*
* #var DOMNode $node
* #var string $xml
*/
public function replaceNodeXML(&$node, $xml) {
$f = $this->dom->createDocumentFragment();
$f->appendXML($xml);
$node->parentNode->replaceChild($f,$node);
}
Source: Some old "PHP5 Dom Based Template" article.
And here is another suggestion posted by Pian0_M4n to use value attribute as workaround:
$dom = new DomDocument;
// main object
$object = $dom->createElement('div');
// html attribute
$attr = $dom->createAttribute('value');
// ugly html string
$attr->value = "<div> this is a really html string ©</div><i></i> with all the © that XML hates!";
$object->appendChild($attr);
// jquery fix (or javascript as well)
$('div').html($(this).attr('value')); // and it works!
$('div').removeAttr('value'); // to clean-up
No ideal, but at least it works.
Gumbo's code works perfectly! Just a little enhancement that adding the TRUE parameter so that it works with nested html snippets.
$node = $parent->ownerDocument->importNode($node);
$node = $parent->ownerDocument->importNode($node, **TRUE**);
I've got my HTML inside of $html.
dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[#id="header"]');
foreach($tags as $tag) {
var_dump($tag->nodeValue); // the innerHTML of that element
var_dump($tag); // object(DOMElement)#3 (0) { }
}
Is there a way to get that node, or remove it?
Basically, I'm parsing an existing website and need to remove elements from it. What method do I call to do that?
Thanks
Have you checked out DOMNode::removeChild ?