Symfony DOMCrawler: How to change html? - php

How to edit html of elements? I tried this, but i get this error.
Fatal error: Uncaught InvalidArgumentException: Attaching DOM nodes
from multiple documents in the same crawler is forbidden.
$crawler = new Crawler('<h1>The title</h1>');
$crawler
->filter('h1,h2,h3,h4,h5,h6')
->each(function (Crawler $crawler, $i) use (&$replace) {
$crawler->html('<span>test</span>' . $crawler->html());
});

Use this:
$doc = new DOMDocument;
$doc->loadHTML($html);
$crawler = new Crawler($doc);
$crawler
->filter('h1,h2,h3,h4,h5,h6')
->each(function (Crawler $crawler) use ($doc) {
foreach ($crawler as $node) {
$span = $doc->createElement('span', 'test');
$node->parentNode->insertBefore($span, $node);
}
});
Important: Use same DOMDocument object for creating new tag that used in Crawler object.
As explained in The DomCrawler Component docs:
An instance of the Crawler represents a set of DOMElement objects, which are nodes that can be traversed...
So, you need to traverse Crawler object before manipulate DOMElements.

Related

Scraping with DOMDocument PHP

This is the current code that I have for scraping.
$item is the HTML for the div HTML within the loop.
$doc = DOMDocument::loadHTML($item);
$xpath = new DOMXPath($doc);
$link = "//a[#class='s-item__link']";
$entries = $xpath->query($link);
foreach ($entries as $entry) {
// do work here
}
I am changing the first two lines to be...
$doc = new DOMDocument();
$xpath = $doc->load($item);
With that, I am getting the following error...
Fatal error: Uncaught Error: Call to a member function query() on bool in
The error is coming in from $entries = $xpath->query($link); and I can not figure out where to change this line to.
Any help would be appreciated.
UPDATE:
same error
$doc = new DOMDocument();
$xpath = $doc->loadHTML($item);
$link = "//a[#class='s-item__link']";
$entries = $xpath->query($link);
foreach ($entries as $entry) {
// do work here
}
Look at the return value from DOMDocument:load()...
Returns true on success or false on failure. If called statically, returns a DOMDocument or false on failure.
Emphasis: Mine. Notice that you're not calling it statically anymore with your change.
So, with code like, $xpath = $doc->load($item);, of course $xpath will need to be a bool (true or false), and your error makes total sense: Fatal error: Uncaught Error: Call to a member function query() on bool.
I just scooped out the Xpath stuff I'm using right now for my own PHP scraper. This should work...
$dom = new DOMDocument;
#$dom->loadHTML(mb_convert_encoding($htmltext, 'HTML-ENTITIES', 'UTF-8'));
$xpath = new DOMXPath($dom);
Explanation:
new DOMDocument : New class instance of DOMDocument().
#$dom->loadHTML : The # symbol suppresses warnings, and this class is very wordy with its errors, you don't want to see them all the time.
mb_convert_encoding($htmltext, 'HTML-ENTITIES', 'UTF-8') : loadHTML() appreciates properly UTF-8 encoded text, also, mb_convert_encoding() is optimized for massive strings.
new DOMXPath($dom); : New class instance of DOMXPath().
->load expects a filename as first parameter as shown in the documentation.
In your first code block, you use loadHTML.
Use ->loadHTML instead off ->load on an empty DomDocument:
$doc = new DOMDocument();
$xpath = $doc->loadHTML($item);
public load ( string $filename , int $options = 0 ) : DOMDocument|bool
public loadHTML ( string $source , int $options = 0 ) : DOMDocument|bool
public loadHTMLFile ( string $filename , int $options = 0 ) : DOMDocument|bool

Determine what element is now xpath html foreach?

I am in the middle of a process that should extract something from a HTML page. I am fairly new to DomDocument in PHP, but I got this together from some tutorials and Stack Overflow.
Unfortunately, I need to know what element I am currently getting in the foreach loop below. As far as I know, the getName() function has something to do with XML, because it gives an Undefined Function Fatal error. Do you guys know any way to do this?
$rawdom = new DOMDocument();
$rawdom->loadHTML($page);
$finder = new DomXPath($rawdom);
$nodes = $finder->query("//dl[contains(#class, 'layout__definitionlist')]");
$tmp_dom = new DOMDocument();
foreach ($nodes as $node) {
echo $node->getName();
$tmp_dom->appendChild($tmp_dom->importNode($node,true));
}
$innerHTML = $tmp_dom->saveHTML();
echo $innerHTML;
With DOMElement objects, the element name is not accessible using a getName() function, but as property $tagName:
echo $node->tagName;
getName() is only available with SimpleXMLElement, which is another XML/XPath API for PHP.

PHP DOMDocument And DOMXpath

I am trying to find the last paragraph tag in a block of HTML using DOMDocument/DOMXpath but can't seem to figure it out.
# Create DOMDocument Object
$dom = new DOMDocument;
# Load HTML into DomDocument Object
$dom->loadHTML($data['component2']);
# Creat DOMXPath Object and load DOMDocument Object into XPath for magical goodness
$xpath = new DOMXPath($dom);
# Loop through each comment node
foreach($xpath->query('//p') as $node) {
// krumo($node->parentNode);
print_r($node->parentNode->lastChild);
}
exit();
The print_r returns an empty DOMText Object ( )... any idea on how to find the last paragraph in a block of HTML using DOMDocument/DOMXPath?
Working Code:
# Create DOMDocument Object
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
# Load HTML into DomDocument Object
$dom->loadHTML($data['component2']);
# Creat DOMXPath Object and load DOMDocument Object into XPath for magical goodness
$xpath = new DOMXPath($dom);
$q = $xpath->query('//div[#class="t_content"]/p[last()]');
$data['component2'] = str_replace(utf8_decode($q->item(0)->nodeValue), "", $data['component2']);
Use this instead:
print_r($node->parentNode->lastChild->nodeValue);

How to insert HTML to PHP DOMNode?

Is there any way I can insert an HTML template to existing DOMNode without content being encoded?
I have tried to do that with:
$dom->createElement('div', '<h1>Hello world</h1>');
$dom->createTextNode('<h1>Hello world</h1>');
The output is pretty much the same, with only difference that first code would wrap it in a div.
I have tried to loadHTML from string but I have no idea how can I append it's body content to another DOMDocument.
In javascript, this process seems to be quite simple and obvious.
You can use
DOMDocumentFragment::appendXML — Append raw XML data
Example:
// just some setup
$dom = new DOMDocument;
$dom->loadXml('<html><body/></html>');
$body = $dom->documentElement->firstChild;
// this is the part you are looking for
$template = $dom->createDocumentFragment();
$template->appendXML('<h1>This is <em>my</em> template</h1>');
$body->appendChild($template);
// output
echo $dom->saveXml();
Output:
<?xml version="1.0"?>
<html><body><h1>This is <em>my</em> template</h1></body></html>
If you want to import from another DOMDocument, replace the three lines with
$tpl = new DOMDocument;
$tpl->loadXml('<h1>This is <em>my</em> template</h1>');
$body->appendChild($dom->importNode($tpl->documentElement, TRUE));
Using TRUE as the second argument to importNode will do a recursive import of the node tree.
If you need to import (malformed) HTML, change loadXml to loadHTML. This will trigger the HTML parser of libxml (what ext/DOM uses internally):
libxml_use_internal_errors(true);
$tpl = new DOMDocument;
$tpl->loadHtml('<h1>This is <em>malformed</em> template</h2>');
$body->appendChild($dom->importNode($tpl->documentElement, TRUE));
libxml_use_internal_errors(false);
Note that libxml will try to correct the markup, e.g. it will change the wrong closing </h2> to </h1>.
It works with another DOMDocument for parsing the HTML code. But you need to import the nodes into the main document before you can use them in it:
$newDiv = $dom->createElement('div');
$tmpDoc = new DOMDocument();
$tmpDoc->loadHTML($str);
foreach ($tmpDoc->getElementsByTagName('body')->item(0)->childNodes as $node) {
$node = $dom->importNode($node, true);
$newDiv->appendChild($node);
}
And as a handy function:
function appendHTML(DOMNode $parent, $source) {
$tmpDoc = new DOMDocument();
$tmpDoc->loadHTML($source);
foreach ($tmpDoc->getElementsByTagName('body')->item(0)->childNodes as $node) {
$node = $parent->ownerDocument->importNode($node, true);
$parent->appendChild($node);
}
}
Then you can simply do this:
$elem = $dom->createElement('div');
appendHTML($elem, '<h1>Hello world</h1>');
As I do not want to struggle with XML, because it throws errors faster and I am not a fan of prefixing an # to prevent error output. The loadHTML does the better job in my opinion and it is quite simple as that:
$doc = new DOMDocument();
$div = $doc->createElement('div');
// use a helper to load the HTML into a string
$helper = new DOMDocument();
$helper->loadHTML('This is my HTML Link.');
// now the magic!
// import the document node of the $helper object deeply (true)
// into the $div and append as child.
$div->appendChild($doc->importNode($helper->documentElement, true));
// add the div to the $doc
$doc->appendChild($div);
// final output
echo $doc->saveHTML();
Here is simple example by using DOMDocumentFragment:
$doc = new DOMDocument();
$doc->loadXML("<root/>");
$f = $doc->createDocumentFragment();
$f->appendXML("<foo>text</foo><bar>text2</bar>");
$doc->documentElement->appendChild($f);
echo $doc->saveXML();
Here is helper function for replacing DOMNode:
/**
* Helper function for replacing $node (DOMNode)
* with an XML code (string)
*
* #var DOMNode $node
* #var string $xml
*/
public function replaceNodeXML(&$node, $xml) {
$f = $this->dom->createDocumentFragment();
$f->appendXML($xml);
$node->parentNode->replaceChild($f,$node);
}
Source: Some old "PHP5 Dom Based Template" article.
And here is another suggestion posted by Pian0_M4n to use value attribute as workaround:
$dom = new DomDocument;
// main object
$object = $dom->createElement('div');
// html attribute
$attr = $dom->createAttribute('value');
// ugly html string
$attr->value = "<div> this is a really html string ©</div><i></i> with all the © that XML hates!";
$object->appendChild($attr);
// jquery fix (or javascript as well)
$('div').html($(this).attr('value')); // and it works!
$('div').removeAttr('value'); // to clean-up
No ideal, but at least it works.
Gumbo's code works perfectly! Just a little enhancement that adding the TRUE parameter so that it works with nested html snippets.
$node = $parent->ownerDocument->importNode($node);
$node = $parent->ownerDocument->importNode($node, **TRUE**);

Can I get the matched DOM string with PHP and DOMDocument?

I've got my HTML inside of $html.
dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[#id="header"]');
foreach($tags as $tag) {
var_dump($tag->nodeValue); // the innerHTML of that element
var_dump($tag); // object(DOMElement)#3 (0) { }
}
Is there a way to get that node, or remove it?
Basically, I'm parsing an existing website and need to remove elements from it. What method do I call to do that?
Thanks
Have you checked out DOMNode::removeChild ?

Categories