Load HTML containing namespaces with DOMDocument

Load HTML containing namespaces with DOMDocument - php

I've a problem. I want to load a HTML snippet with namespaces in it with DOMDocument.
<div class="something-first">
<div class="something-child something-good another something-great">
<my:text value="huhu">
</div>
</div>
But I can't figure out how to preserve the namespaces. I tried loading it with loadHTML() but HTML does not have namespaces and so they get stripped.
I tried loading it with loadXML() but this doesn't work neither cause <my:text value="huhu"> is not correct XML.
What I need is a loadHTML() method which doesn't strip namespaces or a loadXML() method which does not validate the markup. So a combination of this two methods.
My code so far:
$html = '<div class="something-first">
<div class="something-child something-good another something-great">
<my:text value="huhu">
</div>
</div>';
libxml_use_internal_errors(true);
$domDoc = new DOMDocument();
$domDoc->formatOutput = false;
$domDoc->resolveExternals = false;
$domDoc->substituteEntities = false;
$domDoc->strictErrorChecking = false;
$domDoc->validateOnParse = false;
$domDoc->loadHTML($html/*, LIBXML_NOERROR | LIBXML_NOWARNING*/);
$xpath = new DOMXPath($domDoc);
$xpath->registerNamespace ( 'my', 'http://www.example.com/' );
// -----> This results in zero nodes cause namespace gets stripped by loadHTML()
$nodes = $xpath->query('//my:*');
var_dump($nodes);
Is there a way to achieve what I want? I would be very happy for any advices.
EDIT I opened an enhancment request for libxml2 to provide an option to preserve namespaces in HTML: https://bugzilla.gnome.org/show_bug.cgi?id=711670

First, namespaces are allowed in XML (or XHTML) only. HTML does not support namespaces.
Given that it is XHTML and the xmlns declaration is present in the snippet, then you can access elements by namespace using DOMDocument::getElementsByTagNameNS():
$html = <<<EOF
<div xmlns:my="http://www.example.com/" class="something-first">
<div class="something-child something-good another something-great">
<my:text value="huhu" />
</div>
</div>
EOF;
$domDoc = new DOMDocument();
$domDoc->loadXML($html);
var_dump(
// it is possible to use wildcard `*` here
$domDoc->getElementsByTagNameNS('http://www.example.com/', '*')
);
However as it is common that the namespace declaration is defined in the root element <html> rather than in sub nodes, the code above will not work in most cases..
So part two of the solution would be to check if the declaration is present and if not inject it.... (working on this)
As I said, the code above works for XML / XHTML only. It is still open how to do that with HTML. (check the discussion below)

Technically it's neither valid XML or HTML (or XHTML) because HTML does not allow for namespaced elements while valid XML requires that empty elements be self-closing and that the namespace be registered. So your basically asking "how can I have DOMDocument treat this invalid HTML as valid XML even though it's not valid XML either?" which is going to prove difficult and one might ask why should libxml be updated to allow for this? If I update your snippet to:
$html = <<<XML
<div xmlns:my="http://www.example.com/" class="something-first">
<div class="something-child something-good another something-great">
<my:text value="huhu" />
</div>
</div>
XML;
adding in the NS registration and closing the my:text, it works just fine with:
$domDoc = new DOMDocument();
$domDoc->loadXML($html);
echo $domDoc->saveXML();
Notice that the namespace is not stripped out. The namespace is stripped out, as I understand it, because it's not valid XML or HTML. The XPath can't query by the namespace since the namespace wasn't defined via xmlns and therefore was dropped.
So I guess the question is: Why are you petitioning for invalid XML support rather than adding that closing slash? Is it because the data is from an external source or because in some context the empty non-closing tag is valid?

Related

php xml DOMDocument close tag element

I am using PHP DOMDocument() to generate XML file with elements.
I am appending all details into sample xml file into components tag. But closing tag is not coming. I want to create closing tag.
My Code is doing this
<component expiresOn="2022-12-31" id="pam" />
I want to do like following
<component expiresOn="2022-12-31" id="pam"></component>
My PHP CODE SAMPLE
$dom = new DOMDocument();
$dom->load("Config.xml");
$components = $dom->getElementsByTagName('components')->item(0);
if(!empty($_POST["pam"])) {
$pam = $_POST["pam"];
$component = $dom->createElement('component');
$component->setAttribute('expiresOn', $expirydate);
$component->setAttribute('id', "pam");
$components->appendChild($component5);
}
$dom->save("Config.xml");
I tested following suggestion and its not working. Both xml-php code are different.
$dom->saveXml($dom,LIBXML_NOEMPTYTAG);
Self-closing tags using createElement
I tested following.

You're trying to use DOMDocument::saveXML to save the new XML back into the original file, but all that function does is return the XML as a string. Since you aren't assigning the result to anything, nothing happens.
If you want to save the XML back to your file, as well as avoiding self-closing tags, you'll need to use the save method as you originally were, and also pass the option:
$dom->save('licenceConfig.xml', LIBXML_NOEMPTYTAG);
See https://3v4l.org/e6N5s for a demo

How do I save as a HTML fragment, not as full DOM model

Here's the issue: I have a web page that saves HTML fragments to the server side. The problem is that in PHP, when I start the DOMDocument parser, add a custom attribute to a given element and save the HTML as a file, it literally adds the html, body, and other unnecessary elements that are clearly not going to be valid since that fragment would be going back to the browser as a HTML fragment to be inserted inside the DOM model and it would be invalid (you cannot have nested HTML/BODY). Here's a quick example of my code:
$html="<div><magic></magic>
<video controls></video>
<a href='http://example.com'>Example</a><br>
<a href='http://google.com'>Google</a><br></div>
";
$dom = new DOMDocument();
$dom->loadHTML($html);
$html=$dom->C14N();
echo $html;
But it shows:
<html>
<body>
<div>
<magic></magic>
<video controls=""></video>
Example
<br></br>
Google
<br></br>
</div>
</body>
</html>
How do I save just the fragmented HTML? I came up with $dom->C14N() but it still adds html and body tags. It also adds </br> but that's no big deal.
At this point, I am resorting to preg_replace to remove html and body tags but it would be nice if there's a way to save it as a fragment.

You need to initialize the DOM structure like this:
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$html=$dom->saveHTML();
See PHP documentation:
LIBXML_HTML_NOIMPLIED (integer)
Sets HTML_PARSE_NOIMPLIED flag, which turns off the automatic adding of implied html/body... elements.
LIBXML_HTML_NODEFDTD (integer)
Sets HTML_PARSE_NODEFDTD flag, which prevents a default doctype being added when one is not found.

PHP Modify an included file

I have a bunch of .html files that I am including on a page. Conditionally, I need to add classes to some of the components in these files, for example:
<div id='foo' class='bar'></div>
to
<div id='foo' class='bar bar2'></div>
I know I can do this with some inline PHP like this
<div id='foo' class="bar <?php echo " bar2"; ?>"></div>
However, having PHP in any of the files I'm including is not an option.
I also looked into including a file and then modifying afterward, but that doesn't seem possible. Then I was thinking I should read the files line-by-line, and add it in then.
Is there a nicer way I'm not thinking of?

Since having PHP is not an option, you could use PHP's DOM Parser with an XPath selector:
$dom = new DOMDocument();
$dom->loadHTMLFile($htmlFile);
$finder = new DomXPath($dom);
// getting the class name using XPath
$nodes = $finder->query("//*[contains(#class, 'bar')]");
// changing the class name using setAttribute
foreach ($nodes as $node) {
$node->setAttribute('class', 'barbar2');
}
// modified HTML source
$html = $dom->saveHTML();
That should get you started.

You can use the DOMDocument class in PHP to retreive the information from the file and then add attributes and data.
I don't really remember the code for DOMDocument so I haven't included any code here (sorry), but here are some links:
Use this method to get the HTML from your file:
http://php.net/manual/en/domdocument.loadhtmlfile.php
Review the DOMDocument class:
http://php.net/manual/en/class.domdocument.php

You may need to use .php instead of .html.
So do like below:
$variableClass="bar2";
include("htmlfilename.html");
where the htmlfile.html consists of
<div id='foo' class="bar <?php echo $variableClass; ?>"></div>

Depends on what you actually want to achieve - but basically this tends to be better solved by jQuery on the client.
But anyway you might put your HTML fragments in a DOM object, analyze and modify it, and read the HTML back after the modifications, for example:
// including an HTML file writes to the output stream, so buffer this
ob_start();
include('myfile.html');
$html = ob_get_clean();
// make a DOMDocument
$doc = new DOMDocument();
$doc->loadHTML($html);
// make the changes you need to
$xpath = new DOMXPath($doc);
$nodelist $xpath->query('//div[#id="foo"]');
// etc...
// get modified HTML
$html = $doc->saveHTML();
Hope this helps.

Using namespaces on XML

I need work with namespaces on XML from a code and do something with it. For instance:
<system:include file="./test.php" cache="true" />
That would be the final output of the content, but it is necessary to process the special tags (like the system:include) before send to client.
So I will get all elements of final output to search about namespaced tags or specific ones. The problem is that if I use DOMDocument and read like XML, I have some problems with namespaces declaration (Namespace prefix system on include is not defined in Entity).
My test code is:
<?php
$document = new DOMDocument();
$document->loadXML('
<system:include file="./test.php" cache="true" />
');
foreach($document->childNodes as $node) {
var_dump($node->nodeName);
}
?>
I need do it because I need process some special tags and converts it to real HTML. For instance: convert <b> to <strong> (just an example!) or make something better like include and cache a specific page using tags.
Another example:
<h7>Hello World!</h7>
Converts to:
<div class="h7">Hello World!</div>
Note: the ob contents will be sent to a specific method that will search by this special tags. So I don't know if I can make namespaces declaration before (will be hard and slowly, probably).
Bye!

I can get it to work if I specify a root element in the XML, and then declare the system namespace inside the root element. <root xmlns:system="system">...</root>
<?php
function dump($root) {
foreach($root->childNodes as $node) {
echo $node->nodeName;
echo "\n";
dump($node);
}
}
$doc = new DOMDocument();
$doc->loadXML('<root xmlns:system="system"><system:include file="./test.php" cache="true" /></root>');
dump($doc);
?>

PHP Scraping using XPath - html5 issue?

I'm attempting to scrape the value of an input box from a URL. I seem to be having problems with my implementation of XPath.
The page to be scraped looks something like:
<!DOCTYPE html>
<html lang="en">
<head></head>
<body>
<div><span>Blah</span></div>
<div><span>Blah</span> Blah</div>
<div>
<form method="POST" action="blah">
<input name="SomeName" id="SomeId" value="GET ME"/>
<input type="hidden" name="csrfToken" value="ajax:3575644127378754050" id="csrfToken-login">
</form>
</div>
</body>
</html>
and I'm attempting to parse it like this:
$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[#id='csrfToken-login']/#value";
print_r($Selector);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
print_r($xpath->query($Selector));
NB: dump() just wraps print_r() but adds some stack trace info and formatting.
The output is as folllowws:
14:50:08 scraper.php 181: (Scraper->Test)
//input[#id='csrfToken-login']/#value
14:50:08 scraper.php 188: (Scraper->Test)
DOMNodeList Object
(
)
Which I'm assuming means it was unable to find anything in the document which matches my selector? I've tried a number of variations, jsut to see if I can get something back:
/input/#value
/input
//input
/div
The only selector which I've been able to get anything from is / which returns the entire document.
What am I doing wrong?
EDIT: As some can't reproduce the problem with the old example, I've replaced it with an almost identical example which also demonstrates the problem but uses a public URL (LinkedIn login page).
There's been a suggestion that this isn't possible due to the parser choking on html5 - (as is the internal page) anyone have any experience of this?

If your selector starts with a single slash(/), it means the absolute path from the root. You need to use double slash (//) which selects all matching elements regardless of their location.
print_r won't work for this. Everything was fine in your code except for actually getting value.
Lists classes in PHP usually have a property called length, check that instead.
$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[#id='csrfToken-login']/#value";
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
$b = $xpath->query($Selector);
echo $b->item(0)->value;

DOMXPath looks fine to me.
As for the xpath use descendant-or-self shortcut // to get to the input tag
//input[#id='SomeId']/#value

I've been to the LinkedIn login page that you specified and it is malformed; even your pared-down example has an unclosed input node. I know nothing about PHP's XPath implementation, but I'm guessing no straight XPath API is ever going to work with a malformed document.
Your XPath is correct, by the way.
You might need an intermediary step using TagSoup to "well form" the source before you start querying it, or Google "tag soup php" for any PHP-specific solutions/implementations.
I hope this helps,
Zachary

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Load HTML containing namespaces with DOMDocument - php

Related

php xml DOMDocument close tag element

How do I save as a HTML fragment, not as full DOM model

PHP Modify an included file

Using namespaces on XML

PHP Scraping using XPath - html5 issue?

Categories

Resources