I have an extremely simple implementation that pulls in a test bit of XML and attempts to validate it using DOMDocument. In testing, it's able to get through the LoadHTML() call fine, but as soon as I try and run validate(), the browser hangs forever and doesn't load. Here's the code:
$content = '<?xml version="1.0" encoding="utf-8"?><mainElement></mainElement>';
$dom = new DOMDocument;
$dom->LoadHTML($content);
if (!$dom->validate()) {
echo 'fail';
} else {
echo 'success!';
}
It seems that if you want to validate content loaded with loadHTML, you need DOCTYPE declaration (without it, you get an infinitive loop). For example, following code works and prints fail
$content = "
<!DOCTYPE html>
<html>
<body>
Content of the document......
</body>
</html>
";
$dom = new DOMDocument();
$dom->loadHTML($content);
if (!$dom->validate()) {
echo 'fail';
} else {
echo 'success!';
}
For XML it's more tolerant (it works even you didn't declare dtd but it returns false). In your case, you might use loadXML method and your code will print fail.
Tested with php 7.0.13.
Related
I use the following function to validate XML coming from a Web API before trying to parse it:
function isValidXML($xml) {
$doc = #simplexml_load_string($xml);
if ($doc) {
return true;
} else {
return false;
}
}
For some reason, it fails on the following XML. While it's a bit light in content, it looks valid to me.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><connection-response-list xmlns="http://www.ca.com/spectrum/restful/schema/response" />
Why would this fail? I tried another method of validate the XML that used DOMDocument and libxml_get_errors(), but it was actually more fickle.
EDIT: I should mention that I'm using PHP 5.3.8.
I think your interpretation is just wrong here – var_dump($doc) should give you
object(SimpleXMLElement)#1 (0) {
}
– but since it is an “empty” SimpleXMLElement, if($doc) considers it to be false-y due to PHP’s loose type comparison rules.
You should be using
if ($doc !== false)
here – a type-safe comparison.
(Had simplexml_load_string actually failed, it would have returned false – but it didn’t, see var_dump output I have shown above, that was tested with exactly the XML string you’ve given.)
SimpleXML wants some kind of "root" element. A self-closing tag at the root won't cut it.
See the following code when a root element is added:
<?php
function isValidXML($xml)
{
$doc = #simplexml_load_string($xml);
if ($doc) {
return true;
} else {
return false;
}
}
var_dump(isValidXML('<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><connection-response-list xmlns="http://www.ca.com/spectrum/restful/schema/response" /></root>'));
// returns true
print_r(isValidXML('<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><connection-response-list xmlns="http://www.ca.com/spectrum/restful/schema/response" /></root>'));
// returns 1
?>
Hope that helps.
I have a flash application that posts xml to a php page which validates it against an xsd schema. I'm trying to do the same thing but from an html page. I'm using XMLHttpRequest or with jquery's ajax call but I keep running around the same issues. "document has no document" or the "access-control-allow-origin' header issue. I can fix one but not the other.
My PHP page looks like this:
function libxml_append_errors() {
global $returnXML, $errors;
$e = libxml_get_errors();
foreach ($e as $error) {
$en = $returnXML->createElement("error", trim($error->message));
$en->setAttribute('line',$error->line);
$errors->appendChild($en);
}
libxml_clear_errors();
}
libxml_use_internal_errors(true);
$xml = new DOMDocument();
$contents = $HTTP_RAW_POST_DATA;
$xml->loadXML($contents);
$returnXML = new DOMDocument("1.0", "utf-8");
$rootNode = $returnXML->createElement("result");
$returnXML->appendChild($rootNode);
$errors = $returnXML->createElement("errors");
if (!$xml->schemaValidate('muffin_dumplings.xsd'))
{
libxml_append_errors();
}
$rootNode->appendChild($errors);
echo $returnXML->saveXML();
Either way I'm looking to get xml back with any validation errors or a simple empty error xml node same as I do with flash.
If anyone cares, here is the answer. When you post to PHP, have the right headers, and are using ajax you need to pass a content type. Then it works.
<?php
class parsedictionary {
public function _process() {
$webpage="http://www.oppapers.com/essays/Computerized-World/160871?read_essay";
$doc=new DOMDocument();
$doc->loadHTML($webpage);
echo $doc;
}
}
$obj=new parsedictionary();
$obj->_process();
?>
I can't get the content of that page.
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<p>http://www.oppapers.com/essays/Computerized-World/160871?read_essay</p>
</body>
</html>
But i need to get the content of that page.
The DOMDocument class is obviously not a string; you can iterate it, perform operations on it, but it can't just be echoed. Check the documentation to see what you can do with it: http://www.php.net/domdocument
To get the page contents, you can either use file_get_contents or do echo $doc->saveHTML()
Edit: Didn't realize you had another problem in your code; you can just use this instead:
public function _process() {
return file_get_contents('http://www.oppapers.com/essays/Computerized-World/160871?read_essay');
}
<?php
$doc->saveHTML();
?>
Works like a charm.
Well the error is pretty clear in this case. The _process() method cannot convert from one data type to another, it is expecting String and your feeding it a DomDocument. Perhaps you should try to extract all the text out of the DomDocument first as a string and then send that to the _process() method.
I have several functions in a class that return saveHTML(). After I echo more than one function in the class saveHTML(), it repeats some of the HTML. I initially solved this by doing saveHTML($node) but that doesn't seem to be an option now.
I didn't know saveHTML($domnode) was only available in PHP 5.3.6 and I have no control over the server I uploaded the files to so now I have to make it compatible with PHP 5.2.
For simplicity's sake it and only to show my problem it looks similar to this:
<?php
class HTML
{
private $dom;
function __construct($dom)
{
$this->dom = $dom;
}
public function create_paragraph()
{
$p = $this->dom->createElement('p','Text 1.');
$this->dom->appendChild($p);
return $this->dom->saveHTML();
}
public function create_paragraph2()
{
$p = $this->dom->createElement('p','Text 2.');
$this->dom->appendChild($p);
return $this->dom->saveHTML();
}
}
$dom = new DOMDocument;
$html = new HTML($dom);
?>
<html>
<body>
<?php
echo $html->create_paragraph();
echo $html->create_paragraph2();
?>
</body>
</html>
Outputs:
<html>
<body>
<p>Text 1.</p>
<p>Text 1.</p><p>Text 2.</p>
</body>
I have an idea why it's happening but I have no idea how to not make it repeat without saveHTML($domnode). How can I make it work properly with PHP 5.2?
Here's an example of what I want to be able to do:
http://codepad.viper-7.com/o61DdJ
What I do, is just save the node as XML. There are a few differences in the syntax, but it's good enough for most uses:
return $dom->saveXml($node);
You have return $this->dom->saveHTML(); twice in your class ( as far as I know you don't have to return it inside the class anywhere unless it is a private function.
If you take return $this->dom->saveHTML(); out of createparagraph() it will echo without returning. It's a DOM thing as far as I know but am new to this like you.
is there an option with DomDocument to remove the first line:
<?xml version="1.0" encoding="UTF-8"?>
The class instantiation automatically adds it to the output, but is it possible to get rid of it?
I think using DOMDocument is a universal solution for valid XML files:
If you have XML already loaded in a variable:
$t_xml = new DOMDocument();
$t_xml->loadXML($xml_as_string);
$xml_out = $t_xml->saveXML($t_xml->documentElement);
For XML file from disk:
$t_xml = new DOMDocument();
$t_xml->load($file_path_to_xml);
$xml_out = $t_xml->saveXML($t_xml->documentElement);
This comment helped: http://www.php.net/manual/en/domdocument.savexml.php#88525
If you want to output HTML, use the saveHTML() function. It automatically avoids a whole lot of XML idiom and handles closed/unclosed HTML idiom properly.
If you want to output XML you can use the fact that DOMDocument is a DOMNode (namely: '/' in XPath expression), thus you can use DOMNode API calls on it to iterate over child nodes and call saveXML() on each child node. This does not output the XML declaration, and it outputs all other XML content properly.
Example:
$xml = get_my_document_object();
foreach ($xml->childNodes as $node) {
echo $xml->saveXML($node);
}
For me, none of the answers above worked:
$dom = new \DOMDocument();
$dom->loadXXX('<?xml encoding="utf-8" ?>' . $content); // loadXML or loadHTML
$dom->saveXML($dom->documentElement);
The above didn't work for me if I had partial HTML, e.g.
<p>Lorem</p>
<p>Ipsum</p>
As it then removed the everything after <p>Lorem</p>.
The only solution that worked for me was:
foreach ($doc->childNodes as $xx) {
if ($xx instanceof \DOMProcessingInstruction) {
$xx->parentNode->removeChild($xx);
}
}
I had the same problem, but I am using symfony/serializer for XML creation. If you also want to achieve this with Symfony serializer you can do in this way:
$encoder = new \Symfony\Component\Serializer\Encoder\XmlEncoder();
$encoder->encode($nodes[$rootNodeName], 'xml', [
XmlEncoder::ROOT_NODE_NAME => $rootNodeName,
XmlEncoder::ENCODING => $encoding,
XmlEncoder::ENCODER_IGNORED_NODE_TYPES => [
XML_PI_NODE, //this flag is the solution
],
]);
You can use output buffering to remove it. A bit of a hack but it works.
ob_start();
// dom stuff
$output = ob_get_contents();
ob_end_clean();
$clean = preg_replace("/(.+?\n)/","",$output);