I'm using an instance of PHPs built-in XMLReader to read some kind of user-generated XML file. Usually this XML files content starts like the following sample snippet, where everything works fine:
<?xml version="1.0" encoding="UTF-8"?>
<openimmo>
<uebertragung art="OFFLINE" umfang="VOLL" version="1.2.7" (...)
However, another user uses a different software to send and generate the XML file. The XML generated by this software starts like:
<?xml version="1.0" encoding="UTF-8"?>
<openimmo xmlns="http://www.openimmo.de" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openimmo.de openimmo.xsd">
<uebertragung art="OFFLINE" umfang="VOLL" version="1.2.7" (...)
Which causes my importer to fail with the following error:
XMLReader::read(): Element '{http://www.openimmo.de}openimmo': No matching global declaration available for the validation root.
I'm already doing validation by manually applying some XSD schema. The passed file follows the same schema, just explicitly specifies the xmlns attributes. How can I work around this issue? How can I tell XMLReader to just ignore that xmlns statement?
My code (simplified to the relevant sections) looks like the following snippet:
$reader = new XMLReader();
$success = #$reader->open($path);
if (!$success) { /* error handling */ }
$reader->setSchema($localOpenImmoXsdPath);
/* then starts reading and throws the above exception */
Namespace information is fundamental and there's no way an XML parser is going to ignore it.
Your options are either (a) send the file back to sender, saying it doesn't conform to the agreed schema, or (b) transform the file sent to you so that it does conform, by changing the namespace. That's a fairly simple XSLT transformation.
My immediate instinct was to look at the OpenImmo specs to see what they say about namespaces and schema conformance, but unfortunately access to the specs requires registration and licensing. Basically, either the specs allow both these formats, which would be a pretty shoddy spec, or they only allow one of them, in which case you shouldn't be accepting both.
Related
I have a set of XLSX files that PhpSpreadsheet cannot load, because simplexml_load_string returns an empty SimpleXMLelement from (for instance) the workbook XML file.
The file has the following format, that can be loaded by simplexml after removing all occurrences of the x: namespace, and the declaration itself (that is, for instance, the <x:workbook> tag has been converted to <workbook>).
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<x:workbook xmlns:x15ac="http://schemas.microsoft.com/office/spreadsheetml/2010/11/ac" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" xmlns:xr6="http://schemas.microsoft.com/office/spreadsheetml/2016/revision6" xmlns:xr10="http://schemas.microsoft.com/office/spreadsheetml/2016/revision10" xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2" mc:Ignorable="x15 xr xr6 xr10 xr2" xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<x:fileVersion appName="xl" lastEdited="7" lowestEdited="4" rupBuild="23801" />
<x:workbookPr codeName="ThisWorkbook" />
<mc:AlternateContent xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006">
<mc:Choice Requires="x15">
<x15ac:absPath xmlns:x15ac="http://schemas.microsoft.com/office/spreadsheetml/2010/11/ac" url=".........." />
</mc:Choice>
</mc:AlternateContent>
<xr:revisionPtr revIDLastSave="0" documentId=".........." xr6:coauthVersionLast="46" xr6:coauthVersionMax="46" xr10:uidLastSave="{00000000-0000-0000-0000-000000000000}" />
<x:bookViews>
<x:workbookView xWindow="-120" yWindow="-120" windowWidth="29040" windowHeight="15840" xr2:uid="{00000000-000D-0000-FFFF-FFFF00000000}" />
</x:bookViews>
<x:sheets>
<x:sheet name="......" sheetId="1" r:id="rId1" />
</x:sheets>
<x:calcPr calcId="191029" />
</x:workbook>
I'm not sure the XML file is wrong, since the XLSX file(s) can be opened - for instance - with Libre Office. Anyway, have managed to load the file(s) hacking a simple minded function cleanup_xml() in Xlsx.php:
//~ http://schemas.openxmlformats.org/spreadsheetml/2006/main"
$xmlWorkbook = simplexml_load_string(
cleanup_xml($this->securityScanner->scan($this->getFromZipArchive($zip, "{$rel['Target']}"))),
'SimpleXMLElement',
Settings::getLibXmlLoaderOptions()
);
Maybe there is a proper/clean way to force simplexml API to load such files ?
edit:
I was wrong thinking all problems were gone after the cleanup_xml hack.
Seems that also the data rows XML file has problems, probably the same as above...
edit:
Indeed, I moved cleanup_xml() into XmlScanner::scan, to apply to every loaded XML, and now seems to work...
edit:
Seems the namespace declaration is correct, at least, from this simple example...
Then, I wonder why simplexml_load_string doesn't accept the format:
<x:workbook ... xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
....
</x:workbook>
while it apparently accepts
<workbook ... xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
....
<workbook>
edit
Have digged into simplexml API, this answer helped to understand the problem. Now I can try to rewrite my hackish cleanup_xml accounting for namespaces... Just wondering if PhpSpreadsheet offers a better way... seems strange this problem has been unnoticed before...
edit
ok, now I've found the bug report...
This appears to be a bug in PhpSpreadsheet.
Opening an XLSX file I created this week with a real copy of Microsoft Excel, the "workbook.xml" starts like this:
<workbook
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
mc:Ignorable="x15 xr xr6 xr10 xr2"
xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main"
xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision"
xmlns:xr6="http://schemas.microsoft.com/office/spreadsheetml/2016/revision6"
xmlns:xr10="http://schemas.microsoft.com/office/spreadsheetml/2016/revision10"
xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2">
This declares eight different namespaces that will be used in the document. One happens to be defined as the "default namespace", and the other seven are assigned prefixes - but all of that is just local to this specific file.
If we look at your XML document, we can see all the same namespaces in use, plus an extra one:
<x:workbook
xmlns:x15ac="http://schemas.microsoft.com/office/spreadsheetml/2010/11/ac"
xmlns:r="http://schemas.openxmlformats.org/officeDocumen/2006/relationships"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main"
xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision"
xmlns:xr6="http://schemas.microsoft.com/office/spreadsheetml/2016/revision6"
xmlns:xr10="http://schemas.microsoft.com/office/spreadsheetml2016/revision10"
xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2"
mc:Ignorable="x15 xr xr6 xr10 xr2"
xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
The only difference is that the namespace "http://schemas.openxmlformats.org/spreadsheetml/2006/main" has been assigned prefix "x", rather than set as the default namespace, but that makes no difference to its meaning. A different library might label the namespaces completely differently, just because of the way it generates the XML:
<ns0:workbook
xmlns:ns0="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:ms1="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:ns2="http://schemas.openxmlformats.org/markup-compatibility/2006"
ns2:Ignorable="x15 xr xr6 xr10 xr2"
xmlns:ns3="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main"
xmlns:ns4="http://schemas.microsoft.com/office/spreadsheetml/2014/revision"
xmlns:ns5="http://schemas.microsoft.com/office/spreadsheetml/2016/revision6"
xmlns:ns6="http://schemas.microsoft.com/office/spreadsheetml/2016/revision10"
xmlns:ns7="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2">
As explained in this reference answer, SimpleXML's namespace handling is based around using the ->children() method to select the namespace you want to work with. The correct way to use this is to always specify the namespace URI you want, e.g. "http://schemas.openxmlformats.org/spreadsheetml/2006/main" or "http://schemas.microsoft.com/office/spreadsheetml/2016/revision10".
However, because the same program generally creates XML documents with the same choice of prefixes, it's easy to write incorrect code which relies on:
A particular namespace being the default, and therefore selected before you first call ->children()
Particular namespaces being bound to particular prefixes, and therefore selectable by looking up that prefix
The author of PhpSpreadsheet appears to have made both mistakes, meaning that when you try to load a document created by a different program, it doesn't find the namespaces it expects even though they're actually there.
I am using xsd2php library to parse XSD which describes API request body. Then using the same library (which itself uses jsm-serializer) I try to serialize objects:
$payload = new TrackRequest;
$searchCriteria = new SearchCriteriaAType;
$searchCriteria->addToConsignmentNumber(11111);
$payload->setSearchCriteria($searchCriteria);
$levelOfDetail = new LevelOfDetailAType;
$levelOfDetail->setSummary(true);
$payload->setLevelOfDetail($levelOfDetail);
Using basic serializer settings:
$serializerBuilder = SerializerBuilder::create();
$serializerBuilder->addMetadataDir(__DIR__ . '/../../metadata/Tracking', 'TNTExpressConnect\Tracking\XSD');
$serializerBuilder->setPropertyNamingStrategy(new IdenticalPropertyNamingStrategy);
$serializerBuilder->configureHandlers(function (HandlerRegistryInterface $handler) use ($serializerBuilder) {
$serializerBuilder->addDefaultHandlers();
$handler->registerSubscribingHandler(new BaseTypesHandler()); // XMLSchema List handling
$handler->registerSubscribingHandler(new XmlSchemaDateHandler()); // XMLSchema date handling
});
Serialization results in:
<?xml version="1.0" encoding="UTF-8"?>
<result>
<searchCriteria>
<account/>
<alternativeConsignmentNumber/>
<consignmentNumber>
<entry><![CDATA[11111]]></entry>
</consignmentNumber>
<customerReference/>
<pieceReference/>
</searchCriteria>
<levelOfDetail>
<summary>true</summary>
</levelOfDetail>
</result>
Regarding this results I have several questions:
Why the root element is <result> and not <TrackRequest>?
How to get rid of CDATA?
How to get rid of <entry> tags in favor of creating separate consigmentNumber tag for each entry?
How to replace <summary>true</summary> with self-closing tag <summary/>
I guess for every one of this cases I can create a dedicated handler, but maybe there is a built-in solution, which I overlooked in the documentation (maybe some config options that can be placed in yaml).
And if I have to create handlers maybe someone can point me the more sophisticated example, that explains how to do it right.
I'm not a big fan of annotations, so I would prefer to use separate config files.
Thank you in advance.
You should have a look ar the YAML Reference. A lot of things can be set up with the meta data files.
To change the "result" to "TrackRequest" add this line to the file:
Vendor\MyBundle\Model\ClassName:
xml_root_name: TrackRequest ## Changes the root element
To get rid of cdata in entry change the property:
properties:
entry:
xml_element:
cdata: false ## Add this to disable cdata tags
Just came accross the same problems as you did. I hope it helps.
I post a question here as a last resort, I have browsed the web and went through many attempts but did not succeed.
Replicating a XXE attack is what I am trying to do, in order to prevent them, but I cannot seem to get my head around the way PHP works with XML entities. For the record I am using PHP 5.5.10 on Ubuntu 12.04, but I have done some tests on 5.4 and 5.3, and libxml2 seem to be of version 2.7.8 (which does not seem to include the default to not resolving entities).
In the following example, calling libxml_disable_entity_loader() with true or false has no effect, or I am doing something wrong.
$xml = <<<XML
<?xml version="1.0"?>
<!DOCTYPE root [
<!ENTITY c PUBLIC "bar" "/etc/passwd">
]>
<root>
<test>Test</test>
<sub>&c;</sub>
</root>
XML;
libxml_disable_entity_loader(true);
$dom = new DOMDocument();
$dom->loadXML($xml);
// Prints Test.
print $dom->textContent;
But, I could specifically pass some arguments to loadXML() to allow some options, and that works when the entity is a local file, not when it is an external URL.
$xml = <<<XML
<?xml version="1.0"?>
<!DOCTYPE root [
<!ENTITY c PUBLIC "bar" "/etc/passwd">
]>
<root>
<test>Test</test>
<sub>&c;</sub>
</root>
XML;
$dom = new DOMDocument();
$dom->loadXML($xml, LIBXML_NOENT | LIBXML_DTDLOAD);
// Prints Test.
print $dom->textContent;
Now if we are changing the entity to something else, as in the following example, the entity is resolved but I could not disable it at all using the parameters or function... What is happening?!
$xml = <<<XML
<?xml version="1.0"?>
<!DOCTYPE root [
<!ENTITY c "Blah blah">
]>
<root>
<test>Test</test>
<sub>&c;</sub>
</root>
XML;
$dom = new DOMDocument();
$dom->loadXML($xml);
// Prints Test.
print $dom->textContent;
The only way that I could find was to overwrite the properties of the DOMDocument object.
resolveExternals set to 1
substituteEntities set to 1
Then they are resolved, or not.
So to summarise, I would really like to understand what I am obviously not understanding . Why do those parameters and function seem to have no effect? Is libxml2 taking precedence over PHP?
Many thanks!
References:
https://www.owasp.org/index.php/XML_External_Entity_%28XXE%29_Processing
http://au2.php.net/libxml_disable_entity_loader
http://au2.php.net/manual/en/libxml.constants.php
http://www.vsecurity.com/download/papers/XMLDTDEntityAttacks.pdf
http://www.mediawiki.org/wiki/XML_External_Entity_Processing
How can I use PHP's various XML libraries to get DOM-like functionality and avoid DoS vulnerabilities, like Billion Laughs or Quadratic Blowup?
Keeping it simple .. As it should be simple :-)
Your first code snippet
libxml_disable_entity_loader does or does not do anything here based on whether your system resolves entities by default or not (mine does not). This is controlled by LIBXML_NOENT option of libxml.
Without it the document processor may not even try translating external entities and therefore libxml_disable_entity_loader has nothing to really influence (if libxml does not load entities by default which seems to be the case in your test-case).
Add LIBXML_NOENT to loadXML() like this:
$dom->loadXML($xml, LIBXML_NOENT);
and you'll quickly get:
PHP Warning: DOMDocument::loadXML(): I/O warning : failed to load external entity "/etc/passwd" in ...
PHP Warning: DOMDocument::loadXML(): Failure to process entity c in Entity, line: 7 in ...
PHP Warning: DOMDocument::loadXML(): Entity 'c' not defined in Entity, line: 7 in ...
Your second code snippet
In this scenario you've enabled entity resolving by using the LIBXML_NOENT option, that's why it goes after /etc/passwd.
The example works just fine on my machine even for external URL - I changed the ENTITY to an external one like this:
<!ENTITY c PUBLIC "bar" "https://stackoverflow.com/opensearch.xml">
It can, however, be even influenced by eg. allow_url_fopen PHP INI setting - put it to false and PHP won't ever load a remote file.
Your third code snippet
XML Entity that you've provided is not an external one but rather an internal one (see eg. here).
Your entity:
<!ENTITY c "Blah blah">
How internal entity is defined:
<!ENTITY % name "entity_value">
Therefore there is no reason for PHP or libxml to prevent resolving such entity.
Conclusion
I've quickly put up a PHP XXE tester script which tries out different settings and shows whether XXE is successful and in which case.
The only line that should actually show a warning is the "LIBXML_NOENT" one.
If any other line loads the WARNING, external entity loaded! your setup does allow loading external entities by default.
You can't go wrong by using SHOULD USE libxml_disable_entity_loader() regardless of your/your provider's machine default settings. If your app ever gets migrated it might become vulnerable instantly.
correct usage
As the MediaWiki states in link you've posted.
Unfortunately, the way that libxml2 implements the disabling, the library is crippled when external entities are disabled, and functions that would otherwise be safe cause an exception in the entire parsing.
$oldValue = libxml_disable_entity_loader(true);
// do whatever XML-processing related
libxml_disable_entity_loader($oldValue);
Note: libxml_disable_entity_loader() also prohibits loading external xml files directly (not through entities):
<?php
$remote_xml = "https://stackoverflow.com/opensearch.xml";
$dom = new DOMDocument();
if ($dom->load($remote_xml) !== FALSE)
echo "loaded remote xml!\n";
else
echo "failed to load remote xml!\n";
libxml_disable_entity_loader(true);
if ($dom->load($remote_xml) !== FALSE)
echo "loaded remote xml after libxml_disable_entity_loader(true)!\n";
else
echo "failed to remote xml after libxml_disable_entity_loader(true)!\n";
On my machine:
loaded remote xml!
PHP Warning: DOMDocument::load(): I/O warning : failed to load external entity "https://stackoverflow.com/opensearch.xml" in ...
failed to remote xml after libxml_disable_entity_loader(true)!
It might perhaps be related to this PHP bug but PHP is being really stupid about it as:
libxml_disable_entity_loader(true);
$dom->loadXML(file_get_contents($remote_xml));
works just fine.
Validating any VAST2.0 XML tag
$xsdPath='https://github.com/chrisdinn/vast/blob/master/lib/vast_2.0.1.xsd'
$domdoc= new DOMDocument();
$domdoc->loadHTML($xml_input);
if(!$domdoc->schemaValidate($xsdPath)){/* ... */}
returns nonsense messages like Error 1845: Element 'html': No matching global declaration available for the validation root.
In my opinion, this does not really make sense because both the schema xsd and the vast xml do not contain or require a markup or element with the name .
Trying the same with
$reader = new XMLReader();
$reader->XML($xml_input);
$valid = $reader->setSchema($xsdPath);
$reader->read();
$reader->close();
returns the same error codes.
I checked the xsd twiche. It is the same like on https://github.com/chrisdinn/vast/blob/master/lib/vast_2.0.1.xsd.
Any idea how to fix this?
To load XML you should use loadXML(), not loadHTML():
$xsdPath = 'https://raw.github.com/chrisdinn/vast/master/lib/vast_2.0.1.xsd';
// ^^^
// using raw version
$domdoc= new DOMDocument();
$domdoc->loadXML($xml_input);
// ^^^
// Not loading HTML here
if (!$domdoc->schemaValidate($xsdPath)) {
// ...
}
When I dereference the URI you give for the XSD schema document, I don't get an XSD schema document. I get an HTML document which displays a rendering of the XSD schema document. It makes perfect sense to me for a validator expecting to see an xs:schema element to issue the error message you quote, when instead it sees an HTML element.
You can either find a URI that actually serves the XML document your validator needs, or you can make a local copy and point to that local copy. But expecting PHP's schema validation to find the XSD document buried in that HTML is asking more than you can reasonably expect.
The error is rather straight forward:
Error 1845: Element 'html': No matching global declaration available for the validation root.
Means that the element <html> is not declared in the XSD therefore the document can not be validated with that XSD.
In my opinion, this does not really make sense because both the schema xsd and the vast xml do not contain or require a markup or element with the name.
You're loading a HTML document. Regardless if the string/file contains that html element or not, the DOMDocument does contain it so the validation tries to validate it against the XSD and then fails because the XSD does not have any declaration for it.
I am trying to validate an XML message that is signed using XMLDSig. In order to create a message digest, I need to canonicalize the message first. It works fine, except that DOMNode::C14N() removes the second namespace from the code below:
<?xml version="1.0" encoding="UTF-8"?><DirectoryRes xmlns="http://www.idealdesk.com/ideal/messages/mer-acq/3.3.1" xmlns:ns2="http://www.w3.org/2000/09/xmldsig#" version="3.3.1">
<createDateTimestamp>2012-10-29T17:04:56.374Z</createDateTimestamp>
<Acquirer>
<acquirerID>0050</acquirerID>
</Acquirer>
<Directory>
<directoryDateTimestamp>2012-10-29T17:04:56.374Z</directoryDateTimestamp>
<Country>
<countryNames>Deutschland</countryNames>
<Issuer>
<issuerID>NLINGB2U152</issuerID>
<issuerName>Issuer Simulator</issuerName>
</Issuer>
</Country>
</Directory>
</DirectoryRes>
Canonicalizing the XML above results in the following XML:
<DirectoryRes xmlns="http://www.idealdesk.com/ideal/messages/mer-acq/3.3.1" version="3.3.1">
<createDateTimestamp>2012-10-29T17:04:56.374Z</createDateTimestamp>
<Acquirer>
<acquirerID>0050</acquirerID>
</Acquirer>
<Directory>
<directoryDateTimestamp>2012-10-29T17:04:56.374Z</directoryDateTimestamp>
<Country>
<countryNames>Deutschland</countryNames>
<Issuer>
<issuerID>NLINGB2U152</issuerID>
<issuerName>Issuer Simulator</issuerName>
</Issuer>
</Country>
</Directory>
</DirectoryRes>
The remote server I am testing with keeps this namespace when calculating the message digest, so validation obviously fails. I confirmed this issue by first adding the namespace back in before creating my own digest to compare to the digest embedded in the message (the signature was stripped from the XML code above prior to posting). The code however has to work with different servers, some of which may or may not add namespaces (they are not part of the specifications, but as far as I know just adding a redundant namespace declaration shouldn't hurt). I looked this up in the W3C XML C14N specs and they say root elements should always keep their namespaces, except empty default namespaces. The disappearing namespace is neither the default, nor empty, so I am not sure whether this is a bug in DOMNode::C14N() or whether I overlooked something important.
The c14n spec suggests that extra namespaces don't make it into the canonicalized form.
If you made use of ns2, etc they should make it down into the document emitted by ->c14n.
You probably already figured this out, but since you are communicating with iDEAL you have to follow their "Signing iDEAL messages" remarks:
For the purpose of generating the digest of the main message, the inclusive canonicalization algorithm must be used6. This method of canonicalization of the main message is not (always) explicitly indicated in the iDEAL XML messages. For this reason this transform has not been included in the example messages in this document. Merchants are not required to explicitly indicate this transform in their messages.
Source: https://www.pronamic.eu/wp-content/uploads/sites/2/2016/06/Merchant-Integration-Guide-v3-3-1-ENG-February-2015.pdf
This can be confusing since the CanonicalizationMethod element algorithm is http://www.w3.org/2001/10/xml-exc-c14n#. For the digest however you always have to use https://www.w3.org/TR/2001/REC-xml-c14n-20010315. This canonicalization method algorithm will also leave the xmlns:ns2="http://www.w3.org/2000/09/xmldsig#" in place.
In the xmlseclibs library, used a lot for iDEAL, is the http://www.w3.org/TR/2001/REC-xml-c14n-20010315 canonicalization method algorithm the default:
https://github.com/simplesamlphp/xmlseclibs/blob/v1.3.2/xmlseclibs.php#L872