Using XSD schema validation for XPath queries

Using XSD schema validation for XPath queries - php

I'm using the following code to create a DOMDocument and validate it against an external xsd file.
<?php
$xmlPath = "/xml/some/file.xml";
$xsdPath = "/xsd/some/schema.xsd";
$doc = new \DOMDocument();
$doc->loadXML(file_get_contents($xmlPath), LIBXML_NOBLANKS);
if (!$doc>schemaValidate($xsdPath)) {
throw new InvalidXmlFileException();
}
Update 2 (rewritten question)
This works fine, meaning that if the XML doesn't match the definitions of XSD it will throw a meaningful exception.
Now, I want to retrieve information from the DOMDocument using Xpath. It works fine aswell, however, from this point on the DOMDocument is completely detached from the XSD! For example, if I have a DOMNode I cannot know whether it is of type simpleType or type complexType. I can check whether the node has child (hasChild()) nodes, but this is not the same. Also, there is tons of information more in the XSD (like, min and max number of occurrence, etc).
The question really is, do I have to query the XSD myself or is there a programmatic way of asking those kind of questions. I.e. is this DOMNode a complex or simple type?
In another post it was suggested "to process the schema using a real schema processor, and then use its API to ask questions about the contents of the schema". Does XPath has an API to retrieve information of the XSD or is there a different convenient way with DOMDocument?
For the record, the original question
Now, I wanted to proceed to parse information from the DOMDocument using XPath. To increase the integrity of my data I'm storing to a database and giving meaningful error message to the client I wanted to constantly use the schema information to validate the queries. I.e. I wanted to validate fetched childNodes against allowed child nodes defined in the xsd. I wanted to that by using XPath on the xsd document.
However, I sumbled across this post. It basically sais it is a kind of kirky way to that yourself and you should rather use a real schema processor and use its API to make the queries. If I understand that right, I'm using a real schema processor with schemaValidate, but what is meant by using its API?
I kind of guessed already I'm not using the schema in a correct way, but I have no idea how to research a proper usage.
The question
If I use schemaValidate on the DOMDocument is that a one-time validation (true or false) or is it tied to the DOMDocument for longer then? Precisely, can I use the validation also for adding nodes somehow or can I use it to select nodes I'm interested in as suggested by the referenced SO post?
Update
The question was rated unclear, so I want to try again. Say I would like to add a node or edit a node value. Can I use the schema provided in the xsd so that I can validate the user input? Originally, in order to do that I wanted to query the xsd manually with another XPath instance to get the specs for a certain node. But as suggested in the linked article this is not best practice. So the question would be, does the DOM lib offer any API to make such a validation?
Maybe I'm overthinking it. Maybe I just add the node and run the validation again and see where/why it breaks? In that case, the answer of the custom error handling would be correct. Can you confirm?

Your question is not very clear, but it sounds like you want to get detailed reporting about any schema validation failures. While DomDocument::validateSchema() only returns a boolean, you can use internal libxml functions to get some more detailed information.
We can start with your original code, only changing one thing at the top:
<?php
// without this, errors are echoed directly to screen and/or log
libxml_use_internal_errors(true);
$xmlPath = "file.xml";
$xsdPath = "schema.xsd";
$doc = new \DOMDocument();
$doc->loadXML(file_get_contents($xmlPath), LIBXML_NOBLANKS);
if (!$doc->schemaValidate($xsdPath)) {
throw new InvalidXmlFileException();
}
And then we can make the interesting stuff happen in the exception which is presumably (based on the code you've provided) caught somewhere higher up in the code.
<?php
class InvalidXmlFileException extends \Exception
{
private $errors = [];
public function __construct()
{
foreach (libxml_get_errors() as $err) {
$this->errors[] = self::formatXmlError($err);
}
libxml_clear_errors();
}
/**
* Return an array of error messages
*
* #return array
*/
public function getXmlErrors(): array
{
return $this->errors;
}
/**
* Return a human-readable error message from a libxml error object
*
* #return string
*/
private static function formatXmlError(\LibXMLError $error): string
{
$return = "";
switch ($error->level) {
case \LIBXML_ERR_WARNING:
$return .= "Warning $error->code: ";
break;
case \LIBXML_ERR_ERROR:
$return .= "Error $error->code: ";
break;
case \LIBXML_ERR_FATAL:
$return .= "Fatal Error $error->code: ";
break;
}
$return .= trim($error->message) .
"\n Line: $error->line" .
"\n Column: $error->column";
if ($error->file) {
$return .= "\n File: $error->file";
}
return $return;
}
}
So now when you catch your exception you can just iterate over $e->getXmlErrors():
try {
// do stuff
} catch (InvalidXmlFileException $e) {
foreach ($e->getXmlErrors() as $err) {
echo "$err\n";
}
}
For the formatXmlError function I just copied an example from the PHP documentation that parses the error into something human readable, but no reason you couldn't return some structured data or whatever you like.

I think what you're looking for is the PSVI (post schema validation infoset), see this answer for some references.
An other option would be to use XPath2 that has operators to check schema types.
I don't know if there are libraries in PHP that allows you to get PSVI or perform XPath2 queries, in Java there is Xerces for PSVI and Saxon for XPath2
For example With Xerces is possible to cast a DOM Element to a Xerces ElementPSVI in order to get schema informations of an Element.
I can warn that using XPath on the schema (as you were doing) will work only for simple cases since the XML of the schema is very different from the actual schema model (assembled schema) that is a graph of components with properties that are yes calculated from the XML declaration (schema file) but with very complex rules that are almost impossible to recreate with XPath.
So you need at least the PSVI or to make XPath2 queries but, in my experience, obtaining decent validation for application users from an XML schema is difficult.
What are you trying to achieve ?

Related

Include simple fields based on context in Fractal

I am using the Fractal library to transform a Book object into JSON by using a simple transformer:
class BookTransformer extends \League\Fractal\TransformerAbstract
{
public function transform(Book $book)
{
return [
'name' => $book->getName()
// ...
];
}
}
And I am performing the transformation as follows.
$book = new Book('My Awesome Book');
$resource = new \League\Fractal\Resource\Item($book, new BookTransformer());
$fractal = new \League\Fractal\Manager();
$fractal->setSerializer(new \League\Fractal\Serializer\ArraySerializer());
$json = $fractal->createData($resource)->toJson();
This works great. However, I have certain fields on my Book object that should not always be included, because this depends on the context the transformation is done in. In my particular use case, the JSON returned to AJAX requests from my public website should not include sensitive information, while this should be the case when the data is requested from an admin backend.
So, let's say that a book has a topSecretValue field, which is a string. This field should not be included in one transformation, but should be included in another. I took a look at transformer includes, and played around with it, but this only works with resources. In my case, I need to somehow include different fields (not resources) for different contexts. I have been digging around and could not find anything in the Fractal library that could help me, but maybe I am missing something?
I came up with a working solution, but it is not the prettiest the world has ever seen. By having a BaseBookTransformer that transforms fields that should always be included, I can extend this transformer to add fields for other contexts, e.g. AdminBookTransformer or TopSecretValueBookTransformer, something like the below.
class AdminBookTransformer extends BookTransformer
{
public function transform(Book $book)
{
$arr = parent::transform($book);
$arr['author'] = $book->getTopSecretValue();
return $arr;
}
}
This works fine, although it is not as "clean" as using includes (if it were possible), because I have to actually use a different transformer.
So the question is: is there anything in Fractal that enables me to accomplish this in a simpler/cleaner way, or is there a better way to do it, be it the Fractal way or not?

GetElementsByTagName alternative to DOMDocument

I am creating an HTML file with DOMDocument, but I have a problem at the time of the search by the getElementsByTagName method. What I found is that as I'm generating the hot, does not recognize the labels that I inserted.
I tried with DOMXPath, but to no avail :S
For now, I've got to do is go through all the children of a node and store in an array, but I need to convert that score DOMNodeList, and in doing
return (DOMNodeList) $ my_array;
generates a syntax error.
My specific question is, how I can do to make a search for tags with the getElementsByTagName method or other alternative I can offer to achieve the task?
Recalling that the DOMDocument I'm generating at the time.
If you need more information, I'll gladly place it in the question.
Sure Jonathan Sampson.
I apologize for the editing of the question the way. I did not quite understand this forum format.
For a better understanding of what I do, I put the inheritance chain.
I have this base class
abstract class ElementoBase {
...
}
And I have this class that inherits from the previous one, with an abstract function insert (insert)
abstract class Elemento extends ElementoBase {
...
public abstract function insertar ( $elemento );
}
Then I have a whole series of classes that represent the HTML tags that inherit from above, ie.
class A extends Elemento {
}
...
Now the code I use to insert the labels in the paper is as follows:
public function insertar ( $elemento ) {
$this->getElemento ()->appendChild ( $elemento->getElemento () );
}
where the function getElemento (), return a DOMElement
Moreover, before inserting the element do some validations that depend on the HTML tag that is to be inserted,
because they all have very specific specifications.
Since I'm generating HTML code at the same time, it is obvious that there is no HTML file.
To your question, the theory tells me to do this:
$myListTags = $this->getElemento ()->getElementsByTagName ( $tag );
but I always returns null, this so I researched it because I'm not loading the HTML file, because if I
$myHtmlFile = $this->getDocumento ()->loadHTMLFile ( $filename );
$myListTags = $myHtmlFile->getElementsByTagName ( $etiqueta );
I do return the list of HTML tags
If you need more information, I'll gladly place it in the question.

I am assuming you have created a valid HTML file with DOMDocument. Your basic problem is to parse or search the HTML doc for a particular tag name.
To search a HTML file the best solution available in PHP is Simple HTML DOM parser.
You can just run the following code and you are done!
$html = file_get_html('url to your html file');
foreach($html->find('tag name') as $element)
{
// perform the action you want to do here.
// example: echo $element->someproperty;
}

$doc = new DOMDocument('1.0', 'iso-8859-1');
$doc->appendChild(
$doc->createElement('Filiberto', 'It works!')
);
$nodeList = $doc->getElementsByTagName('Filiberto');
var_dump($nodeList->item(0)->nodeValue);

Recursive tree rendering with Agile Toolkit

I have a following situation. I have a Model A with following properties:
id int
name varchar(255)
parent_id int (references same Model A).
Now, I need to render Tree View using that ModelA. Of course, I could just load all data, sort it properly by parent_id and "render it" using traditional string sticking. e.g.
class Model_A extends Model_Table {
...
function render_branch($nodes, $parent){
if (!isset($nodes[$parent])){
return null;
}
$out = "<ul>";
foreach ($nodes[$parent] as $node){
$out .= "<li>" . $node["name"];
$out .= $this->render_branch($nodes, $node["id"]);
$out .= "</li>";
}
return $out;
}
function init(){
parent::init();
$nodes = array(); // preload from db and arrange so that key = parent and content is array of childs
$this->template->set("tree", $this->render_branch($nodes, 0));
}
}
now, I would instead like to use atk4 native lister/smlite template parser for the purpose. but, if you try to do that, then you would end up with nasty lister, where in format row, you would anyway try to substitute the specific tag with output from other lister which in fact you would have to destruct to void runtime memory overflows.
any suggestions?
p.s.
code above is not tested, just shows concept
thanks!

Okay, right time had come and proper add-on has been created. To use it, get your add ons and atk4 up-to-dated and follow this article to get to know how.
http://www.ambienttech.lv/blog/2012-07-06/tree_view_in_agile_toolkit.html

As per Jancha's comment
okay, after spending some time looking at possible options, I found that
the easiest thing to do in this particular case was to use above mentioned example.
The only way to make it more native would be to use external template for
nodes and use smite and clone region + render to move html outside t o
template. apart from that, usage of traditional lister did not seem to
be efficient enough. so, atk4 guys, follow up with query tree view
plugin and create proper backend! it would be cool. thanks,j
.

Using the Data Mapper Pattern, Should the Entities (Domain Objects) know about the Mapper?

I'm working with Doctrine2 for the first time, but I think this question is generic enough to not be dependent on a specific ORM.
Should the entities in a Data Mapper pattern be aware - and use - the Mapper?
I have a few specific examples, but they all seem to boil down to the same general question.
If I'm dealing with data from an external source - for example a User has many Messages - and the external source simply provides the latest few entities (like an RSS feed), how can $user->addMessage($message) check for duplicates unless it either is aware of the Mapper, or it 'searches' through the collection (seems like an inefficient thing to do).
Of course a Controller or Transaction Script could check for duplicates before adding the message to the user - but that doesn't seem quite right, and would lead to code duplication.
If I have a large collection - again a User with many Messages - how can the User entity provide limiting and pagination for the collection without actually proxying a Mapper call?
Again, the Controller or Transaction Script or whatever is using the Entity could use the Mapper directly to retrieve a collection of the User's Messages limited by count, date range, or other factors - but that too would lead to code duplication.
Is the answer using Repositories and making the Entity aware of them? (At least for Doctrine2, and whatever analogous concept is used by other ORMs.) At that point the Entity is still relatively decoupled from the Mapper.

Rule #1: Keep your domain model simple and straightforward.
First, don't prematurely optimize something because you think it may be inefficient. Build your domain so that the objects and syntax flow correctly. Keep the interfaces clean: $user->addMessage($message) is clean, precise and unambiguous. Underneath the hood you can utilize any number of patterns/techniques to ensure that integrity is maintained (caching, lookups, etc). You can utilize Services to orchestrate (complex) object dependencies, probably overkill for this but here is a basic sample/idea.
class User
{
public function addMessage(Message $message)
{
// One solution, loop through all messages first, throw error if already exists
$this->messages[] $message;
}
public function getMessage()
{
return $this->messages;
}
}
class MessageService
{
public function addUserMessage(User $user, Message $message)
{
// Ensure unique message for user
// One solution is loop through $user->getMessages() here and make sure unique
// This is more or less the only path to adding a message, so ensure its integrity here before proceeding
// There could also be ACL checks placed here as well
// You could also create functions that provide checks to determine whether certain criteria are met/unmet before proceeding
if ($this->doesUserHaveMessage($user,$message)) {
throw Exception...
}
$user->addMessage($message);
}
// Note, this may not be the correct place for this function to "live"
public function doesUserHaveMessage(User $user, Message $message)
{
// Do a database lookup here
return ($user->hasMessage($message) ? true
}
}
class MessageRepository
{
public function find(/* criteria */)
{
// Use caching here
return $message;
}
}
class MessageFactory
{
public function createMessage($data)
{
//
$message = new Message();
// setters
return $message;
}
}
// Application code
$user = $userRepository->find(/* lookup criteria */);
$message = $messageFactory->create(/* data */);
// Could wrap in try/catch
$messageService->sendUserMessage($user,$message);
Been working with Doctrine2 as well. Your domain entity objects are just that objects...they should not have any idea of where they came from, the domain model just manages them and passes them around to the various functions that manage and manipulate them.
Looking back over, I'm not sure that I completely answered your question. However, I don't think that the entities themselves should have any access to the mappers. Create Services/Repositories/Whatever to operate on the objects and utilize the appropriate techniques in those functions...
Don't overengineer it from the onset either. Keep your domain focused on its goal and refactor when performance is actually an issue.

IMO, an Entity should be oblivious of where it came from, who created it and how to populate its related Entities. In the ORM I use (my own) I am able to define joins between two tables and limiting its results by specifying (in C#) :
SearchCriteria sc = new SearchCriteria();
sc.AddSort("Message.CREATED_DATE","DESC");
sc.MaxRows = 10;
results = Mapper.Read(sc, new User(new Message());
That will result in a join which is limited to 10 items, ordered by date create of message. The Message items will be added to each User. If I write:
results = Mapper.Read(sc, new Message(new User());
the join is reversed.
So, it is possible to make Entities completely unaware of the mapper.

No.
Here's why: trust. You cannot trust data to act on the benefit of the system. You can only trust the system to act on data. This is a fundamental of programming logic.
Let's say something nasty slipped into the data and it was intended for XSS. If a data chunk is performing actions or if it's evaluated, then the XSS code gets blended into things and it will open a security hole.
Let not the left hand know what the right hand doeth! (mostly because you don't want to know)

Error when merging two XML documents using XPath & DOMDocument

About a year ago I wrote a jQuery-inspired library which allowed you to manipulate the DOM using PHP's XPath and DOMDocument. I recently wanted to clean it up and post it as an open source project. I've been spending the past few days making improvements and implementing some more of PHP's native OO features.
Anyhow, I thought I'd add a new method which allows you to merge a separate XML document with the current one. The catch here is that this method asks for 2 XPath expressions. The first one fetches the elements you want to merge into the existing document. The second specifies the destination path of these merged elements.
The method works well in fetching matching elements from both paths, but I'm having issues with importing the foreign elements into the current DOM. I keep getting the dreaded 'Wrong Document Error' message.
I thought I knew what I was doing, but I suppose I was wrong. If you look at the following code, you can see that I'm first iteration through the current documents matching elements, then through the foreign document's matching elements.
Within the second nested loop is where I am attempting to merge each foreign element into the destination path in the current document.
Not sure what I'm doing wrong here as I'm clearly importing the foreign node into the current document before appending it.
public function merge($source, $path_origin, $path_destination)
{
$Dom = new self;
if(false == $Dom->loadXml($source))
{
throw new DOMException('XML source could not be loaded into the DOM.');
}
$XPath = new DOMXPath($Dom);
foreach($this->path($path_destination, true) as $Destination)
{
if(false == in_array($Destination->nodeName, array('#text', '#document')))
{
foreach($XPath->query($path_origin) as $Origin)
{
if(false == in_array($Destination->nodeName, array('#text', '#document')))
{
$this->importNode($Origin, true);
$Destination->appendChild($Origin->cloneNode(true));
}
}
}
}
return $this;
}
You can find the library in its entirety in the following Github repo:
http://github.com/wilhelm-murdoch/DomQuery
Halps!!!

importNode doesn't "change" the node so it belongs to another document. It creates a new node belonging to the new document and returns it. So you should be getting its return value and using that in appendChild.
$Destination->appendChild($this->importNode($Origin, true));

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.