I'm working on a PHP script using SimpleXML / XPath that needs to print citations for sentences from an XML file which has structure similar to the following:
<text name="text_title">
<book name="book_title">
<chapter name="chapter_title">
<sentence name="sentence_number" id="0000">
<word attr="desired_val" id="1111" />
<word attr="undesired_val" id="2222" />
</sentence>
</chapter>
</book>
</text>
The issue is that I need to return each sentence containing a word bearing attr="desired_val", and then a citation containing its text, book, chapter, and sentence number. I'm currently doing the first part with the xpath query
//word[#$attr='desired_val']/ancestor::sentence
and the second part with a series of subsequent xpath queries based on the ID attribute of each returned sentence, e.g. for the text node:
/text/[book/chapter/sentence[#id={$id}]]/#name
(and so on, for the other relevant nodes). My issue is that this becomes grossly inefficient with large numbers of records, and is causing the script to timeout with more than about ten results. Can anyone suggest ideas about a better way to do this?
If you need all matches, the only optimization I can imagine is to reduce the enormous amount of queries. It takes much time to build the whole list of matches, in order to seek for each match into the document to collect the remaining information. Instead it would be better to query the necessary data from you document in just one step. The same problem occurs in database applications, where people execute too many SQL statements instead of doing everything in just one query.
The SQL for XML is called XQuery. If you use XQuery instead of XPath you can collect all the necessary data in just one step. The following example has been tested with Saxon-HE as a XQuery engine.
<results>
{
for $x in doc("text.xml")/text/book/chapter/sentence/word
where $x/#attr = "desired_val"
return <match text="{$x/../../../../#name}"
book="{$x/../../../#name}"
chapter="{$x/../../#name}"
sentence="{$x/../#name}" />
}
</results>
The following command
java -cp /usr/share/java/Saxon-HE.jar net.sf.saxon.Query '!indent=yes' text.xquery
extracts the required information from the document in just one step.
<?xml version="1.0" encoding="UTF-8"?>
<results>
<match chapter="chapter_title"
text="text_title"
book="book_title"
sentence="sentence_number"/>
</results>
Saxon-HE can be installed on Ubuntu by the following command.
apt-get install libsaxonhe-java
I do not know which XQuery engine is best suited for PHP.
Related
I am having trouble finding a solution to a problem I am facing, parsing XMLs.
Let me describe what I have now and what's the issue:
I have LINKs of XMLs files that have for example:
<prodcuts>
..
<product>
<id>1</id>
<name><![CDATA[ this is a test product name ]]></name>
<link><![CDATA[http://www.google.com]]></link>
<image><![CDATA[http://www.google.com/image.jpg]]></image>
<sku><![CDATA[ ]]></sku>
<category><![CDATA[ System > Technology ]]></category>
<price>20</price>
<description><![CDATA[ ]]></description>
<instock><![CDATA[ Y ]]></instock>
<availability>Y</availability>
</product>
..
</products>
Another XML has:
<prodcuts>
..
<product>
<productID>1</productID>
<title><![CDATA[ ]]></title>
<link><![CDATA[http://www.google.com]]></link>
<image><![CDATA[http://www.google.com/image.jpg]]></image>
<sku><![CDATA[ ]]></sku>
<categoryPath><![CDATA[ System > Technology ]]></categoryPath>
<price>20</price>
<description><![CDATA[ ]]></description>
<instock><![CDATA[ Y ]]></instock>
<availability>Y</availability>
<size>40</size>
</product>
..
</products>
Now, the difference between those are
1) the first one has a tag name "name", the other one has a tag name "title".
2) The second one has some tags that the first one does not.
Now the problem is, I am parsing the XML file via PHP like this:
$xml->products->product[$i]->id
$xml->products->product[$i]->name
and so on.. If I do this the code I have wrote, will work only for the first one. The tags that are missing is not a problem for now, cause I am inserting to Database NULL cause there are not required fields..
But, what about the second XML? Can I do something "automatically" in order to avoid asking to correct those tags?
This could be done only manually, by grabbing the content of this LINK (via PHP) and rename those ones?
I do not have the file from my clients, just the LINK of XML.
thanks in advance!
ok! I believe I have found some solutions to my problem.. I wrote them here in case someone has the same issues:
Solutions:
i) Read all the children of XML file, no matter how they are written (case-sensitive) and add them to Database. After that, there is a dashboard/PHP file with SQL queries that MATCH those children elements tags of XML with the one that you want.
In this case, you may want to create a file called whatever you like, for example test.xml and CREATE the one that you want, with the correct XML tag elements. In this case, you could UPDATE this, every some hour (according to your needs) via a cronjob..
ii) Create manually the PHP file with the parsing inside, for every XML that you get. Just make sure to keep the XML link in your DB
iii) Ask the client to give you the correct XML. XML is case-sensitive for a reason.
In case you choose the first solution you need to make changes to php.ini file too, cause the XML files may be too large and the max_execution_time is probably too low to run all these PHP - MySQL scripts.
if someone need more explain or have any better advice, please share!
I would like to create a function where users can create there own XML feed. The feed should be for example the following (quite simple example) feed:
<xml>
<products>
<product>Product 1</product>
<product>Product 2</product>
</products>
</xml>
Very important in the setup is that there is a connection between the database and the setup feed, for example the is loaded from the database. So, the user should create for example the following 'text/xml' as basis:
<xml>
<products>
%whileProducts%
<product>%title%</product>
%/whileProducts%
</products>
</xml>
It is possible to enter the product title via a str_replace, but is it also possible to create a while loop via a replace function? To make it a bit more difficult: it could be possible that there are multiple loops in a loop, for example, a user would like to create a feed with a while loop for the products and inside this loop a new loop for the colors and/or sizes of the product.
No, it's not. str_replace() can only perform literal replacements of one set of constant strings with another corresponding set of constant strings; it can't do anything more complex.
What you want here is a templating engine. Since XML is involved, XSLT may be an appropriate tool to use; it's not simple, though. There are many other templating engines for PHP available, and recommending one is outside the scope of this question.
I'm looking into the possibility of efficiently comparing two similar XML-files and updating outdated information.
The main XML-file I'm working with is about 200-250mb in size. The second is a tad smaller.
The two XML-files pretty much looks like this:
<product>
<Category>BOOK</Category>
<Bookgroup>BOOKF</Bookgroup>
<Productname>Name of the book</Productname>
<Productcode>123456789</Productcode>
<Price>79.00</Price>
<Availability>Stock On Order</Availability>
<ProductURL>www.url.com</ProductURL>
<Release>07.08.2013</Release>
<Author>Name of author</Author>
<Genre>Crime</Genre>
<BookType>Pocket</BookType>
<Language>English</Language>
</product>
As you can see I'm working with books, and the purpose of having a second XML-file with the same information is that I only want one copy of each book for further use.
Basically I'm trying to figure out how I effectively can parse through the first XML and check whether the book exists in the second XML. If it exists I'll check if productinformation (price, availabilty etc) have been updated. If this information has been updated this needs to be updated in the second XML as well.
If it doesnt exist it needs to be added to the second XML.
Using XMLReader I'm able to parse through each book from the first XML fairly fast (40ish seconds to loop through 4,5million lines of XML and echo out all the books) by using a similar approach as this.
My problem occurs when I want to check if this book exists in the second XML and make changes in the second XML if it needs to be updated or added.
Would it for example be possible to use XMLReader on the second XML and stop at nodes with the same booktitle as I've stopped at in the first XML and then make the check? If so how?
I am working on a script for judges of a film festival to review and vote for films. I was thinking I could minimize the project by saving all the results in a single XML file. My concern however is if multiple judges are casting their vote at the same time, will there be a conflict with the XML file being written to at the same time?
Here is my thought on a schema :
<festival>
<teams>
<team id='*'>
<name></name>
<video>http://vimeo.com/####</video>
<ratings>
<judge id='%'>(1-7)</judge>
</ratings>
<nominations>
<judge id='%'>#</judge>
<nominations>
</team>
</teams>
<awards>
<award id='#'>Best Director</award>
</awards>
<judges>
<judge id='%'>
<name></name>
<email></email>
<password></password>
<lastVideoWatched></lastvideowatched>
</judge>
</judges>
Okay first of all,
Why are you using xml files? It would be much easier to use a database for this sort of thing. Even mysqli will work quite well. You can use simple xml to parse the file and save it in memory before committing, but I see no way of doing it concurrently. But I cannot see anyway of doing concurrent transactions without building an engine in the middle.
IF you'll take my advice, switch to mysql
Here is my XML file :
<?xml version="1.0" encoding="utf-8"?>
<root>
<category>
<name>Category</name>
<desc>Category</desc>
<category>
<name>Subcategory</name>
<desc>Sub-category</desc>
<category>
<name>Subcategory</name>
<desc>Sub-category</desc>
</category>
</category>
</category>
</root>
My tree could have as much levels as possible. There are no requirements about this.
First question :
Is my XML correct to handle this kind of requirement ?
and How could i optimize it (if it's needed)
Second question :
How could I parse it with DOMDocument ?
I know how to load an xml document, but I don't know how to parse it.
I read a little on recursion but I was not able to understand properly how to map with PHP/DOMDocument.
Thanks for the help !
EDIT
What I want to do is manage a category system.
I tried with SQL but it was too hard to manage using the relational model, even with nested select, etc...
So i want to be able make a tree from my xml
like
Category
Sub Category
Sub sub category
Without limits on the depth
I want to be able to search for a category, retrieve all its children (subcategories) (or not), its parent(s) (or not), (the sisters ?), etc...
Well, there's nothing wrong with the XML you're using here, but you don't say enough about what you want to DO with the data for anyone to give you a quality answer about whether or not your XML will capture what you need. As for "[parsing] it with DOMDocument", you can load it into a DOMDocument object like so:
$xml = <<<XML
<?xml version="1.0" encoding="utf-8"?>
<root>
<category>
<name>Category</name>
<desc>Category</desc>
<category>
<name>Subcategory</name>
<desc>Sub-category</desc>
<category>
<name>Subcategory</name>
<desc>Sub-category</desc>
</category>
</category>
</category>
</root>
XML;
$d = new DOMDocument();
$d->loadXML($xml);
At this point, the question once again becomes: Now what do you want to DO with it?
If you're just talking about how to handle a structure like this - i'd say write two functions, one that accepts the full structure, and one that accepts a category DOMNode reference. The first function would do initial processing then pass the first reference to the initial Category node. Then in this function, you process the current node's properties as needed, and then recurse into children if they are present.
It would be more efficient to process this flat of course, in one loop, but then you would lose the literal representation of the hierarchy.
Recapping the point above about what you want to do with it... IMHO there are three broad classes of thing one might do with a chunk of XML.
Having instantiated a DOMDocument and loaded XML into it, you can search it for nodes using XPath queries, much like you search a relational database using SQL SELECT queries. You can extract properties of node, sub-nodes of nodes and the text within nodes. Which is a species of parsing, I'd say. DOMDocument XPath component will do this for you.
You can instead maybe turn your XML into something else - different XML dialect, XHTML, etc, using XSL Transforms. Which may or may not be parsing per se, but does involve parsing. PHP XSLTProcessor component will do this.
Another major idea, which I think DOMDocument does not really support, is a streaming parser. The parser consumes XML in a linear manner, and while doing so invokes callback functions at each node of interest. The somewhat venerable parser named SAX is AFAIK the archetypal streaming parser. There used to be a SAX parser in PHP, I think it has now been moved to PEAR or PECL.
But, yeah, what do you want to do with your XML?
You said you tried SQL and it didn't work for you. Just a tip: If you use Oracle, take a look at START WITH ... CONNECT BY, if you use SQL Server, use recursive CTEs. These approaches do solve the problem.