I'm looking into the possibility of efficiently comparing two similar XML-files and updating outdated information.
The main XML-file I'm working with is about 200-250mb in size. The second is a tad smaller.
The two XML-files pretty much looks like this:
<product>
<Category>BOOK</Category>
<Bookgroup>BOOKF</Bookgroup>
<Productname>Name of the book</Productname>
<Productcode>123456789</Productcode>
<Price>79.00</Price>
<Availability>Stock On Order</Availability>
<ProductURL>www.url.com</ProductURL>
<Release>07.08.2013</Release>
<Author>Name of author</Author>
<Genre>Crime</Genre>
<BookType>Pocket</BookType>
<Language>English</Language>
</product>
As you can see I'm working with books, and the purpose of having a second XML-file with the same information is that I only want one copy of each book for further use.
Basically I'm trying to figure out how I effectively can parse through the first XML and check whether the book exists in the second XML. If it exists I'll check if productinformation (price, availabilty etc) have been updated. If this information has been updated this needs to be updated in the second XML as well.
If it doesnt exist it needs to be added to the second XML.
Using XMLReader I'm able to parse through each book from the first XML fairly fast (40ish seconds to loop through 4,5million lines of XML and echo out all the books) by using a similar approach as this.
My problem occurs when I want to check if this book exists in the second XML and make changes in the second XML if it needs to be updated or added.
Would it for example be possible to use XMLReader on the second XML and stop at nodes with the same booktitle as I've stopped at in the first XML and then make the check? If so how?
Related
I am having trouble finding a solution to a problem I am facing, parsing XMLs.
Let me describe what I have now and what's the issue:
I have LINKs of XMLs files that have for example:
<prodcuts>
..
<product>
<id>1</id>
<name><![CDATA[ this is a test product name ]]></name>
<link><![CDATA[http://www.google.com]]></link>
<image><![CDATA[http://www.google.com/image.jpg]]></image>
<sku><![CDATA[ ]]></sku>
<category><![CDATA[ System > Technology ]]></category>
<price>20</price>
<description><![CDATA[ ]]></description>
<instock><![CDATA[ Y ]]></instock>
<availability>Y</availability>
</product>
..
</products>
Another XML has:
<prodcuts>
..
<product>
<productID>1</productID>
<title><![CDATA[ ]]></title>
<link><![CDATA[http://www.google.com]]></link>
<image><![CDATA[http://www.google.com/image.jpg]]></image>
<sku><![CDATA[ ]]></sku>
<categoryPath><![CDATA[ System > Technology ]]></categoryPath>
<price>20</price>
<description><![CDATA[ ]]></description>
<instock><![CDATA[ Y ]]></instock>
<availability>Y</availability>
<size>40</size>
</product>
..
</products>
Now, the difference between those are
1) the first one has a tag name "name", the other one has a tag name "title".
2) The second one has some tags that the first one does not.
Now the problem is, I am parsing the XML file via PHP like this:
$xml->products->product[$i]->id
$xml->products->product[$i]->name
and so on.. If I do this the code I have wrote, will work only for the first one. The tags that are missing is not a problem for now, cause I am inserting to Database NULL cause there are not required fields..
But, what about the second XML? Can I do something "automatically" in order to avoid asking to correct those tags?
This could be done only manually, by grabbing the content of this LINK (via PHP) and rename those ones?
I do not have the file from my clients, just the LINK of XML.
thanks in advance!
ok! I believe I have found some solutions to my problem.. I wrote them here in case someone has the same issues:
Solutions:
i) Read all the children of XML file, no matter how they are written (case-sensitive) and add them to Database. After that, there is a dashboard/PHP file with SQL queries that MATCH those children elements tags of XML with the one that you want.
In this case, you may want to create a file called whatever you like, for example test.xml and CREATE the one that you want, with the correct XML tag elements. In this case, you could UPDATE this, every some hour (according to your needs) via a cronjob..
ii) Create manually the PHP file with the parsing inside, for every XML that you get. Just make sure to keep the XML link in your DB
iii) Ask the client to give you the correct XML. XML is case-sensitive for a reason.
In case you choose the first solution you need to make changes to php.ini file too, cause the XML files may be too large and the max_execution_time is probably too low to run all these PHP - MySQL scripts.
if someone need more explain or have any better advice, please share!
I have a simple PHP application, which uses MySQL DB, but I think that maybe the using of DB is needlessly for such easy operations.
Anyway, I hove some problems with the XML operations.
Let's say I want to have XML structure like this:
<root>
<experiment>
<name>test</name>
<accessCount>5</accessCount>
<downloadEntry>
<date>2015-11-27</date>
<comment>comment</comment>
</downloadEntry>
<downloadEntry>
<date>2015-11-28</date>
<comment>comment</comment>
</downloadEntry>
</experiment>
</root>
Now I would like to know, how to do these operations:
Count download entries (count of downloadEntry nodes) of experiment with name "test". Via XPATH?
Get download entries of experimetn with name test - but I would like to have pagination on this. So get download entries somehow like LIMIT 0,5.
The biggest problem is that, when there are no experiments - so the XML is , the loading of XML with simplexml_load_file fails. I can't open it. Yes, I can add the condition - if the XML is empty, donť open it. But I need to write to it and can't write if it isnť open.
Is there a solution for that?
Thanks everyone
How can I parse an 88 GB RDF file with PHP?
This RDF is filled with entities and facts about each entity.
I'm trying to iterate through each entity and check for certain facts per each entity. Then write those facts to an XML document I created earlier in the script.
So as I am navigating the rdf, per each entity I create a <card></card> element and give it a child called <facts>. I run through all the facts on the entity and I take the ones I need and write them inside and as <fact></fact> element children inside the <facts></facts>.
How can I parse the rdf, extract the data, and write it to XML?
First, use an RDF parser. Googling for a PHP RDF parser turned up lots of results; I dont use PHP personally, but I'm sure one of them will do the job of parsing RDF. But make sure it's a streaming parser, you're not going to hold 88G of RDF in memory on your workstation.
Second, you said you need to 'iterate through each entity' that might be tricky if either they're not sorted by subject in the original file, or the parser does not report them in the same order.
Assuming that is not a problem, then you can just keep the triples for each subject in a local data structure, and when you get a triple w/ a subject different than the ones you've queued locally, do whatever business logic you need and write out the XML. Might want to make sure you can't queue up so many statements locally that you'll OOM.
Lastly, I'm going to assume you have a good reason to take RDF and turn it into an XML format that is not RDF/XML. But I you might reconsider your design just in case.
Or you could put the data in an RDF database and write SPARQL queries against it, transforming query results into whatever XML or anything else you need.
I think your best option would be:
use some external tool (probably something like rapper?) to convert the source-file from Turtle into n-triples format
iterate file one line at a time via fopen+fgets as n-triples defines strict 1-statement per 1-line constraint which is perfect in this case
I am trying to build a very simple price comparison script.
Until now, I wrote a code that gets some product xml feeds from shops and with the help of XSLT I create a single-global xml of all those input XMLs. I use the XSLT because the shops have different names for elements.
Now I want to take it one step further and I want to create a search form that will display me the products let's say I have the term "laptop".
I know how to create a form, but I need a coding guidance to understand how to make it to search in my XML file (products.xml) and display let's say the
Thank you
You might want to check out http://php.net/manual/en/class.xmlreader.php
Using that it is pretty easy to navigate through an XML file and grab all the info you need.
EDIT:
On second thought, http://php.net/manual/en/book.simplexml.php is a MUCH simpler way to achieve what you're trying to do. Hence the name, I guess ;)
You can use SimpleXML library to parse your xml file. In my opinion SimpleXML is easier to use than xmlreader. Though SimpleXML is introduced on php5.
I have an xml feed that I have to check periodically for updates. The xml consists of many elements and I'm looking to figure it out which is the best (and probably faster) way to find out which elements suffered updates from last time I've checked.
What I think of is to check first the lastBuildDate for modifications and if it differs from the previous one to start parse the xml again. This would involve keeping each element with all of its attributes in my database. But each element can have different number of attributes as well as other nested elements. So if it would be to store each element in my database what would be the best way to keep them ?
That's why I'm asking for your help :) Thank you.
Most modern databases will store your XML as a blob if you like. (You tagged PHP... MySQL? If so, use MEDIUMTEXT.) Store your XML and generate a diff when you get a new one. If you don't have an XML diff tool, canonicalize both XML listings then run a text diff.