How can I parse an 88 GB RDF file with PHP?
This RDF is filled with entities and facts about each entity.
I'm trying to iterate through each entity and check for certain facts per each entity. Then write those facts to an XML document I created earlier in the script.
So as I am navigating the rdf, per each entity I create a <card></card> element and give it a child called <facts>. I run through all the facts on the entity and I take the ones I need and write them inside and as <fact></fact> element children inside the <facts></facts>.
How can I parse the rdf, extract the data, and write it to XML?
First, use an RDF parser. Googling for a PHP RDF parser turned up lots of results; I dont use PHP personally, but I'm sure one of them will do the job of parsing RDF. But make sure it's a streaming parser, you're not going to hold 88G of RDF in memory on your workstation.
Second, you said you need to 'iterate through each entity' that might be tricky if either they're not sorted by subject in the original file, or the parser does not report them in the same order.
Assuming that is not a problem, then you can just keep the triples for each subject in a local data structure, and when you get a triple w/ a subject different than the ones you've queued locally, do whatever business logic you need and write out the XML. Might want to make sure you can't queue up so many statements locally that you'll OOM.
Lastly, I'm going to assume you have a good reason to take RDF and turn it into an XML format that is not RDF/XML. But I you might reconsider your design just in case.
Or you could put the data in an RDF database and write SPARQL queries against it, transforming query results into whatever XML or anything else you need.
I think your best option would be:
use some external tool (probably something like rapper?) to convert the source-file from Turtle into n-triples format
iterate file one line at a time via fopen+fgets as n-triples defines strict 1-statement per 1-line constraint which is perfect in this case
Related
I'm looking into the possibility of efficiently comparing two similar XML-files and updating outdated information.
The main XML-file I'm working with is about 200-250mb in size. The second is a tad smaller.
The two XML-files pretty much looks like this:
<product>
<Category>BOOK</Category>
<Bookgroup>BOOKF</Bookgroup>
<Productname>Name of the book</Productname>
<Productcode>123456789</Productcode>
<Price>79.00</Price>
<Availability>Stock On Order</Availability>
<ProductURL>www.url.com</ProductURL>
<Release>07.08.2013</Release>
<Author>Name of author</Author>
<Genre>Crime</Genre>
<BookType>Pocket</BookType>
<Language>English</Language>
</product>
As you can see I'm working with books, and the purpose of having a second XML-file with the same information is that I only want one copy of each book for further use.
Basically I'm trying to figure out how I effectively can parse through the first XML and check whether the book exists in the second XML. If it exists I'll check if productinformation (price, availabilty etc) have been updated. If this information has been updated this needs to be updated in the second XML as well.
If it doesnt exist it needs to be added to the second XML.
Using XMLReader I'm able to parse through each book from the first XML fairly fast (40ish seconds to loop through 4,5million lines of XML and echo out all the books) by using a similar approach as this.
My problem occurs when I want to check if this book exists in the second XML and make changes in the second XML if it needs to be updated or added.
Would it for example be possible to use XMLReader on the second XML and stop at nodes with the same booktitle as I've stopped at in the first XML and then make the check? If so how?
I'm using php to take xml files and convert them into single line tab delimited plain text with set columns (i.e. ignores certain tags if database does not need it and certain tags will be empty). The problem I ran into is that it took 13 minutes to go through 56k (+ change) files, which I think is ridiculously slow. (average folder has upwards of a million xml files) I'll probably cronjob it overnight anyways, but it is completely untestable at a reasonable pace while I'm at work for things like missing files and corrupt files and such.
Here's hoping someone can help me make the thing faster, the xml files themselves are not too big (<1k lines) and I don't need every single data tag, just some, here's my data node method:
function dataNode ($entries) {
$out = "";
foreach ($entries as $e) {
$out .= $e->nodeValue."[ATTRIBS]";
foreach ($e->attributes as $name => $node)
$out .= $name."=".$node->nodeValue;
}
return $out;
}
where $entries is a DOMNodeList generated from XPath queries for the nodes I need. So the question is, what is the fastest way to go to a target data node or nodes (if I have 10 keyword nodes from my XPath query then I need all of them to be printed from that function) and output the nodevalue and all it's attributes?
I read here that iterating through a DOMNodeList isn't constant time but I can't really use the solution given because a sibling to the node I want might be one that I don't need or need to call a different format function before I write it to file and I really don't want to run the node through a gigantic switch statement for every iteration trying to format out the data.
Edit: I'm an idiot, I had my write function inside my processing loop so every iteration it had to reopen the file I was writing to, thanks for both of your help, I'm trying to learn XSLT right now as it seems very useful.
A comment would be a little short, so I write it as an answer:
It's hard to say where actually your setup can benefit from optimizing. Perhaps it's possible to join multiple of your many XML files together before loading.
From the information you give in your question I would assume that it's more the disk operations that are taking the time than the XML parsing. I found DomDocument and Xpath quite fast even on large files. An XML file with up to 60 MB takes about 4-6 secs to load, a file of 2MB only a fraction.
Having many small files (< 1k) would mean a lot of work on the disk, opening / closing files. Additionally, I have no clue how you iterate over directories/files, sometimes this can be speed up dramatically as well. Especially as you say that you have millions of file nodes.
So perhaps concatenating/merging files is an option for you which can be run quite safe so to reduce the time to test your converter.
If you encounter missing or corrupt files, you should create a log and catch these errors. So you can let the job run through and check for errors later.
Additionally, if possible, you can try to make your workflow resumeable. E.g. if an error occurs, the current state is saved and next time you can continue at this state.
The suggestion above in a comment to run an XSLT on the files is a good idea as well to transform them first. Having a new layer in the middle to transpose data can help to reduce the overall problem dramatically as it can reduce complexity.
This workflow on XML files has helped me so far:
Preprocess the file (plain text filters, optional)
Parse the XML. That's loading into DomDocument, XPath iterating etc.
My Parser sends out events with the parsed data if found.
The Parser throws a specific exception if data is encountered that is not in the expected format. That allows to realize errors in the own parser.
Every other errors are converted to Exceptions as well.
Exceptions can be caught and operations finished. E.g. go to next file etc.
Logger, Resumer and Exporter (file-export) can hook onto the events. Sort of the visitor pattern.
I've build such a system to process larger XML files which formats change. It's flexible enough to deal with changes (e.g. replace the parser with a new version while keeping logging and exporting). The event system really pushed it for me.
Instead of a gigantic switch statement I normally use a $state variable for the parsers state while iterating over a domnodelist. $state can be handy to resume operations later. Restore the state and go to the last known position, then continue.
I'm trying to determine the best course of action for the display of data for a project I'm working on. My client is currently using a proprietary CMS geared towards managing real estate. It easily lets you add properties, square footage, price, location, etc. The company that runs this CMS provides the data in a pretty straightforward XML file that they say offers access to all of the data my client enters.
I've read up on PHP5's SimpleXML feature and I grasp the basic concepts well enough, but my question is: can I access the XML data in a similar fashion as if I were querying a MySQL database?
For instance, assuming each entry has a unique ID, will I be able to set up a view and display just that record using a URL variable like: http://example.com/apartment.php?id=14
Can you also display results based on values within strings? I'm thinking a form submit that returns only two bedroom properties in this case.
Sorry in advance if this is a noob question. I'd rather not build a custom CMS for my client if for no other reason than they'd only have to login to one location and update accordingly.
Some short answers on your questions:
a. Yes you can access XML data with queries, but using XPath instead of SQL. XPath is for XML what SQL is for databases, working quite different.
b. Yes you can build a php program that receives an id as parameter and uses this for an XPath search on a given XML file.
c. All data in a XML file is a string, so it is no problem to search for or display strings. Even your example id=14 is to handle as a string.
You might be interested in this further information:
http://www.ibm.com/developerworks/library/x-simplexml.html?S_TACT=105AGX06&S_CMP=LP
http://www.ibm.com/developerworks/library/x-xmlphp1.html?S_TACT=105AGX06&S_CMP=LP
PHP can access XML not only via SimpleXML but also with DOM. SimpleXML accesses the elements like PHP-arrays, DOM provides a w3c-DOM-compatible api.
See php.net for other ways to access XML, but they seem not to be appropriate for you.
I want to store the contents of an xml file in the database. Is there an easy way to do it?
Can i write some script that can do the task for me?
The schema of the XML file looks like this:
<schedule start="20100727120000 +0530" stop="20100727160000 +0530" ch_id="0210.CHNAME.in">
<title>Title_info</title>
<date>20100727</date>
<category>cat_02</category>
</schedule>
One thing to note is:
How do I read the start time? I need the time +0530 added to the time?
Thank you so much.
You'll probably want to create a table called schedules that matches your data, then read the contents of the XML file with an XML parser of your choice. SimpleXML might be the right tool for this job.
As for the dates, I recommend you try using the function date_parse_from_format().
look up simple_xml on the php page - off hand I'm not too hot on it, but basically you will end up with a loop which will add your data to an object eg:
$xml
and you will be able to call tags as such $xml->schedule->title $xml->schedule->date and $xml->schedule->category and you will be able to call attributes as such $xml->schedule[start] but you might wanna check that.
I had to do this recently for a client, and this was the best way I could find. The attributes may be tricky - I can't quite remember but you might have to look into namespaces and such... anyway, find simple_xml and you're on the right tracks.
I have an xml feed that I have to check periodically for updates. The xml consists of many elements and I'm looking to figure it out which is the best (and probably faster) way to find out which elements suffered updates from last time I've checked.
What I think of is to check first the lastBuildDate for modifications and if it differs from the previous one to start parse the xml again. This would involve keeping each element with all of its attributes in my database. But each element can have different number of attributes as well as other nested elements. So if it would be to store each element in my database what would be the best way to keep them ?
That's why I'm asking for your help :) Thank you.
Most modern databases will store your XML as a blob if you like. (You tagged PHP... MySQL? If so, use MEDIUMTEXT.) Store your XML and generate a diff when you get a new one. If you don't have an XML diff tool, canonicalize both XML listings then run a text diff.