I'm using php to take xml files and convert them into single line tab delimited plain text with set columns (i.e. ignores certain tags if database does not need it and certain tags will be empty). The problem I ran into is that it took 13 minutes to go through 56k (+ change) files, which I think is ridiculously slow. (average folder has upwards of a million xml files) I'll probably cronjob it overnight anyways, but it is completely untestable at a reasonable pace while I'm at work for things like missing files and corrupt files and such.
Here's hoping someone can help me make the thing faster, the xml files themselves are not too big (<1k lines) and I don't need every single data tag, just some, here's my data node method:
function dataNode ($entries) {
$out = "";
foreach ($entries as $e) {
$out .= $e->nodeValue."[ATTRIBS]";
foreach ($e->attributes as $name => $node)
$out .= $name."=".$node->nodeValue;
}
return $out;
}
where $entries is a DOMNodeList generated from XPath queries for the nodes I need. So the question is, what is the fastest way to go to a target data node or nodes (if I have 10 keyword nodes from my XPath query then I need all of them to be printed from that function) and output the nodevalue and all it's attributes?
I read here that iterating through a DOMNodeList isn't constant time but I can't really use the solution given because a sibling to the node I want might be one that I don't need or need to call a different format function before I write it to file and I really don't want to run the node through a gigantic switch statement for every iteration trying to format out the data.
Edit: I'm an idiot, I had my write function inside my processing loop so every iteration it had to reopen the file I was writing to, thanks for both of your help, I'm trying to learn XSLT right now as it seems very useful.
A comment would be a little short, so I write it as an answer:
It's hard to say where actually your setup can benefit from optimizing. Perhaps it's possible to join multiple of your many XML files together before loading.
From the information you give in your question I would assume that it's more the disk operations that are taking the time than the XML parsing. I found DomDocument and Xpath quite fast even on large files. An XML file with up to 60 MB takes about 4-6 secs to load, a file of 2MB only a fraction.
Having many small files (< 1k) would mean a lot of work on the disk, opening / closing files. Additionally, I have no clue how you iterate over directories/files, sometimes this can be speed up dramatically as well. Especially as you say that you have millions of file nodes.
So perhaps concatenating/merging files is an option for you which can be run quite safe so to reduce the time to test your converter.
If you encounter missing or corrupt files, you should create a log and catch these errors. So you can let the job run through and check for errors later.
Additionally, if possible, you can try to make your workflow resumeable. E.g. if an error occurs, the current state is saved and next time you can continue at this state.
The suggestion above in a comment to run an XSLT on the files is a good idea as well to transform them first. Having a new layer in the middle to transpose data can help to reduce the overall problem dramatically as it can reduce complexity.
This workflow on XML files has helped me so far:
Preprocess the file (plain text filters, optional)
Parse the XML. That's loading into DomDocument, XPath iterating etc.
My Parser sends out events with the parsed data if found.
The Parser throws a specific exception if data is encountered that is not in the expected format. That allows to realize errors in the own parser.
Every other errors are converted to Exceptions as well.
Exceptions can be caught and operations finished. E.g. go to next file etc.
Logger, Resumer and Exporter (file-export) can hook onto the events. Sort of the visitor pattern.
I've build such a system to process larger XML files which formats change. It's flexible enough to deal with changes (e.g. replace the parser with a new version while keeping logging and exporting). The event system really pushed it for me.
Instead of a gigantic switch statement I normally use a $state variable for the parsers state while iterating over a domnodelist. $state can be handy to resume operations later. Restore the state and go to the last known position, then continue.
Related
How can I parse an 88 GB RDF file with PHP?
This RDF is filled with entities and facts about each entity.
I'm trying to iterate through each entity and check for certain facts per each entity. Then write those facts to an XML document I created earlier in the script.
So as I am navigating the rdf, per each entity I create a <card></card> element and give it a child called <facts>. I run through all the facts on the entity and I take the ones I need and write them inside and as <fact></fact> element children inside the <facts></facts>.
How can I parse the rdf, extract the data, and write it to XML?
First, use an RDF parser. Googling for a PHP RDF parser turned up lots of results; I dont use PHP personally, but I'm sure one of them will do the job of parsing RDF. But make sure it's a streaming parser, you're not going to hold 88G of RDF in memory on your workstation.
Second, you said you need to 'iterate through each entity' that might be tricky if either they're not sorted by subject in the original file, or the parser does not report them in the same order.
Assuming that is not a problem, then you can just keep the triples for each subject in a local data structure, and when you get a triple w/ a subject different than the ones you've queued locally, do whatever business logic you need and write out the XML. Might want to make sure you can't queue up so many statements locally that you'll OOM.
Lastly, I'm going to assume you have a good reason to take RDF and turn it into an XML format that is not RDF/XML. But I you might reconsider your design just in case.
Or you could put the data in an RDF database and write SPARQL queries against it, transforming query results into whatever XML or anything else you need.
I think your best option would be:
use some external tool (probably something like rapper?) to convert the source-file from Turtle into n-triples format
iterate file one line at a time via fopen+fgets as n-triples defines strict 1-statement per 1-line constraint which is perfect in this case
There are quite a few different threads about this similar topic, yet I have not been able to fully comprehend a solution to my problem.
What I'd like to do is quite simple, I have a flat-file db, with data stored like this -
$username:$worldLocation:$resources
The issue is I would like to have a submit data html page that would update this line based upon a search of the term using php
search db for - $worldLocation
if $worldLocation found
replace entire line with $username:$worldLocation:$updatedResources
I know there should be a fairly easy way to get this done but I am unable to figure it out at the moment, I will keep trying as this post is up but if you know a way that I could use I would greatly appreciate the help.
Thank you
I always loved c, and functions that came into php from c.
Check out fscanf and fprintf.
These will make your life easier while reading writing in a format. Like say:
$filehandle = fopen("file.txt", "c");
while($values = fscanf($filehandle, "%s\t%s\t%s\n")){
list($a, $b, $c) = $values;
// do something with a,b,c
}
Also, there is no performance workaround for avoiding reading the entire file into memory -> changing one line -> writing the entire file. You have to do it.
This is as efficient as you can get. Because you most probably using native c code since I read some where that php just wraps c's functions in these cases.
You like the hard way so be it....
Make each line the same length. Add space, tab, capital X etc to fill in the blanks
When you want to replace the line, find it and as each line is of a fixed length you can replace it.
For speed and less hassle use a database (even SQLLite)
If you're committed to the flat file, the simplest thing is iterating through each line, writing a new file & changing the one that matches.
Yeah, it sucks.
I'd strongly recommend switching over to a 'proper' database. If you're concerned about resources or the complexity of running a server, you can look into SQLite or Berkeley DB. Both of these use a database that is 'just a file', removing the issue of installing and maintaining a DB server, but still you the ability to quickly & easily search, replace and delete individual records. If you still need the flat file for some other reason, you can easily write some import/export routines.
Another interesting possibility, if you want to be creative, would be to look at your filesystem as a database. Give each user a directory. In each directory, have a file for locations. In each file, update the resources. This means that, to insert a row, you just write to a new file. To update a file, you just rewrite a single file. Deleting a user is just nuking a directory. Sure, there's a bit more overhead in slurping the whole thing into memory.
Other ways of solving the problem might be to make your flat-file write-only, since appending to the end of a file is a trivial operation. You then create a second file that lists "dead" line numbers that should be ignored when reading the flat file. Similarly, you could easily "X" out the existing lines (which, again, is far easier than trying to update lines in a file that might not be the same length) and append your new data to the end.
Those second two ideas aren't really meant to be practical solutions as much as they are to show you that there's always more than one way to solve a problem.
ok.... after a few hours work..this example woorked fine for me...
I intended to code an editing tool...and use it for password update..and it did the
trick!
Not only does this page send and email to user (sorry...address harcoded to avoid
posting aditional code) with new password...but it also edits entry for thew user
and re-writes all file info in new file...
when done, it obviously swaps filenames, storing old file as usuarios_old.txt.
grab the code here (sorry stackoverflow got VERY picky about code posting)
https://www.iot-argentina.xyz/edit_flat_databse.txt
Is that what you are location for :
update `field` from `table` set `field to replace` = '$username:$worldlocation:$updatesResources' where `field` = '$worldLocation';
I have files I need to convert into a database. These files (I have over 100k) are from an old system (generated from a COBOL script). I am now part of the team that migrate data from this system to the new system.
Now, because we have a lot of files to parse (each files is from 50mb to 100mb) I want to make sure I use the right methods in order to convert them to sql statement.
Most of the files have these following format:
#id<tab>name<tab>address1<tab>address2<tab>city<tab>state<tab>zip<tab>country<tab>#\n
the address2 is optional and can be empty
or
#id<tab>client<tab>taxid<tab>tagid<tab>address1<tab>address2<tab>city<tab>state<tab>zip<tab>country<tab>#\n
these are the 2 most common lines (I'll say around 50%), other than these, all the line looks the same but with different information.
Now, my question is what should I do to open them to be as efficient as possible and parse them correctly?
Honestly, I wouldn't use PHP for this. I'd use awk. With input that's as predictably formatted as this, it'll run faster, and you can output into SQL commands which you can also insert via a command line.
If you have other reasons why you need to use PHP, you probably want to investigate the fgetcsv() function. Output is an array which you can parse into your insert. One of the first user-provided examples takes CSV and inserts it into MySQL. And this function does let you specify your own delimiter, so tab will be fine.
If the id# in the first column is unique in your input data, then you should definitely insert this into a primary key in mysql, to save you from duplicating data if you have to restart your batch.
When I worked on a project where it was necessary to parse huge and complex log files (Apache, firewall, sql), we had a big gain in performance using the function preg_match_all(less than 10% of the time required using explode / trims / formatting).
Huge files (>100Mb) are parsed in 2 or 3 minutes in a core 2 duo (the drawback is that memory consumption is very high since it creates a giant array with all the information ready to be synthesized).
Regular expressions allow you to identify the content of line if you have variations within the same file.
But if your files are simple, try ghoti suggestion (fgetscv), will work fine.
If you're already familiar with PHP then using it is a perfectly fine tool.
If records do not span multiple lines, the best way to do this to guarantee that you won't run out of memory will be to process one line at a time.
I'd also suggest looking at the Standard PHP Library. It has nice directory iterators and file objects that make working with files and directories a bit nicer (in my opinion) than it used to be.
If you can use the CSV features and you use the SPL, make sure to set your options correctly for the tab characters.
You can use trim to remove the # from the first and last fields easily enough after the call to fgetcsv
Just sit and parse.
It's one-time operation and looking for the most efficient way makes no sense.
Just more or less sane way would be enough.
As a matter of fact, most likely you'll waste more overall time looking for the super-extra-best solution. Say, your code will run for a hour. You will spend another hour to find a solution that runs 30% faster. You'll spend 1,7 hours vs. 1.
I have a couple hundred single words that are identified in a foreach routine and pushed into an array.
I would like to check each one (word) to see if it exists in an existing txt file that is single column 200k+ lines.
(Similar to a huge "bad word" routine i guess but, in the end this will add to the "filter" file.)
I don't know whether i should do this with preg_match in the loop or should I combine the arrays somehow and array_unique?
I would like to add the ones not found to the main file as well. Also flocking in attempt to avoid any multi access issues.
Is this a pipe dream? Well it is for this beginner. My attempts have timed out in 30 seconds.
Stackoverflow has been such a great resource. I don't know what I would do without it. Thanks in advance either way.
sorry, but that sound like a REALLY AWFUL APPROACH!
doing a whole scan (of a table, list or whatever) if you want to check if something already exists is just... wrong.
this is what hashtables are for!
your case sounds like a classical database job...
if you don't have a database available you can use a local sqlite file which will provide essential functionalities.
let me explain the background...
a lookup of "foo" in an hashtable basically consumes O(1) time. which means a static amount of time. because your algorithm knows WHERE to look and can see whether its THERE. hashmaps have the the attitude to run into ambiguiti because of the one-way nature of hashing procedures, which really doesnt matter that much because the hashmap delivers some possible matches which can be compared directly (for a reasonable number of elements, like probably the google index laugh)
so if you want (for some reason) stay with your text-file approach, consider the following:
sort your file and insert your data at the right place (alphabetically would be the most intuitive approach). then you can jump from position to position and isolate the area where the word should be. there are several algorithms available, just have a google. but keep it takes longer the more data you have. usually your running time will be O(log(n)) where n is the size of the table.
well this is all basically just to guide you on the right track.
you can as well shard your data, this would be for example saving every word beginning with a in the file a.txt and so on. or to split the word into characters and create a folder for every character and the last character is the file, then you check if the file exists. those are stupid suggestions, as you will probably run out of inodes on your disk, but it illustrates that you can CHECK for EXISTENCE witout having to do a FULL SCAN.
the main thing is that you have to project some search tree into a reasonable structure (like a database system does automatically for you). the folder example was an example of the basic principle.
this wikipedia entry might be a good place to start: http://en.wikipedia.org/wiki/Binary_search_tree
If the file is too large, then it is not a good idea to read it all into memory. You can process it line by line:
<?php
$words = array('a', 'b', 'c'); # words to insert, assumed to be unique
$fp = fopen('words.txt', 'r+');
while (!feof($fp))
{
$line = trim(fgets($fp));
$key = array_search($line, $words);
if ($key !== false)
{
unset($words[$key]);
if (!$words) break;
}
}
foreach ($words as $word)
{
fputs($fp, "$word\n");
}
fclose($fp);
?>
It loops through the entire file, checking to see if the current line (assumed to be a single word) exists in the array. If it does, that element is removed from the array. If there is nothing left in the array, then the search stops. After cycling through the file, if the array is not empty, it adds each of them to the file.
(File locking and error handling are not implemented in this example.)
Note that this is a very bad way to store this data (file based, unsorted, etc). Even sqlite would be a big improvement. You could always simply write an exporter to .txt if you needed it in plain text.
I have an xml feed that I have to check periodically for updates. The xml consists of many elements and I'm looking to figure it out which is the best (and probably faster) way to find out which elements suffered updates from last time I've checked.
What I think of is to check first the lastBuildDate for modifications and if it differs from the previous one to start parse the xml again. This would involve keeping each element with all of its attributes in my database. But each element can have different number of attributes as well as other nested elements. So if it would be to store each element in my database what would be the best way to keep them ?
That's why I'm asking for your help :) Thank you.
Most modern databases will store your XML as a blob if you like. (You tagged PHP... MySQL? If so, use MEDIUMTEXT.) Store your XML and generate a diff when you get a new one. If you don't have an XML diff tool, canonicalize both XML listings then run a text diff.