I have files I need to convert into a database. These files (I have over 100k) are from an old system (generated from a COBOL script). I am now part of the team that migrate data from this system to the new system.
Now, because we have a lot of files to parse (each files is from 50mb to 100mb) I want to make sure I use the right methods in order to convert them to sql statement.
Most of the files have these following format:
#id<tab>name<tab>address1<tab>address2<tab>city<tab>state<tab>zip<tab>country<tab>#\n
the address2 is optional and can be empty
or
#id<tab>client<tab>taxid<tab>tagid<tab>address1<tab>address2<tab>city<tab>state<tab>zip<tab>country<tab>#\n
these are the 2 most common lines (I'll say around 50%), other than these, all the line looks the same but with different information.
Now, my question is what should I do to open them to be as efficient as possible and parse them correctly?
Honestly, I wouldn't use PHP for this. I'd use awk. With input that's as predictably formatted as this, it'll run faster, and you can output into SQL commands which you can also insert via a command line.
If you have other reasons why you need to use PHP, you probably want to investigate the fgetcsv() function. Output is an array which you can parse into your insert. One of the first user-provided examples takes CSV and inserts it into MySQL. And this function does let you specify your own delimiter, so tab will be fine.
If the id# in the first column is unique in your input data, then you should definitely insert this into a primary key in mysql, to save you from duplicating data if you have to restart your batch.
When I worked on a project where it was necessary to parse huge and complex log files (Apache, firewall, sql), we had a big gain in performance using the function preg_match_all(less than 10% of the time required using explode / trims / formatting).
Huge files (>100Mb) are parsed in 2 or 3 minutes in a core 2 duo (the drawback is that memory consumption is very high since it creates a giant array with all the information ready to be synthesized).
Regular expressions allow you to identify the content of line if you have variations within the same file.
But if your files are simple, try ghoti suggestion (fgetscv), will work fine.
If you're already familiar with PHP then using it is a perfectly fine tool.
If records do not span multiple lines, the best way to do this to guarantee that you won't run out of memory will be to process one line at a time.
I'd also suggest looking at the Standard PHP Library. It has nice directory iterators and file objects that make working with files and directories a bit nicer (in my opinion) than it used to be.
If you can use the CSV features and you use the SPL, make sure to set your options correctly for the tab characters.
You can use trim to remove the # from the first and last fields easily enough after the call to fgetcsv
Just sit and parse.
It's one-time operation and looking for the most efficient way makes no sense.
Just more or less sane way would be enough.
As a matter of fact, most likely you'll waste more overall time looking for the super-extra-best solution. Say, your code will run for a hour. You will spend another hour to find a solution that runs 30% faster. You'll spend 1,7 hours vs. 1.
Related
I'm currently attempting to come up with a solution to the following problem:
I have been tasked with parsing large (+-3500 Lines 300kb) pipe delimited text files and comparing them line by line to corresponding codes within our database. An example of a file would be:
File name: 015_A.txt
File content (example shows only 4 lines):
015|6999|Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.|1|1|0|0|2016/01/01
015|3715|It has roots in a piece of classical Latin literature from 45 BC|1|1|213.5|213.5|2016/01/01
015|3724|Making it over 2000 years old.|1|1|617.4|617.4|2016/01/01
015|4028|Words will go here.|1|1|74|74|2016/01/01
I will be providing a web interface which I have already built to allow a file to be selected from the browser and then uploaded to the server.
Using the above example pipe file I will only be using these:
Code (using above line 1 as an example: 6999)
Price (using above line 1 as an example: 0)
I would then (to my mind not sure if this is best method) need to run a query (our DB is MSSQL) for each line, example:
SELECT t.Price
FROM table t
WHERE t.code = '6999'
If t.Price === 0 then line 1 has passed. As it is equal to the source file.
This is where I believe I just needed to ask some advice as I am sure there are many ways to tackle this problem, I would just like to, if possible be pointed in the direction of doing this in an efficient manner. (Example best method of parsing the file? Do I run a query per code or rather do a SQL statement using an IN clause and then compare every code and price? Should I scrap this idea and use some form of pure SQL tool bearing in mind I have pipe file to deal with / import.)
Any advice would be greatly appreciated.
Your story appears to end somewhat prematurely. Is the only thing this script should do is check the values in the database match the files in the file? If so, it would be simpler just to extract the data from the database and overwrite the file. If not, then this implies you need to retain some record of the variations.
This has some bearing on the approach taken to the reconcilliation; running 3500 queries against the database is going to take some time - mostly spent on the network and in query parsing (i.e. wasted). OTOH comparing 3500 records in a single SELECT to find mismatches will take no time at all.
The problem is that your data is out at the client and uploading via a browser only gets it halfway to the database. If you create another table on the database (not a temporary table - add a column to represent the file) it is possible to INSERT multiple rows in a single DML statement, but really you should batch them up in lots of 100 or so records, meaning you only need to execute 36 queries to complete the operation - and you've got a record of the data in the database which simplifies how you report the mismatches.
You probably should not use the DBMS supplied utilities for direct import unless you ABSOLUTELY trust the source data.
I was just wondering which is considered to be more efficient in PHP.
I need to loop through MySQL results and output it to a file using a specific format, I am interested to know whether it's more efficient to create a huge string (I'm going to be going through over 100,000 rows) and write the string to a file, or write (append) what I want to add to the file using fwrite, as I'm going through the results?
From the outset, I believe using a very long string will be more efficient than keep writing to a file but I am wondering whether PHP is going to have any issues with processing a very long string?
Thanks.
PHP wont have an issue with creating a very long string, it will however use a lot of memory. As long as you have enough memory available then this won't be an issue. Reading and writing from memory is much faster than writing to disk so going down the memory route could be an option although with 100,000 rows, you could split it up and get the best of both worlds.
What I mean by this is every 10,000 rows append your string into the file and then empty the string, this way, you'll only be writing to file 10 times and storing a smaller amount of data in memory.
If you create a large string, try to collect all values into an array and join it together before writing. Strings are immutable, if you just append values to a String, this will lead to bad performance.
Best solution would be to collect for example 1000 results, write this part to a file, collet another 1000 results and so on.
There are quite a few different threads about this similar topic, yet I have not been able to fully comprehend a solution to my problem.
What I'd like to do is quite simple, I have a flat-file db, with data stored like this -
$username:$worldLocation:$resources
The issue is I would like to have a submit data html page that would update this line based upon a search of the term using php
search db for - $worldLocation
if $worldLocation found
replace entire line with $username:$worldLocation:$updatedResources
I know there should be a fairly easy way to get this done but I am unable to figure it out at the moment, I will keep trying as this post is up but if you know a way that I could use I would greatly appreciate the help.
Thank you
I always loved c, and functions that came into php from c.
Check out fscanf and fprintf.
These will make your life easier while reading writing in a format. Like say:
$filehandle = fopen("file.txt", "c");
while($values = fscanf($filehandle, "%s\t%s\t%s\n")){
list($a, $b, $c) = $values;
// do something with a,b,c
}
Also, there is no performance workaround for avoiding reading the entire file into memory -> changing one line -> writing the entire file. You have to do it.
This is as efficient as you can get. Because you most probably using native c code since I read some where that php just wraps c's functions in these cases.
You like the hard way so be it....
Make each line the same length. Add space, tab, capital X etc to fill in the blanks
When you want to replace the line, find it and as each line is of a fixed length you can replace it.
For speed and less hassle use a database (even SQLLite)
If you're committed to the flat file, the simplest thing is iterating through each line, writing a new file & changing the one that matches.
Yeah, it sucks.
I'd strongly recommend switching over to a 'proper' database. If you're concerned about resources or the complexity of running a server, you can look into SQLite or Berkeley DB. Both of these use a database that is 'just a file', removing the issue of installing and maintaining a DB server, but still you the ability to quickly & easily search, replace and delete individual records. If you still need the flat file for some other reason, you can easily write some import/export routines.
Another interesting possibility, if you want to be creative, would be to look at your filesystem as a database. Give each user a directory. In each directory, have a file for locations. In each file, update the resources. This means that, to insert a row, you just write to a new file. To update a file, you just rewrite a single file. Deleting a user is just nuking a directory. Sure, there's a bit more overhead in slurping the whole thing into memory.
Other ways of solving the problem might be to make your flat-file write-only, since appending to the end of a file is a trivial operation. You then create a second file that lists "dead" line numbers that should be ignored when reading the flat file. Similarly, you could easily "X" out the existing lines (which, again, is far easier than trying to update lines in a file that might not be the same length) and append your new data to the end.
Those second two ideas aren't really meant to be practical solutions as much as they are to show you that there's always more than one way to solve a problem.
ok.... after a few hours work..this example woorked fine for me...
I intended to code an editing tool...and use it for password update..and it did the
trick!
Not only does this page send and email to user (sorry...address harcoded to avoid
posting aditional code) with new password...but it also edits entry for thew user
and re-writes all file info in new file...
when done, it obviously swaps filenames, storing old file as usuarios_old.txt.
grab the code here (sorry stackoverflow got VERY picky about code posting)
https://www.iot-argentina.xyz/edit_flat_databse.txt
Is that what you are location for :
update `field` from `table` set `field to replace` = '$username:$worldlocation:$updatesResources' where `field` = '$worldLocation';
I have a function to write a text file based on the form settings, a rather large form.
Shortly, I want to compare the output of a function to a single file, and only do execution (rewriting the file) if the destination file is different from the output. As you guess, it is a performance concern.
Is it doable, BTW?
The process is, I fill up some forms:
A single file is written to contain some "specific" selected options
Some "non-specific" options do not necessarily write anything to the file.
The form is updateable anytime, so the content of the file may grow or shrink based on different options.
It only needs a rewrite to the file if I am at point #1.
When at point #2, nothing should be written.
This is what I tried:
if ($output != file_get_contents($filepath)) {
// save the data
}
But I felt so much delay of execution in this.
I found a almost similar issue here: Can I use file_get_contents() to compare two files?, but my issue is different. Mine is comparing the result of the process to an already existing file which simply the result of the process previously. And only rewrite the file if they are different.
No sensitive data on the form, btw.
Any hint is very much appreciated.
Thanks
To compare a whole file with a string (I suppose it's a string, isn't it?) the only way is to read whole file and do comparison. To improve performance you can read file line by line and stop at first different line, as Explosion Pills said before me.
If your file is really big, and you want to improve performance further, you can do some hashing stuff:
Generate the output, let's say $output.
Calculate md5($output) and store in $output_md5.
Compare $output_md5 with a stored one, let's say in file output.md5.
Are they equal?
If yes, do nothing.
If not, save $output into output.txt and $output_md5 in output.md5.
Rather than load the entire file into memory, it may be faster to read it line-by-line (fgets) and compare it to the input string also line-by-line. You could even go as small as character-by-character, but I think that's overkill.
You could always try a combination of what was in the other post, the sha1_file($file) function, with the sha1($string) function, and check the equality of that.
I'm using php to take xml files and convert them into single line tab delimited plain text with set columns (i.e. ignores certain tags if database does not need it and certain tags will be empty). The problem I ran into is that it took 13 minutes to go through 56k (+ change) files, which I think is ridiculously slow. (average folder has upwards of a million xml files) I'll probably cronjob it overnight anyways, but it is completely untestable at a reasonable pace while I'm at work for things like missing files and corrupt files and such.
Here's hoping someone can help me make the thing faster, the xml files themselves are not too big (<1k lines) and I don't need every single data tag, just some, here's my data node method:
function dataNode ($entries) {
$out = "";
foreach ($entries as $e) {
$out .= $e->nodeValue."[ATTRIBS]";
foreach ($e->attributes as $name => $node)
$out .= $name."=".$node->nodeValue;
}
return $out;
}
where $entries is a DOMNodeList generated from XPath queries for the nodes I need. So the question is, what is the fastest way to go to a target data node or nodes (if I have 10 keyword nodes from my XPath query then I need all of them to be printed from that function) and output the nodevalue and all it's attributes?
I read here that iterating through a DOMNodeList isn't constant time but I can't really use the solution given because a sibling to the node I want might be one that I don't need or need to call a different format function before I write it to file and I really don't want to run the node through a gigantic switch statement for every iteration trying to format out the data.
Edit: I'm an idiot, I had my write function inside my processing loop so every iteration it had to reopen the file I was writing to, thanks for both of your help, I'm trying to learn XSLT right now as it seems very useful.
A comment would be a little short, so I write it as an answer:
It's hard to say where actually your setup can benefit from optimizing. Perhaps it's possible to join multiple of your many XML files together before loading.
From the information you give in your question I would assume that it's more the disk operations that are taking the time than the XML parsing. I found DomDocument and Xpath quite fast even on large files. An XML file with up to 60 MB takes about 4-6 secs to load, a file of 2MB only a fraction.
Having many small files (< 1k) would mean a lot of work on the disk, opening / closing files. Additionally, I have no clue how you iterate over directories/files, sometimes this can be speed up dramatically as well. Especially as you say that you have millions of file nodes.
So perhaps concatenating/merging files is an option for you which can be run quite safe so to reduce the time to test your converter.
If you encounter missing or corrupt files, you should create a log and catch these errors. So you can let the job run through and check for errors later.
Additionally, if possible, you can try to make your workflow resumeable. E.g. if an error occurs, the current state is saved and next time you can continue at this state.
The suggestion above in a comment to run an XSLT on the files is a good idea as well to transform them first. Having a new layer in the middle to transpose data can help to reduce the overall problem dramatically as it can reduce complexity.
This workflow on XML files has helped me so far:
Preprocess the file (plain text filters, optional)
Parse the XML. That's loading into DomDocument, XPath iterating etc.
My Parser sends out events with the parsed data if found.
The Parser throws a specific exception if data is encountered that is not in the expected format. That allows to realize errors in the own parser.
Every other errors are converted to Exceptions as well.
Exceptions can be caught and operations finished. E.g. go to next file etc.
Logger, Resumer and Exporter (file-export) can hook onto the events. Sort of the visitor pattern.
I've build such a system to process larger XML files which formats change. It's flexible enough to deal with changes (e.g. replace the parser with a new version while keeping logging and exporting). The event system really pushed it for me.
Instead of a gigantic switch statement I normally use a $state variable for the parsers state while iterating over a domnodelist. $state can be handy to resume operations later. Restore the state and go to the last known position, then continue.