I'm currently attempting to come up with a solution to the following problem:
I have been tasked with parsing large (+-3500 Lines 300kb) pipe delimited text files and comparing them line by line to corresponding codes within our database. An example of a file would be:
File name: 015_A.txt
File content (example shows only 4 lines):
015|6999|Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.|1|1|0|0|2016/01/01
015|3715|It has roots in a piece of classical Latin literature from 45 BC|1|1|213.5|213.5|2016/01/01
015|3724|Making it over 2000 years old.|1|1|617.4|617.4|2016/01/01
015|4028|Words will go here.|1|1|74|74|2016/01/01
I will be providing a web interface which I have already built to allow a file to be selected from the browser and then uploaded to the server.
Using the above example pipe file I will only be using these:
Code (using above line 1 as an example: 6999)
Price (using above line 1 as an example: 0)
I would then (to my mind not sure if this is best method) need to run a query (our DB is MSSQL) for each line, example:
SELECT t.Price
FROM table t
WHERE t.code = '6999'
If t.Price === 0 then line 1 has passed. As it is equal to the source file.
This is where I believe I just needed to ask some advice as I am sure there are many ways to tackle this problem, I would just like to, if possible be pointed in the direction of doing this in an efficient manner. (Example best method of parsing the file? Do I run a query per code or rather do a SQL statement using an IN clause and then compare every code and price? Should I scrap this idea and use some form of pure SQL tool bearing in mind I have pipe file to deal with / import.)
Any advice would be greatly appreciated.
Your story appears to end somewhat prematurely. Is the only thing this script should do is check the values in the database match the files in the file? If so, it would be simpler just to extract the data from the database and overwrite the file. If not, then this implies you need to retain some record of the variations.
This has some bearing on the approach taken to the reconcilliation; running 3500 queries against the database is going to take some time - mostly spent on the network and in query parsing (i.e. wasted). OTOH comparing 3500 records in a single SELECT to find mismatches will take no time at all.
The problem is that your data is out at the client and uploading via a browser only gets it halfway to the database. If you create another table on the database (not a temporary table - add a column to represent the file) it is possible to INSERT multiple rows in a single DML statement, but really you should batch them up in lots of 100 or so records, meaning you only need to execute 36 queries to complete the operation - and you've got a record of the data in the database which simplifies how you report the mismatches.
You probably should not use the DBMS supplied utilities for direct import unless you ABSOLUTELY trust the source data.
Related
I am working on a genealogy project and I am trying to read the GEDCOM (*.ged) file format.
Instead of using columns, they choose to use lines and on each line has a root node and following sub node with its affiliated values, and their associated (child) nodes. Very simple numbering system 0 (root) node and 1,2,3... another root node and so on.
The problem that I'm having is that I have placed variables as (check points) indicating what part/section the program would be in, head / submission / individual, in order to minimize just what part of the program it is in, at what point. But one of the child nodes (particularly DATE) is inside the indi-birt-date as well as indi-chan-date, and there for passes / fails to differentiate and parse the correct date in intended check point.
Unlike head and submission, there are multiple (indi)viduals. And there for its difficult to create the right code to match the scenario.
preg_match('/^\d.(DATE)\s(.*)/i', '2 DATE Jan 01 2022', $matches); is the same condition for both instances. And therefor my birthdate is overridden with the final update (chan-date)
What I would like to know is how do I create a depth scenario, as a variable, and decide just what level I am in the program, and there for limit just what matches and executing code limits to the depth the program is at.
Update: I moved a couple of checkpoints inside so they could only be opened once per individual, and closed when they are not needed.
I have created a image showing how to create/read from GED file in PHP.
The code assumes that it has a file open (fopen) and reading line by line (while-fgets-loop). Instead of reading the entire file into memory (arrays), its best to store it all into a relational database. That code can be written in place of the first die() function. Also its best to mention that the programs that create these GED files, also have hidden characters in front of the line as a space (zero width no-break space Unicode code point).
So use the following $line = preg_replace('/[\x{200B}-\x{200D}\x{FEFF}]/u', '', trim($line));
I have been trying for a while now to import this kind of file to phpmyadmin and with external PHP infile load code, however I cant seem to get the result I would like. I'm not too sure whether I am putting the correct data within the format specific options.
Columns separated with: (space character)
Columns enclosed with: -
Columns escaped with: -
Lines terminated with: auto (\n)
Could someone propose a suggestion as what I should do? The snippet of text below is what the text file looks like (without the bullet points).
Perhaps phpmyadmin is not the way to go?
I would show you guys pictures but I don't have the reputation...yet.
If this helps, I have a link from where I got the dataset:
- https://snap.stanford.edu/data/web-Amazon.html
The layout is shown in the link, I would use Python if need be, though I don't have the experience with that language, I'm willing to use a parser which is provided in the link (wouldn't know where to start).
-product/productId- B000068VBQ
-product/title- Fisher-Price Rescue Heroes: Lava Landslide
-product/price- 8.88
-review/userId- unknown
-review/profileName- unknown
-review/helpfulness- 11/11
-review/score- 2.0
-review/time- 1042070400
-review/summary- Requires too much coordination
-review/text- I bought this software for my 5 year old. He has a couple of the other RH software games..
For a project, I need to get some word definitions in a database. All the definitions can be found on multiple DB files, but the DB files that I got are for a C language program and are in the form of ASCII (I believe). I need to somehow phrase thorough the files, line by line add the data into a MySQL database.
I would prefer using PHP and/or MySQL.
I tried writing a PHP script to go through and do it, but it timed-out and is intensive on my system and in most cases don't complete.
I heard about LOAD DATA INFILE from MySQL but have no clue how to use it with this.
The file names change for each file and do not have a specific extension, however, all of them can be read from a text file, and I am sure they are all the same in terms of content.
I uploaded the contents of one file here.
You can see that some lines are useless, but the lines starting with { are good and the pattern is essentially the first word is the dictionary term, and the content within () are the definitions. The parts within the "" are sample sentences.
All I need to extract are the terms, definitions and sentences.
The definitions are provided by Princeton University and the license is open source (and I will be crediting them).
Unless you want to reinvent the wheel I would go with something like wordnet2sql. It will output an SQL script that you can use to create your MySQL tables.
You can find the database specifications on princeton's website.
LOAD DATA is useful for csv files but not so much for special database formats.
I have files I need to convert into a database. These files (I have over 100k) are from an old system (generated from a COBOL script). I am now part of the team that migrate data from this system to the new system.
Now, because we have a lot of files to parse (each files is from 50mb to 100mb) I want to make sure I use the right methods in order to convert them to sql statement.
Most of the files have these following format:
#id<tab>name<tab>address1<tab>address2<tab>city<tab>state<tab>zip<tab>country<tab>#\n
the address2 is optional and can be empty
or
#id<tab>client<tab>taxid<tab>tagid<tab>address1<tab>address2<tab>city<tab>state<tab>zip<tab>country<tab>#\n
these are the 2 most common lines (I'll say around 50%), other than these, all the line looks the same but with different information.
Now, my question is what should I do to open them to be as efficient as possible and parse them correctly?
Honestly, I wouldn't use PHP for this. I'd use awk. With input that's as predictably formatted as this, it'll run faster, and you can output into SQL commands which you can also insert via a command line.
If you have other reasons why you need to use PHP, you probably want to investigate the fgetcsv() function. Output is an array which you can parse into your insert. One of the first user-provided examples takes CSV and inserts it into MySQL. And this function does let you specify your own delimiter, so tab will be fine.
If the id# in the first column is unique in your input data, then you should definitely insert this into a primary key in mysql, to save you from duplicating data if you have to restart your batch.
When I worked on a project where it was necessary to parse huge and complex log files (Apache, firewall, sql), we had a big gain in performance using the function preg_match_all(less than 10% of the time required using explode / trims / formatting).
Huge files (>100Mb) are parsed in 2 or 3 minutes in a core 2 duo (the drawback is that memory consumption is very high since it creates a giant array with all the information ready to be synthesized).
Regular expressions allow you to identify the content of line if you have variations within the same file.
But if your files are simple, try ghoti suggestion (fgetscv), will work fine.
If you're already familiar with PHP then using it is a perfectly fine tool.
If records do not span multiple lines, the best way to do this to guarantee that you won't run out of memory will be to process one line at a time.
I'd also suggest looking at the Standard PHP Library. It has nice directory iterators and file objects that make working with files and directories a bit nicer (in my opinion) than it used to be.
If you can use the CSV features and you use the SPL, make sure to set your options correctly for the tab characters.
You can use trim to remove the # from the first and last fields easily enough after the call to fgetcsv
Just sit and parse.
It's one-time operation and looking for the most efficient way makes no sense.
Just more or less sane way would be enough.
As a matter of fact, most likely you'll waste more overall time looking for the super-extra-best solution. Say, your code will run for a hour. You will spend another hour to find a solution that runs 30% faster. You'll spend 1,7 hours vs. 1.
i need some help with a project of mine. It is about a dvd database. In the moment i am planning to implement a csv data function to import dvds with all information from a file.
I will do this in three steps.
Step 1
- show data i want to import, building array
- import data, building session arrays
Step 2
- edit informations
Step 3
- showing result before update
- update data
so far it works but i have a problem with large files. the csv data has 20 columns (title, genre, plot etc.) and for each line in the csv there are some arrays i create to use it in the next steps.
When i have more about 500 lines the browser often collapse while importing. I get no response.
Anyway now i trying to do this as an ajax call process. The advantage is, that i can define how many procedures the system handle each call and the user can see that the system is still working, like an statusbar when down/uploading a file.
In the moment i try to find some usefull example illustrating how i can do this, but i could not find something useful till now.
Maybe you have some tipps or an example how this could work, saying processing 20 lines each call, building the array.
After i would like to use the same function to build the session arrays using in the next step and so on.
Some information:
i use fgetcsv() to read the rows from the file. i go through the rows and each column i have different querys like is the item id unique, the title exist, description exist etc.
So if one of these data is not entered i get an error which row and column the error occures.
I´d appreciate any help from you
use 'LOAD DATA INFILE' syntax. ive used it on files upwards of 500mb with 3mil rows and it takes seconds, not minutes.
http://dev.mysql.com/doc/refman/5.0/en/load-data.html
While this is not the direct answer you were looking for
500 lines shouldnt take too long to process, so.. heres another thought for you.
Create a temporary table with the right structure of fields
you can then extract from it using select statements the various unique entries for the plot, genre etc rather than making a bunch of arrays along the way
mysql import would be very fast of your data
You can then edit it as required, and finally insert into your final table the data you have from your temporary but now validated table.
In terms of doing it with ajax, you would have to do a repeating timed event to refresh the status, the problem is rather than 20 lines, it would need to be a specific time period, as your browser has no way to know, assuming the csv is uploaded and you can process it in 20 line chunks.
If you enter the csv in a big big textbox, you could work on by taking the first 20 lines, passing it the remainder to the next page etc, would strike me as potential mess.
So, while I know ive not answered your question directly, I hope I gave you food for thought as to alternative and possibly more practical alternatives