Parsing large XML data

Parsing large XML data - php

I am trying to parse xml files to store data into database. I have written a code with PHP (as below) and I could successfully run the code.
But the problem is, it requires around 8 mins to read a complete file (which is around 30 MB), and I have to parse around 100 files in each hour.
So, obviously my current code is of no use to me. Can anybody advise for a better solution? Or should I switch to other coding language?
What I get from net is, I can do it with Perl/Python or something called XSLT (which I am not so sure about, frankly).
$xml = new XMLReader();
$xml->open($file);
while ($xml->name === 'node1'){
$node = new SimpleXMLElement($xml->readOuterXML());
foreach($node->node2 as $node2){
//READ
}
$xml->next('node1');
}
$xml->close();

Here's an example of my script I used to parse the WURFL XML database found here.
I used the ElementTree module for Python and wrote out a JavaScript Array - although you can easily modify my script to write a CSV of the same (Just change the final 3 lines).
import xml.etree.ElementTree as ET
tree = ET.parse('C:/Users/Me/Documents/wurfl.xml')
root = tree.getroot()
dicto = {} #to store the data
for device in root.iter("device"): #parse out the device objects
dicto[device.get("id")] = [0, 0, 0, 0] #set up a list to store the needed variables
for child in device: #iterate through each device
if child.get("id") == "product_info": #find the product_info id
for grand in child:
if grand.get("name") == "model_name": #and the model_name id
dicto[device.get("id")][0] = grand.get("value")
dicto[device.get("id")][3] +=1
elif child.get("id") == "display": #and the display id
for grand in child:
if grand.get("name") == "physical_screen_height":
dicto[device.get("id")][1] = grand.get("value")
dicto[device.get("id")][3] +=1
elif grand.get("name") == "physical_screen_width":
dicto[device.get("id")][2] = grand.get("value")
dicto[device.get("id")][3] +=1
if not dicto[device.get("id")][3] == 3: #make sure I had enough
#otherwise it's an incomplete dataset
del dicto[device.get("id")]
arrays = []
for key in dicto.keys(): #sort this all into another list
arrays.append(key)
arrays.sort() #and sort it alphabetically
with open('C:/Users/Me/Documents/wurfl1.js', 'w') as new: #now to write it out
for item in arrays:
new.write('{\n id:"'+item+'",\n Product_Info:"'+dicto[item][0]+'",\n Height:"'+dicto[item][1]+'",\n Width:"'+dicto[item][2]+'"\n},\n')
Just counted this as I ran it again - took about 3 seconds.

In Perl you could use XML::Twig, which is designed to process huge XML files (bigger than can fit in memory)
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my $file= shift #ARGV;
XML::Twig->new( twig_handlers => { 'node1/node2' => \&read_node })
->parsefile( $file);
sub read_node
{ my( $twig, $node2)= #_;
# your code, the whole node2 string is $node2->sprint
$twig->purge; # if you want to reduce memory footprint
}
You can find more info about XML::Twig at xmltwig.org

In case of Python I would recommend using lxml.
As you are having performance problems, I would recommend iterating through your XML and processing things part by part, this would save a lot of memory and is likely to be much faster.
I am reading on old server 10 MB XML within 3 seconds, your situation might be different.
About iterating with lxml: http://lxml.de/tutorial.html#tree-iteration

Review this line of code:
$node = new SimpleXMLElement($xml->readOuterXML());
Documentation for readOuterXML has a comment, that sometime it is attempting to reach out for namespaces etc. Anyway, here I would suspect big performance problem.
Consider using readInnerXML() if you could.

Related

how to order a big csv file with php?

i'm looking for an algorithm strategy. I have a csv file with 162 columns and 55000 lines.
I want to order the datas with one date (which is on column 3).
first i tried directly to put everything in an array, but memory explodes.
So i decided to :
1/ Put in an array the 3 first columns.
2/ Order this array with usort
3/ read the csv file to recover the other columns
4/ Add in a new csv file the complete line
5/ replace the line by an empty string on the readed csv file
//First read of the file
while(($data = fgetcsv($handle, 0,';')) !== false)
{
$tabLigne[$columnNames[0]] = $data[0];
$tabLigne[$columnNames[1]] = $data[1];
$tabLigne[$columnNames[2]] = $data[2];
$dateCreation = DateTime::createFromFormat('d/m/Y', $tabLigne['Date de Création']);
if($dateCreation !== false)
{
$tableauDossiers[$row] = $tabLigne;
}
$row++;
unset($data);
unset($tabLigne);
}
//Order the array by date
usort(
$tableauDossiers,
function($x, $y) {
$date1 = DateTime::createFromFormat('d/m/Y', $x['Date de Création']);
$date2 = DateTime::createFromFormat('d/m/Y', $y['Date de Création']);
return $date1->format('U')> $date2->format('U');
}
);
fclose($handle);
copy(PATH_CSV.'original_file.csv', PATH_CSV.'copy_of_file.csv');
for ($row = 3; $row <= count($tableauDossiers); $row++)
{
$handle = fopen(PATH_CSV.'copy_of_file.csv', 'c+');
$tabHandle = file(PATH_CSV.'copy_of_file.csv');
fgetcsv($handle);
fgetcsv($handle);
$rowHandle = 2;
while(($data = fgetcsv($handle, 0,';')) !== false)
{
if($tableauDossiers[$row]['Caisse Locale Déléguée'] == $data[0]
&& $tableauDossiers[$row]['Date de Création'] == $data[1]
&& $tableauDossiers[$row]['Numéro RCT'] == $data[2])
{
fputcsv($fichierSortieDossier, $data,';');
$tabHandle[$rowHandle]=str_replace("\n",'', $tabHandle[$rowHandle]);
file_put_contents(PATH_CSV.'copy_of_file.csv', $tabHandle);
unset($tabHandle);
break;
}
$rowHandle++;
unset($data);
unset($tabLigne);
}
fclose($handle);
unset($handle);
}
This algo is really too long to execute, but works
Any idea how to improve it ?
Thanks

Assuming you are limited to using PHP and can not use a database to implement it as suggested in the comments, the next best option is to use an external sorting algorithm.
Split the file into small files. The files should be small enough to sort them in memory.
Sort all these files individually in memory.
Merge the sorted files to one big file by comparing the first lines of each file.
The merging of the sorted files can be done very memory efficient: You only need to have the first line of each file in memory at any given time. The first line with the minimal timestamp should go to the resulting file.
For really big files you can cascade the merging ie: if you have 10,000 files you can merge groups of 100 files first then merge the resulting 100 files.
Example
I use a comma to separate values instead of line-breaks for readability.
The unsorted file (imagine it to be be too big to fit into memory):
1, 6, 2, 4, 5, 3
Split the files in parts that are small enough to fit into memory:
1, 6, 2
4, 5, 3
Sort them individually:
1, 2, 6
3, 4, 5
Now merge:
Compare 1 & 3 → take 1
Compare 2 & 3 → take 2
Compare 6 & 3 → take 3
Compare 6 & 4 → take 4
Compare 6 & 5 → take 5
Take 6.

You have a fairly large set of data to process, so you need to do something to optimize it.
You could increase your memory, but that will only postpone the error, when there is a bigger file, it'll crash then (or get waaaayy too slow).
The first option is try to minimize the amount of data. Remove all non-relevant columns from the file. Whichever solution you apply, a smaller dataset is always faster.
I suggest you put it into a database and apply your requirements to it, using that result to create a new file. A database is made to manage large data sets, so it'll take a whole lot less time.
Taking that much data and write that to a file from PHP will still be slow, but could be manageble. Another tactic might be using the commandline, using a .sh file. If you have do basic terminal/ssh skills, you have basic .sh writing capabilities. In that file, you can use mysqldump to export as csv like this. Mysqldump will be significantly faster, but it's a bit trickier to get going when you're used to PHP.
To improve your current code:
- The unset at the end of the first will don't do anything userful. They barely store data and get reset anyways when the next itteration of the while starts.
- Instead of DateTime() for everything, which is easier to work with, but slower, use epoch values. I dont know in what format it comes now, but if you use epoch seconds (like the result of time()), you have two numbers. Your usort() will improve drastically, as it no longer has to use the heavy DateTime class, but just a simple number comparison.
This all asumes that you need to do it muliple times. If not, just open it in Excel or Numbers, use that sort and save as copy.

I've only tried this on a small file, but the principle is very similar to your idea of reading the file, stores the dates and then sorts it. Then reading the original file and writing out the sorted data.
In this version, the load just reads the dates in and creates an array which holds the date and the position in the file of the start of the line (using ftell() after each read to get the file pointer).
It then sorts this array (as date is first just uses normal sort).
Then it goes through the sorted array and for each entry it uses fseek() to locate the record in the file and reads the line (using fgets()) and writes this line to the output file...
$file = "a.csv";
$out = "sorted.csv";
$handle = fopen($file, "r");
$tabligne = [];
$start = 0;
while ( $data = fgetcsv($handle) ) {
$tabligne[] = ['date' => DateTime::createFromFormat('d/m/Y', $data[2]),
'start' => $start ];
$start = ftell($handle);
}
sort($tabligne);
$outHandle = fopen( $out, "w" );
foreach ( $tabligne as $entry ) {
fseek($handle, $entry['start']);
$copy = fgets($handle);
fwrite($outHandle, $copy);
}
fclose($outHandle);
fclose($handle);

I would load the data in a database and let that worry about the underlying algorithm.
If this is a one-time issue, i would suggest not to automate it and use a spreadsheet instead.

BigQuery PHP API - large query result memory bloat - even with paging

I am running a range of queries in BigQuery and exporting them to CSV via PHP. There are reasons why this is the easiest method for me to do this (multiple queries dependent on variables within an app).
I am struggling with memory issues when the result set is larger than 100mb. It appears that the memory usage of my code seems to grow in line with the result set, which I thought would be avoided by paging. Here is my code:
$query = $bq->query($myQuery);
$queryResults = $bq->runQuery($query,['maxResults'=>5000]);
$FH = fopen($storagepath, 'w');
$rows = $queryResults->rows();
foreach ($rows as $row) {
fputcsv($FH, $row);
}
fclose($FH);
The $queryResults->rows() function returns a Google Iterator which uses paging to scroll through the results, so I do not understand why memory usage grows as the script runs.
Am I missing a way to discard previous pages from memory as I page through the results?
UPDATE
I have noticed that actually since upgrading to the v1.4.3 BigQuery PHP API, the memory usage does cap out at 120mb for this process, even when the result set reaches far beyond this (currently processing a 1gb result set). But still, 120mb seems too much. How can I identify and fix where this memory is being used?
UPDATE 2
This 120mb seems to be tied at 24kb per maxResult in the page. E.g. adding 1000 rows to maxResults adds 24mb of memory. So my question is now why is 1 row of data using 24kb in the Google Iterator? Is there a way to reduce this? The data itself is < 1kb per row.

Answering my own question
The extra memory is used by a load of PHP type mapping and other data structure info that comes alongside the data from BigQuery. Unfortunately I couldn't find a way to reduce the memory usage below around 24kb per row multiplied by the page size. If someone finds a way to reduce the bloat that comes along with the data please post below.
However thanks to one of the comments I realized you can extract a query directly to CSV in a Google Cloud Storage Bucket. This is really easy:
query = $bq->query($myQuery);
$queryResults = $bq->runQuery($query);
$qJobInfo = $queryResults->job()->info();
$dataset = $bq->dataset($qJobInfo['configuration']['query']['destinationTable']['datasetId']);
$table = $dataset->table($qJobInfo['configuration']['query']['destinationTable']['tableId']);
$extractJob = $table->extract('gs://mybucket/'.$filename.'.csv');
$table->runJob($extractJob);
However this still didn't solve my issue as my result set was over 1gb, so I had to make use of the data sharding function by adding a wildcard.
$extractJob = $table->extract('gs://mybucket/'.$filename.'*.csv');
This created ~100 shards in the bucket. These need to be recomposed using gsutil compose <shard filenames> <final filename>. However, gsutil only lets you compose 32 files at a time. Given I will have variable numbers of shards, opten above 32, I had to write some code to clean them up.
//Save above job as variable
$eJob = $table->runJob($extractJob);
$eJobInfo = $eJob->info();
//This bit of info from the job tells you how many shards were created
$eJobFiles = $eJobInfo['statistics']['extract']['destinationUriFileCounts'][0];
$composedFiles = 0; $composeLength = 0; $subfile = 0; $fileString = "";
while (($composedFiles < $eJobFiles) && ($eJobFiles>1)) {
while (($composeLength < 32) && ($composedFiles < $eJobFiles)) {
// gsutil creates shards with a 12 digit number after the filename, so build a string of 32 such filenames at a time
$fileString .= "gs://bucket/$filename" . str_pad($composedFiles,12,"0",STR_PAD_LEFT) . ".csv ";
$composedFiles++;
$composeLength++;
}
$composeLength = 0;
// Compose a batch of 32 into a subfile
system("gsutil compose $fileString gs://bucket/".$filename."-".$subfile.".csv");
$subfile++;
$fileString="";
}
if ($eJobFiles > 1) {
//Compose all the subfiles
system('gsutil compose gs://bucket/'.$filename.'-* gs://fm-sparkbeyond/YouTube_1_0/' . $filepath . '.gz') ==$
}
Note in order to give my Apache user access to gsutil I had to allow the user to create a .config directory in the web root. Ideally you would use the gsutil PHP library, but I didn't want the code bloat.
If anyone has a better answer please post it
Is there a way to get smaller output from the BigQuery library than 24kb per row?
Is there a more efficient way to clean up variable numbers of shards?

Force a statement to visually write to a file slowly

I have a want to take a File.open('somefile', 'w+') and make it read a file, take one line of text at a time, and visually write it slowly in another file. The reason I ask this question is because I can find nothing that does this already in code, nor can I find anything that actually controls the speed of how fast a program writes on the computer. I know that this can be simulated in a program such as Adobe Aftereffects so long as you provide a cursor after a character and the visual effect doesn't take place too quickly, but I've got 4,000 lines of code that I want to iterate over and can't afford to do this manually. This effect can also be achieved with a Microsoft Macro, but this requires it to be manually entered into the macro with no option of copy and paste.
-solutions preferred in Python, Ruby, and PHP-

If I understood properly, what you are trying to achieve, here you go:
input = File.read('readfrom.txt', 'r')
File.open('writeto.txt', 'w+') do |f|
input.chars.each do |c|
f.print(c) # print 1 char
f.flush # flush the stream
sleep 1 # sleep
end
end

This is one quick and dirty way of doing it in Python.
from time import sleep
mystring= 'My short text with a newline here\nand then ensuing text'
dt = 0.2 #0.2 seconds
for ch in mystring:
with open('fn_out','w+') as f:
f.write(ch)
f.flush()
sleep(dt)
f.flush() will result in updating the file with the changes.
One could make this more elaborate by having a longer pause after each newline, or a variable timestep dt.
To watch the change one has to repeatedly reload the file, as pointed out by #Tom Lord so you could run something like this beforehand to watch it in the terminal:
watch -n 0.1 cat fn_out

After some serious testing, I have finally developed a piece of code that will do the very thing I want. Tom Lord gave me some new words to use in my search terms "simulate typing" and this led me to win32ole with its SendKeys function. Here is a code that will iterate over all the characters in a file and print them out exactly as they were saved while simulating typing. I will see about making this into a gem for future use.
require 'win32ole'
wsh = WIN32OLE.new("WScript.Shell")
wsh.Run("Notepad.exe")
while not wsh.AppActivate("Notepad")
sleep 1
end
def fileToArray(file)
x = []
File.foreach("#{file}") do |line|
x << line.split('')
end
return x.flatten!
end
tests = fileToArray("readfrom.txt")
x = 0
while x <= tests.length
send = tests[x]
wsh.SendKeys("#{send}")
x += 1
sleep 0.1
end

Best way to extract text from a 1.3GB text file using PHP?

I have a 1.3GB text file that I need to extract some information from in PHP. I have researched it and have come up with a few various ways to do what I need to do, but as always am after a little clarification on which method would be best or if another better exists that I do not know about?
The information I need in the text file is only the first 40 characters of each line, and there are around 17million lines in the file. The 40 characters from each line will be inserted into a database.
The methods I have are below;
// REMOVE TIME LIMIT
set_time_limit(0);
// REMOVE MEMORY LIMIT
ini_set('memory_limit', '-1');
// OPEN FILE
$handle = #fopen('C:\Users\Carl\Downloads\test.txt', 'r');
if($handle) {
while(($buffer = fgets($handle)) !== false) {
$insert[] = substr($buffer, 0, 40);
}
if(!feof($handle)) {
// END OF FILE
}
fclose($handle);
}
Above is read each line at a time and get the data, I have all the database inserts sorted, doing 50 inserts at a time ten times over in a transaction.
The next method is the same as above really but calling file() to store all the lines in an array before doing a foreach to get the data? I am not sure about this method though as the array would essentially have over 17 million values.
Another method would be to extract only part of the file, rewrite the file with the unused data, and after that part has been executed recall the script using a header call?
What would be the best way in terms of getting this done in the most quick and efficient manner? Or is there a better way to approach this that I have thought of?
Also I plan to use this script with wamp, but running it in a browser while testing has caused problems with timeout even with setting script time out to 0. Is there a way I can execute the script to run without accessing the page through a browser?

You have it good so far, don't use "file()" function as it would most probably hit RAM usage limit and terminate your script.
I wouldn't even accumulate stuff into "insert[]" array, as that would waste RAM as well. If you can, insert into the database right away.
BTW, there is a nice tool called "cut" that you could use to process the file.
cut -c1-40 file.txt
You could even redirect cut's stdout to some PHP script that inserts into database.
cut -c1-40 file.txt | php -f inserter.php
inserter.php could then read lines from php://stdin and insert into DB.
"cut" is a standard tool available on all Linuxes, if you use Windows you can get it with MinGW shell, or as part of msystools (if you use git) or install native win32 app using gnuWin32.

Why are you doing this in PHP when your RDBMS almost certainly has bulk import functionality built in? MySQL, for example, has LOAD DATA INFILE:
LOAD DATA INFILE 'data.txt'
INTO TABLE `some_table`
FIELDS TERMINATED BY ''
LINES TERMINATED BY '\n';
( #line )
SET `some_column` = LEFT( #line, 40 );
One query.
MySQL also has the mysqlimport utility that wraps this functionality from the command line.

None of the above. The problem with the using fgets() is it does not work as you expect. When the maximum characters is reached, then the next call to fgets() will continue on the same line. You have correctly identified the problem with using file(). The third method is an interesting idea, and you could pull it off with other solutions as well.
That said, your first idea of using fgets() is pretty close, however we need to slightly modify its behaviour. Here's a customized version that will work as you'd expect:
function fgetl($fp, $len) {
$l = 0;
$buffer = '';
while (false !== ($c = fgetc($fp)) && PHP_EOL !== $c) {
if ($l < $len)
$buffer .= $c;
++$l;
}
if (0 === $l && false === $c) {
return false;
}
return $buffer;
}
Execute the insert operation immediately or you will waste memory. Make sure you are using prepared statements to insert this many rows; this will drastically reduce execution time. You don't want to submit the full query on each insert when you can only submit the data.

php file random access and object to file saving

I have a csv file with records being sorted on the first field. I managed to generate a function that does binary search through that file, using fseek for random access through file.
However, this is still a pretty slow process, since when I seek some file position, I actually need to look left, looking for \n characted, so I can make sure I'm reading a whole line (once whole line is read, I can check for first field value mentioned above).
Here is the function that returns a line that contains character at position x:
function fgetLineContaining( $fh, $x ) {
if( $x 125145411) // 12514511 is the last pos in my file
return "";
// now go as much left as possible, until newline is found
// or beginning of the file
while( $x > 0 && $c != "\n" && $c != "\r") {
fseek($fh, $x);
$x--; // go left in the file
$c = fgetc( $fh );
}
$x+=2; // skip newline char
fseek( $fh, $x );
return fgets( $fh, 1024 ); // return the line from the beginning until \n
}
While this is working as expected, I have to sad that my csv file has ~1.5Mil lines, and these left-seeks are slowing thins down pretty much.
Is there a better way to seek a line containing position x inside a file?
Also, it would be much better if object of a class could be saved to a file without serializing it, thus enabling reading of a file object-by-object. Does php support that?
Thanks

I think you really should consider using SQLite or MySQL again (like others have suggested in the comments). Most of the suggestions about pre-calculating indexes are already implemented "properly" in these SQL engines.
You said the speed wasn't good enough in SQL. Did you have the fields indexed properly? How were you querying the data? Where you using bulk queries, where you using prepared statements? Did the SQL process have enough ram to store it's indexes in RAM?
One thing you can possibly try to speed under the current algorithm is to load the (~100MB ?) file onto a RAM disc. No matter what you chose to do, either CVS or SQLite, this WILL help speed things up, especially if the hard drive seek time is your bottleneck.
You could possibly even read the whole file into PHP array's (assuming your computer has enough RAM for that). That would allow you to do your search via index ($big_array[$offset]) lookups.
Also one thing to keep in mind, PHP isn't exactly super fast at doing low level things fast. You might want to consider moving away from PHP in favor of C or C++.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.