Splitting large files in half

Splitting large files in half - php

tl;dr: I need a way of splitting 5 GB / ~11m row files in ~half (or thirds) while keeping track of exactly every file I create and of course not breaking any lines, so I can process both files at once
I have a set of 300 very large json-like files I need to parse with a php script periodically. Each file is about 5 GB decompressed. I've optimized the hell out of parsing script and it's reached it's speed limit. But it's still a single-threaded script running for about 20 hours on a 16 core server.
I'd like to split each file into approximately half, and have two parsing scripts run at once, to "fake" multi-threaded-ness and speed up run time. I can store global runtime information and "messages" between threads in my sql database. That should cut the total runtime in half, having one thread downloading the files, another decompressing them, and two more loading them into sql in parallel.
That part is actually pretty straight forward, where I'm stuck is splitting up the file to be parsed. I know there is a split tool that can break down files into chunks based on KB or line count. Problem is that doesn't quite work for me. I need to split these files in half (or thirds or quarters) cleanly. And without having any excess data go into an extra file. I need to know exactly what files the split command has created so I can note easy file in my sql table so the parsing script can know which files are ready to be parsed. If possible, I'd even like to avoid running wc -l in this process. That may not be possible, but it takes about 7 seconds for each file, 200 files, means 35 extra minutes of runtime.
Despite what I just said, I guess I run wc -l file on my file, divide that by n, round the result up, and use split to break the file into that many lines. That should always give me exactly n files. Than I can just know that ill have filea, fileb and so on.
I guess the question ultimately is, is there a better way to deal with this problem? Maybe theres another utility that will split in a way thats more compatible with what I'm doing. Or maybe there's another approach entirely that I'm overlooking.

I had the same problem and it wasn't easy to find a solution.
First you need to use jq to convert your JSON to string format.
Then use the GNU version of split, it has an extra --filter option which allows processing individual chunks of data in much less space as it does not need to create any temporary files:
split --filter='shell_command'
Your filter command should read from stdin:
jq -r '' file.json | split -l 10000 --filter='php process.php'
-l will tell split to work on 10000 lines at a time.
In process.php file you just need to read from stdin and do whatever you want.

Related

php getting big data from url - optimise

I am using file_get_contents to get 1 million records from URL and output the results which is in json format and I can't go for pagination and currently working by increasing my memory. Is there any other solution for this?

If you're processing large amounts of data, fscanf will probably prove
valuable and more efficient than, say, using file followed by a split
and sprintf command. In contrast, if you're simply echoing a large
amount of text with little modification, file, file_get_contents, or
readfile might make more sense. This would likely be the case if
you're using PHP for caching or even to create a makeshift proxy
server.
More
The right way to read files with PHP

Read, and remove, X number of lines from big text file in PHP

There are a lot of different scenarios that are similar (replace text in file, read specific lines etc) but I have not found a good solution to what I want to do:
Messages (strings) are normally sent to a queue. If the server that handles the queue is down the messages are saved to a file. One message per line.
When the server is up again I want to start sending the messages to the server. The file with messages could be "big" so I do not want to read the entire file into memory. I also only want to send a message once, so the file need to reflect if a message has been sent(in other words: don't get 100 lines and then PHP timeout after 95 so the next time the same thing will happen again).
What I basically need is to read one line from a big text file and then delete that line when it has been processed by my script, without constantly reading/writing the whole file.
I have seen different solutions (fread, SplFileObject etc) that can read a line from a file without reading the entire file (into memory) but I have not seen a good way to delete the line that was just read without going through the entire file and saving it again.
I'm guessing that it can be done since the thing that needs to be done is to remove x bytes from the beginning or the end of the file, depending where you read the lines from.
To be clear: I do not think it's a good solution to read the first line from the file, use it, and then read all the other lines just to write them to a tmp-file and then from there to the original file. Read/write 100000 lines just to get one line.
The problem can be fixed in other ways, like creating a number of smaller files so they can be read/written without to much performance problems, but I would like to know if anyone has a solution to the exact problem.
Update:
Since it can't be done did I end up using Sqlite.

How do i process a file in parts [duplicate]

I have a large file, 100,000 lines. I can read each line and process it, or I can store the lines in an array then process them. I would prefer to use the array for extra features, but I'm really concerned about the memory usage associated with storing that many lines in an array, and if it's worth it.

There are two functions you should familiarize yourself with.
The first is file(), which reads an entire file into an array, with each line as an array element. This is good for shorter files, and probably isn't what you want to be using on a 100k line file. This function handles its own file management, so you don't need to explicitly open and close the file yourself.
The second is fgets(), which you can use to read a file one line at a time. You can use this to loop for as long as there are more lines to process, and run your line processing inside the loop. You'll need to use fopen() to get a handle on this file, you may want to track the file pointer yourself for recovery management (i.e. so you won't have to restart processing from scratch if something goes sideways and the script fails), etc.
Hopefully that's enough to get you started.

How about a combination of the two? Read 1000 lines into an array, process it, delete the array, then read 1000 more, etc. Monitor memory usage and adjust how many you read into an array at a time.

Generating ZIP files with PHP + Apache on-the-fly in high speed?

To quote some famous words:
“Programmers… often take refuge in an understandable, but disastrous, inclination towards complexity and ingenuity in their work. Forbidden to design anything larger than a program, they respond by making that program intricate enough to challenge their professional skill.”
While solving some mundane problem at work I came up with this idea, which I'm not quite sure how to solve. I know I won't be implementing this, but I'm very curious as to what the best solution is. :)
Suppose you have this big collection with JPG files and a few odd SWF files. With "big" I mean "a couple thousand". Every JPG file is around 200KB, and the SWFs can be up to a few MB in size. Every day there's a few new JPG files. The total size of all the stuff is thus around 1 GB, and is slowly but steadily increasing. Files are VERY rarely changed or deleted.
The users can view each of the files individually on the webpage. However there is also the wish to allow them to download a whole bunch of them at once. The files have some metadata attached to them (date, category, etc.) that the user can filter the collection by.
The ultimate implementation would then be to allow the user to specify some filter criteria and then download the corresponding files as a single ZIP file.
Since the amount of criteria is big enough, I cannot pre-generate all the possible ZIP files and must do it on-the-fly. Another problem is that the download can be quite large and for users with slow connections it's quite likely that it will take an hour or more. Support for "resume" is therefore a must-have.
On the bright side however the ZIP doesn't need to compress anything - the files are mostly JPEGs anyway. Thus the whole process shouldn't be more CPU-intensive than a simple file download.
The problems then that I have identified are thus:
PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
Will passing large amounts of file data through PHP not be a performance hit in itself?
How would you implement this? Is PHP up to the task at all?
Added:
By now two people have suggested to store the requested ZIP files in a temporary folder and serving them from there as usual files. While this is indeed an obvious solution, there are several practical considerations which make this infeasible.
The ZIP files will usually be pretty large, ranging from a few tens of megabytes to hundreads of megabytes. It's also completely normal for a user to request "everything", meaning that the ZIP file will be over a gigabyte in size. Also there are many possible filter combinations and many of them are likely to be selected by the users.
As a result, the ZIP files will be pretty slow to generate (due to sheer volume of data and disk speed), and will contain the whole collection many times over. I don't see how this solution would work without some mega-expensive SCSI RAID array.

This may be what you need:
http://pablotron.org/software/zipstream-php/
This lib allows you to build a dynamic streaming zip file without swapping to disk.

Use e.g. the PhpConcept Library Zip library.
Resuming must be supported by your webserver except the case where you don't make the zipfiles accessible directly. If you have a php script as mediator then pay attention to sending the right headers to support resuming.
The script creating the files shouldn't timeout ever just make sure the users can't select thousands of files at once. And keep something in place to remove "old zipfiles" and watch out that some malicious user doesn't use up your diskspace by requesting many different filecollections.

You're going to have to store the generated zip file, if you want them to be able to resume downloads.
Basically you generate the zip file and chuck it in a /tmp directory with a repeatable filename (hash of the search filters maybe). Then you send the correct headers to the user and echo file_get_contents to the user.
To support resuming you need to check out the $_SERVER['HTTP_RANGE'] value, it's format is detailed here and once your parsed that you'll need to run something like this.
$size = filesize($zip_file);
if(isset($_SERVER['HTTP_RANGE'])) {
//parse http_range
$range = explode( '-', $seek_range);
$new_length = $range[1] - $range[0]
header("HTTP/1.1 206 Partial Content");
header("Content-Length: $new_length");
header("Content-Range: bytes {$range[0]}-$range[1]");
echo file_get_contents($zip_file, FILE_BINARY, null, $range[0], $new_length);
} else {
header("Content-Range: bytes 0-$size");
header("Content-Length: ".$size);
echo file_get_contents($zip_file);
}
This is very sketchy code, you'll probably need to play around with the headers and the contents to the HTTP_RANGE variable a bit. You can use fopen and fwrite rather than file_get contents if you wish and just fseek to the right place.
Now to your questions
PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
You can remove it if you want to, however if something goes pear shaped and your code get stuck in an infinite loop at can lead to interesting problems should that infinite loop be logging and error somewhere and you don't notice, until a rather grumpy sys-admin wonders why their server ran out of hard disk space ;)
With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
Cache the file to the hard disk, means you wont have this problem.
Will passing large amounts of file data through PHP not be a performance hit in itself?
Yes it wont be as fast as a regular download from the webserver. But it shouldn't be too slow.

i have a download page, and made a zip class that is very similar to your ideas.
my downloads are very big files, that can't be zipped properly with the zip classes out there.
and i had similar ideas as you.
the approach to give up the compression is very good, with that you not even need fewer cpu resources, you save memory because you don't have to touch the input files and can pass it throught, you can also calculate everything like the zip headers and the end filesize very easy, and you can jump to every position and generate from this point to realize resume.
I go even further, i generate one checksum from all the input file crc's, and use it as an e-tag for the generated file to support caching, and as part of the filename.
If you have already download the generated zip file the browser gets it from the local cache instead of the server.
You can also adjust the download rate (for example 300KB/s).
One can make zip comments.
You can choose which files can be added and what not (for example thumbs.db).
But theres one problem that you can't overcome with the zip format completely.
Thats the generation of the crc values.
Even if you use hash-file to overcome the memory problem, or use hash-update to incrementally generate the crc, it will use to much cpu resources.
Not much for one person, but not recommend for professional use.
I solved this with an extra crc value table that i generate with an extra script.
I add this crc values per parameter to the zip class.
With this, the class is ultra fast.
Like a regular download script, as you mentioned.
My zip class is work in progress, you can have a look at it here: http://www.ranma.tv/zip-class.txt
I hope i can help someone with that :)
But i will discontinue this approach, i will reprogram my class to a tar class.
With tar i don't need to generate crc values from the files, tar only need some checksums for the headers, thats all.
And i don't need an extra mysql table any more.
I think it makes the class easier to use, if you don't have to create an extra crc table for it.
It's not so hard, because tars file structure is easier as the zip structure.
PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
If your script is safe and it closes on user abort, then you can remove it completely.
But it would be safer, if you just renew the timeout on every file that you pass throught :)
With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
Yes that would work.
I had generated a checksum from the input file crc's.
I used this as an e-tag and as part of the zip filename.
If something changed, the user can't resume the generated zip,
because the e-tag and filename changed together with the content.
Will passing large amounts of file data through PHP not be a performance hit in itself?
No, if you only pass throught it will not use much more then a regular download.
Maybe 0.01% i don't know, its not much :)
I assume because php don't do much with the data :)

You can use ZipStream or PHPZip, which will send zipped files on the fly to the browser, divided in chunks, instead of loading the entire content in PHP and then sending the zip file.
Both libraries are nice and useful pieces of code. A few details:
ZipStream "works" only with memory, but cannot be easily ported to PHP 4 if necessary (uses hash_file())
PHPZip writes temporary files on disk (consumes as much disk space as the biggest file to add in the zip), but can be easily adapted for PHP 4 if necessary.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.