Handle Large File with PHP - php

I have a file with the size of around 10 GB or more. The file contains only numbers ranging from 1 to 10 on each line and nothing else. Now the task is to read the data[numbers] from the file and then sort the numbers in ascending or descending order and create a new file with the sorted numbers.
Can anyone of you please help me with the answer?

I'm assuming this is somekind of homework and goal for this is to sort more data than you can hold in your RAM?
Since you only have numbers 1-10, this is not that complicated task. Just open your input file and count how many occourances of every specific number you have. After that you can construct simple loop and write values into another file. Following example is pretty self explainatory.
$inFile = '/path/to/input/file';
$outFile = '/path/to/output/file';
$input = fopen($inFile, 'r');
if ($input === false) {
throw new Exception('Unable to open: ' . $inFile);
}
//$map will be array with size of 10, filled with 0-s
$map = array_fill(1, 10, 0);
//Read file line by line and count how many of each specific number you have
while (!feof($input)) {
$int = (int) fgets($input);
$map[$int]++;
}
fclose($input);
$output = fopen($outFile, 'w');
if ($output === false) {
throw new Exception('Unable to open: ' . $outFile);
}
/*
* Reverse array if you need to change direction between
* ascending and descending order
*/
//$map = array_reverse($map);
//Write values into your output file
foreach ($map AS $number => $count) {
$string = ((string) $number) . PHP_EOL;
for ($i = 0; $i < $count; $i++) {
fwrite($output, $string);
}
}
fclose($output);
Taking into account the fact, that you are dealing with huge files, you should also check script execution time limit for your PHP environment, following example will take VERY long for 10GB+ sized files, but since I didn't see any limitations concerning execution time and performance in your question, I'm assuming it is OK.

I had a similar issue before. Trying to manipulate such a large file ended up being huge drain on resources and it couldn't cope. The easiest solution I ended up with was to try and import it into a MySQL database using a fast data dump function called LOAD DATA INFILE
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
Once it's in you should be able to manipulate the data.
Alternatively, you could just read the file line by line while outputting the result into another file line by line with the sorted numbers. Not too sure how well this would work though.
Have you had any previous attempts at it or are you just after a possible method of doing it?

If that's all you don't need PHP (if you have a Linux maschine at hand):
sort -n file > file_sorted-asc
sort -nr file > file_sorted-desc
Edit: OK, here's your solution in PHP (if you have a Linux maschine at hand):
<?php
// Sort ascending
`sort -n file > file_sorted-asc`;
// Sort descending
`sort -nr file > file_sorted-desc`;
?>
:)

Related

PHP Array Processing Ability Decreases

I need help processing files holding about 46k lines or more than 30MB of data.
My original idea was to open the file and turn each line into an array element. This worked the first time as the array held about 32k values total.
The second time, the process was repeated, the array only held 1011 elements, and finally, the third time it could only hold 100.
I'm confused and don't know much about the backend array processes. Can someone explain what is happening and fix the code?
function file_to_array($cvsFile){
$handle = fopen($cvsFile, "r");
$path = fread($handle, filesize($cvsFile));
fclose($handle);
//Turn the file into an array and separate lines to elements
$csv = explode(",", $path);
//Remove common double spaces
foreach ($csv as $key => $line){
$csv[$key] = str_replace(' ', '', str_getcsv($line));
}
array_filter($csv);
//get the row count for the file and array
$rows = count($csv);
$filerows = count(file($cvsFile)); //this no longer works
echo "File has $filerows and array has $rows";
return $csv;
}
The approach here can be split in 2.
Optimized file reading and processing
Proper storage solution
Optimized file processing can be done like so:
$handle = fopen($cvsFile, "r");
$rowsSucceed = 0;
$rowsFailed = 0;
if ($handle) {
while (($line = fgets($handle)) !== false) { // Reading file by line
// Process CSV line and check if it was parsed correctly
// And count as you go
if (!empty($parsedLine)) {
$csv[$key] = ... ;
$rowsSucceed++;
} else {
$rowsFailed++;
}
}
fclose($handle);
} else {
// Error handling
}
$totalLines = $rowsSucceed + $rowsFailed;
Also you can avoid array_filter() simply by not adding processed line if its empty.
It will allow to optimize memory usage during script execution.
Proper storage
Proper storage here is needed for performing operations on certain amount of data. File reading are ineffective and expensive. Using simple file based database like sqlite can help you a lot and increase overall performance of your script.
For this purpose you probably should process your CSV directly to database and than perform count operation on parsed data avoiding excessive file line counts etc.
Also it gives you further advantage on working with data not keeping it all in memory.
Your question says you want to "turn each line into an array element" but that is definitely not what you are doing. The code is quite clear; it reads the entire file into $path and then uses explode() to make one massive flat array of every element on every line. Then later you're trying to run str_getcsv() on each item, which of course isn't going to work; you've already exploded all the commas away.
Looping over the file using fgetcsv() makes more sense:
function file_to_array($cvsFile) {
$filerows = 0;
$handle = fopen($cvsFile, "r");
while ($line = fgetcsv($handle)) {
$filerows++;
// skip empty lines
if ($line[0] === null) {
continue;
}
//Remove common double spaces
$csv[] = str_replace(' ', '', $line);
}
//get the row count for the file and array
$rows = count($csv);
echo "File has $filerows and array has $rows";
fclose($handle);
return $csv;
}

php - for loop repeating itself / going out of sequence

I'm very new to PHP, making errors and learning as I go. Please be gentle! :)
I want to access some data from Blizzard.com's API. For this particular data set, it's not a block of data in JSON, rather each object has it's own URL to access. I estimate that there are approx 150000 objects, however I don't know the start or end points of the number range. So I'm having to assume 1 and work past the highest number I know (269065)
To get the data, I need to access each object's data via a JSON file, which I read, get the contents of & drop in to a text file (this could be written as an insert in to a SQL db too, as I'm able to do this if it's the text file that's the issue). But to be honest, I would love to get to the bottom of why this is happening as much as anything!
I wasn't going to try and run ~250000 iterations in a for loop, I thought I'd try something I considered small, 2000.
The for loop starts with $a as 1, uses $a as part of the URL, loads & decodes the JSON, checks to see if the first field (ID) in the object is set, if it is, it writes a few fields to data.txt & if the first field (ID) isn't set it just writes $a to data.txt (so I know it's a null for other purposes not outlined here).
Simple! Or so I thought, after approx after 183 iterations, the data written to the text file goes awry as seen by the quote below. It is out of sequence and starts at 1 again, then back to 184 ad nauseam. The loop then seems to be locked in some kind of infinite loop of running, outputting in a random order until I close the page 10-20 minutes later.
I have obviously made a big mistake! But I have no idea what I have done wrong to have caused this. During my attempts I have rewritten the code with new variable names, so a new text does not conflict with code that could be running in memory.
I've tried resetting variables to blank at the end of the loop in case it something was being reused that was causing a problem.
If anyone could point out any errors in my code, or suggest something for me to look in to, to handle bigger loops that would be brilliant. I am assuming my issue may be a time out or memory problem. But I don't know where to start & was hoping I'd find some suggestions here.
If it's relevant, I am using 000webhostapp.com as my host provider for now, until I get some paid for hosting.
1 ... 182 183 1 184 2 3 185 4 186 5 187 6 188 7 189 190 8 191
for ($a = 1; $a <= 2000; $a++) {
$json = "https://eu.api.battle.net/wow/recipe/".$a."?locale=en_GB&<MYPRIVATEAPIKEY>";
$contents = file_get_contents($json);
$data = json_decode($contents,true);
if (isset($data['id'])) {
$file = fopen("data.txt","a");
fwrite($file,$data['id'].",'".$data['name']."'\n");
fclose($file);
} else {
$file = fopen("data.txt","a");
fwrite($file,$a."\n");
fclose($file);
}
}
The content of the file I'm trying to access is
{"id":33994,"name":"Precise Strikes","profession":"Enchanting","icon":"spell_holy_greaterheal"}
I scrapped the original plan and wrote this instead. Thank you again who took the time out of their day to help and offer suggestions!
$b = $mysqli->query("SELECT id FROM `static_recipes` order by id desc LIMIT 1;")->fetch_object()->id;
if (empty($b)) {$b=1;};
$count = $b+101;
$write = [];
for ($a = $b+1; $a < $count; $a++) {
$json = "https://eu.api.battle.net/wow/recipe/".$a."?locale=en_GB&apikey=";
$contents = #file_get_contents($json);
$data = json_decode($contents,true);
if (isset($data['id'])) {
$write [] = "(".$data['id'].",'".addslashes($data['name'])."','".addslashes($data['profession'])."','".addslashes($data['icon'])."')";
} else {
$write [] = "(".$a.",'a','a','a'".")";
}
}
$SQL = ('INSERT INTO `static_recipes` (id, name, profession, icon) VALUES '.implode(',', $write));
$mysqli->query($SQL);
$mysqli->close();
$write = [];
for ($a = 1; $a <= 2000; $a++) {
$json = "https://eu.api.battle.net/wow/".$a."?locale=en_GB&<MYPRIVATEAPIKEY>";
$contents = file_get_contents($json);
$data = json_decode($contents,true);
if (isset($data['id'])) {
$write [] = $data['id'].",'".$data['name']."'\n";
} else {
$write [] = $a."\n";
}
}
$file = fopen("data.txt","a");
fwrite($file, implode('', $write));
fclose($file);
Also, why you are think what some IDS isn't duplicated at several "https://eu.api.battle.net/wow/[N]" urls data?
Also if you are I wasn't going to try and run ~250000 think about curl_multi_init(): http://php.net/manual/en/function.curl-multi-init.php
I can't really see anything obviously wrong with your code, can't run it though as I don't have the JSON
It could be possible that there is some kind of race condition since you're opening and closing the same file hundreds of times very quickly.
File operations might seem atomic but not necessarily so - here's an interesting SO thread:
Does PHP wait for filesystem operations (like file_put_contents) to complete before moving on?
Like some others' suggested - maybe just open the file before you enter the loop then close the file when the loop breaks.
I'd try it first and see if it helps.
There's nothing in your original code that would cause that sort of behaviour. PHP will not arbitrarily change the value of a variable. You are opening this file in append mode, are you certain that you're not looking at old data? Maybe output some debug messages as you process the data. It's likely you'd run up against some rate limiting on the API server, so putting a pause in there somewhere may improve reliability.
The only substantive change I'd suggest to your code is opening the file once and closing it when you're done.
$file = fopen("data_1_2000.txt", "w");
for ($a = 1; $a <= 2000; $a++) {
$json = "https://eu.api.battle.net/wow/recipe/$a?locale=en_GB&<MYPRIVATEAPIKEY>";
$contents = file_get_contents($json);
$data = json_decode($contents, true);
if (!empty($data['id'])) {
$data["name"] = str_replace("'", "\\'", $data["name"]);
$record = "$data[id],'$data[name]'";
} else {
$record = $a;
}
fwrite($file, "$record\n");
sleep(1);
echo "$a "; if ($a % 50 === 0) echo "\n";
}
fclose($file);

php fgetcsv multiple lines not only one or all

I wand to read biiiiig CSV-Files and want to insert them into a database. That already works:
if(($handleF = fopen($path."\\".$file, 'r')) !== false){
$i = 1;
// loop through the file line-by-line
while(($dataRow = fgetcsv($handleF,0,";")) !== false) {
// Only start at the startRow, otherwise skip the row.
if($i >= $startRow){
// Check if to use headers
if($lookAtHeaders == 1 && $i == $startRow){
$this->createUberschriften( array_map(array($this, "convert"), $dataRow ) );
} else {
$dataRow = array_map(array($this, "convert"), $dataRow );
$data = $this->changeMapping($dataRow, $startCol);
$this->executeInsert($data, $tableFields);
}
unset($dataRow);
}
$i++;
}
fclose($handleF);
}
My problem of this solution is, that it's very slow. But the files are too big to put it directly into the memory... So I wand to ask, if there a posibility to read, for example 10 lines, into the $dataRow array not only one or all.
I want to get a better balance between the memory and the performance.
Do you understand what i mean? Thanks for help.
Greetz
V
EDIT:
Ok, I still have to try to find a solution with the MSSQL-Database. My solution was to stack the data and than make a multiple-MSSQL-Insert:
while(($dataRow = fgetcsv($handleF,0,";")) !== false) {
// Only start at the startRow, otherwise skip the row.
if($i >= $startRow){
// Check if to use headers
if($lookAtHeaders == 1 && $i == $startRow){
$this->createUberschriften( array_map(array($this, "convert"), $dataRow ) );
} else {
$dataRow = array_map(array($this, "convert"), $dataRow );
$data = $this->changeMapping($dataRow, $startCol);
$this->setCurrentRow($i);
if(count($dataStack) > 210){
array_push($dataStack, $data);
#echo '<pre>', print_r($dataStack), '</pre>';
$this->executeInsert($dataStack, $tableFields, true);
// reset the stack
unset($dataStack);
$dataStack = array();
} else {
array_push($dataStack, $data);
}
unset($data);
}
$i++;
unset($dataRow);
}
}
Finaly I have to loop the Stack and build in mulitiple Insert in the method "executeInsert", to create a query like this:
INSERT INTO [myTable] (field1, field2) VALUES ('data1', 'data2'),('data2', 'datta3')...
That works much better. I still have to check the best balance, but therefor i can change only the value '210' in the code above. I hope that help's everybody with a similar problem.
Attention: Don't forget to execute the method "executeInsert" again after readin the complete file, because it could happen that there are still some data in the stack and the method will only be executed when the stack reach the size of 210....
Greetz
V
I think your bottleneck is not reading the file. Which is a text file. Your bottleneck is the INSERT in the SQL table.
Do something, just comment the line that actually do the insert and you will see the difference.
I had this same issue in the past, where i did exactly what you are doing. reading a 5+ million lines CSV and inserting it in a Mysql table. The execution time was 60 hours which is
unrealistic.
My solutions was switch to another db technology. I selected MongoDB and the execution time
was reduced to 5 minutes. MongoDB performs really fast on this scenarios and also have a tool called mongoimport that will allow you to import a csv file firectly from the command line.
Give it a try if the db technology is not a limitation on your side.
Another solution will be spliting the huge CSV file into chunks and then run the same php script multiple times in parallel and each one will take care of the chunks with an specific preffix or suffix on the filename.
I don't know which specific OS are you using, but in Unix/Linux there is a command line tool
called split that will do that for you and will also add any prefix or suffix you want to the filename of the chunks.

Split large Excel/Csv file to multiple files on PHP or Javascript

I have excel(file.xls)/csv(file.csv) file that contains/will contain hundreds of thousands of entry, even millions I guess. Is it possible to split this one to multiple file? Like file.xls to file1.xls, file2.xls, file3.xls and so on.
Are there any libraries to use? Is this possible on PHP? or how about javascript?
On where I can specify how many rows to be included on each file?
Thanks
Quick and dirty way of splitting a CSV file into several CSV files
$inputFile = 'input.csv';
$outputFile = 'output';
$splitSize = 10000;
$in = fopen($inputFile, 'r');
$rowCount = 0;
$fileCount = 1;
while (!feof($in)) {
if (($rowCount % $splitSize) == 0) {
if ($rowCount > 0) {
fclose($out);
}
$out = fopen($outputFile . $fileCount++ . '.csv', 'w');
}
$data = fgetcsv($in);
if ($data)
fputcsv($out, $data);
$rowCount++;
}
fclose($out);
Yes it is possible to do that in PHP and with CSV files. You basically iterate over the large file and chunk each X rows, forwarding those rows to another file.
You find the information how to open the large CSV file as an iterator in this answer here:
Answer to "how to extract data from csv file in php"
Then you need to chunk the iterator each X rows parts. That can be done as outline here:
Answer to "Need some advice with PHP loop"
Just instead of outputting into multiple <ul>...</ul> HTML lists, you copy over into a new files. That basically works like outlined in:
Answer to "How can I split a CSV file in PHP?"
However this time you want to use the SplFileObject::fputcsv method. Take care you use the latest stable PHP for this, otherwise you need do different, see fputcsv().
If the first line of the original file contains column-headers, you might be as well interested in the following:
Answer to "Process CSV Into Array With Column Headings For Key"
It just shows some ways to extend / process the incomming file. You might not need the full abstraction done there, just keeping the first line around might do it already.
I think You can also use "split by file size":
$part = 1;
$maxSize = 50;//50 Mb
$fopen = fopen('filename.csv','r') or die ('ERROR');
while (($line = fgetcsv($fopen, 10000, ";")) !== FALSE) {
$ftowrite = fopen("Part_$part.csv",'a');
fputcsv($ftowrite,$line);
clearstatcache();
$size = filesize ( "review_p$part.csv" ) / 1000000;
if ($size > $maxSize) {
fclose($ftowrite);
$part++;
}
}

Crunch lots of files to generate stats file

I have a bunch of files I need to crunch and I'm worrying about scalability and speed.
The filename and filedata(only the first line) is stored into an array in RAM to create some statical files later in the script.
The files must remain files and can't be put into a databases.
The filename are formatted in the following fashion :
Y-M-D-title.ext (where Y is Year, M for Month and D for Day)
I'm actually using glob to list all the files and create my array :
Here is a sample of the code creating the array "for year" or "month" (It's used in a function with only one parameter -> $period)
[...]
function create_data_info($period=NULL){
$data = array();
$files = glob(ROOT_DIR.'/'.'*.ext');
$size = sizeOf($files);
$existing_title = array(); //Used so we can handle having the same titles two times at different date.
if (isSet($period)){
if ( "year" === $period ){
for ($i = 0; $i < $size; $i++) {
$info = extract_info($files[$i], $existing_file);
//Create the data array with all the data ordered by year/month/day
$data[(int)$info[5]][] = $info;
unset($info);
}
}elseif ( "month" === $period ){
for ($i = 0; $i < $size; $i++) {
$info = extract_info($files[$i], $existing_file);
$key = $info[5].$info[6];
//Create the data array with all the data ordered by year/month/day
$data[(int)$key][] = $info;
unset($info);
}
}
}
[...]
}
function extract_info($file, &$existing){
$full_path_file = $file;
$file = basename($file);
$info_file = explode("-", $file, 4);
$filetitle = explode(".", $info_file[3]);
$info[0] = $filetitle[0];
if (!isSet($existing[$info[0]]))
$existing[$info[0]] = -1;
$existing[$info[0]] += 1;
if ($existing[$info[0]] > 0)
//We have already found a post with this title
//the creation of the cache is based on info[4] data for the filename
//so we need to tune it
$info[0] = $info[0]."-".$existing[$info[0]];
$info[1] = $info_file[3];
$info[2] = $full_path_file;
$post_content = file(ROOT_DIR.'/'.$file, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
$info[3] = $post_content[0]; //first line of the files
unset($post_content);
$info[4] = filemtime(ROOT_DIR.'/'.$file);
$info[5] = $info_file[0]; //year
$info[6] = $info_file[1]; //month
$info[7] = $info_file[2]; //day
return $info;
}
So in my script I only call create_data_info(PERIOD) (PERIOD being "year", "month", etc..)
It returns an array filled with the info I need, and then I can loop throught it to create my statistics files.
This process is done everytime the PHP script is launched.
My question is : is this code optimal (certainly not) and what can I do to squeeze some juice from my code ?
I don't know how I can cache this (even if it's possible), as there is a lot of I/O involved.
I can change the tree structure if it could change things compared to a flat structure, but from what I found out with my tests it seems flat is the best.
I already thought about creating a little "booster" in C doing only the crunching, but I since it's I/O bound, I don't think it would make a huge difference and the application would be a lot less compatible for shared hosting users.
Thank you very much for your input, I hope I was clear enough here. Let me know if you need clarification (and forget my english mistakes).
To begin with you should use DirectoryIterator instead of glob function. When it comes to scandir vs opendir vs glob, glob is as slow as it gets.
Also, when you are dealing with a large amount of files you should try to do all your processing inside one loop, php function calls are rather slow.
I see you are using unset($info); yet in every loop you make, $info gets new value. Php does its own garbage collection, if thats your concern. Unset is a language construct not a function and should be pretty fast, but when using not needed, it still makes whole thing a bit slower.
You are passing $existing as a reference. Is there practical outcome for this? In my experience references make things slower.
And at last your script seems to deal with a lot of string processing. You might want to consider somekind of "serialize data and base64 encode/decode" solution, but you should benchmark that specifically, might be faster, might be slower depenging on your whole code. (My idea is that, serialize/unserialize MIGHT run faster as these are native php functions and custom functions with string processing are slower).
My answer was not very I/O related but I hope it was helpful.

Categories