I have a large CSV file with every postcode in the UK, it comes to 2,558,797 records and I need to import it, manipulate the data by sorting it into a multi-dimensional array before saving the data in the multi-dimensional array in the database.
The problem is if I try and access the whole file I get an allowed memory exceeded exception. I can access about 128,000 records in any one go. Is there a way I can split the task down so that I can process the whole file? I've tried looking at fseek but that uses the number of bytes, not the number of rows, and I don't know how many bytes 128,000 rows is.
How can I process the entire file without hitting the memory limit? I've been trying to get this working for the last 6 hours and I've not had any joy.
This is my code so far:
// This script takes a long time to run
ini_set('max_execution_time', 300);
// First we need to verify the files that have been uploaded.
$file = Validation::factory($_FILES);
$file->rule('import_update_file', 'Upload::not_empty');
$file->rule('import_update_file', 'Upload::valid');
$file->rule('import_update_file', 'Upload::size', array(':value', '8M'));
$file->rule('import_update_file', 'Upload::type', array(':value', array('zip')));
if (Request::current()->method() == Request::POST && $file->check())
{
$file_name = date('Y-m-d-').'update.zip';
$dir = Upload::save($file['import_update_file'], $file_name);
if ($dir === false)
{
throw new Kohana_Exception('Unable to save uploaded file!', NULL, 1);
}
$zip = new ZipArchive;
if ($zip->open($dir) !== TRUE)
{
throw new Kohana_Exception('Unable to open uploaded zip file! Error: '.$res, NULL, 1);
}
$zip->extractTo(realpath(Upload::$default_directory), array('localauthority.csv', 'postcode.csv'));
$zip->close();
if( ! file_exists(realpath(Upload::$default_directory).DIRECTORY_SEPARATOR.'localauthority.csv') OR
! file_exists(realpath(Upload::$default_directory).DIRECTORY_SEPARATOR.'postcode.csv'))
{
throw new Kohana_Exception('Missing file from uploaded zip archive! Expected localauthority.csv and postcode.csv', NULL, 1);
}
$local_authorities = Request::factory('local_authority/read')->execute();
// We start by combining the files, sorting the postcodes and local authority names under the local authority codes.
$update = array();
if (($fp = fopen(realpath(Upload::$default_directory).DIRECTORY_SEPARATOR.'localauthority.csv', 'r')) === FALSE)
{
throw new Kohana_Exception('Unable to open localauthority.csv file.', NULL, 1);
}
while (($line = fgetcsv($fp)) !== FALSE)
{
// Column 0 = Local Authority Code
// Column 1 = Local Authority Name
$update[$line[0]] = array(
'name' => $line[1],
'postcodes' => array()
);
}
fclose($fp);
unlink(realpath(Upload::$default_directory).DIRECTORY_SEPARATOR.'localauthority.csv');
if (($fp = fopen(realpath(Upload::$default_directory).DIRECTORY_SEPARATOR.'postcode.csv', 'r')) === FALSE)
{
throw new Kohana_Exception('Unable to open postcode.csv file.', NULL, 1);
}
$i = 1;
while (($line = fgetcsv($fp)) !== FALSE && $i <= 128000)
{
$postcode = trim(substr($line[0], 0, 4));
echo "Line ".sprintf("%03d", $i++) . ": Postcode: ".$line[0]."; Shortened Postcode: ".$postcode."; LAC: ".$line[1]."<br>";
// Column 0 = Postcode
// Column 1 = Local Authority Code
if ( ! array_key_exists($line[1], $update))
{
echo $line[1]." not in array<br>";
continue;
}
if ( ! in_array($postcode, $update[$line[1]]['postcodes']))
{
$update[$line[1]]['postcodes'][] = $postcode;
}
}
fclose($fp);
unlink(realpath(Upload::$default_directory).DIRECTORY_SEPARATOR.'postcode.csv');
echo '<pre>'; var_dump($update); echo '</pre>';
}
else
{
throw new Kohana_Exception('Invalid file uploaded!', NULL, 1);
}
Related
We currently open a csv from our server and then do some stuff with the data like this:
$CSVfile = fopen('filename.csv', "r");
if($CSVfile !== FALSE) {
$count = 0;
while(! feof($CSVfile)) {
$data = fgetcsv($CSVfile, 5000, ",");
if ($count > 0 && !empty($data)) {
// Do something
}
}
}
We need to change the system as the file will now be hosted on an external server so we need to SFTP into it to retrieve the file. I've installed phpseclib and got the connection working but it just keeps echoing the file contents on the screen. I have it set up like this:
include 'vendor/autoload.php';
$sftp = new \phpseclib\Net\SFTP('SERVER');
if (!$sftp->login('USERNAME', 'PASSWORD')) {
exit('Login Failed');
} else {
$file = $sftp->fetch('FILE_LOCATION');
}
$CSVfile = fopen($file, "r");
if($CSVfile !== FALSE) {
$count = 0;
while(! feof($CSVfile)) {
$data = fgetcsv($CSVfile, 5000, ",");
if ($count > 0 && !empty($data)) {
// Do Something
}
}
}
How do I get the new system to read the file contents and do something with it rather than just showing all the file contents?
As #verjas commented already, there's no fetch method in phpseclib.
If you want to download a file to a string, use SFTP::get:
$contents = $sftp->get("/remote/path/file.csv");
You can then use str_getcsv to parse the contents. According to a contributed note at the function documentation, this should do:
$data = str_getcsv($contents, "\n"); // parse the rows
foreach ($data as $line)
{
$row_data = str_getcsv($line); // parse the items in rows
}
I have an issue in my php script which I don't understand. I know there are several questions regarding this issue but none fits to my issue.
I actually have one input file delimited by tabulation named testfile.txt.
With this txt file, I create a new file named result.txt where I take content of testfile in column 0 and column 7.
When I execute my php script, I get this error:
Notice: Undefined offset: 7
The thing that I don't understand is, my result.txt is well created with data contained in my column 0 and 7 from my testfile.txt. If I do:
echo $dataFromTestFile[7];
I have in output contents in column 7.
So I don't really understand why I have this notice and how to remove it.
Here's my php script:
<?php
if (false !== ($ih = fopen('/opt/lampp/htdocs/ngs/tmp/testfile.txt', 'r'))) {
$oh = fopen('/opt/lampp/htdocs/ngs/tmp/result.txt', 'w');
while (false !== ($dataFromTestFile = fgetcsv($ih,0,"\t"))) {
// this is where I build my new row
$outputData = array($dataFromTestFile[0], $dataFromTestFile[7]);
fputcsv($oh, $outputData);
//echo $dataFromTestFile[7];
}
fclose($ih);
fclose($oh);
}
?>
Sample data of testfile.txt:
Input Errors AccNo Genesymbol Variant Reference Coding Descr. Coding
aaa ddd fdfd dfdf fefefd ref1 fdfdfd fdfdf dfdfde
I suspect this is the line that's causing the error:
$outputData = array($dataFromTestFile[0], $dataFromTestFile[7]);
You are trying to use array elements at specific index without checking if they actually exists.
Also, you are trying to write an array object to the result file, did you mean to create a comma separated value in that file?
Try this:
$source = '/opt/lampp/htdocs/ngs/tmp/testfile.txt';
$result = '/opt/lampp/htdocs/ngs/tmp/result.txt';
if (($handle = fopen($source, "r")) !== FALSE) {
while (($data = fgetcsv($handle, 1000, "\t")) !== FALSE) {
if (isset($data[0]) && isset($data[7])) {
file_put_contents($result, $data[0] .','. $data[7] ."\r\n");
}
}
fclose($handle);
}
Alternatively, you could write the result as a csv like this also:
$sourceFile = '/opt/lampp/htdocs/ngs/tmp/testfile.txt';
$resultFile = '/opt/lampp/htdocs/ngs/tmp/result.txt';
$resultData = array();
// Parse source file
if (($handle = fopen($sourceFile, "r")) !== FALSE) {
while (($data = fgetcsv($handle, 1000, "\t")) !== FALSE) {
if (isset($data[0]) && isset($data[7])) {
$resultData[] = array($data[0], $data[7]);
}
}
fclose($handle);
}
// Write result file
if (sizeof($resultData)) {
$h = #fopen($resultFile, 'w');
if (!$h) {
exit('Failed to open result file for writing.');
}
foreach ($resultData as $resultRow) {
fputcsv($h, $resultRow, ',', '"');
}
fclose($h);
}
make sure the column 7 exists in your testfile.txt - i guess when starting from zero it may be column number 6 - also you can
var_dump($dataFromTestFile)
in order to get the content of the variable - array keys and values might be of interest for your issue
I have a couple of huge (11mb and 54mb) files that I need to read to process the rest of the script. Currently I'm reading the files and storing them in an array like so:
$pricelist = array();
$fp = fopen($DIR.'datafeeds/pricelist.csv','r');
while (($line = fgetcsv($fp, 0, ",")) !== FALSE) {
if ($line) {
$pricelist[$line[2]] = $line;
}
}
fclose($fp);
.. but I'm constantly getting memory overload messages from my webhost. How do I read it more efficiently?
I don't need to store everything, I already have the keyword which exactly matches the array key $line[2] and I need to read just that one array/line.
If you know the key why don't you filter out by the key? And you can check memory usage with memory_get_usage() function to see how much memory allocated after you fill your $pricelist array.
echo memory_get_usage() . "\n";
$yourKey = 'some_key';
$pricelist = array();
$fp = fopen($DIR.'datafeeds/pricelist.csv','r');
while (($line = fgetcsv($fp, 0, ",")) !== FALSE) {
if (isset($line[2]) && $line[2] == $yourKey) {
$pricelist[$line[2]] = $line;
break;
/* If there is a possiblity to have multiple lines
we can store each line in a separate array element
$pricelist[$line[2]][] = $line;
*/
}
}
fclose($fp);
echo memory_get_usage() . "\n";
You can try this (I have not checked if it works properly)
$data = explode("\n", shell_exec('cat filename.csv | grep KEYWORD'));
You will get all the lines containing the keyword, each line as an element of array.
Let me know if it helps.
I join what user2864740 said : "The problem is the in-memory usage caused by the array itself and is not about "reading" the file"
My Solution is :
Split your `$priceList` array
Load only 1 at time a splitted Array in memory
Keep the other splitted Arrays in an intermediate file
N.B: i did not test what i've written
<?php
define ("MAX_LINE", 10000) ;
define ("CSV_SEPERATOR", ',') ;
function intermediateBuilder ($csvFile, $intermediateCsvFile) {
$pricelist = array ();
$currentLine = 0;
$totalSerializedArray = 0;
if (!is_file()) {
throw new Exception ("this is not a regular file: " . $csv);
}
$fp = fopen ($csvFile, 'r');
if (!$fp) {
throw new Exception ("can not read this file: " . $csv);
}
while (($line = fgetcsv($fp, 0, CSV_SEPERATOR)) !== FALSE) {
if ($line) {
$pricelist[$line[2]] = $line;
}
if (++$currentLine == MAX_LINE) {
$fp2 = fopen ($intermediateCsvFile, 'a');
if (!$fp) throw new Exception ("can not write in this intermediate csv file: " . $intermediateCsvFile);
fputs ($fp2, serialize ($pricelist) . "\n");
fclose ($fp2);
unset ($pricelist);
$pricelist = array ();
$currentLine = 0;
$totalSerializedArray++;
}
}
fclose($fp);
return $totalSerializedArray;
}
/**
* #param array : by reference unserialized array
* #param integer : the array number to read from the intermediate csv file; start from index 1
* #param string : the (relative|absolute) path/name of the intermediate csv file
* #throw Exception
*/
function loadArray (&$array, $arrayNumber, $intermediateCsvFile) {
$currentLine = 0;
$fp = fopen ($intermediateCsvFile, 'r');
if (!$fp) {
throw new Exception ("can not read this intermediate csv file: " . $intermediateCsvFile);
}
while (($line = fgetcsv($fp, 0, CSV_SEPERATOR)) !== FALSE) {
if (++$currentLine == $arrayNumber) {
fclose ($fp);
$array = unserialize ($line);
return;
}
}
throw new Exception ("the array number argument [" . $arrayNumber . "] is invalid (out of bounds)");
}
Usage example
try {
$totalSerializedArray = intermediateBuilder ($DIR . 'datafeeds/pricelist.csv',
$DIR . 'datafeeds/intermediatePricelist.csv');
$priceList = array () ;
$arrayNumber = 1;
loadArray ($priceList,
$arrayNumber,
$DIR . 'datafeeds/intermediatePricelist.csv');
if (!array_key_exists ($key, $priceList)) {
if (++$arrayNumber > $totalSerializedArray) $arrayNumber = 1;
loadArray ($priceList,
$arrayNumber,
$DIR . 'datafeeds/intermediatePricelist.csv');
}
catch (Exception $e) {
// TODO : log the error ...
}
You can drop the
if ($line) {
That only repeats the check from the loop condition. If your file is 54MB, and you are going to retain every line from the file, as an array, plus the key from column 3 (which is hashed for lookup)... I could see that requiring 75-85MB to store it all in memory. That isn't much. Most wordpress or magento pages using widgets run 150-200MB. But if your host is set low it could be a problem.
You can try filtering out some rows by changing the if($line) to a if($line[1] == 'book') to reduce how much you store. But the only sure way to handle storing that much content in memory is to have that much memory available to the script.
You can try set bigger memory using this. You can change limit how you want.
ini_set('memory_limit', '2048M');
But also depents how you want that script use.
This question already has answers here:
Process very big csv file without timeout and memory error
(5 answers)
Closed 9 years ago.
I want to upload the csv file to the database so I am using word-press plug-in to do that . I have file size of 350 MB . Although I copied some data and save it to new file and now it has file size of 14 MB and total number of lines are 66872 .
When I try to upload that file the script don’t work after uploading 63296 lines of data in array . I check the forum and mostly say its a memory_limit issue . I even change the memory_limit = 2000M but it didn’t help .
Here is the code from plugin
function csv_file_data($file, $delim) {
$this->checkUploadDirPermission ();
ini_set ( "auto_detect_line_endings", true );
$data_rows = array ();
$resource = fopen ( $file, 'r' );
//print $file;
$init = 0;
while ( $keys = fgetcsv ( $resource, '', $this->delim, '"' ) ) {
print $keys;
print $init;
if ($init == 0) {
$this->headers = $keys;
} else {
array_push ( $data_rows, $keys );
}
$init ++;
}
//print_r($data_rows);
print $init;
fclose ( $resource );
ini_set ( "auto_detect_line_endings", false );
return $data_rows;
}
You should not load entire file into the memory.
Here a correct example:
if (($handle = fopen("test.csv", "r")) !== FALSE) {
while (($data = fgetcsv($handle, 2000, ",")) !== FALSE) {
// Do your staff with $data array
}
fclose($handle);
}
This is what I have so far:
<?php
$file = "18201010338AM16390621000846.png";
$test = file_get_contents($file, FILE_BINARY);
echo str_replace("\n","<br>",$test);
?>
The output is sorta what I want, but I really only need lines 3-7 (inclusively). This is what the output looks like now: http://silentnoobs.com/pbss/collector/test.php. I am trying to get the data from "PunkBuster Screenshot (±) AAO Bridge Crossing" to "Resulting: w=394 X h=196 sample=2". I think it'd be fairly straight forward to read through the file, and store each line in an array, line[0] would need to be "PunkBuster Screenshot (±) AAO Bridge Crossing", and so on. All those lines are subject to change, so I can't just search for something finite.
I've tried for a few days now, and it doesn't help much that I'm poor at php.
The PNG file format defines that a PNG document is split up into multiple chunks of data. You must therefore navigate your way to the chunk you desire.
The data you want to extract seem to be defined in a tEXt chunk. I've written the following class to allow you to extract chunks from PNG files.
class PNG_Reader
{
private $_chunks;
private $_fp;
function __construct($file) {
if (!file_exists($file)) {
throw new Exception('File does not exist');
}
$this->_chunks = array ();
// Open the file
$this->_fp = fopen($file, 'r');
if (!$this->_fp)
throw new Exception('Unable to open file');
// Read the magic bytes and verify
$header = fread($this->_fp, 8);
if ($header != "\x89PNG\x0d\x0a\x1a\x0a")
throw new Exception('Is not a valid PNG image');
// Loop through the chunks. Byte 0-3 is length, Byte 4-7 is type
$chunkHeader = fread($this->_fp, 8);
while ($chunkHeader) {
// Extract length and type from binary data
$chunk = #unpack('Nsize/a4type', $chunkHeader);
// Store position into internal array
if ($this->_chunks[$chunk['type']] === null)
$this->_chunks[$chunk['type']] = array ();
$this->_chunks[$chunk['type']][] = array (
'offset' => ftell($this->_fp),
'size' => $chunk['size']
);
// Skip to next chunk (over body and CRC)
fseek($this->_fp, $chunk['size'] + 4, SEEK_CUR);
// Read next chunk header
$chunkHeader = fread($this->_fp, 8);
}
}
function __destruct() { fclose($this->_fp); }
// Returns all chunks of said type
public function get_chunks($type) {
if ($this->_chunks[$type] === null)
return null;
$chunks = array ();
foreach ($this->_chunks[$type] as $chunk) {
if ($chunk['size'] > 0) {
fseek($this->_fp, $chunk['offset'], SEEK_SET);
$chunks[] = fread($this->_fp, $chunk['size']);
} else {
$chunks[] = '';
}
}
return $chunks;
}
}
You may use it as such to extract your desired tEXt chunk as such:
$file = '18201010338AM16390621000846.png';
$png = new PNG_Reader($file);
$rawTextData = $png->get_chunks('tEXt');
$metadata = array();
foreach($rawTextData as $data) {
$sections = explode("\0", $data);
if($sections > 1) {
$key = array_shift($sections);
$metadata[$key] = implode("\0", $sections);
} else {
$metadata[] = $data;
}
}
<?php
$fp = fopen('18201010338AM16390621000846.png', 'rb');
$sig = fread($fp, 8);
if ($sig != "\x89PNG\x0d\x0a\x1a\x0a")
{
print "Not a PNG image";
fclose($fp);
die();
}
while (!feof($fp))
{
$data = unpack('Nlength/a4type', fread($fp, 8));
if ($data['type'] == 'IEND') break;
if ($data['type'] == 'tEXt')
{
list($key, $val) = explode("\0", fread($fp, $data['length']));
echo "<h1>$key</h1>";
echo nl2br($val);
fseek($fp, 4, SEEK_CUR);
}
else
{
fseek($fp, $data['length'] + 4, SEEK_CUR);
}
}
fclose($fp);
?>
It assumes a basically well formed PNG file.
I found this problem a few days ago, so I made a library to extract the metadata (Exif, XMP and GPS) of a PNG in PHP, 100% native, I hope it helps. :) PNGMetadata
How about:
http://www.php.net/manual/en/function.getimagesize.php