PHP addslashes failing when files too large - php

I have a set of PHP scripts that load files into a database for an auto-updater program to use later. The program works fine until a file exceeds the 10MB range. The rough idea of the script is that it pulls files from disk in a specific location, and loads them into the database. This allows us to store in source control, and update in sets as needed.
Initially, I thought that I was hitting a limit on the database SQL based on my initial searches. However, after further testing, it seems to be something PHP specific. I checked the Apache error log, but I did not see any errors for this script or the includes. Once the PHP script reaches the addslashes function, the script seems to stop executing. (I added echo statements between each script statement.)
I'm hoping that it is something simple that I am missing, but I couldn't find anything related to addslashes failing after several hours of searching online.
Any ideas?
Thanks in advance.
mysql_connect('localhost', '****', '****') or die('Could not connect to the database');
mysql_select_db('****') or die('Could not select database');
function get_filelist($path)
{
return get_filelist_recursive("/build/".$path);
}
function get_filelist_recursive($path)
{
$i = 0;
$list = array();
if( !is_dir($path) )
return get_filedetails($path);
if ($handle = opendir($path))
{
while (false !== ($file = readdir($handle)))
{
if($file!='.' && $file!='..' && $file[0]!='.')
{
if( is_dir($path.'/'.$file) )
{
$list = $list + get_filelist_recursive($path.'/'.$file);
}
else
{
$list = $list + get_filedetails($path.'/'.$file);
}
}
}
closedir($handle);
return $list;
}
}
function get_filedetails($path)
{
$item = array();
$details = array();
$details[0] = filesize($path);
$details[1] = sha1_file($path);
$item[$path] = $details;
return $item;
}
$productset = mysql_query("select * from product where status is null and id=".$_REQUEST['pid']);
$prow = mysql_fetch_assoc($productset);
$folder = "product/".$prow['name'];
$fileset = get_filelist($folder);
while (list($key, $val) = each($fileset))
{
$fh = fopen($key, 'rb') or die("Cannot open file");
$data = fread($fh,$val[0]);
$data = addslashes($data);
fclose($fh);
$filename = substr( $key, strlen($folder) + 1 );
$query = "insert into file(name,size,hash,data,manifest_id) values('".$filename."','".$val[0]."','".$val[1]."','".$data."','".$prow['manifest_id']."')";
$retins = mysql_query($query);
if( $retins == false )
echo "BUILD FAILED: $key, $val[0] $val[1].<br>\n";
}
header("Location: /patch/index.php?pid=".$_REQUEST['pid']);

Don't use addslashes, use mysql_real_escape_string in this case. Also, you could likely be hitting a max_allowed_packet limit by trying to insert such large files. The default value is 1MB.
If you use mysqli (which is recommended) you can indicate that the column is binary, and it will send the query in chunks.
Also make sure you aren't hitting any PHP memory limits, or maximum execution time.

Related

Reading > 1GB GZipped CSV files from external FTP server

In a scheduled task of my Laravel application I'm reading several large gzipped CSV files, ranging from 80mb to 4gb on an external FTP server, containing products which I store in my database based on a product attribute.
I loop through a list of product feeds that I want to import but each time a fatal error is returned: 'Allowed memory size of 536870912 bytes exhausted'. I can bump up the length parameter of the fgetcsv function from 1000 to 100000 which solves the problem for the smaller files (< 500mb) but for the larger files it will return the fatal error.
Is there a solution that allows me to either download or unzip the .csv.gz files, reading the lines (by batch or one by one) and inserting the products into my database without running out of memory?
$feeds = [
"feed_baby-mother-child.csv.gz",
"feed_computer-games.csv.gz",
"feed_general-books.csv.gz",
"feed_toys.csv.gz",
];
foreach ($feeds as $feed) {
$importedProducts = array();
$importedFeedProducts = 0;
$csvfile = 'compress.zlib://ftp://' . config('app.ftp_username') . ':' . config('app.ftp_password') . '#' . config('app.ftp_host') . '/' . $feed;
if (($handle = fopen($csvfile, "r")) !== FALSE) {
$row = 1;
$header = fgetcsv($handle, 1, "|");
while (($data = fgetcsv($handle, 1000, "|")) !== FALSE) {
if($row == 1 || array(null) !== $data){ $row++; continue; }
$product = array_combine($header, $data);
$importedProducts[] = $product;
}
fclose($handle);
} else {
echo 'Failed to open: ' . $feed . PHP_EOL;
continue;
}
// start inserting products into the database below here
}
The problem is probably not the gzip file itself,
Of course you can download it, on process it then, this will keep the same issues.
Because you are loading all products in a single array (Memory)
$importedProducts[] = $product;
You could comment this line out, and see it if this prevent's hitting your memory limit.
Usually i would create a method like this addProduct($product) to handle it memory safe.
You can then from there decide a max number of products before doing a bulk insert. to achieve optimal speed.. i usually use something between 1000 en 5000 rows.
For example
class ProductBatchInserter
{
private $maxRecords = 1000;
private $records = [];
function addProduct($record) {
$this->records[] = $record;
if (count($this->records) >= $this->maxRecords) {
EloquentModel::insert($this->records);
$this->records = [];
}
}
}
However i usualy don't implement it as a single class, but in my projects i used to integrate them as a BulkInsertable trait that could be used on any eloquent model.
But this should give you an direction, how you can avoid memory limits.
Or, the easier , but significantly slower, just insert the row where you now assign it to array.
But that will put a ridiculous load on your database and will be really very slow.
If the GZIP stream is the bottleneck
As i expect this is not the issue, but if it would, then you could use gzopen()
https://www.php.net/manual/en/function.gzopen.php
and nest the gzopen handle as handle for fgetcsv.
But i expect the streamhandler you are using, is doing this already the same way for you..
If not, i mean like this:
$input = gzopen('input.csv.gz', 'r');
while (($row = fgetcsv($input)) !== false) {
// do something memory safe, like suggested above
}
If you need to download it anyway there are many ways to do it, but make sure you use something memory safe, like fopen / fgets , or a guzzle stream and don't try to use something like file_get_contents() that loads it into memory

PHP memory exhaused while using array_combine in foreach loop

I'm having a trouble when tried to use array_combine in a foreach loop. It will end up with an error:
PHP Fatal error: Allowed memory size of 268435456 bytes exhausted (tried to allocate 85 bytes) in
Here is my code:
$data = array();
$csvData = $this->getData($file);
if ($columnNames) {
$columns = array_shift($csvData);
foreach ($csvData as $keyIndex => $rowData) {
$data[$keyIndex] = array_combine($columns, array_values($rowData));
}
}
return $data;
The source file CSV which I've used has approx ~1,000,000 rows. This row
$csvData = $this->getData($file)
I was using a while loop to read CSV and assign it into an array, it's working without any problem. The trouble come from array_combine and foreach loop.
Do you have any idea to resolve this or simply have a better solution?
UPDATED
Here is the code to read the CSV file (using while loop)
$data = array();
if (!file_exists($file)) {
throw new Exception('File "' . $file . '" do not exists');
}
$fh = fopen($file, 'r');
while ($rowData = fgetcsv($fh, $this->_lineLength, $this->_delimiter, $this->_enclosure)) {
$data[] = $rowData;
}
fclose($fh);
return $data;
UPDATED 2
The code above is working without any problem if you are playing around with a CSV file <=20,000~30,000 rows. From 50,000 rows and up, the memory will be exhausted.
You're in fact keeping (or trying to keep) two distinct copies of the whole dataset in your memory. First you load the whole CSV date into memory using getData() and the you copy the data into the $data array by looping over the data in memory and creating a new array.
You should use stream based reading when loading the CSV data to keep just one data set in memory. If you're on PHP 5.5+ (which you definitely should by the way) this is a simple as changing your getData method to look like that:
protected function getData($file) {
if (!file_exists($file)) {
throw new Exception('File "' . $file . '" do not exists');
}
$fh = fopen($file, 'r');
while ($rowData = fgetcsv($fh, $this->_lineLength, $this->_delimiter, $this->_enclosure)) {
yield $rowData;
}
fclose($fh);
}
This makes use of a so-called generator which is a PHP >= 5.5 feature. The rest of your code should continue to work as the inner workings of getData should be transparent to the calling code (only half of the truth).
UPDATE to explain how extracting the column headers will work now.
$data = array();
$csvData = $this->getData($file);
if ($columnNames) { // don't know what this one does exactly
$columns = null;
foreach ($csvData as $keyIndex => $rowData) {
if ($keyIndex === 0) {
$columns = $rowData;
} else {
$data[$keyIndex/* -1 if you need 0-index */] = array_combine(
$columns,
array_values($rowData)
);
}
}
}
return $data;

How can i upload a csv file in mysql database using multithreads? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have a csv file, containing millions of email addresses which I want to upload fast into a mysql database with PHP.
Right now I'm using a single threaded program which takes too much time to upload.
//get the csv file
$file = $_FILES['csv']['tmp_name'];
$handle = fopen($file,"r");
//loop through the csv file and insert into database
do {
if ($data[0]) {
$expression = "/^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})$/";
if (preg_match($expression, $data[0])) {
$query=mysql_query("SELECT * FROM `postfix`.`recipient_access` where recipient='".$data[0]."'");
mysql_query("SET NAMES utf8");
$fetch=mysql_fetch_array($query);
if($fetch['recipient']!=$data[0]){
$query=mysql_query("INSERT INTO `postfix`.`recipient_access`(`recipient`, `note`) VALUES('".addslashes($data[0])."','".$_POST['note']."')");
}
}
}
} while ($data = fgetcsv($handle,1000,",","'"));
First of all, I can't stress enough; fix your indentation - it will make life easier for everyone.
Secondly, the answer depends a lot on the actual bottlenecks you are encountering:
Regular expressions are very slow, especially when they're in a loop.
Databases tend to either work well for WRITES or for READS but not BOTH: try decreasing the amount of queries beforehand.
It stands to reason that the less PHP code in your loop, the faster it will work. Consider decreasing conditions (for instance).
For the record, your code is not safe against mysql injection: filter $_POST before hand [*]
[*] speaking of which, it's faster to access a variable than the index of an array, like $_POST.
You can simulate multithreading by having your main program split the huge CSV file into a smaller one and run each file into a different process.
common.php
class FileLineFinder {
protected $handle, $length, $curpos;
public function __construct($file){
$handle = fopen($file, 'r');
$length = strlen(PHP_EOL);
}
public function next_line(){
while(!feof($this->handle)){
$b = fread($this->handle, $this->length);
$this->curpos += $this->length;
if ($b == PHP_EOL) return $this->curpos;
}
return false;
}
public function skip_lines($count){
for($i = 0; $i < $count; $i++)
$this->next_line();
}
public function __destruct(){
fclose($this->handle);
}
}
function exec_async($cmd, $outfile, $pidfile){
exec(sprintf("%s > %s 2>&1 & echo $! >> %s", $cmd, $outfile, $pidfile));
}
main.php
require('common.php');
$maxlines = 200; // maximum lines subtask will be processing at a time
$note = $_POST['note'];
$file = $_FILES['csv']['tmp_name'];
$outdir = dirname(__FILE__) . DIRECTORY_SEPARATOR . 'out' . DIRECTORY_SEPARATOR;
//make sure our output directory exists
if(!is_dir($outdir))
if(!mkdir($outdir, 0755, true))
die('Cannot create output directory: '.$outdir);
// run a task for each chunk of lines in the csv file
$i = 0; $pos = 0;
$l = new FileLineFinder($file);
do {
$i++;
exec_async(
'php -f sub.php -- '.$pos.' '.$maxlines.' '.escapeshellarg($file).' '.escapeshellarg($note),
$outdir.'proc'.$i.'.log',
$outdir.'proc'.$i.'.pid'
);
$l->skip_lines($maxlines);
} while($pos = $l->next_line());
// wait for each task to finish
do {
$tasks = count(glob($outdir.'proc*.pid'));
echo 'Remaining Tasks: '.$tasks.PHP_EOL;
} while ($tasks > 0);
echo 'Finished!'.PHP_EOL;
sub.php
require('common.php');
$start = (int)$argv[1];
$count = (int)$argv[2];
$file = $argv[3];
$note = mysql_real_escape_string($argv[4]);
$lines = 0;
$handle = fopen($file, 'r');
fseek($handle, $start, SEEK_SET);
$expression = "/^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})$/";
mysql_query('SET NAMES utf8');
//loop through the csv file and insert into database
do {
$lines++;
if ($data[0]) {
if (preg_match($expression, $data[0])) {
$query = mysql_query('SELECT * FROM `postfix`.`recipient_access` where recipient="'.$data[0].'"');
$fetch = mysql_fetch_array($query);
if($fetch['recipient'] != $data[0]){
$query = mysql_query('INSERT INTO `postfix`.`recipient_access`(`recipient`, `note`) VALUES("'.$data[0].'","'.$note.'")');
}
}
}
} while (($data = fgetcsv($handle, 1000, ',', '\'')) && ($lines < $count));
Credits
https://stackoverflow.com/a/2162528/314056
https://stackoverflow.com/a/45966/314056
The most pressing thing to do is to make sure your database is properly indexed so the lookup query you do for every row is as fast as possible.
Other than that, there simply isn't that much you can do. For a multithreaded solution, you'll have to go outside PHP.
You could also just import the CSV file in mySQL, and then weed out the superfluous data using your PHP script - that is likely to be the fastest way.
Just a general suggestion: The key to speed up any program is to know which part take most of the time.
And then figure out how to reduce it. Sometimes you will be very surprised by the actual result.
btw, I don't think multithreading would solve your your problem.
Put the whole loop inside an SQL transaction. That will speed things up by an order of magnitude.

php memory limit and reading/writing temp files

using the function below I am pulling rows from tables, encoding them, then putting them in csv format. I am wondering if there is an easier way to prevent high memory usage. I don't want to have to rely on ini_set. I believe the memory consumption is caused from reading the temp file and gzipping it up. I'd love to be able to have a limit of 64mb ram to work with. Any ideas? Thanks!
function exportcsv($tables) {
foreach ($tables as $k => $v) {
$fh = fopen("php://temp", 'w');
$sql = mysql_query("SELECT * FROM $v");
while ($row = mysql_fetch_row($sql)) {
$line = array();
foreach ($row as $key => $vv) {
$line[] = base64_encode($vv);
}
fputcsv($fh, $line, chr(9));
}
rewind($fh);
$data = stream_get_contents($fh);
$gzdata = gzencode($data, 6);
$fp = fopen('sql/'.$v.'.csv.gz', 'w');
fwrite($fp, $gzdata);
fclose($fp);
fclose($fh);
}
}
untested, but hopefully you understand
function exportcsv($tables) {
foreach ($tables as $k => $v) {
$fh = fopen('compress.zlib://sql/' .$v. '.csv.gz', 'w');
$sql = mysql_unbuffered_query("SELECT * FROM $v");
while ($row = mysql_fetch_row($sql)) {
fputcsv($fh, array_map('base64_encode', $row), chr(9));
}
fclose($fh);
mysql_free_result($sql);
}
}
edit-
points of interest are the use of mysql_unbuffered_query and use of php's compression stream. regular mysql_query() buffers entire result set into memory. and using the compression stream gets rid of having to buffer the data yet again into php memory as a string before writing to a file.
Pulling the whole file into memory via the stream_get_contents() is probably what's killing you. Not only are you having to hold the base64 data (which is usually about 33% than its raw content), you've got the csv overhead to deal with as well. If memory is a problem, consider simply calling a command-line gzip app instead of gzipping inside of PHP, something like:
... database loop here ...
exec('gzip yourfile.csv');
And you can probably optimize things a little bit better inside the DB loop, and encode in-place, rather than building a new array for each row:
while($row = mysql_fetch_row($result)) {
foreach ($row as $key => $val) {
$row[$key] = base64_encode($val);
fputcsv($fh, $row, chr(9));
}
}
Not that this will reduce memory usage much - it's only a single row of data, so unless you're dealing with huge record fields, it won't have much effect.
You could insert some flushing there, currently your entire php file will be held in memory then flushed at the end, however if you manually
fflush($fh);
Also instead of gzipping the entire file you could gzip line by line using
$gz = gzopen ( $fh, 'w9' );
gzwrite ( $gz, $content );
gzclose ( $gz );
This will write line by line packed data rather than creating an entire file and then gzipping it.
I found this suggestion for compressing in chunks on http://johnibanez.com/node/21
It looks like it wouldn't be hard to modify for your purposes.
function gzcompressfile($source, $level = false){
$dest = $source . '.gz';
$mode = 'wb' . $level;
$error = false;
if ($fp_out = gzopen($dest, $mode)) {
if ($fp_in = fopen($source, 'rb')) {
while(!feof($fp_in)) {
gzwrite($fp_out, fread($fp_in, 1024*512));
}
fclose($fp_in);
}
else
$error=true;
gzclose($fp_out);
}
else
$error=true;
if ($error)
return false;
else
return $dest;
}

500 error after a lot of mysql_query calls in php

I have a php script that steps through a folder containing tab delimited files, parsing them line by line and inserting the data into a mysql database. I cannot use LOAD TABLE because of security restrictions on my server and I do not have access to the configuration files. The script works just fine parsing 1 or 2 smaller files but when when working with several large files I get a 500 error. There do not appear to be any error logs containing messages pertaining to the error, at least none that my hosting provider gives me access to. Below is the code, I am also open to suggestions for alternate ways of doing what I need to do. Ultimately I want this script to fire off every 30 minutes or so, inserting new data and deleting the files when finished.
EDIT: After making the changes Phil suggested, the script still fails but I now have the following message in my error log "mod_fcgid: read data timeout in 120 seconds", looks like the script is timing out, any idea where I can change the timeout setting?
$folder = opendir($dir);
while (($file = readdir($folder)) !== false) {
$filepath = $dir . "/" . $file;
//If it is a file and ends in txt, parse it and insert the records into the db
if (is_file($filepath) && substr($filepath, strlen($filepath) - 3) == "txt") {
uploadDataToDB($filepath, $connection);
}
}
function uploadDataToDB($filepath, $connection) {
ini_set('display_errors', 'On');
error_reporting(E_ALL);
ini_set('max_execution_time', 300);
$insertString = "INSERT INTO dirty_products values(";
$count = 1;
$file = #fopen($filepath, "r");
while (($line = fgets($file)) !== false) {
$values = "";
$valueArray = explode("\t", $line);
foreach ($valueArray as $value) {
//Escape single quotes
$value = str_replace("'", "\'", $value);
if ($values != "")
$values = $values . ",'" . $value . "'";
else
$values = "'" . $value . "'";
}
mysql_query($insertString . $values . ")", $connection);
$count++;
}
fclose($file);
echo "Count: " . $count . "</p>";
}
First thing I'd do is use prepared statements (using PDO).
Using the mysql_query() function, you're creating a new statement for every insert and you may be exceeding the allowed limit.
If you use a prepared statement, only one statement is created and compiled on the database server.
Example
function uploadDataToDB($filepath, $connection) {
ini_set('display_errors', 'On');
error_reporting(E_ALL);
ini_set('max_execution_time', 300);
$db = new PDO(/* DB connection parameters */);
$stmt = $db->prepare('INSERT INTO dirty_products VALUES (
?, ?, ?, ?, ?, ?)');
// match number of placeholders to number of TSV fields
$count = 1;
$file = #fopen($filepath, "r");
while (($line = fgets($file)) !== false) {
$valueArray = explode("\t", $line);
$stmt->execute($valueArray);
$count++;
}
fclose($file);
$db = null;
echo "Count: " . $count . "</p>";
}
Considering you want to run this script on a schedule, I'd avoid the web server entirely and run the script via the CLI using cron or whatever scheduling service your host provides. This will help you avoid any timeout configured in the web server.

Categories