MySQL Is Not Inserting All Successful Insert Queries...Why? - php

Before I go on, this is purely a question of intuition. That is, I'm not seeking answers to work out specific bugs in my PHP/MySQL code. Rather, I want to understand what the range of possible issues that I need to consider in resolving my issue. To these ends, I will not post code or attach scripts - I will simply explain what I did and what is happening.
I have written PHP script that
Reads a CSV text file of X records to be inserted into a MySQL database table and/or update duplicate entries where applicable;
Inserts said records into what I will call a "root" table for that data set;
Selects subset records of specific fields from the "root" table and then inserts those records into a "master" table; and
Creates an output export text file from the master table for distribution.
There are several CSV files that I am processing via separate scheduled cron tasks every 30 minutes. All said, from the various sources, there are an estimated 420,000 insert transactions from file to root table, and another 420,000 insert transactions from root table to master table via the scheduled tasks.
One of the tasks involves a CSV file of about 400,000 records by itself. The processing contains no errors, but here's the problem: of the 400,000 records that MySQL indicates have been successfully inserted into the root table, only about 92,000 of those records actually store in the root table - I'm losing about 308,000 records from that scheduled task.
The other scheduled tasks process about 16,000 and 1,000 transactions respectively, and these transactions process perfectly. In fact, if I reduce the number of transactions from 400,000 to, say, 10,000, then these process just fine as well. Clearly, that's not the goal here.
To address this issue, I have tried several remedies...
Upping the memory of my server (and increasing the max limit in the php.ini file)
Getting a dedicated database with expanded memory (as opposed to a shared VPS database)
Rewriting my code to substantially eliminate stored arrays that suck down memory and process fgetcsv() processes on the run
Use INSERT DELAYED MySQL statements (as opposed to plain INSERT statements)
...and none of these remedies have worked as desired.
What range of remedial actions should be considered at this point, given the lack of success in the actions taken so far? Thanks...

The source data in csv may have duplicate records. Even though there are 400,000 record in the csv, your 'insert or update' logic trims them into reduced set. Less memory could lead to exceptions etc, but this kind of data loss.

I suspect there are problems in the CSV file.
My suggestion:
Print something for debugging information on each lines read from
CSV. This will show you how many lines are processed.
On every insert/update, print any error (if any)
It's something like this:
<?php
$csv = fopen('sample.csv', 'r'); $line = 1;
while (($item = fgetcsv($csv)) !== false) {
echo 'Line ' . $line++ . '... ';
$sql = ''; // your SQL query
mysql_query($sql);
$error = mysql_error();
if ($error == '') {
echo 'OK' . PHP_EOL;
} else {
echo 'FAILED' . PHP_EOL . $error . PHP_EOL;
}
}
So, if there are any errors, you can see it and find the problem (what lines of CSV has problem).

Related

how to handle large size of update query in mysql with laravel

is There a way that I can update 100k records in a query and mysql database will work smoothly?
Suppose there is a table users containg hundred thousand of records and I have to update approx fifty thousand of records and for update I have IDs of those records means to around fifty thousand of records somewhere stored in csv file,
1 - Will query be ok as size of query would be too large ? or if there is any way to put in smaller chuncks let me know ?
2- Considering laravel framework, if there any option to read a part of file not the whole file, to avoid memory leakage, As I donot want to read all file at the same time, please suggest.
Any suggestion are welcome !
If you're thinking of building a query like UPDATE users SET column = 'value' WHERE id = 1 OR id = 2 OR id = 3 ... OR id = 50000 or WHERE id IN (1, 2, 3, ..., 50000) then that will probably be too big. If you can make some logic to summarize that, it would shorten the query and speed things up on MySQL's end significantly. Maybe you could make it WHERE id >= 1 AND id <= 50000.
If that's not an option, you could do it in bursts. You're probably going to loop through the rows of the CSV file, build the query as a big WHERE id = 1 OR id = 2... query and every 100 rows or so (or 50 if that's still too big), run the query and start a new one for the next 50 IDs.
Or you could just run 50.000 single UPDATE queries on your database. Honestly, if the table makes proper use of indexes, running 50.000 queries should only take a few seconds on most modern webservers. Even the busiest servers should be able to handle that in under a minute.
As for reading a file in chunks, you can use PHP's basic file access functions for that:
$file = fopen('/path/to/file.csv', 'r');
// read one line at a time from the file (fgets reads up to the
// next newline character if you don't provide a number of bytes)
while (!feof($file)) {
$line = fgets($file);
// or, since it's a CSV file:
$row = fgetcsv($file);
// $row is not an array with all the CSV columns
// do stuff with the line/row
}
// set the file pointer to 60 kb into the file
fseek($file, 60*1024);
// close the file
fclose($file);
This will not read the full file into memory. Not sure if Laravel has its own way of dealing with files, but this is how to do that in basic PHP.
Depending on data you have to update, i would suggest few ways:
If all users would be updated by same value - as #rickdenhaan said,
you can build multiple batches every X rows from csv.
If every individual user have to be updated with unique values - you have to run single queries.
If any updated columns have indices - you should disable autocommit and do transaction manually to avoid reindex after each single update.
To avoid memory leakage, my opinion is the same as #rickdenhaan's. You should read csv line by line using the fgetcsv
To avoid possible timeouts, for example you can put script processing into laravel queues

SQL select query memory issues

I've got this server setting a live traffic log DB that holds a big stats table. Now I need to create a smaller table from it, let's say 30 days back.
This server also has a slave server that copies the data and is 5 sec behind the master.
I created this slave in order to reduce server process for selecting queries so it only works with insert/update for the traffic log.
Now I need to copy the last day to the smaller table, and still not to use the "real" DB,
so I need to select from the slave and insert to the real smaller table. (The slave only allows read operations).
I am working with PHP and I can't solve this with one query using two different databases at one query... If it's possible, please let me know how?
When using two queries I need to hold the last day as a PHP MySQL object. For 300K-650K of rows, it's starting to be a cache memory problem. I would use a partial select by ID(by setting the ids at the where term) chunks but I don't have an auto increment id field and there's no id for the rows (when storing traffic data id would take a lot of space).
So I am trying this idea and I would like to get a second opinion.
If I will take the last day at once (300K rows) it will overload the PHP memory.
I can use limit chunks, or a new idea: selecting one column at a time and copying this one to the new real table. But I don't know if the second method is possible. Does insert looks at the first open space at a column level or row level?
the main idea is reducing the size of the select.. so is it possible to build a select by columns and then insert them as columns at mysql?
If this is simply a memory problem in PHP you could try using PDO and fetching 1 result row at a time instead of a all at the same time.
From PHP.net for PDO:
<?php
function getFruit($conn) {
$sql = 'SELECT name, color, calories FROM fruit ORDER BY name';
foreach ($conn->query($sql) as $row) {
print $row['name'] . "\t";
print $row['color'] . "\t";
print $row['calories'] . "\n";
}
}
?>
well here is where php start to be weird.. i took your advice and started to use chunks for the data. i used a loop for advancing a limit in 2000 rows jumps. but what was interesting is when i started to use php memory usage and memory peak functions i found out that the reason the chunks method doesn't work in large scales and looping is because setting a new value to a var doesn't release the memory of what was before the new setting.. so you must use unset or null in order to keep your memory at php, –

Auto index, repair and optimize MySQL table on every page load

I am in a debate with a guy telling me that there is no performance hit for using his function that...
Auto index, repair and optimize MySQL tables using PHP class __destruct() on every single page load by every user who runs the page.
He is asking me why I think it is not good for performance but I do not really know, can someone tell me why such a thing isn't good?
UPDATE His reasoning...
Optimizing & repairing the database tables eliminates the byte size of overhead that can essentially slow down additional queries when multiple connections and table use are concerned. Even with a performance enhanced database schema with indexing enabled.
Not to mention the amount of execution time to perform these operations are slim to none in memory and processor threading.
Opening, reading, writing, updating and then cleaning up after oneself makes more sense to me then performing the same operations and leaving unnecessary overhead behind waiting for a cron entry to clean up.
Instead of arguing, why not measure? Use a toolkit to profile where you're spending time, such as Instrumentation for PHP. Prove that the optimize step of your PHP request is taking a long time.
Reindexing is an expensive process, at least as costly as doing a table-scan as if you did not have an index. You should build indexes infrequently, so that you serve many PHP requests with the aid of the index for every one time you build the index. If you're building the index on every PHP request, you might as well not define indexes at all, and just run table-scans all the time.
REPAIR TABLE is only relevant for MyISAM tables (and Archive tables). I don't recommend using MyISAM tables. You should just use InnoDB tables. Not only for the sake of performance, but also data safety. MyISAM is very susceptible to data corruption, whereas InnoDB protects against that in most cases by maintaining internal checksums per page.
OPTIMIZE TABLE for an InnoDB table rebuilds all the data and index pages. This is going to be immensely expensive once your table grows to a non-trivial size. Certainly not something you would want to do on every page load. I would even say you should not do OPTIMIZE TABLE during any PHP web request -- do it offline via some script or admin interface.
A table restructure also locks the table. You will queue up all other PHP requests that access the same table for a long time (i.e. minutes or even hours, depending on the size of the table). When each PHP request gets its chance, it'll run another table restructure. It's ridiculous to incur this amount of overhead on every PHP request.
You can also use an analogy: you don't rebuild or optimize an entire table or index during every PHP request for the same reason you don't give your car a tune-up and oil change every time you start it:
It would be expensive and inconvenient to do so, and it would give no extra benefit compared to performing engine maintenance on an appropriate schedule.
Because every single operation (index,repair and optimize) takes considerable time; in fact they are VERY expensive (table locks, disk IO, risk of data loss) if the tables are even slightly big.
Doing this on every page load is definitely not recommended. It should be done only when needed.
Repair table could cause data loss as stated in documentation, so it requires previous backup to avoid further problems. Also, it is intended to be run only in case of disaster (something HAS failed).
Optimize table blocks the table under maintenance so it could cause problems to concurrent users.
My 0.02: Database management operations should not be part of common user transactions as they are expensive in time and resources as your tables grow.
I have set the following code into our scheduled job running every early morning, when users don't access frequently our site (I read that OPTIMIZE should lock affected tables during optimization).
Advantage using this function is that a single query is composed with all table names comma-separated, instead executing a lot of queries, one for each table to optimize.
It's supposed that you have a db connection opened and a db selected yet, in order to use this function without specifying db connection, db name, etc.
$q = "SHOW TABLE STATUS WHERE Data_Free > '0'";
$res = mysql_query($q); $TOOPT = mysql_num_rows($res);
$N = 0; // number of optimized tables
if(mysql_num_rows($res) > 0)
{
$N = 1;
while($t = mysql_fetch_array($res))
{
$TNAME = $t['Name']; $TSPACE += $t['Data_free'];
if($N < 2)
{
$Q = "OPTIMIZE TABLE ".$TNAME."";
}
else
{
$Q .= ", ".$TNAME."";
}
$N++;
} // endwhile tables
mysql_query($Q);
} // endif tables found (to optimize)
The docs states...
(optimize reference)
OPTIMIZE TABLE should be used if you have deleted a large part of a
table or if you have made many changes to a table with variable-length
rows (tables that have VARCHAR, VARBINARY, BLOB, or TEXT columns).
Deleted rows are maintained in a linked list and subsequent INSERT
operations reuse old row positions. You can use OPTIMIZE TABLE to
reclaim the unused space and to defragment the data file. After
extensive changes to a table, this statement may also improve
performance of statements that use the table, sometimes significantly.
When operations have been performed performance is enhanced by using the 'OPTIMIZE' command.
(flush reference)
FLUSH TABLES has several variant forms. FLUSH TABLE is a synonym for
FLUSH TABLES, except that TABLE does not work with the WITH READ LOCK
variant.
Using the 'FLUSH TABLE' command vs. 'FLUSH TABLES' there is no READ LOCK performed.
(repair reference)
Normally, you should never have to run REPAIR TABLE. However, if
disaster strikes, this statement is very likely to get back all your
data from a MyISAM table. If your tables become corrupted often, you
should try to find the reason for it, to eliminate the need to use
REPAIR TABLE. See Section C.5.4.2, “What to Do If MySQL Keeps
Crashing”, and Section 13.5.4, “MyISAM Table Problems”.
It is my understanding here that if the 'REPAIR TABLE' command is run consistantly the condition concerning large records created would be eliminated as constant maintances is performed. If I am wrong I would like to see benchmarks as my own attempts have not shown anything too detrimental, although the record sets have been under the 10k mark.
Here is the pice of code that is being used and #codedev is asking about...
class db
{
protected static $dbconn;
// rest of database class
public function index($link, $database)
{
$obj = $this->query('SHOW TABLES');
$results = $this->results($obj);
foreach ($results as $key => $value){
if (isset($value['Tables_in_'.$database])){
$this->query('REPAIR TABLE '.$value['Tables_in_'.$database]);
$this->query('OPTIMIZE TABLE '.$value['Tables_in_'.$database]);
$this->query('FLUSH TABLE '.$value['Tables_in_'.$database]);
}
}
}
public function __destruct()
{
$this->index($this->dbconn, $this->configuration['database']);
$this->close();
}
}

Import and process text file within MySQL

I'm working on a research project that requires me to process large csv files (~2-5 GB) with 500,000+ records. These files contain information on government contracts (from USASpending.gov). So far, I've been using PHP or Python scripts to attack the files row-by-row, parse them, and then insert the information into the relevant tables. The parsing is moderately complex. For each record, the script checks to see if the entity named is already in the database (using a combination of string and regex matching); if it is not, it first adds the entity to a table of entities and then proceeds to parse the rest of the record and inserts the information into the appropriate tables. The list of entities is over 100,000.
Here are the basic functions (part of a class) that try to match each record with any existing entities:
private function _getOrg($data)
{
// if name of organization is null, skip it
if($data[44] == '') return null;
// use each of the possible names to check if organization exists
$names = array($data[44],$data[45],$data[46],$data[47]);
// cycle through the names
foreach($names as $name) {
// check to see if there is actually an entry here
if($name != '') {
if(($org_id = $this->_parseOrg($name)) != null) {
$this->update_org_meta($org_id,$data); // updates some information of existing entity based on record
return $org_id;
}
}
}
return $this->_addOrg($data);
}
private function _parseOrg($name)
{
// check to see if it matches any org names
// db class function, performs simple "like" match
$this->db->where('org_name',$name,'like');
$result = $this->db->get('orgs');
if(mysql_num_rows($result) == 1) {
$row = mysql_fetch_object($result);
return $row->org_id;
}
// check to see if matches any org aliases
$this->db->where('org_alias_name',$name,'like');
$result = $this->db->get('orgs_aliases');
if(mysql_num_rows($result) == 1) {
$row = mysql_fetch_object($result);
return $row->org_id;
}
return null; // no matches, have to add new entity
}
The _addOrg function inserts the new entity's information into the db, where hopefully it will match subsequent records.
Here's the problem: I can only get these scripts to parse about 10,000 records / hour, which, given the size, means a few solid days for each file. The way my db is structured requires a several different tables to be updated for each record because I'm compiling multiple external datasets. So, each record updates two tables, and each new entity updates three tables. I'm worried that this adds too much lag time between MySQL server and my script.
Here's my question: is there a way to import the text file into a temporary MySQL table and then use internal MySQL functions (or PHP/Python wrapper) to speed up the processing?
I'm running this on my Mac OS 10.6 with local MySQL server.
load the file into a temporary/staging table using load data infile and then use a stored procedure to process the data - shouldnt take more than 1-2 mins at the most to completely load and process the data.
you might also find some of my other answers of interest:
Optimal MySQL settings for queries that deliver large amounts of data?
MySQL and NoSQL: Help me to choose the right one
How to avoid "Using temporary" in many-to-many queries?
60 million entries, select entries from a certain month. How to optimize database?
Interesting presentation:
http://www.mysqlperformanceblog.com/2011/03/18/video-the-innodb-storage-engine-for-mysql/
example code (may be of use to you)
truncate table staging;
start transaction;
load data infile 'your_data.dat'
into table staging
fields terminated by ',' optionally enclosed by '"'
lines terminated by '\n'
(
org_name
...
)
set
org_name = nullif(org_name,'');
commit;
drop procedure if exists process_staging_data;
delimiter #
create procedure process_staging_data()
begin
insert ignore into organisations (org_name) select distinct org_name from staging;
update...
etc..
-- or use a cursor if you have to ??
end#
delimiter ;
call process_staging_data();
Hope this helps
It sounds like you'd benefit the most from tuning your SQL queries, which is probably where your script spends the most time. I don't know how the PHP MySQL client performs, but MySQLdb for Python is fairly fast. Doing naive benchmark tests I can easily sustain 10k/sec insert/select queries on one of my older quad-cores. Instead of doing one SELECT after another to test if the organization exists, using a REGEXP to check for them all at once might be more efficient (discussed here: MySQL LIKE IN()?). MySQLdb lets you use executemany() to do multiple inserts simultaneously, you could almost certainly leverage that to your advantage, perhaps your PHP client lets you do the same thing?
Another thing to consider, with Python you can use multiprocessing to and try parallelize as much as possible. PyMOTW has a good article about multiprocessing.

How to translate and migrate data

I am building php web application that let's a user upload a MS Access Database (csv export) that is then translated and migrated into a MySQL database.
The MS Access database consists of one table called t_product of 100k rows. This table is not designed well. As an example, the following query:
SELECT part_number, model_number FROM t_product
will return:
part_number model_number
100 AX1000, AX1001, AX1002
101 CZ10, CZ220, MB100
As you can see, the model numbers are listed as comma separated values instead of individual records in another table. There are many more issues of this nature. I'm writing a script to clean this data before importing into the mysql database. The script will also map existing Access columns to a proper relationally design database.
My issue is that my script takes too long to complete. Here's simplified code to explain what I'm doing:
$handle = fopen("MSAccess.csv, "r");
// get each row from the csv
while ($data=fgetcsv($handle, 1000, ","))
{
mysql_query("INSERT INTO t_product (col1, col2 etc...) values ($data[0], $data[1], etc...");
$prodId = mysql_last_insert_id();
// using model as an example, there are other columns
// with csv values that need to be broken up
$arrModel = explode(',', $data[2]);
foreach($arrModel as $modelNumber)
mysql_query("INSERT INTO t_model (product_id, col1, col2 etc...) values ($prodId, $modelNumber[0], $modelNumber[1] etc...");
}
The problem here is that each while-loop iteration makes a tremendous number of calls to the database. For every product record, I have to insert N model numbers, Y part numbers, X serial numbers etc...
I started another approach where I stored the whole CSV in an array. I then write one batch query like
$sql = "INSERT INTO t_product (col1, col2, etc...) values ";
foreach($arrParam as $val)
$sql .= " ($val[0], $val[1], $val[2]), "
But I ran into excessive memory errors with this approach. I increased the max memory limit to 64M and I'm still running out of memory.
What is the best way to tackle this problem?
Maybe I should write all my queries to a *.sql file first, then import the *.sql file into the mysql database?
This may be entirely not the direction you want to go, but you can generate the MySQL creation script directly from MS Access with the free MySQL Migration Toolkit
Perhaps you could allow the user to upload the Access db, and then have your PHP script call the Migration toolkit?
If you're going to try optimizing the code you have there already, I would try aggregating the INSERTS and see if that helps. This should be easy to add to your code. Something like this (C# pseudocode):
int flushCount = 0;
while (!done)
{
// Build next query, concatenate to last set of queries
if (++flushCount == 5)
{
// Flush queries to database
// Reset query string to empty
flushCount = 0;
}
}
// Flush remaining queries to the database
I decided to write all my queries into a .SQL file. This gave me the opportunity to normalize the CSV file into a proper relational database. Afterwards, my php script called an exec("mysql -h dbserver.com -u myuser -pmypass dbname < db.sql");
This solved my memory problems and it was much faster than multiple queries from php.

Categories