I'm trying to create a script to import about 10m records to mysql database.
When I did a loop with single queries, import with 2000 records takes 20 minutes.
So I'm trying to do this with transactions. The problem is, in my loop there are some select queries that need to be trigger at once to get some values to create inserts. Last two queries (insert and update) could be in in transaction.
Something like this:
foreach($record as $rec) {
//select sth
//do sth with result
//second select sth
//do sth with second result
//prepare values from above results and $rec
// below part I'd like to do with transaction
//insert with new record
//update table
}
I know this is little messy and not exact, but this function is more complicated, so I decided to put just a "draft" and I need just advice, not complete code.
Regards
Transactions are for multiple statements that need to be treated as a single group that either entirely succeeds or entirely fails. It sounds like your issue has a lot more to do with performance than transactions. Unless there is a bit of information that you haven't included that involves groups of statements "which all must succeed at the same time", transactions are just a distraction.
There are a few ways to approach your problem depending on some things that aren't immediately obvious from your post.
-If your data source for the 10M records is a table in the same database that you are going to populate with the new records (via the inserts and updates at the end of your loop) then you might be able to do everything through a single database query. SQL is very expressive and through joins and some of the built in functions (SUBSTR(), UPPER(), REVERSE(), CASE...END, et c.) you might be able to do everything you want. This would require reading up on SQL and trying to reframe your goals in terms of set operations that you could do.
-If you are inserting records that are sourced from outside the database (like from a file) then I would organize your code like this
//select sth
//do sth with result
//second select sth
//do sth with second result
//prepare values from above results so that $rec info can be added in later
foreach($record as $rec) {
//construct a big insert statement
}
//insert the new records by running the big insert statement
//update table
The advantage here is that you are only hitting the db with a few queries, instead of a few queries per $rec so your performance will be better (since db calls have overhead). For 10M rows you may need to break the above up into a few chunks since there is a limit to how big a single insert can be (see max_allowed_packet). I would suggest breaking the 10M into 5K or 10K chunks by adding another loop around the above that partitions off the chunks from the 10M.
A clearer answer could have been given if you added details about your data source, what transformations you want to do on the data, what the purpose of the
//select sth
//do sth with result
//second select sth
//do sth with second result
section is (within the context of how it adds information to your insert statements later), and what the prepare values section of your code does.
Related
It's kinda hard to undertsand my need in the title.
CodeIgniter is performing a SELECT query in a table of 800,000+ rows in one shot.
It takes a lot of memory, but in one specific server, I get a "Out of memory" fatal error.
For performance purposes, I would like to seperate the select into 2 selects, and more specifically, the 50% first rows, and then the 50% left.
I reuse this set of data to perform an INSERT afterwise.
How to do that without losing/forgetting any single row ?
Beside the fact that operations like that are highly connected to performance issues, you can use unbuffered_row.
Basically, if you have a job with that large data - you should use
unbuffered_row provided and integrated in the built
in query builder.
its very well documented here in the result rows section.
for example:
$query = $this->db->select('*')->from('your_table')->get();
while($row = $query->unbuffered_row())
{
//do your job
}
This will avoid your memory problem.
My use case:
I have multiple scripts inserting into a table in the order of several inserts per second. I am seeing performance degradation, so I think there would be performance benefits in "batching queries" and inserting several hundred rows every minute or so.
Question:
How would I go about doing this using mysqli? My current code uses a wrapper (pastebin), and looks like:
$array = array();\\BIG ARRAY OF VALUES (more than 100k rows worth)
foreach($array AS $key => $value){
$db -> q('INSERT INTO `player_items_attributes` (`column1`, `column2`, `column3`) VALUES (?, ?, ?)', 'iii', $value['test1'], $value['test2'], $value['test3']);
}
Notes:
I looked at using transactions, but it sounds like those would still hit the server, instead of queuing. I would prefer to use a wrapper (feel free to suggest one with similar functionality to what my current one offers), but if not possible I will try to build suggestions into the wrapper I use.
Sources:
Wrapper came from here
Edit:
I am trying optimize table speed, rather than script speed. This table has more than 35million rows, and has a few indexes.
The MySQL INSERT syntax allows for one INSERT query to insert multiple rows, like this:
INSERT INTO tbl_name (a,b,c) VALUES(1,2,3),(4,5,6),(7,8,9);
where each set of parenthesised values represents another row in your table. So, by working down the array you could create multiple rows in one query.
There is one major limitation: the total size of the query must not exceed the configured limit. For 100k rows you'd probably have to break this down into blocks of, say, 250 rows, reducing your 100k queries to 400. You might be able to go further.
I'm not going to attempt to code this - you'd have to code something and try it in your environment.
Here's a pseudo-code version:
escape entire array // array_walk(), real_escape_string()
block_size = 250; // number of rows to insert per query
current_block = 0;
rows_array = [];
while (next-element <= number of rows) {
create parenthesised set and push to rows_array // implode()
if (current_block == block_size) {
implode rows_array and append to query
execute query
set current_block = 0
reset rows_array
reset query
}
current_block++
next_element++
}
if (there are any records left over) {
implode rows_array and append to query
execute the query for the last block
}
I can already think of a potentially faster implementation with array_map() - try it.
i was playing with PDO on PostgreSQL 9.2.4 and was trying to fetch data from a table having millions on rows. My query returns about 100.000 rows.
I do not use any of PDOStatements's fetch function, i simply use the result from the PDO Objecte itels and loop through it.
But its getting slower and slower by time. At the beginning it was fetching like 200 rows per second. But the close it comes to its end, it gets slower. Now being at row 30.000 it fetches only 1 row per second. Why is it getting slower.
I do this, its pretty simple:
$dbh = new PDO("pgsql...");
$sql = "SELECT x, y FROM point WHERE name is NOT NULL and place IN ('area1', 'area2')";
$res = $dbh->query($sql);
$ins_sql = "INSERT INTO mypoints (x, y) VALUES ";
$ins_vals = [];
$ins_placeholders = [];
foreach($res as $row) {
$ins_placeholders[] = "(?,?)";
$ins_vals = array_merge($ins_vals, [$row['x'], $row['y']]);
printCounter();
}
// now build up one insert query using placeholders and values,
// to insert all of them in one shot into table mypoints
Function printCounter simply increases an int var and prints it. So i can see how many rows it has put already in that array before i create my insert statement out of it. I use one shot inserts to speed things up, better than doing 100.000 inserts.
But that foreach loop is getting slower by time. How can i increase the speed.
Is there a difference between fetch() and the simple loop method using the pdostatement in foreach?
when i start this php script, it takes like 5-10 seconds for the query. So this has nothing to do with how the table is setup and if i need indexes.
I have other tables returning 1 million rows, im not sure what is the best way to fetch them. I can raise PHP's memory_limit if needed, so the most important thing for me is SPEED.
Appreciate any help.
It's not likely that the slowness is related to the database, because after the $dbh->query() call, the query is finished and the resulting rows are all in memory (they are not in PHP variables yet, but they're in memory accessible at the pgsql module level).
The more likely culprit is the array_merge operation. The array becomes larger at every loop iteration, and the operation recreates the entire array each time.
You may want to do instead:
$ins_vals[] = [$row['x'], $row['y']];
Although personally, when concerned with speed, I'd use an even simpler flat structure:
$ins_vals[] = $x;
$ins_vals[] = $y;
Another unrelated point is that it seems to build a query with a huge number of placeholders, which is not how placeholders are normally used. To send large numbers of values to the server, the efficient way is to use COPY, possibly into a temporary table followed by server-side merge operations if it's not a plain insertion.
I dont know why, but using fetch() method instead and doing the $ins_val filling like this:
$ins_vals[] = $x;
$ins_vals[] = $y;
and using beginTransaction and commit makes now my script unbelievable fast.
Now it takes only about 1 minute to add my 100.000 points.
i think both array_merge and that "ugly" looping through the PDOStatement slowed down my script.
And why the heck someone downvoted my question? Are you punishing me because of my missing knowledge? Thanks.
Ok i generated a class where i set the sql and then put the values for each row with a method call. Whenever it reaches a specific limit, it starts a transaction, prepares the statement with as many placeholders as i have put values, then executes it with the array having all the values, then commit.
This seems to be fast enough, at least it doesnt get slower anymore.
For some reason its faster to add values in a flat structure as Daniel suggested. Thats enough for me.
Sometimes its good to have a function doing one step of insertion, because when the function returns, all the memory used in the function will be freed, so your memory usage stays low.
I have a mysql table with a lot of data in it. All of the rows in this table need to have one field modified in a way that is not easily expressed in pure SQL.
I'd like to be able to loop over the table row by row, and update all the entries one by one.
However to do this I would do something like:
$sql = "SELECT id,value FROM objects";
foreach ($dbh->query($sql) as $row)
{
$value = update_value( $row['value'] );
$id = $row['id'];
$update_sql = "UPDATE objects SET value='$value' WHERE id=$d";
$dbh->query( $update_sql );
}
Will this do something bad? (Other than potentially being slow?)
Clarification: In particular I'm worried about the first select using a cursor, rather than retrieving all the data in one hit within the foreach, and then
there being something I don't know about cursor invalidation caused by the update inside the loop. If there is some rule like "don't update the same table while scanning it with another cursor" it's likely that it will only show up on huge tables, and so me performing a small test case is pretty much useless.
If someone can point me to docs that say doing this is OK, rather than a particular problem with working this way, that'd also be great.
The results of a single query are consistent, so updates won't affect it. To keep in mind:
Use prepared statements; it will reduce the traffic between your process and the database, because only the values are transferred instead of a whole query every time.
If you're worried about other processes running at the same time, you should use transactions and proper locking, e.g.
// transaction started
SELECT id,value
FROM objects
LOCK IN SHARE MODE
// your other code
// commit transaction
Seems like you have two options right out of the gate:
(straight-forward): use something like 'fetchAll' to get all the results of the first query before you start looping through it. this will help keep you from overlapping cursors.
(more obscure): change this to use a stored function (in place of 'update_value') so you can turn the two queries into a single 'update objects set value=some_function( id )'
Depending on the size and duration of this you may need to lock everything beforehand.
I want to use pdo in my application, but before that I want to understand how internally
PDOStatement->fetch and PDOStatement->fetchAll.
For my application, I want to do something like "SELECT * FROM myTable" and insert into csv file and it has around 90000 rows of data.
My question is, if I use PDOStatement->fetch as I am using it here:
// First, prepare the statement, using placeholders
$query = "SELECT * FROM tableName";
$stmt = $this->connection->prepare($query);
// Execute the statement
$stmt->execute();
var_dump($stmt->fetch(PDO::FETCH_ASSOC));
while ($row = $stmt->fetch(PDO::FETCH_ASSOC))
{
echo "Hi";
// Export every row to a file
fputcsv($data, $row);
}
Will after every fetch from database, result for that fetch would be store in memory ?
Meaning when I do second fetch, memory would have data of first fetch as well as data for second fetch.
And so if I have 90000 rows of data and if am doing fetch every time than memory is being updated to take new fetch result without removing results from previous fetch and so for the last fetch memory would already have 89999 rows of data.
Is this how PDOStatement::fetch
works ?
Performance wise how does this stack
up against PDOStatement::fetchAll ?
Update: Something about fetch and fetchAll from memory usage point of view
Just wanted to added some thing to this question as recently found something regarding fetch and fetchAll, hope this would make this question worthwhile for people would visit this question in future to get some understanding on fetch and fetchAll parameters.
fetch does not store information in memory and it works on row to row basis, so it would go through the result set and return row 1, than again would go to the result set and than again return row 2 mind here that it will not return row 1 as well as 2 but would only return row 2, so fetch will not store anything into memory but fetchAll will store details into the memories. So fetch is better option compared to fetchAll if we are dealing with an resultant set of around 100K in size.
PHP generally keeps its results on the server. It all depends on the driver. MySQL can be used in an "unbuffered" mode, but it's a tad tricky to use. fetchAll() on a large result set can cause network flooding, memory exhaustion, etc.
In every case where I need to process more than 1,000 rows, I'm not using PHP. Consider also if your database engine already has a CSV export operation. Many do.
I advice you to use PDO::FETCH_LAZY instead of PDO::FETCH_ASSOC for big data.
I used it for export to csv row by row and it works fine.
Without any "out of memory" errors.