PDO PHP Postgres: slow fetching of data - php

i was playing with PDO on PostgreSQL 9.2.4 and was trying to fetch data from a table having millions on rows. My query returns about 100.000 rows.
I do not use any of PDOStatements's fetch function, i simply use the result from the PDO Objecte itels and loop through it.
But its getting slower and slower by time. At the beginning it was fetching like 200 rows per second. But the close it comes to its end, it gets slower. Now being at row 30.000 it fetches only 1 row per second. Why is it getting slower.
I do this, its pretty simple:
$dbh = new PDO("pgsql...");
$sql = "SELECT x, y FROM point WHERE name is NOT NULL and place IN ('area1', 'area2')";
$res = $dbh->query($sql);
$ins_sql = "INSERT INTO mypoints (x, y) VALUES ";
$ins_vals = [];
$ins_placeholders = [];
foreach($res as $row) {
$ins_placeholders[] = "(?,?)";
$ins_vals = array_merge($ins_vals, [$row['x'], $row['y']]);
printCounter();
}
// now build up one insert query using placeholders and values,
// to insert all of them in one shot into table mypoints
Function printCounter simply increases an int var and prints it. So i can see how many rows it has put already in that array before i create my insert statement out of it. I use one shot inserts to speed things up, better than doing 100.000 inserts.
But that foreach loop is getting slower by time. How can i increase the speed.
Is there a difference between fetch() and the simple loop method using the pdostatement in foreach?
when i start this php script, it takes like 5-10 seconds for the query. So this has nothing to do with how the table is setup and if i need indexes.
I have other tables returning 1 million rows, im not sure what is the best way to fetch them. I can raise PHP's memory_limit if needed, so the most important thing for me is SPEED.
Appreciate any help.

It's not likely that the slowness is related to the database, because after the $dbh->query() call, the query is finished and the resulting rows are all in memory (they are not in PHP variables yet, but they're in memory accessible at the pgsql module level).
The more likely culprit is the array_merge operation. The array becomes larger at every loop iteration, and the operation recreates the entire array each time.
You may want to do instead:
$ins_vals[] = [$row['x'], $row['y']];
Although personally, when concerned with speed, I'd use an even simpler flat structure:
$ins_vals[] = $x;
$ins_vals[] = $y;
Another unrelated point is that it seems to build a query with a huge number of placeholders, which is not how placeholders are normally used. To send large numbers of values to the server, the efficient way is to use COPY, possibly into a temporary table followed by server-side merge operations if it's not a plain insertion.

I dont know why, but using fetch() method instead and doing the $ins_val filling like this:
$ins_vals[] = $x;
$ins_vals[] = $y;
and using beginTransaction and commit makes now my script unbelievable fast.
Now it takes only about 1 minute to add my 100.000 points.
i think both array_merge and that "ugly" looping through the PDOStatement slowed down my script.
And why the heck someone downvoted my question? Are you punishing me because of my missing knowledge? Thanks.

Ok i generated a class where i set the sql and then put the values for each row with a method call. Whenever it reaches a specific limit, it starts a transaction, prepares the statement with as many placeholders as i have put values, then executes it with the array having all the values, then commit.
This seems to be fast enough, at least it doesnt get slower anymore.
For some reason its faster to add values in a flat structure as Daniel suggested. Thats enough for me.
Sometimes its good to have a function doing one step of insertion, because when the function returns, all the memory used in the function will be freed, so your memory usage stays low.

Related

How do I paginate or batch a call to mssql_execute()?

When I make a call to an MSSQL database using ad-hoc SQL (for example, "SELECT foo FROM tablename"), I can give a batch size for that call. This is very useful when I expect a lot of data returned.
In my case, I have a table with over 200 million rows, and I'm getting them all. Yes, I have reasons for being a big slurpy data hog like this.
My DB guys said, "Hey, stop using ad-hoc SQL, here, use this nifty SP. It does the same thing."
So I'm using it with the mssql_execute() function call, but there's no way to specify a batch size when doing this as there is with mssql_query()
I not only have to do a ini_set('memory_limit', '64G'); to make this work, I also have to sweat things as the SP call takes upwards of a half hour to run. Once it runs, I can do a loop on mssql_fetch_row(), no problem, but that initial call is a nail-biter!
And once I'm done, I have a process taking up 57G of memory (on a 96G box) that then takes a full hour at 80% CPU just to unwind and garbage collect. Yeah, I could kill the process, but that's a hack.
There has to be a better way!
With ad-hoc SQL, I call mssql_query() with a batch size of 10,000 rows and process them and then go back for more. I can then do something like echo "Yes, indeed, I'm on row $i right now..." and salve my paranoia that everything is running right.
So... what's the appropriate way to do this if I'm forced to use the SP that my DB guys want me to use?
Assuming the table has a primary key, I suggest you ask the DB guys to add 2 parameters to the stored procedure, one for the number of rows to be returned and another for the starting key value. Pass NULL for the initial batch and the last key value returned for each subsequent batch. This will provide efficient forward-only pagination. For example:
CREATE PROCEDURE dbo.usp_select_tablename
#NumRows int
, #StartKey int = NULL
AS
IF #StartKey IS NULL
BEGIN
SELECT TOP(#NumRows) foo
FROM tableName
ORDER BY StartKey;
END
ELSE
BEGIN
SELECT TOP(#NumRows) foo
FROM tableName
WHERE Key > #StartKey
ORDER BY StartKey;
END;

Recommendation for mysqli batch queries

My use case:
I have multiple scripts inserting into a table in the order of several inserts per second. I am seeing performance degradation, so I think there would be performance benefits in "batching queries" and inserting several hundred rows every minute or so.
Question:
How would I go about doing this using mysqli? My current code uses a wrapper (pastebin), and looks like:
$array = array();\\BIG ARRAY OF VALUES (more than 100k rows worth)
foreach($array AS $key => $value){
$db -> q('INSERT INTO `player_items_attributes` (`column1`, `column2`, `column3`) VALUES (?, ?, ?)', 'iii', $value['test1'], $value['test2'], $value['test3']);
}
Notes:
I looked at using transactions, but it sounds like those would still hit the server, instead of queuing. I would prefer to use a wrapper (feel free to suggest one with similar functionality to what my current one offers), but if not possible I will try to build suggestions into the wrapper I use.
Sources:
Wrapper came from here
Edit:
I am trying optimize table speed, rather than script speed. This table has more than 35million rows, and has a few indexes.
The MySQL INSERT syntax allows for one INSERT query to insert multiple rows, like this:
INSERT INTO tbl_name (a,b,c) VALUES(1,2,3),(4,5,6),(7,8,9);
where each set of parenthesised values represents another row in your table. So, by working down the array you could create multiple rows in one query.
There is one major limitation: the total size of the query must not exceed the configured limit. For 100k rows you'd probably have to break this down into blocks of, say, 250 rows, reducing your 100k queries to 400. You might be able to go further.
I'm not going to attempt to code this - you'd have to code something and try it in your environment.
Here's a pseudo-code version:
escape entire array // array_walk(), real_escape_string()
block_size = 250; // number of rows to insert per query
current_block = 0;
rows_array = [];
while (next-element <= number of rows) {
create parenthesised set and push to rows_array // implode()
if (current_block == block_size) {
implode rows_array and append to query
execute query
set current_block = 0
reset rows_array
reset query
}
current_block++
next_element++
}
if (there are any records left over) {
implode rows_array and append to query
execute the query for the last block
}
I can already think of a potentially faster implementation with array_map() - try it.

Can I perform concurrent request and modificatio using PDO

I have a mysql table with a lot of data in it. All of the rows in this table need to have one field modified in a way that is not easily expressed in pure SQL.
I'd like to be able to loop over the table row by row, and update all the entries one by one.
However to do this I would do something like:
$sql = "SELECT id,value FROM objects";
foreach ($dbh->query($sql) as $row)
{
$value = update_value( $row['value'] );
$id = $row['id'];
$update_sql = "UPDATE objects SET value='$value' WHERE id=$d";
$dbh->query( $update_sql );
}
Will this do something bad? (Other than potentially being slow?)
Clarification: In particular I'm worried about the first select using a cursor, rather than retrieving all the data in one hit within the foreach, and then
there being something I don't know about cursor invalidation caused by the update inside the loop. If there is some rule like "don't update the same table while scanning it with another cursor" it's likely that it will only show up on huge tables, and so me performing a small test case is pretty much useless.
If someone can point me to docs that say doing this is OK, rather than a particular problem with working this way, that'd also be great.
The results of a single query are consistent, so updates won't affect it. To keep in mind:
Use prepared statements; it will reduce the traffic between your process and the database, because only the values are transferred instead of a whole query every time.
If you're worried about other processes running at the same time, you should use transactions and proper locking, e.g.
// transaction started
SELECT id,value
FROM objects
LOCK IN SHARE MODE
// your other code
// commit transaction
Seems like you have two options right out of the gate:
(straight-forward): use something like 'fetchAll' to get all the results of the first query before you start looping through it. this will help keep you from overlapping cursors.
(more obscure): change this to use a stored function (in place of 'update_value') so you can turn the two queries into a single 'update objects set value=some_function( id )'
Depending on the size and duration of this you may need to lock everything beforehand.

How does PHP PDO work internally?

I want to use pdo in my application, but before that I want to understand how internally
PDOStatement->fetch and PDOStatement->fetchAll.
For my application, I want to do something like "SELECT * FROM myTable" and insert into csv file and it has around 90000 rows of data.
My question is, if I use PDOStatement->fetch as I am using it here:
// First, prepare the statement, using placeholders
$query = "SELECT * FROM tableName";
$stmt = $this->connection->prepare($query);
// Execute the statement
$stmt->execute();
var_dump($stmt->fetch(PDO::FETCH_ASSOC));
while ($row = $stmt->fetch(PDO::FETCH_ASSOC))
{
echo "Hi";
// Export every row to a file
fputcsv($data, $row);
}
Will after every fetch from database, result for that fetch would be store in memory ?
Meaning when I do second fetch, memory would have data of first fetch as well as data for second fetch.
And so if I have 90000 rows of data and if am doing fetch every time than memory is being updated to take new fetch result without removing results from previous fetch and so for the last fetch memory would already have 89999 rows of data.
Is this how PDOStatement::fetch
works ?
Performance wise how does this stack
up against PDOStatement::fetchAll ?
Update: Something about fetch and fetchAll from memory usage point of view
Just wanted to added some thing to this question as recently found something regarding fetch and fetchAll, hope this would make this question worthwhile for people would visit this question in future to get some understanding on fetch and fetchAll parameters.
fetch does not store information in memory and it works on row to row basis, so it would go through the result set and return row 1, than again would go to the result set and than again return row 2 mind here that it will not return row 1 as well as 2 but would only return row 2, so fetch will not store anything into memory but fetchAll will store details into the memories. So fetch is better option compared to fetchAll if we are dealing with an resultant set of around 100K in size.
PHP generally keeps its results on the server. It all depends on the driver. MySQL can be used in an "unbuffered" mode, but it's a tad tricky to use. fetchAll() on a large result set can cause network flooding, memory exhaustion, etc.
In every case where I need to process more than 1,000 rows, I'm not using PHP. Consider also if your database engine already has a CSV export operation. Many do.
I advice you to use PDO::FETCH_LAZY instead of PDO::FETCH_ASSOC for big data.
I used it for export to csv row by row and it works fine.
Without any "out of memory" errors.

How do PHP/MySQL database queries work exactly?

I have used MySQL a lot, but I always wondered exactly how does it work - when I get a positive result, where is the data stored exactly? For example, I write like this:
$sql = "SELECT * FROM TABLE";
$result = mysql_query($sql);
while ($row = mysql_fetch_object($result)) {
echo $row->column_name;
}
When a result is returned, I am assuming it's holding all the data results or does it return in a fragment and only returns where it is asked for, like $row->column_name?
Or does it really return every single row of data even if you only wanted one column in $result?
Also, if I paginate using LIMIT, does it hold THAT original (old) result even if the database is updated?
The details are implementation dependent but generally speaking, results are buffered. Executing a query against a database will return some result set. If it's sufficiently small all the results may be returned with the initial call or some might be and more results are returned as you iterate over the result object.
Think of the sequence this way:
You open a connection to the database;
There is possibly a second call to select a database or it might be done as part of (1);
That authentication and connection step is (at least) one round trip to the server (ignoring persistent connections);
You execute a query on the client;
That query is sent to the server;
The server has to determine how to execute the query;
If the server has previously executed the query the execution plan may still be in the query cache. If not a new plan must be created;
The server executes the query as given and returns a result to the client;
That result will contain some buffer of rows that is implementation dependent. It might be 100 rows or more or less. All columns are returned for each row;
As you fetch more rows eventually the client will ask the server for more rows. This may be when the client runs out or it may be done preemptively. Again this is implementation dependent.
The idea of all this is to minimize roundtrips to the server without sending back too much unnecessary data, which is why if you ask for a million rows you won't get them all back at once.
LIMIT clauses--or any clause in fact--will modify the result set.
Lastly, (7) is important because SELECT * FROM table WHERE a = 'foo' and SELECT * FROM table WHERE a = 'bar' are two different queries as far as the database optimizer is concerned so an execution plan must be determined for each separately. But a parameterized query (SELECT * FROM table WHERE a = :param) with different parameters is one query and only needs to be planned once (at least until it falls out of the query cache).
I think you are confusing the two types of variables you're dealing with, and neither answer really clarifies that so far.
$result is a MySQL result object. It does not "contain any rows." When you say $result = mysql_query($sql), MySQL executes the query, and knows what rows will match, but the data has not been transferred over to the PHP side. $result can be thought of as a pointer to a query that you asked MySQL to execute.
When you say $row = mysql_fetch_object($result), that's when PHP's MySQL interface retrieves a row for you. Only that row is put into $row (as a plain old PHP object, but you can use a different fetch function to ask for an associative array, or specific column(s) from each row.)
Rows may be buffered with the expectation that you will be retrieving all of the rows in a tight loop (which is usually the case), but in general, rows are retrieved when you ask for them with one of the mysql_fetch_* functions.
If you only want one column from the database, then you should SELECT that_column FROM .... Using a LIMIT clause is also a good idea whenever possible, because MySQL can usually perform significant optimizations if it knows that you only want a certain group of rows.
The first question can be answered by reading up on resources
Since you are SELECTing "*", every column is returned for each mysql_fetch_object call. Just look at print_r($row) to see.
In simple words the resource returned it like an ID that the MySQL library associate with other data. I think it is like the identification card in your wallet, it's just a number and some information but asociated with a lot of more information if you give it to the goverment, or your cell-phone company, etc.

Categories