So I have a custom artisan command that I wrote to slug a column and store it into a new column. I have a progress bar implemented, and for some reason, when the command reaches 50% completion it jumps to 100%. The issue is that it has only executed the code on half of the data.
I am using the chunk() function to break the data into chunks of 1,000 rows to eliminate memory exhaustion issues. This is necessary because my dataset is extremely large.
I have looked into my PHP error logs, MySQL error logs, and Laravel logs. I can't find any error or log line pertaining to this command. Any ideas on a new place to even start looking for the issue.
$jobTitles = ModelName::where($columnName, '<>', '')
->whereNull($slugColumnName)
->orderBy($columnName)
->chunk(1000, function($jobTitles) use($jobCount, $bar, $columnName, $slugColumnName)
{
foreach($jobTitles as $jobTitle)
{
$jobTitle->$slugColumnName = Str::slug($jobTitle->$columnName);
$jobTitle->save();
}
$bar->advance(1000);
});
$bar->finish();
What's happening is the whereNull($slugColumnName) in combination with the callback setting the $slugColumnName is leading to missed results on subsequent loops.
The order of events is something like this:
Get first set of rows: select * from table where column is null limit 100;
For each of the rows, set column to a value.
Get next set of rows: select * from table where column is null limit 100 offset 100;.
Continue and increase the offset until no more results.
The problem here is that after the second step you have removed 100 results from the total. Say you begin with 1000 total rows, by the second query you now only have 900 matching rows.
This causes the offset to be seemingly skipping an entire chunk by starting at row 100, when the first 100 rows have not been touched yet.
For more official documentation, please see this section of this section on chunking.
I have not tested this to verify it works as expected for your use-case, but it appears that using chunkById will account for this issue and correct your results.
Related
I'm running two Laravel (front)/Lumen (api) applications that are using the same DB. (I'm aware that the seperation might not be ideal, but this is only for testing).
When I count the number of rows in each application the (front) returns 50000 (correct - when i query the db) rows and api returns 50001. Always one off.
I tried, when outputting, to count the number of rows to see if it was just the rowCount method that was wrong, but it does indeed output 50001 rows. With that many rows it difficult to see which row exactly is wrong.
I stumbled across this: Php PDO rowCount() return wrong result
I'm always running:
DB::connection()->getPdo()->prepare($query)
to get the rows.
And then originally:
result->rowCount
Changed to counting separately - as per suggestion:
$count = DB::connection()->getPdo()->query("SELECT count(*) AS cnt )->fetchColumn(0);
Then:
DB::table()->select()->count()
And lastly:
DB::table()->select('SELECT count(*) as cnt)->value('cnt')
But all returns exactly one row more. I'm also aware that they probably all fall back on PDO.
My first reaction was that it was a transaction, but I'm not using any, so only if there is something built-in by default in Lumen/Laravel?
Im am running a query and retrieving with OCI_FETCH_ARRAY and I am getting a fatal error, out of memory after I hit a certain volume of records. The result array is 100k rows and about 60 columns.
I have my memory_limit in php.ini set to 2 gigs.
memory_limit = 2056M
It seems to happen when I have more than one person running the script at the same time (or same person running twice as it is set up to run in the background).
It only takes 2 concurrent jobs of 100k records to cause the error.
Everything I've found on OCI_FETCH_ARRAY states that it isn't caching the whole result set in to memory but it looks like it IS.
This is my code (Very straight forward)
while ($row = oci_fetch_array($stid, OCI_ASSOC+OCI_RETURN_NULLS)) {
array_push($resultfile,$row);
$tablerow=$tablerow +1;
unset($row);
}
The error happens on the OCI_FETCH_ARRAY statement after it hits a certain number of loops.
The output file is only 94m (avg) so doesn't seem like I should be anywhere memory limit.
below code is causing high memory usage :
array_push($resultfile,$row);
oci_fetch_array is unbuffered meaning it will fetch rows one by one until no rows exists. I would suggest not to push row into another array. instead write your logic inside while loop itself.
I'm running on IBM i (an AS/400) V7R2, PHP v5.6.5, Zend Server v8.0.2.
I have a query which takes less than a second to execute from iNavigator. When I run the same query from a PHP script and then loop through it using:
$numRows = 0;
while ($row = db2_fetch_assoc($stmt))
{
//Do stuff
$numRows++;
}
echo $numRows++;
$numRows ends up only being a fraction of the expected result set and I get this error in the Zend logs:
PHP Warning: db2_fetch_assoc(): Fetch Failure in /path/to/script.php on line XXX
Note that the value of $numRows varies every time I run it. It is almost like it is timing out and ending before it can iterate through all of the result sets, but the page loads in seconds. Outside of results missing from the result set everything seems to function and load perfectly fine on the page.
Does anyone know what might be contributing to this behavior?
Is it possible that the data has errors? One possibility is decimal data errors.
#Buck Calabro got me on the right track. The issue wasn't decimal data errors but rather a sub-query in the view definition which was returning more than 1 row. So it was a "Result of select more than one row" error.
If I did a simple SELECT * FROM VIEW1 in iNavigator or PHP everything seemed to come up fine. It wasn't until I either ran the mentioned query in STRSQL or ran the view definition manually as if it weren't part of a view in iNavigator that the error would be reported.
To help future users here is basically what was happening.
TABLEA contains a single column with 10 rows.
I write a view such as this:
CREATE VIEW VIEWA (COL1, COL2, COL3)
AS SELECT 1, 2, (
SELECT * FROM TABLEA
);
The sub-select is returning 10 rows and the DB engine doesn't know how to handle it. If instead you add FETCH FIRST 1 ROW ONLY as part of the sub-query the error is fixed. That isn't to say logically you will get the correct results though, since you may need the 2nd row not the first. Second it would also be suggested you specify an ORDER BY and/or WHERE clause to ensure the first (and only) row returned would be what you want.
The situation is something like the following:
1- MySQL InnoDB table undergo to transactional select as follows:
<?php
....
doQuery('START TRANSACTION');
$sql = "SELECT * FROM table where amount < 10 FOR UPDATE";
$res = $doQuery($sql);
// Then a looping through $res includes updates to some fields -amount field- in the same table and set it to values greater than 10
//After the loop
doQuery('COMMIT');
At the XAMPP localhost, I opened two different browsers' windows, FireFox and Opera, requesting The script URL at the same time. I expect that only one of them could able to retrieve values for $res. However, The script returns Fetal Error
Blockquote
Fatal error: Maximum execution time of 30 seconds exceeded
I need to know the cause of this Error? Does it due to the two clients, FireFox and Opera, don't able to select or because they are not able to update?
Also I need a solution that keep transaction and give me the expected result, i.e. only one browser can return results!
you could just add set_time_limit(0); at the top of the script but it's not a good solution for scripts accessible via http.
Your script enters a dead lock. To avoid this, add an ORDER BY to the query, to ensure that both queries will try to select the records in the same order. Also make sure there is index on amount, otherwise the query will have to lock the entire table.
When paging through data that comes from a DB, you need to know how many pages there will be to render the page jump controls.
Currently I do that by running the query twice, once wrapped in a count() to determine the total results, and a second time with a limit applied to get back just the results I need for the current page.
This seems inefficient. Is there a better way to determine how many results would have been returned before LIMIT was applied?
I am using PHP and Postgres.
Pure SQL
Things have changed since 2008. You can use a window function to get the full count and the limited result in one query. Introduced with PostgreSQL 8.4 in 2009.
SELECT foo
, count(*) OVER() AS full_count
FROM bar
WHERE <some condition>
ORDER BY <some col>
LIMIT <pagesize>
OFFSET <offset>;
Note that this can be considerably more expensive than without the total count. All rows have to be counted, and a possible shortcut taking just the top rows from a matching index may not be helpful any more.
Doesn't matter much with small tables or full_count <= OFFSET + LIMIT. Matters for a substantially bigger full_count.
Corner case: when OFFSET is at least as great as the number of rows from the base query, no row is returned. So you also get no full_count. Possible alternative:
Run a query with a LIMIT/OFFSET and also get the total number of rows
Sequence of events in a SELECT query
( 0. CTEs are evaluated and materialized separately. In Postgres 12 or later the planner may inline those like subqueries before going to work.) Not here.
WHERE clause (and JOIN conditions, though none in your example) filter qualifying rows from the base table(s). The rest is based on the filtered subset.
( 2. GROUP BY and aggregate functions would go here.) Not here.
( 3. Other SELECT list expressions are evaluated, based on grouped / aggregated columns.) Not here.
Window functions are applied depending on the OVER clause and the frame specification of the function. The simple count(*) OVER() is based on all qualifying rows.
ORDER BY
( 6. DISTINCT or DISTINCT ON would go here.) Not here.
LIMIT / OFFSET are applied based on the established order to select rows to return.
LIMIT / OFFSET becomes increasingly inefficient with a growing number of rows in the table. Consider alternative approaches if you need better performance:
Optimize query with OFFSET on large table
Alternatives to get final count
There are completely different approaches to get the count of affected rows (not the full count before OFFSET & LIMIT were applied). Postgres has internal bookkeeping how many rows where affected by the last SQL command. Some clients can access that information or count rows themselves (like psql).
For instance, you can retrieve the number of affected rows in plpgsql immediately after executing an SQL command with:
GET DIAGNOSTICS integer_var = ROW_COUNT;
Details in the manual.
Or you can use pg_num_rows in PHP. Or similar functions in other clients.
Related:
Calculate number of rows affected by batch query in PostgreSQL
As I describe on my blog, MySQL has a feature called SQL_CALC_FOUND_ROWS. This removes the need to do the query twice, but it still needs to do the query in its entireity, even if the limit clause would have allowed it to stop early.
As far as I know, there is no similar feature for PostgreSQL. One thing to watch out for when doing pagination (the most common thing for which LIMIT is used IMHO): doing an "OFFSET 1000 LIMIT 10" means that the DB has to fetch at least 1010 rows, even if it only gives you 10. A more performant way to do is to remember the value of the row you are ordering by for the previous row (the 1000th in this case) and rewrite the query like this: "... WHERE order_row > value_of_1000_th LIMIT 10". The advantage is that "order_row" is most probably indexed (if not, you've go a problem). The disadvantage being that if new elements are added between page views, this can get a little out of synch (but then again, it may not be observable by visitors and can be a big performance gain).
You could mitigate the performance penalty by not running the COUNT() query every time. Cache the number of pages for, say 5 minutes before the query is run again. Unless you're seeing a huge number of INSERTs, that should work just fine.
Since Postgres already does a certain amount of caching things, this type of method isn't as inefficient as it seems. It's definitely not doubling execution time. We have timers built into our DB layer, so I have seen the evidence.
Seeing as you need to know for the purpose of paging, I'd suggest running the full query once, writing the data to disk as a server-side cache, then feeding that through your paging mechanism.
If you're running the COUNT query for the purpose of deciding whether to provide the data to the user or not (i.e. if there are > X records, give back an error), you need to stick with the COUNT approach.