Best logic to handle big file + pgsql tables

Best logic to handle big file + pgsql tables - php

I have a postgres database with multiple schemas, each schemas with multiple tables, each tables with few millions rows.
From time to time, I have to download a file containing a few millions lines and compare each line with every row in all tables, if the row is find I must update a column in It.
The first thing I tried was reading the file line by line, running a select query on each table in each schema, and if the rows is found i run the update, it works good on my testing platform, but with the actual database, it will be running forever since it's executing around 1000 queries per second (I checked out with the query: SELECT xact_commit+xact_rollback FROM pg_stat_database WHERE datname = 'mailtng_lists';).
The second thing I tried, is separate the main script from the script that connects to the database, so what I did was splitting the huge file in chunks, each chunk with 100K lines, and then calling the script that will do the connection X times with the following command:
foreach($chunks as $chunk) //$chunks is the result of the split command (no problem here)
{
exec("updater.php $chunk");
}
But it did no improvements at all, number of queries per second was still very low, so the last thing I tried, was doing the same but with shell_exec so the script wouldn't have to wait for the output, but the server crashed since I had 173 chunks so It resulted in calling 173 php instances.
Any idea on how to handle this situation?

Related

"LOAD DATA INFILE" gives problem when using several in sequence

I have a rather large product feed which i split up in multiple csv files of 20k lines each.
feed-0.csv
feed-1.csv
etc
I load this csv into a temp table and export 3 new csv files which i will lateron load into seperate tables.
products.csv
attributes.csv
prices.csv
Above of course also contain 20k lines just like the (split) source, so far so good, all going well.
Another part of the script loads above 3 csv files into their respective tables, db.products , db.attributes and db.prices. When i select 1 file (be it feed-0.csv or feed-9.csv any split file will do) the database is updated with 20k rows in each respective table. Still no problem there.
Now i create a loop where i loop through the split csv files and add 20k rows to each table on each loop.
This works well until i hit the 3rd loop. then i will get mismatching numbers in the database. e.g.
db.products - 57569 rows
db.attributes - 58661 rows
db.prices - 52254 rows
So while on the previous all was 40k, now i have mismatching numbers.
I have checked the products.csv, attributes.csv and prices.csv on each loop, and these have each the 20k as they should have.
I have tried with random split feeds, e.g. feed-1.csv, feed-5.csv, feed-7.csv or feed-1.csv, feed-8.csv and feed-3.csv. So i changed the files, i changed the order, but each time on the 3rd and further loops i get this problem.
I tried to import split files from different feeds too, but each time 3rd loop i get incorrect numbers. The source files should be good. When i run just 1 or 2 files in any sequence the results are goods.
I suspect that i am hitting some limitation somewhere. I thought it might be a innodb buffer issue, so i restarted the server, but the issue remains. (innodb buffer around 25% after the 3rd loop)
I am using MariaDB 10.1.38, PHP Version 7.3.3, innodb buffer size 500mb
Any pointers in what direction i have to search for a solution would be welcome!

PHP: Filtering and export large amount of data from MySQL database

I have a very large database table (more than 700k records) that I need to export to a .csv file. Before exporting it, I need to check some options (provided by the user via GUI) and filter the records. Unfortunately this filtering action cannot be achieved via SQL code (for example, a column contains serialized data, so I need to unserialize and then check if the record "passes" the filtering rules.
Doing all records at once leads to memory limit issues, so I decided to break the process in chunks of 50k records. So instead of loading 700k records at once, I'm loading 50k records, apply filters, save to the .csv file, then load other 50k records and go on (until it reaches the 700k records). In this way I'm avoiding the memory issue, but it takes around 3 minutes (This time will increase if the number of records increase).
Is there any other way of doing this process (better in terms of time) without changing the database structure?
Thanks in advance!

The best thing one can do is to get PHP out of the mix as much as possible. Always the case for loading CSV, or exporting it.
In the below, I have a 26 Million row student table. I will export 200K rows of it. Granted, the column count is small in the student table. Mostly for testing other things I do with campus info for students. But you will get the idea I hope. The issue will be how long it takes for your:
... and then check if the record "passes" the filtering rules.
which naturally could occur via the db engine in theory without PHP. Without PHP should be the mantra. But that is yet to be determined. The point is, get PHP processing out of the equation. PHP is many things. An adequate partner in DB processing it is not.
select count(*) from students;
-- 26.2 million
select * from students limit 1;
+----+-------+-------+
| id | thing | camId |
+----+-------+-------+
| 1 | 1 | 14 |
+----+-------+-------+
drop table if exists xOnesToExport;
create table xOnesToExport
( id int not null
);
insert xOnesToExport (id) select id from students where id>1000000 limit 200000;
-- 200K rows, 5.1 seconds
alter table xOnesToExport ADD PRIMARY KEY(id);
-- 4.2 seconds
SELECT s.id,s.thing,s.camId INTO OUTFILE 'outStudents_20160720_0100.txt'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
FROM students s
join xOnesToExport x
on x.id=s.id;
-- 1.1 seconds
The above 1AM timestamped file with 200K rows was exported as a CSV via the join. It took 1 second.
LOAD DATA INFILE and SELECT INTO OUTFILE are companion functions that, for one one thing, cannot be beat for speed short of raw table moves. Secondly, people rarely seem to use the latter. They are flexible too if one looks into all they can do with use cases and tricks.
For Linux, use LINES TERMINATED BY '\n' ... I am on a Windows machine at the moment with the code blocks above. The only differences tend to be with paths to the file, and the line terminator.

Unless you tell it to do otherwise, php slurps your entire result set at once into RAM. It's called a buffered query. It doesn't work when your result set contains more than a few hundred rows, as you have discovered.
php's designers made it use buffered queries to make life simpler for web site developers who need to read a few rows of data and display them.
You need an unbuffered query to do what you're doing. Your php program will read and process one row at a time. But be careful to make your program read all the rows of that unbuffered result set; you can really foul things up if you leave a partial result set dangling in limbo between MySQL and your php program.
You didn't say whether you're using mysqli or PDO. Both of them offer mode settings to make your queries unbuffered. If you're using the old-skool mysql_ interface, you're probably out of luck.

More efficient - multiple SQL queries or one query and process in php?

I have a php application showing 3 tables of data, each from the same MySQL table. Each record has an integer field named status which can have values 1, 2 or 3. Table 1 shows all records with status = 1, Table 2 showing status = 2 and table 3 showing status = 3.
To achieve this three MySQL queries could be run using WHERE to filter by status, iterating through each set of results once to populate the three tables.
Another approach would be to select all from the table and then iterate through the same set of results once for each table, using php to test the value of status each time.
Would one of these approaches be significantly more efficient than the other? Or would one of them be considered better practice than the other?

Generally, it's better to filter on the RDBMS side so you can reduce the amount of data you need to transfer.
Transferring data from the RDBMS server over the network to the PHP client is not free. Networks have a capacity, and you can generate so much traffic that it becomes a constraint on your application performance.
For example, recently I helped a user who was running queries many times per second, each generating 13MB of result set data. The queries execute quickly on the server, but they couldn't get the data to his app because he was simply exhausting his network bandwidth. This was a performance problem that didn't happen during his testing, because when he ran one query at a time, it was within the network capacity.

If you use the second method you connect with database only once, thus it's more efficient.
And even if it wasn't, it's more elegant that way IMO.
Of course there are some situations that it would be better to connect three times (eg. getting info from this query would be complicated), but for most of the cases I would do it the second way.

I would create a store procedure that return all the fields you need pre-formatted, no more, no less.
And then just loop on php without calling any other table.
This way you run only 1 query, and you only get the bytes you need. So same bandwidth, less http request = more performance.

Performance of MySQL

MyPHP Application sends a SELECT statement to MySQL with HTTPClient.
It takes about 20 seconds or more.
I thought MySQL can’t get result immediately because MySQL Administrator shows stat for sending data or copying to tmp table while I'd been waiting for result.
But when I send same SELECT statement from another application like phpMyAdmin or jmater it takes 2 seconds or less.10 times faster!!
Dose anyone know why MySQL perform so difference?

Like #symcbean already said, php's mysql driver caches query results. This is also why you can do another mysql_query() while in a while($row=mysql_fetch_array()) loop.
The reason MySql Administrator or phpMyAdmin shows result so fast is they append a LIMIT 10 to your query behind your back.
If you want to get your query results fast, i can offer some tips. They involve selecting only what you need and when you need:
Select only the columns you need, don't throw select * everywhere. This might bite you later when you want another column but forget to add it to select statement, so do this when needed (like tables with 100 columns or a million rows).
Don't throw a 20 by 1000 table in front of your user. She cant find what she's looking for in a giant table anyway. Offer sorting and filtering. As a bonus, find out what she generally looks for and offer a way to show that records with a single click.
With very big tables, select only primary keys of the records you need. Than retrieve additional details in the while() loop. This might look like illogical 'cause you make more queries but when you deal with queries involving around ~10 tables, hundreds of concurrent users, locks and query caches; things don't always make sense at first :)
These are some tips i learned from my boss and my own experince. As always, YMMV.

Dose anyone know why MySQL perform so difference?
Because MySQL caches query results, and the operating system caches disk I/O (see this link for a description of the process in Linux)

MySQL Is Not Inserting All Successful Insert Queries...Why?

Before I go on, this is purely a question of intuition. That is, I'm not seeking answers to work out specific bugs in my PHP/MySQL code. Rather, I want to understand what the range of possible issues that I need to consider in resolving my issue. To these ends, I will not post code or attach scripts - I will simply explain what I did and what is happening.
I have written PHP script that
Reads a CSV text file of X records to be inserted into a MySQL database table and/or update duplicate entries where applicable;
Inserts said records into what I will call a "root" table for that data set;
Selects subset records of specific fields from the "root" table and then inserts those records into a "master" table; and
Creates an output export text file from the master table for distribution.
There are several CSV files that I am processing via separate scheduled cron tasks every 30 minutes. All said, from the various sources, there are an estimated 420,000 insert transactions from file to root table, and another 420,000 insert transactions from root table to master table via the scheduled tasks.
One of the tasks involves a CSV file of about 400,000 records by itself. The processing contains no errors, but here's the problem: of the 400,000 records that MySQL indicates have been successfully inserted into the root table, only about 92,000 of those records actually store in the root table - I'm losing about 308,000 records from that scheduled task.
The other scheduled tasks process about 16,000 and 1,000 transactions respectively, and these transactions process perfectly. In fact, if I reduce the number of transactions from 400,000 to, say, 10,000, then these process just fine as well. Clearly, that's not the goal here.
To address this issue, I have tried several remedies...
Upping the memory of my server (and increasing the max limit in the php.ini file)
Getting a dedicated database with expanded memory (as opposed to a shared VPS database)
Rewriting my code to substantially eliminate stored arrays that suck down memory and process fgetcsv() processes on the run
Use INSERT DELAYED MySQL statements (as opposed to plain INSERT statements)
...and none of these remedies have worked as desired.
What range of remedial actions should be considered at this point, given the lack of success in the actions taken so far? Thanks...

The source data in csv may have duplicate records. Even though there are 400,000 record in the csv, your 'insert or update' logic trims them into reduced set. Less memory could lead to exceptions etc, but this kind of data loss.

I suspect there are problems in the CSV file.
My suggestion:
Print something for debugging information on each lines read from
CSV. This will show you how many lines are processed.
On every insert/update, print any error (if any)
It's something like this:
<?php
$csv = fopen('sample.csv', 'r'); $line = 1;
while (($item = fgetcsv($csv)) !== false) {
echo 'Line ' . $line++ . '... ';
$sql = ''; // your SQL query
mysql_query($sql);
$error = mysql_error();
if ($error == '') {
echo 'OK' . PHP_EOL;
} else {
echo 'FAILED' . PHP_EOL . $error . PHP_EOL;
}
}
So, if there are any errors, you can see it and find the problem (what lines of CSV has problem).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.