Comparing 2700+ filenames against 29000+ rows in a db table

Comparing 2700+ filenames against 29000+ rows in a db table - php

I have a ftp repository that is currently at 2761 files (PDF files).
I have a MySQL table (a list of those files) that's actually at 29k+ files (it hasn't been parsed since a recent upgrade).
I'd like to provide admins with a one-click script that will do the following:
1) Compare the "existing" filenames with the rows in the database table
2) Delete any rows that are not in the existing filesystem
3) Add a new row for a file that doesn't appear in the database table
This usually is handled by an AppleScript/FolderAction/Perl script method, but it's not perfect (it chokes sometimes when large numbers of files are added at a time - like on heavy news nights).
It takes about 10-20 seconds to build the file list from the FTP repository (using $file_list = ftp_nlist($conn_id,$target_dir) ), and I'm not sure how to best go about comparing with the DB table (I'm positive that an WHERE NOT IN (big_fat_list) would be a nightmare query to run).
Any suggestions?

Load the list of filenames into another table, then perform a couple of queries that fulfill your requirements.

Yup that is the solution. I propose you to use pdo prepared insert statement to reduce time.
or do what mysqldump does, generate insert into table(column1,column2, ... ) values(),
(),
(),
...
;
insert into ...
you would have to check the maximun values list in mysql site.

I usually dump the recursive directory list with dates and file sizes to a temporary table. Then I remove items not found:
delete
from A
where not exists (
select null as nothing
from temp b
where a.key = b.key )
Then I update items already there (for file sizes, CRCs):
update a set nonkeyfield1 = b.nonkeyfield1, nonkeyfield2 = b.nonkeyfield2
from a join temp b on a.key = b.key
Then I add items found:
insert into A ( field, list)
select field, list
from temp b
where not exists (
select null as nothing
from A
where b.key = a.key )
This is from memory, so test it first before you fly. The select null as nothing keeps you from wasting RAM while you check things.

Related

MySQL left join with given ids

So I need to left join a table from MySQL with a couple of thousands of ids.
It’s like I need to temporarily build a table for the join then delete it, but that just sounds not right.
Currently the task is done by code but proves pretty slow on the results, and an sql query might be faster.
My thought was to use ...WHERE ID IN (“.$string_of_values.”);
But that cuts off the ids that have no match on the table.
So, how is it possible to tell MySQL to LEFT JOIN a table with a list of ids?

As I understand your task you need to leftjoin your working table to your ids, i.e. the output must contain all these ids even there is no matched row in working table. Am I correct?
If so then you must convert your ids list to the rowset.
You already tries to save them to the table. This is useful and safe practice. The additional points are:
If your dataset is once-used and may be dropped immediately after final query execution then you may create this table as TEEMPORARY. Than you may do not care of this table - it wil be deleted automatically when the connection is closed, but it may be reused (including its data edition) in this connection until it is closed. Of course the queries which creates and fills this table and final query must be executed in the same connection in that case.
If your dataset is small enough (approximately - not more than few megabytes) then you may create this table with the option ENGINE = Memory. In this case only table definition file (small text file) will be really written to the disk whereas the table body will be stored in the memory only, so the access to it will be fast enough.
You may create one or more indexes in such table and improve final query performance.
All these options may be combined.
Another option is to create such rowset dynamically.
In MySQL 5.x the only option is to create such rowset in according subquery. Like
SELECT ...
FROM ( SELECT 1 AS id UNION SELECT 2 UNIO SELECT 22 ... ) AS ids_rowset
LEFT JOIN {working tables}
...
In MySQL 8+ you have additional options.
You may do the same but use CTE:
WITH ids_rowset AS ( SELECT 1 AS id UNION SELECT 2 UNIO SELECT 22 ... )
SELECT ...
FROM ids_rowset
LEFT JOIN {working tables}
...
Alternatively you may transfer your ids list in some serialized form and parse it to the rowset in the query (in recursive CTE, or by using some table function, for example, JSON_TABLE).
All these methods creates once-used rowset (of course, CTE can be reused within the query). And this rowset cannot be indexed for query improvement (server may index this dataset during query execution if it finds this reasonable but you cannot affect this).

Checking if a user changed any of their data in multiple tables

In my database, I have several tables. One is a checkpoint table that makes note of a user choosing to finalize one of their projects. This table contains a timestamp that is automatically created. Whenever a user finalizes their project a new row is added to the checkpoint table (that way we can also keep a history of previous times the project was finalized).
I have several other tables with timestamps (or tables that I could add timestamp columns too) that automatically update when their tables change.
Is there a simple way to be able to tell if any of the other tables have updated their data since the project was last finalized? I do not need to know which tables have changed data just that there are tables that have changed data.
For example, if a user changes data in one of their tables I want to be able to display a message indicating that their project has unfinalized data.
There are a couple of ways that I have thought about doing this:
Checking every single table to see if any timestamps are newer than the latest timestamp in the checkpoint table.
Add an additional timestamp column (I already have a created and updated timestamp column) to the main project table. Most of the other tables are linked directly or indirectly to this main project table. Add triggers to every other table to update this timestamp when their data changes. I am not quite sure yet how to correctly set up a proper trigger for this.
Creating a new table with just the project_id and a timestamp column. Add a trigger to the other tables as shown in option 2.
As new modules are added, I will be adding more tables to the project so will need something that is easy to scale as well.
Each of these approaches seems like there would be a lot of steps involved.
Would one of these approaches be more efficient or viable than another? Is there another approach that I am not thinking about? If triggers are the best way to do this how would I go about setting up the trigger?
A simplified overview of my tables looks like this:
main_project_table
id
user_id (FK to user_table)
created_timestamp
updated_timestamp
checkpoint_group_table (users can choose which group to finalize their project too)
id
user_id (FK to user_table)
group_name
checkpoint_table (the table that records the finalized data and time of finalization)
id
checkpoint_group_id (FK to checkpoint_group_table)
project_id (FK to main_project_table)
project_finalized_timestamp
parent_table (several of these)
id
project_id (FK to main_project_table)
child_table (0 or more of these for each parent_table)
id
parent_id (FK to parent_table)

You really only have three solutions: Middleware, Triggers, and General Log File.
Middleware solution:
Add a timestamp field to each relevant table, and set the default value is set to "CURRENT_TIMESTAMP". This will update the timestamp field to the current time on every update.
Assuming that users are going through some API, you can write a JOIN query where it returns the latest time stamp. It would look like this.
SELECT
CASE
WHEN b.timestamp IS NOT NULL THEN 0
WHEN c.timestamp IS NOT NULL THEN 0
WHEN d.timestamp IS NOT NULL THEN 0
WHEN e.timestamp IS NOT NULL THEN 0
ELSE 1
AS `test`
FROM checkpoint_table a
LEFT JOIN main_project_table b
ON a.project_id = b.id
AND b.timestamp > a.project_finalized_timestamp
LEFT JOIN checkpoint_group_table c
ON b.user_id = c.user_id
AND c.timestamp > a.project_finalized_timestamp
LEFT JOIN parent_table d
ON b.id = d.project_id
AND d.timestamp > a.project_finalized_timestamp
LEFT JOIN child_table e ON d.id = e.parent_id
ON b.id = d.project_id
AND e.timestamp > a.project_finalized_timestamp
Now when a request routed to the tables you can run this query and if test == 0, then you return the message.
<?php
class middleware{
public function getMessage(){
// run query
if($data[0]['test'] == 1){
return "project has unfinalized data";
}else{
return null;
}
}
}
Trigger Solution:
CREATE TRIGGER checkpoint_group_table
AFTER UPDATE on _table_
FOR EACH ROW UPDATE _table_
SET main_project_table.updated_timestamp = CURTIME()
WHERE main_project_table.user_id=checkpoint_group_table.id
The advantages to this are that it is perhaps more elegant than the middleware solution. The disadvantages are that triggers are not in plain view, and it is my experience, that when processes are in the background they eventually are forgotten. In the long term, you could be left with this Jenga puzzle, which would make like difficult.
General Log File Solution:
Mysql can log every query on the server. It is possible to access this log file during the time, parse it out, and figure out if any tables were updated. This way you can figure if anything was updated after the project was finalized.
Turn on a general log file.
SET GLOBAL general_log = 'ON';
Set the path of the log file.
SET GLOBAL general_log_file = 'var/log/mysql/mysql_general.log'
Confirm by going to the command terminal.
mysql -se "SHOW VARIABLES" | grep -e general_log
You might need to reset MySQL.
sudo service MySQL restart
This script can you started...
$v = shell_exec("sudo less /var/log/mysql/mysql_general.log");
$lines = explode("\n",$v);
$new = array();
foreach($lines as $i => $line){
if(substr($line,0,1) != " "){
if(isset($l)){
array_push($new,$l);
}
$l = $line;
}else{
$l.= preg_replace('/\s+/', ' ', $line);
}
}
$lines = $new;
$index = array();
foreach($lines as $i => $line){
$e = explode("\t",$line);
$new = array();
foreach($e as $key => $value){
$new[$key] = trim($value);
}
$index[$i] = $new;
}
This will result in this...
array(3) {
[0]=> string(27) "2017-10-01T08:17:04.659274Z"
[1]=> string(8) "70 Query"
[2]=> string(129) "UPDATE checkpoint_group_table SET group_name = 'Dev Group' Where id=6"
}
From here you can use a library called PHP-SQL-Parser to parse out the query.
The advantages to this approach might scale well, being that you will not have to add any columns to your database. The disadvantages are that this will involve more code and that means more complexity. You probably cannot really do this solution without writing unit tests for it.

If I would have been at your situation, I would have made a table with fields project id (FK) and boolean for is_finalized. So every time a project is finalized, I would add an entry in it.
+-----------------+--------------+
| project_id | is_finalized |
|-----------------|--------------|
| 12 | 1 |
+-----------------+--------------+
before any update/insert, Just check if this key exists for my project. if exists, change it to 0 and while loading the file, Just check if the value is 0. If 0, then show the Message: project has unfinalized data.
It should show the message only if the key exists and the value is 0. If the project is not finalized. The table won't have the value, hence no message.
Quite easy, faster in processing (rather than checking each timestamp) and extensible approach as it would be just dependent on the update or insert queries, which you can use in you upcoming modules in future.

timestamp comparison could be messy to do multiple check.
...I do not need to know which tables have changed data just that there are tables that have changed data...
Join-query to generate a (1) data-set, JSON/SERIALIZE it, then MD5, keep this hasted string into db. Next time compare it back, if there ANY different, the data-set has been changed. This is the general idea in large data/file comparison / code-repo.
but in light of...
..more tables to the project..
Then just use MD5 on each data-row in the table. Once changed the hashed string will be different.

Plan A: An off-the-wall solution:
Set up Master-Slave. The Slave will contain an 'old' copy of the data.
Establish "delayed" replication. Let's say 1 hour.
Get pt-table-checksum; run it twice an hour.
That will discover changes within an hour. (The timings may need tweaking if data size is quite large or small.)
Plan B
Deny all direct access from actual humans. Instead, build an application that handles all normal accesses through some API. Then I would instrument the API to collect whatever I choose.
Ad Hoc queries (for which there is no API):
Perhaps disallow them
Perhaps have a review board (me) to admit their running.
Perhaps have an API that runs the query, but immediately logs/emails/rings bells/whatever.

Really not sure why these answers are suggesting reliance on IDs or complex data-logging, this is a fairly common problem with some very simple solutions.
Use those parent/child relationships
Note: when documenting a schema, it is important to note more than just FK relationships, but also the type of replationship (one-to-one, many-to-one, one-to-many, many-to-many).
You already have a fairly well defined parent/child relationships, I assume to be:
main_project one<--many parent one<--many child
Use them one of two ways:
Update a date for parent and main_project which stores the most recent date any child was modified.
Use a combination of join/max/modified in a query utilizing main_project, parent, and child.
child_updated date
main_project.child_updated
parent.child_updated
Whenever updating any child, also update the child_modified dates for main_project and parent. Similar for parent, update main_project. This can be done with triggers, php, or some clever uses of joins or views to as the main_project objects. I would highly advise sticking to doing this with PHP models of those tables.
join/max/modified
Just build a query to get you four values, then check them:
checkpoint_table.main_project_finalized
main_project.modified
MAX(parent.modified)
MAX(child.modified)
These joins can get a bit tricky, so you'll have to play with this a bit.
SELECT m.modified as modified, MAX(c.project_finalized_timestamp) as finalized, MAX(p.modified) AS parent_modified, MAX(c.modified) as child_modified
FROM main_project_table m
LEFT JOIN checkpoint_table c
ON m.id = c.project_id
LEFT JOIN parent_table p
ON m.id = p.project_id
LEFT JOIN child_table c
ON p.id = c.parent_id
GROUP BY m.id
This will give you ONE row of all the dates you care about, allowing you to create some simple logic for it in PHP.
$result = // retrieve joined data as above
if ($result['finalized'] < max($result['modified'], $result['parent_modified'], $result['child_modified']) {
// changed
}

There are some good solutions mentioned so far. Another one is to make use of MySQL's information schema. Doing this, you can for example select all tables that have a timestamp field with the name you know, and check their modification times. This is probably the most dynamic and seamless approach but not really the best one. I would typically only do something like this if I was building an interface on top of legacy or third party code and didn't have control of that part of the application.
Architecturally I think the best approach is to have your application aware of pertinent tables / fields and do an audit of them. I am assuming that the data is relational to the object at question and therefore although they are foreign tables, they can still be easily checked for modifications.
Another good idea would be to add versioning to all of the tables in question so that during this step in your application you can show what changed.

Import only non-existing data to database from CSV

I have created a script that read data from a CSV file, check if the data already exists in the database and import it if it does not. If the data does exist (the code of a specific product) then the rest of the information needs to be updated from the CSV file.
For example; I have a member with code WTW-2LT, named Alex and surname Johnson in my CSV file. The script checks if the member with code WTW-2LT, named Alex and surname Johnson already exist, if it does, the contact details and extra details needs to be updated from the script (other details like subject and lecturer also needs to be checked, all details are in one line in the CSV), if it doesn't exist the new member just have to be created.
My script what I have so far with minimum other checks to prevent distraction for now;
while ($row = fgetcsv($fp, null, ";")) {
if ($header === null) {
$header = $row;
continue;
}
$record = array_combine($header, $row);
$member = $this->em->getRepository(Member::class)->findOneBy([
'code' =>$record['member_code'],
'name' =>$record['name'],
'surname' =>$record['surname'],
]);
if (!$member) {
$member = new Member();
$member->setCode($record['member_code']);
$member->setName($record['name']);
$member->setName($record['surname']);
}
$member->setContactNumber($record['phone']);
$member->setAddress($record['address']);
$member->setEmail($record['email']);
$subject = $this->em->getRepository(Subject::class)->findOneBy([
'subject_code' => $record['subj_code']
]);
if (!$subject) {
$subject = new Subject();
$subject->setCode($record['subj_code']);
}
$subject->setTitle($record['subj_title']);
$subject->setDescription($record['subj_desc']);
$subject->setLocation($record['subj_loc']);
$lecturer = $this->em->getRepository(Lecturer::class)->findOneBy([
'subject' => $subject,
'name' => $record['lec_name'],
'code' => $record['lec_code'],
]);
if (!$lecturer) {
$lecturer = new Lecturer();
$lecturer->setSubject($subject);
$lecturer->setName($record['lec_name']);
$lecturer->setCode($record['lec_code']);
}
$lecturer->setEmail($record['lec_email']);
$lecturer->setContactNumber($record['lec_phone']);
$member->setLecturer($lecturer);
$validationErrors = $this->validator->validate($member);
if (!count($validationErrors)) {
$this->em->persist($member);
$this->em->flush();
} else {
// ...
}
}
You can notice this script has to query the database 3 times to check if one CSV line exists. In my case I have files with up to 2000+ lines, so for every line to perform 3 queries to check if that line exists or not is quite time-consuming.
Unfortunately, I also can't import the rows in batches because if one subject doesn't exist it will create it so many times until the batch is flushed to the database and then I sit with duplicate records that serve no point.
How can I improve performance and speed to the max? Like first get all records from the database and store it in arrays (memory consuming?) and then do the checks and add the line to the array and check from there...
Can somebody please help me find a way to improve this (with sample code please?)

To be honest, I do find 2000+ rows with 3x that amount of queries not that much. But, since you are asking for performance, here are my two cents:
Using a framework will always give overhead. Meaning that if you make this code in native PHP, it will already run quicker. Im not familiar with symfony, but I assume you store your data in a database. In MySQL you can use the command INSERT ... ON DUPLICATE KEY update. If you have set the 3 fields (code, name, lastname) as a primary key (which I assume), you can use that to: insert data, but if the key already exists, update the values in the database. MySQL will do the checsk for you, to see if the data has changed: if not, no diskwrite will happen.
Im quite certain you can write native SQL to symfony, allowing you to use the security the framework provides, yet speeding up your insert.

Generally if you want performance my best experience has been to dump all the data into the database and then transform it in there using SQL statements. The DBS will be able optimize all of your steps this way.
You can import CSV-Files directly into your MySQL database with the SQL command
LOAD DATA INFILE 'data.csv'
INTO TABLE tmp_import
The command has a lot of options where you can specify your CSV file's format, for example:
MySQL Ref on LOAD DATA INFILE
https://stackoverflow.com/a/18941427/1220835
If your data.csv is a full dump containing all old and new rows then you can just replace your current table with the imported one, after you fixed it up a bit.
For example it looks like your csv-file (and import table) might look a bit like
WTW-2LT, Alex, Johnson, subj_code1, ..., lec_name1, ...
WTW-2LT, Alex, Johnson, subj_code1, ..., lec_name2, ...
WTW-2LT, Alex, Johnson, subj_code2, ..., lec_name3, ...
WTW-2LU, John, Doe, subj_code3, ..., lec_name4, ...
You could then get the distinct rows via grouping:
SELECT member_code, name, surname
FROM tmp_import
GROUP BY member_code, name, surname
If member_code is a key you can just GROUP BY member_code in MySQL. The DBS won't complain even though I believe it's technically against the standard.
To get the rest of your data you do the same:
SELECT subj_code, subj_title, member_code
FROM tmp_import
GROUP BY subj_code
and
SELECT lec_code, lec_name, subj_code
FROM tmp_import
GROUP BY lec_code
assuming subj_code and lec_code are both keys for subjects and lectures.
To actually save this result as a table you can use MySQL's CREATE TABLE ... SELECT-syntax, for example
CREATE TABLE tmp_import_members
SELECT member_code, name, surname
FROM tmp_import
GROUP BY member_code, name, surname
You can then do the inserts and updates in two queries:
INSERT INTO members (member_code, name, surname)
SELECT member_code, name, surname
FROM tmp_import_members
WHERE tmp_import_members.member_code NOT IN (
SELECT member_code FROM members WHERE member_code IS NOT NULL
);
UPDATE members
JOIN tmp_import_members ON
members.member_code = tmp_import_members.members_code
SET
members.name = tmp_import_members.name,
members.surname = tmp_import_members.surname;
and the same for subjects and lectures to your liking.
This all amounts to
one bulk import of your CSV file, which should be very fast,
3 temporary tables for your members, subjects and lectures,
3 insert and 3 update statements (one per table)
one drop tables on your temporary tables after you're done
Again: If your CSV-File contains all rows you could just replace your existing tables and save the 3 inserts and 3 updates.
Make sure that you create indexes on the relevant columns of your temporary tables so that MySQL can speed up the NOT IN and JOIN in the above queries.

You can run a custom sql first to get the count from all the three tables
SELECT
(SELECT COUNT(*) FROM member WHERE someCondition) as memberCount,
(SELECT COUNT(*) FROM subject WHERE someCondition) as subjectCount,
(SELECT COUNT(*) FROM lecturer WHERE someCondition) as lecturerCount
Then on the basis of count you can find if data is present in your table or not. You don't have to run the queries multiple times for uniqueness if you go with native SQL
Checkout this link to know how to run custom SQL in Doctrine
Symfony2 & Doctrine: Create custom SQL-Query

Move Table with 1.3 MIL records from one MysQL Server to Another (on different physical servers)

We have a query that we run, it is an Insert from Select statement, and it literally finishes in seconds - and is able to insert 1.3 million records with ease.
We have new requirements. The table that is being created and inserted into must now exist on a DIFFERENT mysql server on the same subnet.
How can this be done? What is the most efficient manner to do this? Should we backup the table at a file level, and copy it to the new server?
Or should we run a query against both databases, and if so - how long will it take?
We need to do this process manually -
Here is an example of the current code that we use to do this:
$populateNewTable = "insert into tbl_free_minutes_".strtolower(date("M")).date("Y")." (cust_id,telephone_number,state,created_datetime,plan_id,requested_planid) SELECT id,telephone_number,state,created_datetime,plan_id,requested_planid FROM tbl_customer WHERE created_datetime <= '".$date." 23:59:59' AND account_status IN ( 'Active', 'ActiveLL','Suspend') AND plan_id IN ('14','13','15','16','17','19','20','21','22','23','24','60','26', '27','28','29','30','31','32','33','34','38','39','40','44','45','46','47','48', '49','50','51','52','53','54','55','56','57','58','59','35','36','66','63','64', '65','67','68','69',76,77,78,79,82,83,84,85,86,87,88,89,93,94,95,96) and unlimited_plan != 'Y' ";
Sas

I suggest using "SELECT ... INTO OUTFILE" to extract your data on the first server.
Then transfer the file created by the query to the second server.
And last, use "LOAD DATA INFILE" on the second server to load your data in your table.
Some links:
http://dev.mysql.com/doc/refman/5.5/en/select-into.html
http://dev.mysql.com/doc/refman/5.5/en/load-data.html

Sorting postgresql database dump (pg_dump)

I am creating to pg_dumps, DUMP1 and DUMP2.
DUMP1 and DUMP2 are exactly the same, except DUMP2 was dumped in REVERSE order of DUMP1.
Is there anyway that I can sort the two DUMPS so that the two DUMP files are exactly the same (when using a diff)?
I am using PHP and linux. I tried using "sort" in linux, but that does not work...
Thanks!

From your previous question, I assume that what you are really trying to do is compare to databases to see if they are they same including the data.
As we saw there, pg_dump is not going to behave deterministically. The fact that one file is the reverse of the other is probably just coincidental.
Here is a way that you can do the total comparison including schema and data.
First, compare the schema using this method.
Second, compare the data by dumping it all to a file in an order that will be consistent. Order is guaranteed by first sorting the tables by name and then by sorting the data within each table by primary key column(s).
The query below generates the COPY statements.
select
'copy (select * from '||r.relname||' order by '||
array_to_string(array_agg(a.attname), ',')||
') to STDOUT;'
from
pg_class r,
pg_constraint c,
pg_attribute a
where
r.oid = c.conrelid
and r.oid = a.attrelid
and a.attnum = ANY(conkey)
and contype = 'p'
and relkind = 'r'
group by
r.relname
order by
r.relname
Running that query will give you a list of statements like copy (select * from test order by a,b) to STDOUT; Put those all in a text file and run them through psql for each database and then compare the output files. You may need to tweak with the output settings to COPY.

My solution was to code an own program for the pg_dump output. Feel free to download PgDumpSort which sorts the dump by primary key. With the java default memory of 512MB it should work with up to 10 million records per table, since the record info (primary key value, file offsets) are held in memory.
You use this little Java program e.g. with
java -cp ./pgdumpsort.jar PgDumpSort db.sql
And you get a file named "db-sorted.sql", or specify the output file name:
java -cp ./pgdumpsort.jar PgDumpSort db.sql db-$(date +%F).sql
And the sorted data is in a file like "db-2013-06-06.sql"
Now you can create patches using diff
diff --speed-large-files -uN db-2013-06-05.sql db-2013-06-06.sql >db-0506.diff
This allows you to create incremental backup which are usually way smaller. To restore the files you have to apply the patch to the original file using
patch -p1 < db-0506.diff
(Source code is inside of the JAR file)

If
performance is less important than order
you only care about the data not the schema
and you are in a position to recreate both dumps (you don't have to work with existing dumps)
you can dump the data in CSV format in a determined order like this:
COPY (select * from your_table order by some_col) to stdout
with csv header delimiter ',';
See COPY (v14)

It's probably not worth the effort to parse out the dump.
It will be far, far quicker to restore DUMP2 into a temporary database and dump the temporary in the right order.

Here is another solution to the problem: https://github.com/tigra564/pgdump-sort
It allows sorting both DDL and DML including resetting volatile values (like sequence values) to some canonical values to minimize the resulting diff.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.