Sorting postgresql database dump (pg_dump) - php

I am creating to pg_dumps, DUMP1 and DUMP2.
DUMP1 and DUMP2 are exactly the same, except DUMP2 was dumped in REVERSE order of DUMP1.
Is there anyway that I can sort the two DUMPS so that the two DUMP files are exactly the same (when using a diff)?
I am using PHP and linux. I tried using "sort" in linux, but that does not work...
Thanks!

From your previous question, I assume that what you are really trying to do is compare to databases to see if they are they same including the data.
As we saw there, pg_dump is not going to behave deterministically. The fact that one file is the reverse of the other is probably just coincidental.
Here is a way that you can do the total comparison including schema and data.
First, compare the schema using this method.
Second, compare the data by dumping it all to a file in an order that will be consistent. Order is guaranteed by first sorting the tables by name and then by sorting the data within each table by primary key column(s).
The query below generates the COPY statements.
select
'copy (select * from '||r.relname||' order by '||
array_to_string(array_agg(a.attname), ',')||
') to STDOUT;'
from
pg_class r,
pg_constraint c,
pg_attribute a
where
r.oid = c.conrelid
and r.oid = a.attrelid
and a.attnum = ANY(conkey)
and contype = 'p'
and relkind = 'r'
group by
r.relname
order by
r.relname
Running that query will give you a list of statements like copy (select * from test order by a,b) to STDOUT; Put those all in a text file and run them through psql for each database and then compare the output files. You may need to tweak with the output settings to COPY.

My solution was to code an own program for the pg_dump output. Feel free to download PgDumpSort which sorts the dump by primary key. With the java default memory of 512MB it should work with up to 10 million records per table, since the record info (primary key value, file offsets) are held in memory.
You use this little Java program e.g. with
java -cp ./pgdumpsort.jar PgDumpSort db.sql
And you get a file named "db-sorted.sql", or specify the output file name:
java -cp ./pgdumpsort.jar PgDumpSort db.sql db-$(date +%F).sql
And the sorted data is in a file like "db-2013-06-06.sql"
Now you can create patches using diff
diff --speed-large-files -uN db-2013-06-05.sql db-2013-06-06.sql >db-0506.diff
This allows you to create incremental backup which are usually way smaller. To restore the files you have to apply the patch to the original file using
patch -p1 < db-0506.diff
(Source code is inside of the JAR file)

If
performance is less important than order
you only care about the data not the schema
and you are in a position to recreate both dumps (you don't have to work with existing dumps)
you can dump the data in CSV format in a determined order like this:
COPY (select * from your_table order by some_col) to stdout
with csv header delimiter ',';
See COPY (v14)

It's probably not worth the effort to parse out the dump.
It will be far, far quicker to restore DUMP2 into a temporary database and dump the temporary in the right order.

Here is another solution to the problem: https://github.com/tigra564/pgdump-sort
It allows sorting both DDL and DML including resetting volatile values (like sequence values) to some canonical values to minimize the resulting diff.

Related

MySQL Optimization for data tables and Query optimization

So a LOT of details here, my main objective is to do this as fast as possible.
I am calling an API which returns a large json encoded string.
I am storing the quoted encoded string into MySQL (InnoDB) with 3 fields: (tid (key), json, tags) in a table called store.
I will, at up to 3+ months later, pull information from this database by using:
WHERE tags LIKE "%something%" AND "%somethingelse%"
Tags are + delimited. (Which makes them too long to be efficiently keyed.)
Example:
'anime+pikachu+shingeki no kyojin+pokemon+eren+attack on titan+'
I do not wish to repeat API calls at ANYTIME. If you are going to include an API call use:
API(tag, time);
All of the JSON data is needed.
This table is an active archive.
One Idea I had was to put the tags into their own 2 column table (pid, tag (key)). pid points to tid in the store table.
Questions
Are there any MySQL configurations I can change to make this faster?
Are there any table structure changes I can do to make this faster?
Is there anything else I can do to make this faster?
QUOTED JSON Example (Messy, to see another clean example see TUMBLR APIv2):
'{\"blog_name\":\"roxannemariegonzalez\",\"id\":62108559921,\"post_url\":\"http:\\/\\/roxannemariegonzalez.tumblr.com\\/post\\/62108559921\",\"slug\":\"\",\"type\":\"photo\",\"date\":\"2013-09-24 00:36:56 GMT\",\"timestamp\":1379983016,\"state\":\"published\",\"format\":\"html\",\"reblog_key\":\"uLdTaScb\",\"tags\":[\"anime\",\"pikachu\",\"shingeki no kyojin\",\"pokemon\",\"eren\",\"attack on titan\"],\"short_url\":\"http:\\/\\/tmblr.co\\/ZxlLExvrzMen\",\"highlighted\":[],\"bookmarklet\":true,\"note_count\":19,\"source_url\":\"http:\\/\\/weheartit.com\\/entry\\/78231354\\/via\\/roxannegonzalez?page=2\",\"source_title\":\"weheartit.com\",\"caption\":\"\",\"link_url\":\"http:\\/\\/weheartit.com\\/entry\\/78231354\\/via\\/roxannegonzalez\",\"image_permalink\":\"http:\\/\\/roxannemariegonzalez.tumblr.com\\/image\\/62108559921\",\"photos\":[{\"caption\":\"\",\"alt_sizes\":[{\"width\":500,\"height\":444,\"url\":\"http:\\/\\/31.media.tumblr.com\\/c8a87bee925b0b0674773af63e43f954\\/tumblr_mtltpkLvuo1qmfyxko1_500.png\"},{\"width\":400,\"height\":355,\"url\":\"http:\\/\\/25.media.tumblr.com\\/c8a87bee925b0b0674773af63e43f954\\/tumblr_mtltpkLvuo1qmfyxko1_400.png\"},{\"width\":250,\"height\":222,\"url\":\"http:\\/\\/31.media.tumblr.com\\/c8a87bee925b0b0674773af63e43f954\\/tumblr_mtltpkLvuo1qmfyxko1_250.png\"},{\"width\":100,\"height\":89,\"url\":\"http:\\/\\/25.media.tumblr.com\\/c8a87bee925b0b0674773af63e43f954\\/tumblr_mtltpkLvuo1qmfyxko1_100.png\"},{\"width\":75,\"height\":75,\"url\":\"http:\\/\\/25.media.tumblr.com\\/c8a87bee925b0b0674773af63e43f954\\/tumblr_mtltpkLvuo1qmfyxko1_75sq.png\"}],\"original_size\":{\"width\":500,\"height\":444,\"url\":\"http:\\/\\/31.media.tumblr.com\\/c8a87bee925b0b0674773af63e43f954\\/tumblr_mtltpkLvuo1qmfyxko1_500.png\"}}]}'
Look into the Mysql MATCH()/AGAINST() functions and FULLTEXT index feature, this is probably what you are looking for. Make sure a FULLTEXT index will operate reasonably on a json document.
What kind of data sizes are we talking about? Huge amounts of memory is cheap these days, so having the entire Mysql dataset buffered in memory where you can do full text scans isn't unreasonable.
Breaking out some of the json field values and putting them into their own columns would allow you to search quickly for those ... but that doesn't help you for the general case.
This option you suggested is the correct design:
One Idea I had was to put the tags into their own 2 column table (pid,
tag (key)). pid points to tid in the store table.
But if you're searching LIKE '%something%' then the leading '%' will mean the index can only be used to reduce disk reads - you will still need to scan the entire index.
If you can drop the leading % (because you now have the entire tags) then this is certainly the way to go. The trailing '%' is not as important.

Different PostgreSQL fields orders in PHP when retrieved from two different servers

I am using CodeIgniter 2.1.3 and PHP 5.4.8, and two PostgreSQL servers (P1 and P2).
I am having a problem with list_fields() function in CodeIgniter. When I retrieve fields from P1 server, fields are in the order that I originally created the table with. However, if I use the exact same codes to retrieve fields from P2 server, fields are in reverse order.
If fields from P1 is array('id', 'name', 'age'),
fields from P2 becomes array('age', 'name', 'id')
I don't think this is CodeIgniter specific problem, but rather general database configuration or PHP problem, because codes are identical.
This is the code that I get fields with.
$fields = $this->db->list_fields("clients");
I have to clarify something. #muistooshort claims in a comment above:
Technically there is no defined order to the columns in a table.
#mu may be thinking of the order or rows, which is arbitrary without ORDER BY.
It is completely incorrect for the order of columns, which is well defined and stored in the column pg_attribute.attnum. It's used in many places, like INSERT without column definition list or SELECT *. It is preserved through a dump / restore cycle and has significant bearing on storage size and performance.
You cannot simply change the order of columns in PostgreSQL, because it has not been implemented, yet. It's deeply wired into the system and hard to change. There is a Postgres Wiki page and it's on the TODO list of the project:
Allow column display reordering by recording a display, storage, and
permanent id for every column?
Find out for your table:
SELECT attname, attnum
FROM pg_attribute
WHERE attrelid = 'myschema.mytable'::regclass
AND NOT attisdropped -- no dropped (dead) columns
AND attnum > 0 -- no system columns
ORDER BY attnum;
It is unwise to use SELECT * in some contexts, where the columns of the underlying table may change and break your code. It is explicitly wise to use SELECT * in other contexts, where you need all columns (in default order).
As to the primary question
This should not occur. SELECT * returns columns in a well defined order in PostgreSQL. Some middleware must be messing with you.
I suspect you are used to MySQL which allows you to reorder columns post columns post-table creation. PostgreSQL does not let you do this, so when you:
ALTER TABLE foo ADD bar int;
It puts this on the end of the table always and there is no way to change the order.
On PostgreSQL you should not assume that the order of the columns is meaningful because these can differ from server to server based on the order in which the columns were defined.
However the fact that these are reversed is odd to me.
If you want to see the expected order on the db, use:
\d foo
from psql
If these are reversed then the issue is in the db creation (this is my first impression). That's the first thing to look at. If that doesn't show the problem then there is something really odd going on with CodeIgniter.

Would I gain a performance boost by normalizing this table?

I'm running a sort of forum modification code on my server. In my database, I have a HUGE table called core, which basically includes all the admin-editable settings, of which there are a lot. My question is, should I normalize this?
I've created a fiddle to show you how large this table is: http://sqlfiddle.com/#!2/f4536/1
You'll see some columns called gear_notifications, gear_bank, gear_* etc. These indicate whether a certain system is turned on. For example, if gear_bank=1, then the Bank System is turned on. At the moment, in the __construct of my DB file, I run the following query:
$settings = mysql_query("SELECT GROUP_CONCAT(d.domain) AS domains, GROUP_CONCAT(d.zb) AS zb_info, c.* FROM core c JOIN domains d ON d.cid = c.id WHERE c.id ='$cid' LIMIT 1");
Ignoring the JOIN you can straight away see a problem here; the query returns EVERY field from the core table, regardless of whether the corresponding system is turned on. For example, if gear_bank=0, the query will still return bank_name, bank_history_perpage, bank_* etc. While this doesn't present any physical problem with how the code runs (as it can just ignore any data it does not need), I'd much prefer if it didn't have to load that data.
Would I be better off creating one table called core which has all the gear_* values, then corresponding tables (core_bank, core_* etc) for their corresponding values?
I've been querying this (bad pun, sorry!) for a while now and I just don't know enough to work out whether this will provide a performance boost for my code. Note that the above query will be run on EVERY page.
If I were to revert to the new system, i.e the multiple tables, how would I get all the information I need? I don't want to run one query to core to work out which systems are turned on, then subsequently run an extra query on all corresponding tables. I'd much prefer one query which JOINs all necessary tables based on the values of gear_* in core. Would that be possible?
I've not yet released this forum modification so I can make as many changes as I like without any real-world impact :) .
If you split your core into multiple tables you can then use NULL to indicate if a particular entry has info for a particular system. The joins will then work as expected.
However, your table really is not that big and it's doubtful that you will actually notice any speed improvement at the application level by breaking it up.
And if you don't need to query on individual columns then just use one TEXT column, put all your attributes in an array and then serialize the array into the text column.

Comparing 2700+ filenames against 29000+ rows in a db table

I have a ftp repository that is currently at 2761 files (PDF files).
I have a MySQL table (a list of those files) that's actually at 29k+ files (it hasn't been parsed since a recent upgrade).
I'd like to provide admins with a one-click script that will do the following:
1) Compare the "existing" filenames with the rows in the database table
2) Delete any rows that are not in the existing filesystem
3) Add a new row for a file that doesn't appear in the database table
This usually is handled by an AppleScript/FolderAction/Perl script method, but it's not perfect (it chokes sometimes when large numbers of files are added at a time - like on heavy news nights).
It takes about 10-20 seconds to build the file list from the FTP repository (using $file_list = ftp_nlist($conn_id,$target_dir) ), and I'm not sure how to best go about comparing with the DB table (I'm positive that an WHERE NOT IN (big_fat_list) would be a nightmare query to run).
Any suggestions?
Load the list of filenames into another table, then perform a couple of queries that fulfill your requirements.
Yup that is the solution. I propose you to use pdo prepared insert statement to reduce time.
or do what mysqldump does, generate insert into table(column1,column2, ... ) values(),
(),
(),
...
;
insert into ...
you would have to check the maximun values list in mysql site.
I usually dump the recursive directory list with dates and file sizes to a temporary table. Then I remove items not found:
delete
from A
where not exists (
select null as nothing
from temp b
where a.key = b.key )
Then I update items already there (for file sizes, CRCs):
update a set nonkeyfield1 = b.nonkeyfield1, nonkeyfield2 = b.nonkeyfield2
from a join temp b on a.key = b.key
Then I add items found:
insert into A ( field, list)
select field, list
from temp b
where not exists (
select null as nothing
from A
where b.key = a.key )
This is from memory, so test it first before you fly. The select null as nothing keeps you from wasting RAM while you check things.

generating mysql statistics

I have a csv file, which is generated weekly, and loaded into a mysql database. I need to make a report, which will include various statistics on the records imported. The first such statistic is how many records were imported.
I use PHP to interface with the database, and will be using php to generate a page showing such statistics.
However, the csv files are imported via a mysql script, quite separate from any PHP.
Is it possible to calculate the records that were imported and store the number in a different field/table, or some other way?
Adding an additional timefield to work out fields added since a certain time is not possible, as the structure of the database can not be changed.
Is there a query I can use while importing from a mysql script, or a better way to generate/count the number of imported records from within php?
You can get the number of records in a table using the following query.
SELECT COUNT(*) FROM tablename
So what you can do is you can count the number of records before the import and after the import and then select the difference like so.
$before_count = mysql_fetch_assoc(mysql_query("SELECT COUNT(*) AS c FROM tablename"));
// Run mysql script
$after_count = mysql_fetch_assoc(mysql_query("SELECT COUNT(*) AS c FROM tablename"));
$records_imported = $after_count['c'] - $before_count['c'];
You could do this all from the MySql script if you would like but I think using PHP to do it turns out to be a bit more clean.
A bit of a barrel-scraper, but depending on permissions you could edit the cron executed MySQL script to output some pre-update stats into a file using INTO OUTFILE and then parse the resultant file in PHP. You'd then have the 'before' stats and could execute the stats queries against via PHP to obtain the 'after' stats.
However, as with many of these solutions it'll be next to impossible to find updates to existing rows using this solution. (Although new rows should be trivial to detect.)
Not really sure what you're after, but here's a bit more detail:
Get MySQL to export the relevant stats to a known directory using
SELECT... INTO OUTFILE..
This directory would need to be readable/writable by the MySQL
user/group and your web server's user/group (or whatever user/group
you're running PHPas if you're going to automate the cli via cron on a
weekly basis. The file should be in CSV format and datestamped as
"stats_export_YYYYMMDD.csv".
Get PHP to scan the export directory for files beginning
"stats_export_", perhaps using the "scandir" function with a simple
substr test. You can then add the matching filename to an array. Once
you're run out of files, sort the array to ensure it's in date order.
Read the stats data from each of the files listed in the array in
turn using fgetcsv. It would be wise to place this data into a clean
array which also contains the relevant datestamp as extracted from the
filename.
At this point you'll have a summary of the stats at the end of each
day in an array. You can then execute the relevant stats SQL queries
again (if so required) directly from PHP and add the stats to the data
array.
Compare/contrast and output as required.
Load the files using PHP and 'LOAD DATA INFILE .... INTO TABLE .. ' and then get the number of imported rows using mysqli_affected_rows() (or mysql_affected_rows)

Categories