Postgres pg_dump dumps database in a different order every time

Postgres pg_dump dumps database in a different order every time - php

I am writing a PHP script (which also uses linux bash commands) which will run through test cases by doing the following:
I am using a PostgreSQL database (8.4.2)...
1.) Create a DB
2.) Modify the DB
3.) Store a database dump of the DB (pg_dump)
4.) Do regression testing by doing steps 1.) and 2.), and then take another database dump and compare it (diff) with the original database dump from step number 3.)
However, I am finding that pg_dump will not always dump the database in the same way. It will dump things in a different order every time. Therefore, when I do a diff on the two database dumps, the comparison will result in the two files being different, when they are actually the same, just in a different order.
Is there a different way I can go about doing the pg_dump?
Thanks!

It's worth distinguishing schema and data here. The schema is dumped in a fairly deterministic order, most objects alphabetically, constrained by inter-object dependencies. There are some limited cases where the order is not fully constrained and may appear random to an outside observer, but that may get fixed in the next version.
The data on the other hand is dumped in disk order. This is usually what you want, because you want dumps to be fast and not use insane amounts of resources to do sorting. What you might be observing is that when you "modify the DB" you are doing an UPDATE, which will actually delete the old value and append the new value at the end. And that will of course upset your diff strategy.
A tool that might be more suitable for your purpose is pg_comparator.

Here is a handy script for pre-processing pg_dump output to make it more suitable for diffing and storing in version control:
https://github.com/akaihola/pgtricks
pg_dump_splitsort.py splits the dump into the following files:
0000_prologue.sql: everything up to the first COPY
0001_<schema>.<table>.sql
.
.
NNNN_<schema>.<table>.sql: data for each table sorted by the first field
9999_epilogue.sql: everything after the last COPY
The files for table data are numbered so a simple sorted concatenation of all files can be used to re-create the database:
$ cat *.sql | psql <database>
I've found that a good way to take a quick look at differences between dumps is to use the meld tool on the whole directory:
$ meld old-dump/ new-dump/
Storing the dump in version control also gives a decent view on the differences. Here's how to configure git to use color in diffs:
# ~/.gitconfig
[color]
diff = true
[color "diff"]
frag = white blue bold
meta = white green bold
commit = white red bold
Note: If you have created/dropped/renamed tables, remember to delete all .sql files before post-processing the new dump.

It is impossible to force pg_dump to dump data in any particular order, as it dumps data in disk order - it is much faster this way.
You can use "-a -d" options for pg_dump and then "sort" output, but newlines in data will make sorted output unusable. But for basic comparison, whether anything changed, it would suffice.

If you are just interested in the schema:
You could do your diff table by table-by-using a combination of these options to dump the schema for only one table at a time. You could then compare them individually or cat them all to one file in a known order.
-s, --schema-only dump only the schema, no data
-t, --table=TABLE dump the named table(s) only
To generate the list of tables to feed to the above, query information_schema.tables.

As of may 2010 a patch to pg_dump exists that may be helpful to all interested in this matter - it adds "--ordered" option to this utility:
Using --ordered will order the data by
primary key or unique index, if one
exists, and use the "smallest"
ordering (i.e. least number of
columns required for a unique order).
Note that --ordered could crush your
database server if you try to order
very large tables, so use judiciously.
I didn't test it, but I guess it's worth a try.

It is not unusual that PostgreSQL behaves nondeterministically - maybe timer triggered reorganization processes or something like that occur in the background. Further I am not aware of a way to force pg_dump to reproduce a bit-identical output on successive runs.
I suggest to change your comparison logic because it is your comparison that is misbehaved - it reports differences while both dumps represent the same database state. This of course means some additional work but is in my opinion the correct way to attack the problem.

If performance is less important than order you could use:
COPY (select * from your_table order by some_col) to stdout
with csv header delimiter ',';
See COPY (9.5)

Related

Simulate MySQL connection to analyze queries to rebuild table structure (reverse-engineering tables)

I have just been tasked with recovering/rebuilding an extremely large and complex website that had no backups and was fully lost. I have a complete (hopefully) copy of all the PHP files however I have absolutely no clue what the database structure looked like (other than it is certainly at least 50 or so tables...so fairly complex). All data has been lost and the original developer was fired about a year ago in a fiery feud (so I am told). I have been a PHP developer for quite a while and am plenty comfortable trying to sort through everything and get the application/site back up and running...but the lack of a database will be a huge struggle. So...is there any way to simulate a MySQL connection to some software that will capture all incoming queries and attempt to use the requested field and table names to rebuild the structure?
It seems to me that if i start clicking through the application and it passes a query for
SELECT name, email, phone from contact_table WHERE
contact_id='1'
...there should be a way to capture that info and assume there was a table called "contact_table" that had at least 4 fields with those names... If I can do that repetitively, each time adding some sample data to the discovered fields and then moving on to another page, then eventually I should have a rough copy of most of the database structure (at least all public-facing parts). This would be MUCH easier than manually reading all the code and pulling out every reference, reading all the joins and subqueries, and sorting through it all manually.
Anyone ever tried this before? Any other ideas for reverse-engineering the database structure from PHP code?

mysql> SET GLOBAL general_log=1;
With this configuration enabled, the MySQL server writes every query to a log file (datadir/hostname.log by default), even those queries that have errors because the tables and columns don't exist yet.
http://dev.mysql.com/doc/refman/5.6/en/query-log.html says:
The general query log can be very useful when you suspect an error in a client and want to know exactly what the client sent to mysqld.
As you click around in the application, it should generate SQL queries, and you can have a terminal window open running tail -f on the general query log. As you see queries run by that reference tables or columns that don't exist yet, create those tables and columns. Then repeat clicking around in the app.
A number of things may make this task even harder:
If the queries use SELECT *, you can't infer the names of columns or even how many columns there are. You'll have to inspect the application code to see what column names are used after the query result is returned.
If INSERT statements omit the list of column names, you can't know what columns there are or how many. On the other hand, if INSERT statements do specify a list of column names, you can't know if there are more columns that were intended to take on their default values.
Data types of columns won't be apparent from their names, nor string lengths, nor character sets, nor default values.
Constraints, indexes, primary keys, foreign keys won't be apparent from the queries.
Some tables may exist (for example, lookup tables), even though they are never mentioned by name by the queries you find in the app.
Speaking of lookup tables, many databases have sets of initial values stored in tables, such as all possible user types and so on. Without the knowledge of the data for such lookup tables, it'll be hard or impossible to get the app working.
There may have been triggers and stored procedures. Procedures may be referenced by CALL statements in the app, but you can't guess what the code inside triggers or stored procedures was intended to be.
This project is bound to be very laborious, time-consuming, and involve a lot of guesswork. The fact that the employer had a big feud with the developer might be a warning flag. Be careful to set the expectations so the employer understands it will take a lot of work to do this.
PS: I'm assuming you are using a recent version of MySQL, such as 5.1 or later. If you use MySQL 5.0 or earlier, you should just add log=1 to your /etc/my.cnf and restart mysqld.

Crazy task. Is the code such that the DB queries are at all abstracted? Could you replace the query functions with something which would log the tables, columns and keys, and/or actually create the tables or alter them as needed, before firing off the real query?
Alternatively, it might be easier to do some text processing, regex matching, grep/sort/uniq on the queries in all of the PHP files. The goal would be to get it down to a manageable list of all tables and columns in those tables.

I once had a similar task, fortunately I was able to find an old backup.
If you could find a way to extract the queries, like say, regex match all of the occurrences of mysql_query or whatever extension was used to query the database, you could then use something like php-sql-parser to parse the queries and hopefully from that you would be able to get a list of most tables and columns. However, that is only half the battle. The other half is determining the data types for every single column and that would be rather impossible to do autmatically from PHP. It would basically require you inspect it line by line. There are best practices, but who's to say that the old dev followed them? Determining whether a column called "date" should be stored in DATE, DATETIME, INT, or VARCHAR(50) with some sort of manual ugly string thing can only be determined by looking at the actual code.
Good luck!

You could build some triggers with the BEFORE action time, but unfortunately this will only work for INSERT, UPDATE, or DELETE commands.
http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html

PHP Array efficiency vs mySQL query

I have a MySQL table with about 9.5K rows, these won't change much but I may slowly add to them.
I have a process where if someone scans a barcode I have to check if that barcode matches a value in this table. What would be the fastest way to accomplish this? I must mention there is no pattern to these values
Here Are Some Thoughts
Ajax call to PHP file to query MySQL table ( my thoughts would this would be slowest )
Load this MySQL table into an array on log in. Then when scanning Ajax call to PHP file to check the array
Load this table into an array on log in. When viewing the scanning page somehow load that array into a JavaScript array and check with JavaScript. (this seems to me to be the fastest because it eliminates Ajax call and MySQL Query. Would it be efficient to split into smaller arrays so I don't lag the server & browser?)

Honestly, I'd never load the entire table for anything. All I'd do is make an AJAX request back to a PHP gateway that then queries the database, and returns the result (or nothing). It can be very fast (as it only depends on the latency) and you can cache that result heavily (via memcached, or something like it).
There's really no reason to ever load the entire array for "validation"...

Much faster to used a well indexed MySQL table, then to look through an array for something.
But in the end it all depends on what you really want to do with the data.

As you mentions your table contain around 9.5K of data. There is no logic to load data on login or scanning page.
Better to index your table and do a ajax call whenever required.
Best of Luck!!

While 9.5 K rows are not that much, the related amount of data would need some time to transfer.
Therefore - and in general - I'd propose to run validation of values on the server side. AJAX is the right technology to do this quite easily.
Loading all 9.5 K rows only to find one specific row, is definitely a waste of resources. Run a SELECT-query for the single value.
Exposing PHP-functionality at the client-side / AJAX
Have a look at the xajax project, which allows to expose whole PHP classes or single methods as AJAX method at the client side. Moreover, xajax helps during the exchange of parameters between client and server.
Indexing to be searched attributes
Please ensure, that the column, which holds the barcode value, is indexed. In case the verification process tends to be slow, look out for MySQL table scans.
Avoiding table scans
To avoid table scans and keep your queries run fast, do use fixed sized fields. E.g. VARCHAR() besides other types makes queries slower, since rows no longer have a fixed size. No fixed-sized tables effectively prevent the database to easily predict the location of the next row of the result set. Therefore, you e.g. CHAR(20) instead of VARCHAR().
Finally: Security!
Don't forget, that any data transferred to the client side may expose sensitive data. While your 9.5 K rows may not get rendered by client's browser, the rows do exist in the generated HTML-page. Using Show source any user would be able to figure out all valid numbers.
Exposing valid barcode values may or may not be a security problem in your project context.
PS: While not related to your question, I'd propose to use PHPexcel for reading or writing spreadsheet data. Beside other solutions, e.g. a PEAR-based framework, PHPExcel depends on nothing.

php parsing speed optimization

I would like to add tooltip or generate link according to the element available in the database, for exemple if the html page printed is:
to reboot your linux host in single-user mode you can ...
I will use explode(" ", $row[page]) and the idea is now to lookup for every single word in the page to find out if they have a related referance in this exemple let's say i've got a table referance an one entry for reboot and one for linux
reboot: restart a computeur
linux: operating system
now my output will look like (replaced < and > by #)
to #a href="ref/reboot"#reboot#/a# your #a href="ref/linux"#linux#/a# host in single-user mode you can ...
Instead of have a static list generated when I saved the content, if I add more keyword in the future, then the text will become more interactive.
My main concerne and question is how can I create a efficient enough process to do it ?
Should I store all the db entry in an array and compare them ?
Do an sql query for each word (seems to be crazy)
Dump the table in a file and use a very long regex or a "grep -f pattern data" way of doing it?
Or or or or I'm sure it must be a better way of doing it, just don't have a clue about it, or maybe this will be far too resource un-friendly and I should avoid doing such things.
Cheers!

depending on the amount of keywords in the db there are two solutions.
1. if amount of keywords is less then amount of words in the text. Then you just pull all the keywords from db and compare them.
2. if amount of keywords is more then words in text. Dynamically create a single query which will bring all necessary words. eg. SELECT * FROM keywords WHERE keyword='system' OR keyword='linux' etc.
However if you are really concerned about resources i would suggest you to create a caching system. you process each page once, then store both original text and result in the db. if keyword table is updated you can reprocess all the pages once again.

I would have added an additional field for each article that would contain 'keyword table version' which was used to process this article.
Each time a user opens an article, you should compare this version with version of the keyword list. If it is outdated, you process the article and save the results to the articles table. Otherwise you just show the article.
You can control the load by adding a date column fro processing, and check it as well. If the item is relatively fresh, you may want to postpone the processing. Again, you may compare version difference, if it is greater than 5 or 10, for instance, you should update the article. If you have an important keyword added, just increase the version of keywords table by 10 and all your articles will be forced to update.
The main idea is distributing the load to user requests, and caching the results.
If your system is heavily-loaded, you may want to use random number generator to define that you should update the article only with a 10% chance, for instance.

You can have an index of keywords stored somewhere statically (database, file, or in an array). When the content is updated, you can rebuild or update the index accordingly. You just have to make sure that it can be looked up very quickly.
When you have it, you can then look up if there is a that word in the database very quickly, because the index is optimized for this.
I would store the index in a sorted list in a file, and look them up using binary search. This is a simple solution, and I think that should be quick enough if there are not too many data to process.
Or maybe you can send a list of words in the article to the database in one SQL query and have it return the list of articles that matches any of the word in the list.
Also, after the article is processed, you should also cache your data, so that in subsequent requests to the same article, you can give them the processed article instead of processing them everytime.

What Would be a Suitable Way to Log Changes Within a Database Using CodeIgniter

I want to create a simple auditing system for my small CodeIgniter application. Such that it would take a snapshot of a table entry before the entry has been edited. One way I could think of would be to create a news_audit table, which would replicate all the columns in the news table. It would also create a new record for each change with the added column of date added. What are your views, and opinions of building such functionality into a PHP web application?

There are a few things to take into account before you decide which solution to use:
If your table is large (or could become large) your audit trail needs to be in a seperate table as you describe or performance will suffer.
If you need an audit that can't (potentially) be modified except to add new entries it needs to have INSERT permissions only for the application (and to be cast iron needs to be on a dedicated logging server...)
I would avoid creating audit records in the same table as it might be confusing to another developer (who might no realize they need to filter out the old ones without dates) and will clutter the table with audit rows, which will force the db to cache more disk blocks than it needs to (== performance cost). Also to index this properly might be a problem if your db does not index NULLS. Querying for the most recent version will involve a sub-query if you choose to time stamp them all.
The cleanest way to solve this, if your database supports it, is to create an UPDATE TRIGGER on your news table that copies the old values to a seperate audit table which needs only INSERT permissions). This way the logic is built into the database, and so your applications need not be concerned with it, they just UPDATE the data and the db takes care of keeping the change log. The body of the trigger will just be an INSERT statement, so if you haven't written one before it should not take long to do.
If I knew which db you are using I might be able to post an example...

What we do (and you would want to set up archiving beforehand depending on size and use), but we created an audit table that stores user information, time, and then the changes in XML with the table name.
If you are in SQL2005+ you can then easily search the XML for changes if needed.
We then added triggers to our table to catch what we wanted to audit (inserts, deletes, updates...)
Then with simple serialization we are able to restore and replicate changes.

What scale are we looking at here? On average, are entries going to be edited often or infrequently?
Depending on how many edits you expect for the average item, it might make more sense to store diff's of large blocks of data as opposed to a full copy of the data.

One way I like is to put it into the table itself. You would simply add a 'valid_until' column. When you "edit" a row, you simply make a copy of it and stamp the 'valid_until' field on the old row. The valid rows are the ones without 'valid_until' set. In short, you make it copy-on-write. Don't forget to make your primary keys a combination of the original primary key and the valid_until field. Also set up constraints or triggers to make sure that for each ID there can be only one row that does not have it's valid_until set.
This has upsides and downsides. The upside is less tables. The downside is far more rows in your tables. I would recommend this structure if you often need to access old data. By simply adding a simple WHERE to your queries you can query the state of a table at a previous date/time.
If you only need to access your old data occasionally then I would not recommend this though.
You can take this all the way to the extreme by building a Temportal database.

In small to medium size project I use the following set of rules:
All code is stored under Revision Control System (i.e. Subversion)
There is a directory for SQL patches in source code (i.e. patches/)
All files in this directory start with serial number followed by short description (i.e. 086_added_login_unique_constraint.sql)
All changes to DB schema must be recorded as separate files. No file can be changed after it's checked in to version control system. All bugs must be fixed by issuing another patch. It is important to stick closely to this rule.
Small script remembers serial number of last executed patch in local environment and runs subsequent patches when needed.
This way you can guarantee, that you can recreate your DB schema easily without the need of importing whole data dump. Creating such patches is no brainer. Just run command in console/UI/web frontend and copy-paste it into patch if successful. Then just add it to repo and commit changes.
This approach scales reasonably well. Worked for PHP/PostgreSQL project consisting of 1300+ classes and 200+ tables/views.

Migrating database changes from development to live

Perhaps the biggest risk in pushing new functionality to live lies with the database modifications required by the new code. In Rails, I believe they have 'migrations', in which you can programmatically make changes to your development host, and then make the same changes live along with the code that uses the revised schema. And roll both backs if needs be, in a synchronized fashion.
Has anyone come across a similar toolset for PHP/MySQL? Would love to hear about it, or any programmatic or process solutions to help make this less risky...

I don't trust programmatic migrations. If it's a simple change, such as adding a NULLable column, I'll just add it directly to the live server. If it's more complex or requires data changes, I'll write a pair of SQL migration files and test them against a replica database.
When using migrations, always test the rollback migration. It is your emergency "oh shit" button.

I've never come across a tool that would do the job. Instead I've used individual files, numbered so that I know which order to run them: essentially, a manual version of Rails migrations, but without the rollback.
Here's the sort of thing I'm talking about:
000-clean.sql # wipe out everything in the DB
001-schema.sql # create the initial DB objects
002-fk.sql # apply referential integrity (simple if kept separate)
003-reference-pop.sql # populate reference data
004-release-pop.sql # populate release data
005-add-new-table.sql # modification
006-rename-table.sql # another modification...
I've never actually run into any problems doing this, but it's not very elegant. It's up to you to track which scripts need to run for a given update (a smarter numbering scheme could help). It also works fine with source control.
Dealing with surrogate key values (from autonumber columns) can be a pain, since the production database will likely have different values than the development DB. So, I try never to reference a literal surrogate key value in any of my modification scripts if at all possible.

I've used this tool before and it worked perfectly.
http://www.mysqldiff.org/
It takes as an input either a DB connection or a SQL file, and compares it to the same (either another DB connection or another SQL file). It can spit out the SQL to make the changes or make the changes for you.

#[yukondude]
I'm using Perl myself, and I've gone down the route of Rails-style migrations semi-manually in the same way.
What I did was have a single table "version" with a single column "version", containing a single row of one number which is the current schema version. Then it was (quite) trivial to write a script to read that number, look in a certain directory and apply all the numbered migrations to get from there to here (and then updating the number).
In my dev/stage environment I frequently (via another script) pull the production data into the staging database, and run the migration script. If you do this before you go live you'll be pretty sure the migrations will work. Obviously you test extensively in your staging environment.
I tag up the new code and the required migrations under one version control tag. To deploy to stage or live you just update everything to this tag and run the migration script fairly quick. (You might want to have arranged a short downtime if it's really wacky schema changes.)

The solution I use (originally developed by a friend of mine) is another addendum to yukondude.
Create a schema directory under version control and then for each db change you make keep a .sql file with the SQL you want executed along with the sql query to update the db_schema table.
Create a database table called "db_schema" with an integer column named version.
In the schema directory create two shell scripts, "current" and "update". Executing current tells you which version of the db schema the database you're connected to is currently at. Running update executes each .sql file numbered greater than the version in the db_schema table sequentially until you're up to the greatest numbered file in your schema dir.
Files in the schema dir:
0-init.sql
1-add-name-to-user.sql
2-add-bio.sql
What a typical file looks like, note the db_schema update at the end of every .sql file:
BEGIN;
-- comment about what this is doing
ALTER TABLE user ADD COLUMN bio text NULL;
UPDATE db_schema SET version = 2;
COMMIT;
The "current" script (for psql):
#!/bin/sh
VERSION=`psql -q -t <<EOF
\set ON_ERROR_STOP on
SELECT version FROM db_schema;
EOF
`
[ $? -eq 0 ] && {
echo $VERSION
exit 0
}
echo 0
the update script (also psql):
#!/bin/sh
CURRENT=`./current`
LATEST=`ls -vr *.sql |egrep -o "^[0-9]+" |head -n1`
echo current is $CURRENT
echo latest is $LATEST
[[ $CURRENT -gt $LATEST ]] && {
echo That seems to be a problem.
exit 1
}
[[ $CURRENT -eq $LATEST ]] && exit 0
#SCRIPT_SET="-q"
SCRIPT_SET=""
for (( I = $CURRENT + 1 ; I <= $LATEST ; I++ )); do
SCRIPT=`ls $I-*.sql |head -n1`
echo "Adding '$SCRIPT'"
SCRIPT_SET="$SCRIPT_SET $SCRIPT"
done
echo "Applying updates..."
echo $SCRIPT_SET
for S in $SCRIPT_SET ; do
psql -v ON_ERROR_STOP=TRUE -f $S || {
echo FAIL
exit 1
}
done
echo OK
My 0-init.sql has the full initial schema structure along with the initial "UPDATE db_schema SET version = 0;". Shouldn't be too hard to modify these scripts for MySQL. In my case I also have
export PGDATABASE="dbname"
export PGUSER="mike"
in my .bashrc. And it prompts for password with each file that's being executed.

Symfony has a plugin called sfMigrationsLight that handles basic migrations. CakePHP also has migrations.
For whatever reason, migration support has never really been a high priority for most of the PHP frameworks and ORMs out there.

Pretty much what Lot105 described.
Each migration needs an apply and rollback script, and you have some kind of control script which checks which migration(s) need to be applied and applies them in the appropriate order.
Each developer then keeps their db in sync using this scheme, and when applied to production the relevant changes are applied. The rollback scripts can be kept to back out a change if that becomes necessary.
Some changes can't be done with a simple ALTER script such as a tool like sqldiff would produce; some changes don't require a schema change but a programmatic change to existing data. So you can't really generalise, which is why you need a human-edited script.

I use SQLyog to copy the structure, and I ALWAYS, let me repeat ALWAYS make a backup first.

I've always preferred to keep my development site pointing to the same DB as the live site. This may sound risky at first but in reality it solves many problems. If you have two sites on the same server pointing to the same DB, you get a real time and accurate view of what your users will see when it goes live.
You will only ever have 1 database and so long as you make it a policy to never delete a column from a table, you know your new code will match up with the database you are using.
There is also significantly less havoc when migrating. You only need to move over the PHP scripts and they are already tested using the same DB.
I also tend to create a symlink to any folder that is a target for user uploads. This means there is no confusion on which user files have been updated.
Another side affect is the option of porting over a small group of 'beta-testers' to use the site in everyday use. This can lead to a lot of feedback that you can implement before the public launch.
This may not work in all cases but I've started moving all my updates to this model. It's caused much smoother development and launches.

In the past I have used LiquiBase, a Java-based tool where you configure your migrations as XML files. You can generate the necessary SQL with it.
Today I'd use the Doctrine 2 library which has migration facilities similar to Ruby.
The Symfony 2 framework also has a nice way to deal with schema changes - its command line tool can analyze the existing schema and generate SQL to match the database to the changed schema definitions.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.