Here's the situation:
I have a mySQL db on a remote server. I need data from 4 of its tables. On occasion, the schema of these tables is changed (new fields are added, but not removed). At the moment, the tables have > 300,000 records.
This data needs to be imported into the localhost mySQL instance. These same 4 tables exist (with the same names), but the fields needed are a subset of the fields in the remote db tables. The data in these local tables is considered read-only and is never written to. Everything needs to be run in a transaction so there is always some data in the local tables, even if it is a day old. The localhost tables are used by an active website, so this entire process needs to complete as quickly as possible to minimize downtime.
This process runs once per day.
The options as I see them:
Get a mysqldump of the structure/data of the remote tables and save to file. Drop the localhost tables, and run the dumped sql script. Then recreate the needed indexes on the 4 tables.
Truncate the localhost tables. Run SELECT queries on the remote db in PHP and retrieve only the fields needed instead of the entire row. Then loop through the results and create INSERT statements from this data.
My questions:
Performance wise, which is my best option?
Which one will complete the fastest?
Will either one put a heavier load on the server?
Would indexing the
tables take the same amount of time in both options?
If there is no good reason for having the local d/b be a subset of the remote, make the structure the same and enable database replication on the needed tables. Replication works by the master tracking all changes made, and managing each slave d/b's pointer into the changes. Each slave says give me all changes since the last request. For a sizeable database, this is far more efficient than any alternative you have selected. It comes with only modest cost.
As for schema changes, I think the alter information is logged by the master, so the slave(s) can replicate those as well. The mechanism definitely replicates drop table ... if exists and create table ... select, so alter logically should follow, but I have not tried it.
Here it is: confirmation that alter is properly replicated.
Related
This is somewhat of an abstract question but hopefully pretty simple at the same time. I just have no idea the best way to go about this except for an export/import and I can't do that due to permission issues. So i need some alternatives.
On one server, we'll call it 1.2.3 I have a database with 2 schemas, Rdb and test. These schemas have 27 and 3 tables respectively. This database stores call info from our phone system but we have reader access only so we're very limited in what we can do beyond selecting and joining for data records and info.
I then have a production database server, call it 3.2.1 With my main schemas and I'd like to place the previous 30 tables into one of these production schemas. After the migration is done, I'll need to create a script that will check the data on the first connection and then update the new schema on the production connection, but that's after the bulk migration is done.
I'm wondering if a php script would be the way to go about this initial migration, though. I'm using MySQL workbench and the export wizard fails for the read only database, but if there's another way in the interface then I don't know about it.
It's quite a bit of data, and I'm not necessarily looking for the fastest way but the easiest and most fail safe way.
For a one time data move, the easiest way is to use the command line tool mysqldump to dump your tables to file, then load the resulting file with mysql. This assumes that you are either shutting down 1.2.3, or will reconfigure your phone system to point to 3.2.1 (or update DNS appropriately). Also, this is much easier if you can get downtime on the phone system to move the data.
we have reader access only so we're very limited in what we can do beyond selecting and joining for data records
This really limits your options.
Master/Slave replication requires REPLICATION SLAVE privilege, which you probably need a user with SUPER privilege to create a replication user.
Trigger based replication solutions like SymetricDS will require a user with CREATE ROUTINE in order to create the triggers
An "Extract, Transform, Load" solution like Clover ETL will work best if tables have LAST_CHANGED timestamps. If they don't, then you would need ALTER TABLE privilege.
Different tools for different goals.
Master/Slave replication is generally used for Disaster Recovery, Availability or Read Scaling
Hetergenous Replication to replicate some (or all) tables between different environments (could be different RDBMS, or different replica sets) in a continuous, but asynchronous fashion.
ETL for bulk, hourly/daily/periodic data movements, with the ability to pick a subset of columns, aggregate, convert timestamp formats, merge with multiple sources, and generally fix whatever you need to with the data.
That should help you determine really what your situation is - whether it's a one time load with a temporary data sync, or if it's an on-going replication (real-time, or delayed).
Edit:
https://www.percona.com/doc/percona-toolkit/LATEST/index.html
Check out the Persona Toolkit. Specifically pt-table-sync and pt-table-checksum. They will help with this.
This is probably any team will encounter at some point so I'm counting on experience other guys had.
We are in a process of migrating old MySQL database to a new database with structure changed quite a bit. Some tables were split into multiple smaller tables, some data was joined from multiple smaller to one larger table.
We ran a test and it takes a few hours to migrate database to a new form. The problem is, the old database is our production database, changes every minute. We cannot have a few hours downtime.
What approach do you think would be ok in such a situation?
Let's say you have table called "users" with 1M rows. It's being changed every second. Some fields are updated, some rows are added and some rows are deleted. That's the problem why we cannot make a snapshot at certain point of time because after the migration is done, we would have 3 hours of data unsynced.
One approach I've used in the past was to use replication.
We created a replication scheme between the old production database and a slave which was used for the data migration. When we started the migration, we switched off the replication temporarily, and used the slave database as the data source for the migration; the old production system remained operational.
Once the migration script had completed, and our consistency checks had run, we re-enabled replication from the old production system to the replicated slave. Once the replication had completed, we hung up the "down for maintenance" sign on production, re-ran the data migration scripts and consistency checks, pointed the system to the new database, and took down the "down for maintenance" sign.
There was downtime, but it was minutes, rather than hours.
This does depend on your database schema to make it easy to identify changed/new data.
If your schema does not lend itself to easy querying to find new or changed records, and you don't want to add new columns to keep track of this, the easiest solution is to create separate tables to keep track of the migration status.
For instance:
TABLE: USERS (your normal, replicated table)
----------------------
USER_ID
NAME
ADDRESS
.....
TABLE: USERS_STATUS (keeps track of changes, only exists on the "slave")
-----------------
USER_ID
STATUS
DATE
You populate this table via a trigger on the USERS table for insert, delete and update - for each of those actions, you set a separate status.
This allows you to quickly find all records that changed since you ran your first migration script, and only migrate those records.
Because you're not modifying your production environment, and the triggers only fire on the "slave" environment, you shouldn't introduce any performance or instability problems on the production environment.
There's one approach I used once and that should work for you too, however you'll need to do modify your production datasets for that. Just briefly:
Add a new column named "migrated" (or so) to every table you want to migrate. Give it a boolean type. Set it to 0 by default.
When your migration script runs it has to set this flag to 1 for every entry that has been migrated to the new db. All entries that are already "1" have to be ignored. That way you won't run into synchronization issues.
That way you can run the migration script as often as you like.
You will have a downtime, but it will be just a minimal one because during that downtime you only have to migrate a few datasets (practically the last "delta" between the last run of the migration script and now).
Could you run the new database in parallel with the current one? That way you can later migrate the old data from your old db to your new one and your "live" situation will already have been captured on the new one.
What I mean is: when you write something to the old db, you will also have to write the data to the new one.
We created a import script which imports about 120GB of data into a MySQL database. The data is saved in a few hunderd directories (all are seperated databases). Each directory contains files with table structures and table data.
The issue being; it works on my local machine with a subset of the actual data, but when the import is ran on the server (Which takes a few days). Not all the tables are created (even tables that are tested locally). The odd thing is that the script, when ran on the server does not show any errors on the creation of the tables.
Here is on a high level how the script works:
Find all directories that represent a database
Create all databases
Per database loop through the tables: create table, fill table
Added the code on gist: https://gist.github.com/3349872
Add more logging to see steps that succeded since you might be having problems with memory usage or execution times.
Why dont you create sql files from given CVS files and then just do normal importing in bash?
mysql -u root -ppassword db_something< db_user.sql
Auch, the problem was in the code. Amazingly stupid mistake.
When testing the code on a subset of all the files all table information and table content where available. When a table count not be created the function enters a logging statement and than returns. On the real data this was a the mistake because there are files with no data and no structure so after creating a few tables this creation of the tables of a certain database went wrong and did a return and so didn't create the other tables.
I am working on a PHP class implementing PDO to sync a local database's table with a remote one.
The Question
I am looking for some ideas / methods / suggestions on how to implement a 'backup' feature to my 'syncing' process. The ideas is: Before the actual insert of the data takes place, I do a full wipe of the local table's data. Time is not a factor so I figure this is the cleanest and simplest solution and I wont have to worry about checking for differences in the data and all that jazz. The Problem is, I want to implement some kind of security measure in case there is a problem during the insert of data, like loss of internet connection or something. The only idea I have so far is: Copy said table to be synced -> wipe said table -> insert remote tables data into local table -> if successful delete backup copy.
Check out mk-table-sync. It compares two tables on different servers, using checksums of chunks of rows. If a given chunk is identical between the two servers, no copying is needed. If the chunk differs, it copies just the chunk it needs. You don't have to wipe the local table.
Another alternative is to copy the remote data to a distinct table name. If it completes successfully, then DROP the old table and RENAME the new local copy to the original table's name. If the copy fails or is interrupted, then drop the local copy with the distinct name and try again. Meanwhile, your other local table with the previous data is untouched.
Following is Web tool that sync database between you and server or other developer.
It is Git Based. So you should use Git in project.
But it only helpful while developing Application. it is not tool for compare databases.
For Sync Databases you regularly push code to Git.
Git Project : https://github.com/hardeepvicky/DB-Sync
I have two databases, one is a Firebird database, other is a MySQL database.
Firebird database is the main one where the information changes. I have to synchronize those changes to the other MySQL database.
I have no control over the Firebird one - I can just SELECT from it. I cannot add triggers, events or similar. I have all the control on the MySQL database.
The synchronization has to be done through 'internet' as these two servers are not connected in any way and are on different locations.
Synchronization has to be done in PHP on the server that also hosts the MySQL database.
Currently I just go through every record (every 15 minutes), calculate the hash of the rows, compare two hashes and if they don't match, I update the whole row. It works but just seems very wrong and not optimized in any way.
Is there any other way to do this? I am missing something?
Thank you.
I have made the same thing once and I don't think there is a generaly better solution.
You can only more or less optimize what you have so far. For example:
If some of the tables have a column with the "latest update" information, you can select only those that were changed since the last sync.
You can change the comparison mechanism - instead of comparing and updating whole rows, you can compare individual columns and on the MySQL side update only the changed ones. I believe that it would speed things up in case of MyISAM tables, but probably not if you use InnoDB.