I would like write a php script that merges several databases, and I would like to be sure of how to go around it before I start anything.
I have 4 databases which have the same structure and almost same data. I want to merge them without any duplicate entry while preserving (or re-linking) the foreign keys.
For example there is a db1.product table which is almost the same as db2.products so I think I would have to use LIKE comparison on name and description columns to be sure that I only insert new rows. But then, when merging the orders table I have to make sure that the productID still indicates the right product.
So I thought of 2 solutions :
Either I use for each table insert into db1.x as select * from db2.x and then make new links and check for duplicate using triggers.
Either I delete duplicate entries and update new foreign keys (after having dropped constraints) and then insert row into the main database.
Just heard of MySQL Data Compare and Toad for mySQL, could they help me to merge tables ?
Could someone indicate to me what should be the right solution ?
sorry for my english and thank you !
First thing is how are you determining whether products are the same? You mentioned LIKE comparison on name and description. You need to establish a rule what says that product is one and the same in your db1, db2 and so on.
However, let's assume that product's name and description are the attributes that define it.
ALTER TABLE products ADD UNIQUE('name', 'description');
Run this on all of your databases.
After you've done that, select one of the databases you wish to import into and run the following query:
INSERT IGNORE INTO db1.products SELECT * FROM db2.products;
Repeat for the remaining databases.
Naturally, this all fails if you can't determine how you're going to compare the products.
Note: never use reserved words for your column names such as word "name".
Firstly, good luck with this - sounds like a tricky job.
Secondly, I wouldn't do this with PHP - I'd write SQL to do the work, assuming this is a one-off migration task and not a recurring task.
As an approach, I would do the following.
Create a database with the schema you want - it sounds like each of your 4 databases have small variations in the schema. Just create the schema for now, don't worry about the data.
Create a "working" database, with the same schema, but with columns for "old" primary keys. For instance:
table ORDER
order_id int primary key auto increment
old_order_id int not null
...other columns...
table ORDER_LINE
order_line_id int primary key auto increment
old_order_line_id int not null
order_id int foreign key
...other columns...
Table by table, Insert into your working database from your first source database. Let the primary keys auto_increment, but put the original primary key into the "old_" column.
For instance:
insert into workingdb.orders
select null, order_id, ....other columns...
from db1.orders
Where you have a foreign key, populate it by finding the record in the old_ column.
For instance:
insert into workingdb.order_line
select null, ol.order_line_id, o.order_id
from db1.order_line ol,
workingdb.order
where ol.order_id = o.old_order_id
Rinse and repeat for the other databases.
Finally, copy the data from your working database into the "proper" database. This is optional - it may help to retain the old IDs for lookups etc.
Related
I know this has something to do with maybe the Primary Key and Unique Keys, but I'm not sure how to how to make it work. Basically I want MySQL to generate a new row even if data in it is duplicate of last rows. Right now, any duplicate data from previous rows result in the row not being generated. Help is very appreciated.
The table you have defined has all columns as primary key and unique. It's ideal to maintain one column as primary key (perhaps with auto increment) and the rest as non-indexed columns. Check the table definition with the following mysql query if you are not familiar with using phpMyAdmin
desc tablename;
I migrate a custom made web site to WordPress and first I have to migrate the data from the previous web site, and then, every day I have to perform some data insertion using an API.
The data I like to insert, comes with a unique ID, representing a single football game.
In order to avoid inserting the same game multiple times, I made a db table with the following structure:
CREATE TABLE `ss_highlight_ids` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`highlight_id` int(10) unsigned zerofill NOT NULL DEFAULT '0000000000',
PRIMARY KEY (`id`),
UNIQUE KEY `highlight_id_UNIQUE` (`highlight_id`),
KEY `highlight_id_INDEX` (`highlight_id`) COMMENT 'Contains a list with all the highlight IDs. This is used as index, and dissalow the creation of double records.'
) ENGINE=InnoDB AUTO_INCREMENT=2967 DEFAULT CHARSET=latin1
and when I try to insert a new record in my WordPress db, I first like to lookup this table, to see if the ID already exists.
The question now :)
What's preferable ? To load all the IDs using a single SQL query, and then use plain PHP to check if the current game ID exists, or is it better to query the DB for any single row I insert ?
I know that MySQL Queries are resource expensive, but from the other side, currently I have about 3k records in this table, and this will move over 30 - 40k in the next few year, so I don't know if it's a good practice to load all of those records in PHP ?
What is your opinion / suggestion ?
UPDATE #1
I just found that my table has 272KiB size with 2966 row. This means that in the near feature it seems that will have a size of about ~8000KiB+ size, and going on.
UPDATE #2
Maybe I have not make it too clear. For first insertion, I have to itterate a CSV file with about 12K records, and after the CSV insertion every day I will insert about 100 - 200 records. All of those records requiring a lookup in the table with the IDs.
So the excact question is, is it better to create a 12K queries in MySQL at CSV insertion and then about 100 - 200 MySQL Queries every day, or just load the IDs in server memory, and use PHP for the lookup ?
Your table has a column id which is auto_increment, what that means is there is no need to insert anything in that column. It will fill it itself.
highlight_id is UNIQUE, so it may as well be the PRIMARY KEY; get rid if id.
A PRIMARY KEY is a UNIQUE key is an INDEX. So this is redundant:
KEY `highlight_id_INDEX` (`highlight_id`)
Back to your question... SQL is designed to do things in batches. Don't defeat that by doing things one row at a time.
How can the table be 272KiB size if it has only two columns and 2966 rows? If there are more columns in the table; show them. There are often good clues of what you are doing, and how to make it more efficient.
2966 rows is 'trivial'; you will have to look closely to see performance differences.
Loading from CSV...
If this is a replacement, use LOAD DATA, building a new table, then RENAME to put it into place. One CREATE, one LOAD, one RENAME, one DROP. Much more efficient than 100 queries of any kind.
If the CSV is updates/inserts, LOAD into a temp table, then do INSERT ... ON DUPLICATE KEY UPDATE ... to perform the updates/inserts into the real table. One CREATE, one LOAD, one IODKU. Much more efficient than 100 queries of any kind.
If the CSV is something else, please elaborate.
I want to begin with Thank you, you guys have been good to me.
I will go straight to the question.
Having a table with over 400 columns, is that bad?
I have web forms that consists mainly of questions that require check box answers.
The total number of check boxes can run up to 400 if not more.
I actually modeled one of the forms, and put each check box in a column (took me hours to do).
Because of my unfamiliarity with database design, I did not feel like that was the right way to go.
So I read somewhere that some people use the serialize function, to store a group of check boxes as text in a column.
I just want to know it that would be the best way to store these check boxes.
Oh and some more info I will be using cakephp orm with these tables.
Thanks again in advance.
My database looks something like this
Table : Patients, Table : admitForm, Table : SomeOtherFOrm
each form table will have a PatientId
As i stated above i first attempted creating a table for each form, and then putting each check box in a column. That took me forever to do.
so i read some where serializing check boxes per question would be a good idea
So im asking would would be a good approach.
For questions with multiple options, just add another table.
The question that nobody has asked you yet is do you need to do data mining or put the answers to these checkbox questions into a where clause in a query. If you don't need to do any queries on the data that look at the data contained in these answers then you can simply serialize them up into a few fields. You could even pack them into numbers. (all who come after you will hate you if you pack the data though)
Here's my idea of a schema.
== Edit #3 ==
Updated ERD with ability to store free form answers, also linked patient_reponse_option to question_option_link table so a patients response will be saved with correct option context (we know which question the response is too). I will post a few queries soon.
== Edit #2 ==
Updated ERD with form data
== Edit #1 ==
The short answer to your question is no, 400 columns is not the right approach. As an alternative, check out the following schema:
== Original ==
According to your recent edit, you will want to incorporate a pivot table. A pivot table breaks up a M:M relationship between 'patients' and 'options', for example, many patients can have many options. For this to work, you don't need a table with 400 columns, you just need to incorporate the aforementioned pivot table.
Example schema:
// patient table
tableName: patient
id: int(11), autoincrement, unsigned, not null, primary key
name_first: varchar(100), not null
name_last: varshar(100), not null
// Options table
tableName: option
id: int(11), autoincrement, unsigned, not null, primary key
name: varchar(100), not null, unique key
// pivot table
tableName: patient_option_link
id: int(11), autoincrement, unsigned, not null, primary key
patient_id: Foreign key to patient (`id`) table
option_id: Foreign key to option (`id`) table
With this schema you can have any number of 'options' without having to add a new column to the patients table. Which, if you have a large number of rows, will crush your database if you ever have to run an alter table add column command.
I added an id to the pivot table, so if you ever need to handle individual rows, they will be easier to work with, vs having to know the patient_id and option_id.
I think I would split this out into 3 tables. One table representing whatever entity is answering the questions. A second table containing the questions themselves. Finally, a third junction table that will be populated with the primary key of the first table and the id of the question from the second table whenever the entity from the first table selects the check box for that question.
Usually 400 columns means your data could be normalized better and broken into multiple tables. 400 columns might actually be appropriate, though, depending on the use case. An example where it might be appropriate is if you need these fields on every single query AND you need to filter records using these columns (ie: use them in your WHERE clause)... in that case the SQL JOINs will likely be more expensive than having a sparsely populated "wide" table.
If you never need to use SQL to filter out records based on these "checkboxes" (I'm guessing they are yes/no boolean/tinyint type values) then serializing is a valid approach. I would go this route if I needed to use the checkbox values most of time I query the table, but don't need to use them in a WHERE clause.
If you don't need these checkbox values, or only need a small subset of them, on a majority of requests to your table then its likely you should work on breaking your table into multiple tables. One approach is to have a table with the checkbox values (id, record_id, checkbox_name, checkbox_value) where record_id is the id of your primary table record. This implies a one-to-many relationship between your primary records and your checkbox values.
I have some location data, which is in a table locations with the key being the unique location_id
I have some user data, which is in a table users with the key being the unique user_id
Two ways I was thinking of linking these two together:
I can put the 'location' in each user's data.
'SELECT user_id FROM users WHERE location = "LOCATIONID";'
//this IS NOT searching with the table's key
//this does not require an explode
//this stores 1 integer per user
I can also put the 'userIDs' as a comma delimited string of ids into each location's data.
'SELECT userIDs FROM locations WHERE location_id = "LOCATIONID";'
//this IS searching with the tables key
//this needs an explode() once the comma delimited list is retrieved
//this stores 1 string of user ids per location
so I wonder, which would be most efficient. I'm not really sure how much the size of the data stored could also impact the speed. I want retrievals that are as fast as possible when trying to find out which users are at which location.
This is just an example, and there will be many other tables like location to compare to the users, so the efficiency, or lack of, will be multiplied across the whole system.
Stick with option 1. Keep your database tables normalised as much as possible till you know you have a performance problem.
There's a whole slew of problems with option 2, including the lack of ability to then use the user ID's till you pull them into PHP and then having to fire off lots more SQL queries for each ID. This is extremely inefficient. Do as much inside MySQL as possible, the optimisations that the database layer can do while running the query will easily be a lot quicker than anything you write in PHP.
Regarding your point about not searching on the primary key, you should add an index to the location column. All columns that are in a WHERE clause should be indexed as a general rule. This negates the issue of not searching on the primary key, as the primary key is just another type of index for the purposes of performance.
Use the first one to keep your data normalized. You can then query for all users for a location directly from the database without having to go back to the database for each user.
Be sure to add the correct index on your users table too.
CREATE TABLE locations (
locationId INT PRIMARY KEY AUTO_INCREMENT
) ENGINE=INNODB;
CREATE TABLE users (
userId INT PRIMARY KEY AUTO_INCREMENT,
location INT,
INDEX ix_location (location)
) ENGINE=INNODB;
Or to only add the index
ALTER TABLE users ADD INDEX ix_location(location);
Have you heard of foreign key ?
get details from many tables tables using join .
You can use of sub query also.
As you said there are two tables users and locations.
Keep userid as foreign key in locations and fetch it based on that.
When you store the user IDs as a comma-separated list in a table, that table is not normalized (especially it violates the first normal form, item 4).
It is perfectly valid to denormalize tables for optimization purposes. But only after you have measured that this is where the bottleneck actually is in your specific situation. This, however, can only be determined if you know which query is executed how often, how long they take and whether the performance of the query is critical (in relation to other queries).
Stick with option 1 unless you know exactly why you have to denormalize your table.
How do I retrieve a multi-column PK in MySQL?
For example I have my primary key setup as
PRIMARY KEY (donor_id,country_id)
Now if I want to get the primary key value without concatenating those 2 fields in a select query, how do I do that? I want to use this in a view (or better yet, directly in phpmaker).
It's not clear what you mean by "without concatenating". A simple
SELECT donor_id, country_id FROM table WHERE ...;
will retrieve the records; you don't need to apply a CONCATENATE() function or anything like that. This is the Right Way to select two records from a table; the fact that they both happen to be declared part of the primary key changes nothing.
No special way is needed to get the records from table that has a multi-column PK in MySQL. Things might be different if you are using an ORM. An ORM may or may have special or different syntax/features for working with tables with multi-column PK.