This is somewhat of an abstract question but hopefully pretty simple at the same time. I just have no idea the best way to go about this except for an export/import and I can't do that due to permission issues. So i need some alternatives.
On one server, we'll call it 1.2.3 I have a database with 2 schemas, Rdb and test. These schemas have 27 and 3 tables respectively. This database stores call info from our phone system but we have reader access only so we're very limited in what we can do beyond selecting and joining for data records and info.
I then have a production database server, call it 3.2.1 With my main schemas and I'd like to place the previous 30 tables into one of these production schemas. After the migration is done, I'll need to create a script that will check the data on the first connection and then update the new schema on the production connection, but that's after the bulk migration is done.
I'm wondering if a php script would be the way to go about this initial migration, though. I'm using MySQL workbench and the export wizard fails for the read only database, but if there's another way in the interface then I don't know about it.
It's quite a bit of data, and I'm not necessarily looking for the fastest way but the easiest and most fail safe way.
For a one time data move, the easiest way is to use the command line tool mysqldump to dump your tables to file, then load the resulting file with mysql. This assumes that you are either shutting down 1.2.3, or will reconfigure your phone system to point to 3.2.1 (or update DNS appropriately). Also, this is much easier if you can get downtime on the phone system to move the data.
we have reader access only so we're very limited in what we can do beyond selecting and joining for data records
This really limits your options.
Master/Slave replication requires REPLICATION SLAVE privilege, which you probably need a user with SUPER privilege to create a replication user.
Trigger based replication solutions like SymetricDS will require a user with CREATE ROUTINE in order to create the triggers
An "Extract, Transform, Load" solution like Clover ETL will work best if tables have LAST_CHANGED timestamps. If they don't, then you would need ALTER TABLE privilege.
Different tools for different goals.
Master/Slave replication is generally used for Disaster Recovery, Availability or Read Scaling
Hetergenous Replication to replicate some (or all) tables between different environments (could be different RDBMS, or different replica sets) in a continuous, but asynchronous fashion.
ETL for bulk, hourly/daily/periodic data movements, with the ability to pick a subset of columns, aggregate, convert timestamp formats, merge with multiple sources, and generally fix whatever you need to with the data.
That should help you determine really what your situation is - whether it's a one time load with a temporary data sync, or if it's an on-going replication (real-time, or delayed).
Edit:
https://www.percona.com/doc/percona-toolkit/LATEST/index.html
Check out the Persona Toolkit. Specifically pt-table-sync and pt-table-checksum. They will help with this.
Related
I just took over a pretty terrible database design job, which heavily use comma separated value to store data. I know I know, it is hell.
The db is mysql, currently accessing it using MySql Workbench.
I already had idea in mind what to remove, and what new relations table needed.
So, my question is, how shall I proceed by migrating comma separated data to the new table? Any tools specialize for normalizing database?
Edit:
The server code is in PHP.
Define you new tables and attributes first.
Then, use PHP or Python or your favorite language with MySQL calls and write a 1 time converter which loops and reads the old table(s) and records and inserts the proper records into the new tables.
It appears you are looking for standard practices. There are varying degree of denormalized databases out there. The ones I have come across have been normalized with custom code and tools.
SQL Server Integration Services (SSIS) can be used for some case. In your case, I'd build a script for the migration that involves:
creation of normalized tables
creating stored procedure or PHP script(s) to read data from denormalized table, transform it and load it into normalized table
creating a log table or log file
performing the migration in sandbox; write logs while doing so
version control the script
correct the proc/script as needed
create another sandbox
run the full script on sandbox
if successful, run the full script on prod (with logging)
SSIS is used for ETL in many organizations; it's standard tool for Microsoft BI stack and can also be used to migrate data between non-Microsoft DBs also.
Open Source ETL tool called Talend might also help in transforming your data. I personally believe that a PHP script will be the fastest and easiest to manipulate data.
I am creating an application that utilizes MySQL and PHP. My current web hosting provider has a MySQL database size limitation of 1 GB, but I am allowed to create many 1 GB databases. Even if was able to find another web hosting provider that allowed larger databases, I wonder how is data integrity and speed affected by larger databases? Is it better to keep databases small in terms of disk size? In other words, what is the best practice method of storing the same data (all text) from thousands of users? I am new to database design and planning. Eventually, I would imagine that a single database with data from thousands of users would grow to be inefficient and optimally the data should be distributed among smaller databases. Do I have this correct?
On a related note, how would my application know when to create another table (or switch to another table that was manually created)? For example, if I had 1 database that filled up with 1 GB of data, I would want my application to continue working without any service delays. How would I control the input of data from 1 table to a second, newly created database?
Similarly, if a user joins the website in 2011 and creates 100 records of information, and thousands of other users do the same, and then the 1 GB database becomes filled. Later on, that original user adds an additional 100 records that are created in another 1 GB database. How would my PHP code know which database to query for the 2 sets of 100 records? Would this be managed automatically in some way on the MySQL end? Would it need to be managed in the PHP code would IF/THEN/ELSE statements? Is this a service that some web hosting providers offer?
This is a very abstract question and I'm not sure the generic stackoverflow is the right place to do it.
In any case. What is the best practice method of storing? How about: in a file on disk. Keep in mind that a database is just a glorified file that has fancy 'read' and 'write' commands.
Optimization is hard, you can only ever trade things. CPU for memory usage, read speed for write speed, bulk data storage or speed. (Or get a better host provider and make your databases as large as you want ;) )
To answer your second question, if you do go with your database approach you will need to set up some system to 'migrate' users from a database to another if one gets full. If you reach 80% of 1GB, start migrating users.
Detecting the size of a database is a tricky problem. You could, I suppose look at the RAW files on disk to see how big they are, but perhaps there are more clever ways.
I would suggest using SQLite will the best option in your case. It supports 2 terabytes (2^41 bytes) database and best part is that it requires no server side installation. So it is compatible everywhere. All you need is a library to work with SQLite database.
You can also choose your host without looking on what databases and sizes do they support.
I'm writing a program that runs (24/7) on a Linux server and adds entries to a MySQL database.
The contents of the database are presented on a web interface with PHP and the user should be able to delete entries using the web interface.
Is it possible to access the database from multiple processes at the same time?
Yes, databases are designed for this purpose quite well. You'll want to keep a few things in mind in your designs:
Concurrency and race conditions on database writes.
Performance.
Separate database permissions for separate applications.
Unless you're doing something like accessing the DB using a singleton, the max number of simultaneous mysql connections php will use is limited in your php.ini. I believe it defaults to 100.
Yes multiple users can access the database at the same time.
You should however take care that the data is consistent.
If you create/edit entry with many small sql statements and in the meantime someone useses the web interface this may lead to some errors.
If you have a simple db this should not be a problem, else you should consider using transactions.
http://dev.mysql.com/doc/refman/5.0/en/ansi-diff-transactions.html
Yes and there will not be any problems while trying to delete records in the presence of that automated program which runs 24/7 if you are using the InnoDb engine. This is because transactions happen one at a time, one starts after another has finished and the database is consistent everytime.
This answer How to implement the ACID model for a database has many relevant points.
Read about the ACID Properties of a database. A Mysql database with InnoDb engine will take care of all these things for you and you need not worry about that.
I am developing a project at work for which I need to create and maintain Summary Tables for performance reasons. I believe the correct term for this is Materialized Views.
I have 2 main reasons to do this:
Denormalization
I normalized the tables as much as possible. So there are situations where I would have to join many tables to pull data. We work with MySQL Cluster, which has pretty poor performance when it comes to JOIN's.
So I need to create Denormalized Tables that can run faster SELECT's.
Summarize Data
For example, I have a Transactions table with a few million records. The transactions come from different websites. The application needs to generate a report will display the daily or monthly transaction counts, and total revenue amounts per website. I don't want the report script to calculate this every time, so I need to generate a Summary Table that will have a breakdown by [site,date].
That is just one simple example. There are many different kinds of summary tables I need to generate and maintain.
In the past I have done these things by writing several cron scripts to keep each summary table updated. But in this new project, I am hoping to implement a more elegant and proper solution.
I would prefer a PHP based solution, as I am not a server administrator, and I feel the most comfortable when I can control everything through my application code.
Solutions that I have considered:
Copying VIEW's
If the resulting table can be represented as a single SELECT query, I can generate a VIEW. Since they are slow, there can be a cronjob that copies this VIEW into a real table.
However, some of these SELECT queries can be so slow that it's not acceptable even for cronjobs. It is not very efficient to recreate the whole summary data, if older rows are not even being updated much.
Custom Cronjobs for each Summary Table
This is the solution I have used before, but now I am trying to avoid it if possible. If there will be many summary tables, it can be messy to maintain.
MySQL Triggers
It is possible to add triggers to the main tables so that every time there is an INSERT, UPDATE or DELETE, the summary tables get updated accordingly.
There would be no cronjobs and the summaries would be in real time. However if there is ever a need to rebuild a summary table from scratch, it would have to be done with another solution (probably #1 above).
Using ORM Hooks/Triggers
I am using Doctrine as my ORM. There is a way to add event listeners that will trigger stuff on INSERT/UPDATE/DELETE, which in turn can update the summary tables. In a sense this solution is similar to #3 above, but I will have better control over these triggers since they will be implemented in PHP.
Implementation Considerations:
Complete Rebuilds
I want to avoid having to rebuild the summary tables, for efficiency, and only update for new data. But in case something goes wrong, I need the capability to rebuild the summary table from scratch using existing data on the main tables.
Ignoring UPDATE/DELETE on Old Data
Some summaries can assume that older records will never be updated or deleted, but only new records will be inserted. The summary process can save a lot of work by making the assumption that it doesn't need to check for updates on older data.
But of course this won't apply to all tables.
Keeping a Log
Let's assume that I won't have access to, or do not want to use the binary MySQL logs.
For summarizing new data, the summary process just needs to remember the last primary key id's for the last records it summarized. Next time it runs, it can summarize everything after that id. However, to keep track of older records that have been updated/deleted, it needs another log so it can go back and re-summarize that data.
I would appreciate any kind of strategies, suggestions or links that can help. Thank you!
As noted above materialized views in Oracle are different than indexed views in SQL Server. They are very cool and useful. See http://download.oracle.com/docs/cd/B10500_01/server.920/a96567/repmview.htm for details
MySql does not have support for these however.
One thing you mention several times is poor performance. Have you checked your database design for proper indexing and run explain plans on the queries to see why they are slow. See here http://dev.mysql.com/doc/refman/5.1/en/using-explain.html. This is of course assuming that your server is tuned properly, you have mysql setup and tuned, e.g. buffer caches, etc. etc. etc.
To your direct question. What you sound like you want to do is something we do often in a data warehouse situation. We have a production database and a DW that pulls in all sorts of information, aggregates and pre-caclulates it to speed up querying. This may be overkill for you but you can decide. Depending on the latency you define for your reports, i.e. how often you need them, we normally go through an ETL (extract transform load) process periodically (daily, weekly, etc.) to populate the DW from the production system. This keeps impact low on the production system and moves all reporting to another set of servers which also lessens the load. On the DW side, I would normally design my schemas different, i.e. using star schemas. (http://www.orafaq.com/node/2286) Star schemas have fact tables (things you want to measure) and dimensions (things you want to aggregate the measures by (time, geography, product categories, etc.) On SQL Server they also include an additional engine called SQL Server Analysis services (SSAS) to look at fact tables and dimensions, pre calculate and build OLAP data cubes. In these data cubes you can drill down and look at all types of patterns, do data analysis and data mining. Oracle does things slightly differently but the outcome is the same.
Whether you want to go the about route really depends on the business need and how much value you get from data analysis. As I said it is likely overkill if you just have a few summary tables but some of the concepts you may find helpful as you think things through. If your business is going toward a business intelligence solution then this is something to consider.
PS You can actually set a DW up to work in "real-time" using something called ROLAP if that is the business need. Microstrategy has a good product that works well for this.
PPS You also may want to look at PowerPivot from MS (http://www.powerpivot.com/learn.aspx) I have only played with it so I cannot tell you how it works on very large datasets.
Flexviews (http://flexvie.ws) is an open source PHP/MySQL based project. Flexviews adds incrementally refreshable materialized views (like the materialized views in Oracle) to MySQL, usng PHP and stored procedures.
It includes FlexCDC, a PHP based change data capture utility which reads binary logs, and the Flexviews MySQL stored procedures which are used to define and maintain the views.
Flexviews supports joins (inner join only) and aggregation so it can be used to create summary tables. Moreover, you can use Flexviews in combination with Mondrian's (a ROLAP server) aggregation designer to create summary tables that the ROLAP tool can automatically use.
If you don't have access to the logs (it can read them remotely, btw, so you don't need server access, but you do need SUPER privs) then you can use 'COMPLETE' refresh with Flexviews. This automates creating a new table with 'CREATE TABLE ... AS SELECT' under a new table name. It then uses RENAME TABLE to swap the new table for the one, renaming the old with an _old postfix. Finally it drops the old table. The advantage here is that the SQL to create the view is stored in the database (flexviews.mview) and can be refreshed with a simple API call which automates the swapping process.
Many database libraries come setup for multiple database connections - but I've never actually known of an scripting application that needed to connect to two databases during it's run. (compiled, daemon-running languages are a different matter).
I understand having database slaves so that you can spread the load out - but usually on startup only one of them is chosen to handle that scripts needs.
So why would a PHP or Ruby application need to connect to more than one database? Or rather, why would you split your data up among several databases?
The only thing I can think of is bad design from a slowly evolving system that started off in multiple separate parts.
Are you talking about different physical database servers or different databases in the "schema" sense?
Regarding physical servers, If you're using MySQL replication you might write to a master and always read from a slave. This helps split the load among each database.
The simple answer is "scalability".
The ready availability of replication and clustering in a number of database products makes multiple database use a definite 'this must be possible'. Any decent ORM should know how to connect to multiple databases as required.
But even when the main application doesn't connect to more than one, there will often be other needs that do. Report generation, either scripted or ad-hoc, often involve queries that run for a long time. These are best run on database replicants dedicated (and configured) for these queries so they don't disrupt the main application.
Another good use is a type of scripted processing. Many apps will have a regular process that needs to rummage through a large part of the database. Whislt updates obviously have to go to the master, the big read queries can be run off a replicant.
Of course, the obvious need is simple performance. I oversaw a webapp and database that grew from surviving comfortably on one MySQL databse on a 32-bit dual-core machine with 3Gb to needing two 8-core 64-bit servers with 8Gb. Once it reached this stage, it relied on the database handler directing traffic to both servers. We had a window of about 50 minutes in a day where it could survive on just one database.
I have a Ruby application that connects to multiple databases. One database contains user login credentials (which is shared between several other projects). Another database contains archived data that my application tracks and compares (that only my application accesses). Another database contains data regarding physical machine resources which my application uses to generate new data (these resources are used by several different applications). By splitting the data into multiple databases, different applications only access the data that they need to be accessing.
It is all too frequently the case that some of the data you need is stored in The Wrong Database. Sometimes it's personnel records in a PeopleSoft (Oracle) database. Maybe it's Enterprise CRM data on Informix. Or some departmental database stored in MS SQL Server. Whatever it is, it's in a different database, but you still need access (hopefully read-only).
Unless your primary database is magic-based, it isn't going to be able to provide you with remote table access for every other database out there. (Most will only provide remote access to other databases of the same type, eg: MySQL->MySQL.) When that all too frequent situation occurs, you'll have no other option but to have multiple database connections, and be glad that your framework supports it.
I have a site that connects with two databases. One powers the website content (CMS DB) the other drives a web application that runs within the site (large amounts of non-CMS data) In fact, the latter uses replication.
I don't feel that's bad design. If one set of data has no relation to the other, then it makes sense even from a pure organization perspective to house it in a separate DB. Otherwise, people would just put all their tables in one DB.
For added security, I always create two accounts for every database: a read-only account (good for SELECT) and a read-write account (for SELECT, UPDATE, INSERT, DELETE and whatever else I might need). On some pages, I may need to use both accounts, thus I will consume two connections for only one database.
Well, reading from one and writing to another is a very common use case. It's easy and fun to write a data access layer that reads from one connection (reading from the slave), and writes to another (the master). A single script might make multiple reads before writing -- perhaps some lookups are necessary for validation, for instance.
Scripting languages are also frequently used for integration. You might have two off-the-shelf codebases, both of which want to maintain their own database. Your integration code might want to talk to both of them.
In general, you can usually design out of using more than one connection, but in general, I don't see anything fundamentally wrong with using connections to more than one database.
Other reasons to have multiple databases. We have one application that everyone can access. We also have client database that are very differnt from client to client. It is easier to maintain the application that all clients use (and which is maintained by a differnte team) if the client_specific data is separated out to their own databases. It is also easier to move the client to a new server when they become a large enterprise client rather than the smaller clietns who run on a server with many other clients.
Further there are types of data that are transactional and need to be in databases that are set to full recovery mode with full transaction logging. Other data is only populated from imports and does not need transactional logging and which might slow down the system as the log grew enough to handle the 10,000,000 record import. These are often split out to a separate databse so they can be in simple recovery mode as it si not necessary to recover data from the transaction log if there is a problem, it can be easily recoverd by re-running the import.
Then data is split out into datawarehouses which are optimized for data reporting not transactions. Again these reporting databases are usually separate databases (often on separate servers).
Then you have the databases for multiple different COTS applications (we have accounting databases, Credit Card transaction porcessing databases, HR databases, our project management database). A particular website might need to access more than one of these or transfer information from one to the other. Believe me vendors won't let you copy their database structure into one database to rule them all.
We have several hundred databases here on many differnt servers.