PHP, MySQL: question on how to "sync" two databases - php

Note: my question is in the last paragraph.
I have multiple sources of files that get inserted into a database (call it process/database A). These files contain the same type of information but in different formats (i.e. different column headers, orders, number of columns, etc.), but when process A puts them into a unified table, and it is nice and neat. I need this data from multiple sources also inserted into another database (process/database B), but I'm not sure what is the best way of doing this. DB B is part of a software we use. It is not open-source, but DB connection can be made.
We already have process A up and running for a while. Process B is something new to improve physical workflow at the warehouse. I think since the data is already unified in process A, it seems to me that I should pull this unified data and insert it into B. This will save me the repetitive work of remapping everything for process B.
My question is, if I want to "sync" these two databases, what would be the optimal approach? It's not exactly "syncing," I suppose, because the two tables (only need to reference one table on each DB) have different columns. I see these approaches..
Check the entire DB's and pull from DB A to insert to DB B for new data. However, DB B has over 50K rows. DB A is much smaller and growing slowly.
Have the user input a date from which to look for new data rows to insert from A to B.
Check the latest date (data rows are dated) in DB B, and insert accordingly.
Do you guys have any inputs? I'm not too familiar with MySQL processing speed, so I'm not sure if approach 1 is a good option. I'm also not sure what some conventions (if any) are for these types of tasks. I imagine it isn't a too-uncommon thing to do. But (1) seems to be a more complete way of doing things. Any comments or alternative options are appreciated. I'd like to keep things in PHP as it will be a feature on a web application. TIA!

Use mysql clustering
Check it : http://en.wikipedia.org/wiki/MySQL_Cluster

Related

PHP / MySQL - Compare tables from 2 different databases

I've got 2 frameworks (Laravel - web, Codeigniter - API) and 2 different databases. I've built a function (on the API) which detect changes on one database (from 2 tables) and apply the changes in the other database.
Note: there is no way to run both web and API on the same databases - so thats why I'm doing this thing.
Anyway, this is important that every little change will recognized. If the case is new record or delete record - its simple and no problem at all. But, if the records exists in both databases - I need to compare their values to detect changes and this section become challenging.
I know how to do this in the slowest and heavy way (pick each record and compare).
My question is - how do you suggest to make it work in smart and fast way?
Thanks a lot.
As long as the mysql user has select rights on both databases, you can qualify the database in the query like so:
SELECT * FROM `db1`.`table1`;
SELECT * FROM `db2`.`table1`;
It doesn't matter which database has been selected when you connected to PHP. The correct database will be used in the query.
The ticks are optional when the database/table name is only alphanumeric and not an SQL keyword.
Depending on the response-time of the 'slave'-database there are a two options which don't increase the overhead too much:
If you can combine both databases within the same database by prefixing one or both of the tables, you can use FOREIGN KEYS to let the database do the tough work for you.
Use the TIMESTAMP-field which you can set to update itself by the DB whenever the row gets updated.
Option 1 would be my best guess, but that might mean a physical change to the running system, and if FOREIGN KEYS are new for you, you might wanna test since they can be a real PITA (IMHO).
Option 2 is easier to implement, but you still have to manually detect changes to deleted/rows.

Dealing with millions of data records in MySQL and PHP/Laravel

I have an activity records table named revisions (showed in following image) built for a big learning management system, which mainly keeps record of CRUD operations on tables (e.g. who has done what on which object in what time).
This table may contain up to 3M records of data. I want to build a search functionality for this on the front-end with PHP/Laravel.
Now my question is that what things should I consider for building search functionalities with high performance for tables with millions of records of data, what are the things on code level, database level, or are there 3rd party stuff to support these kind of issues?
I am experienced with building systems with PHP/Laravel, Python/Django, Ruby, etc. But I have never encountered with a case like this, dealing with millions records of data. So please keep in mind my knowledge/experience level. I have NO experience on this level.
Note: Search will be an advance search, making users able to search with different criteria and parameters, the object which is changed, who has changed it, when it's changed, etc.
Let me know if my question still isn't clear.
I would recommend to take a look at the https://www.elastic.co/products/elasticsearch and save your activity records to its storage when you do save to the main database. Then you can easily search any field. Elasticsearch can store a schema free JSON documents, if you prefer more SQL way, there is another search engine - http://sphinxsearch.com/.
There is no problem inserting a zillion rows into a table. Performance problems come when you try to do non-trivial SELECTs on the table. You mentioned "search"; you will have to limit what the 'users' can search for. But at least make a stab at what they might want to search for.
You mentioned "searching for an object", but I don't see a column called object. How many rows might there be for a given object? Do you need all the rows? Or selected ones? (An INDEX on object is likely to make the query efficient, regardless of table size.)
Third-party software sometimes gets in the way of dealing with really large tables. Beware.

JS, PHP, and MySQL to get large data

I am using Ajax to send query to PHP server, which then run the SQL query to get data. Because the query involves three tables (two large ones), so JOIN the three tables is very slow.
Then I split the SQL query to three queries. It improves the efficiency (for small dataset). But for large dataset, because the PHP program runs the three queries one by one, and processes the result after each, there will be 30 second timeout (by default). I don't want to remove this default setting.
To avoid timeout, I am also considering running the three query and returning the result to JS, and let client side to do processing.
Is there other way to do that?
add
Basically, I want three output, title, extviews, allviews, for each item, WHERE extviews>somevalue. title is from one small table, extviews and allviews are aggregated from two different large tables. I have all the fields indexed, but joining the two big tables still requires a long time.
So I first aggregate one table to get extviews for each item, and also a list of item id. The results are organized as an array for JSON output to JS. Then using the list of id, I get the title for each item, and aggregate the other table to get allviews. Then I update the array with the new results.
Unless your mysql server is really overloaded, it's usually quickier to use joins. I guess you've already defined indexes on your tables? (for fields used in join condition & where clauses)
Doing the processing on the client side might also be a problem, since you'll have to send a lot of data in order to do the join...
Edit:
If all "easy" optimisation is done, then you have 2 choices... The one you just described (doing it on client size, if it's possible - what is the size (in bytes) of the json arrays you send to the client?)
Your other choice is to do the processing in the background (via cron) & cache somehow the results.
As already indicated by other people responding to your post, you should give us an idea of the structure of your three tables and the intent of each. Based upon that information, you may be able to get significant performance improvements by optimizing your database structure. To make it easier to understand, let's assume that someone had a website running off an intelligently designed database. I could easily make that application perform ten times worse solely by modifying the structure of the database.
Now, maybe there's some reason why you need to have three distinct tables, but I can't make that judgment without knowing what the fields in the database are, what you're aggregating, and what your web application is doing in the first place. Is it read heavy or write heavy? The solution may be as simple as denormalizing your database so that you don't need to use any joins.
I can say from a cursory glance at your description of what you're doing, that this application can't possibly scale efficiently and that you really need to reconsider your design. The first warning sign for me is the fact that you stated that one of the joins is just to link the title to two other tables. To me, being forced to do a join just to get a title of an object seems indicative of over-normalization. Some data redundancy is not necessarily a bad thing, and in some situations it's absolutely mandatory. Also, you say that you have two large tables that you use aggregate functions on and then join everything together. I can tell you right now that you're going to run into some serious performance issues if every hit to your application involves using a triple join and two aggregate functions, I'm assuming count.
Ultimately, we'll be able to give you a better response once you provide more information as to what you're trying to accomplish, and the general structure of the database you set up for it.

How should I version my data in an MS SQL shared server environment?

The server is a shared Windows hosting server with Hostgator. We are allowed "unlimited" MS SQL databases and each is allowed "unlimited" space. I'm writing the website in PHP. The data (not the DB schema, but the data) needs to be versioned such that (ideally) my client can select the DB version he wants from a select box when he logs in to the website, and then (roughly once a month) tag the current data, also through a simple form on the website. I've thought of several theoretical ways to do this and I'm not excited about any of them.
1) Put a VersionNumber column on every table; have a master Version table that lists all versions for the select box at login. When tagged, every row without a version number in every table in the db would be duplicated, and the original would be given a version number.
This seems like the easiest idea for both me and my client, but I'm concerned the db would be awfully slow in just a few months, since every table will grow by (at least) its original size every month. There's not a whole lot of data, and there probably never will be, in any one version. But multiplying versions in the same table just scares me.
2) Duplicate the DB every time we tag.
It looks like this would have to be done manually by my client since the server is shared, so I already dislike the idea. But in addition, the old DBs would have to be able to work with the current website code, and as changes are made to the DB structure over time (which is inevitable) the old DBs will no longer work with the new website code.
3) Create duplicate tables (with the version in their name) inside the same database every time we tag. Like [v27_Employee].
The benefit here over idea (1) would be that no table would get humongous in size, allowing the queries to keep up their speed, and over idea (2) it could theoretically be done easily through the simple website tag form rather than manually by my client. The problems are that the queries in my PHP code are going to get all discombobulated as I try to explain which Employee table is joining with which Address table depending upon which version is selected, since they all have the same name, but different; and also that as the code changes, the old DB tables no longer match, same problem as (2).
So, finally, does anyone have any good recommendations? Best practices? Things they did that worked in the past?
Thanks guys.
Option 1 is the most obvious solution because it has the lowest maintenance overhead and it's the easiest to work with: you can view any version at any time simply by adding #VersionNumber to your queries. If you want or need to, this means you could also implement option 3 at the same time by creating views for each version number instead of real tables. If your application only queries one version at a time, consider making the VersionNumber the first column of a clustered primary key, so that all the data for one version is physically stored together.
And it isn't clear how much data you have anyway. You say it's "not a whole lot", but that means nothing. If you really have a lot of data (say, into hundreds of millions of rows) and if you have Enterprise Edition (you didn't say what edition you're using), you can use table partitioning to 'split' very large tables for better performance.
My conclusion would be to do the simplest, easiest thing to maintain right now. If it works fine then you're done. If it doesn't, you will at least be able to rework your design from a simple, stable starting point. If you do something more complicated now, you will have much more work to do if you ever need to redesign it.
You could copy your versionable tables into a new database every month. If you need to do a join between a versionable table and a non-versionable table, you'd need to do a cross-schema join - which is supported in SQL Server. This approach is a bit cleaner than duplicating tables in a single schema, since your database explorer will start getting unwieldy with all the old tables.
What I finally wound up doing was creating a new schema for each version and duplicating the tables and triggers and keys each time the DB is versioned. So, for example, I had this table:
[dbo].[TableWithData]
And I duplicated it into this table in the same DB:
[v1].[TableWithData]
Then, when the user wants to view old tables, they select which version and my code automatically changes every instance of [dbo] in every query to [v1]. It's conceptually fairly simple and the user doesn't have to do anything complicated to version -- just type in "v1" to a form and hit a submit button. My PHP and SQL does the rest.
I did find that some tables had to remain separate -- I made a different schema called [ctrl] into which I put tables that will not be versioned, like the username / password table for example. That way I just duplicate the [dbo] tables.
Its been operational for a year or so and seems to work well at the moment. They've only versioned maybe 4 times so far. The only problem I seem to have consistently that I can't figure out is that triggers seem to get lost somehow. That's probably a problem with my very complex PHP rather than the DB versioning concept itself though.

Preferred method for Materialized Views (Summary Tables) with MySQL

I am developing a project at work for which I need to create and maintain Summary Tables for performance reasons. I believe the correct term for this is Materialized Views.
I have 2 main reasons to do this:
Denormalization
I normalized the tables as much as possible. So there are situations where I would have to join many tables to pull data. We work with MySQL Cluster, which has pretty poor performance when it comes to JOIN's.
So I need to create Denormalized Tables that can run faster SELECT's.
Summarize Data
For example, I have a Transactions table with a few million records. The transactions come from different websites. The application needs to generate a report will display the daily or monthly transaction counts, and total revenue amounts per website. I don't want the report script to calculate this every time, so I need to generate a Summary Table that will have a breakdown by [site,date].
That is just one simple example. There are many different kinds of summary tables I need to generate and maintain.
In the past I have done these things by writing several cron scripts to keep each summary table updated. But in this new project, I am hoping to implement a more elegant and proper solution.
I would prefer a PHP based solution, as I am not a server administrator, and I feel the most comfortable when I can control everything through my application code.
Solutions that I have considered:
Copying VIEW's
If the resulting table can be represented as a single SELECT query, I can generate a VIEW. Since they are slow, there can be a cronjob that copies this VIEW into a real table.
However, some of these SELECT queries can be so slow that it's not acceptable even for cronjobs. It is not very efficient to recreate the whole summary data, if older rows are not even being updated much.
Custom Cronjobs for each Summary Table
This is the solution I have used before, but now I am trying to avoid it if possible. If there will be many summary tables, it can be messy to maintain.
MySQL Triggers
It is possible to add triggers to the main tables so that every time there is an INSERT, UPDATE or DELETE, the summary tables get updated accordingly.
There would be no cronjobs and the summaries would be in real time. However if there is ever a need to rebuild a summary table from scratch, it would have to be done with another solution (probably #1 above).
Using ORM Hooks/Triggers
I am using Doctrine as my ORM. There is a way to add event listeners that will trigger stuff on INSERT/UPDATE/DELETE, which in turn can update the summary tables. In a sense this solution is similar to #3 above, but I will have better control over these triggers since they will be implemented in PHP.
Implementation Considerations:
Complete Rebuilds
I want to avoid having to rebuild the summary tables, for efficiency, and only update for new data. But in case something goes wrong, I need the capability to rebuild the summary table from scratch using existing data on the main tables.
Ignoring UPDATE/DELETE on Old Data
Some summaries can assume that older records will never be updated or deleted, but only new records will be inserted. The summary process can save a lot of work by making the assumption that it doesn't need to check for updates on older data.
But of course this won't apply to all tables.
Keeping a Log
Let's assume that I won't have access to, or do not want to use the binary MySQL logs.
For summarizing new data, the summary process just needs to remember the last primary key id's for the last records it summarized. Next time it runs, it can summarize everything after that id. However, to keep track of older records that have been updated/deleted, it needs another log so it can go back and re-summarize that data.
I would appreciate any kind of strategies, suggestions or links that can help. Thank you!
As noted above materialized views in Oracle are different than indexed views in SQL Server. They are very cool and useful. See http://download.oracle.com/docs/cd/B10500_01/server.920/a96567/repmview.htm for details
MySql does not have support for these however.
One thing you mention several times is poor performance. Have you checked your database design for proper indexing and run explain plans on the queries to see why they are slow. See here http://dev.mysql.com/doc/refman/5.1/en/using-explain.html. This is of course assuming that your server is tuned properly, you have mysql setup and tuned, e.g. buffer caches, etc. etc. etc.
To your direct question. What you sound like you want to do is something we do often in a data warehouse situation. We have a production database and a DW that pulls in all sorts of information, aggregates and pre-caclulates it to speed up querying. This may be overkill for you but you can decide. Depending on the latency you define for your reports, i.e. how often you need them, we normally go through an ETL (extract transform load) process periodically (daily, weekly, etc.) to populate the DW from the production system. This keeps impact low on the production system and moves all reporting to another set of servers which also lessens the load. On the DW side, I would normally design my schemas different, i.e. using star schemas. (http://www.orafaq.com/node/2286) Star schemas have fact tables (things you want to measure) and dimensions (things you want to aggregate the measures by (time, geography, product categories, etc.) On SQL Server they also include an additional engine called SQL Server Analysis services (SSAS) to look at fact tables and dimensions, pre calculate and build OLAP data cubes. In these data cubes you can drill down and look at all types of patterns, do data analysis and data mining. Oracle does things slightly differently but the outcome is the same.
Whether you want to go the about route really depends on the business need and how much value you get from data analysis. As I said it is likely overkill if you just have a few summary tables but some of the concepts you may find helpful as you think things through. If your business is going toward a business intelligence solution then this is something to consider.
PS You can actually set a DW up to work in "real-time" using something called ROLAP if that is the business need. Microstrategy has a good product that works well for this.
PPS You also may want to look at PowerPivot from MS (http://www.powerpivot.com/learn.aspx) I have only played with it so I cannot tell you how it works on very large datasets.
Flexviews (http://flexvie.ws) is an open source PHP/MySQL based project. Flexviews adds incrementally refreshable materialized views (like the materialized views in Oracle) to MySQL, usng PHP and stored procedures.
It includes FlexCDC, a PHP based change data capture utility which reads binary logs, and the Flexviews MySQL stored procedures which are used to define and maintain the views.
Flexviews supports joins (inner join only) and aggregation so it can be used to create summary tables. Moreover, you can use Flexviews in combination with Mondrian's (a ROLAP server) aggregation designer to create summary tables that the ROLAP tool can automatically use.
If you don't have access to the logs (it can read them remotely, btw, so you don't need server access, but you do need SUPER privs) then you can use 'COMPLETE' refresh with Flexviews. This automates creating a new table with 'CREATE TABLE ... AS SELECT' under a new table name. It then uses RENAME TABLE to swap the new table for the one, renaming the old with an _old postfix. Finally it drops the old table. The advantage here is that the SQL to create the view is stored in the database (flexviews.mview) and can be refreshed with a simple API call which automates the swapping process.

Categories