Quickly copy big MySQL database for testing

Quickly copy big MySQL database for testing - php

When running our suite of acceptance tests we would like to execute every single test on a defined database state. If one of the tests would write to the database (like create users or something else), it must of course not affect later tests.
We have thought about several options to achieve that, but copying the whole database before every single tests does not seem like the best solution (thinking of possible performance issues).
One more idea was to use MySQL transactions, but some of our tests cause many HTTP requests, so different PHP processes are spawned and they would lose the transaction too early for a clean rollback after the full test is done.
Are there better ways to guarantee a defined database state for every of our acceptance tests? We would like to keep it simpler than solutions like aufs or btrfs tackling it on system level.

You could approach this problem using PhpUnit.
It is used for automated testing with PHP. Is not the only library, but one of the most extended ones.
You could use it with database testing as well ( https://phpunit.de/manual/current/en/database.html ). Basically, it lets you accomplish exactly what you are looking for. Import initially the whole database, and then in each test suite, load what you need and then restore to the previous state. For example, you could save temporarily the current status of the table A and after you are done with all tests of the suite, simply restore it. Instead of reloading the whole database.
By the way, having a minimal Database with only the required information for testing will help a lot as well. In that case you don't have to deal with big performance issues, and you can simply restore it after each test suite.

Related

Should I keep a similar tests for Unit Testing and DBUnit testing in a Data Mapper?

I've been practicing and reading about test driven development.
I've been doing well so far since much of it is straightforward but I have questions as to what to test
about certain classes such as the one below.
public function testPersonEnrollmentDateIsSet()
{
//For the sake of simplicity I've abstracted the PDO/Connection mocking
$PDO = $this->getPDOMock();
$PDOStatement = $this->getPDOStatementMock();
$PDO->method('prepare')->willReturn($PDOStatement);
$PDOStatement->method('fetch')->willReturn('2000-01-01');
$AccountMapper = $this->MapperFactory->build(
'AccountMapper',
array($PDO)
);
$Person = $this->EntityFactory->build('Person');
$Account = $this->EntityFactory->build('Account');
$Person->setAccount($Account);
$AccountMapper->getAccountEnrollmentDate($Person);
$this->assertEquals(
'2001-01-01',
$Person->getAccountEnrollmentDate()
);
}
I'm a little unsure if I should even be testing this at all for two reasons.
Unit Testing Logic
In the example above I'm testing if the value is mapped correctly. Just the logic alone, mocking the connection so no database. This is awesome because I can run a test of exclusively business logic without any other developers having to install or configure a database dependency.
DBUnit Testing Result
However, a separate configuration can be run on demand to test the SQL queries themselves which is another type of test altogether. While also consolidating SQL tests to run separately from unit tests.
The test would be exactly the same as above except the PDO connection would not be mocked out and be a real database connection.
Cause For Concern
I'm torn because although it is testing for different things, it's essentially duplicate code.
If I get rid of the unit test, I introduce a required database dependency at all times. As the codebase grows, the tests will become more slow over time, no out-of-box testing; extra effort from other developers to set up a configuration.
If I get rid of the database test I can't assure the SQL will return the expected information.
My questions are:
Is this a legitimate reason to keep both tests?
Is it worth it or is it possible this may become a maintenance nightmare?

According to your comments, it seems like you want to test what PDO is doing. (Are prepare, execute etc ... being called properly and doing things I expect from them ?)
It's not what you want to do here with unit testing. You do not want to test PDO or any other library, and you are not supposed to test the logic or the behavior of PDO. Be sure that these things have been done by people in charge of PDO.
What you want to test is the correct execution of YOUR code and so the logic of YOUR code. You want to test getAccountExpirationDate, and so the results you expect from it. Not the results you expect from PDO calls.
Therefore, and if I understand your code right, you only have to test that the expiration date of the $Person you give as parameter has been set with the expected value from $result -> If you didn't use PDO in the right way, then the test will fail anyway !
A correct test could be :
Create a UserMapper object and open a DB transaction
Insert data in DB that will be retrieved in $result
Create a Person object with correct taxId and clientID to get the wanted $result from your query
Run UserMapper->getAccountExpirationDate($Person)
Assert that $Person->expirationDate equals whatever you settled as expected from $result
Rollback DB transaction
You may also check the format of the date, etc ... Anything that is related to the logic you implemented.
If you are going to test such things in many tests, use setUp and tearDown to init and rollback your DB after each test.

Is this a legitimate reason to keep both tests?
I think no one could ever be able to tell from your simple example, which test you need to keep. If you like, you can keep both. I say that because the answer depends a lot on unknown factors of your projects, including future requirements & modifications.
Personally, I don't think you need both, at least for the same function. If your function is more easier to break logically than data-sensitive, feel free to drop the database-backed ones. You will learn to use the correct type of tests over time.
I'm torn because although it is testing for different things, it's essentially duplicate code.
Yes, it is.
If I get rid of the database test I can't assure the SQL will return the expected information.
This is a valid concern if your database gets complicated enough. You are using DBUnit, so I would assume you have a sample database with sample information. Over time, a production database may gets multiple structure modifications (adding fields, removing fields, even re-working data relationship...). These things will add to your efforts to maintain your test database. However, it would also be helpful if you realize that some changes in the database aren't right.
If I get rid of the unit test, I introduce a required database dependency at all times. As the codebase grows, the tests will become more slow over time, no out-of-box testing; extra effort from other developers to set up a configuration.
What's wrong with requiring a database dependency?
Firstly, I don't see how extra efforts from other developers to setup a configuration comes to play here. You only set up one time, isn't it? In my opinion, it's a minor factor.
Second, the tests will be slow over time. Of course. You can count on in-memory database like H2 or HSQLDB to mitigate that risk, but it certainly take longer. However, do you actually need to worry? Most of these data-backed tests are integration test. They should be the responsibility of Continuous Integration Server, who runs exhausting test every now and then to prevent regression bug. For developers, I think we should only run a small set of light-weight tests during development.
In the end, I would like to remind you the number 1 rule for using unit test: Test what can break. Don't write tests for test sake. You will never have 100% coverage. However, since 20% of the code is used 80% of the time, focusing your testing effort on your main flow of business might be a good idea.

when should I use a static array instead of a new table in my database?

I've implemented an Access Control List using 2 static arrays (for the roles and the resources), but I added a new table in my database for the permissions.
The idea of using a static array for the roles is that we won't create new roles all the time, so the data won't change all the time. I thought the same for the resources, also because I think the resources are something that only the developers should treat, because they're more related to the code than to a data. Do you have any knowledge of why to use a static array instead of a database table? When/why?

The problem with hardcoding values into your code is that compared with a database change, code changes are much more expensive:
Usually need to create a new package to deploy. That package would need to be regression tested, to verify that no bugs have been introduced. Hint: even if you only change one line of code, regression tests are necessary to verify that nothing went wrong in the build process (e.g. a library isn't correctly packaged causing a module to fail).
Updating code can mean downtime, which also increases risk because what if the update fails, there always is a risk of this
In an enterprise environment it is usually a lot quicker to get DB updates approved than code change.
All that costs time/effort/money. Note, in my opinion holding reference data or static data in a database does not mean a hit on performance, because the data can always be cached.

Your static array is an example of 'hard-coding' your data into your program, which is fine if you never ever want to change it.
In my experience, for your use case, this is not ever going to be true, and hard-coding your data into your source will result in you being constantly asked to update those things you assume will never change.
Protip: to a project manager and/or client, nothing is immutable.

I think this just boils down to how you think the database will be used in the future. If you leave the data in arrays, and then later want to create another application that interacts with this database, you will start to have to maintain the roles/resources data in both code bases. But, if you put the roles/resources into the database, the database will be the one authority on them.
I would recommend putting them in the database. You could read the tables into arrays at startup, and you'll have the same performance benefits and the flexibility to have other applications able to get this information.
Also, when/if you get to writing a user management system, it is easier to display the roles/resources of a user by joining the tables than it is to get back the roles/resources IDs and have to look up the pretty names in your arrays.

Using static arrays you get performance, considering that you do not need to access the database all the time, but safety is more important than performance, so I suggest you do the control of permissions in the database.
Study on RBAC.

Things considered static should be coded static. That is if you really consider them static.
But I suggest using class constants instead of static array values.

Using TDD to Create a Report

I'm currently trying to use PHPUnit to learn about Test Driven Development (TDD) and I have a question about writing reports using TDD.
First off: I understand the basic process of TDD:
But my question is this: How do you use TDD to write a report?
Say you've been tasked to write a report about the number of cars that pass by a given intersection by color, type, and weight. Now, all of the above data has been captured in a database table but you're being asked to correlate it.
How do you go about writing tests for a method that you don't know the outcome of? The outcome of the method that correlates this data is going to change based on date range and other limiting criteria that the user may provide when running the report? How do you work in the confines of TDD in this situation using a framework like PHPUnit?

You create test data beforehand that represents the type of data you will receive in production, then test your code against that, refreshing the table each time you run the test (i.e. in your SetUp() function).
You can't test against the actual data you will receive in production no matter what you're testing. You're only testing that the code works as expected for a given scenario. For example, if you load your testing table with five rows of blue cars, then you want your report to show five blue cars when you test it. You're testing the parts of the report, so that when you're done you will have tested the whole of the report automatically.
As a comparison, if you were testing a function that expected a positive integer between 1 and 100, would you write 100 tests to test each individual integer? No, you would test something within the range, then something on and around the boundaries (e.g. -1, 0, 1, 50, 99, 100, and 101). You don't test, for example, 55, because that test will go down the same code path as 50.
Identify your code paths and requirements, then create suitable tests for each one of them. Your tests will become a reflection of your requirements. If the tests pass, then the code will be an accurate representation of your requirements (and if your requirements are wrong, TDD can't save you from that anyway).

You don't use the same data when running the test suites and when running your script. You use test data. So if you want to interact with a database, a good solution is to create a sqlite database stored in your ram.
Similarly, if your function interacts with a filesystem, you can use a virtual filesystem.
And if you have to interact with objects, you can mock them too.
The good thing is you can test with all the vicious edge-case-data you think of when you write the code (hey, what if the data contains unescaped quotes?).

It is very difficult, and often unwise, to test directly against your production server, so your best bet is to fake it.
First, you create a stub, a special object which stands in for the database that allows you to have your unit tests pretend that some value came from the DB when it really came from you. If needs be, you have something which is capable of generating something which is not knowable to you, but still accessible by the tests.
Once everything is working there, you can have a data set in the DB itself in some testing schema -- basically, you connect but with different parameters so that while it thinks it is looking at PRODUCTION.CAR_TABLE it is really looking at TESTING.CAR_TABLE. You may even want to have the test drop/create table each time (though that might be a bit much it does result in more reliable tests).

How to "upgrade" the database in real world?

My company have develop a web application using php + mysql. The system can display a product's original price and discount price to the user. If you haven't logined, you get the original price, if you loginned , you get the discount price. It is pretty easy to understand.
But my company want more features in the system, it want to display different prices base on different user. For example, user A is a golden parnter, he can get 50% off. User B is a silver parnter, only have 30 % off. But this logic is not prepare in the original system, so I need to add some attribute in the database, at least a user type in this example. Is there any recommendation on how to merge current database to my new version of database. Also, all the data should preserver, and the server should works 24/7. (within stop the database)
Is it possible to do so? Also , any recommend for future maintaince advice? Thz u.

I would recommend writing a tool to run SQL queries to your databases incrementally. Much like Rails migrations.
In the system I am currently working on, we have such tool written in python, we name our scripts something like 000000_somename.sql, where the 0s is the revision number in our SCM (subversion), and the tool is run as part of development/testing and finally deploying to production.
This has the benefit of being able to go back in time in terms of database changes, much like in code (if you use a source code version control tool) too.

http://dev.mysql.com/doc/refman/5.1/en/alter-table.html
Here are more concrete examples of ALTER TABLE.
http://php.about.com/od/learnmysql/p/alter_table.htm
You can add the necessary columns to your table with ALTER TABLE, then set the user type for each user with UPDATE. Then deploy the new version of your app. that uses the new column.

Did you use an ORM for data access layer ? I know Doctrine comes with a migration API which allow version switch up and down (in case something went wrong with new version).
Outside any framework or ORM consideration, a fast script will minimize slowdown (or downtime if process is too long).
To my opinion, I'd rather prefer a 30sec website access interruption with an information page, than getting shorter interuption time but getting visible bugs or no display at all. If interruption times matters, it's best doing this at night or when lesser traffic.
This can all be done in one script (or at least launched by one commande line), when we'd to do such scripts we include in a shell script :
putting application in standby (temporary static page) : you can use .htaccess redirect or whatever applicable to your app/server environment.
svn udpate (or switch) for source code and assets upgrade
empty caches, cleaning up temp files, etc.
rebuild generated classes (symfony specific)
upgrade DB structure with ALTER / CREATE TABLE querys
if needed, migrate data from old structure to new : depending on what you changed on structure, it may require fetching data before altering DB structure, or use tmp tables.
if all went well, remove temporary page. Upgrade done
if something went wrong display a red message to the operator so it can see what happened, try to fix it and then remove waiting page by hand.
The script should do checks at each steps and stop a first error, and it should be verbose (but concise) about what it does at all steps, thus you can fix the app faster if something has to went wrong.
The best would be a recoverable script (error at step 2 - stop process - manual fix - recover at step 3), I never took the time to implement it this way.
If works pretty well but these kind of script have to be intensively tested, on an environnement as closest as possible to the production one.
In general we develop such scripts locally, and test them on the same platform tha the production env (just different paths and DB)
If the waiting page is not an option, you can go whithout but you need to ensure data and users session integrity. As an example, use LOCK on tables during upgrade/data transfer and use exclusive locks on modified files (SVN does I think)
There could other better solutions, but it's basically what I use and it do the job for us. The major drawback is that kind of script had to be rewritten at each major release, this incitate me to search for other options to do this, but which one ??? I would be glad if someone here had better and simpler alternative.

Smart (?) Database Cache

I've seen several database cache engines, all of them are pretty dumb (i.e.: keep this query cached for X minutes) and require that you manually delete the whole cache repository after a INSERT / UPDATE / DELETE query has been executed.
About 2 or 3 years ago I developed an alternative DB cache system for a project I was working on, the idea was basically to use regular expressions to find the table(s) involved in a particular SQL query:
$query_patterns = array
(
'INSERT' => '/INTO\s+(\w+)\s+/i',
'SELECT' => '/FROM\s+((?:[\w]|,\s*)+)(?:\s+(?:[LEFT|RIGHT|OUTER|INNER|NATURAL|CROSS]\s*)*JOIN\s+((?:[\w]|,\s*)+)\s*)*/i',
'UPDATE' => '/UPDATE\s+(\w+)\s+SET/i',
'DELETE' => '/FROM\s+((?:[\w]|,\s*)+)/i',
'REPLACE' => '/INTO\s+(\w+)\s+/i',
'TRUNCATE' => '/TRUNCATE\s+(\w+)/i',
'LOAD' => '/INTO\s+TABLE\s+(\w+)/i',
);
I know that these regexs probably have some flaws (my regex skills were pretty green back then) and obviously don't match nested queries, but since I never use them that isn't a problem for me.
Anyway, after finding the involved tables I would alphabetically sort them and create a new folder in the cache repository with the following naming convention:
+table_a+table_b+table_c+table_...+
In case of a SELECT query, I would fetch the results from the database, serialize() them and store them in the appropriate cache folder, so for instance the results of the following query:
SELECT `table_a`.`title`, `table_b`.`description` FROM `table_a`, `table_b` WHERE `table_a`.`id` <= 10 ORDER BY `table_a`.`id` ASC;
Would be stored in:
/cache/+table_a+table_b+/079138e64d88039ab9cb2eab3b6bdb7b.md5
The MD5 being the query itself. Upon a consequent SELECT query the results would be trivial to fetch.
In case of any other type of write query (INSERT, REPLACE, UPDATE, DELETE and so on) I would glob() all the folders that had +matched_table(s)+ in their name all delete all the file contents. This way it wouldn't be necessary to delete the whole cache, just the cache used by the affected and related tables.
The system worked pretty well and the difference of performance was visible - although the project had many more read queries than write queries. Since then I started using transactions, FK CASCADE UPDATES / DELETES and never had the time to perfect the system to make it work with these features.
I've used MySQL Query Cache in the past but I must say the performance doesn't even compare.
I'm wondering: am I the only one who sees beauty in this system? Is there any bottlenecks I may not be aware of? Why do popular frameworks like CodeIgniter and Kohana (I'm not aware of Zend Framework) have such rudimentary DB cache systems?
More importantly, do you see this as a feature worth pursuing? If yes, is there anything I could do / use to make it even faster (my main concerns are disk I/O and (de)serialization of query results)?
I appreciate all input, thanks.

I can see the beauty in this solution, however, I belive it only works for a very specific set of applications. Scenarios where it is not applicable include:
Databases which utilize cascading deletes/updates or any kind of triggers. E.g., your DELETE to table A may cause a DELETE from table B. The regex will never catch this.
Accessing the database from points which do not go through you cache invalidation scheme, e.g. crontab scripts etc. If you ever decide to implement replication across machines (introduce read-only slaves), it may also disturb the cache (because it does not go through cache invalidation etc.)
Even if these scenarios are not realistic for your case it does still answer the question of why frameworks do not implement this kind of cache.
Regarding if this is worth pursuing, it all depends on your application. Maybe you care to supply more information?

The solution, as you describe it, is at risk for concurrency issues. When you're receiving hundreds of queries per second, you're bound to hit a case where an UPDATE statement runs, but before you can clear your cache, a SELECT reads from it, and gets stale data. Additionally, you may run in to issues when several UPDATEs hit the same set of rows in a short time period.
In a broader sense, best practice with caching is to cache the largest objects possible. E.g., rather than having a bunch of "user"-related rows cached all over the place, it's better to just cache the "user" object itself.
Even better, if you can cache whole pages (e.g., you show the same homepage to everyone; a profile page appears identical to almost everyone, etc.), that's even better. One cache fetch for a whole, pre-rendered page will dramatically outperform dozens of cache fetches for row/query level caches followed by re-rending the page.
Long story short: profile. If you take the time to do some measurement, you'll likely find that caching large objects, or even pages, rather than small queries used to build those things, is a huge performance win.

While I do see the beauty in this - especially for environments where resources are limited and can not easily be extended, like on shared hosting - I personally would fear complications in the future: What if somebody, newly hired and unaware of the caching mechanism, starts using nested queries? What if some external service starts updating the table, with the cache not noticing?
For a specialized, defined project that urgently needs a speedup that cannot be helped by adding processor power or RAM, this looks like a great solution. As a general component, I find it too shaky, and would fear subtle problems in the long run that stem from people forgetting that there is a cache to be aware of.

I suspect that the regexes may not provide for every case - certainly they don't seem to deal with the scenario of mixing base table names and the tables themselves. e.g. consider
update stats.measures set amount=50 where id=1;
and
use stats;
update measures set amount=50 where id=1;
Then there's PL/SQL.
Then there's the fact that it depends on every client opting in to an advisory control mechanism i.e. it pre-supposes that all the database access is from machines implementing the caching control mechanism on a shared filesystem.
(as a small point - wouldn't it be simpler to just check the modification times on the data files to determine if the cached version of a query on a defined set of tables is still current, rather then trying to identify if the cache control mechanism has spotted an update - it would certainly be a lot more robust)
Stepping back a bit, implementing this from scratch using a robust architecture would mean that all queries would have to be intercepted by the control mechanism. The control mechanism would probably need a more sophisticated query parser. It certainly requires a common storgae substrate for all the instances of the control mechanism. It probably needs an understanding of the data dictionary - all things which are already implemented by the database itself.
You state that "I've used MySQL Query Cache in the past but I must say the performance doesn't even compare."
I find this rather odd. Certainly when dealing with large result sets from queries, my experience is that loading the data into the heap from a database is a lot faster than unserializing large arrays - although large result sets are rather atypical of web based applications.
When I've tried to speed up database access (after fixing everything else of course) then I've gone down the route of replicating and partitioning data across multiple DBMS instances.
C.

This is related to the problem of session splitting when working with multiple databases in a master-slave configuration. Basically, a similar set of regular expressions are used to determine which tables (or even which rows) are being read from or written to. The system keeps track of which tables were written to and when, and when a read to one of those tables comes up, it's routed to the master. If a query is reading from a table whose data needn't be up-to-the-second accurate, then it's routed to the slave. Generally, information only really needs to be current when it's something a user changed themselves (i.e., editing a user's profile).
They talk about this a good bit in the O'Reilly book High Performance MySQL. I used it quite a bit when developing a system for handling session splits back in the day.

The improvement you describe is to avoid invalidating caches that are guaranteed to not have been affected by an update because they draw data from a different table.
That is of course nice, but I am not sure if it is fine-grained enough to make a real difference. You would still be invaliding lots of caches that did not really need to be (because the update was on the table, but on different rows).
Also, even this "simple" scheme relies on being able to detect the relevant tables by looking at the SQL query string. This can be difficult to do in the general case, because of views, table aliases, and multiple catalogs.
It is very difficult to automatically (and efficiently) detect whether a cache needs to be invalidated. Because of that, you can either use a very simple scheme (such as invalidating on every update, or per table, as in your system, which does not work too well when there are many updates), or a very hand-crafted cache for the specific application with deep hooks into the query logic (probably difficult to write and hard to maintain), or accept that the cache can contain stale data and just refresh it periodically.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.