MySQL Performance Planning

MySQL Performance Planning - php

I'm in the process of writing a system to search through a MySQL database of real estate listings. I'm concerned about performance and wanted some input on how to handle this.
The table that will be the most frequently queried is the 'listings' table and will contain over 600k records with 86 columns. This table will also be updated every 30 minutes as listings change.
Almost every search will be against records with a status of 'active' which will be about 15k of the 600k records. However, I need to retain all of the records for our internal reports. Also, each query will likely be searching for various parameters (#beds, #baths, etc) so caching may not be feasible.
I was considering maintaining a second table containing the PK's of records marked 'active'. Create a view of the tables joined on the listing's PK. However, I know that under certain conditions, Views can be very inefficient.
I did have the thought of maintaining two databases since the inactive listings won't be searched frequently and will require less maintenance.
Fortunately it's not in production yet and I have time for performance testing. One more thing, this will be hosted on a dedicated Linux server with the front-end written in PHP. Any insight offered is greatly appreciated.

I suggest that you create an archive table. You could set up a process to run every 30 minutes or once per day, depending on the requirements.
The archive table would have the same columns as the original table plus and EffDate and EndDate, that have the dates/date times when the record is active.
Such a table will make it possible to recreate the history at any point in time -- something that will prove useful, I'm sure.
You will need code to create this. The basic logic is to lookup each record in your table with the most current version in the archive (EndDate is null and id = id). Then:
If new record is not present, create a new record with the current date as EffDate.
If present and all columns are the same, do nothing.
Otherwise update EndDate on the archive record and do (1).
Any archive records that do not have a new record at all should have EndDate set to the current date.
Typically, I have such tables updated once per day.
In code that does this, I have a big ugly query (Excel helps me build it) that does the comparisons and determines which records are "New", "Modified", and "Removed". The "Removed" and "Modified" records have the current EndDates set to the current date. The "New" and "Modified" records then get a new record with the EffDate set to the current date.
The values for EndDate and EffDate might be one more or less than stated, depending on how the updates really work. For a nightly update, for instance, the EffDate might be set to tomorrow or even to the date when the listing takes effect.

Related

Storing huge amount of records in a mysql table...is it normal?

I have developed a small system based on PHP and Mysql.
Its more like a teacher create notes using a WYSIWYG editor and share among the students.
For teachers I have done some analysis. One of them is how long they spent on each course and on which date. Using a jquery to detect the browser activities adding records to a table.
Say subject Mathematics and it has 3 chapters. I will accumulate the time spent on each chapter and group by date and display.
if the student idle for 15 mins I stop counting and add a record
if the student go to another tab, then stop counting and add a record,
if the student click another link I add another record.
close the browser also same.
if they are going to another chapter, then again I add another record and start counting in the next chapter.
And finally accumulate the time spent.
Now, I am using it my college and i can see within 2 months, around 10,000 records has been added.
I am using the primary key as Int(10), auto Increment.
is it a good programming. Any alternatives.
This is how the analysis shows
This the table stores records...
Sorry for the LONG POST

Simple answer: 10.000 records is peanuts, for performance, make sure you have indexes on your foreign keys.

Storing User Login Time in a User Table

In a table of Users, I want to keep track of the time of day each user logs in as running totals. For example
UserID midnightTo6am 6amToNoon noonTo6pm 6pmToMidnight
User1 3 2 7 1
User2 4 9 1 8
Note that this is part of a larger table that contains more information about a user, such as address and gender, hair color, etc, etc.
In this example, what is the best way to store this this data? Should it be part of the users table, despite knowing that not every user will log in at every time (a user may never log in between 6am and noon)? Or is this table a 1NF failure because of repeating columns that should be moved to a separate table?
If stored as part of the Users Table, there may be empty cells that never get populated with data because the user never logs in at that time.
If this data is a 1NF failure and the data is to be put in a separate table, how would I ensure that a +1 for a certain time goes smoothly? Would I search for the user in the separate table to see if they have logged in at that time before and +1? Or add a column to that table if it is their first time logging in during that time period?
Any clarifications or other solutions are welcome!

I would recommend storing the login events either in a file based log or in a simple table with just the userid and DATETIME of the login.
Once a day, or however often you need to report on the data you illustrated in your question, aggregate that data up into a table in the shape that you want. This way you're not throwing away any raw data and can always reaggregate for different periods, by hour, etc at a later date.
addition: I suspect that the fastest way of deriving the aggregated data would be to run a number of range queries for each of your aggregation periods so you're searching for (e.g.) login dates in the range 2011-12-25 00:00:00 - 2011-12-24 03:00:00. If you go with that approach and index of (datetime, user_id) would work well. It seems counter-intuitive as you want to do stuff on a user-centric basis but the index on the DATETIME field would allow easy finding of the rows and then the trailing user_id index would allow for fast grouping.

A couple of things. Firstly, this is not a violation of 1NF. Doing it as 4 columns may in fact be acceptable. Secondly, if you do go with this design, you should not use nulls, use zero instead(with the possible exception of existing records). Finally, WHETHER you should use this design or split it into another table (or two) is dependent upon your purpose and usage. If your standard use of the table does not make use of this information, it should go into another table with a 1 to 1 relationship. If you may need to increase the granuality of the login times, then you should use another table. Finally, if you do split this off into another table with a timestamp, give some consideration to privacy.

check if a row was added to table mysql

I have a table which contains orders, and orders are being added to the table by users as time goes by.
I want to implement a service that checks if a row was added to the table.
Is there a specific way to do that?
thanks!

If you want to know which rows have been added since last time you checked, put a timestamp in each row, and keep track somewhere (separately) of the newest row you've seen so far. To find new rows, query for all rows whose timestamp is newer than newest one you've seen before. Then take the most recent timestamp from the result set, and use it to update your "newest row seen so far" variable.
The database itself doesn't keep track of which rows have been newly-added because the meaning of "new" depends on who's asking. A row that was added six months ago is "new" to someone who hasn't checked since then. That's why you have to use timestamps, and have the application keep track of which timestamp currently marks the boundary between "old" and "new".
Edit: Actually, instead of timestamps, you might want to use an auto-increment integer column. With timestamps there's a slight chance that two rows may be added so close together in time that they get the same timestamp, and if the application does its query at a moment when only one of those rows has been inserted, it'll "miss" the other one next time it checks for new rows because it thinks that timestamp has been seen already. A value that always increases for every new row would avoid that problem, plus many tables have one already (for use as a primary key).

HOw do I delete record in a table by keeping certain datas?

my site has lots of incoming searches which is stored in a database to show recent queries into my website. due to high search queries my database is getting bigger in size. so what I want is I need to keep only recent queries in database say 10 records. this keeps my database small and queries will be faster.
I am able to store incoming queries to database but don't know how to restrict or delete excess/old data from table.
any help??
well I am using PHP and MySQL

Hopefully you have a timestamp column in your table (or have the freedom to add one). AFAIK, you have to add the timestamp explicitly when you add data to the table. Then you can do something along the lines of:
DELETE FROM tablename WHERE timestamp < '<a date two days in the past, or whatever'>;
You'd probably want to just do this periodically, rather than every time you add to the table.
I suppose you could also just limit the size to the most recent ten records by checking the size of the table every time you are about to add a line, and deleting the oldest record (again, using the timestamp column you added) if adding the new record will make it too large.
Falkon's answer is good - though you might not want to have your archive in a table, depending on your needs for that forensic data. You could also set up a cron job that just uses mysqldump to make a backup of the database (with the date in the filename), and then delete the excess records. This way you can easily make backups of your old data, or search it with whatever tool, and your database stays small.

You should write a PHP script, which will be started by CRON (ie. once a day) and move some data from main table TableName to archive table TableNameArchive with exactly the same structure.
That SQL inside the script should looks like:
INSERT INTO TableNameArchive
SELECT * FROM TableName WHERE data < '2010-06-01' //of course you should provide here your conditions
next you should DELETE old records from TableName.

How to do monthly refresh of large DB tables without interrupting user access to them

I have four DB tables in an Oracle database that need to be rewritten/refreshed every week or every month. I am writing this script in PHP using the standard OCI functions, that will read new data in from XML and refresh these four tables. The four tables have the following properties
TABLE A - up to 2mil rows, one primary key (One row might take max 2K data)
TABLE B - up to 10mil rows, one foreign key pointing to TABLE A (One row might take max 1100 bytes of data)
TABLE C - up to 10mil rows, one foreign key pointing to TABLE A (One row might take max 1100 bytes of data)
TABLE D - up to 10mil rows, one foreign key pointing to TABLE A (One row might take max 120 bytes of data)
So I need to repopulate these tables without damaging the user experience. I obviously can't delete the tables and just repopulate them as it is a somewhat lengthy process.
I've considered just a big transaction where I DELETE FROM all of the tables and just regenerate them. I get a little concerned about the length of the transaction (don't know yet but it could take an hour or so).
I wanted to create temp table replicas of all of the tables and populate those instead. Then I could DROP the main tables and rename the temp tables. However you can't do the DROP and ALTER table statements within a transaction as they always do an auto commit. This should be able to be done quickly (four DROP and and four ALTER TABLE statements), but it can't guarantee that a user won't get an error within that short period of time.
Now, a combination of the two ideas, I'm considering doing the temp tables, then doing a DELETE FROM on all four original tables and then and INSERT INTO from the temp tables to repopulate the main tables. Since there are no DDL statements here, this would all work within a transaction. Then, however, I wondering if the memory it takes to process some 60 million records within a transaction is going to get me in trouble (this would be a concern for the first idea as well).
I would think this would be a common scenario. Is there a standard or recommended way of doing this? Any tips would be appreciated. Thanks.

You could have a synonym for each of your big tables. Create new incarnations of your tables, populate them, drop and recreate the synonyms, and finally drop your old tables. This has the advantage of (1) only one actual set of DML (the inserts) avoiding redo generation for your deletes and (2) the synonym drop/recreate is very fast, minimizing the potential for a "bad user experience".
Reminds me of a minor peeve of mine about Oracle's synonyms: why isn't there an ALTER SYNONYM command?

I'm assuming your users don't actually modify the data in these tables since it is deleted from another source every week, so it doesn't really matter if you lock the tables for a full hour. The users can still query the data, you just have to size you rollback segment appropriately. A simple DELETE+INSERT therefore should work fine.
Now if you want to speed things up AND if the new data has little difference with the previous data you could load the new data into temporary tables and updating the tables with the delta with a combination of MERGE+DELETE like this:
Setup:
CREATE TABLE a (ID NUMBER PRIMARY KEY, a_data CHAR(200));
CREATE GLOBAL TEMPORARY TABLE temp_a (
ID NUMBER PRIMARY KEY, a_data CHAR(200)
) ON COMMIT PRESERVE ROWS;
-- Load A
INSERT INTO a
(SELECT ROWNUM, to_char(ROWNUM) FROM dual CONNECT BY LEVEL <= 10000);
-- Load TEMP_A with extra rows
INSERT INTO temp_a
(SELECT ROWNUM + 100, to_char(ROWNUM + 100)
FROM dual
CONNECT BY LEVEL <= 10000);
UPDATE temp_a SET a_data = 'x' WHERE mod(ID, 1000) = 0;
This MERGE statement will insert the new rows and update the old rows only if they are different:
SQL> MERGE INTO a
2 USING (SELECT temp_a.id, temp_a.a_data
3 FROM temp_a
4 LEFT JOIN a ON (temp_a.id = a.id)
5 WHERE decode(a.a_data, temp_a.a_data, 1) IS NULL) temp_a
6 ON (a.id = temp_a.id)
7 WHEN MATCHED THEN
8 UPDATE SET a.a_data = temp_a.a_data
9 WHEN NOT MATCHED THEN
10 INSERT (id, a_data) VALUES (temp_a.id, temp_a.a_data);
Done
You will then need to delete the rows that aren't in the new set of data:
SQL> DELETE FROM a WHERE a.id NOT IN (SELECT temp_a.id FROM temp_a);
100 rows deleted
You would insert into A then into the child tables and deleting in reverse order.

Am I the only one (except Vincent) who would first test the simplest possible solution, i.e. DELETE/INSERT, before trying to build something more advanced?
Then, however, I wondering if the memory it takes to process some 60 million records within a transaction is going to get me in trouble (this would be a concern for the first idea as well).
Oracle manages memory quite well, it hasn't been written by a bunch of Java novices (oops it just came out of my mouth!). So the real question is, do you have to worry about the performance penalties of thrashing REDO and UNDO log files... In other words, build a performance test case and run it on your server and see how long it takes. During the DELETE / INSERT the system will be not as responsive as usual but other sessions can still perform SELECTs without any fears of deadlocks, memory leaks or system crashes. Hint: DB servers are usually disk-bound, so getting a proper RAID array is usually a very good investment.
On the other hand, if the performance is critical, you can select one of the alternative approaches described in this thread:
partitioning if you have the license
table renaming if you don't, but be mindful that DDLs on the fly can cause some side effects such as object invalidation, ORA-06508...

In Oracle your can partition your tables and indexes based on a Date or time column that way to remove a lot of data you can simply drop the partition instead of performing a delete command.
We used to use this to manage monthly archives of 100 Million+ records and not have downtime.
http://www.oracle.com/technology/oramag/oracle/06-sep/o56partition.html is a super handy page for learning about partitioning.

I assume that this refreshing activity is the only way of data changing in these tables, so that you don't need to worry about inconsistencies due to other writing processes during the load.
All that deleting and inserting will be costly in terms of undo usage; you also would exclude the option of using faster data loading techniques. For example, your inserts will go much, much faster if you insert into the tables with no indexes, then apply the indexes after the load is done. There are other strategies as well, but both of them preclude the "do it all in one transaction" technique.
Your second choice would be my choice - build the new tables, then rename the old ones to a dummy name, rename the temps to the new name, then drop the old tables. Since the renames are fast, you'd have a less than one second window when the tables were unavailable, and you'd then be free to drop the old tables at your leisure.
If that one second window is unacceptable, one method I've used in situations like this is to use an additional locking object - specifically, a table with a single row that users would be required to select from before they access the real tables, and that your load process could lock in exclusive mode before it it does the rename operation.
Your PHP script would use two connections to the db - one where you do the lock, the other where you do the loading, renaming and dropping. This way the implicit commits in the work connection won't terminate the lock in the other table.
So, in the script, you'd do something like:
Connection 1:
Create temp tables, load them, create new indexes
Connection 2:
LOCK TABLE Load_Locker IN SHARE ROW EXCLUSIVE MODE;
Connection 1:
Perform renaming swap of old & new tables
Connection 2:
Rollback;
Connection 1:
Drop old tables.
Meanwhile, your clients would issue the following command immediately after starting a transaction (or a series of selects):
LOCK TABLE Load_Locker IN SHARE MODE;
You can have as many clients locking the table this way - your process above will block behind them until they have all released the lock, at which point subsequent clients will block until you perform your operations. Since the only thing you're doing inside the context of the SHARE ROW EXCLUSIVE lock is renaming tables, your clients would only ever block for an instant. Additionally, putting this level of granularity allows you to control how long the clients would have a read consistent view of the old table; without it, if you had a client that did a series of reads that took some time, you might end up changing the tables mid-stream and wind up with weird results if the early queries pulled old data & the later queries pulled new data. Using SET TRANSACTION SET ISOLATION LEVEL READ ONLY would be another way of addressing this issue if you weren't using my approach.
The only real downside to this approach is that if your client read transactions take some time, you run the risk of other clients being blocked for longer than an instant, since any locks in SHARE MODE that occur after your load process issues its SHARE ROW EXCLUSIVE lock will block until the load process finishes its task. For example:
10:00 user 1 issues SHARE lock
10:01 user 2 issues SHARE lock
10:03 load process issues SHARE ROW EXCLUSIVE lock (and is blocked)
10:04 user 3 issues SHARE lock (and is blocked by load's lock)
10:10 user 1 releases SHARE
10:11 user 2 releases SHARE (and unblocks loader)
10:11 loader renames tables & releases SHARE ROW EXCLUSIVE (and releases user 3)
10:11 user 3 commences queries, after being blocked for 7 minutes
However, this is really pretty kludgy. Kinlan's solution of partitioning is most likely the way to go. Add an extra column to your source tables that contains a version number, partition your data based on that version, then create views that look like your current tables that only show data that shows the current version (determined by the value of a row in a "CurrentVersion" table). Then just do your load into the table, update your CurrentVersion table, and drop the partition for the old data.

Why not add a version column? That way you can add the new rows with a different version number. Create a view against the table that specifies the current version. After the new rows are added recompile the view with the new version number. When that's done, go back and delete the old rows.

What we do in some cases is have two versions of the tables, say SalesTargets1 and SalesTargets2 (an active and inactive one.) Truncate the records from the inactive one and populate it. Since no one but you uses the inactive one, there should be no locking issues or impact on the users while it is populating. Then have view that selcts all the information from the active table (it should be named what the current table is now, say SalesTargets in my example). Then to switch to the refreshed data, all you have to do is run an alter view statement.

Have you evaluated the size of the delta (of changes).
If the number of rows that get updated (as opposed to inserted) every time you put up a new rowset it not too high, then I think you should consider importing the new set of data into a set of staging tables and do an update-where-exists and insert-where-not-exists (UPSERT) solution and just refresh your indexes (ok ok indices).
Treat it like ETL.

I'm going with an upsert method here.
I added an additional "delete" column to each of the tables.
When I begin processing the feed, I set the delete field for every record to '1'.
Then I go through a serious of updates if the record exists, or inserts if it does not. For each of those inserts/updates, the delete field is then set to zero.
At the end of the process I delete all records that still have a delete value of '1'.
Thanks everybody for your answers. I found it very interesting/educational.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.