SQL Query speed and refinement

SQL Query speed and refinement - php

So I have just set up a database which holds only one table with the following fields:
key_value: holds 6 digit code for a key
redeemed: boolean for if the key is redeemed
redeemed_by: who redeemed it
redeemed_date: when it was redeemed
software_name: name of the software the key relates to
I basically start with an empty database and then when someone purchases through PayPal, they get their own key and it is added to the database. After this they open an app which lets them input their code which is then searched for in the database and marked as redeemed so it can't be used again - this results in both redeemed and unredeemed codes being in one table.
If I was to reach a good few thousand purchases, would this cause the database to slow down majorly, crash maybe? what if it was a bigger number, say 10,000?
What exactly would be a good solution for this, even if I had another table of redeemed keys, it would have to look in the redeemed table to see if it was redeemed?
Thanks for any answer, I am still learning databases and SQL!

I think your design is sound. You might want to add indexes based on what queries you will be running. key_value sounds like a good primary key which would also serve as an index for updating redeemed.
As noted by Marc B, the hardware is your only likely consideration for performance.

I would use two tables for this: One for what you have spec'ed out, but another as an archive table with a job that migrates over redeemed/expired records on a regular basis.
Reasoning: The primary purpose of the table is for the benefit of redemptions, not for use as an archive. Over time, as more and more redeemed records are found in the table, the performance for lookups of unredeemed records starts getting worse and worse because of all the "deadwood" in the table. (Do you think eBay houses all active and completed auctions in one table?)
If you still absolutely need a "one-table" solution, you can easily create a view that merges the two tables.
Also, if you set up a proper primary key, the performance (for a while) will not degrade quickly as that would eliminate table scans which is what you are alluding to when the record volumes grow.

Related

SQL - auto increment withing group inside one table [duplicate]

I have got a table which has an id (primary key with auto increment), uid (key refering to users id for example) and something else which for my question won’t matter.
I want to make, lets call it, different auto-increment keys on id for each uid entry.
So, I will add an entry with uid 10, and the id field for this entry will have a 1 because there were no previous entries with a value of 10 in uid. I will add a new one with uid 4 and its id will be 3 because I there were already two entried with uid 4.
...Very obvious explanation, but I am trying to be as explainative an clear as I can to demonstrate the idea... clearly.
What SQL engine can provide such a functionality natively? (non Microsoft/Oracle based)
If there is none, how could I best replicate it? Triggers perhaps?
Does this functionality have a more suitable name?
In case you know about a non SQL database engine providing such a functioality, name it anyway, I am curious.
Thanks.

MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:
For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.
The docs go on after that paragraph, showing an example.
The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.
You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:
Mario starts transaction and inserts a new row for user 4.
Bill starts transaction and inserts a new row for user 4.
Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
Bill's session fires a trigger to compute MAX(id). I get 3.
Bill's session finishes his INSERT and commits.
Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.
In general, you can't control the order of execution of these steps without some kind of synchronization.
The solutions to this are either:
Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.
Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.
It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:
What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?

SQL Server should allow you to do this. If you can't implement this using a computed column (probably not - there are some restrictions), surely you can implement it in a trigger.
MySQL also would allow you to implement this via triggers.

In a comment you ask the question about efficiency. Unless you are dealing with extreme volumes, storing an 8 byte DATETIME isn't much of an overhead compared to using, for example, a 4 byte INT.
It also massively simplifies your data inserts, as well as being able to cope with records being deleted without creating 'holes' in your sequence.
If you DO need this, be careful with the field names. If you have uid and id in a table, I'd expect id to be unique in that table, and uid to refer to something else. Perhaps, instead, use the field names property_id and amendment_id.
In terms of implementation, there are generally two options.
1). A trigger
Implementations vary, but the logic remains the same. As you don't specify an RDBMS (other than NOT MS/Oracle) the general logic is simple...
Start a transaction (often this is Implicitly already started inside triggers)
Find the MAX(amendment_id) for the property_id being inserted
Update the newly inserted value with MAX(amendment_id) + 1
Commit the transaction
Things to be aware of are...
- multiple records being inserted at the same time
- records being inserted with amendment_id being already populated
- updates altering existing records
2). A Stored Procedure
If you use a stored procedure to control writes to the table, you gain a lot more control.
Implicitly, you know you're only dealing with one record.
You simply don't provide a parameter for DEFAULT fields.
You know what updates / deletes can and can't happen.
You can implement all the business logic you like without hidden triggers
I personally recommend the Stored Procedure route, but triggers do work.

It is important to get your data types right.
What you are describing is a multi-part key. So use a multi-part key. Don't try to encode everything into a magic integer, you will poison the rest of your code.
If a record is identified by (entity_id,version_number) then embrace that description and use it directly instead of mangling the meaning of your keys. You will have to write queries which constrain the version number but that's OK. Databases are good at this sort of thing.
version_number could be a timestamp, as a_horse_with_no_name suggests. This is quite a good idea. There is no meaningful performance disadvantage to using timestamps instead of plain integers. What you gain is meaning, which is more important.
You could maintain a "latest version" table which contains, for each entity_id, only the record with the most-recent version_number. This will be more work for you, so only do it if you really need the performance.

mysql historical data and record id

I am setting up a new part of an application with historical data requirements for the transactions table in mysql. Originally in old version transactions were not historical, with structure like this:
id|buyerid|prodid|price|status
And other fields, with the id being referenced in links to access Transaction Details page, as well as used as foreign key in other tables across the application to reference particular transactions for various purposes.
Now the requirement is to answer reporting questions like "Show all transaction that had particular status Feb 2014" AND "What did a transaction look like in Feb 2014".
The new design I'm testing at the moment is below:
id|buyerid|prodid|price|status|active|start_date|end_date
Where active used to indicate latest record, start is when it is created, no records to be modified instead end date populated and a new record created with same details plus the modification.
Now the question is - what to do about transaction id field? Because in this new design it is more of a history id, and can not be used for a foreign key across the application since it is going to change with every update.
I can think of two options:
Create a separate table, transaction_ids with just one column, primary key autoincrement tid, and a foreign key column in the main transactions table for tid - Every time a brand new transaction is created, insert the ids table and use that id for the tid to trace this particular transaction across the system.
The buyerid and prodid combination is always unique in my application, no buyer can get the same product twice.
Is the second solution better? Does anyone know of a better way to handle this?

What you are trying to achieve is called Event Sourcing.
Think in terms of events changing the status of your transaction, rather than tracing the status itself in time.
You still have your transaction with its own primary key, and you rebuild the current (or past) status applying each event.
I would also suggest you to start coding your business models, and only after that, to think about the persistence and the best way to map it to a database.

Second Solution looks better although I will say that there is a lot of ambiguity in your question.
I am saying that second solution is better because the transaction_ids table which you are talking about in solution 1 is basically REDUNDANT. It is not solving any purpose. Even if the transaction id is repeating itself in the transaction table, it does not mean that you need to have a separate table to generate the ids and make it as PK-FK relation. Most probably you will still be querying the data by user-id and prod-id and not by transaction-id
Basically what you need is some kind of audit history table where you insert a record for every operation/transaction/modification done and capture some basic details like - Username, Date/time, old value, new value etc. You do not need status or start date and end date columns. Once a record is inserted in this audit history table then it is never going to be touched again.
You will have to design your report carefully.

Taking two previous answers into consideration, here is the solution I will go with: All of the data updates in my application come through one single function, that is already set up to audit particular fields of my choosing, so I will mark the transaction status to be audited among the others. Table structure for the audit table is similar to this:
|id|table|table_id|column|old_val|new_val|who|when|
Only that there is a bit more advanced object mapping via object id's instead of simple table name. I can then use this data in a Join to the main, normal not historical transactions table to provide the reporting required.

Build PHP function to retrieve a variety of mySQL database queries and correctly traverse through multiple tables via their foreign key relationship

I am trying to build a robust php function that allows me to traverse over my normalized database. My mySQL database has 6 tables with the following column names (I am only including the primary and foreign keys, as well as some limited table columns for simplicity) so that you can see how they are related.
tableA:
partID (primary key)
tableABJunction
itemID (foreign key)
partID (foreign key)
tableB
itemID (primary key)
itemName
sales
customerID (foreign key)
itemID (foreign key)
partDate
itemID (foreign key)
customer
customerID (primary key)
nameFirst
nameLast
When I need to generate a query, such as: What are the names of the customers that ordered itemID = 12? I have to first do a query from the sales database for all customerIDs where itemID=12 and then query the customer table to find out their first and last names. Some times, I may need to perform a query where I have to return data from all 6 tables, based on a query asking for all information pertaining to customer whose name is John Smith. Is there any easy way to build a function to handle this variety of queries, without having to build a query for every possible type of search?
Currently, my approach is to pass the following to php via AJAX:
web_conditionArray (contains the column name and value of the data provided. Such as nameFirst => 'John', nameLast => 'Smith'); web_resultArray (contains the table name and the columns that I am requesting: sales => 'itemID, itemName').
The issue that I am having with this approach is a way to store the relationships between all of the mySQL datatables with their foreign keys so that my php program knows how to link all the tables together to run the correct query to get from the data provided from one table to the data requested in another table. Any suggestions or a better way to solve this? I was initially thinking of a doubly linked list but the flow from table to table is not linear given that there is a fork where the tableB links to the sales and partDate tables.
I tried to be as specific as I could in describing this situation without writing a novel; however, please let me know if you need any additional information to refine my question further.

Looking at your table structure, I imagine it would be possible to construct logic to calculate the relationships between tables, and dynamically construct queries, but it seems to me that that would be far more work than manually constructing queries for your particular database. I'm assuming that your tables have many more fields in them, but that you've only included the most important, and have definitely included all primary and foreign keys.
Based on that, you have only three information objects in your database: Parts, Items and Customers. You should, therefore, not need more than 12 manually constructed queries to make your system work. You just need to ensure that you simplify your queries to work with whole information objects, and use the PHP layer to filter them later.
So, you reduce your query logic to:
"Fetch me all [Parts, Items or Customers] (and possibly also all [Parts, Items or Customers]) related to [Part, Item or Custromer] (and possibly [Part, Item or Customer])"
This results in the following queries:
All Customers for a Part
All Customers for an Item
All Customers for a Part and an Item
All Items for a Part
All Items for a Customer
All Items for a Part and a Customer
All Parts for an Item
All Parts for a Customer
All Parts for a Customer and an Item
All Parts and Customers for an Item
All Customers and Items for a Part
All Items and Parts for a Customer
(This is the full list of logical relationships - some may not make any sense practically, which makes your life easier)
So, your PHP script needs to perform the following tasks:
Identify which object(s) are required for the criteria of the query. This is based on the fields supplied.
Construct a WHERE clause for your query which identifies the primary key for the criteria objects from the fields passed.
Identify which object(s) are required for the result of the query, based on the fields requested.
Select the query based on the criteria and return objects, and insert the constructed WHERE clause.
Perform the query, extracting all information available about the requested objects
Filter the results, extracting only the required information
Return the final results.

First, know that my answer will most likely be downvoted to hell (as this methodology is constantly downvoted despite its' correctness). DBAs want you to believe that just because a complex query can be done with a SQL statement that it should (like how server-siders think all client-side should be done with server-side or how client-siders think layouts should be done with client-side instead of CSS). No. Complex queries are for people sitting at command lines needing to come up with on demand data grabbing for specific, non-routine reasons. For processing speed, SELECTing, UPDATEing, and DELETEing should always be done off the PK server-side.
It sounds like you have a set of legitimately large tables.
Assuming it's large and speed is the primary concern (and not development time), use only a primary key and no other indexes because the more indexes you have, the more those indexes need to be reindexed by the database when really the comparisons that DBAs would have you do are faster server-side.
The primary key will take some finagling, but it's the most important thing past data types and lengths. For instance, the non-FK, independent tables like tableA, tableB, and customer should probably have an ai INT PK (Generally, remember that computers think in terms of integers), but the ones with multiple FKs should probably have no ai INT but instead a composite PK with the less variant SELECTed FK first. For example, with my site, I store vote totals on links by userID and linkID. If a user's logged in, they'll need to know how many votes they've placed on a link, so the userID is the one less likely to change, so that's first in my PK on that table. Counting this on demand database side or server-side was a performance nightmare.
For just a few lines of code, you will GREATLY improve speed. Sorting on the PK via php will cut latency by 50%. Absorbing JOINs into php will decrease the rate of latency spikes. Having no on demand MySQL calculations will keep your site from becoming paralyzed.
If you step away from the dogma that just because a SQL statement can get you the results that you should use a SQL statement instead of a server-side language (C++ being the fastest), you'll see performance skyrocket.
If you can be more specific with the tables you're trying to obfuscate, I can get more specific, but you probably get the idea.
AJAX has changed the game and forced refocus. CSS for layouts; js for client-side programming; server-side for...server-side processing; database for storing everything that lasts longer than a moment.
Bring on the downvotes! LOL

Database with 40000+ records per day

I am creating a database for keeping track of water usage per person for a city in South Florida.
There are around 40000 users, each one uploading daily readouts.
I was thinking of ways to set up the database and it would seem easier to give each user separate a table. This should ease the download of data because the server will not have to sort through a table with 10's of millions of entries.
Am I false in my logic?
Is there any way to index table names?
Are there any other ways of setting up the DB to both raise the speed and keep the layout simple enough?
-Thank you,
Jared
p.s.
The essential data for the readouts are:
-locationID (table name in my idea)
-Reading
-ReadDate
-ReadTime
p.p.s. during this conversation, i uploaded 5k tables and the server froze. ~.O
thanks for your help, ya'll

Setting up thousands of tables in not a good idea. You should maintain one table and put all entries in that table. MySQL can handle a surprisingly large amount of data. The biggest issue that you will encounter is the amount of queries that you can handle at a time, not the size of the database. For instances where you will be handling numbers use int with attribute unsigned, and instances where you will be handling text use varchar of appropriate size (unless text is large use text).
Handling users
If you need to identify records with users, setup another table that might look something like this:
user_id INT(10) AUTO_INCREMENT UNSIGNED PRIMARY
name VARCHAR(100) NOT NULL
When you need to link a record to the user, just reference the user's user_id. For the record information I would setup the SQL something like:
id INT(10) AUTO_INCREMENT UNSIGNED PRIMARY
u_id INT(10) UNSIGNED
reading Im not sure what your reading looks like. If it's a number use INT if its text use VARCHAR
read_time TIMESTAMP
You can also consolidate the date and time of the reading to a TIMESTAMP.

Do NOT create a seperate table for each user.
Keep indexes on the columns that identify a user and any other common contraints such as date.
Think about how you want to query the data at the end. How on earth would you sum up the data from ALL users for a single day?
If you are worried about primary key, I would suggest keeping a LocationID, Date composite key.
Edit: Lastly, (and I do mean this in a nice way) but if you are asking these sorts of questions about database design, are you sure that you are qualified for this project? It seems like you might be in over your head. Sometimes it is better to know your limitations and let a project pass by, rather than implement it in a way that creates too much work for you and folks aren't satisfied with the results. Again, I am not saying don't, I am just saying have you asked yourself if you can do this to the level they are expecting. It seems like a large amount of users constantly using it. I guess I am saying that learning certain things while at the same time delivering a project to thousands of users may be an exceptionally high pressure environment.

Generally speaking tables should represent sets of things. In your example, it's easy to identify the sets you have: users and readouts; there the theoretical best structure would be having those two tables, where the readouts entries have a reference to the id of the user.
MySQL can handle very large amounts of data, so your best bet is to just try the user-readouts structure and see how it performs. Alternatively you may want to look into a document based NoSQL database such as MongoDB or CouchDB, since storing readouts reports as individual documents could be a good choice aswell.

If you create a summary table that contains the monthly total per user, surely that would be the primary usage of the system, right?
Every month, you crunch the numbers and store the totals into a second table. You can prune the log table on a rolling 12 month period. i.e., The old data can be stuffed in the corner to keep the indexes smaller, since you'll only need to access it when the city is accused of fraud.
So exactly how you store the daily readouts isn't that big of a concern that you need to be freaking out about it. Giving each user his own table is not the proper solution. If you have tons and tons of data, then you might want to consider sharding via something like MongoDB.

Purpose of Secondary Key

What is the purpose of the Secondary key? Say I have a table that logs down all the check-ins (similar to Foursquare), with columns id, user_id, location_id, post, time, and there can be millions of rows, many people have stated to use secondary keys to speed up the process.
Why does this work? And should both user_id and location_id be secondary keys?
I'm using mySQL btw...
Edit: There will be a page that lists/calculates all the check-ins for a particular user, and another page that lists all the users who has checked-in to a particular location
mySQL Query
Type 1
SELECT location_id FROM checkin WHERE user_id = 1234
SELECT user_id FROM checkin WHERE location_id = 4321
Type 2
SELECT COUNT(location_id) as num_users FROM checkin
SELECT COUNT(user_id) as num_checkins FROM checkin

The key (also called index) is for speeding up queries. If you want to see all checkins for a given user, you need a key on user_id field. If you want to see all checking for a given location, you need index on location_id field. You can read more at mysql documentation

I want to comment on your question and your examples.
Let me just suggest strongly to you that since you are using MySQL you make sure that your tables are using the innodb engine type for many reasons you can research on your own.
One important feature of InnoDB is that you have referential integrity. What does that mean? In your checkin table, you have a foreign key of user_id which is the primary key of the user table. With referential integrity, MySQL will not let you insert a row with a user_id that doesn't exist in the user table. Using MyISAM, you can. That alone should be enough to make you want to use the innodb engine.
To your question about keys/indexes, essentially when a table is defined and a key is declared for a column or some combination of columns, mysql will create an index.
Indexes are essential for performance as a table grows with the insert of rows.
All relational databases and Document databases depend on an implementation of BTree indexing. What Btree's are very good for, is finding an item (or not) using a predictable number of lookups. So when people talk about the performance of a relational database the essential building block of that is use of btree indexes, which are created via KEY statements or with alter table or create index statements.
To understand why this is, imagine that your user table was simply a text file, with one line per row, perhaps separated by commas. As you add a row, a new line in the text file gets added at the bottom.
Eventually you get to the point that you have 10,000 lines in the file.
Now you want to find out if you entered a line for one particular person with the last name of Smith. How can you find that out?
Without any sort of sortation of the file, or a separate index, you have but one option and that is to start at the first line in the file and scan through every line in the table looking for a match. Even if you found a Smith, that might not be the only 'Smith' in the table, so you have to read the entire file from top to bottom every time you want do do this search.
Obviously as the table grows the performance of searching gets worse and worse.
In relational database parlance, this is known as a "table scan". The database has to start at the first row and scan through reading every row until it gets to the end.
Without indexes, relational databases still work, but they are highly dependent on IO performance.
With a Btree index, the rows you want to find are found in the index first. The indexes have a pointer directly to the data you want, so the table no longer needs to be scanned, but instead the individual data pages required are read. This is how a database can maintain adequate performance even when there are millions or 10's or 100's of millions of rows.
To really start to gain insight into how mysql works, you need to get familiar with EXPLAIN EXTENDED ... and start looking at the explain plans for queries. Simple ones like those you've provided will have simple plans that show you how many rows are being examined to get a result and whether or not they are using one or more indexes.
For your summary queries, indexes are not helpful because you are doing a COUNT(). The table will need to be scanned when you have no other criteria constraining the search.
I did notice what looks like a mistake in your summary queries. Just based on your labels, I would think that these are the right queries to get what you would want given your column alias names.
SELECT COUNT(DISTINCT user_id) as num_users FROM checkin
SELECT COUNT(*) as num_checkins FROM checkin
This is yet another reason to use InnoDB, which when properly configured has a data cache (innodb buffer pool) similar to other rdbms's like oracle and sql server. MyISAM doesn't cache data at all, so if you are repeatedly querying the same sorts of queries that might require a lot of IO, MySQL will have to do all that data reading work over and over, whereas with InnoDB, that data could very well be sitting in cache memory and have the result returned without having to go back and read from storage.
Primary vs Secondary
There really is no such concept internally. A Primary key is special because it allows the database to find one single row. Primary keys must be unique, and to reflect that, the associated Btree index is unique, which simply means that it will not allow you to have 2 keys with the same data to exist in the index.
Whether or not an index is unique is an excellent tool that allows you to maintain the consistency of your database in many other cases. Let's say you have an 'employee' table with the SS_Number column to store social security #. It makes sense to have an index on that column if you want the system to support finding an employee by SS number. Without an index, you will tablescan. But you also want to have that index be unique, so that once an employee with a SS# is inserted, there is no way the database will let you enter a duplicate employee with the same SS#.
But to demystify this for you, when you declare keys these indexes are just being created for you and used automagically in most cases, when you define the tables.
It's when you aren't dealing with keys (primary or foreign) as in the example of usernames, first, last & last names, ss#'s etc., that you need to also be aware of how to create an index because you are searching (using where clause criteria) on one or more columns that aren't keys.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.