Database with 40000+ records per day

Database with 40000+ records per day - php

I am creating a database for keeping track of water usage per person for a city in South Florida.
There are around 40000 users, each one uploading daily readouts.
I was thinking of ways to set up the database and it would seem easier to give each user separate a table. This should ease the download of data because the server will not have to sort through a table with 10's of millions of entries.
Am I false in my logic?
Is there any way to index table names?
Are there any other ways of setting up the DB to both raise the speed and keep the layout simple enough?
-Thank you,
Jared
p.s.
The essential data for the readouts are:
-locationID (table name in my idea)
-Reading
-ReadDate
-ReadTime
p.p.s. during this conversation, i uploaded 5k tables and the server froze. ~.O
thanks for your help, ya'll

Setting up thousands of tables in not a good idea. You should maintain one table and put all entries in that table. MySQL can handle a surprisingly large amount of data. The biggest issue that you will encounter is the amount of queries that you can handle at a time, not the size of the database. For instances where you will be handling numbers use int with attribute unsigned, and instances where you will be handling text use varchar of appropriate size (unless text is large use text).
Handling users
If you need to identify records with users, setup another table that might look something like this:
user_id INT(10) AUTO_INCREMENT UNSIGNED PRIMARY
name VARCHAR(100) NOT NULL
When you need to link a record to the user, just reference the user's user_id. For the record information I would setup the SQL something like:
id INT(10) AUTO_INCREMENT UNSIGNED PRIMARY
u_id INT(10) UNSIGNED
reading Im not sure what your reading looks like. If it's a number use INT if its text use VARCHAR
read_time TIMESTAMP
You can also consolidate the date and time of the reading to a TIMESTAMP.

Do NOT create a seperate table for each user.
Keep indexes on the columns that identify a user and any other common contraints such as date.
Think about how you want to query the data at the end. How on earth would you sum up the data from ALL users for a single day?
If you are worried about primary key, I would suggest keeping a LocationID, Date composite key.
Edit: Lastly, (and I do mean this in a nice way) but if you are asking these sorts of questions about database design, are you sure that you are qualified for this project? It seems like you might be in over your head. Sometimes it is better to know your limitations and let a project pass by, rather than implement it in a way that creates too much work for you and folks aren't satisfied with the results. Again, I am not saying don't, I am just saying have you asked yourself if you can do this to the level they are expecting. It seems like a large amount of users constantly using it. I guess I am saying that learning certain things while at the same time delivering a project to thousands of users may be an exceptionally high pressure environment.

Generally speaking tables should represent sets of things. In your example, it's easy to identify the sets you have: users and readouts; there the theoretical best structure would be having those two tables, where the readouts entries have a reference to the id of the user.
MySQL can handle very large amounts of data, so your best bet is to just try the user-readouts structure and see how it performs. Alternatively you may want to look into a document based NoSQL database such as MongoDB or CouchDB, since storing readouts reports as individual documents could be a good choice aswell.

If you create a summary table that contains the monthly total per user, surely that would be the primary usage of the system, right?
Every month, you crunch the numbers and store the totals into a second table. You can prune the log table on a rolling 12 month period. i.e., The old data can be stuffed in the corner to keep the indexes smaller, since you'll only need to access it when the city is accused of fraud.
So exactly how you store the daily readouts isn't that big of a concern that you need to be freaking out about it. Giving each user his own table is not the proper solution. If you have tons and tons of data, then you might want to consider sharding via something like MongoDB.

Related

SQL Query speed and refinement

So I have just set up a database which holds only one table with the following fields:
key_value: holds 6 digit code for a key
redeemed: boolean for if the key is redeemed
redeemed_by: who redeemed it
redeemed_date: when it was redeemed
software_name: name of the software the key relates to
I basically start with an empty database and then when someone purchases through PayPal, they get their own key and it is added to the database. After this they open an app which lets them input their code which is then searched for in the database and marked as redeemed so it can't be used again - this results in both redeemed and unredeemed codes being in one table.
If I was to reach a good few thousand purchases, would this cause the database to slow down majorly, crash maybe? what if it was a bigger number, say 10,000?
What exactly would be a good solution for this, even if I had another table of redeemed keys, it would have to look in the redeemed table to see if it was redeemed?
Thanks for any answer, I am still learning databases and SQL!

I think your design is sound. You might want to add indexes based on what queries you will be running. key_value sounds like a good primary key which would also serve as an index for updating redeemed.
As noted by Marc B, the hardware is your only likely consideration for performance.

I would use two tables for this: One for what you have spec'ed out, but another as an archive table with a job that migrates over redeemed/expired records on a regular basis.
Reasoning: The primary purpose of the table is for the benefit of redemptions, not for use as an archive. Over time, as more and more redeemed records are found in the table, the performance for lookups of unredeemed records starts getting worse and worse because of all the "deadwood" in the table. (Do you think eBay houses all active and completed auctions in one table?)
If you still absolutely need a "one-table" solution, you can easily create a view that merges the two tables.
Also, if you set up a proper primary key, the performance (for a while) will not degrade quickly as that would eliminate table scans which is what you are alluding to when the record volumes grow.

Proper unique news article view counter approach

I have looked at different ways to approach this but I would like a method which does not allow people to get around it. Just need a simple, light-weight method to count the number of views off different news articles which are stored in a database:
id | title | body | date | views
1 Stack Overflow 2010-01-01 23
Session
- Could they not just clear browser data and reload page for another view? Any way to stop this?
Database table of ip addresses
- Tons of entries, may hinder performance
Log file
- Same issue as database however I've seen lots of examples
For a performance critical system and for ensuring accuracy, which method should I look into further?
Thanks.

If you're looking to figure out how many unique visitors you have to a given page, then you need to keep information that is unique to each visitor somewhere in your application to reference.
IP addresses are definitely the "safest" way to go, as a user would have to jump through a good many hoops to manually change their IP address. That being said you would have to store a pretty massive amount of data if this is a commercial web-site for each and every page.
What is more reasonable to do is to store the information in a cookie on the client's machine. Sure if your client doesn't allow cookies you will have a skewed number and sure the user can wipe their browser history and you will have a skewed number but overall your number should be relatively accurate.
You could potentially keep this information cached or in session-level variables, but then if your application crashes or restarts you're SOL.
If you REALLY need to have nearly 100% accurate numbers then your best bet is to log the IP addresses of each page's unique visitors. This will ensure you the most accurate count. This is pretty extreme though and if you can take a ~5+% hit in accuracy then I would definitely go for the cookies.

I think that to keep it lightweight you should use someone else's processing power, so for that reason you should sign up to Google Analytics and insert their code into your pages that you want to track.
If you want more accuracy then track each database request in the database itself; or employ a log reading tool that then drops summaries of page reads into a database or file system each day.

Another suggestion:
When the user visits your website log their IP address in a table and drop a cookie with a unique ID. Store this unique ID in a table, along with a reference to the IP address record. This way you are able to figure out a more accurate count (and make adjustments to your final number)
Setup an automated task to create summary tables - making querying the data much faster. This will also allow you to prune the data on a regular basis.
If you're happy to sacrifice better accuracy then this might be a solution:
This would be the "holding" table - which contains the raw data. This is not the table you'd use to query data from - it'd just be for writing to. You'd run through this whole table on a daily/weekly/monthly basis. Yet again - you may need indexes dependant on how you wish to prune this.
CREATE TABLE `article_views` (
`article_id` int(10) unsigned NOT NULL,
`doy` smallint(5) unsigned NOT NULL,
`ip_address` int(10) unsigned NOT NULL
) ENGINE=InnoDB
You'd then have a summary table, which you would update on a daily/weekly or monthly basis which would be super fast to query.
CREATE TABLE `summary_article_uniques_2011` (
`article_id` int(10) unsigned NOT NULL,
`doy` smallint(5) unsigned NOT NULL,
`unique_count` int(10) unsigned NOT NULL,
PRIMARY KEY (`article_id`,`doy`),
KEY(`doy`)
) ENGINE=InnoDB
Example queries:
Unique count for a specific article on a day:
SELECT unique_count FROM summary_article_uniques_2011 WHERE article_id=? AND doy=" . date('z') . "
Counts per day for a specific article:
SELECT unique_count FROM summary_article_uniques_2011 WHERE article_id=?
Counts across the entire site, most popular articles today:
SELECT article_id FROM summary_article_uniques WHERE doy=? ORDER BY unique_count DESC LIMIT 10 // note this query will not hit an index, if you are going to have a lot of articles your best bet is to add another summary table/index "unique_count"

How to design a generic database whose layout may change over time?

Here's a tricky one - how do I programatically create and interrogate a database whose contents I can't really foresee?
I am implementing a generic input form system. The user can create PHP forms with a WYSIWYG layout and use them for any purpose he wishes. He can also query the input.
So, we have three stages:
a form is designed and generated. This is a one-off procedure, although the form can be edited later. This designs the database.
someone or several people make use of the form - say for daily sales reports, stock keeping, payroll, etc. Their input to the forms is written to the database.
others, maybe management, can query the database and generate reports.
Since these forms are generic, I can't predict the database structure - other than to say that it will reflect HTML form fields and consist of a the data input from collection of edit boxes, memos, radio buttons and the like.
Questions and remarks:
A) how can I best structure the database, in terms of tables and columns? What about primary keys? My first thought was to use the control name to identify each column, then I realized that the user can edit the form and rename, so that maybe "name" becomes "employee" or "wages" becomes ":salary". I am leaning towards a unique number for each.
B) how best to key the rows? I was thinking of a timestamp to allow me to query and a column for the row Id from A)
C) I have to handle column rename/insert/delete. Foe deletion, I am unsure whether to delete the data from the database. Even if the user is not inputting it from the form any more he may wish to query what was previously entered. Or there may be some legal requirements to retain the data. Any gotchas in column rename/insert/delete?
D) For the querying, I can have my PHP interrogate the database to get column names and generate a form with a list where each entry has a database column name, a checkbox to say if it should be used in the query and, based on column type, some selection criteria. That ought to be enough to build searches like "position = 'senior salesman' and salary > 50k".
E) I probably have to generate some fancy charts - graphs, histograms, pie charts, etc for query results of numerical data over time. I need to find some good FOSS PHP for this.
F) What else have I forgotten?
This all seems very tricky to me, but I am database n00b - maybe it is simple to you gurus?
Edit: please don't tell me not to do it. I don't have any choice :-(
Edit: in real life I don't expect column rename/insert/delete to be frequent. However it is possible that after running for a few months a change to the database might be required. I am sure this happens regularly. I fear that I have worded this question badly and that people think that changes will be made willy-nilly every 10 minutes or so.
Realistically, my users will define a database when they lay out the form. They might get it right first time and never change it - especially if they are converting from paper forms. Even if they do decide to change, this might only happen once or twice ever, after months or years - and that can happen in any database.
I don't think that I have a special case here, nor that we should be concentrating on change. Perhaps better to concentrate on linkage - what's a good primary key scheme? Say, perhaps, for one text input, one numerical and a memo?

"This all seems very tricky to me, but
I am database n00b - maybe it is
simple to you gurus?"
Nope, it really is tricky. Fundamentally what you're describing is not a database application, it is a database application builder. In fact, it sounds as if you want to code something like Google App Engine or a web version of MS Access. Writing such a tool will take a lot of time and expertise.
Google has implemented flexible schemas by using its BigTable platform. It allows you to flex the schema pretty much at will. The catch is, this flexibility makes it very hard to write queries like "position = 'senior salesman' and salary > 50k".
So I don't think the NoSQL approach is what you need. You want to build an application which generates and maintains RDBMS schemas. This means you need to design a metadata repository from which you can generate dynamic SQL to build and change the users' schemas and also generate the front end.
Things your metadata schema needs to store
For schema generation:
foreign key relationships (an EMPLOYEE works in a DEPARTMENT)
unique business keys (there can be only one DEPARTMENT called "Sales")
reference data (permitted values of EMPLOYEE.POSITION)
column data type, size, etc
whether column is optional (i.e NULL or NOT NULL)
complex business rules (employee bonuses cannot exceed 15% of their salary)
default value for columns
For front-end generation
display names or labels ("Wages", "Salary")
widget (drop down list, pop-up calendar)
hidden fields
derived fields
help text, tips
client-side validation (associated JavaScript, etc)
That last points to the potential complexity in your proposal: a regular form designer like Joe Soap is not going to be able to formulate the JS to (say) validate that an input value is between X and Y, so you're going to have to derive it using templated rules.
These are by no means exhaustive lists, it's just off the top of my head.
For primary keys I suggest you use a column of GUID datatype. Timestamps aren't guaranteed to be unique, although if you run your database on an OS which goes to six places (i.e. not Windows) it's unlikely you'll get clashes.
last word
'My first thought was to use the
control name to identify each column,
then I realized that the user can edit
the form and rename, so that maybe
"name" becomes "employee" or "wages"
becomes ":salary". I am leaning
towards a unique number for each.'
I have built database schema generators before. They are hard going. One thing which can be tough is debugging the dynamic SQL. So make it easier on yourself: use real names for tables and columns. Just because the app user now wants to see a form titled HEADCOUNT it doesn't mean you have to rename the EMPLOYEES table. Hence the need to separate the displayed label from the schema object name. Otherwise you'll find yourself trying to figure out why this generated SQL statement failed:
update table_11123
set col_55542 = 'HERRING'
where col_55569 = 'Bootle'
/
That way madness lies.

In essence, you are asking how to build an application without specifications. Relational databases were not designed so that you can do this effectively. The common approach to this problem is an Entity-Attribute-Value design and for the type of system in which you want to use it, the odds of failure are nearly 100%.
It makes no sense for example, that the column called "Name" could become "Salary". How would a report where you want the total salary work if the salary values could have "Fred", "Bob", 100K, 1000, "a lot"? Databases were not designed to let anyone put anything anywhere. Successful database schemas require structure which means effort with respect to specifications on what needs to be stored and why.
Therefore, to answer your question, I would rethink the problem. The entire approach of trying to make an app that can store anything in the universe is not a recipe for success.

Like Thomas said, rational database is not good at your problem. However, you may want to take a look at NoSQL dbs like MongoDB.

See this article:
http://www.simple-talk.com/opinion/opinion-pieces/bad-carma/
for someone else's experience of your problem.

This is for A) & B), and is not something I have done but thought it was an interesting idea that Reddit put to use, see this link (look at Lesson 3):
http://highscalability.com/blog/2010/5/17/7-lessons-learned-while-building-reddit-to-270-million-page.html

Not sure about the database but for charts instead of using PHP for the charts, I recommend looking into using javascript (http://www.reynoldsftw.com/2009/02/6-jquery-chart-plugins-reviewed/). Advantages to this are some of the processing is offloaded to the client side for chart displays and they can be interactive.

The other respondents are correct that you should be very cautious with this approach because it is more complex and less performant than the traditional relational model - but I've done this type of thing to accommodate departmental differences at work, and it worked fine for the amount of use it got.
Basically I set it up like this, first - a table to store some information about the Form the user wants to create (obviously, adjust as you need):
--************************************************************************
-- Create the User_forms table
--************************************************************************
create table User_forms
(
form_id integer identity,
name varchar(200),
status varchar(1),
author varchar(50),
last_modifiedby varchar(50),
create_date datetime,
modified_date datetime
)
Then a table to define the fields to be presented on the form including any limits
and the order and page they are to be presented (my app presented the fields as a
multi-page wizard type of flow).
-
-************************************************************************
-- Create the field configuration table to hold the entry field configuration
--************************************************************************
create table field_configuration
(
field_id integer identity,
form_id SMALLINT,
status varchar(1),
fieldgroup varchar(20),
fieldpage integer,
fieldseq integer,
fieldname varchar(40),
fieldwidth integer,
description varchar(50),
minlength integer,
maxlength integer,
maxval varchar(13),
minval varchar(13),
valid_varchars varchar(20),
empty_ok varchar(1),
all_caps varchar(1),
value_list varchar(200),
ddl_queryfile varchar(100),
allownewentry varchar(1),
query_params varchar(50),
value_default varchar(20)
);
Then my perl code would loop through the fields in order for page 1 and put them on the "wizard form" ... and the "next" button would present the page 2 fields in order etc.
I had javascript functions to enforce the limits specified for each field as well ...
Then a table to hold the values entered by the users:
--************************************************************************
-- Field to contain the values
--************************************************************************
create table form_field_values
(
session_Id integer identity,
form_id integer,
field_id integer,
value varchar(MAX)
);
That would be a good starting point for what you want to do, but keep an eye on performance as it can really slow down any reports if they add 1000 custom fields. :-)

Database Design

I'm trying to build out a mysql database design for a project. The problem is coming up with the best solution. Basically in my application, I will have to insert approximately 10-30 rows per user. The primary key will be a random CHAR(16) string. There will also be an datetime index, and an additional row (with an index) called "data".
Day to day, there will only be a heavy amount of inserts and lookups on the table. The lookups will always joined based on the primary key (so joining those 10-30 rows per user).
I will at times need to be able to look at a few specific months (or a full year even) and use mysql GROUP BY functions on the "data" index as well.
At its current volume and estimates, I would expect the table to grow 9.3m rows/month. I do expect this to increase.
So my question comes down to this: mysql partitions, programmatic table separation, or another solution? and are things best separated by month or year? We are running on RHEL, so getting mysql 5.1 may be a bit of work, but if that's a better solution it may be worth going for.
innoDB has already been selected for this project. Day to day performance is the primary concern.

This doesn't answer your question, but it needs to be mentioned...
The primary key will be a random CHAR(16) string.
This is a Bad Idea. Use an UNSIGNED BIGINT column with AUTO_INCREMENT. No need to reinvent the wheel: you won't have to worry about key management or collisions that way.

Partition the data on the dates (and maybe additionally the user it is per-user data and you have lots of users).
Then create a monthly table with the SUM, COUNT, AVG, etc that you need and the appropriate group by. You can partition that table as well (but dates probably won't be a meaningful partition)
Then create a yearly table like the monthly table.
Populate the monthly and yearly tables with REPLACE INTO ... SELECT ... statements.

Website: What is the best way to store a large number of user variables?

I'm designing a website using PHP and MySQL currently and as the site proceeds I find myself adding more and more columns to the users table to store various variables.
Which got me thinking, is there a better way to store this information? Just to clarify, the information is global, can be affected by other users so cookies won't work, also I'd lose the information if they clear their cookies.
The second part of my question is, if it does turn out that storing it in a database is the best way, would it be less expensive to have a large number of columns or rather to combine related columns into delimited varchar columns and then explode them in PHP?
Thanks!

In my experience, I'd rather get the database right than start adding comma separated fields holding multiple items. Having to sift through multiple comma separated fields is only going to hurt your program's efficiency and the readability of your code.
Also, if your table is growing to much, then perhaps you need to look into splitting it into multiple tables joined by foreign dependencies?

I'd create a user_meta table, with three columns: user_id, key, value.

I wouldn't go for the option of grouping columns together and exploding them. It's untidy work and very unmanageable. Instead maybe try spreading those columns over a few tables and using InnoDb's transaction feature.
If you still dislike the idea of frequently updating the database, and if this method complies with what you're trying to achieve, you can use APC's caching function to store (cache) information "globally" on the server.

MongoDB (and its NoSQL cousins) are great for stuff like this.

The database a perfectly fine place to store such data, as long as they're variables and not, say, huge image files. The database has all the optimizations and specifications for storing and retrieving large amounts of data. Anything you set up on file system level will always be beaten by what the database already has in terms of speed and functionality.
would it be less expensive to have a large number of columns or rather to combine related columns into delimited varchar columns and then explode them in PHP?
It's not really that much of a performance than a maintenance question IMO - it's not fun to manage hundreds of columns. Storing such data - perhaps as serialized objects - in a TEXT field is a viable option - as long as it's 100% sure you will never have to make any queries on that data.
But why not use a normalized user_variables table like so:
id | user_id | variable_name | variable_value
?
It is a bit more complex to query, but provides for a very clean table structure all round. You can easily add arbitrary user variables that way.
If you are doing a lot of queries like SELECT FROM USERS WHERE variable257 = 'green' you may have to stick to have specific columns.

The database is definitely the best place to store the data. (I'm assuming you were thinking of storing it in flat files otherwise) You'd definitely get better performance and security from using a DB over storing in files.
With regards to the storing your data in multiple columns or delimiting them... It's a personal choice but you should consider a few things
If you're going to delimit the items, you need to think of what you're going to delimit them with (something that's not likely to crop up within the text your delimiting)
I often find that it helps to try and visualise whether another programmer of your level would be able to understand what you've done with little help.
Yes, as Pekka said, if you want to perform queries on the data stored you should stick with the seperate columns
You may also get a slight performance boost from not retrieving and parsing ALL your data every time if you just want a couple of fields of information
I'd suggest going with the seperate columns as it offers you the option of much greater flexibility in the future. And there's nothing worse than having to drastically change your data structure and migrate information down the track!

I would recommend setting up a memcached server (see http://memcached.org/). It has proven to be viable with lots of the big sites. PHP has two extensions that integrate a client into your runtime (see http://php.net/manual/en/book.memcached.php).
Give it a try, you won't regret it.
EDIT
Sure, this will only be an option for data that's frequently used and would otherwise have to be loaded from your database again and again. Keep in mind though that you will still have to save your data to some kind of persistent storage.

A document-oriented database might be what you need.
If you want to stick to a relational database, don't take the naïve approach of just creating a table with oh so many fields:
CREATE TABLE SomeEntity (
ENTITY_ID CHAR(10) NOT NULL,
PROPERTY_1 VARCHAR(50),
PROPERTY_2 VARCHAR(50),
PROPERTY_3 VARCHAR(50),
...
PROPERTY_915 VARCHAR(50),
PRIMARY KEY (ENTITY_ID)
);
Instead define a Attribute table:
CREATE TABLE Attribute (
ATTRIBUTE_ID CHAR(10) NOT NULL,
DESCRIPTION VARCHAR(30),
/* optionally */
DEFAULT_VALUE /* whatever type you want */,
/* end_optionally */
PRIMARY KEY (ATTRIBUTE_ID)
);
Then define your SomeEntity table, which only includes the essential attributes (for example, required fields in a registration form):
CREATE TABLE SomeEntity (
ENTITY_ID CHAR(10) NOT NULL
ESSENTIAL_1 VARCHAR(30),
ESSENTIAL_2 VARCHAR(30),
ESSENTIAL_3 VARCHAR(30),
PRIMARY KEY (ENTITY_ID)
);
And then define a table for those attributes that you might or might not want to store.
CREATE TABLE EntityAttribute (
ATTRIBUTE_ID CHAR(10) NOT NULL,
ENTITY_ID CHAR(10) NOT NULL,
ATTRIBUTE_VALUE /* the same type as SomeEntity.DEFAULT_VALUE;
if you didn't create that field, then any type */,
PRIMARY KEY (ATTRIBUTE_ID, ENTITY_ID)
);
Evidently, in your case, that SomeEntity is the user.

Instead of MySQL you might consider using a triplestore, or a key-value store
that way you get the benifits of having all the multithreading multiuser, performance and caching voodoo, figured out, without all the trouble of trying to figure out ahead of time what kind of values you really want to store.
Downsides: it's a bit more costly to figure out the average salary of all the people in idaho who also own hats.

depends on what kind of user info you are storing. if its session pertinent data, use php sessions in coordination with session event handlers to store your session data in a single data field in the db.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.