I have a MySQL/PHP performance related question.
I need to store an index list associated with each record in a table. Each list contains 1000 indices. I need to be able to quickly access any index value in the list associated to a given record. I am not sure about the best way to go. I've thought of the following ways and would like your input on them:
Store the list in a string as a comma separated value list or using JSON. Probably terrible performance since I need to extract the whole list out of the DB to PHP only to retrieve a single value. Parsing the string won't exactly be fast either... I can store a number of expanded lists in a Least Rencently Used cache on the PHP side to reduce load.
Make a list table with 1001 columns that will store the list and its primary key. I'm not sure how costly this is regarding storage? This also feels like abusing the system. And then, what if I need to store 100000 indices?
Only store with SQL the name of the binary file containing my indices and perform a fopen(); fseek(); fread(); fclose() cycle for each access? Not sure how the system filesystem cache will react to that. If it goes badly then there are many solutions available to adress the issues... but that's sounds a bit overkill no?
What do you think of that?
What about a good old one-to-many relationship?
records
-------
id int
record ...
indices
-------
record_id int
index varchar
Then:
SELECT *
FROM records
LEFT JOIN indices
ON records.id = indices.record_id
WHERE indices.index = 'foo'
The standard solution is to create another table, with one row per (record, index), and add a MySQL Index to allow fast search
CREATE TABLE IF NOT EXISTS `table_list` (
`IDrecord` int(11) NOT NULL,
`item` int(11) NOT NULL,
KEY `IDrecord` (`IDrecord`)
)
Change the item's type according to your needs - I used int in my example.
The most logical solution would be to put each value in it's own tuple. Adding a MYSQL index to each tuple will enable the DBMS to quickly ascertain the value, and should improve performance.
The reasons we're not going with your other answers are as follows:
Option 1
Storing multiple values in one MYSQL cell is a violation of the first stage of database normalisation. You can read up on it here.
Option 3
This has heavy reliance on other files. You want to localize your data storage as much as possible, to make it easier to maintain in the future.
Related
I currently have a database structure for dynamic forms as such:
grants_app_id user_id field_name field_value
5--------------42434----full_name---John Doe
5--------------42434----title-------Programmer
5--------------42434----email-------example#example.com
I found this to be very difficult to manage, and it filled up the number rows in the database very quickly. I have different field_names that can vary up to 78 rows, so it proved to be very costly when making updates to the field_values or simply searching them. I would like to combine the rows and use either json or php serialize to greatly reduce the impact on the database. Does anyone have any advice on how I should approach this? Thank you!
This would be the expected output:
grants_app_id user_id data
5--------------42434----{"full_name":"John Doe", "title":"Programmer", "email":"example#example.com"}
It seems you don't have a simple primary key in those rows.
Speeding up the current solution:
create an index for (grants_app_id, user_id)
add an auto-incrementing primary key
switch from field_name to field_id
The index will make retrieving full-forms a lot more fun (while taking a bit extra time on insert).
The primary key allow you to update a row by specifying a single value backed by a unique index, which should generally be really fast.
You probably already have some definition of fields. Add integer-IDs and use them to speed up the process as less data is stored, compared, indexed, ...
Switching to a JSON-Encoded variant
Converting arrays to JSON and back can be done by using json_encode and json_decode since PHP 5.2.
How can you switch to JSON?
Possibly the current best way would be to use a PHP-Script (or similar) to retrieve all data from the old table, group it correctly and insert it into a fresh table. Afterwards you may switch names, ... This is an offline approach.
An alternative would be to add a new column and indicate by field_name=NULL that the new column contains the data. Afterwards you are free to convert data at any time or store only new data as JSON.
Use JSON?
While certainly it is tempting to have all data in one row there are somethings to remember:
with all fields preserved in a single text-field searching for a value inside a field may become a two-phase approach, as a % inside any LIKE can skip into other field's values. Also LIKE '%field:value%' is not easily optimized by indexing the column.
changing a single field means updating all stored fields. As long as you are sure only one process changes the data at any given time this is ok, otherwise there tend to be more problems.
JSON-column needs to be big enough to hold field-names + values + separators. This can be a lot. Also if you miss-calculate a long value in any field means a truncation with the risk of loosing all information on all fields after the long value
So in your case even with 78 different fields it may still be better two have a row per formular user and field. (It may even turn out that JSON is more practicable for formulars with few fields).
As explained in this question you have to remember that JSON is only some other text to MySQL.
Scenario 1
I have one table lets say "member". In that table "member" i have 7 fields ( memid,login_name,password,age,city,phone,country ). In my table i have 10K records.i need to fetch one record . so i'm using the query like this
mysql_query("select * from member where memid=999");
Scenario 2
I have the same table called "member" but i'm splitting the table like this member and member_txt .So in my member_txt table i have memid,age,phone,city,country )and in my member table i have memid,login_name,password .
Which is the best scenario to fetch the data quickly? Either going to single table or split the table into two with reference?
Note: I need to fetch the particular data in PHP and MYSQL. Please let me know which is best method to follow.
we have 10K records
For your own health, use the single table approach.
As long as you are using a primary key for memid, things are going to be lightning fast. This is because PRIMARY KEY automatically assigns an index, which basically tells the exact location for the data and eliminates the need to go through data that it would otherwise do.
From http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
Indexes are used to find rows with specific column values quickly.
Without an index, MySQL must begin with the first row and then read
through the entire table to find the relevant rows. The larger the
table, the more this costs. If the table has an index for the columns
in question, MySQL can quickly determine the position to seek to in
the middle of the data file without having to look at all the data. If
a table has 1,000 rows, this is at least 100 times faster than reading
sequentially. If you need to access most of the rows, it is faster to
read sequentially, because this minimizes disk seeks.
Your second approach only makes your system more complex, and provides no benefits.
Use the scenario 1.
please make the memid primary/unique key then having one table is faster than having two tables.
In general you should not see to much impact on performance with 10k rows as long as your accessing it by your primary key.
Also note that fetching data from one table is also faster than fetching data from 2 tables.
If you want to optimize further use the column names in the select statement instead of the * operator.
What is the purpose of the Secondary key? Say I have a table that logs down all the check-ins (similar to Foursquare), with columns id, user_id, location_id, post, time, and there can be millions of rows, many people have stated to use secondary keys to speed up the process.
Why does this work? And should both user_id and location_id be secondary keys?
I'm using mySQL btw...
Edit: There will be a page that lists/calculates all the check-ins for a particular user, and another page that lists all the users who has checked-in to a particular location
mySQL Query
Type 1
SELECT location_id FROM checkin WHERE user_id = 1234
SELECT user_id FROM checkin WHERE location_id = 4321
Type 2
SELECT COUNT(location_id) as num_users FROM checkin
SELECT COUNT(user_id) as num_checkins FROM checkin
The key (also called index) is for speeding up queries. If you want to see all checkins for a given user, you need a key on user_id field. If you want to see all checking for a given location, you need index on location_id field. You can read more at mysql documentation
I want to comment on your question and your examples.
Let me just suggest strongly to you that since you are using MySQL you make sure that your tables are using the innodb engine type for many reasons you can research on your own.
One important feature of InnoDB is that you have referential integrity. What does that mean? In your checkin table, you have a foreign key of user_id which is the primary key of the user table. With referential integrity, MySQL will not let you insert a row with a user_id that doesn't exist in the user table. Using MyISAM, you can. That alone should be enough to make you want to use the innodb engine.
To your question about keys/indexes, essentially when a table is defined and a key is declared for a column or some combination of columns, mysql will create an index.
Indexes are essential for performance as a table grows with the insert of rows.
All relational databases and Document databases depend on an implementation of BTree indexing. What Btree's are very good for, is finding an item (or not) using a predictable number of lookups. So when people talk about the performance of a relational database the essential building block of that is use of btree indexes, which are created via KEY statements or with alter table or create index statements.
To understand why this is, imagine that your user table was simply a text file, with one line per row, perhaps separated by commas. As you add a row, a new line in the text file gets added at the bottom.
Eventually you get to the point that you have 10,000 lines in the file.
Now you want to find out if you entered a line for one particular person with the last name of Smith. How can you find that out?
Without any sort of sortation of the file, or a separate index, you have but one option and that is to start at the first line in the file and scan through every line in the table looking for a match. Even if you found a Smith, that might not be the only 'Smith' in the table, so you have to read the entire file from top to bottom every time you want do do this search.
Obviously as the table grows the performance of searching gets worse and worse.
In relational database parlance, this is known as a "table scan". The database has to start at the first row and scan through reading every row until it gets to the end.
Without indexes, relational databases still work, but they are highly dependent on IO performance.
With a Btree index, the rows you want to find are found in the index first. The indexes have a pointer directly to the data you want, so the table no longer needs to be scanned, but instead the individual data pages required are read. This is how a database can maintain adequate performance even when there are millions or 10's or 100's of millions of rows.
To really start to gain insight into how mysql works, you need to get familiar with EXPLAIN EXTENDED ... and start looking at the explain plans for queries. Simple ones like those you've provided will have simple plans that show you how many rows are being examined to get a result and whether or not they are using one or more indexes.
For your summary queries, indexes are not helpful because you are doing a COUNT(). The table will need to be scanned when you have no other criteria constraining the search.
I did notice what looks like a mistake in your summary queries. Just based on your labels, I would think that these are the right queries to get what you would want given your column alias names.
SELECT COUNT(DISTINCT user_id) as num_users FROM checkin
SELECT COUNT(*) as num_checkins FROM checkin
This is yet another reason to use InnoDB, which when properly configured has a data cache (innodb buffer pool) similar to other rdbms's like oracle and sql server. MyISAM doesn't cache data at all, so if you are repeatedly querying the same sorts of queries that might require a lot of IO, MySQL will have to do all that data reading work over and over, whereas with InnoDB, that data could very well be sitting in cache memory and have the result returned without having to go back and read from storage.
Primary vs Secondary
There really is no such concept internally. A Primary key is special because it allows the database to find one single row. Primary keys must be unique, and to reflect that, the associated Btree index is unique, which simply means that it will not allow you to have 2 keys with the same data to exist in the index.
Whether or not an index is unique is an excellent tool that allows you to maintain the consistency of your database in many other cases. Let's say you have an 'employee' table with the SS_Number column to store social security #. It makes sense to have an index on that column if you want the system to support finding an employee by SS number. Without an index, you will tablescan. But you also want to have that index be unique, so that once an employee with a SS# is inserted, there is no way the database will let you enter a duplicate employee with the same SS#.
But to demystify this for you, when you declare keys these indexes are just being created for you and used automagically in most cases, when you define the tables.
It's when you aren't dealing with keys (primary or foreign) as in the example of usernames, first, last & last names, ss#'s etc., that you need to also be aware of how to create an index because you are searching (using where clause criteria) on one or more columns that aren't keys.
I have some location data, which is in a table locations with the key being the unique location_id
I have some user data, which is in a table users with the key being the unique user_id
Two ways I was thinking of linking these two together:
I can put the 'location' in each user's data.
'SELECT user_id FROM users WHERE location = "LOCATIONID";'
//this IS NOT searching with the table's key
//this does not require an explode
//this stores 1 integer per user
I can also put the 'userIDs' as a comma delimited string of ids into each location's data.
'SELECT userIDs FROM locations WHERE location_id = "LOCATIONID";'
//this IS searching with the tables key
//this needs an explode() once the comma delimited list is retrieved
//this stores 1 string of user ids per location
so I wonder, which would be most efficient. I'm not really sure how much the size of the data stored could also impact the speed. I want retrievals that are as fast as possible when trying to find out which users are at which location.
This is just an example, and there will be many other tables like location to compare to the users, so the efficiency, or lack of, will be multiplied across the whole system.
Stick with option 1. Keep your database tables normalised as much as possible till you know you have a performance problem.
There's a whole slew of problems with option 2, including the lack of ability to then use the user ID's till you pull them into PHP and then having to fire off lots more SQL queries for each ID. This is extremely inefficient. Do as much inside MySQL as possible, the optimisations that the database layer can do while running the query will easily be a lot quicker than anything you write in PHP.
Regarding your point about not searching on the primary key, you should add an index to the location column. All columns that are in a WHERE clause should be indexed as a general rule. This negates the issue of not searching on the primary key, as the primary key is just another type of index for the purposes of performance.
Use the first one to keep your data normalized. You can then query for all users for a location directly from the database without having to go back to the database for each user.
Be sure to add the correct index on your users table too.
CREATE TABLE locations (
locationId INT PRIMARY KEY AUTO_INCREMENT
) ENGINE=INNODB;
CREATE TABLE users (
userId INT PRIMARY KEY AUTO_INCREMENT,
location INT,
INDEX ix_location (location)
) ENGINE=INNODB;
Or to only add the index
ALTER TABLE users ADD INDEX ix_location(location);
Have you heard of foreign key ?
get details from many tables tables using join .
You can use of sub query also.
As you said there are two tables users and locations.
Keep userid as foreign key in locations and fetch it based on that.
When you store the user IDs as a comma-separated list in a table, that table is not normalized (especially it violates the first normal form, item 4).
It is perfectly valid to denormalize tables for optimization purposes. But only after you have measured that this is where the bottleneck actually is in your specific situation. This, however, can only be determined if you know which query is executed how often, how long they take and whether the performance of the query is critical (in relation to other queries).
Stick with option 1 unless you know exactly why you have to denormalize your table.
I'm designing a website using PHP and MySQL currently and as the site proceeds I find myself adding more and more columns to the users table to store various variables.
Which got me thinking, is there a better way to store this information? Just to clarify, the information is global, can be affected by other users so cookies won't work, also I'd lose the information if they clear their cookies.
The second part of my question is, if it does turn out that storing it in a database is the best way, would it be less expensive to have a large number of columns or rather to combine related columns into delimited varchar columns and then explode them in PHP?
Thanks!
In my experience, I'd rather get the database right than start adding comma separated fields holding multiple items. Having to sift through multiple comma separated fields is only going to hurt your program's efficiency and the readability of your code.
Also, if your table is growing to much, then perhaps you need to look into splitting it into multiple tables joined by foreign dependencies?
I'd create a user_meta table, with three columns: user_id, key, value.
I wouldn't go for the option of grouping columns together and exploding them. It's untidy work and very unmanageable. Instead maybe try spreading those columns over a few tables and using InnoDb's transaction feature.
If you still dislike the idea of frequently updating the database, and if this method complies with what you're trying to achieve, you can use APC's caching function to store (cache) information "globally" on the server.
MongoDB (and its NoSQL cousins) are great for stuff like this.
The database a perfectly fine place to store such data, as long as they're variables and not, say, huge image files. The database has all the optimizations and specifications for storing and retrieving large amounts of data. Anything you set up on file system level will always be beaten by what the database already has in terms of speed and functionality.
would it be less expensive to have a large number of columns or rather to combine related columns into delimited varchar columns and then explode them in PHP?
It's not really that much of a performance than a maintenance question IMO - it's not fun to manage hundreds of columns. Storing such data - perhaps as serialized objects - in a TEXT field is a viable option - as long as it's 100% sure you will never have to make any queries on that data.
But why not use a normalized user_variables table like so:
id | user_id | variable_name | variable_value
?
It is a bit more complex to query, but provides for a very clean table structure all round. You can easily add arbitrary user variables that way.
If you are doing a lot of queries like SELECT FROM USERS WHERE variable257 = 'green' you may have to stick to have specific columns.
The database is definitely the best place to store the data. (I'm assuming you were thinking of storing it in flat files otherwise) You'd definitely get better performance and security from using a DB over storing in files.
With regards to the storing your data in multiple columns or delimiting them... It's a personal choice but you should consider a few things
If you're going to delimit the items, you need to think of what you're going to delimit them with (something that's not likely to crop up within the text your delimiting)
I often find that it helps to try and visualise whether another programmer of your level would be able to understand what you've done with little help.
Yes, as Pekka said, if you want to perform queries on the data stored you should stick with the seperate columns
You may also get a slight performance boost from not retrieving and parsing ALL your data every time if you just want a couple of fields of information
I'd suggest going with the seperate columns as it offers you the option of much greater flexibility in the future. And there's nothing worse than having to drastically change your data structure and migrate information down the track!
I would recommend setting up a memcached server (see http://memcached.org/). It has proven to be viable with lots of the big sites. PHP has two extensions that integrate a client into your runtime (see http://php.net/manual/en/book.memcached.php).
Give it a try, you won't regret it.
EDIT
Sure, this will only be an option for data that's frequently used and would otherwise have to be loaded from your database again and again. Keep in mind though that you will still have to save your data to some kind of persistent storage.
A document-oriented database might be what you need.
If you want to stick to a relational database, don't take the naïve approach of just creating a table with oh so many fields:
CREATE TABLE SomeEntity (
ENTITY_ID CHAR(10) NOT NULL,
PROPERTY_1 VARCHAR(50),
PROPERTY_2 VARCHAR(50),
PROPERTY_3 VARCHAR(50),
...
PROPERTY_915 VARCHAR(50),
PRIMARY KEY (ENTITY_ID)
);
Instead define a Attribute table:
CREATE TABLE Attribute (
ATTRIBUTE_ID CHAR(10) NOT NULL,
DESCRIPTION VARCHAR(30),
/* optionally */
DEFAULT_VALUE /* whatever type you want */,
/* end_optionally */
PRIMARY KEY (ATTRIBUTE_ID)
);
Then define your SomeEntity table, which only includes the essential attributes (for example, required fields in a registration form):
CREATE TABLE SomeEntity (
ENTITY_ID CHAR(10) NOT NULL
ESSENTIAL_1 VARCHAR(30),
ESSENTIAL_2 VARCHAR(30),
ESSENTIAL_3 VARCHAR(30),
PRIMARY KEY (ENTITY_ID)
);
And then define a table for those attributes that you might or might not want to store.
CREATE TABLE EntityAttribute (
ATTRIBUTE_ID CHAR(10) NOT NULL,
ENTITY_ID CHAR(10) NOT NULL,
ATTRIBUTE_VALUE /* the same type as SomeEntity.DEFAULT_VALUE;
if you didn't create that field, then any type */,
PRIMARY KEY (ATTRIBUTE_ID, ENTITY_ID)
);
Evidently, in your case, that SomeEntity is the user.
Instead of MySQL you might consider using a triplestore, or a key-value store
that way you get the benifits of having all the multithreading multiuser, performance and caching voodoo, figured out, without all the trouble of trying to figure out ahead of time what kind of values you really want to store.
Downsides: it's a bit more costly to figure out the average salary of all the people in idaho who also own hats.
depends on what kind of user info you are storing. if its session pertinent data, use php sessions in coordination with session event handlers to store your session data in a single data field in the db.