Speed of SELECT Distinct vs array unique - php

I am using WordPress with some custom post types (just to give a description of my DB structure - its WP's).
Each post has custom meta, which is stored in a separate table (postmeta table). In my case, I am storing city and state.
I've added some actions to WP's save_post/trash_post hooks so that the city and state are also stored in a separate table (cities) like so:
ID postID city state
auto int varchar varchar
I did this because I assumed that this table would be faster than querying the rather large postmeta table for a list of available cities and states.
My logic also forced me to add/update cities and states for every post, even though this will cause duplicates (in the city/state fields). This must be so because I must keep track of which states/cities exist (actually have a post associated with them). When a post is added or deleted, it takes its record to or from the cities table with it.
This brings me to my question(s).
Does this logic make sense or do I suck at DB design?
If it does make sense, my real question is this: **would it be faster to use MySQL's "SELECT DISTINCT" or just "SELECT *" and then use PHP's array_unique on the results?**
Edits for comments/answers thus far:
The structure of the table is exactly how I typed it out above. There is an index on ID, but the point of this table isn't to retrieve an indexed list, but to retrieve ALL results (that are unique) for a list of ALL available city/state combos.
I think I may go with (I don't know why I didn't think of this before) just adding a serialized list of city/state combos in ONE record in the wp_options table. Then I can just get that record, and filter out the unique records I need.
Can I get some feedback on this? I would imagine that retrieving and filtering a serialized array would be faster than storing the data in a separate table for retrieval.

To answer your question about using SELECT distinct vs. array_unique, I would say that I would almost always prefer to limit the result set in the database assuming of course that you have an appropriate index on the field for which you are trying to get distinct values. This saves you time in transmitting extra data from DB to application and for the application reading that data into memory where you can work with it.
As far as your separate table design, it is hard to speculate whether this is a good approach or not, this would largely depend on how you are actually preforming your query (i.e. are you doing two separate queries - one for post info and one for city/state info or querying across a join?).
The is really only one definitive way to determine what is fastest approach. That is to test both ways in your environment.

1) Fully normalized table(when it have only integer values and other tables have only one int+varchar) have advantage when you not dooing full table joins often and dooing alot of search on normalized fields. As downside it require large join/sort buffers and result more complex queries=much less chance query will be auto-optimized by mysql. So you have optimize your queries yourself.
2)Select distinct will be faster in almost any cases. Only case when it will be slower - you have low size sort buffer in /etc/my.conf and much more size memory buffer for php.
Distinct select can use indexes, while your code can't.
Also sending large amount of data to your app require alot of mysql cpu time and real time.

Related

Using SELECT * or SELECT all, cols is better for Queries [duplicate]

I've heard that SELECT * is generally bad practice to use when writing SQL commands because it is more efficient to SELECT columns you specifically need.
If I need to SELECT every column in a table, should I use
SELECT * FROM TABLE
or
SELECT column1, colum2, column3, etc. FROM TABLE
Does the efficiency really matter in this case? I'd think SELECT * would be more optimal internally if you really need all of the data, but I'm saying this with no real understanding of database.
I'm curious to know what the best practice is in this case.
UPDATE: I probably should specify that the only situation where I would really want to do a SELECT * is when I'm selecting data from one table where I know all columns will always need to be retrieved, even when new columns are added.
Given the responses I've seen however, this still seems like a bad idea and SELECT * should never be used for a lot more technical reasons that I ever though about.
One reason that selecting specific columns is better is that it raises the probability that SQL Server can access the data from indexes rather than querying the table data.
Here's a post I wrote about it: The real reason select queries are bad index coverage
It's also less fragile to change, since any code that consumes the data will be getting the same data structure regardless of changes you make to the table schema in the future.
Given your specification that you are selecting all columns, there is little difference at this time. Realize, however, that database schemas do change. If you use SELECT * you are going to get any new columns added to the table, even though in all likelihood, your code is not prepared to use or present that new data. This means that you are exposing your system to unexpected performance and functionality changes.
You may be willing to dismiss this as a minor cost, but realize that columns that you don't need still must be:
Read from database
Sent across the network
Marshalled into your process
(for ADO-type technologies) Saved in a data-table in-memory
Ignored and discarded / garbage-collected
Item #1 has many hidden costs including eliminating some potential covering index, causing data-page loads (and server cache thrashing), incurring row / page / table locks that might be otherwise avoided.
Balance this against the potential savings of specifying the columns versus an * and the only potential savings are:
Programmer doesn't need to revisit the SQL to add columns
The network-transport of the SQL is smaller / faster
SQL Server query parse / validation time
SQL Server query plan cache
For item 1, the reality is that you're going to add / change code to use any new column you might add anyway, so it is a wash.
For item 2, the difference is rarely enough to push you into a different packet-size or number of network packets. If you get to the point where SQL statement transmission time is the predominant issue, you probably need to reduce the rate of statements first.
For item 3, there is NO savings as the expansion of the * has to happen anyway, which means consulting the table(s) schema anyway. Realistically, listing the columns will incur the same cost because they have to be validated against the schema. In other words this is a complete wash.
For item 4, when you specify specific columns, your query plan cache could get larger but only if you are dealing with different sets of columns (which is not what you've specified). In this case, you do want different cache entries because you want different plans as needed.
So, this all comes down, because of the way you specified the question, to the issue resiliency in the face of eventual schema modifications. If you're burning this schema into ROM (it happens), then an * is perfectly acceptable.
However, my general guideline is that you should only select the columns you need, which means that sometimes it will look like you are asking for all of them, but DBAs and schema evolution mean that some new columns might appear that could greatly affect the query.
My advice is that you should ALWAYS SELECT specific columns. Remember that you get good at what you do over and over, so just get in the habit of doing it right.
If you are wondering why a schema might change without code changing, think in terms of audit logging, effective/expiration dates and other similar things that get added by DBAs for systemically for compliance issues. Another source of underhanded changes is denormalizations for performance elsewhere in the system or user-defined fields.
You should only select the columns that you need. Even if you need all columns it's still better to list column names so that the sql server does not have to query system table for columns.
Also, your application might break if someone adds columns to the table. Your program will get columns it didn't expect too and it might not know how to process them.
Apart from this if the table has a binary column then the query will be much more slower and use more network resources.
There are four big reasons that select * is a bad thing:
The most significant practical reason is that it forces the user to magically know the order in which columns will be returned. It's better to be explicit, which also protects you against the table changing, which segues nicely into...
If a column name you're using changes, it's better to catch it early (at the point of the SQL call) rather than when you're trying to use the column that no longer exists (or has had its name changed, etc.)
Listing the column names makes your code far more self-documented, and so probably more readable.
If you're transferring over a network (or even if you aren't), columns you don't need are just waste.
Specifying the column list is usually the best option because your application won't be affected if someone adds/inserts a column to the table.
Specifying column names is definitely faster - for the server. But if
performance is not a big issue (for example, this is a website content database with hundreds, maybe thousands - but not millions - of rows in each table); AND
your job is to create many small, similar applications (e.g. public-facing content-managed websites) using a common framework, rather than creating a complex one-off application; AND
flexibility is important (lots of customization of the db schema for each site);
then you're better off sticking with SELECT *. In our framework, heavy use of SELECT * allows us to introduce a new website managed content field to a table, giving it all of the benefits of the CMS (versioning, workflow/approvals, etc.), while only touching the code at a couple of points, instead of a couple dozen points.
I know the DB gurus are going to hate me for this - go ahead, vote me down - but in my world, developer time is scarce and CPU cycles are abundant, so I adjust accordingly what I conserve and what I waste.
SELECT * is a bad practice even if the query is not sent over a network.
Selecting more data than you need makes the query less efficient - the server has to read and transfer extra data, so it takes time and creates unnecessary load on the system (not only the network, as others mentioned, but also disk, CPU etc.). Additionally, the server is unable to optimize the query as well as it might (for example, use covering index for the query).
After some time your table structure might change, so SELECT * will return a different set of columns. So, your application might get a dataset of unexpected structure and break somewhere downstream. Explicitly stating the columns guarantees that you either get a dataset of known structure, or get a clear error on the database level (like 'column not found').
Of course, all this doesn't matter much for a small and simple system.
Lots of good reasons answered here so far, here's another one that hasn't been mentioned.
Explicitly naming the columns will help you with maintenance down the road. At some point you're going to be making changes or troubleshooting, and find yourself asking "where the heck is that column used".
If you've got the names listed explicitly, then finding every reference to that column -- through all your stored procedures, views, etc -- is simple. Just dump a CREATE script for your DB schema, and text search through it.
Performance wise, SELECT with specific columns can be faster (no need to read in all the data). If your query really does use ALL the columns, SELECT with explicit parameters is still preferred. Any speed difference will be basically unnoticeable and near constant-time. One day your schema will change, and this is good insurance to prevent problems due to this.
definitely defining the columns, because SQL Server will not have to do a lookup on the columns to pull them. If you define the columns, then SQL can skip that step.
It's always better to specify the columns you need, if you think about it one time, SQL doesn't have to think "wtf is *" every time you query. On top of that, someone later may add columns to the table that you actually do not need in your query and you'll be better off in that case by specifying all of your columns.
The problem with "select *" is the possibility of bringing data you don't really need. During the actual database query, the selected columns don't really add to the computation. What's really "heavy" is the data transport back to your client, and any column that you don't really need is just wasting network bandwidth and adding to the time you're waiting for you query to return.
Even if you do use all the columns brought from a "select *...", that's just for now. If in the future you change the table/view layout and add more columns, you'll start bring those in your selects even if you don't need them.
Another point in which a "select *" statement is bad is on view creation. If you create a view using "select *" and later add columns to your table, the view definition and the data returned won't match, and you'll need to recompile your views in order for them to work again.
I know that writing a "select *" is tempting, 'cause I really don't like to manually specify all the fields on my queries, but when your system start to evolve, you'll see that it's worth to spend this extra time/effort in specifying the fields rather than spending much more time and effort removing bugs on your views or optimizing your app.
While explicitly listing columns is good for performance, don't get crazy.
So if you use all the data, try SELECT * for simplicity (imagine having many columns and doing a JOIN... query may get awful). Then - measure. Compare with query with column names listed explicitly.
Don't speculate about performance, measure it!
Explicit listing helps most when you have some column containing big data (like body of a post or article), and don't need it in given query. Then by not returning it in your answer DB server can save time, bandwidth, and disk throughput. Your query result will also be smaller, which is good for any query cache.
You should really be selecting only the fields you need, and only the required number, i.e.
SELECT Field1, Field2 FROM SomeTable WHERE --(constraints)
Outside of the database, dynamic queries run the risk of injection attacks and malformed data. Typically you get round this using stored procedures or parameterised queries. Also (although not really that much of a problem) the server has to generate an execution plan each time a dynamic query is executed.
It is NOT faster to use explicit field names versus *, if and only if, you need to get the data for all fields.
Your client software shouldn't depend on the order of the fields returned, so that's a nonsense too.
And it's possible (though unlikely) that you need to get all fields using * because you don't yet know what fields exist (think very dynamic database structure).
Another disadvantage of using explicit field names is that if there are many of them and they're long then it makes reading the code and/or the query log more difficult.
So the rule should be: if you need all the fields, use *, if you need only a subset, name them explicitly.
The result is too huge. It is slow to generate and send the result from the SQL engine to the client.
The client side, being a generic programming environment, is not and should not be designed to filter and process the results (e.g. the WHERE clause, ORDER clause), as the number of rows can be huge (e.g. tens of millions of rows).
Naming each column you expect to get in your application also ensures your application won't break if someone alters the table, as long as your columns are still present (in any order).
Performance wise I have seen comments that both are equal. but usability aspect there are some +'s and -'s
When you use a (select *) in a query and if some one alter the table and add new fields which do not need for the previous query it is an unnecessary overhead. And what if the newly added field is a blob or an image field??? your query response time is going to be really slow then.
In other hand if you use a (select col1,col2,..) and if the table get altered and added new fields and if those fields are needed in the result set, you always need to edit your select query after table alteration.
But I suggest always to use select col1,col2,... in your queries and alter the query if the table get altered later...
This is an old post, but still valid. For reference, I have a very complicated query consisting of:
12 tables
6 Left joins
9 inner joins
108 total columns on all 12 tables
I only need 54 columns
A 4 column Order By clause
When I execute the query using Select *, it takes an average of 2869ms.
When I execute the query using Select , it takes an average of 1513ms.
Total rows returned is 13,949.
There is no doubt selecting column names means faster performance over Select *
Select is equally efficient (in terms of velocity) if you use * or columns.
The difference is about memory, not velocity. When you select several columns SQL Server must allocate memory space to serve you the query, including all data for all the columns that you've requested, even if you're only using one of them.
What does matter in terms of performance is the excecution plan which in turn depends heavily on your WHERE clause and the number of JOIN, OUTER JOIN, etc ...
For your question just use SELECT *. If you need all the columns there's no performance difference.
It depends on the version of your DB server, but modern versions of SQL can cache the plan either way. I'd say go with whatever is most maintainable with your data access code.
One reason it's better practice to spell out exactly which columns you want is because of possible future changes in the table structure.
If you are reading in data manually using an index based approach to populate a data structure with the results of your query, then in the future when you add/remove a column you will have headaches trying to figure out what went wrong.
As to what is faster, I'll defer to others for their expertise.
As with most problems, it depends on what you want to achieve. If you want to create a db grid that will allow all columns in any table, then "Select *" is the answer. However, if you will only need certain columns and adding or deleting columns from the query is done infrequently, then specify them individually.
It also depends on the amount of data you want to transfer from the server. If one of the columns is a defined as memo, graphic, blob, etc. and you don't need that column, you'd better not use "Select *" or you'll get a whole bunch of data you don't want and your performance could suffer.
To add on to what everyone else has said, if all of your columns that you are selecting are included in an index, your result set will be pulled from the index instead of looking up additional data from SQL.
SELECT * is necessary if one wants to obtain metadata such as the number of columns.
Gonna get slammed for this, but I do a select * because almost all my data is retrived from SQL Server Views that precombine needed values from multiple tables into a single easy to access View.
I do then want all the columns from the view which won't change when new fields are added to underlying tables. This has the added benefit of allowing me to change where data comes from. FieldA in the View may at one time be calculated and then I may change it to be static. Either way the View supplies FieldA to me.
The beauty of this is that it allows my data layer to get datasets. It then passes them to my BL which can then create objects from them. My main app only knows and interacts with the objects. I even allow my objects to self-create when passed a datarow.
Of course, I'm the only developer, so that helps too :)
What everyone above said, plus:
If you're striving for readable maintainable code, doing something like:
SELECT foo, bar FROM widgets;
is instantly readable and shows intent. If you make that call you know what you're getting back. If widgets only has foo and bar columns, then selecting * means you still have to think about what you're getting back, confirm the order is mapped correctly, etc. However, if widgets has more columns but you're only interested in foo and bar, then your code gets messy when you query for a wildcard and then only use some of what's returned.
And remember if you have an inner join by definition you do not need all the columns as the data in the join columns is repeated.
It's not like listing columns in SQl server is hard or even time-consuming. You just drag them over from the object browser (you can get all in one go by dragging from the word columns). To put a permanent performance hit on your system (becasue this can reduce the use of indexes and becasue sending unneeded data over the network is costly) and make it more likely that you will have unexpected problems as the database changes (sometimes columns get added that you do not want the user to see for instance) just to save less than a minute of development time is short-sighted and unprofessional.
Absolutely define the columns you want to SELECT every time. There is no reason not to and the performance improvement is well worth it.
They should never have given the option to "SELECT *"
If you need every column then just use SELECT * but remember that the order could potentially change so when you are consuming the results access them by name and not by index.
I would ignore comments about how * needs to go get the list - chances are parsing and validating named columns is equal to the processing time if not more. Don't prematurely optimize ;-)

Database Definition for Sphinx Search

Background
I am creating a MySQL database to store items such as courses where there may be many attributes to a single course. For example:
A single course may have any or all of the following attributes:
Title (varchar)
Secondary Title (varchar)
Description (text)
Date
Time
Specific Location (varchar; eg. White Hall Room 7)
General Location (varchar; eg. Las Vegas, NV)
Location Coords (floats; eg. lat, long)
etc.
The database is set up as follows:
A table storing specific course info:
courses table:
Course_ID (a Primary Key unique ID for each course)
Creator_ID (a unique ID for the creator)
Creation_Date (datetime of course creation)
Modified_Date (where this is the most recent timestamp the course was modified)
The table storing each courses multiple attributes is set up as follows:
course_attributes table:
Attribute_ID (a unique ID for each attribute)
Course_ID (reference to the specific course attribute is for)
Attribute (varchar definining the attribute; eg. 'title')
Value (text containing value of specified attribute; eg. 'Title Of My Course')
Desire
I would like to search this database using sphinx search. With this search, I have different fields weighing different amounts, for example: 'title' would be more important than 'description'.
Specific search fields that I wish to have are:
Title
Date
Location (string)
Location (geo - lat/long)
The Question
Should I define a View in Mysql to organize the attributes according to 'title', 'description', etc., or is there a way to define my sphinx.conf file to understand specific attributes?
I am open to all suggestions to solving this problem, whether it be rearrangement of the database/tables or the way in which I search.
Let me know if you need any additional details to help me find a solution.
Thanks in advance for the help
!--Update--!
OK, so after reading some of the answers, I feel that I should provide some additional information.
Latitude / Longitude
The latitude/longitude attributes are created by me internally after receiving the general location string. I can generate the values in any way I wish, meaning that I can store them together in a single lat/long attribute as 'float lat, float long' values or any other desired format. This is done only after they have been generated from the initial location string and verified. This is to guard against malformed data as #X-Zero and #Cody have suggested.
Keep in mind that the latitude and longitude was merely illustrating the need to have that field be searchable as opposed to anything more than that. It is simply another attribute; one of many.
Weighting Search Results
I know how to add weights to results in a Sphinx search query:
$cl->setFieldWeights( array('title'=>1000, 'description'=>500) );
This causes the title column to have a higher weight than the description column if the structure was as #X-Zero suggested. My question was more directed to how one would apply the above logic with the current table definition.
Database Structure, Views, and Efficiency
Using my introductory knowledge of Views, I was thinking that I could possibly create something that displays a row for each course where each attribute is its own column. I don't know how to accomplish this or if it's even possible.
I am not the most confident with database structures, but the reason I set my tables up as described was because there are many cases where not all of the fields will be completed for every course and I was attempting to be efficient [yes, it seems as though I've failed].
I was thinking that using my current structure, each attribute would contain a value and would therefore cause no wasted space in the table. Alternatively, if I had a table with tons of potential attributes, I would think there would be wasted space. If I am incorrect, I am happy to learn why my understanding is wrong.
Let me preface this by saying that I've never even heard of Sphinx, nor (obviously) used it. However, from a database perspective...
Doing multi-domain columns like this is a terrible (I will hunt you down and kill you) idea. For one thing, it's impossible to index or sort meaningfully, period. You also have to pray that you don't get a latitude attribute with textual data (and because this can only be reinforced programatically, I'm going to garuantee this will happen) - doing so will cause all distance based formulas to crash. And speaking of location, what happens if somebody stores a latitude without a longitude (note that this is possible regardless of whether you are storing a single GeoLocation attribute, or the pair)?
Your best bet is to do the following:
Figure out which attributes will always be required. These belong in the course table (...mostly).
For each related set of optional attributes, create a table. For example, location (although this should probably be required...), which would contain Latitude/Longitude, City, State, Address, Room, etc. Allow the columns to be nullable (in sets - add constraints so users can't add just longitude and not latitude).
For every set of common queries add a view. Even (perhaps especially) if you persist in using your current design, use a view. This promotes seperation between the logical and physical implementations of the database. (This assumes searching by SQL) You will then be able to search by specifying view_column is null or view_column = input_parameter or whichever.
For weighted searching (assuming dynamic weighting) your query will need to use left joins (inside the view as well - please document this), and use prepared-statement host-parameters (just save yourself the trouble of trying to escape things yourself). Check each set of parameters (both lat and long, for example), and assign the input weighting to a new column (per attribute), which can be summed up into a 'total' column (which must be over some threshold).
EDIT:
Using views:
For your structure, what you would normally do is left join to the attributes table multiple times (one for each attribute needed), keying off of the attribute (which should really be an int FK to a table; you don't want both 'title' and 'Title' in there) and joining on course_id - the value would be included as part of the select. Using this technique, it would be simple to then get the list of columns, which you can then apparently weight in Sphinx.
The problem with this is if you need to do any data conversion - you are betting that you'll be able to find all conversions if the type ever changes. When using strongly typed columns, this is between trivial (the likelyhood is that you end up with a uniquely named column) to unnecessary (views usually take their datatype definitions from the fields in the query); with your architecture, you'll likely end up looking through too many false positives.
Database efficiency:
You're right, unfilled columns are wasted space. Usually, when something is optional(ish), that means you may need an additional table. Which is Why I suggested splitting off location into it's own table: this prevents events which don't need a location (... what?) from 'wasting' the space, but then forces any event that defines a location to specify all required information. There's an additional benefit about splitting it off this way: if multiple events all use the same location (... not at the same time, we hope), a cross-reference table will save you a lot of space. Way more than your attributes table ever could (you're still having to store the complete location per event, after all). If you still have a lot of 'optional' attributes, I hear that NoSQL is made for these kinds of things (but I haven't really looked into it). However, other than that, the cost of an additional table is trivial; the cost of the data inside may not be, but the space required is weighed against the perceived value of the data stored. Remember that disk space is relatively cheap - it's developer/maintainer time that is expensive.
Side note for addresses:
You are probably going to want to create an address table. This would be completely divorced from the event information, and would include (among other things) the precomputed latitude/longitude (in the recommended datatype - I don't know what it is, but it's for sure not a comma-separated string). You would then have an event_address table that would be the cross-reference between the events and where they take place - if there is additional information (such as room), that should be kept in a location table that is referenced (instead of referencing address directly). Once a lat/long value is computed, you should never need to change it.
Thoughts on later updates for lat/long:
While specifying the lat/long values yourself is better, you're going to want to make them a required part of the address table (or part of/in addition to a purely lat/long only table). Frankly, multi-value columns (delimited lists) of any sort are just begging for trouble - you keep having to parse them every time you search on them (among other related issues). And the moment you make them separate rows, one of the pair will eventually get dropped - Murphy himself will personally intervene, if necessary. Additionally, updating them at different times from the addresses will result in an address having a lat/long pair that does not match; your best bet is to compute this at insertion time (there are a number of webservices to find this information for you).
Multi-domain tables:
With a multi-domain table, you're basically betting that the domain key (attribute) will never become out-of-sync with the value (err, value). I don't care how good you are, somewhere, somehow, it's going to happen: at my company, we had one of these in our legacy application (it stored FK links and which files the FKs refer to, along with an attribute). At one point an application was installed in production which promptly began storing the correct file links, but the FK links to a different file, for a given class of attribute. Thankfully, there were audit records in another file which allowed this to be reversed (... as near as they were able tell).
In summary:
Revisit your required/optional data. Don't be afraid to create additional tables, each for a single entity, with every column for a single domain; you will also need relationship tables. You may also wish to place your audit data (last_updated_time) in a set of separate tables (single-domain tables will help immensely in this regard).
In the sphinx config you define your index and the SQL queries that populate it. You can define basic attributes, see Sphinx Attributes
Sphinx also supports geo searches on lat/long but they need to be expressed in radians, definitely not text columns like you have. I agree with X-Zero that storing lat/lng values are strings is a bad idea.

To serialize or to keep a separate table?

This question has risen on many different occasions for me but it's hard to explain without giving a specific example. So here goes:
Let's imagine for a while that we are creating a issue tracker database in PHP/MySQL. There is a "tasks" table. Now you need to keep track of people who are associated with a particular task (have commented or what not). These persons will get an email when a task changes.
There are two ways to solve this type of situation. One is to create a separate table tasks_participants:
CREATE TABLE IF NOT EXISTS `task_participants` (
`task_id` int(10) unsigned NOT NULL,
`person_id` int(10) unsigned NOT NULL,
UNIQUE KEY `task_id_person_id` (`task_id`,`person_id`)
);
And to query this table with SELECT person_id WHERE task_id='XXX'.
If there are 5000 tasks and each task has 4 participants in average (the reporter, the subject for whom the task brought benefit, the solver and one commenter) then the task_participants table would be 5000*4 = 20 000 rows.
There is also another way: create a field in tasks table and store serialized array (JSON or PHP serialize()) of person_id's. Then there would not be need for this 20 000 rows table.
What are your comments, which way would you go?
Go with the multiple records. It promotes database normalization. Normalization is very important. Updating a serialized value is no fun to maintain. With multiple records I can let the database do the work with INSERT, UPDATE and DELETE. Also, you are limiting your future joins by having a multivalued column.
Definitely do the cross reference table (the first option you listed). Why?
First of all, do not worry about the size of the cross reference table. Relational databases would have been out on their ear decades ago if they could not handle the scale of a simple cross reference table. Stop worrying about 20K or 200K records, etc. In fact, if you're going to worry about something like this, it's better to start worrying about why you've chosen a relational DB instead of a key-value DB. After that, and only when it actually starts to be a problem, then you can start worrying about adding an index or other tuning techniques.
Second, if you serialize the association info, you're probably opaque-ifying a whole dimension of your data that only your specialized JSON-enabled app can query. Serialization of data into a single cell in a table typically only makes sense if the embedded structure is (a) not something that contains data you would never need to query outside your app, (b) is not something you need to query the internals of efficiently (e.g., avg count(*) of people with tasks), and (c) is just something you either do not have time to model out properly or is in a prototypical state. So I say probably above, because it's not usually the case that data worth persisting fits these criteria.
Finally, by serializing your data, you are now forced to solve any computation on that serialized data in your code, which is just a big waste of time that you could have spent doing something more productive. Your database already can slice and dice that data any way you need, yet because your data is not in a format it understands, you need to now do that in your code. And now imagine what happens when you change the serialized data structure in V2.
I won't say there aren't use cases for serializing data (I've done it myself), but based on your case above, this probably isn't one of them.
There are a couple of great answers already, but they explain things in rather theoretical terms. Here's my (essentially identical) answer, in plain English:
1) 20k records is nothing to MySQL. If it gets up into the 20 million record range, then you might want to start getting concerned - but it still probably won't be an issue.
2) OK, let's assume you've gone with concatenating all the people involved with a ticket into a single field. Now... Quick! Tell me how many tickets Alice has touched! I have a feeling that Bob is screwing things up and Charlie is covering for him - can you get me a list of tickets that they both worked on, divided up by who touched them last?
With a separate table, MySQL itself can find answers to all kinds of questions about who worked on what tickets and it can find them fast. With everything crammed into a single field, you pretty much have to resort to using LIKE queries to find the (potentially) relevant records, then post-process the query results to extract the important data and summarize it yourself.

How to design the user table for an online dating site?

I'm working on the next version of a local online dating site, PHP & MySQL based and I want to do things right. The user table is quite massive and is expected to grow even more with the new version as there will be a lot of money spent on promotion.
The current version which I guess is 7-8 years old was done probably by someone not very knowledgeable in PHP and MySQL so I have to start over from scratch.
There community has currently 200k+ users and is expected to grow to 500k-1mil in the next one or two years. There are more than 100 attributes for each user's profile and I have to be able to search by at least 30-40 of them.
As you can imagine I'm a little wary to make a table with 200k rows and 100 columns. My predecessor split the user table in two ... one with the most used and searched columns and one with the rest (and bulk) of the columns. But this lead to big synchronization problems between the two tables.
So, what do you think it's the best way to go about it?
This is not an answer per se, but since few answers here suggested the attribute-value model, I just wanted to jump in and say my life experience.
I've tried once using this model with a table with 120+ attributes (growing 5-10 every year), and adding about 100k+ rows (every 6 months), the indexes is growing so big that it takes for ever to add or update a single user_id.
The problem I find with this type of design (not that it's completely unfit to any situation) is that you need to put a primary key on user_id,attrib on that second table. Unknowing the potential length of attrib, you would usually use a greater length value, thus increasing the indexes. In my case, attribs could have from 3 to 130 chars. Also, the value most certainly suffer from the same assumption.
And as the OP said, this leads to synchronization problems. Imagine if every attributes (or say at least 50% of them) NEED to exist.
Also, as the OP suggest, the search needs to be done on 30-40 attributes, and I can't just imagine how a 30-40 joins would be efficient, or even a group_concat() due to length limitation.
My only viable solution was to go back to a table with as much columns as there are attributes. My indexes are now greatly smaller, and searches are easier.
EDIT: Also, there are no normalization problems. Either having lookup tables for attribute values or have them ENUM().
EDIT 2: Of course, one could say I should have a look-up table for attribute possible values (reducing index sizes), but I should then make a join on that table.
What you could do is split the user data accross two tables.
1) Table: user
This will contain the "core" fixed information about a user such as firstname, lastname, email, username, role_id, registration_date and things of that nature.
Profile related information can go in its own table. This will be an infinitely expandable table with a key => val nature.
2) Table: user_profile
Fields: user_id, option, value
user_id: 1
option: profile_image
value: /uploads/12/myimage.png
and
user_id: 1
option: questions_answered
value: 24
Hope this helps,
Paul.
The entity-attribute-value model might be a good fit for you:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
Rather than have 100 and growing columns, add one table with three columns:
user_id, property, value.
In general, you shouldn't sacrifice database integrity for performance.
The first thing that I would do about this is to create a table with 1 mln rows of dummy data and test some typical queries on it, using a stress tool like ab. It will most probably turn out that it performs just fine - 1 mln rows is a piece of cake for mysql. So, before trying to solve a problem make sure you actually have it.
If you find the performance poor and the database really turns out to be a bottleneck, consider general optimizations, like caching (on all levels, from mysql query cache to html caching), getting better hardware etc. This should work out in most cases.
In general you should always get the schema formally correct before you worry about performance!
That way you can make informed decisions about adapting the schema to resolve specific performance problems, rather than guessing.
You definitely should go down the 2 table route. This will significantly reduce the amount of storage, code complexity, and the effort to changing the system to add new attributes.
Assuming that each attribute can be represented by an Ordinal number, and that you're only looking for symmetrical matches (i.e. you're trying to match people based on similar attributes, rather than an expression of intention)....
At a simple level, the query to find suitable matches may be very expensive. Effectively you are looking for nodes within the same proximity in a N-dimensional space, unfortunately most relational databases aren't really setup for this kind of operation (I believe PostgreSQL has support for this). So most people would probably start with something like:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value=current_user.attr_value
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
However this forces the system to compare every available candidate to find the best match. Applying a little heurisitics and you could get a very effective query:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
(the value of $tolerance will affect the number of rows returned and query performance - if you've got an index on attr_type, attr_value).
This can be further refined into a points scoring system:
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
This approach lets you do lots of different things - including searching by a subset of attributes, e.g.
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs,
attribute_subsets s
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
AND s.subset_name=$required_subset
AND s.attr_type=current_user.attr_type
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
Obviously this does not accomodate non-ordinal data (e.g. birth sign, favourite pop-band). Without knowing a lot more about te structure of the existing data, its rather hard to say exactly how effective this will be.
If you want to add more attributes, then you don't need to make any changes to your PHP code nor the database schema - it can be completely data-driven.
Another approach would be to identify sterotypes - i.e. reference points within the N-dimensional space, then work out which of these a particular user is closest to. You collapse all the attributes down to a single composite identifier - then you just need to apply the same approach to find the best match within the subset of candidates whom also have been matched to the stereotype.
Can't really suggest anything without seeing the schema. Generally - Mysql database have to be normalized to at least 3NF or BNCF. It rather sounds like it is not normalized right now with 100 columns in 1 table.
Also - you can easily enforce referential integrity with foreign keys using transactions and INNODB engine.

Is naming tables september_2010 acceptable and efficient for large data sets dependent on time?

I need to store about 73,200 records per day consisting of 3 points of data: id, date, and integer.
Some members of my team suggest creating tables using month's as the table name (september_2010), while others are suggesting having one table with lots of data in it...
Any suggestions on how to deal with this amount of data? Thanks.
========== Thank you to all the feedback.
I recommend against that. I call this antipattern Metadata Tribbles. It creates multiple problems:
You need to remember to create a new table every year or else your app breaks.
Querying aggregates against all rows regardless of year is harder.
Updating a date potentially means moving a row from one table to another.
It's harder to guarantee the uniqueness of pseudokeys across multiple tables.
My recommendation is to keep it in one table until and unless you've demonstrated that the size of the table is becoming a genuine problem, and you can't solve it any other way (e.g. caching, indexing, partitioning).
Seems like it should be just fine holding everything in one table. It will make retrieval much easier in the future to maintain 1 table, as opposed to 12 tables per year. At 73,200 records per day it will take you almost 4 years to hit 100,000,000 which is still well within MySQLs capabilities.
Absolutely not.
It will ruin relationship between tables.
Table relations being built based on field values, not table names.
Especially for this very table that will grow by just 300Mb/year
so in 100 days you have 7.3 M rows, about 25M a year or so. 25M rows isn't a lot anymore. MySQL can handle tables with millions of rows. It really depends on your hardware and your query types and query frequency.
But you should be able to partition that table (if MySQL supports partitioning), what you're describing is an old SQL Server method of partition. After building those monthly tables you'd build a view that concatenates them together to look like one big table... which is essentially what partitioning does but it's all under-the-covers and fully optimized.
Usually this creates more trouble than it's worth, it's more maintenance , your queries need more logic, and it's painful to pull data from more than one period.
We store 200+ million time based records in one (MyISAM) table, and queries are blazingly still fast.
You just need to ensure there's an index on your time/date column and that your queries makes use of the index (e.g. a query that messes around with DATE_FORMAT or similar on a date column will likely not use an index. I wouldn't put them in separate tables just for the sake of retreival performance.
One thing that gets very painful with such a large number of records is when you have to delete old data, this can take a long time (10 minutes to 2 hours for e.g. wiping a month worth of data in tables with hundreds of mullions rows). For that reason we've partitioning the tables, and use a time_dimension(see e.g. the time_dimension table a bit down here) relation table for managing the periods instead of simple date/datetime columns or strings/varchars representing dates.
Some members of my team suggest creating tables using month's as the table name (september_2010), while others are suggesting having one table with lots of data in it...
Don't listen to them. You're already storing a date stamp, what about different months makes it a good idea to split the data that way? The engine will handle the larger data sets just fine, so splitting by month does nothing but artificially segregate the data.
My first reaction is: Aaaaaaaaahhhhhhhhh!!!!!!
Table names should not embed data values. You don't say what the data means, but supposing for the sake of argument it is, I don't know, temperature readings. Just imagine trying to write a query to find all the months in which average temperature increased over the previous month. You'd have to loop through table names. Worse yet, imagine trying to find all 30-day periods -- i.e. periods that might cross month boundaries -- where temperature increased over the previous 30-day period.
Indeed, just retrieving an old record would go from a trivial operation -- "select * where id=whatever" -- would become a complex operation requiring you to have the program generate table names from the date on the fly. If you didn't know the date, you would have to scan through all the tables searching each one for the desired record. Yuck.
With all the data in one properly-normalized table, queries like the above are pretty trivial. With separate tables for each month, they're a nightmare.
Just make the date part of the index and the performance penalty of having all the records in one table should be very small. If the size of table really becomes a performance problem, I could dimply comprehend making one table for archive data with all the old stuff and one for current data with everything you retrieve regularly. But don't create hundreds of tables. Most database engines have ways to partition your data across multiple drives using "table spaces" or the like. Use the sophisticated features of the database if necessary, rather than hacking together a crude simulation.
Depends on what searches you'll need to do. If normally constrained by date, splitting is good.
If you do split, consider naming the tables like foo_2010_09 so the tables will sort alphanumerically.
what is your DB platform?
In SQL Server 2K5+ you can partition on date.
My bad, I didnt notice the tag. #thetaiko is right though and this is well within MySQL capabilities to deal with this.
I would say it depends on how the data is used. If most queries are done over the complete data, it would be an overhead to always join the tables back together again.
If you most times only need a part of the data (by date), it is a good idea to segment the tables into smaller pieces.
For the naming i would do tablename_yyyymm.
Edit: For sure you should then also think about another layer between the DB and your app to handle the segmented tables depending on some date given. Which can then get pretty complicated.
I'd suggest dropping the year and just having one table per month, named after the month. Archive your data annually by renaming all the tables $MONTH_$YEAR and re-creating the month tables. Or, since you're storing a timestamp with your data, just keep appending to the same tables. I assume by virtue of the fact that you're asking the question in the first place, that segregating your data by month fits your reporting requirements. If not, then I'd recommend keeping it all in one table and periodically archiving off historical records when performance gets to be an issue.
I agree with this idea complicating your database needlessly. Use a single table. As others have pointed out, it's not nearly enough data to warrent extraneous handling. Unless you use SQLite, your database will handle it well.
However it also depends on how you want to access it. If the old entries are really only there for archival purposes, then the archive pattern is an option. It's common for versioning systems to have the infrequently used data separated out. In your case you'd only want everything >1 year to move out of the main table. And this is strictly an database administration task, not an application behavior. The application would only join the current list and the _archive list, if at all. Again, this highly depends on the use case. Are the old entries generally needed? Is there too much data to process regularily?

Categories