more efficient database structure across multiple tables

more efficient database structure across multiple tables - php

I am setting up a MySQL database with multiple tables. Several of the tables will have fields with similar names that aren't necessarily for the same purpose.
For example, there's a users table that will have a name field, a category table with a name field and so on.
I've previously seen this setup up either with or without a preface to the field name, so in the above example using user_name, cat_name etc.
As these are all in separate tables, is there any benefit to structuring the database with or without this preface? I know that when using joins and calling the data through PHP you have to add a SELECT users.name AS username... to keep the fields from overwriting each other when using mysql_fetch_array. But i'm not sure if there's any efficiencies in using one method over the other?

It depends on what your shop does or your preference. There is nothing about a prefix that will make this better. Personally I would just keep it as name since: Users.Name and Orders.Name and Products.Name all contain tuples with different object types.
At the end of the day you want to be consistent. If you prefer a cat_ and a user_ prefix just be consistent with your design and include this prefix for all object types. To me less is more.

It's really just a matter of preference. I personally prefer the approach of using just name.
One thing to watch out for though, if you're doing any SELECT * FROM ... queries (which you shouldn't be; always select fields explicitly), you may end up selecting the wrong data.

One disadvantage is if anyone is stupid enough to use natural joins (you can guess that I find this a poor practice but mysql does allow it so you need to consider if that will happen) you may end up joining on those fields with the same name by accident.

Related

Using SELECT * or SELECT all, cols is better for Queries [duplicate]

I've heard that SELECT * is generally bad practice to use when writing SQL commands because it is more efficient to SELECT columns you specifically need.
If I need to SELECT every column in a table, should I use
SELECT * FROM TABLE
or
SELECT column1, colum2, column3, etc. FROM TABLE
Does the efficiency really matter in this case? I'd think SELECT * would be more optimal internally if you really need all of the data, but I'm saying this with no real understanding of database.
I'm curious to know what the best practice is in this case.
UPDATE: I probably should specify that the only situation where I would really want to do a SELECT * is when I'm selecting data from one table where I know all columns will always need to be retrieved, even when new columns are added.
Given the responses I've seen however, this still seems like a bad idea and SELECT * should never be used for a lot more technical reasons that I ever though about.

One reason that selecting specific columns is better is that it raises the probability that SQL Server can access the data from indexes rather than querying the table data.
Here's a post I wrote about it: The real reason select queries are bad index coverage
It's also less fragile to change, since any code that consumes the data will be getting the same data structure regardless of changes you make to the table schema in the future.

Given your specification that you are selecting all columns, there is little difference at this time. Realize, however, that database schemas do change. If you use SELECT * you are going to get any new columns added to the table, even though in all likelihood, your code is not prepared to use or present that new data. This means that you are exposing your system to unexpected performance and functionality changes.
You may be willing to dismiss this as a minor cost, but realize that columns that you don't need still must be:
Read from database
Sent across the network
Marshalled into your process
(for ADO-type technologies) Saved in a data-table in-memory
Ignored and discarded / garbage-collected
Item #1 has many hidden costs including eliminating some potential covering index, causing data-page loads (and server cache thrashing), incurring row / page / table locks that might be otherwise avoided.
Balance this against the potential savings of specifying the columns versus an * and the only potential savings are:
Programmer doesn't need to revisit the SQL to add columns
The network-transport of the SQL is smaller / faster
SQL Server query parse / validation time
SQL Server query plan cache
For item 1, the reality is that you're going to add / change code to use any new column you might add anyway, so it is a wash.
For item 2, the difference is rarely enough to push you into a different packet-size or number of network packets. If you get to the point where SQL statement transmission time is the predominant issue, you probably need to reduce the rate of statements first.
For item 3, there is NO savings as the expansion of the * has to happen anyway, which means consulting the table(s) schema anyway. Realistically, listing the columns will incur the same cost because they have to be validated against the schema. In other words this is a complete wash.
For item 4, when you specify specific columns, your query plan cache could get larger but only if you are dealing with different sets of columns (which is not what you've specified). In this case, you do want different cache entries because you want different plans as needed.
So, this all comes down, because of the way you specified the question, to the issue resiliency in the face of eventual schema modifications. If you're burning this schema into ROM (it happens), then an * is perfectly acceptable.
However, my general guideline is that you should only select the columns you need, which means that sometimes it will look like you are asking for all of them, but DBAs and schema evolution mean that some new columns might appear that could greatly affect the query.
My advice is that you should ALWAYS SELECT specific columns. Remember that you get good at what you do over and over, so just get in the habit of doing it right.
If you are wondering why a schema might change without code changing, think in terms of audit logging, effective/expiration dates and other similar things that get added by DBAs for systemically for compliance issues. Another source of underhanded changes is denormalizations for performance elsewhere in the system or user-defined fields.

You should only select the columns that you need. Even if you need all columns it's still better to list column names so that the sql server does not have to query system table for columns.
Also, your application might break if someone adds columns to the table. Your program will get columns it didn't expect too and it might not know how to process them.
Apart from this if the table has a binary column then the query will be much more slower and use more network resources.

There are four big reasons that select * is a bad thing:
The most significant practical reason is that it forces the user to magically know the order in which columns will be returned. It's better to be explicit, which also protects you against the table changing, which segues nicely into...
If a column name you're using changes, it's better to catch it early (at the point of the SQL call) rather than when you're trying to use the column that no longer exists (or has had its name changed, etc.)
Listing the column names makes your code far more self-documented, and so probably more readable.
If you're transferring over a network (or even if you aren't), columns you don't need are just waste.

Specifying the column list is usually the best option because your application won't be affected if someone adds/inserts a column to the table.

Specifying column names is definitely faster - for the server. But if
performance is not a big issue (for example, this is a website content database with hundreds, maybe thousands - but not millions - of rows in each table); AND
your job is to create many small, similar applications (e.g. public-facing content-managed websites) using a common framework, rather than creating a complex one-off application; AND
flexibility is important (lots of customization of the db schema for each site);
then you're better off sticking with SELECT *. In our framework, heavy use of SELECT * allows us to introduce a new website managed content field to a table, giving it all of the benefits of the CMS (versioning, workflow/approvals, etc.), while only touching the code at a couple of points, instead of a couple dozen points.
I know the DB gurus are going to hate me for this - go ahead, vote me down - but in my world, developer time is scarce and CPU cycles are abundant, so I adjust accordingly what I conserve and what I waste.

SELECT * is a bad practice even if the query is not sent over a network.
Selecting more data than you need makes the query less efficient - the server has to read and transfer extra data, so it takes time and creates unnecessary load on the system (not only the network, as others mentioned, but also disk, CPU etc.). Additionally, the server is unable to optimize the query as well as it might (for example, use covering index for the query).
After some time your table structure might change, so SELECT * will return a different set of columns. So, your application might get a dataset of unexpected structure and break somewhere downstream. Explicitly stating the columns guarantees that you either get a dataset of known structure, or get a clear error on the database level (like 'column not found').
Of course, all this doesn't matter much for a small and simple system.

Lots of good reasons answered here so far, here's another one that hasn't been mentioned.
Explicitly naming the columns will help you with maintenance down the road. At some point you're going to be making changes or troubleshooting, and find yourself asking "where the heck is that column used".
If you've got the names listed explicitly, then finding every reference to that column -- through all your stored procedures, views, etc -- is simple. Just dump a CREATE script for your DB schema, and text search through it.

Performance wise, SELECT with specific columns can be faster (no need to read in all the data). If your query really does use ALL the columns, SELECT with explicit parameters is still preferred. Any speed difference will be basically unnoticeable and near constant-time. One day your schema will change, and this is good insurance to prevent problems due to this.

definitely defining the columns, because SQL Server will not have to do a lookup on the columns to pull them. If you define the columns, then SQL can skip that step.

It's always better to specify the columns you need, if you think about it one time, SQL doesn't have to think "wtf is *" every time you query. On top of that, someone later may add columns to the table that you actually do not need in your query and you'll be better off in that case by specifying all of your columns.

The problem with "select *" is the possibility of bringing data you don't really need. During the actual database query, the selected columns don't really add to the computation. What's really "heavy" is the data transport back to your client, and any column that you don't really need is just wasting network bandwidth and adding to the time you're waiting for you query to return.
Even if you do use all the columns brought from a "select *...", that's just for now. If in the future you change the table/view layout and add more columns, you'll start bring those in your selects even if you don't need them.
Another point in which a "select *" statement is bad is on view creation. If you create a view using "select *" and later add columns to your table, the view definition and the data returned won't match, and you'll need to recompile your views in order for them to work again.
I know that writing a "select *" is tempting, 'cause I really don't like to manually specify all the fields on my queries, but when your system start to evolve, you'll see that it's worth to spend this extra time/effort in specifying the fields rather than spending much more time and effort removing bugs on your views or optimizing your app.

While explicitly listing columns is good for performance, don't get crazy.
So if you use all the data, try SELECT * for simplicity (imagine having many columns and doing a JOIN... query may get awful). Then - measure. Compare with query with column names listed explicitly.
Don't speculate about performance, measure it!
Explicit listing helps most when you have some column containing big data (like body of a post or article), and don't need it in given query. Then by not returning it in your answer DB server can save time, bandwidth, and disk throughput. Your query result will also be smaller, which is good for any query cache.

You should really be selecting only the fields you need, and only the required number, i.e.
SELECT Field1, Field2 FROM SomeTable WHERE --(constraints)
Outside of the database, dynamic queries run the risk of injection attacks and malformed data. Typically you get round this using stored procedures or parameterised queries. Also (although not really that much of a problem) the server has to generate an execution plan each time a dynamic query is executed.

It is NOT faster to use explicit field names versus *, if and only if, you need to get the data for all fields.
Your client software shouldn't depend on the order of the fields returned, so that's a nonsense too.
And it's possible (though unlikely) that you need to get all fields using * because you don't yet know what fields exist (think very dynamic database structure).
Another disadvantage of using explicit field names is that if there are many of them and they're long then it makes reading the code and/or the query log more difficult.
So the rule should be: if you need all the fields, use *, if you need only a subset, name them explicitly.

The result is too huge. It is slow to generate and send the result from the SQL engine to the client.
The client side, being a generic programming environment, is not and should not be designed to filter and process the results (e.g. the WHERE clause, ORDER clause), as the number of rows can be huge (e.g. tens of millions of rows).

Naming each column you expect to get in your application also ensures your application won't break if someone alters the table, as long as your columns are still present (in any order).

Performance wise I have seen comments that both are equal. but usability aspect there are some +'s and -'s
When you use a (select *) in a query and if some one alter the table and add new fields which do not need for the previous query it is an unnecessary overhead. And what if the newly added field is a blob or an image field??? your query response time is going to be really slow then.
In other hand if you use a (select col1,col2,..) and if the table get altered and added new fields and if those fields are needed in the result set, you always need to edit your select query after table alteration.
But I suggest always to use select col1,col2,... in your queries and alter the query if the table get altered later...

This is an old post, but still valid. For reference, I have a very complicated query consisting of:
12 tables
6 Left joins
9 inner joins
108 total columns on all 12 tables
I only need 54 columns
A 4 column Order By clause
When I execute the query using Select *, it takes an average of 2869ms.
When I execute the query using Select , it takes an average of 1513ms.
Total rows returned is 13,949.
There is no doubt selecting column names means faster performance over Select *

Select is equally efficient (in terms of velocity) if you use * or columns.
The difference is about memory, not velocity. When you select several columns SQL Server must allocate memory space to serve you the query, including all data for all the columns that you've requested, even if you're only using one of them.
What does matter in terms of performance is the excecution plan which in turn depends heavily on your WHERE clause and the number of JOIN, OUTER JOIN, etc ...
For your question just use SELECT *. If you need all the columns there's no performance difference.

It depends on the version of your DB server, but modern versions of SQL can cache the plan either way. I'd say go with whatever is most maintainable with your data access code.

One reason it's better practice to spell out exactly which columns you want is because of possible future changes in the table structure.
If you are reading in data manually using an index based approach to populate a data structure with the results of your query, then in the future when you add/remove a column you will have headaches trying to figure out what went wrong.
As to what is faster, I'll defer to others for their expertise.

As with most problems, it depends on what you want to achieve. If you want to create a db grid that will allow all columns in any table, then "Select *" is the answer. However, if you will only need certain columns and adding or deleting columns from the query is done infrequently, then specify them individually.
It also depends on the amount of data you want to transfer from the server. If one of the columns is a defined as memo, graphic, blob, etc. and you don't need that column, you'd better not use "Select *" or you'll get a whole bunch of data you don't want and your performance could suffer.

To add on to what everyone else has said, if all of your columns that you are selecting are included in an index, your result set will be pulled from the index instead of looking up additional data from SQL.

SELECT * is necessary if one wants to obtain metadata such as the number of columns.

Gonna get slammed for this, but I do a select * because almost all my data is retrived from SQL Server Views that precombine needed values from multiple tables into a single easy to access View.
I do then want all the columns from the view which won't change when new fields are added to underlying tables. This has the added benefit of allowing me to change where data comes from. FieldA in the View may at one time be calculated and then I may change it to be static. Either way the View supplies FieldA to me.
The beauty of this is that it allows my data layer to get datasets. It then passes them to my BL which can then create objects from them. My main app only knows and interacts with the objects. I even allow my objects to self-create when passed a datarow.
Of course, I'm the only developer, so that helps too :)

What everyone above said, plus:
If you're striving for readable maintainable code, doing something like:
SELECT foo, bar FROM widgets;
is instantly readable and shows intent. If you make that call you know what you're getting back. If widgets only has foo and bar columns, then selecting * means you still have to think about what you're getting back, confirm the order is mapped correctly, etc. However, if widgets has more columns but you're only interested in foo and bar, then your code gets messy when you query for a wildcard and then only use some of what's returned.

And remember if you have an inner join by definition you do not need all the columns as the data in the join columns is repeated.
It's not like listing columns in SQl server is hard or even time-consuming. You just drag them over from the object browser (you can get all in one go by dragging from the word columns). To put a permanent performance hit on your system (becasue this can reduce the use of indexes and becasue sending unneeded data over the network is costly) and make it more likely that you will have unexpected problems as the database changes (sometimes columns get added that you do not want the user to see for instance) just to save less than a minute of development time is short-sighted and unprofessional.

Absolutely define the columns you want to SELECT every time. There is no reason not to and the performance improvement is well worth it.
They should never have given the option to "SELECT *"

If you need every column then just use SELECT * but remember that the order could potentially change so when you are consuming the results access them by name and not by index.
I would ignore comments about how * needs to go get the list - chances are parsing and validating named columns is equal to the processing time if not more. Don't prematurely optimize ;-)

Speed of SELECT Distinct vs array unique

I am using WordPress with some custom post types (just to give a description of my DB structure - its WP's).
Each post has custom meta, which is stored in a separate table (postmeta table). In my case, I am storing city and state.
I've added some actions to WP's save_post/trash_post hooks so that the city and state are also stored in a separate table (cities) like so:
ID postID city state
auto int varchar varchar
I did this because I assumed that this table would be faster than querying the rather large postmeta table for a list of available cities and states.
My logic also forced me to add/update cities and states for every post, even though this will cause duplicates (in the city/state fields). This must be so because I must keep track of which states/cities exist (actually have a post associated with them). When a post is added or deleted, it takes its record to or from the cities table with it.
This brings me to my question(s).
Does this logic make sense or do I suck at DB design?
If it does make sense, my real question is this: **would it be faster to use MySQL's "SELECT DISTINCT" or just "SELECT *" and then use PHP's array_unique on the results?**
Edits for comments/answers thus far:
The structure of the table is exactly how I typed it out above. There is an index on ID, but the point of this table isn't to retrieve an indexed list, but to retrieve ALL results (that are unique) for a list of ALL available city/state combos.
I think I may go with (I don't know why I didn't think of this before) just adding a serialized list of city/state combos in ONE record in the wp_options table. Then I can just get that record, and filter out the unique records I need.
Can I get some feedback on this? I would imagine that retrieving and filtering a serialized array would be faster than storing the data in a separate table for retrieval.

To answer your question about using SELECT distinct vs. array_unique, I would say that I would almost always prefer to limit the result set in the database assuming of course that you have an appropriate index on the field for which you are trying to get distinct values. This saves you time in transmitting extra data from DB to application and for the application reading that data into memory where you can work with it.
As far as your separate table design, it is hard to speculate whether this is a good approach or not, this would largely depend on how you are actually preforming your query (i.e. are you doing two separate queries - one for post info and one for city/state info or querying across a join?).
The is really only one definitive way to determine what is fastest approach. That is to test both ways in your environment.

1) Fully normalized table(when it have only integer values and other tables have only one int+varchar) have advantage when you not dooing full table joins often and dooing alot of search on normalized fields. As downside it require large join/sort buffers and result more complex queries=much less chance query will be auto-optimized by mysql. So you have optimize your queries yourself.
2)Select distinct will be faster in almost any cases. Only case when it will be slower - you have low size sort buffer in /etc/my.conf and much more size memory buffer for php.
Distinct select can use indexes, while your code can't.
Also sending large amount of data to your app require alot of mysql cpu time and real time.

MySQL database optimization for 20.000 users or more

I have been looking for some optimization tips since I´m doing a RPG modification which uses MySQL to store data by PHP.
I´m using one unique table to store all user information in columns by his unique ID, and I have to store (many?) data for each user. Weapons and other information.
I´m using explode and implode as a method to store the weapons, for example, in one column with the 'text' value. I don´t know if that´s a good practice and I don´t know if I will have performance problems if I get thousands of players doing tons of UPDATES , SELECT , etc, requests.
I read that a Junction table may be better to store the weapons and all those information, but I don´t know if that will get better information that you request it by the explode method.
I mean, I should store all the weapons in a different table, each weapon with his information (each weapon have some information, like different columns, I use multiple explode for that inside the main explode) and the user owner of that weapon to identify the weapon than just have them in one column.
It can be 100 items at least to store, I don´t know if it´s good to make 100 records per user on a different table and call all of them all the time better than just call the column and use explode.
Also I want to improve my skills and knowledge to make the best performance MySQL database I can.
I hope somebody can tell me something.
Thanks, and sorry for my stupid english grammar.

It is almost always best practice to normalize your table data. There are some exceptions to this rule (especially in very high volume databases), but you probably do not need to worry about those exceptions until you get to the point of first understanding how to properly normalize and index your tables.
Typically, try to arrange your tables in a way that mimics real-world objects and their relations to each other.
So, in your case you have users - that is one table. Each user might have multiple weapons. So, you now have a weapons table. Since multiple different users might have the same weapon and each user might have multiple weapons, you have a many-to-many relationship between them, so you should have a table "users_weapons" or similar that does nothing but relate user id's to weapon id's.
Now say the users can all have armor. So now you add an armor table and a users_armor table (as this is likely many-to-many as well).
Just think through the different aspects of your game and try to understand the relationships between them. Make sure you can model these relationships in database tables before you even bother writing any code to actually implement the functionality.

Yes it is better to use several tables instead of one. It's better to db performance, easier to understand, easier to maintain and simplier to use as well.
Let's suggest that one user has several weapons with multiple features(but not unique among all weapons). And in one place in your game you just need to know the value of one specific feature:
doing it by your way you'll need to find user row in users table, fetch on column, explode it several times, and there you have your value, but it complicates even more if you want to change it and save then.
better way is having one table for user details(login, password, email etc), another table which keeps user weapons(name of weapon, image maybe) and table in which will be all features, special powers of weapons kept. You could keep all possible features of all weapons in extra table as well. This way you if you already know user id from user table, you'll have to only join 2 tables in your sql query, and there you got value of feature of specific weapon of user.
Example pseudo schema of tables:
users
user_id
user_name
password
email
weapons
weapon_id
user_id
weapon_name
image
weapons_features
feature_id
weapon_id
feature_name
feature_value
And if you really want to use some ordered data in text field in database encode it to JSON or serialize it. This way you don't have to explode and implode it!

As all guys said, typically you should start from normalized database structure.
If performance is ok, then great, nothing to do.
If not, you can try many different things:
Find and optimize query which works slow.
Denormalize queries - sometimes joins kill performance.
Change data access pattern used in application.
Store data in file system or use NoSQL/polyglot persistence solution.

Storing database info as array

Which is good practice? To store data as a comma separated list in the database or have multiple rows?
I have a table for accounts, classes, and enrolments.
If the enrolment table has 3 fields: ID, AccountID and ClassID, is it better for ClassID to be a varchar containing a comma separated list such as this: "24,21,182,12" or for it to be just an int and have one entry per enrolment?

tldr: Don't do this. That is, don't use a "packed array" here.
Use a correctly normalized design with "multiple rows". This is likely a good candidate for a Many-to-Many relationship. Consider this structure:
Classes 1:M Enrollments(Class,Student) M:1 Students
Following a properly normalized design will reduce pain. In addition, here are some other advantages:
Referential integrity (use InnoDB)
Consistent model described with relationships
Type enforcement (can't have "foo,,")
JOIN and query without needing custom code
"What are the names of the students in class A?"
"Who is taking more than one class?"
Columns can be useful indexed (query performance)
Generally faster than handling locally in code
More flexible and consistent
Can attach attributes to enrollments such as status
No need to have code to handle serialization at access sites
More accommodating of placeholders and ORMs

Never ever ever cram multiple values into a single database field by combining them with some sort of delimiter, like a comma, or fixed length substrings. In the rare cases where this clearly gives a benefit in storage requirements or performance ... see rule #1: never ever ever. Ever.
When you cram multiple values into a single field, you sabatague all the clever features built into the database engine to help you retrieve and manipulate values.
Like let's say you have this -- I guess it's some sort of student database.
Plan A
student (student_id, account_id, class_id_mash)
Plan B
student (student_id, account_id)
student_class (student_id, class_id)
Okay, lets' say you want a list of all the students taking class #27. With Plan B you write
select student_id
from student join student_class on student.student_id=student_class.student_id
where class_id=27
Easy.
How would you do it with Plan A? You might think
select student_id
from student
where class_id_mash like '%27%'
But that will not only find all students in class 27, but also all those in class 127 or 272.
Okay, how about:
select student_id
from student
where class_id_mash like '%,27,%'
There, now we won't find 127 or 272! But, oops, we also won't find it if the 27 happens to be the first or last one in the list, because then there aren't commas on both sides.
So okay, maybe we could get around that with more rules about delimiters or with a more complex matching expression. But it would be unnecessariliy complex and painful.
And even if we did it, every search for class id has to be a full-fill sequential search. With one value per field and multiple records, you can create an index on the class_id field for fast, efficient retrieval. (Some database engines have ways to index into the middle of text fields, but again, why get into complicated solutions when there's an easy solution?)
How do we validate the class_id's? With separate fields, we can say "class_id references class" and the database engine will insure that we don't enter an illegal value. With the mash, no such free validation.

I have done both, but instead of storing the information in the database as comma seperated, I use another delimiter, such as | (so that I don't worry about formatting on insert into db). Its more about how often you will query the data

If you are only going to need the complete list, it is fine to store it as a comma separated value. But if you need to query the list, they should be stored separately.

Is there a way to make a table name dynamic in a query?

I am trying to create a Class-Inheritance design for products.
There is the base table that contains all the common fields. Then for each product type there is a separate table containing the fields that are for that product type only
So in order to get all the data for a product I need to JOIN the base table with whatever table that correlates to the product_type listed in the base table. Is there a way to make this query join on the table dynamically?
Here is a query to try to illustrate what I am trying to do:
SELECT * FROM product_base b
INNER JOIN <value of b.product_type> t
ON b.product_base_id = t.product_base_id
WHERE b.product_base_id = :base_id
Is there a way to do this?

No, there's no way to do this. The table name must be known at the time of parsing the query, so the parser can tell if the table exists, and that it contains the columns you reference. Also the optimizer needs to know the table and its indexes, so it can come up with a plan of what indexes to use.
What you're asking for is for the table to be determined during execution, based on data found row-by-row. There's no way for the RDBMS to know at parse-time that all the data values correspond to real tables.
There's no reason you would do this to implement Class Table Inheritance. CTI supports true references between tables.
You're instead describing the antipattern of Polymorphic Associations.

Make 2 queries:
First select < value of b.product_type > and then use it in the second one (the one that you have, but replace < value of b.product_type > with the result from the first one).

No. There would be little point even if it were possible, as the query optimiser would not be able to make a plan without knowing anything about the right- hand side of the join.
You need to construct the query using concatenation or similar, but make sure that you only use a valid table name to avoid injection attacks.

You can create a procedure that takes the table name as an argument and constructs a dynamic-SQL query. But it's probably easier to do this in your server-side code (PHP). But rather than make it a variable (and as suggested vulnerable to injection attacks), create separate classes for the different join combinations. Use another class (like a dispatcher) to determine the correct class to instantiate.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.