Soon I'll be working on catalog(php+mysql) that will have multilang content support. And now I'm considering the best approach to design the database structure. At the moment I see 3 ways for multilang handling:
1) Having separate tables for each language specific data, i.e. schematicly it'll look like this:
There will be one table Main_Content_Items, storing basic data that cannot be translated like ID, creation_date, hits, votes on so on - it will be only one and will refer to all languages.
And here are tables that will be dublicated for each language:
Common_Data_LANG table(example: common_data_en_us) (storing common/"static" fields that can be translated, but are present for eny catalog item: title, desc and so on...)
Extra_Fields_Data_LANG table (storing extra fields data that can be translated, but can be different for custom item groups, i.e. like: | id | item_id | field_type | value | ...)
Then on items request we will look in table according to user/default language and join translatable data with main_content table.
Pros:
we can update "main" data(i.e. hits, votes...) that are updated most often with only one query
we don't need o dublicate data 4x or more times if we have 4 or more languages in comparison with structure using only one table with 'lang' field. So MySql queries would take less time to go through 100000(for example) records catalog rather then 400000 or more
Cons:
+2 tables for each language
2) Using 'lang' field in content tables:
Main_Content_Items table (storing basic data that cannot be translated like ID, creation_date, hits, votes on so on...)
Common_Data table (storing common/"static" fields that can be translated, but are present for eny catalog item: | id | item_id | lang | title | desc | and so on...)
Extra_Fields_Data table (storing extra fields data that can be translated, but can be different for custom item groups, i.e. like: | id | item_id | lang | field_type | value | ...)
So we'll join common_data and extra_fields to main_content_items according to 'lang' field.
Pros:
we can update "main" data(i.e. hits, votes...) that are updated most often with only one query
we only 3 tables for content data
Cons:
we have custom_data and extra_fields table filled with data for all languages, so its X time bigger and queries run slower
3) Same as 2nd way, but with Main_Content_Items table merged with Common_Data, that has 'lang' field:
Pros:
...?
Cons:
we need to update update "main" data(i.e. hits, votes...) that are updated most often with for every language
we have custom_data and extra_fields table filled with data for all languages, so its X time bigger and queries run slower
Will be glad to hear suggestions about "what is better" and "why"? Or are there better ways?
Thanks in advance...
I've given a similar anwer in this question and highlighted the advantages of this technique (it would be, for example, important for me to let the application decide on the language and build the query accordingly by only changing the lang parameter in the WHERE clause of the SQL query.
This get's pretty close to your second solution. I didn't quite got the "extra_fields" but if it makes sense, you could(!) merge it into the common_data table. I would advise you against the first idea since there will be too many tables and it can be easy to lose track about the items in there.
To your edit: I still consider the second approach the better one (it's my optinion so it's relative ;)) I'm no expert on optimization but I think that with proper indexes and proper table structure speed should be not be a problem. As always, the best way to find the most effective way is doing both methods and see which is best since speed will vary from data, structure, ....
Related
I have a clue on how to do this, but I was wondering if there's other methods out there, maybe a "best practice" approach.
I have a page that lists a number of datasets that can be found in a "catalogue" table in mysql, like the one below.
+----+----------+------+--------------------------+
| id | name | type | listItems |
+----+----------+------+--------------------------+
| 1 | dataset1 | SQL | id, name, location, type |
| 2 | dataset2 | SQL | id, gdp, import, export |
+----+----------+------+--------------------------+
The datasets are different, have different structures etc. What I'm trying to achieve is that when I click one of these links, I'm being shown all the records in the respective table. Normally this is a matter of extracting data from a table, but as I mentioned, the data could be different. From the first dataset, I want to list the id, name, location and type field, whereas from the second dataset, I'm looking for id, gdp, import, export and abbreviation. Not only are the columns different, but I don't want to extract all columns, just some.
My initial thought was to have an extra column in the catalogue table (the listItems column), specifying each table's default columns to be extracted. These would be stored in the following format:
id, name, location, type
Then, when I list items, I identify which dataset I'm using, I'm extracting these values from the catalogue table and then I query the database.
Is there a better way to do this?
You are part way there.
Next, you write PHP code to create the SELECT statement using the dataset name and list of columns.
After that, you may realize that you want different formatting: right justified numbers, maybe with commas; anchor tags for values that look like hyperlinks; left justify strings; etc.
How far do you want to take this? It can all be done in PHP, and there is where most of it belongs. Your "catalog" is about the only thing to store in the database, and very little is done via SQL.
For an MySQL table I am using the InnoDB engine and the structure of my tables looks like this:
Table user
id | username | etc...
----|------------|--------
1 | bruce | ...
2 | clark | ...
3 | tony | ...
Table user-emails
id | person_id | email
----|-------------|---------
1 | 1 | bruce#wayne-ent.com
2 | 1 | ceo#wayne-ent.com
3 | 2 | clark.k#daily-planet.com
To fetch data from the database I've written a tiny framework. E.g. on __construct($id) it checks if there is a person with the given id, if yes it creates the corresponding model and saves only the field id to an array. During runtime, if I need another field from the model it fetches only the value from the database, saves it to the array and returns it. E.g. same with the field emails for that my code accesses the table user-emails and get all the emails for the corresponding user.
For small models this works alright, but now I am working on another project where I have to fetch a lot of data at once for a list and that takes some time. Also I know that many connections to MySQL and many queries are quite stressful for the server, so..
My question now is: Should I fetch all data at once (with left joins etc.) while constructing the model and save the fields as an array or should I use some other method?
Why do people insist on referring to the entities and domain objects as "models".
Unless your entities are extremely large, I would populate the entire entity, when you need it. And, if "email list" is part of that entity, I would populate that too.
As I see it, the question is more related to "what to do with tables, that are related by foreign keys".
Lets say you have Users and Articles tables, where each article has a specific owner associate by user_id foreign key. In this case, when populating the Article entity, I would only retrieve the user_id value instead of pulling in all the information about the user.
But in your example with Users and UserEmails, the emails seem to be a part of the User entity, and something that you would often call via $user->getEmailList().
TL;DR
I would do this in two queries, when populating User entity:
select all you need from Users table and apply to User entity
select all user's emails from the UserEmails table and apply it to User entity.
P.S
You might want to look at data mapper pattern for "how" part.
In my opinion you should fetch all your fields at once, and divide queries in a way that makes your code easier to read/manage.
When we're talking about one query or two, the difference is usually negligible unless the combined query (with JOINs or whatever) is overly complex. Usually an index or two is the solution to a very slow query.
If we're talking about one vs hundreds or thousands of queries, that's when the connection/transmission overhead becomes more significant, and reducing the number of queries can make an impact.
It seems that your framework suffers from premature optimization. You are hyper-concerned about fetching too many fields from a row, but why? Do you have thousands of columns or something?
The time consuming part of your query is almost always the lookup, not the transmission of data. You are causing the database to do the "hard" part over and over again as you pull one field at a time.
Description:
I am building a rating system with mysql/php. I am confused as to how I would set up the database.
Here is my article setup:
Article table:
id | user_id | title | body | date_posted
This is my assumed rating table:
Rating table:
id | article_id | score | ? user_id ?
Problem:
I don't know if I should place the user_id in the rating table. My plan is to use a query like this:
SELECT ... WHERE user_id = 1 AND article_id = 10
But I know that it's redundant data as it stores the user_id twice. Should I figure out a JOIN on the tables or is the structure good as is?
It depends. I'm assuming that the articles are unique to individual users? In that case, I could retain the user_id in your rating table and then just alter your query to:
SELECT ... WHERE article_id = 10
or
SELECT ... WHERE user_id = 1
Depending on what info you're trying to pull.
You're not "storing the user_id twice" so much as using the user_id to link the article to unique data associated to the user in another table. You're taking the right approach, except in your query.
I don't see anything wrong with this approach. The user id being stored twice is not particularly relevant since one is regarding a rating entry and the other, i assume, is related to the article owner.
The benefit of this way is you can prevent multiple scores being recorded for each user by making article_id and user_id unique and use replace into to manage scoring.
There are many things to elaborate on this depending on whether or not this rating system needs to be intelligent to prevent gaming, etc. How large the user base is, etc.
I bet for any normal person, this setup would not be detrimental to even a relatively large scale system.
... semi irrelevant:
Just FYI, depending on the importance and gaming aspects of this score, you could use STDDEV() to fetch an average factoring the standard deviation on the score column...
SELECT STDDEV(`score`) FROM `rating` WHERE `article_id` = {article_id}
That would factor outliers supposing you cared whether or not it looked like people were ganging up on a particular article to shoot it down or praise it without valid cause.
you should not, due to 3rd normal form, you need to keep the independence.
"The third normal form (3NF) is a normal form used in database normalization. 3NF was originally defined by E.F. Codd in 1971.[1] Codd's definition states that a table is in 3NF if and only if both of the following conditions hold:
The relation R (table) is in second normal form (2NF)
Every non-prime attribute of R is non-transitively dependent (i.e. directly dependent) on every superkey of R."
Source here: http://en.wikipedia.org/wiki/Third_normal_form
First normal Form: http://en.wikipedia.org/wiki/First_normal_form
Second normal Form: http://en.wikipedia.org/wiki/Second_normal_form
you should take a look to normalization and E/R model it will help you a lot.
normalization in wikipedia: http://en.wikipedia.org/wiki/Database_normalization
I have a large database of artists, albums, and tracks. Each of these items may have one or more tags assigned via glue tables (track_attributes, album_attributes, artist_attributes). There are several thousand (or even hundred thousand) tags applicable to each item type.
I am trying to accomplish two tasks, and I'm having a very hard time getting the queries to perform acceptably.
Task 1) Get all tracks that have any given tags (if provided) by artists that have any given tags (if provided) on albums with any given tags (if provided). Any set of tags may not be present (i.e. only a track tag is active, no artist or album tags)
Variation: The results are also presentable by artist or by album rather than by track
Task 2) Get a list of tags that are applied to the results from the previous filter, along with a count of how many tracks have each given tag.
What I am after is some general guidance in approach. I have tried temp tables, inner joins, IN(), all my efforts thus far result in slow responses. A good example of the results I am after can be seen here: http://www.yachtworld.com/core/listing/advancedSearch.jsp, except they only have one tier of tags, I am dealing with three.
Table structures:
Table: attribute_tag_groups
Column | Type |
------------+-----------------------------+
id | integer |
name | character varying(255) |
type | enum (track, album, artist) |
Table: attribute_tags
Column | Type |
--------------------------------+-----------------------------+
id | integer |
attribute_tag_group_id | integer |
name | character varying(255) |
Table: track_attribute_tags
Column | Type |
------------+-----------------------------+
track_id | integer |
tag_id | integer |
Table: artist_attribute_tags
Column | Type |
------------+-----------------------------+
artist_id | integer |
tag_id | integer |
Table: album_attribute_tags
Column | Type |
------------+-----------------------------+
album_id | integer |
tag_id | integer |
Table: artists
Column | Type |
------------+-----------------------------+
id | integer |
name | varchar(350) |
Table: albums
Column | Type |
------------+-----------------------------+
id | integer |
artist_id | integer |
name | varchar(300) |
Table: tracks
Column | Type |
-------------+-----------------------------+
id | integer |
artist_id | integer |
album_id | integer |
compilation | boolean |
name | varchar(300) |
EDIT I am using PHP, and I am not opposed to doing any sorting or other hijinx in script, my #1 concern is speed of return.
If you want speed, I would suggest you look into Solr/Lucene. You can store your data, and have very speedy lookups by calling Solr and parsing the result from PHP. And as an added benefit you get faceted searches as well (which is task 2 of your question if I interpret it correctly). The downside is of course that you might have redundant information (once stored in DB, once in the Solr document store). And it does take a while to setup (well, you could learn a lot from Drupal Solr integration).
Just check out the PHP reference docs for Solr.
Here's on article on how to use Solr with PHP, just in case : http://www.ibm.com/developerworks/opensource/library/os-php-apachesolr/.
You probably should try to denormalize your data. Your structure is optimised for insert/update load, but not for queries. As I got it, your will have much more select queries than insert/update queries.
For example you can do something like this:
store your data in normalized structure.
create agregate table like this
track_id, artist_tags, album_tags, track_tags
1 , jazz/pop/, jazz/rock, /heavy-metal/
or
track_id, artist_tags, album_tags, track_tags
1 , 1/2/, 1/3, 4/
to spead up search you probably should create FULLTEXT index on *_tags columns
query this table with sql like
select * from aggregate where album_tags MATCH (track_tags) AGAINST ('rock')
rebuild this table incrementally once a day.
I think the answer greately depends on how much money you wish to spend on your project - there are some tasks that are even theoretically impossible to accomplish given strict conditions(for example that you must use only one weak server). I will assume that you are ready to upgrade your system.
First of all - your table structure forces JOIN's - I think you should avoid them if possible when writing high performace applications. I don't know "attribute_tag_groups" is, so I propose a table structure: tag(varchar 255), id(int), id_type(enum (track, album, artist)). Id can be artist_id,track_id or album_id depending on id_type. This way you will be able too lokup all your data in one table, but of cource it will use much more memory.
Next - you should consider using several databases. It will help even more if each database contains only part of your data(each lookup will be faster). Deciding how to spread your data between databases is usually rather hard task: I suggest you make some statistics about tag length, find ranges of length that will get similar trac/artists results count and hard-code it into your lookup code.
Of cource you should consider MySql tuning(I am sure you did that, but just in case) - all your tables should reside in RAM - if that is impossible try to get SSD discs, raids etc.. Proper indexing and database types/settings are really important too (MySql may even show some bottlenecks in internal statistics).
This suggestion may sound mad - but sometimes it is good to let PHP do some calculations that MySql can do itself. MySql databases are much harder to scale, while a server for PHP processing can be added in in the matter of minutes. And different PHP threads can run on different CPU cores - MySql have problems with it. You can increase your PHP performace by using some advanced modules(you can even write them yourself - profile your PHP scripts and hard code bottlenecks in fast C code).
Last but I think the most important - you must use some type of caching. I know that it is really hard, but I don't think that there was any big project without a really good caching system. In your case some tags will surely be much more popular then others, so it should greately increase performance. Caching is a form of art - depending on how much time you can spend on it and how much resources are avaliable you can make 99% of all requests use cache.
Using other databases/indexing tools may help you, but you should always consider theoretical query speed comparison(O(n), O(nlog(n))...) to understand if they can really help you - using this tools sometimes give you low performance gain(like constant 20%), but they may complicate your application design and most of the time it is not worth it.
From my experience most 'slow' MySQL database doesn't have correct index and/or queries. So I would check these first:
Make sure all data talbes' id fields is primary index. Just in case.
For all data tables, create an index on the external id fields and then the id, so that MySQL can use it in search.
For your glue tables, setting a primary key on the two fields, first the subject, then the tag. This is for normal browsing. Then create a normal index on the tag id. This is for searching.
Still slow? Are you using MyISAM for your tables? It is designed for quick queries.
If still slow, run an EXPLAIN on a slow query and post both the query and result in the question. Preferably with an importable sql dump of your complete database structure.
Things you may give a try:
Use a Query Analyzer to explore the bottlenecks of your querys. (In most times the underlying DBS is quite doing an amazing job in optimizing)
Your table structure is well normalized but personal experience showed me that you can archive much greater performance levels with structures that enable you to avoid joins& subquerys. For your case i would suggest to store the tag information in one field. (This requires support by the underlying DBS)
So far.
Check your indices, and if they are used correctly. Maybe MySQL isn't up to the task. PostgreSQL should be similiar to use but has better performance in complex situations.
On a completely different track, google map-reduce and use one of these new fancy no-SQL databases for really really large data sets. This can do distributed search on multiple servers in parallel.
Here is the scenario 1.
I have a table called "items", inside the table has 2 columns, e. g. item_id and item_name.
I store my data in this way:
item_id | item_name
Ss001 | Shirt1
Sb002 | Shirt2
Tb001 | TShirt1
Tm002 | TShirt2
... etc, i store in this way:
first letter is the code for clothes, i.e S for shirt, T for tshirt
second letter is size, i.e s for small, m for medium and b for big
Lets say in my items table i got 10,000 items. I want to do fast retrieve, lets say I want to find a particular shirt, can I use:
Method1:
SELECT * from items WHERE item_id LIKE Sb99;
or should I do it like:
Method2:
SELECT * from items WHERE item_id LIKE S*;
*Store the result, then execute second search for the size, then third search for the id. Like the hash table concept.
What I want to achieve is, instead of search all the data, I want to minimize the search by search the clothes code first, follow by size code and then id code. Which one is better in term of speed in mysql. And which one is better in long run. I want to reduce the traffic and not to disturb the database so often.
Thanks guys for solving my first scenario. But another scenario comes in:
Scenario 2:
I am using PHP and MySQL. Continue from the preivous story. If my users table structure is like this:
user_id | username | items_collected
U0001 | Alex | Ss001;Tm002
U0002 | Daniel | Tb001;Sb002
U0003 | Michael | ...
U0004 | Thomas | ...
I store the items_collected in id form because one day each user can collect up to hundreds items, if I store as string, i.e. Shirt1, pants2, ..., it would required a very large amount of database spaces (imagine if we have 1000 users and some items name are very long).
Would it be easier to maintain if I store in id form?
And if lets say, I want to display the image, and the image's name is the item's name + jpg. How to do that? Is it something like this:
$result = Select items_collected from users where userid= $userid
Using php explode:
$itemsCollected = explode($result, ";");
After that, matching each item in the items table, so it would like:
shirt1, pants2 etc
Den using loop function, loop each value and add ".jpg" to display the image?
The first method will be faster - but IMO it's not the right way of doing it. I'm in agreement with tehvan about that.
I'd recommend keeping the item_id as is, but add two extra fields one for the code and one for the size, then you can do:
select * from items where item_code = 'S' and item_size = 'm'
With indexes the performance will be greatly increased, and you'll be able to easily match a range of sizes, or codes.
select * from items where item_code = 'S' and item_size IN ('m','s')
Migrate the db as follows:
alter table items add column item_code varchar(1) default '';
alter table items add column item_size varchar(1) default '';
update items set item_code = SUBSTRING(item_id, 1, 1);
update items set item_size = SUBSTRING(item_id, 2, 1);
The changes to the code should be equally simple to add. The long term benefit will be worth the effort.
For scenario 2 - that is not an efficient way of storing and retrieving data from a database. When used in this way the database is only acting as a storage engine, by encoding multiple data into fields you are precluding the relational part of the database from being useful.
What you should do in that circumstance is to have another table, call it 'items_collected'. The schema would be along the lines of
CREATE TABLE items_collected (
id int(11) NOT NULL auto_increment KEY,
userid int(11) NOT NULL,
item_code varchar(10) NOT NULL,
FOREIGN KEY (`userid`) REFERENCES `user`(`id`),
FOREIGN KEY (`itemcode`) REFERENCES `items`(`item_code`)
);
The foreign keys ensure that there is Referential integrity, it's essential to have referential integrity.
Then for the example you give you would have multiple records.
user_id | username | items_collected
U0001 | Alex | Ss001
U0001 | Alex | Tm002
U0002 | Daniel | Sb002
U0002 | Daniel | Tb001
U0003 | Michael | ...
U0004 | Thomas | ...
The first optimization would be splitting the id into three different fields:
one for type, one for size, one for the current id ending (whatever the ending means)
If you really want to keep the current structure, go for the result straight away (option 1).
If you want to speed up for results you should split up the column into multiple columns, one for each property.
Step 2 is to create an index for each column. Remember that mysql only uses one index per table per query. So if you really want speedy queries and your queries vary a lot with these properties, then you might want to create an index on (type,size,ending), (type,ending,size) etc.
For example a query with
select * from items where type = s and size = s and ending = 001
Can benefit from the index (type,size,ending) but:
select * from items where size = s and ending = 001
Can not, because the index will only be used in order, so it needs type, then size, then ending. This is why you might want multiple indexes if you really want fast searches.
One other note, generally it is not a good idea to use * in queries, but to select only the columns you need.
You need to have three columns for the model, size and id, and index them this way:
CREATE INDEX ix_1 ON (model, size, id)
CREATE INDEX ix_2 ON (size, id)
CREATE INDEX ix_3 ON (id, model)
Then you'll be able to search efficiently on any subset of the parameters:
model-size-id, model-size and model queries will use ix_1;
size-id and size queries will use ix_2;
model-id and id queries will use ix_3
Index on your column as it is now is equivalent to ix_1, and you can use this index to efficiently search on the appropriate conditions (model-size-id, model-size and model).
Actually, there is a certain access path called INDEX SKIN SCAN that may be used to search on non-first columns of a composite index, but MySQL does not support it AFAIK.
If you need to stick to your current design, you need to index the field and use queries like:
WHERE item_id LIKE #model || '%'
WHERE item_id LIKE #model || #size || '%'
WHERE item_id = #model || #size || #id
All these queries will use the index if any.
There is not need to put in into multiple queries.
I'm comfortable that you've designed your item_id to be searchable with a "Starts with" test. Indexes will solve that quickly for you.
I don't know MySQL, but in MSSQL having an index on a "Size" column that only has choices of S, M, L most probably won't achieve anything, the index won't be used because the values it contains are not sufficiently selective - i.e. its quicker to just go through all the data rather than "Find the first S entry in the index, now retrieve the data page for that row ..."
The exception is where the query is covered by the index - i.e. several parts of the WHERE clause (and indeed, all of them and also the SELECT columns) are included in the index. In this instance, however, the first field in the index (in MSSQL) needs to be selective. So put the column with the most distinct values first in the index.
Having said that if your application has a picklist for Size, Colour, etc. you should have those data attributes in separate columns in the record - and separate tables with lists of all the available Colours and Sizes, and then you can validate that the Colour / Size given to a Product is actually defined in the Colour / Size tables. Cuts down the Garbage-in / Garbage-out problem!
Your item_selected needs to be in a separate table so that it is "normalised". Don't store a delimited list in a single column, store it using individual rows in a separate table
Thus your USERS table will contain user_id & username
Your, new, items_collected table will contains user_id & item_id (and possibly also Date Purchased or Invoice Number)
You can then say "What did Alex buy" (your design has that) and also "Who bought Ss001" (which, in your design, would require ploughing through all the rows in your USERS table and splitting out the items_collected to find which ones contained Ss001 [1])
[1] Note that using LIKE wouldn't really be safe for that because you might have an item_id of "Ss001XXX" which would match WHERE items_collected LIKE '%Ss001%'