I have a Website in 3 languages.
Which is the best way to structure the DB?
1) Create 3 table, one for every language (e.g. Product_en, Product_es, Product_de) and retrieve data from the table with an identifier:
e.g. on the php page I have a string:
$language = 'en'
so I get the data only
SELECT FROM Product_$language
2) Create 1 table with:
ID LANGUAGE NAME DESCR
and post on the page only
WHERE LANGUAGE = '$language'
3) Create 1 table with:
ID NAME_EN DESCR_EN NAME_ES DESCR_ES NAME_DE DESCR_DE
Thank you!
I'd rather go for the second option.
The first option for me seems not flexible enough for searching of records. What if you need to search for two languages? The best way you can do on that is to UNION the result of two SELECT statement. The third one seems to have data redundancy. It feels like you need to have a language on every names.
The second one very flexible and handy. You can do whatever operations you want without adding some special methods unless you want to pivot the records.
I would opt for option one or two. Which one really depends on your application and how you plan to access your data. When I have done similar localization in the past, I have used the single table approach.
My preference to this approach is that you don't need to change the DB schema at all should you add additional localizations. You also should not need to change your related code in this case either, as language identifier just becomes another value that is used in the query.
That way you would be killing the database in no-time.
Just do a table like:
TABLE languages with fields:
-- product name
-- product description
-- two-letter language code
This will allow you, not only to have a better structured database, but you could even have products that only have one translation. If you want you can even want to show the default language if no other is specified. That you'll do programmatically of course, but I think you get the idea.
There are some language and courses(based on language) are defined in two table. Language table reference is used in course table to relate course with particular language. I also have a notes table that content notes of specific course and that is related to course table. Now I have two issues.
Now in coding I need to take some specific action for Spanish language only. So how should I handle this as languages will be entered by users and we would not be having any idea about Spanish language ID. If I do use text (the language name) then each time I need to fetch ID for Spanish from language table and then will fetch all course related to this from course table.
Suppose Spanish notes are stored in four separate sections and other notes have only one section so should I use same table with four column (one for each section) or use two tables(notes and spanish_notes). Using former way, will leave three column blank for other languages notes. I don't think that is good.
One quick solution to your first issue about multiple languages is to use language codes such as 'en', 'es', 'fr' etc. For instance in your language table you could have both id, code columns but in your content tables you could have a FK with code. So you could either get this lang. code from requests Accept-Language property. or somewhere else.
For second question in terms of normalization it is better to have separate tables for Spanish notes. It is way better for many reasons such as redundancy and dependency concerns.
EDIT: PS. You could also have a look at language codes from here and HTTP Accept-Language from here.
Some inputs:
There could be 2 ways of doing this:
a. When languages are entered by users, use a SELECT dropdown box for accepting user inputs. For each SELECT option, you can set the language name as the text and language id as the value. This way you will know the language ID as
b. You can use MySQL INNER JOIN between "language" and "courses" table, something like:
SELECT *
FROM `language` `l`
INNER JOIN `courses` `c` ON `l`.`language_id` = `c`.`language_id`
WHERE `l`.`language_name` = `spanish`;
I think it's okay to keep all notes for all sections in the single table. So, for the other 3 columns that will only contain Spanish notes, you can set them to accept NULL values
Hope it helps.
I am trying to create a script that finds a matching percentage between my table rows. For example my mySQL database in the table products contains the field name (indexed, FULLTEXT) with values like
LG 50PK350 PLASMA TV 50" Plasma TV Full HD 600Hz
LG TV 50PK350 PLASMA 50"
LG S24AW 24000 BTU
Aircondition LG S24AW 24000 BTU Inverter
As you may see all of them have some same keyword. But the 1st name and 2nd name are more similar. Additionally, 3rd and 4th have more similar keywords between them than 1st and 2nd.
My mySQL DB has thousands of product names. What I want is to find those names that have more than a percentage (let's say 60%) of similarity.
For example, as I said, 1st, 2nd (and any other name) that match between them with more than 60%, will be echoed in a group-style-format to let me know that those products are similar. 3rd and 4th and any other with more than 60% matching will be echoed after in another group, telling me that those products match.
If it is possible, it would be great to echo the keywords that satisfy all the grouped matching names. For example LG S24AW 24000 BTU is the keyword that is contained in 3rd and 4th name.
At the end I will create a list of all those keywords.
What I have now is the following query (as Jitamaro suggested)
Select t1.name, t2.name From products t1, products t2
that creates a new name field next to all other names. Excuse me that I don't know how to explain it right but this is what it does: (The real values are product names like above)
Before the query
-name-
A
B
C
D
E
After the query
-name- -name-
A A
B A
C A
D A
E A
A B
B B
C B
D B
E B
.
.
.
Is there a way either with mySQL or PHP that will find me the matching names and extract the keywords as I described above? Please share code examples.
Thank you community.
Query the DB with LIKE OR REGEXP:
SELECT * FROM product WHERE product_name LIKE '%LG%';
SELECT * FROM product WHERE product_name REGEXP "LG";
Loop the results and use similar_text():
$a = "LG 50PK350 PLASMA TV 50\" Plasma TV Full HD 600Hz"; // DB value
$b = "LG TV 50PK350 PLASMA 50\"" ; // USER QUERY
$i = similar_text($a, $b, $p);
echo("Matched: $i Percentage: $p%");
//outputs: Matched: 21 Percentage: 58.3333333333%
Your second example matches 62.0689655172%:
$a = "LG S24AW 24000 BTU"; // DB value
$b = "Aircondition LG S24AW 24000 BTU Inverter" ; // USER QUERY
$i = similar_text($a, $b, $p);
echo("Matched: $i Percentage: $p%");
You can define a percentage higher than, lets say, 40%, to match products.
Please note that similar_text() is case SensItivE so you should lower case the string.
As for your second question, the levenshtein() function (in MySQL) would be a good candidate.
When I look at your examples, I consider how I would try to find similar products based on the title. From your two examples, I can see one thing in each line that stands out above anything else: the model numbers. 50PK350 probably doesn't show up anywhere other than as related to this one model.
Now, MySQL itself isn't designed to deal with questions like this, but some bolt-on tools above it are. Part of the problem is that querying across all those fields in all positions is expensive. You really want to split it up a certain way and index that. The similarity class of Lucene will grant a high score to words that rarely appear across all data, but do appear as a high percentage of your data. See High level explanation of Similarity Class for Lucene?
You should also look at Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?
Scoring each word against the Lucene similarity class ought to be faster and more reliable. The sum of your scores should give you the most related products. For the TV, I'd expect to see exact matches first, then some others of the same size, then brand, then TVs in general, etc.
Whatever you do, realize that unless you alter the data structures by using another tool on top of the SQL system to create better data structures, your queries will be too slow and expensive. I think Lucene is probably the way to go. Sphinx or other options not mentioned may also be up for consideration.
This is trickier than it seems and there is information missing in your post:
How are people going to use this auto-complete function?
Is it relevant that you can find all names for a product? Because apparently not all stores name their products similarly so a clerk might not be able to find the product (s)he found.
Do you have information about which product names are for the same product?
Is it relevant from which store you're searching? where is this auto-complete used?
Should the auto-complete really only suggest products that match all the words you typed? (it's not so hard, technically, to correct typos)
I think you need a more clear picture of what you (or better yet: the users) want this auto-complete function to do.
An auto-complete function is very much a user-friendly type feature. It aids the user, possibly in a fuzzy way so there is no single right answer. You have to figure out what works best, not what is easiest to do technically.
First figure out what you want, then worry about technology.
One possible solution is to use Damerau-Levenstein distance. It could be used like this
select *
from products p
where DamerauLevenstein(p.name, '*user input here*')<=*X*
You'll have to figure out X that suites your needs best. It should be integer greater than zero. You could have it hard-coded, parameterized or calculated as needed.
The trickiest thing here is DamerauLevenstein. It has to be stored procedure, that implements Damerau-Levenstein algorithm. I don't have MySQL here, so I might write it for you later this day.
Update: MySQL does not support arrays in stored procedures, so there is no way to implement Damerau-Levenstein in MySQL, except using temporary table for each function call. And that will result in terrible performance. So you have two options: loop through the results in PHP with levenstein like Alix Axel suggests, or migrate your database to PostgreSQL, where arrays are supported.
There is also an option to create User-Defined function, but this requires writing this function in C, linking it to MySQL and possibly rebuilding MySQL, so this way you'll just add more headache.
Your approach seems sound. For matching similar products, I would suggest a trigram search. There's a pretty decent explanation of how this works along with the String::Trigram Perl module.
I would suggest using trigram search to get a list of matches, perhaps coupled with some manual review depending on how much data you have to deal with and how frequent you need to add new products. I've found this approach to work quite well in practice.
Maybe you want to find the longest common substring from the 2 strings? Then you need to compute a suffix tree for each of your strings see here http://en.wikipedia.org/wiki/Longest_common_substring_problem.
If you want to check all names against each other you need a cross join in mysql. There are many ways to achieve this:
1. Select a, b From t1, t2
2. Select a, b From t1 Join t2
3. Select a, b From t1 Cross Join t2
Then you can loop through the result. This is the same when I say create a 2d array with n^2-(n-1) elements and each element is connected with each other.
P.S.: Select t1.name, t2.name From products t1, products t2
It sounds like you've gone through all this trouble to explain a complex scenario, then said that you want to ignore the optimal answers and just get us to give you the "handshake" protocol (everything is compared to everything that hasn't been compared to it yet). So... pseudocode:
select * from table order by id
while (result) {
select * from table where id > result_id
}
That will do it.
If your database simply had a UPC code as one of it's fields, and this field was well-maintained, i.e., you could trust that it was entered correctly by the database maintainer and correctly reflected what the item was -- then you wouldn't need to do all of the work you suggest.
An even better idea might be to have a UPC field in your next database -- and constrain it as unique.
Database users attempt to put an-already-existing UPC into the database -- they get an error.
Database maintains its integrity.
And if such a database maintained its integrity -- the necessity of doing what you suggest never arises.
This probably doesn't help much with your current task (apologies) -- but for a future similar database -- you might wish to think about it...
I`d advise you to use some fulltext search engine, like sphinx. It has possibilities to implement any algorithm you want. For example, you may use "quorom" or "any" searches.
It seems that you might always want to return the shortest string?? That's more or a question than anything. But then you might have something like...
SELECT * FROM products LIMIT 1
WHERE product_name like '%LG%'
ORDER BY LENGTH(product_name) ASC
This is a clustering problem, which can be resolved by a data mining method. ( http://en.wikipedia.org/wiki/Cluster_analysis) It requires a lot of memory and computation intensive operations which is not suitable for database engine. Otherwise, separate data mining, text mining, or business analytics software wouldn't have existed.
This question is similar :) to this one:
What is the best way to implement a substring search in SQL?
Trigram can easily find similar rows, and in that question i posted a php+mysql+trigram solution.
You can use LIKE to find similar product names within the table. For example:
SELECT * FROM product WHERE product_name LIKE 'LG%';
Here is another idea (but I'm voting for levenshtein()):
Create a temporary table of all words used in names and their frequencies.
Choose range of results (most popular words are probably words like LCD or LED, most unique words could be good, they might be product actual names).
Suggest for each of result words either:
results with those words
results containing longest substring (like this: http://forums.mysql.com/read.php?10,277997,278020#msg-278020 ) of those words.
Ok, I think I was trying to implement very much similar thing. It can work the same as the google chrome address box. When you type the address it gives you the suggestions. This is what you are trying to achieve as far I am concerned.
I cannot give you exact solution to that but some advice.
You need to implement the dropdown box where someone starts to enter the product they are looking for
Then you need to get the current value of the dropdown and then run query like guy posted above. Can be "SELECT * FROM product WHERE product_name LIKE 'LG%';"
Save results of the query
Refresh the page
Add the results of the query to the dropdown
Note:
You need to save the query results somewhere like the text file with the HTML code i.e. "option" LG TS 600"/option" (add <> brackets to option of course). This values will be used for populating your option box after the page refresh. You need to set up the users session for the user to get the same results for the same user, otherwise if more users would use the search at the same time it could clash. So, with the search id and session id you can match them then. You can save it in the file or the table. Table would be more convenient. It is actually in my sense the whole subsystem for that what are you looking for.
I hope it helps.
Using PHP 5.x and MySQL 5.x
So I asked a question yesterday about the best way of handling dynamic data in multilanguage, this question is what would be a good solution for the database structure to handle this. I have entries in the database for things like stories. My plan is to store a different version of the story for each language available in a db. So if the site supports english and spanish then in the admin tool they can add a english version of a story and a spanish version of a story.
My thoughts on that was to have seperate tables in the db, one for each language. So one story table for english version, one for spanish and one for whatever other languages. Then on the front end I simply allow the visitor to select what language to view the site in and via variables know what table to query to get that version of the story. But one issue is what if there isnt a spanish version but they selected spanish? Is that a good solution or is there a better way of doing this? Looking for suggestions.
Having multiple tables is really not necessary, unless you plan on having millions of stories... I usually go along with two tables; one for the item and another for the localized item
For example, a story would be like
table "story"
id INTEGER (Primary Key)
creation_date DATE_TIME
author VAR_CHAR(32)
..other general columns..
table "story_localized"
story_id INTEGER (Foreign Key of story.id) \
lang CHAR(2) -- (Indexed, unique -- story_id+lang)
content TEXT
..other localized columns..
Performing the query is simply a matter of JOINing the two tables :
SELECT s.*, sl.*
FROM story s
JOIN story_localized sl ON s.id = sl.story_id
WHERE s.id = 1 -- the story you're looking for here
AND sl.lang = 'en' -- your language here
-- other conditions here
This configuration gives a few advantages :
all your data is in the same table, no need to synchronizing CRUD operations
you can add new languages to any story without the need to create yet more tables
etc.
** EDIT **
Just as a bonus, here is a "trick" to retrieve a story easily, regardless of the language it's been written into
SELECT *
FROM (SELECT *
FROM story s
JOIN story_localized sl ON s.id = sl.story_id
WHERE s.id = {storyId}
AND sl.lang = {desiredLanguage} -- example : 'es'
UNION
SELECT *
FROM story s
JOIN story_localized sl ON s.id = sl.story_id
WHERE s.id = {storyId}
AND sl.lang = {defaultLanguage} -- example : 'en'
UNION
SELECT *
FROM story s
JOIN story_localized sl ON s.id = sl.story_id
WHERE s.id = {storyId}
LIMIT 1 -- ..get the first language found
) story_l
LIMIT 1 -- only get the first SELECTed row found
Will try to fetch the story in the {desiredLanguage}, if the story is not available in that language, try to find it in {defaultLanguage} (ie. the site's default language), and if still nothing found, it doesn't matter which language to fetch the story, then, so fetch the first one found. All in one query, and all you need are 3 arguments: the story id, the desired language, and a default fallback language.
Also, you can easily find out in what language the story is available into with a simple query :
SELECT sl.lang
FROM story_localized sl
WHERE sl.story_id = {storyId}
The best way to do this is to store the story as a multi-valued attribute. As #Yanick has shown you.
And it would be interfacially better if you populate your language choice element only with those languages in which that story is available.
Another nice approach when it is not a long text: weeks names, monts... in my app I am localizing musical genres, and musical instruments. Use a text field for store json data.
So in my musical "genres" table, there is a text field named "lang". On in I have the following json structure:
{"es_gl":"MARCHA ESPAÑOLA", "es_EN" : "SPANISH MARCH"}
Then using PHP is really easy to get our data from a json structure. Let's say I have a genre on my $genre variable.
$myJSON = json_decode($genre->lang);
echo $myJSON->es_gl // This echoes MARCHA ESPAÑOLA
echo $myJSON->es_es // This echoes SPANISH MARCH
This is good solution when your localization data is not big... like posts or blogs.
Soon I'll be working on catalog(php+mysql) that will have multilang content support. And now I'm considering the best approach to design the database structure. At the moment I see 3 ways for multilang handling:
1) Having separate tables for each language specific data, i.e. schematicly it'll look like this:
There will be one table Main_Content_Items, storing basic data that cannot be translated like ID, creation_date, hits, votes on so on - it will be only one and will refer to all languages.
And here are tables that will be dublicated for each language:
Common_Data_LANG table(example: common_data_en_us) (storing common/"static" fields that can be translated, but are present for eny catalog item: title, desc and so on...)
Extra_Fields_Data_LANG table (storing extra fields data that can be translated, but can be different for custom item groups, i.e. like: | id | item_id | field_type | value | ...)
Then on items request we will look in table according to user/default language and join translatable data with main_content table.
Pros:
we can update "main" data(i.e. hits, votes...) that are updated most often with only one query
we don't need o dublicate data 4x or more times if we have 4 or more languages in comparison with structure using only one table with 'lang' field. So MySql queries would take less time to go through 100000(for example) records catalog rather then 400000 or more
Cons:
+2 tables for each language
2) Using 'lang' field in content tables:
Main_Content_Items table (storing basic data that cannot be translated like ID, creation_date, hits, votes on so on...)
Common_Data table (storing common/"static" fields that can be translated, but are present for eny catalog item: | id | item_id | lang | title | desc | and so on...)
Extra_Fields_Data table (storing extra fields data that can be translated, but can be different for custom item groups, i.e. like: | id | item_id | lang | field_type | value | ...)
So we'll join common_data and extra_fields to main_content_items according to 'lang' field.
Pros:
we can update "main" data(i.e. hits, votes...) that are updated most often with only one query
we only 3 tables for content data
Cons:
we have custom_data and extra_fields table filled with data for all languages, so its X time bigger and queries run slower
3) Same as 2nd way, but with Main_Content_Items table merged with Common_Data, that has 'lang' field:
Pros:
...?
Cons:
we need to update update "main" data(i.e. hits, votes...) that are updated most often with for every language
we have custom_data and extra_fields table filled with data for all languages, so its X time bigger and queries run slower
Will be glad to hear suggestions about "what is better" and "why"? Or are there better ways?
Thanks in advance...
I've given a similar anwer in this question and highlighted the advantages of this technique (it would be, for example, important for me to let the application decide on the language and build the query accordingly by only changing the lang parameter in the WHERE clause of the SQL query.
This get's pretty close to your second solution. I didn't quite got the "extra_fields" but if it makes sense, you could(!) merge it into the common_data table. I would advise you against the first idea since there will be too many tables and it can be easy to lose track about the items in there.
To your edit: I still consider the second approach the better one (it's my optinion so it's relative ;)) I'm no expert on optimization but I think that with proper indexes and proper table structure speed should be not be a problem. As always, the best way to find the most effective way is doing both methods and see which is best since speed will vary from data, structure, ....