Database Definition for Sphinx Search - php

Background
I am creating a MySQL database to store items such as courses where there may be many attributes to a single course. For example:
A single course may have any or all of the following attributes:
Title (varchar)
Secondary Title (varchar)
Description (text)
Date
Time
Specific Location (varchar; eg. White Hall Room 7)
General Location (varchar; eg. Las Vegas, NV)
Location Coords (floats; eg. lat, long)
etc.
The database is set up as follows:
A table storing specific course info:
courses table:
Course_ID (a Primary Key unique ID for each course)
Creator_ID (a unique ID for the creator)
Creation_Date (datetime of course creation)
Modified_Date (where this is the most recent timestamp the course was modified)
The table storing each courses multiple attributes is set up as follows:
course_attributes table:
Attribute_ID (a unique ID for each attribute)
Course_ID (reference to the specific course attribute is for)
Attribute (varchar definining the attribute; eg. 'title')
Value (text containing value of specified attribute; eg. 'Title Of My Course')
Desire
I would like to search this database using sphinx search. With this search, I have different fields weighing different amounts, for example: 'title' would be more important than 'description'.
Specific search fields that I wish to have are:
Title
Date
Location (string)
Location (geo - lat/long)
The Question
Should I define a View in Mysql to organize the attributes according to 'title', 'description', etc., or is there a way to define my sphinx.conf file to understand specific attributes?
I am open to all suggestions to solving this problem, whether it be rearrangement of the database/tables or the way in which I search.
Let me know if you need any additional details to help me find a solution.
Thanks in advance for the help
!--Update--!
OK, so after reading some of the answers, I feel that I should provide some additional information.
Latitude / Longitude
The latitude/longitude attributes are created by me internally after receiving the general location string. I can generate the values in any way I wish, meaning that I can store them together in a single lat/long attribute as 'float lat, float long' values or any other desired format. This is done only after they have been generated from the initial location string and verified. This is to guard against malformed data as #X-Zero and #Cody have suggested.
Keep in mind that the latitude and longitude was merely illustrating the need to have that field be searchable as opposed to anything more than that. It is simply another attribute; one of many.
Weighting Search Results
I know how to add weights to results in a Sphinx search query:
$cl->setFieldWeights( array('title'=>1000, 'description'=>500) );
This causes the title column to have a higher weight than the description column if the structure was as #X-Zero suggested. My question was more directed to how one would apply the above logic with the current table definition.
Database Structure, Views, and Efficiency
Using my introductory knowledge of Views, I was thinking that I could possibly create something that displays a row for each course where each attribute is its own column. I don't know how to accomplish this or if it's even possible.
I am not the most confident with database structures, but the reason I set my tables up as described was because there are many cases where not all of the fields will be completed for every course and I was attempting to be efficient [yes, it seems as though I've failed].
I was thinking that using my current structure, each attribute would contain a value and would therefore cause no wasted space in the table. Alternatively, if I had a table with tons of potential attributes, I would think there would be wasted space. If I am incorrect, I am happy to learn why my understanding is wrong.

Let me preface this by saying that I've never even heard of Sphinx, nor (obviously) used it. However, from a database perspective...
Doing multi-domain columns like this is a terrible (I will hunt you down and kill you) idea. For one thing, it's impossible to index or sort meaningfully, period. You also have to pray that you don't get a latitude attribute with textual data (and because this can only be reinforced programatically, I'm going to garuantee this will happen) - doing so will cause all distance based formulas to crash. And speaking of location, what happens if somebody stores a latitude without a longitude (note that this is possible regardless of whether you are storing a single GeoLocation attribute, or the pair)?
Your best bet is to do the following:
Figure out which attributes will always be required. These belong in the course table (...mostly).
For each related set of optional attributes, create a table. For example, location (although this should probably be required...), which would contain Latitude/Longitude, City, State, Address, Room, etc. Allow the columns to be nullable (in sets - add constraints so users can't add just longitude and not latitude).
For every set of common queries add a view. Even (perhaps especially) if you persist in using your current design, use a view. This promotes seperation between the logical and physical implementations of the database. (This assumes searching by SQL) You will then be able to search by specifying view_column is null or view_column = input_parameter or whichever.
For weighted searching (assuming dynamic weighting) your query will need to use left joins (inside the view as well - please document this), and use prepared-statement host-parameters (just save yourself the trouble of trying to escape things yourself). Check each set of parameters (both lat and long, for example), and assign the input weighting to a new column (per attribute), which can be summed up into a 'total' column (which must be over some threshold).
EDIT:
Using views:
For your structure, what you would normally do is left join to the attributes table multiple times (one for each attribute needed), keying off of the attribute (which should really be an int FK to a table; you don't want both 'title' and 'Title' in there) and joining on course_id - the value would be included as part of the select. Using this technique, it would be simple to then get the list of columns, which you can then apparently weight in Sphinx.
The problem with this is if you need to do any data conversion - you are betting that you'll be able to find all conversions if the type ever changes. When using strongly typed columns, this is between trivial (the likelyhood is that you end up with a uniquely named column) to unnecessary (views usually take their datatype definitions from the fields in the query); with your architecture, you'll likely end up looking through too many false positives.
Database efficiency:
You're right, unfilled columns are wasted space. Usually, when something is optional(ish), that means you may need an additional table. Which is Why I suggested splitting off location into it's own table: this prevents events which don't need a location (... what?) from 'wasting' the space, but then forces any event that defines a location to specify all required information. There's an additional benefit about splitting it off this way: if multiple events all use the same location (... not at the same time, we hope), a cross-reference table will save you a lot of space. Way more than your attributes table ever could (you're still having to store the complete location per event, after all). If you still have a lot of 'optional' attributes, I hear that NoSQL is made for these kinds of things (but I haven't really looked into it). However, other than that, the cost of an additional table is trivial; the cost of the data inside may not be, but the space required is weighed against the perceived value of the data stored. Remember that disk space is relatively cheap - it's developer/maintainer time that is expensive.
Side note for addresses:
You are probably going to want to create an address table. This would be completely divorced from the event information, and would include (among other things) the precomputed latitude/longitude (in the recommended datatype - I don't know what it is, but it's for sure not a comma-separated string). You would then have an event_address table that would be the cross-reference between the events and where they take place - if there is additional information (such as room), that should be kept in a location table that is referenced (instead of referencing address directly). Once a lat/long value is computed, you should never need to change it.
Thoughts on later updates for lat/long:
While specifying the lat/long values yourself is better, you're going to want to make them a required part of the address table (or part of/in addition to a purely lat/long only table). Frankly, multi-value columns (delimited lists) of any sort are just begging for trouble - you keep having to parse them every time you search on them (among other related issues). And the moment you make them separate rows, one of the pair will eventually get dropped - Murphy himself will personally intervene, if necessary. Additionally, updating them at different times from the addresses will result in an address having a lat/long pair that does not match; your best bet is to compute this at insertion time (there are a number of webservices to find this information for you).
Multi-domain tables:
With a multi-domain table, you're basically betting that the domain key (attribute) will never become out-of-sync with the value (err, value). I don't care how good you are, somewhere, somehow, it's going to happen: at my company, we had one of these in our legacy application (it stored FK links and which files the FKs refer to, along with an attribute). At one point an application was installed in production which promptly began storing the correct file links, but the FK links to a different file, for a given class of attribute. Thankfully, there were audit records in another file which allowed this to be reversed (... as near as they were able tell).
In summary:
Revisit your required/optional data. Don't be afraid to create additional tables, each for a single entity, with every column for a single domain; you will also need relationship tables. You may also wish to place your audit data (last_updated_time) in a set of separate tables (single-domain tables will help immensely in this regard).

In the sphinx config you define your index and the SQL queries that populate it. You can define basic attributes, see Sphinx Attributes
Sphinx also supports geo searches on lat/long but they need to be expressed in radians, definitely not text columns like you have. I agree with X-Zero that storing lat/lng values are strings is a bad idea.

Related

Speed of SELECT Distinct vs array unique

I am using WordPress with some custom post types (just to give a description of my DB structure - its WP's).
Each post has custom meta, which is stored in a separate table (postmeta table). In my case, I am storing city and state.
I've added some actions to WP's save_post/trash_post hooks so that the city and state are also stored in a separate table (cities) like so:
ID postID city state
auto int varchar varchar
I did this because I assumed that this table would be faster than querying the rather large postmeta table for a list of available cities and states.
My logic also forced me to add/update cities and states for every post, even though this will cause duplicates (in the city/state fields). This must be so because I must keep track of which states/cities exist (actually have a post associated with them). When a post is added or deleted, it takes its record to or from the cities table with it.
This brings me to my question(s).
Does this logic make sense or do I suck at DB design?
If it does make sense, my real question is this: **would it be faster to use MySQL's "SELECT DISTINCT" or just "SELECT *" and then use PHP's array_unique on the results?**
Edits for comments/answers thus far:
The structure of the table is exactly how I typed it out above. There is an index on ID, but the point of this table isn't to retrieve an indexed list, but to retrieve ALL results (that are unique) for a list of ALL available city/state combos.
I think I may go with (I don't know why I didn't think of this before) just adding a serialized list of city/state combos in ONE record in the wp_options table. Then I can just get that record, and filter out the unique records I need.
Can I get some feedback on this? I would imagine that retrieving and filtering a serialized array would be faster than storing the data in a separate table for retrieval.
To answer your question about using SELECT distinct vs. array_unique, I would say that I would almost always prefer to limit the result set in the database assuming of course that you have an appropriate index on the field for which you are trying to get distinct values. This saves you time in transmitting extra data from DB to application and for the application reading that data into memory where you can work with it.
As far as your separate table design, it is hard to speculate whether this is a good approach or not, this would largely depend on how you are actually preforming your query (i.e. are you doing two separate queries - one for post info and one for city/state info or querying across a join?).
The is really only one definitive way to determine what is fastest approach. That is to test both ways in your environment.
1) Fully normalized table(when it have only integer values and other tables have only one int+varchar) have advantage when you not dooing full table joins often and dooing alot of search on normalized fields. As downside it require large join/sort buffers and result more complex queries=much less chance query will be auto-optimized by mysql. So you have optimize your queries yourself.
2)Select distinct will be faster in almost any cases. Only case when it will be slower - you have low size sort buffer in /etc/my.conf and much more size memory buffer for php.
Distinct select can use indexes, while your code can't.
Also sending large amount of data to your app require alot of mysql cpu time and real time.

Could MySQL query help find similar customer records?

I have a directory of companies provided to me they want stored and updated in a MySQL database. There is no unique identifier such as company #1234 for each company record.
The fields are typical for a mailing list, contact name, company name, street address, city, state, zip code, phone number and type of company. Updates will be sent to me as a CSV file, again, with no company unique identifier number.
How do I go about matching up the stored record in the db to the new one so it can be updated? In this industry the contact name can change, and even the company name because they add and subtract partners. Their street address can change because when they move the business, and they can even change their phone number. The majority of the companies have a website URL, so hopefully that won't change often but it easily could as well.
I've seen in MySQL there is a similar match %, would this be the answer to match records with the new information?
I work in PHP, if there is a PHP solution. Thanks in advance to the kind soul who helps me out with this!
Without primary key, it is always tricky.
One line solution, decide the rules to best suite your requirements.
If I were you, I first would go to the client to decide some rules of identifying similar records. This step is necessary as without primary key, as there is always a chance of duplicate entry or updating wrong record.
Rules could be simple like:
1. Available fileds:
contact name,
company name,
street address,
city,
state,
zip code,
phone number and
type of company (I Hope this is industry)
2. We will first match company name for similarity like
select * from table_name where company_name like '%$company_name%'
3. For all found records, match zip code and phone number. If match, break, record needs to be updated
4. If not match found in step 3, match street address. If match, break, record needs to be updated
5. & so on.
Your client is the best person to decide these rules as he is the owner of the product.
On the other side, asking rules from client is also important to keep you secure as in the absence of primary key, even after all the care, there is always a chance of duplicating records and/or updating wrong record. You could just minimize the chances with good rules.
As you have told that all the fields of the table can change then I think there is no simple way to correctly update the table every time whatever algorithm you choose.
One of the way to achieve this could be to ask the people/system (which sends you the updated records) to also include the old values of the updated fields in the csv file. If you have the old values you can easily match them with the present records and update it with the new values.
This is rather general question, but the solution itself is somewhat unique from project to project.
I would iterate over all records ordered by the time of their change (or a creating date or update timestamp or so). Next I'd match all entries with major fields similar: company name, address (though that might be risky), telephone or an url (parsing domains only). Then, I would recursively iterate over all found entries until no more results are found.
This algorithm would help to find you same entries as long as they do not have all major columns changes at once. If they do, there is no way saying it's the same firm programmatically.
This will link rows with seemingly now connections (rows 1 and 3 in example)
Example:
2001/01/01 Awesome firm, awesome.com
2002/02/02 Awesome firm, newaddress.com // linked with the first row over company name
2010/12/05 Ohsome inc, newaddress.com // linked over url
I have come acroos bit similar scenario in one of my earlier projects in Sql server.I used to do the following things to handle it.
1.Usually there will be 2 types of files--
a)Full feed (frequency weekly) this will have all the companies from the providers database
b)Incremental Feed(Frequency Daily) this will have only the new records which are not in full feed and updates.(inserts-I,updates -U as flags in incremental feeds)
2.So once I receive the full feed I will refresh the my database table with the full feed once in a week.Also here I will have my internal ids to each company record.(thses ids are for internal purpose)
3.On daily basis I process incremental feeds based on the flags(I-insert,U-update).
4.One very important thing here is to manage the mapping table.Once the feed comes just assign a new internal id to it.
5.For comparing the data to avoid duplicates,I used to use Fuzzy algorithm to get all the potential matches and then use wildcard characters to filter and identify which are new and duplicates.
Have a look at the Damerau-Levenshtein distance algorithm. It calculates the "distance" between two strings and determines how many steps it takes to transform one string into another. The less steps the closer the two strings are.
This article shows the algorithm implemented as a MySQL stored function. Here's the PHP version.
The algorithm is so much better than LIKE or SOUNDEX.

Normalized table structure in MySql... Sort of?

I am wondering what thoughts are on the following table structures for MySQL.
I have a relationship between exercises and exercise parameters, where a single exercise can have mutliple parameters.
For example, the exercise 'sit-ups' could have the parameters 'sets' & 'reps'.
All exercises start with a default set of parameters. For example: sets, reps, weight, hold & rest.
This list is fully customizable. Users can add parameters, remove parameters, or rename them, for each exercise in the database.
To express this relationship, I have the following one-to-many structure:
TABLE exercises
ID
Name
Table exerciseParameters
ID
exerciseID -> exercises(ID)
Name
What is concerning me, is that I am noticing that even though users have the option to rename / customize parameters, a lot of the time they dont. So my exerciseParameters table is filling up with repeat words like "Sets" & "Reps" quite a bit.
Is there a better way something like this should be organized, to avoid so much repetition? (Bearing in mind that the names of the parameters have to be user-customizable. For example "Reps" might get changed to "Hard Reps" by the user.) (Or am I making a big deal out of nothing, and this is ok as is?)
Thanks, in advance, for your help.
Unless you are dealing with millions of rows, I'd leave the structure as it is. It is straightforward and easy to query.
If you are dealing with millions of rows and you have measured the storage impact and deem it unacceptable, then you have couple of options (not necessarily mutually exclusive):
Don't store the defaults
If a parameter is not present in exerciseParameters simply assume it has a default value. The actual defaults can be stored in a separate table or outside the database altogether (depending on your querying needs).
If user changes the default parameter, store it in exerciseParameters.
If user deletes the default parameter, represent it as an exerciseParameters row containing a NULL value.
If user restores the default parameter to its original value, remove it from exerciseParameters.
This exploits the assumption that there will be many more unchanged than either edited or deleted defaults. The cost is in increased complexity (in both modification and querying) and potentially performance.
Reorganize you data model
So names (and values) are stored only once, making the repetitions cheaper. For example:
ParameterNameID and ParameterValueID are integers, so each repetition in exerciseParameters is much cheaper (storage-wise) than if they were strings. OTOH, you loose simplicity and potentially pay a price in querying performance (more JOINs needed).
Use a different DBMS
A one that supports clustering and leading-edge index compression (for example, Oracle's ORGANIZATION INDEX COMPRESS table can greatly diminish storage impact of repeated values).
You could add up another table defaultExerciseParams with the default parameters and values. Whenever a user decides to override any of those - remove the param from this table and push it into the Table exerciseParameters

How to design the user table for an online dating site?

I'm working on the next version of a local online dating site, PHP & MySQL based and I want to do things right. The user table is quite massive and is expected to grow even more with the new version as there will be a lot of money spent on promotion.
The current version which I guess is 7-8 years old was done probably by someone not very knowledgeable in PHP and MySQL so I have to start over from scratch.
There community has currently 200k+ users and is expected to grow to 500k-1mil in the next one or two years. There are more than 100 attributes for each user's profile and I have to be able to search by at least 30-40 of them.
As you can imagine I'm a little wary to make a table with 200k rows and 100 columns. My predecessor split the user table in two ... one with the most used and searched columns and one with the rest (and bulk) of the columns. But this lead to big synchronization problems between the two tables.
So, what do you think it's the best way to go about it?
This is not an answer per se, but since few answers here suggested the attribute-value model, I just wanted to jump in and say my life experience.
I've tried once using this model with a table with 120+ attributes (growing 5-10 every year), and adding about 100k+ rows (every 6 months), the indexes is growing so big that it takes for ever to add or update a single user_id.
The problem I find with this type of design (not that it's completely unfit to any situation) is that you need to put a primary key on user_id,attrib on that second table. Unknowing the potential length of attrib, you would usually use a greater length value, thus increasing the indexes. In my case, attribs could have from 3 to 130 chars. Also, the value most certainly suffer from the same assumption.
And as the OP said, this leads to synchronization problems. Imagine if every attributes (or say at least 50% of them) NEED to exist.
Also, as the OP suggest, the search needs to be done on 30-40 attributes, and I can't just imagine how a 30-40 joins would be efficient, or even a group_concat() due to length limitation.
My only viable solution was to go back to a table with as much columns as there are attributes. My indexes are now greatly smaller, and searches are easier.
EDIT: Also, there are no normalization problems. Either having lookup tables for attribute values or have them ENUM().
EDIT 2: Of course, one could say I should have a look-up table for attribute possible values (reducing index sizes), but I should then make a join on that table.
What you could do is split the user data accross two tables.
1) Table: user
This will contain the "core" fixed information about a user such as firstname, lastname, email, username, role_id, registration_date and things of that nature.
Profile related information can go in its own table. This will be an infinitely expandable table with a key => val nature.
2) Table: user_profile
Fields: user_id, option, value
user_id: 1
option: profile_image
value: /uploads/12/myimage.png
and
user_id: 1
option: questions_answered
value: 24
Hope this helps,
Paul.
The entity-attribute-value model might be a good fit for you:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
Rather than have 100 and growing columns, add one table with three columns:
user_id, property, value.
In general, you shouldn't sacrifice database integrity for performance.
The first thing that I would do about this is to create a table with 1 mln rows of dummy data and test some typical queries on it, using a stress tool like ab. It will most probably turn out that it performs just fine - 1 mln rows is a piece of cake for mysql. So, before trying to solve a problem make sure you actually have it.
If you find the performance poor and the database really turns out to be a bottleneck, consider general optimizations, like caching (on all levels, from mysql query cache to html caching), getting better hardware etc. This should work out in most cases.
In general you should always get the schema formally correct before you worry about performance!
That way you can make informed decisions about adapting the schema to resolve specific performance problems, rather than guessing.
You definitely should go down the 2 table route. This will significantly reduce the amount of storage, code complexity, and the effort to changing the system to add new attributes.
Assuming that each attribute can be represented by an Ordinal number, and that you're only looking for symmetrical matches (i.e. you're trying to match people based on similar attributes, rather than an expression of intention)....
At a simple level, the query to find suitable matches may be very expensive. Effectively you are looking for nodes within the same proximity in a N-dimensional space, unfortunately most relational databases aren't really setup for this kind of operation (I believe PostgreSQL has support for this). So most people would probably start with something like:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value=current_user.attr_value
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
However this forces the system to compare every available candidate to find the best match. Applying a little heurisitics and you could get a very effective query:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
(the value of $tolerance will affect the number of rows returned and query performance - if you've got an index on attr_type, attr_value).
This can be further refined into a points scoring system:
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
This approach lets you do lots of different things - including searching by a subset of attributes, e.g.
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs,
attribute_subsets s
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
AND s.subset_name=$required_subset
AND s.attr_type=current_user.attr_type
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
Obviously this does not accomodate non-ordinal data (e.g. birth sign, favourite pop-band). Without knowing a lot more about te structure of the existing data, its rather hard to say exactly how effective this will be.
If you want to add more attributes, then you don't need to make any changes to your PHP code nor the database schema - it can be completely data-driven.
Another approach would be to identify sterotypes - i.e. reference points within the N-dimensional space, then work out which of these a particular user is closest to. You collapse all the attributes down to a single composite identifier - then you just need to apply the same approach to find the best match within the subset of candidates whom also have been matched to the stereotype.
Can't really suggest anything without seeing the schema. Generally - Mysql database have to be normalized to at least 3NF or BNCF. It rather sounds like it is not normalized right now with 100 columns in 1 table.
Also - you can easily enforce referential integrity with foreign keys using transactions and INNODB engine.

How to store multi-valued profile details?

I have many fields which are multi valued and not sure how to store them? if i do 3NF then there are many tables. For example: Nationality.
A person can have single or dual nationality. if dual this means it is a 1 to many. So i create a user table and a user_nationality table. (there is already a nationality lookup table). or i could put both nationalities into the same row like "American, German" then unserialize it on run-time. But then i dont know if i can search this? like if i search for only German people will it show up?
This is an example, i have over 30 fields which are multi-valued, so i assume i will not be creating 61 tables for this? 1 user table, 30 lookup tables to hold each multi-valued item's lookups and 30 tables to hold the user_ values for the multi valued items?
You must also keep in mind that some multi-valued fields group together like "colleges i have studied at" it has a group of fields such as college name, degree type, time line, etc. And a user can have 1 to many of these. So i assume i can create a separate table for this like user_education with these fields, but lets assume one of these fields is also fixed list multi-valued like college campuses i visited then we will end up in a never ending chain of FK tables which isn't a good design for social networks as the goal is it put as much data into as fewer tables as possible for performance.
If you need to keep using SQL, you will need to create these tables. you will need to decide on how far you are willing to go, and impose limitations on the system (such as only being able to specify one campus).
As far as nationality goes, if you will only require two nationalities (worst-case scenario), you could consider a second nationality field (Nationality and Nationality2) to account for this. Of course this only applies to fields with a small maximum number of different values.
If your user table has a lot of related attributes, then one possibility is to create one attributes table with rows like (user_id, attribute_name, attribute_value). You can store all your attributes to one table. You can use this table to fetch attributes for given users, also search by attribute names and values.
The simple solution is to stop using a SQL table. This what NoSQL is deigned for. Check out CouchDB or Mongo. There each value can be stored as a full structure - so this whole problem could be reduced to a single (not-really-)table.
The downside of pretty much any SQL based solution is that it will be slow. Either slow when fetching a single user - a massive JOIN statement won't execute quickly or slow when searching (if you decide to store these values as serialized).
You might also want to look at ORM which will map your objects to a database automatically.
http://en.wikipedia.org/wiki/List_of_object-relational_mapping_software#PHP
This is an example, i have over 30
fields which are multi-valued, so i
assume i will not be creating 61
tables for this?
You're right that 61 is the maximum number of tables, but in reality it'll likely be less, take your own example:
"colleges i have studied at"
"college campuses i visited"
In this case you'll probably only have one "collage" table, so there would be four tables in this layout, not five.
I'd say don't be afraid of using lots of tables if the data set you're modelling is large - just make sure you keep an up to date ERD so you don't get lost! Also, don't get caught up too much in the "link table" paradigm - "link tables" can be "entities" in their own rights, for example you could think of the "colleges i have studied at" link table as an "collage enrolments" table instead, give it it's own primary key, and store each of the times you pay your course fees as rows in a (linked) "collage enrolment payments" table.

Categories