When to use comma-separated values in a DB Column?

When to use comma-separated values in a DB Column? - php

OK, I know the technical answer is NEVER.
BUT, there are times when it seems to make things SO much easier with less code and seemingly few downsides, so please here me out.
I need to build a Table called Restrictions to keep track of what type of users people want to be contacted by and that will contain the following 3 columns (for the sake of simplicity):
minAge
lookingFor
drugs
lookingFor and drugs can contain multiple values.
Database theory tells me I should use a join table to keep track of the multiple values a user might have selected for either of those columns.
But it seems that using comma-separated values makes things so much easier to implement and execute. Here's an example:
Let's say User 1 has the following Restrictions:
minAge => 18
lookingFor => 'Hang Out','Friendship'
drugs => 'Marijuana','Acid'
Now let's say User 2 wants to contact User 1. Well, first we need to see if he fits User 1's Restrictions, but that's easy enough EVEN WITH the comma-separated columns, as such:
First I'd get the Target's (User 1) Restrictions:
SELECT * FROM Restrictions WHERE UserID = 1
Now I just put those into respective variables as-is into PHP:
$targetMinAge = $row['minAge'];
$targetLookingFor = $row['lookingFor'];
$targetDrugs = $row['drugs'];
Now we just check if the SENDER (User 2) fits that simple Criteria:
COUNT (*)
FROM Users
WHERE
Users.UserID = 2 AND
Users.minAge >= $targetMinAge AND
Users.lookingFor IN ($targetLookingFor) AND
Users.drugs IN ($targetDrugs)
Finally, if COUNT == 1, User 2 can contact User 1, else they cannot.
How simple was THAT? It just seems really easy and straightforward, so what is the REAL problem with doing it this way as long as I sanitize all inputs to the DB every time a user updates their contact restrictions? Being able to use MySQL's IN function and already storing the multiple values in a format it will understand (e.g. comma-separated values) seems to make things so much easier than having to create join tables for every multiple-choice column. And I gave a simplified example, but what if there are 10 multiple choice columns? Then things start getting messy with so many join tables, whereas the CSV method stays simple.
So, in this case, is it really THAT bad if I use comma-separated values?
****ducks****

You already know the answer.
First off, your PHP code isn't even close to working because it only works if user 2 has only a single value in LookingFor or Drugs. If either of these columns contains multiple comma-separated values then IN won't work even if those values are in the exact same order as User 1's values. What do expect IN to do if the right-hand side has one or more commas?
Therefore, it's not "easy" to do what you want in PHP. It's actually quite a pain and would involve splitting user 2's fields into single values, writing dynamic SQL with many ORs to do the comparison, and then doing an extremely inefficient query to get the results.
Furthermore, the fact that you even need to write PHP code to answer such a relatively simple question about the intersection of two sets means that your design is badly flawed. This is exactly the kind of problem (relational algebra) that SQL exists to solve. A correct design allows you to solve the problem in the database and then simply implement a presentation layer on top in PHP or some other technology.
Do it correctly and you'll have a much easier time.

Suppose User 1 is looking for 'Hang Out','Friendship' and User 2 is looking for 'Friendship','Hang Out'
Your code would not match them up, because 'Friendship','Hang Out' is not in ('Hang Out','Friendship')
That's the real problem here.

Related

Best way to create nested array from tables: multiple queries/loops VS single query/loop style

Say I have 2 tables,
which I can "merge" and represent in a single nested array.
I'm wandering what would be the best way to do that, considering:
efficiency
best-practice
DB/server-side usage trade-off
what you should do in real life
same case for 3, 4 or more tables that can be "merged" that way
The question is about ANY server-side/relational-db.
2 simple ways I was thinking about
(if you have others, please suggest!
notice I'm asking for a simple SERVER-SIDE and RELATIONAL-DB,
so please don't waste your time explaining why I shouldn't
use this kind of DB, use MVC design, etc., etc. ...):
2 loops, 5 simple "SELECT" queries
1 loop, 1 "JOIN" query
I've tried to give a simple and detailed example,
in order to explain myself & understand better your answers
(though how to write the code and/or
finding possible mistakes is not the issue here,
so try not to focus on that...)
SQL SCRIPTS FOR CREATING AND INSERTING DATA TO TABLES
CREATE TABLE persons
(
id int NOT NULL AUTO_INCREMENT,
fullName varchar(255),
PRIMARY KEY (id)
);
INSERT INTO persons (fullName) VALUES ('Alice'), ('Bob'), ('Carl'), ('Dan');
CREATE TABLE phoneNumbers
(
id int NOT NULL AUTO_INCREMENT,
personId int,
phoneNumber varchar(255),
PRIMARY KEY (id)
);
INSERT INTO phoneNumbers (personId, phoneNumber) VALUES ( 1, '123-456'), ( 1, '234-567'), (1, '345-678'), (2, '456-789'), (2, '567-890'), (3, '678-901'), (4, '789-012');
A JSON REPRESENTATION OF THE TABLES AFTER I "MERGED" THEM:
[
{
"id": 1,
"fullName": "Alice",
"phoneNumbers": [
"123-456",
"234-567",
"345-678"
]
},
{
"id": 2,
"fullName": "Bob",
"phoneNumbers": [
"456-789",
"567-890"
]
},
{
"id": 3,
"fullName": "Carl",
"phoneNumbers": [
"678-901"
]
},
{
"id": 4,
"fullName": "Dan",
"phoneNumbers": [
"789-012"
]
}
]
PSEUDO CODE FOR 2 WAYS:
1.
query: "SELECT id, fullName FROM persons"
personList = new List<Person>()
foreach row x in query result:
current = new Person(x.fullName)
"SELECT phoneNumber FROM phoneNumbers WHERE personId = x.id"
foreach row y in query result:
current.phoneNumbers.Push(y.phoneNumber)
personList.Push(current)
print personList
2.
query: "SELECT persons.id, fullName, phoneNumber FROM persons
LEFT JOIN phoneNumbers ON persons.id = phoneNumbers.personId"
personList = new List<Person>()
current = null
previouseId = null
foreach row x in query result:
if ( x.id != previouseId )
if ( current != null )
personList.Push(current)
current = null
current = new Person(x.fullName)
current.phoneNumbers.Push(x.phoneNumber)
print personList
CODE IMPLEMENTATION IN PHP/MYSQL:
1.
/* get all persons */
$result = mysql_query("SELECT id, fullName FROM persons");
$personsArray = array(); //Create an array
//loop all persons
while ($row = mysql_fetch_assoc($result))
{
//add new person
$current = array();
$current['id'] = $row['id'];
$current['fullName'] = $row['fullName'];
/* add all person phone-numbers */
$id = $current['id'];
$sub_result = mysql_query("SELECT phoneNumber FROM phoneNumbers WHERE personId = {$id}");
$phoneNumbers = array();
while ($sub_row = mysql_fetch_assoc($sub_result))
{
$phoneNumbers[] = $sub_row['phoneNumber']);
}
//add phoneNumbers array to person
$current['phoneNumbers'] = $phoneNumbers;
//add person to final result array
$personsArray[] = $current;
}
echo json_encode($personsArray);
2.
/* get all persons and their phone-numbers in a single query */
$sql = "SELECT persons.id, fullName, phoneNumber FROM persons
LEFT JOIN phoneNumbers ON persons.id = phoneNumbers.personId";
$result = mysql_query($sql);
$personsArray = array();
/* init temp vars to save current person's data */
$current = null;
$previouseId = null;
$phoneNumbers = array();
while ($row = mysql_fetch_assoc($result))
{
/*
if the current id is different from the previous id:
you've got to a new person.
save the previous person (if such exists),
and create a new one
*/
if ($row['id'] != $previouseId )
{
// in the first iteration,
// current (previous person) is null,
// don't add it
if ( !is_null($current) )
{
$current['phoneNumbers'] = $phoneNumbers;
$personsArray[] = $current;
$current = null;
$previouseId = null;
$phoneNumbers = array();
}
// create a new person
$current = array();
$current['id'] = $row['id'];
$current['fullName'] = $row['fullName'];
// set current as previous id
$previouseId = $current['id'];
}
// you always add the phone-number
// to the current phone-number list
$phoneNumbers[] = $row['phoneNumber'];
}
}
// don't forget to add the last person (saved in "current")
if (!is_null($current))
$personsArray[] = $current);
echo json_encode($personsArray);
P.S.
this link is an example of a different question here, where i tried to suggest the second way: tables to single json

Preliminary
First, thank you for putting that much effort into explaining the problem, and for the formatting. It is great to see someone who is clear about what they are doing, and what they are asking.
But it must be noted that that, in itself, forms a limitation: you are fixed on the notion that this is the correct solution, and that with some small correction or guidance, this will work. That is incorrect. So I must ask you to give that notion up, to take a big step back, and to view (a) the whole problem and (b) my answer without that notion.
The context of this answer is:
all the explicit considerations you have given, which are very important, which I will not repeat
the two most important of which is, what best practice and what I would do in real life
This answer is rooted in Standards, the higher order of, or frame of reference for, best practice. This is what the commercial Client/Server world does, or should be doing.
This issue, this whole problem space, is becoming a common problem. I will give a full consideration here, and thus answer another SO question as well. Therefore it might contain a tiny bit more detail that you require. If it does, please forgive this.
Consideration
The database is a server-based resource, shared by many users. In an online system, the database is constantly changing. It contains that One Version of the Truth (as distinct from One Fact in One Place, which is a separate, Normalisation issue) of each Fact.
the fact that some database systems do not have a server architecture, and that therefore the notion of server in such software is false and misleading, are separate but noted points.
As I understand it, JSON and JSON-like structures are required for "performance reasons", precisely because the "server" doesn't, cannot, perform as a server. The concept is to cache the data on each (every) client, such that you are not fetching it from the "server" all the time.
This opens up a can of worms. If you do not design and implement this properly, the worms will overrun the app.
Such an implementation is a gross violation of the Client/Server Architecture, which allows simple code on both sides, and appropriate deployment of software and data components, such that implementation times are small, and efficiency is high.
Further, such an implementation requires a substantial implementation effort, and it is complex, consisting of many parts. Each of those parts must be appropriately designed.
The web, and the many books written in this subject area, provide a confusing mix of methods, marketed on the basis of supposed simplicity; ease; anyone-can-do-anything; freeware-can-do-anything; etc. There is not scientific basis for any of those proposals.
Non-architecture & Sub-standard
As evidenced, you have learned that that some approaches to database design are incorrect. You have encountered one problem, one instance that that advice is false. As soon as you solve this one problem, the next problem, which is not apparent to you right now, will be exposed. The notions are a never-ending set of problems.
I will not enumerate all the false notions that are sometimes advocated. I trust that as you progress through my answer, you will notice that one after the other marketed notion is false.
The two bottom lines are:
The notions violate Architecture and Design Standards, namely Client/Server Architecture; Open Architecture; Engineering Principles; and to a lesser in this particular problem, Database Design Principles.
Which leads to people like you, who are trying to do an honest job, being tricked into implementing simple notions, which turn into massive implementations. Implementations that will never quite work, so they require substantial ongoing maintenance, and will eventually be replaced, wholesale.
Architecture
The central principle being violated is, never duplicate anything. The moment you have a location where data is duplicated (due to caching or replication or two separate monolithic apps, etc), you create a duplicate that will go out of synch in an online situation. So the principle is to avoid doing that.
Sure, for serious third-party software, such as a gruntly report tool, by design, they may well cache server-based data in the client. But note that they have put hundreds of man-years into implementing it correctly, with due consideration to the above. Yours is not such a piece of software.
Rather than providing a lecture on the principles that must be understood, or the evils and costs of each error, the rest of this answer provides the requested what would you do in real life, using the correct architectural method (a step above best practice).
Architecture 1
Do not confuse
the data which must be Normalised
with
the result set, which, by definition, is the flattened ("de-normalised" is not quite correct) view of the data.
The data, given that it is Normalised, will not contain duplicate values; repeating groups. The result set will contain duplicate values; repeating groups. That is pedestrian.
Note that the notion of Nested Sets (or Nested Relations), which is in my view not good advice, is based on precisely this confusion.
For forty-five years since the advent of the RM, they have been unable to differentiate base relations (for which Normalisation does apply) from derived relations (for which Normalisation does not apply).
Two of these proponents are currently questioning the definition of First Normal Form. 1NF is the foundation of the other NFs, if the new definition is accepted, all the NFs will be rendered value-less. The result would be that Normalisation itself (sparsely defined in mathematical terms, but clearly understood as a science by professionals) will be severely damaged, if not destroyed.
Architecture 2
There is a centuries-old scientific or engineering principle, that content (data) must be separated from control (program elements). This is because the analysis, design, and implementation of the two are completely different. This principle is no less important in the software sciences, where it has specific articulation.
In order to keep this brief (ha ha), instead of a discourse, I will assume that you understand:
That there is a scientifically demanded boundary between data and program elements. Mixing them up results in complex objects that are error-prone and hard to maintain.
The confusion of this principle has reached epidemic proportions in the OO/ORM world, the consequences reach far and wide.
Only professionals avoid this. For the rest, the great majority, they accept the new definition as "normal", and they spend their lives fixing problems that we simply do not have.
The architectural superiority, the great value, of data being both stored and presented in Tabular Form per Dr E F Codd's Relational Model. That there are specific rules for Normalisation of data.
And importantly, you can determine when the people, who write and market books, advise non-relational or anti-relational methods.
Architecture 3
If you cache data on the client:
Cache the absolute minimum.
That means cache only the data that does not change in the online environment. That means Reference and Lookup tables only, the tables that populate the higher level classifiers, the drop-downs, etc.
Currency
For every table that you do cache, you must have a method of (a) determining that the cached data has become stale, compared to the One Version of the Truth which exists on the server, and (b) refreshing it from the server, (c) on a table-by-table basis.
Typically, this involves a background process that executes every (e) five minutes, that queries the MAX updated DateTime for each cached table on the client vs the DateTime on the server, and if changed, refreshes the table, and all its child tables, those that dependent on the changed table.
That, of course, requires that you have an UpdatedDateTime column on every table. That is not a burden, because you need that for OLTP ACID Transactions anyway (if you have a real database, instead of a bunch of sub-standard files).
Which really means, never replicate, the coding burden is prohibitive.
Architecture 4
In the sub-commercial, non-server world, I understand that some people advise the reverse caching of "everything".
That is the only way the programs like PostgreSQL, can to the used in a multi-user system.
You always get what you pay for: you pay peanuts, you get monkeys; you pay zero, you get zero.
The corollary to Architecture 3 is, if you do cache data on the client, do not cache tables that change frequently. These are the transaction and history tables. The notion of caching such tables, or all tables, on the client is completely bankrupt.
In a genuine Client/Server deployment, due to use of applicable standards, for each data window, the app should query only the rows that are required, for that particular need, at that particular time, based on context or filter values, etc. The app should never load the entire table.
If the same user using the same window inspected its contents, 15 minutes after the first inspection, the data would be 15 mins out of date.
For freeware/shareware/vapourware platforms, which define themselves by the absence of a server architecture, and thus by the result, that performance is non-existent, sure, you have to cache more than the minimum tables on the client.
If you do that, you must take all the above into account, and implement it correctly, otherwise your app will be broken, and the ramifications will drive the users to seek your termination. If there is more than one user, they will have the same cause, and soon form an army.
Architecture 5
Now we get to how you cache those carefully chosen tables on the client.
Note that databases grow, they are extended.
If the system is broken, a failure, it will grow in small increments, and require a lot of effort.
If the system is even a small success, it will grow exponentially.
If the system (each of the database, and the app, separately) is designed and implemented well, the changes will be easy, the bugs will be few.
Therefore, all the components in the app must be designed properly, to comply with applicable standards, and the database must be fully Normalised. This in turn minimises the effect of changes in the database, on the app, and vice versa.
The app will consist of simple, not complex, objects, which are easy to maintain and change.
For the data that you do cache on the client, you will use arrays of some form: multiple instances of a class in an OO platform; DataWindows (TM, google for it) or similar in a 4GL; simple arrays in PHP.
(Aside. Note that what people in situations such as yours produce in one year, professional providers using a commercial SQL platform, a commercial 4GL, and complying with Architecture and Standards.)
Architecture 6
So let's assume that you understand all the above, and appreciate its value, particularly Architecture 1 & 2.
If you don't, please stop here and ask questions, do not proceed to the below.
Now that we have established the full context, we can address the crux of your problem.
In those arrays in the app, why on Earth would you store flattened views of data ?
and consequently mess with, and agonise over, the problems
instead of storing copies of the Normalised tables ?
Answer
Never duplicate anything that can be derived. That is an Architectural Principle, not limited to Normalisation in a database.
Never merge anything.
If you do, you will be creating:
data duplication, and masses of it, on the client. The client will not only be fat and slow, it will be anchored to the floor with the ballast of duplicated data.
additional code, which is completely unnecessary
complexity in that code
code that is fragile, that will constantly have to change.
That is the precise problem you are suffering, a consequence of the method, which you know intuitively is wrong, that there must be a better way. You know it is a generic and common problem.
Note also that method, that code, constitutes a mental anchor for you. Look at the way that you have formatted it and presented it so beautifully: it is of importance to you. I am reluctant to inform you of all this.
Which reluctance is easily overcome, due to your earnest and forthright attitude, and the knowledge that you did not invent this method
In each code segment, at presentation time, as and when required:
a. In the commercial Client/Server context
Execute a query that joins the simple, Normalised, unduplicated tables, and retrieves only the qualifying rows. Thereby obtaining current data values. The user never sees stale data. Here, Views (flattened views of Normalised data) are often used.
b. In the sub-commercial non-server context
Create a temporary result-set array, and join the simple, unduplicated, arrays (copies of tables that are cached), and populate it with only the qualifying rows, from the source arrays. The currency of which is maintained by the background process.
Use the Keys to form the joins between the arrays, in exactly the same way that Keys are used to form the joins in the Relational tables in the database.
Destroy those components when the user closes the window.
A clever version would eliminate the result-set array, and join the source arrays via the Keys, and limit the result to the qualifying rows.
Separate to being architecturally incorrect, Nested Arrays or Nested Sets or JSON or JSON-like structures are simply not required. This is the consequence of confusing the Architecture 1 Principle.
If you do choose to use such structures, then use them only for the temporary result-set arrays.
Last, I trust this discourse demonstrates that n tables is a non-issue. More important, that m levels deep in the data hierarchy, the "nesting", is a non-issue.
Answer 2
Now that I have given the full context (and not before), which removes the implications in your question, and makes it a generic, kernel one.
The question is about ANY server-side/relational-db. [Which is better]:
2 loops, 5 simple "SELECT" queries
1 loop, 1 "JOIN" query
The detailed examples you have given are not accurately described above. The accurate descriptions is:
Your Option 1
2 loops, each loop for loading each array
1 single-table SELECT query per loop
(executed n x m times ... the outermost loop, only, is a single execution)
Your Option 2
1 Joined SELECT query executed once
followed by 2 loops, each loop for loading each array
For the commercial SQL platforms, neither, because it does not apply.
The commercial SQL server is a set-processing engine. Use one query with whatever joins are required, that returns a result set. Never step through the rows using a loop, that reduces the set-processing engine to a pre-1970's ISAM system. Use a View, in the server, since it affords the highest performance and the code is in one place.
However, for the non-commercial, non-server platforms, where:
your "server" is not a set-processing engine ie. it returns single rows, therefore you have to fetch each row and fill the array, manually or
your "server" does not provide Client/Server binding, ie. it does not provide facilities on the client to bind the incoming result set to a receiving array, and therefore you have to step through the returned result set, row by row, and fill the array, manually,
as per your example then, the answer is, by a large margin, your option 2.
Please consider carefully, and comment or ask questions.
Response to Comment
Say I need to print this json (or other html page) to some STOUT (example: an http response to: GET /allUsersPhoneNumbers. It's just an example to clarify what I'm expecting to get), should return this json. I have a php function that got this 2 result sets (1). now it should print this json - how should I do that? this report could be an employee month salary for a whole year, and so one. one way or anther, I need to gather this information and represent it in a "JOIN"ed representation
Perhaps I was not clear enough.
Basically, do not use JSON unless you absolutely have to. Which means sending to some system that requires it, which means that receiving system, and that demand is stupid.
Make sure that your system doesn't make such demands on others.
Keep your data Normalised. Both in the database, and in whatever program elements that you write. That means (in this example) use one SELECT per table or array. That is for loading purposes, so that you can refer to and inspect them at any point in the program.
When you need a join, understand that it is:
a result-set; a derived relation; a view
therefore temporary, it exists for the duration of the execution of that element, only
a. For tables, join them in the usual manner, via Keys. One query, joining two (or more) tables.
b. For arrays, join arrays in the program, the same way you join tables in the database, via Keys.
For the example you have given, which is a response to some request, first understand that it is the category [4], and then fulfil it.
Why even consider JSON?
What has JSON got to do with this?
JSON is misunderstood and people are interested in the wow factor. It is a solution looking for a problem. Unless you have that problem it has no value.
Check these two links:
Copter - What is JSON
Stack Overflow - What is JSON
Now if you understand that, it is mostly for incoming feeds. Never for outgoing. Further, it requires parsing, deconstructing, etc, before the can be used.
Recall:
I need to gather this information and represent it in a "JOIN"ed representation
Yes. That is pedestrian. Joined does not mean JSONed.
In your example, the receiver is expecting a flattened view (eg. spreadsheet), with all the cells filled, and yes, for Users with more than one PhoneNumber, their User details will be repeated on the second nad subsequent result-set row. For any kind of print, eg. for debugging, I want a flattened view. It is just a:
SELECT ... FROM Person JOIN PhoneNumber
And return that. Or if you fulfil the request from arrays, join the Person and PhoneNumber Arrays, which may require a temporary result-set array, and return that.
please don't tell me you should get only 1 user at a time, etc. etc.
Correct. If someone tells you to regress to procedural processing (ie. row by row, in a WHILE loop), where the engine or your program has set processing (ie. processes an entire set in one command), that marks them as someone who should not be listened to.
I have already stated, your Option 2 is correct, Option 1 is incorrect. That is as far as the GET or SELECT is concerned.
On the other hand, for programming languages that do not have set-processing capability (ie. cannot print/set/inspect an array in a single command), or "servers" that do not provide client-side array binding, you do have to write loops, one loop per depth of the data hierarchy (in your example, two loops, one for Person, and one for PhoneNumber per User).
You have to do that to parse an incoming JSON object.
You have to do that to load each array from the result set that is returned in your Option 2.
You have to do that to print each array from the result set that is returned in your Option 2.
Response to Comment 2
I've ment I have to return a result represented in a nested version (let's say I'm printing the report to the page), json was just an example for such representation.
I don't think you understand the reasoning and the conclusions I have provided in this answer.
For printing and displaying, never nest. Print a flattened view, the rows returned from the SELECT per Option 2. That is what we have been doing, when printing or displaying data Relationally, for 31 years. It is easier to read, debug, search, find, fold, staple, mutilate. You cannot do anything with a nested array, except look at it, and say gee that is interesting.
Code
Caveat
I would prefer to take your code and modify it, but actually, looking at your code, it is not well written or structured, it cannot be reasonably modified. Second, if I use that, it would be a bad teaching tool. So I will have to give you fresh, clean code, otherwise you will not learn the correct methods.
This code examples follow my advice, so I am not going to repeat. And this is way beyond the original question.
Query & Print
Your request, using your Option 2. One SELECT executed once. Followed by one loop. Which you can "pretty up" if you like.

In general it is a best practice to grab the data you need in as few trips to the database as possible then map the data into the appropriate objects. (Option 2)
But, to answer your question I would ask yourself what the use case for your data is. If you know for sure that you will be needing your person and your phone number data then I would say the second method is your best option.
However, option one can also have its use case when the joined data is optional.One example of this could be that on the UI you have a table of all your people and if a user wants to see the phone number for a particular person they have to click on that person. Then it would be acceptable to "lazy-load" all of the phone numbers.

This is the common problem, especially if you are creating a WebAPIs, converting those table sets to nested arrays is a big deal..
I always go for you the second option(in slightly different method though), because the first is worst possible way to do it... One thing I learned with my experience is never query inside a loop, that is a waste of DB calls, well you know what I trying to say.
Although I don't accept all the things PerformanceDBA said, there are two major things I need the address,
1. Don't have duplicate data
2. Fetch only data you want
The Only problem I see in Joining the table is, we end up duplicating data lots of them, take you data for example, Joining Person ans phoneNumber tables we end up duplicating Every person for each of his phone number, for two table with few hundred rows its fine, imagine we need to merge 5 tables with thousands of rows its huge...
So here's my solution:
Query:
SELECT id, fullName From Person;
SELECT personId, phoneNumber FROM phoneNumbers
WHERE personId IN (SELECT id From Person);
So I get to tables in my result set, now I assign Table[0] to my Person list,
and use a 2 loops to place right phoneNumbers in to right person...
Code:
personList = ConvertToEntity<List<Person>>(dataset.Table[0]);
pnoList = ConvertToEntity<List<PhoneNumber>>(dataset.Table[1]);
foreach (person in personList) {
foreach (pno in pnoList) {
if(pno.PersonId = person.Id)
person.PhoneNumer.Add(pno)
}
}
I think above method reduce lots of duplication and only get me what I wanted, if there is any downside to the above method please let me know... and thanks for asking these kind of questions...

Many to many vs one row [duplicate]

This question already has answers here:
Many database rows vs one comma separated values row
(4 answers)
Closed 8 years ago.
I'm interested how and why many to many relationship is better than storing the information in one row.
Example: I have two tables, Users and Movies (very big data). I need to establish a relationship "view".
I have two ideas:
Make another column in Users table called "views", where I will store the ids of the movies this user has viewed, in a string. for example: "2,5,7...". Then I will process this information in PHP.
Make new table users_movies (many to many), with columns user_id and movie_id. row with user_id=5 and movie_id=7 means that user 5 has viewed movie 7.
I'm interested which of this methods is better and WHY. Please consider that the data is quite big.

The second method is better in just about every way. Not only will you utilize your DBs indexes to find records faster, it will make modification far far easier.

Approach 1) could answer the question "Which movies has User X viewed" by just having an SQL like "...field_in_set(movie_id, user_movielist) ...". But the other way round ("Which user do have viewed movie x") won't work on an sql basis.
That's why I always would go for approach 2): clear normalized structure, both ways are simple joins.

It's just about the needs you have. If you need performance then you must accept redundancy of the information and add a column. If your main goal is to respect the Normalization paradigma then you should not have redundancy at all.
When I have to do this type of choice I try to estimate the space loss of redundancy vs the frequency of the query of interest and its performance.

A few more thoughts.
In your first situation if you look up a particular user you can easily get the list of ids for the films they have seen. But then would need a separate query to get the details such as the titles of those movies. This might be one query using IN with the list of ids, or one query per film id. This would be inefficient and clunky.
With MySQL there is a possible fudge to join in this situation using the FIND_IN_SET() function (although a down side of this is you are straying in to non standard SQL). You could join your table of films to the users using ON FIND_IN_SET(film.id, users.film_id) > 0 . However this is not going to use an index for the join, and involves a function (which while quick for what it does, will be slow when performed on thousands of rows).
If you wanted to find all the users who had view any film a particular user had viewed then it is a bit more difficult. You can't just use FIND_IN_SET as it requires a single string and a comma separated list. As a single query you would need to join the particular user to the film table to get a lot of intermediate rows, and then join that back against the users again (using FIND_IN_SET) to find the other users.
There are ways in SQL to split up a comma separated list of values, but they are messy and anyone who has to maintain such code will hate it!
These are all fudges. With the 2nd solution these easy to do, and any resulting joins can easily use indexes (and possibly the whole queries can just use indexes without touching the actual data).
A further issue with the first solution is data integretity. You will have to manually check that a film doesn't appear twice for a user (with the 2nd solution this can easily be enforced using a unique key). You also cannot just add a foreign key to ensure that any film id for a user does actually exist. Further you will have to manually ensure that nothing enters a character string in your delimited list of ids.

Optimizing an MYSQL COUNT ORDER BY query

I have recently written a survey application that has done it's job and all the data is gathered. Now i have to analyze the data and i'm having some time issues.
I have to find out how many people selected what option and display it all.
I'm using this query, which does do it's job:
SELECT COUNT(*)
FROM survey
WHERE users = ? AND table = ? AND col = ? AND row = ? AND selected = ?
GROUP BY users,table,col,row,selected
As evident by the "?" i'm using MySQLi (in php) to fetch the data when needed, but i fear this is causing it to be so slow.
The table consists of all the elements above (+ an unique ID) and all of them are integers.
To explain some of the fields:
Each survey was divided into 3 or 4 tables (sized from 2x3 to 5x5) with a 1 to 10 happiness grade to select form. (questions are on the right and top of the table, then you answer where the questions intersect)
users - age groups
table, row, col - explained above
selected - dooooh explained above
Now with the surveys complete and around 1 million entries in the table the query is getting very slow. Sometimes it takes like 3 minutes, sometimes (i guess) the time limit expires and you get no data at all. I also don't have access to the full database, just my empty "testing" one since the costumer is kinda paranoid :S (and his server seems to be a bit slow)
Now (after the initial essay) my questions are: I left indexing out intentionally because with a lot of data being written during the survey, it would be a bad idea. But since no new data is coming in at this point, would it make sense to index all the fields of a table? How much sense does it make to index integers that never go above 10? (as you can guess i haven't got a clue about indexes). Do i need the primary unique ID in this table? I
I read somewhere that indexing may help groups but only if you group by the first columns in a table (and since my ID is first and from my point of view useless can i remove it and gain anything by it?)
Is there another way to write my query that would basically do the same thing but in a shorter period of time?
Thanks for all your suggestions in advance!

Add an index on entries that you "GROUP BY" or do "WHERE". So that's ONE index incorporating users,table,col,row and selected in your case.
Some quick rules:
combine fields to have the WHERE first, and the GROUP BY elements last.
If you have other queries that only use part of it (e.g. users,table,col and selected) then leave the missing value (row, in this example) last.
Don't use too many indexes/indeces, as each will slow the table to updates marginally - so on really large system you need to balance queries with indexes.
Edit: do you need the GROUP BY user,col,row as these are used in the WHERE. If the WHERE has already filtered them out, you only need group by "selected".

How to store searchable arrays in MySQL

So I've got this form with an array of checkboxes to search for an event. When you create an event, you choose one or more of the checkboxes and then the event gets created with these "attributes". What is the best way to store it in a MySQL database if I want to filter results when searching for these events? Would creating several columns with boolean values be the best way? Or possibly a new table with the checkbox values only?
I'm pretty sure selializing is out of the question because I wouldn't be able to query the selialized string for whether the checkbox was ticked or not, right?
Thanks

You can use the set datatype or a separate table that you join. Either will work.
I would not do a bunch of columns though.
You can search the set easily using FIND_IN_SET(), but it's not indexed, so it depends on how many rows you expect (up to a few thousand is probably OK - it's a very fast search).
The normal solution is a separate table with one column being the ID of the event, and the second column being the attribute using the enum datatype (don't use text, it's slower).

create separate columns or you can store them all in one column using bit mask

One way would be to create a new table with a column for each checkbox, as already described by others. I'll not add to that.
However, another way is to use a bitmask. You have just one column myCheckboxes and store the values as an int. Then in the code you have constants or another appropriate way to store the correlation between each checkbox and it's bit. I.e.:
CHECKBOX_ONE 1
CHECKBOX_TWO 2
CHECKBOX_THREE 4
CHECKBOX_FOUR 8
...
CHECKBOX_NINE 256
Remember to always use the next power of two for new values, otherwise you'll get values that overlap.
So, if the first two checkboxes have been checked you should have 3 as the value of myCheckboxes for that row. If you have ONE and FOUR checked you'd have 9 as the values of myCheckboxes, etc. When you want to see which rows have say checkboxes ONE, THREE and NINE checked your query would be like:
SELECT * FROM myTable where myCheckboxes & 1 AND myCheckboxes & 4 AND myCheckboxes & 256;
This query will return only rows having all this checkboxes marked as checked.
You should also use bitwise operations when storing and reading the data.
This is a very efficient way when it comes to speed. You have just a single column, probably just a smallint, and your searches are pretty fast. This can make a big difference if you have several different collections of checkboxes that you want to store and search trough. However, this makes the values harder to understand. If you see the value 261 in the DB it'll not be easy for a human to immeditely see that this means checkboxes ONE, THREE and NINE have been checked whereas it is much easier for a human seeing separate columns for each checkbox. This normally is not an issue, cause humans don't need to manually poke the database, but it's something worth mentioning.
From the coding perspective it's not much of a difference, but you'll have to be careful not to corrupt the values, cause it's not that hard to mess up a single int, it's magnitudes easier than screwing the data than when it's stored in different columns. So test carefully when adding new stuff. All that said, the speed and low memory benefits can be very big if you have a ton of different collections.

How to design the user table for an online dating site?

I'm working on the next version of a local online dating site, PHP & MySQL based and I want to do things right. The user table is quite massive and is expected to grow even more with the new version as there will be a lot of money spent on promotion.
The current version which I guess is 7-8 years old was done probably by someone not very knowledgeable in PHP and MySQL so I have to start over from scratch.
There community has currently 200k+ users and is expected to grow to 500k-1mil in the next one or two years. There are more than 100 attributes for each user's profile and I have to be able to search by at least 30-40 of them.
As you can imagine I'm a little wary to make a table with 200k rows and 100 columns. My predecessor split the user table in two ... one with the most used and searched columns and one with the rest (and bulk) of the columns. But this lead to big synchronization problems between the two tables.
So, what do you think it's the best way to go about it?

This is not an answer per se, but since few answers here suggested the attribute-value model, I just wanted to jump in and say my life experience.
I've tried once using this model with a table with 120+ attributes (growing 5-10 every year), and adding about 100k+ rows (every 6 months), the indexes is growing so big that it takes for ever to add or update a single user_id.
The problem I find with this type of design (not that it's completely unfit to any situation) is that you need to put a primary key on user_id,attrib on that second table. Unknowing the potential length of attrib, you would usually use a greater length value, thus increasing the indexes. In my case, attribs could have from 3 to 130 chars. Also, the value most certainly suffer from the same assumption.
And as the OP said, this leads to synchronization problems. Imagine if every attributes (or say at least 50% of them) NEED to exist.
Also, as the OP suggest, the search needs to be done on 30-40 attributes, and I can't just imagine how a 30-40 joins would be efficient, or even a group_concat() due to length limitation.
My only viable solution was to go back to a table with as much columns as there are attributes. My indexes are now greatly smaller, and searches are easier.
EDIT: Also, there are no normalization problems. Either having lookup tables for attribute values or have them ENUM().
EDIT 2: Of course, one could say I should have a look-up table for attribute possible values (reducing index sizes), but I should then make a join on that table.

What you could do is split the user data accross two tables.
1) Table: user
This will contain the "core" fixed information about a user such as firstname, lastname, email, username, role_id, registration_date and things of that nature.
Profile related information can go in its own table. This will be an infinitely expandable table with a key => val nature.
2) Table: user_profile
Fields: user_id, option, value
user_id: 1
option: profile_image
value: /uploads/12/myimage.png
and
user_id: 1
option: questions_answered
value: 24
Hope this helps,
Paul.

The entity-attribute-value model might be a good fit for you:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
Rather than have 100 and growing columns, add one table with three columns:
user_id, property, value.

In general, you shouldn't sacrifice database integrity for performance.
The first thing that I would do about this is to create a table with 1 mln rows of dummy data and test some typical queries on it, using a stress tool like ab. It will most probably turn out that it performs just fine - 1 mln rows is a piece of cake for mysql. So, before trying to solve a problem make sure you actually have it.
If you find the performance poor and the database really turns out to be a bottleneck, consider general optimizations, like caching (on all levels, from mysql query cache to html caching), getting better hardware etc. This should work out in most cases.

In general you should always get the schema formally correct before you worry about performance!
That way you can make informed decisions about adapting the schema to resolve specific performance problems, rather than guessing.
You definitely should go down the 2 table route. This will significantly reduce the amount of storage, code complexity, and the effort to changing the system to add new attributes.
Assuming that each attribute can be represented by an Ordinal number, and that you're only looking for symmetrical matches (i.e. you're trying to match people based on similar attributes, rather than an expression of intention)....
At a simple level, the query to find suitable matches may be very expensive. Effectively you are looking for nodes within the same proximity in a N-dimensional space, unfortunately most relational databases aren't really setup for this kind of operation (I believe PostgreSQL has support for this). So most people would probably start with something like:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value=current_user.attr_value
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
However this forces the system to compare every available candidate to find the best match. Applying a little heurisitics and you could get a very effective query:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
(the value of $tolerance will affect the number of rows returned and query performance - if you've got an index on attr_type, attr_value).
This can be further refined into a points scoring system:
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
This approach lets you do lots of different things - including searching by a subset of attributes, e.g.
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs,
attribute_subsets s
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
AND s.subset_name=$required_subset
AND s.attr_type=current_user.attr_type
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
Obviously this does not accomodate non-ordinal data (e.g. birth sign, favourite pop-band). Without knowing a lot more about te structure of the existing data, its rather hard to say exactly how effective this will be.
If you want to add more attributes, then you don't need to make any changes to your PHP code nor the database schema - it can be completely data-driven.
Another approach would be to identify sterotypes - i.e. reference points within the N-dimensional space, then work out which of these a particular user is closest to. You collapse all the attributes down to a single composite identifier - then you just need to apply the same approach to find the best match within the subset of candidates whom also have been matched to the stereotype.

Can't really suggest anything without seeing the schema. Generally - Mysql database have to be normalized to at least 3NF or BNCF. It rather sounds like it is not normalized right now with 100 columns in 1 table.
Also - you can easily enforce referential integrity with foreign keys using transactions and INNODB engine.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.