MongoDB (PHP) - Custom "id", and OrderWith number

MongoDB (PHP) - Custom "id", and OrderWith number - php

First to say that I'm new to MongoDb and document oriented db's in general.
After some trouble with embedded documents in mongodb (unable to select only nested document (example single comment in blog post)),
I redesigned the db. Now I have two collections, posts and comments (not the real deal, using blog example for convinience sake).
Example - posts collection document:
Array {
'_id' : MongoId,
'title' : 'Something',
'body' : 'Something awesome'
}
Example - comments document:
Array {
'_id' : MongoId,
'postId' : MongoId,
'userId' : MongoId,
'commentId' : 33,
'comment' : 'Punch the punch line!'
}
As you can see, I have multiple comment documents (As I said before, I want to be able to select single comment, and not an array of them).
My plan is this: I want to select single comment from collection using postId and commentId (commentId is unique value only among comments with the same postId).
Oh and commentId needs to be an int, so that I could be able to use that value for calculating next and previous documents, sort of "orderWith" number.
Now I can get a comment like this:
URI: mongo.php?post=4de526b67cdfa94f0f000000&comment=4
Code: $comment = $collection->findOne(array("postId" => $theObjId, "commentId" => (int)$commentId));
I have a few questions.
Am I doing it right?
What is the best way to generate that kind of commentId?
What is the best way to ensure that commentId is unique among comments with the same postId (upsert?)?
How to deal with concurrent queries?

Am I doing it right?
This is a really difficult question. Does it work? Does it meet your performance needs, are you comfortable maintaining it?
MongoDB doesn't have any notion of "normalization" or the "the one true way". You model your data in a way that works for you.
What is the best way to generate that kind of commentId?
What is the best way to ensure that commentId is unique among comments with the same postId (upsert?)?
This is really a complex problem. If you want to generate monotonically increasing integers IDs (like auto-increment), then you need a central authority for generating these integers. That doesn't tend to scale very well.
The commonly suggested method is to use the the ObjectId/MongoId. That will give you a unique ID.
However, you really want an integer. So take a look at findAndModify. You can keep a "last_comment_id" on your post and then update it when creating a new comment.
How to deal with concurrent queries?
Why would concurrent queries be a problem? Two readers should be able to access the same data.
Are you worried about concurrent comments being created? Then see the find an modify docs.

I don't know if The Big Picture will allow you to do this, but here is how I'd do it.
I'd have an array of comments contained inside each post. This means no joins are needed. In your case, normalization of comments doesn't give any benefit. I'd replace CommentID with CreatedAt as the time of creation.
This will let you have an easy data model to work with, as well as the ability to sort it.

Related

Best way to create nested array from tables: multiple queries/loops VS single query/loop style

Say I have 2 tables,
which I can "merge" and represent in a single nested array.
I'm wandering what would be the best way to do that, considering:
efficiency
best-practice
DB/server-side usage trade-off
what you should do in real life
same case for 3, 4 or more tables that can be "merged" that way
The question is about ANY server-side/relational-db.
2 simple ways I was thinking about
(if you have others, please suggest!
notice I'm asking for a simple SERVER-SIDE and RELATIONAL-DB,
so please don't waste your time explaining why I shouldn't
use this kind of DB, use MVC design, etc., etc. ...):
2 loops, 5 simple "SELECT" queries
1 loop, 1 "JOIN" query
I've tried to give a simple and detailed example,
in order to explain myself & understand better your answers
(though how to write the code and/or
finding possible mistakes is not the issue here,
so try not to focus on that...)
SQL SCRIPTS FOR CREATING AND INSERTING DATA TO TABLES
CREATE TABLE persons
(
id int NOT NULL AUTO_INCREMENT,
fullName varchar(255),
PRIMARY KEY (id)
);
INSERT INTO persons (fullName) VALUES ('Alice'), ('Bob'), ('Carl'), ('Dan');
CREATE TABLE phoneNumbers
(
id int NOT NULL AUTO_INCREMENT,
personId int,
phoneNumber varchar(255),
PRIMARY KEY (id)
);
INSERT INTO phoneNumbers (personId, phoneNumber) VALUES ( 1, '123-456'), ( 1, '234-567'), (1, '345-678'), (2, '456-789'), (2, '567-890'), (3, '678-901'), (4, '789-012');
A JSON REPRESENTATION OF THE TABLES AFTER I "MERGED" THEM:
[
{
"id": 1,
"fullName": "Alice",
"phoneNumbers": [
"123-456",
"234-567",
"345-678"
]
},
{
"id": 2,
"fullName": "Bob",
"phoneNumbers": [
"456-789",
"567-890"
]
},
{
"id": 3,
"fullName": "Carl",
"phoneNumbers": [
"678-901"
]
},
{
"id": 4,
"fullName": "Dan",
"phoneNumbers": [
"789-012"
]
}
]
PSEUDO CODE FOR 2 WAYS:
1.
query: "SELECT id, fullName FROM persons"
personList = new List<Person>()
foreach row x in query result:
current = new Person(x.fullName)
"SELECT phoneNumber FROM phoneNumbers WHERE personId = x.id"
foreach row y in query result:
current.phoneNumbers.Push(y.phoneNumber)
personList.Push(current)
print personList
2.
query: "SELECT persons.id, fullName, phoneNumber FROM persons
LEFT JOIN phoneNumbers ON persons.id = phoneNumbers.personId"
personList = new List<Person>()
current = null
previouseId = null
foreach row x in query result:
if ( x.id != previouseId )
if ( current != null )
personList.Push(current)
current = null
current = new Person(x.fullName)
current.phoneNumbers.Push(x.phoneNumber)
print personList
CODE IMPLEMENTATION IN PHP/MYSQL:
1.
/* get all persons */
$result = mysql_query("SELECT id, fullName FROM persons");
$personsArray = array(); //Create an array
//loop all persons
while ($row = mysql_fetch_assoc($result))
{
//add new person
$current = array();
$current['id'] = $row['id'];
$current['fullName'] = $row['fullName'];
/* add all person phone-numbers */
$id = $current['id'];
$sub_result = mysql_query("SELECT phoneNumber FROM phoneNumbers WHERE personId = {$id}");
$phoneNumbers = array();
while ($sub_row = mysql_fetch_assoc($sub_result))
{
$phoneNumbers[] = $sub_row['phoneNumber']);
}
//add phoneNumbers array to person
$current['phoneNumbers'] = $phoneNumbers;
//add person to final result array
$personsArray[] = $current;
}
echo json_encode($personsArray);
2.
/* get all persons and their phone-numbers in a single query */
$sql = "SELECT persons.id, fullName, phoneNumber FROM persons
LEFT JOIN phoneNumbers ON persons.id = phoneNumbers.personId";
$result = mysql_query($sql);
$personsArray = array();
/* init temp vars to save current person's data */
$current = null;
$previouseId = null;
$phoneNumbers = array();
while ($row = mysql_fetch_assoc($result))
{
/*
if the current id is different from the previous id:
you've got to a new person.
save the previous person (if such exists),
and create a new one
*/
if ($row['id'] != $previouseId )
{
// in the first iteration,
// current (previous person) is null,
// don't add it
if ( !is_null($current) )
{
$current['phoneNumbers'] = $phoneNumbers;
$personsArray[] = $current;
$current = null;
$previouseId = null;
$phoneNumbers = array();
}
// create a new person
$current = array();
$current['id'] = $row['id'];
$current['fullName'] = $row['fullName'];
// set current as previous id
$previouseId = $current['id'];
}
// you always add the phone-number
// to the current phone-number list
$phoneNumbers[] = $row['phoneNumber'];
}
}
// don't forget to add the last person (saved in "current")
if (!is_null($current))
$personsArray[] = $current);
echo json_encode($personsArray);
P.S.
this link is an example of a different question here, where i tried to suggest the second way: tables to single json

Preliminary
First, thank you for putting that much effort into explaining the problem, and for the formatting. It is great to see someone who is clear about what they are doing, and what they are asking.
But it must be noted that that, in itself, forms a limitation: you are fixed on the notion that this is the correct solution, and that with some small correction or guidance, this will work. That is incorrect. So I must ask you to give that notion up, to take a big step back, and to view (a) the whole problem and (b) my answer without that notion.
The context of this answer is:
all the explicit considerations you have given, which are very important, which I will not repeat
the two most important of which is, what best practice and what I would do in real life
This answer is rooted in Standards, the higher order of, or frame of reference for, best practice. This is what the commercial Client/Server world does, or should be doing.
This issue, this whole problem space, is becoming a common problem. I will give a full consideration here, and thus answer another SO question as well. Therefore it might contain a tiny bit more detail that you require. If it does, please forgive this.
Consideration
The database is a server-based resource, shared by many users. In an online system, the database is constantly changing. It contains that One Version of the Truth (as distinct from One Fact in One Place, which is a separate, Normalisation issue) of each Fact.
the fact that some database systems do not have a server architecture, and that therefore the notion of server in such software is false and misleading, are separate but noted points.
As I understand it, JSON and JSON-like structures are required for "performance reasons", precisely because the "server" doesn't, cannot, perform as a server. The concept is to cache the data on each (every) client, such that you are not fetching it from the "server" all the time.
This opens up a can of worms. If you do not design and implement this properly, the worms will overrun the app.
Such an implementation is a gross violation of the Client/Server Architecture, which allows simple code on both sides, and appropriate deployment of software and data components, such that implementation times are small, and efficiency is high.
Further, such an implementation requires a substantial implementation effort, and it is complex, consisting of many parts. Each of those parts must be appropriately designed.
The web, and the many books written in this subject area, provide a confusing mix of methods, marketed on the basis of supposed simplicity; ease; anyone-can-do-anything; freeware-can-do-anything; etc. There is not scientific basis for any of those proposals.
Non-architecture & Sub-standard
As evidenced, you have learned that that some approaches to database design are incorrect. You have encountered one problem, one instance that that advice is false. As soon as you solve this one problem, the next problem, which is not apparent to you right now, will be exposed. The notions are a never-ending set of problems.
I will not enumerate all the false notions that are sometimes advocated. I trust that as you progress through my answer, you will notice that one after the other marketed notion is false.
The two bottom lines are:
The notions violate Architecture and Design Standards, namely Client/Server Architecture; Open Architecture; Engineering Principles; and to a lesser in this particular problem, Database Design Principles.
Which leads to people like you, who are trying to do an honest job, being tricked into implementing simple notions, which turn into massive implementations. Implementations that will never quite work, so they require substantial ongoing maintenance, and will eventually be replaced, wholesale.
Architecture
The central principle being violated is, never duplicate anything. The moment you have a location where data is duplicated (due to caching or replication or two separate monolithic apps, etc), you create a duplicate that will go out of synch in an online situation. So the principle is to avoid doing that.
Sure, for serious third-party software, such as a gruntly report tool, by design, they may well cache server-based data in the client. But note that they have put hundreds of man-years into implementing it correctly, with due consideration to the above. Yours is not such a piece of software.
Rather than providing a lecture on the principles that must be understood, or the evils and costs of each error, the rest of this answer provides the requested what would you do in real life, using the correct architectural method (a step above best practice).
Architecture 1
Do not confuse
the data which must be Normalised
with
the result set, which, by definition, is the flattened ("de-normalised" is not quite correct) view of the data.
The data, given that it is Normalised, will not contain duplicate values; repeating groups. The result set will contain duplicate values; repeating groups. That is pedestrian.
Note that the notion of Nested Sets (or Nested Relations), which is in my view not good advice, is based on precisely this confusion.
For forty-five years since the advent of the RM, they have been unable to differentiate base relations (for which Normalisation does apply) from derived relations (for which Normalisation does not apply).
Two of these proponents are currently questioning the definition of First Normal Form. 1NF is the foundation of the other NFs, if the new definition is accepted, all the NFs will be rendered value-less. The result would be that Normalisation itself (sparsely defined in mathematical terms, but clearly understood as a science by professionals) will be severely damaged, if not destroyed.
Architecture 2
There is a centuries-old scientific or engineering principle, that content (data) must be separated from control (program elements). This is because the analysis, design, and implementation of the two are completely different. This principle is no less important in the software sciences, where it has specific articulation.
In order to keep this brief (ha ha), instead of a discourse, I will assume that you understand:
That there is a scientifically demanded boundary between data and program elements. Mixing them up results in complex objects that are error-prone and hard to maintain.
The confusion of this principle has reached epidemic proportions in the OO/ORM world, the consequences reach far and wide.
Only professionals avoid this. For the rest, the great majority, they accept the new definition as "normal", and they spend their lives fixing problems that we simply do not have.
The architectural superiority, the great value, of data being both stored and presented in Tabular Form per Dr E F Codd's Relational Model. That there are specific rules for Normalisation of data.
And importantly, you can determine when the people, who write and market books, advise non-relational or anti-relational methods.
Architecture 3
If you cache data on the client:
Cache the absolute minimum.
That means cache only the data that does not change in the online environment. That means Reference and Lookup tables only, the tables that populate the higher level classifiers, the drop-downs, etc.
Currency
For every table that you do cache, you must have a method of (a) determining that the cached data has become stale, compared to the One Version of the Truth which exists on the server, and (b) refreshing it from the server, (c) on a table-by-table basis.
Typically, this involves a background process that executes every (e) five minutes, that queries the MAX updated DateTime for each cached table on the client vs the DateTime on the server, and if changed, refreshes the table, and all its child tables, those that dependent on the changed table.
That, of course, requires that you have an UpdatedDateTime column on every table. That is not a burden, because you need that for OLTP ACID Transactions anyway (if you have a real database, instead of a bunch of sub-standard files).
Which really means, never replicate, the coding burden is prohibitive.
Architecture 4
In the sub-commercial, non-server world, I understand that some people advise the reverse caching of "everything".
That is the only way the programs like PostgreSQL, can to the used in a multi-user system.
You always get what you pay for: you pay peanuts, you get monkeys; you pay zero, you get zero.
The corollary to Architecture 3 is, if you do cache data on the client, do not cache tables that change frequently. These are the transaction and history tables. The notion of caching such tables, or all tables, on the client is completely bankrupt.
In a genuine Client/Server deployment, due to use of applicable standards, for each data window, the app should query only the rows that are required, for that particular need, at that particular time, based on context or filter values, etc. The app should never load the entire table.
If the same user using the same window inspected its contents, 15 minutes after the first inspection, the data would be 15 mins out of date.
For freeware/shareware/vapourware platforms, which define themselves by the absence of a server architecture, and thus by the result, that performance is non-existent, sure, you have to cache more than the minimum tables on the client.
If you do that, you must take all the above into account, and implement it correctly, otherwise your app will be broken, and the ramifications will drive the users to seek your termination. If there is more than one user, they will have the same cause, and soon form an army.
Architecture 5
Now we get to how you cache those carefully chosen tables on the client.
Note that databases grow, they are extended.
If the system is broken, a failure, it will grow in small increments, and require a lot of effort.
If the system is even a small success, it will grow exponentially.
If the system (each of the database, and the app, separately) is designed and implemented well, the changes will be easy, the bugs will be few.
Therefore, all the components in the app must be designed properly, to comply with applicable standards, and the database must be fully Normalised. This in turn minimises the effect of changes in the database, on the app, and vice versa.
The app will consist of simple, not complex, objects, which are easy to maintain and change.
For the data that you do cache on the client, you will use arrays of some form: multiple instances of a class in an OO platform; DataWindows (TM, google for it) or similar in a 4GL; simple arrays in PHP.
(Aside. Note that what people in situations such as yours produce in one year, professional providers using a commercial SQL platform, a commercial 4GL, and complying with Architecture and Standards.)
Architecture 6
So let's assume that you understand all the above, and appreciate its value, particularly Architecture 1 & 2.
If you don't, please stop here and ask questions, do not proceed to the below.
Now that we have established the full context, we can address the crux of your problem.
In those arrays in the app, why on Earth would you store flattened views of data ?
and consequently mess with, and agonise over, the problems
instead of storing copies of the Normalised tables ?
Answer
Never duplicate anything that can be derived. That is an Architectural Principle, not limited to Normalisation in a database.
Never merge anything.
If you do, you will be creating:
data duplication, and masses of it, on the client. The client will not only be fat and slow, it will be anchored to the floor with the ballast of duplicated data.
additional code, which is completely unnecessary
complexity in that code
code that is fragile, that will constantly have to change.
That is the precise problem you are suffering, a consequence of the method, which you know intuitively is wrong, that there must be a better way. You know it is a generic and common problem.
Note also that method, that code, constitutes a mental anchor for you. Look at the way that you have formatted it and presented it so beautifully: it is of importance to you. I am reluctant to inform you of all this.
Which reluctance is easily overcome, due to your earnest and forthright attitude, and the knowledge that you did not invent this method
In each code segment, at presentation time, as and when required:
a. In the commercial Client/Server context
Execute a query that joins the simple, Normalised, unduplicated tables, and retrieves only the qualifying rows. Thereby obtaining current data values. The user never sees stale data. Here, Views (flattened views of Normalised data) are often used.
b. In the sub-commercial non-server context
Create a temporary result-set array, and join the simple, unduplicated, arrays (copies of tables that are cached), and populate it with only the qualifying rows, from the source arrays. The currency of which is maintained by the background process.
Use the Keys to form the joins between the arrays, in exactly the same way that Keys are used to form the joins in the Relational tables in the database.
Destroy those components when the user closes the window.
A clever version would eliminate the result-set array, and join the source arrays via the Keys, and limit the result to the qualifying rows.
Separate to being architecturally incorrect, Nested Arrays or Nested Sets or JSON or JSON-like structures are simply not required. This is the consequence of confusing the Architecture 1 Principle.
If you do choose to use such structures, then use them only for the temporary result-set arrays.
Last, I trust this discourse demonstrates that n tables is a non-issue. More important, that m levels deep in the data hierarchy, the "nesting", is a non-issue.
Answer 2
Now that I have given the full context (and not before), which removes the implications in your question, and makes it a generic, kernel one.
The question is about ANY server-side/relational-db. [Which is better]:
2 loops, 5 simple "SELECT" queries
1 loop, 1 "JOIN" query
The detailed examples you have given are not accurately described above. The accurate descriptions is:
Your Option 1
2 loops, each loop for loading each array
1 single-table SELECT query per loop
(executed n x m times ... the outermost loop, only, is a single execution)
Your Option 2
1 Joined SELECT query executed once
followed by 2 loops, each loop for loading each array
For the commercial SQL platforms, neither, because it does not apply.
The commercial SQL server is a set-processing engine. Use one query with whatever joins are required, that returns a result set. Never step through the rows using a loop, that reduces the set-processing engine to a pre-1970's ISAM system. Use a View, in the server, since it affords the highest performance and the code is in one place.
However, for the non-commercial, non-server platforms, where:
your "server" is not a set-processing engine ie. it returns single rows, therefore you have to fetch each row and fill the array, manually or
your "server" does not provide Client/Server binding, ie. it does not provide facilities on the client to bind the incoming result set to a receiving array, and therefore you have to step through the returned result set, row by row, and fill the array, manually,
as per your example then, the answer is, by a large margin, your option 2.
Please consider carefully, and comment or ask questions.
Response to Comment
Say I need to print this json (or other html page) to some STOUT (example: an http response to: GET /allUsersPhoneNumbers. It's just an example to clarify what I'm expecting to get), should return this json. I have a php function that got this 2 result sets (1). now it should print this json - how should I do that? this report could be an employee month salary for a whole year, and so one. one way or anther, I need to gather this information and represent it in a "JOIN"ed representation
Perhaps I was not clear enough.
Basically, do not use JSON unless you absolutely have to. Which means sending to some system that requires it, which means that receiving system, and that demand is stupid.
Make sure that your system doesn't make such demands on others.
Keep your data Normalised. Both in the database, and in whatever program elements that you write. That means (in this example) use one SELECT per table or array. That is for loading purposes, so that you can refer to and inspect them at any point in the program.
When you need a join, understand that it is:
a result-set; a derived relation; a view
therefore temporary, it exists for the duration of the execution of that element, only
a. For tables, join them in the usual manner, via Keys. One query, joining two (or more) tables.
b. For arrays, join arrays in the program, the same way you join tables in the database, via Keys.
For the example you have given, which is a response to some request, first understand that it is the category [4], and then fulfil it.
Why even consider JSON?
What has JSON got to do with this?
JSON is misunderstood and people are interested in the wow factor. It is a solution looking for a problem. Unless you have that problem it has no value.
Check these two links:
Copter - What is JSON
Stack Overflow - What is JSON
Now if you understand that, it is mostly for incoming feeds. Never for outgoing. Further, it requires parsing, deconstructing, etc, before the can be used.
Recall:
I need to gather this information and represent it in a "JOIN"ed representation
Yes. That is pedestrian. Joined does not mean JSONed.
In your example, the receiver is expecting a flattened view (eg. spreadsheet), with all the cells filled, and yes, for Users with more than one PhoneNumber, their User details will be repeated on the second nad subsequent result-set row. For any kind of print, eg. for debugging, I want a flattened view. It is just a:
SELECT ... FROM Person JOIN PhoneNumber
And return that. Or if you fulfil the request from arrays, join the Person and PhoneNumber Arrays, which may require a temporary result-set array, and return that.
please don't tell me you should get only 1 user at a time, etc. etc.
Correct. If someone tells you to regress to procedural processing (ie. row by row, in a WHILE loop), where the engine or your program has set processing (ie. processes an entire set in one command), that marks them as someone who should not be listened to.
I have already stated, your Option 2 is correct, Option 1 is incorrect. That is as far as the GET or SELECT is concerned.
On the other hand, for programming languages that do not have set-processing capability (ie. cannot print/set/inspect an array in a single command), or "servers" that do not provide client-side array binding, you do have to write loops, one loop per depth of the data hierarchy (in your example, two loops, one for Person, and one for PhoneNumber per User).
You have to do that to parse an incoming JSON object.
You have to do that to load each array from the result set that is returned in your Option 2.
You have to do that to print each array from the result set that is returned in your Option 2.
Response to Comment 2
I've ment I have to return a result represented in a nested version (let's say I'm printing the report to the page), json was just an example for such representation.
I don't think you understand the reasoning and the conclusions I have provided in this answer.
For printing and displaying, never nest. Print a flattened view, the rows returned from the SELECT per Option 2. That is what we have been doing, when printing or displaying data Relationally, for 31 years. It is easier to read, debug, search, find, fold, staple, mutilate. You cannot do anything with a nested array, except look at it, and say gee that is interesting.
Code
Caveat
I would prefer to take your code and modify it, but actually, looking at your code, it is not well written or structured, it cannot be reasonably modified. Second, if I use that, it would be a bad teaching tool. So I will have to give you fresh, clean code, otherwise you will not learn the correct methods.
This code examples follow my advice, so I am not going to repeat. And this is way beyond the original question.
Query & Print
Your request, using your Option 2. One SELECT executed once. Followed by one loop. Which you can "pretty up" if you like.

In general it is a best practice to grab the data you need in as few trips to the database as possible then map the data into the appropriate objects. (Option 2)
But, to answer your question I would ask yourself what the use case for your data is. If you know for sure that you will be needing your person and your phone number data then I would say the second method is your best option.
However, option one can also have its use case when the joined data is optional.One example of this could be that on the UI you have a table of all your people and if a user wants to see the phone number for a particular person they have to click on that person. Then it would be acceptable to "lazy-load" all of the phone numbers.

This is the common problem, especially if you are creating a WebAPIs, converting those table sets to nested arrays is a big deal..
I always go for you the second option(in slightly different method though), because the first is worst possible way to do it... One thing I learned with my experience is never query inside a loop, that is a waste of DB calls, well you know what I trying to say.
Although I don't accept all the things PerformanceDBA said, there are two major things I need the address,
1. Don't have duplicate data
2. Fetch only data you want
The Only problem I see in Joining the table is, we end up duplicating data lots of them, take you data for example, Joining Person ans phoneNumber tables we end up duplicating Every person for each of his phone number, for two table with few hundred rows its fine, imagine we need to merge 5 tables with thousands of rows its huge...
So here's my solution:
Query:
SELECT id, fullName From Person;
SELECT personId, phoneNumber FROM phoneNumbers
WHERE personId IN (SELECT id From Person);
So I get to tables in my result set, now I assign Table[0] to my Person list,
and use a 2 loops to place right phoneNumbers in to right person...
Code:
personList = ConvertToEntity<List<Person>>(dataset.Table[0]);
pnoList = ConvertToEntity<List<PhoneNumber>>(dataset.Table[1]);
foreach (person in personList) {
foreach (pno in pnoList) {
if(pno.PersonId = person.Id)
person.PhoneNumer.Add(pno)
}
}
I think above method reduce lots of duplication and only get me what I wanted, if there is any downside to the above method please let me know... and thanks for asking these kind of questions...

PHP/MySQL: Handling Questionnaire Input

I have a questionnaire for users to be matched by similar interests: 40 categories, each with 3 to 10 subcategories. Each of the subcategories has a 0 - 5 value related to how interested they are in that subcategory (0 being not even remotely interested, 5 being a die-hard fan). Let's take an example for a category, sports:
<input type="radio" name="int_sports_football" value="0">0</input>
<input type="radio" name="int_sports_football" value="1">1</input>
<input type="radio" name="int_sports_football" value="2">2</input>
<input type="radio" name="int_sports_football" value="3">3</input>
<input type="radio" name="int_sports_football" value="4">4</input>
<input type="radio" name="int_sports_football" value="5">5</input>
With so many of these, I have a table with the interest categories, but due to the size, have been using CSV format for the subcategory values (Bad practice for numerous reasons, I know).
Right now, I don't have the resources to create an entire database devoted to interests, and having 40 tables of data in the profiles database is messy. I've been pulling the CSV out (Which looks like 0,2,4,1,5,1), exploding them, and using the numbers as I desire, which seems really inefficient.
If it were simply yes/no I could see doing bit masking (which I do in another spot – maybe there's a way to make this work with 6-ary values? ). Is there another way to store this sort of categorized data efficiently?

You do not do this by adding an extra field per question to the user table, but rather you create a table of answers where each answer record stores a unique identifier for the user record. You can then query the two tables together using joins in order to isolate only those answers for a specific user. In addition, you want to create a questions table so you can link the answer to a specific question.
table 1) user: (uniqueID, identifying info)
table 2) answers: (uniqueID, userID, questionID, text) links to unique userID and unique questionID
table 3) question: (uniqueID, subcategoryID, text) links to uniqueID of a subcategory (e.g. football)
table 4) subcategories: (uniqueID, maincategoyID, text) links to uniqueID of a mainCategory (e.g sports)
table 5) maincategories: (uniqueID,text)
An individual user has one user record, but MANY answer records. As the user answers a question, a new record is created in the answers table, storing the uniqueID of the user, the uniqueID of the question, and the value of their answer.
An answer record is linked to a single user record (by referencing the user's uniqueID field) and a single question record (via uniqueID of question).
A question record is linked to a single subcategory record.
A subcategory record is linked to a single category record.
Note this scheme only handles two levels of categories: sports->football. If you have 3 levels, then add another level in the same manner. If your levels are arbitrary, there may be some other scheme more suited.

okay, so, given that you have 40 categories and let's assume 10 subcategories, that leaves us with 400 question-answer pairs per user.
now, in order to design the best intermediary data storage, I would suggest starting out with a few questions:
1) what type of analysis will I need
2) what resources do I have
3) is this one time solution or should it be reused in future
Well, if I were you, I would stick to very simple database structure e.g.:
question_id | user_id | answer
if I would foresee more this kind of polls going on with same questions and probably having same respondents, I would further extend the structure with "campaign_id". This would work as raw data storage which would allow quick and easy statistics of any kind.
now, you said database is no option. well, you can mimic this very same structure using arrays and create your own statistical interface that would work based on the array storage type, BUT, you would save their and your time if you could get sql. as others suggest, there is always sqlite (file based database engine), which, is easy to use and setup.
now, if all that does not make you happy, then there is another interesting approach. if data set is fixed, meaning, that there are pretty much no conditional questions, then, given that you could create question index, you could further create funny 400byte answer chunk, where each byte would represent answer in any of the given values. then what you do is you create your statistical methods that, based on the question id, can easily operate with $answer[$user][$nth] byte (or $answer[$nth][$user] -- again, based on the type of statistics you need)
this should help you get your mind set on the goal you want to achieve.

I know you said you don't have the resources to create a database, but I disagree. Using SQL seems like your best bet and PHP includes SQLite (http://us2.php.net/manual/en/book.sqlite.php) which means you wouldn't need to set up a MySQL database if that were a problem.
There are also tools for both MySQL and SQLite which would allow you to create tables and import your data from the CSV files without any effort.

maybe I am confused but it seems like you need a well designed relational database.
for example:
tblCategories (pkCategoryID, fldCategoryName)
tblSubCategory (pkSubCategoryID, fkdSubCategoryName)
tblCategorySubCategory(fkCategoryID,fkSubCategoryID)
then use inner joins to populate the pages. hopefully this helps you :)

i consider NoSQL architecture as a solution to scaling MySQL field in agile solutions.
To get it done asap, I'd create a class for "interest" category that constructs sub-categories instance which extends from category parent class, carrying properties of answers, which would be stored as a JSON object in that field, example:
{
"music": { // category
"instruments": { // sub category
"guitar": 5, //intrest answers
"piano": 2,
"violin": 0,
"drums": 4
},
"fav artist":{
"lady gaga": 1,
"kate perry": 2,
"Joe satriani": 5
}
}
"sports": {
"fav sport":{
"soccer": 5,
"hockey": 2,
}
"fav player":{
"messi": 5,
"Jordan": 5,
}
}
}
NOTE that you need to use "abstraction" for the "category" class to keep the object architecture right

Performance issues with mongo + PHP with pagination, distinct values

I have a mongodb collection contains lots of books with many fields. Some key fields which are relevant for my question are :
{
book_id : 1,
book_title :"Hackers & Painters",
category_id : "12",
related_topics : [ {topic_id : "8", topic_name : "Computers"},
{topic_id : "11", topic_name : "IT"}
]
...
... (at least 20 fields more)
...
}
We have a form for filtering results (with many inputs/selectbox) on our search page. And of course there is also pagination. With the filtered results, we show all categories on the page. For each category, number of results found in that category is also shown on the page.
We try to use MongoDB instead of PostgreSQL. Because performance and speed is our main concern for this process.
Now the question is :
I can easily filter results by feeding "find" function with all filter parameters. That's cool. I can paginate results with skip and limit functions :
$data = $lib_collection->find($filter_params, array())->skip(20)->limit(20);
But I have to collect number of results found for each category_id and topic_id before pagination occurs. And I don't want to "foreach" all results, collect categories and manage pagination with PHP, because filtered data often consists of nearly 200.000 results.
Problem 1 : I found mongodb::command() function in PHP manual with a "distinct" example. I think that I get distinct values by this method. But command function doesn't accept conditional parameters (for filtering). I don't know how to apply same filter params while asking for distinct values.
Problem 2 : Even if there is a way for sending filter parameters with mongodb::command function, this function will be another query in the process and take approximately same time (maybe more) with the previous query I think. And this will be another speed penalty.
Problem 3 : In order to get distinct topic_ids with number of results will be another query, another speed penalty :(
I am new with working MongoDB. Maybe I look problems from the wrong point of view. Can you help me solve the problems and give your opinions about the fastest way to get :
filtered results
pagination
distinct values with number of results found
from a large data set.

So the easy way to do filtered results and pagination is as follows:
$cursor = $lib_collection->find($filter_params, array())
$count = $cursor->count();
$data = $cursor->skip(20)->limit(20);
However, this method may not be somewhat inefficient. If you query on fields that are not indexed, the only way for the server to "count()" is to load each document and check. If you do skip() and limit() with no sort() then the server just needs to find the first 20 matching documents, which is much less work.
The number of results per category is going to be more difficult.
If the data does not change often, you may want to precalculate these values using regular map/reduce jobs. Otherwise you have to run a series of distinct() commands or in-line map/reduce. Neither one is generally intended for ad-hoc queries.
The only other option is basically to load all of the search results and then count on the webserver (instead of the DB). Obviously, this is also inefficient.
Getting all of these features is going to require some planning and tradeoffs.

Pagination
Be careful with pagination on large datasets. Remember that skip() and take() --no matter if you use an index or not-- will have to perform a scan. Therefore, skipping very far is very slow.
Think of it this way: The database has an index (B-Tree) that can compare values to each other: it can tell you quickly whether something is bigger or smaller than some given x. Hence, search times in well-balanced trees are logarithmic. This is not true for count-based indexation: A B-Tree has no way to tell you quickly what the 15.000th element is: it will have to walk and enumerate the entire tree.
From the documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the
server to walk from the beginning of the collection, or index, to get
to the offset/skip position before it can start returning the page of
data (limit). As the page number increases skip will become slower and
more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow
you to easily jump to a specific page.
Make sure you really need this feature: Typically, nobody cares for the 42436th result. Note that most large websites never let you paginate very far, let alone show exact totals. There's a great website about this topic, but I don't have the address at hand nor the name to find it.
Distinct Topic Counts
I believe you might be using a sledgehammer as a floatation device. Take a look at your data: related_topics. I personally hate RDBMS because of object-relational mapping, but this seems to be the perfect use case for a relational database.
If your documents are very large, performance is a problem and you hate ORM as much as I do, you might want to consider using both MongoDB and the RDBMS of your choice: Let MongoDB fetch the results and the RDBMS aggregate the best matches for a given category. You could even run the queries in parallel! Of course, writing changes to the DB needs to occur on both databases.

How to handle Tree structures returned from SQL query using PHP?

This is a "theoretical" question.
I'm having trouble defining the question so please bear with me.
When you have several related tables in a database, for example a table that holds "users" and a table that holds "phones"
both "phones" and "users" have a column called "user_id"
select user_id,name,phone from users left outer join phones on phones.user_id = users.user_id;
the query will provide me with rows of all the users whether they have a phone or not.
If a user has several phones, his name will be returned in 2 rows as expected.
columns=>|user_id|name|phone|
row0 = > | 0 |fred|NULL|
row1 = > | 1 |paul|tlf1|
row2 = > | 1 |paul|tlf2|
the name "paul" in the case above is a necessary duplicate which in the RDMS's eye's is not a duplicate at all!
It will then be handled by some server side scripting language, for example php.
How are these "necessary duplicates" actually handled in real websites or applications?
as in, how are the row's "mapped" into some usable object model.
p.s. if you decide to post examples, post them for php,mysql,sqlite if possible.
edit:
Thank you for providing answers, each answer has interpreted the question differently and as such is different and correct in it's own way.
I have come to the conclusion that if round trips are expensive this will be the best way along with Jakob Nilsson-Ehle's solution, which was fitting for the theoretical question.
If round trips they are cheap, I will do separate selects for phones and users as 9000 suggests, if I need to show a single phone for every user, I will give a primary column to the phones and join it with the user select like Ollie Jones correctly suggests.
even though for real life applications I'm using 9000's answer, I think that for this unrealistic question Jakob Nilsson-Ehle's solution is most appropriate.

The thing I would probably do in this case in PHP would be to use the userId in a PHP array and then use that to continuosly update the users
A very simple example would be
$result = mysql_query('select user_id,name,phone from users left outer join phones on phones.user_id = users.user_id;');
$users = Array();
while($row = mysql_fetch_assoc($result)) {
$uid =$row['user_id'];
if(!array_key_exists($uid, $users)) {
$users[$uid] = Array('name' => $row['name'], 'phones' => Array());
}
$users[$uid]['phones'][] = $row['phone'];
}
Of course, depending on your programming style and the complexity of the user data, you might define a User class or something and populate the data, but that is fundamentally how I would would do it.

Your data model inherently allows a user to have 0, 1, or more phones.
You could get your database to return either 0 or 1 phone items for each user by employing a nasty hack, like choosing the numerically smallest phone number. (MIN(phone) ... GROUP BY user). But numerically comparing phone numbers makes very little sense.
Your problem of ambiguity (which of several phone numbers) points to a problem in your data model design. Take a look, if you will, at some common telephone-directory apps. A speed-dial app on a mobile phone is a good example. Mostly they offer ways to put in multiple phone numbers, but they always have the concept of a primary phone number.
If you add a column to your phone table indicating number priority, and make it part of your primary (unique) key, and declare that priority=1 means the user's primary number, your app will not have this ambiguous issue any more.

You can't easily get a tree structure from an RDBMS, only a table structure. And you want a tree: [(user1, (phone1, phone2)), (user2, (phone2, phone3))...]. You can optimize towards different goals, though.
Round-trips are more expensive than sending extra info: go with your current solution. It fetches username multiple times, but you only have one round-trip per entire phone book. May make sense if your overburdened MySQL host is 1000 miles away.
Sending extra info is more expensive than round-trips, or you want more clarity: as #martinho-fernandes suggests, only fetch user IDs with phones, then fetch user details in another query. I'd stick with this approach unless your entire user details is a short username. With SQLite I'd stick with it at all times just for the sake of clarity.

Sound's like you're confusing the object data model with the relational data model - Understanding how they differ in general, and in the specifics of your application is essential to writing OO code on top of a relational database.
Trivial ORM is not the solution.
There are ORM mapping technologies such as hibernate - however these do not scale well. IME, the best solution is using a factory pattern to manage the mapping properly.

Implementing Recursive Comments in PHP/MySQL

I'm trying to write a commenting system, where people can comment on other comments, and these are displayed as recursive threads on the page. (Reddit's Commenting system is an example of what I'm trying to achieve), however I am confused on how to implement such a system that would not be very slow and computationally expensive.
I imagine that each comment would be stored in a comment table, and contain a parent_id, which would be a foreign key to another comment. My problem comes with how to get all of this data without a ton of queries, and then how to efficiently organize the comments into the order belong in. Does anyone have any ideas on how to best implement this?

Try using a nested set model. It is described in Managing Hierarchical Data in MySQL.
The big benefit is that you don't have to use recursion to retrieve child nodes, and the queries are pretty straightforward. The downside is that inserting and deleting takes a little more work.
It also scales really well. I know of one extremely huge system which stores discussion hierarchies using this method.

Here's another site providing information on that method + some source code.

It's just a suggestion, but since I'm facing the same problem right now,
How about add a sequence field (int), and a depth field in the comments table,
and update it as new comments are inserted.
The sequence field would serve the purpose of ordering the comments.
And the depth field would indicates the recursion level of the comment.
Then the hard part would be do the right updates as users insert new comments.
I don't know yet how hard this is to implement,
but I'm pretty sure once implemented, we will have a performance gain over nested model based
solutions.

I created a small tutorial explaining the basic concepts behind the recursive approach. As people have said above, the recursive function doesn't scale as well, however, inserts are far more efficient.
Here are the links:
http://www.evanpetersen.com/index.php/item/php-and-mysql-recursion.html
and
http://www.evanpetersen.com/index.php/item/php-mysql-revisited.html

I normaly work with a parent - child system.
For example, consider the following:
Table comment(
commentID,
pageID,
userID,
comment
[, parentID]
)
parentID is a foreign key to commentID (from the same table) which is optional (can be NULL).
For selecting comments use this for a 'root' comment:
SELECT * FROM comments WHERE pageID=:pageid AND parentID IS NULL
And this for a child:
SELECT * FROM comments WHERE pageID=:pageid AND parentID=:parentid

I had to implement recursive comments too.
I broke my head with nested model, let me explain why :
Let's say you want comments for an article.
Let's call root comments the comments directly attached to this article.
Let's calls reply comments the comments that are an answer to another comment.
I noticed ( unfortunately ) that I wanted the root comments to be ordered by date desc,
BUT I wanted the reply comments to be ordered date asc !!
Paradoxal !!
So the nested model didn't help me to alleviate the number of queries.
Here is my solution :
Create a comment table with following fields :
id
article_id
parent_id (nullable)
date_creation
email
whateverYouLike
sequence
depth
The 3 key fields of this implementation are parent_id, sequence and depth.
parent_id and depth helps to insert new nodes.
Sequence is the real key field, it's kind of nested model emulation.
Each time you insert a new root comment, it is multiple of x.
I choose x=1000, which basically means that I can have 1000 maximum nested comments (That' s the only drawback I found
for this system, but this limit can easily be modified, it's enough for my needs now).
The most recent root comment has to be the one with the greatest sequence number.
Now reply comments :
we have two cases :
reply for a root comment, or reply for a reply comment.
In both cases the algoritm is the same :
take the parent's sequence, and retrieve one to get your sequence number.
Then you have to update the sequences numbers which are below the parent's sequence and above the base sequence,
which is the sequence of the root comment just below the root comment concerned.
I don't expect you to understand all this since I'm not a very good explainer,
but I hope it may give you new ideas.
( At least it worked for me better than nested model would= less requests which is the real goal ).

I’m taking a simple approach.
Save root id (if it’s comments then post_id)
Save parent_id
Then fetch all comments with post_id and recursively order them on the client.
I don’t care if there’s 1000 comments. This happens in memory.
It’s one database call, and that’s te expensive part.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.