MongoDB Schema Design . can't get what i want - php

i think i have a problem with my schema design for my music app.
i have 3 collections: Artists, Tracks and Albums.
and 3 classes: artists, albums and tracks
document from artists:
[_id] => MongoId Object
(
[$id] => 4ee5bbfd615c219a07000000
)
[freeze] => false,
[genres] => Array,
[hits] => 0,
[name] => Sarya Al Sawas,
[pictures] => Array,
document from albums:
[_id] => MongoId Object
(
[$id] => 4ee88308615c218128000000
)
[name] => Sabia
[slug] => wafiq-habib-ft-sarya-al-sawas-sabia
[year] => 1999
[genres] => Array,
[pictures] => Array,
[artists] => Array
(
[0] => MongoId Object
(
[$id] => 4ee34a3b615c21b624010000
)
[1] => MongoId Object
(
[$id] => 4ee5bbfd615c219a07000000
)
)
document from tracks
[_id] => MongoId Object
(
[$id] => 4ee8a056615c21542a000000
)
[name] => Bid Ashok
[slug] => wafiq-habib-ft-sarya-al-sawas-bid-ashok
[genres] => Array,
[file] => /m/tracks/t.4ee8a05540c624.04707814.mp3,
[freeze] => false,
[hits] => 0,
[duration] => 303,
[albums] => Array
(
[0] => MongoId Object
(
[$id] => 4ee5cbc3615c216509000000
)
)
[artists] => Array
(
[0] => MongoId Object
(
[$id] => 4ee5bbfd615c219a07000000
)
[1] => MongoId Object
(
[$id] => 4ee34a3b615c21b624010000
)
)
first of all is that good schema design ??!
i designed this schema this way because of many to many relationships
sometimes tracks have 2 artists, and albums have 2 artists.
anyway i have problem querying the albums that attached to specific track.
lets say i'm on the artist page
i need to get all the artist albums and tracks so i do this:
$cursors = array(
'albums' => $this->albums->find(array('artists' => $artist->_id))->sort(array('_id' => -1)),
'tracks' => $this->tracks->find(array('artists' => $artist->_id))->sort(array('_id' => -1)),
'clips' => $this->clips->find(array('artists' => $artist->_id))->sort(array('_id' => -1))
);
foreach($cursors as $key => $cursor) {
foreach($cursor as $obj) {
$obj['name'] = ($this->lang->get() != 'ar' ? $obj['translated']['name'] : $obj['name']);
$obj['by'] = $this->artists()->get($obj['artists'])->toString('ft');
${$key}[] = $obj;
}
}
i need to loop on all tracks and get their album names lets say this artist has 3000 tracks
i think it will be very slow....
so my question is: Is That a good Schema Design ?

Well, this is a very relational problem, and using a non-relational database for such a problem requires some effort. In general, I think your schema design is good.
What you're describing is called "the N+1 problem", because you'll have to make N+1 queries for N objects (in your case, it's more complicated, but I guess you get the idea).
Some remedies:
You can use the $in operator to find e.g. all tracks of a certain artist:
db.tracks.find({"artists" : { $in : [artist_id_1, artist_id_2, ...] } });
This doesn't work if the array of artists grows huge, but a few hundred, maybe a thousand should work fine. Make sure artists is indexed.
You can denormalize some of the information that is needed very often. For example, you might want to show the track list very often, so it makes sense to copy the artist's names to every track. Denormalization depends mostly on what you're trying to achieve from an end-user perspective. You might not want to store each and every artist's name in full, but only the first 50 characters because the UI doesn't show more in the overview anyway.
In fact, you're already denormalizing some data, such as the artist ids in album (which are redundant, because you could get them via the tracks as well). This makes queries easier, but it will be more write-heavy. Updates are ugly because you'll have to make sure they propagate through the system.
In some cases, it might make sense to 'join' on the client(!) rather than the server. This doesn't really fit your problem well, but it's noteworthy: suppose you have a list of friends. Now the sever will have to look up each friend's name whenever it displays them. Instead, it could provide you with a lookup table ids/friends, and the server only serves the ids. Some JavaScript could replace the ids with the real names from the client's cache.

Related

How to stop array results from being merged when they share the same array key?

I am having an issue trying to build an array where the status ID is the key and ALL posts related to the statuses are sub-arrays relating to the key (status ID).
Here's the (incorrect) array I am getting with both array_merge_recursive and manually adding items to the array (array 1):
Array
(
[res_1] => Array
(
[status_name] => NEEDS REVIEW
[status_color] => 666666
[post] => Array
(
[post_title] => Add feature that allows our team to add internal notes on ideas
[post_description] => Sometimes our team needs to leave feedback that users should not see publicly. Having this feature would allow the team to collaborate better at scale.
[categories] => Admin,Communication
)
)
[res_2] => Array
(
[status_name] => PLANNED
[status_color] => aa5c9e
[post] => Array
(
[post_title] => Add support for multiple languages
[post_description] => We have customers across the globe who would appreciate this page to be localized to their language
[categories] => Integrations
)
)
[res_3] => Array
(
[status_name] => IN PROGRESS
[status_color] => 3333cc
[post] => Array
(
[post_title] => Allow users to add an image with their feature request
[post_description] => Sometimes users want something visual, having an example really helps.
[categories] => Uncategorized
)
)
[res_4] => Array
(
[status_name] => COMPLETED
[status_color] => 7ac01d
[post] => Array
(
[post_title] => Add feature that allows #mentioning in comments
[post_description] => There is no hierarchy in comments so it's hard to reply to one specific user if there is a longer thread of comments.
[categories] => Communication
)
)
)
Here's something like what I am expecting to happen (every status ID is an array with multiple posts as sub-arrays):
Array
(
[res_1] => Array
(
[status_name] => NEEDS REVIEW
[status_color] => 666666
[post] => Array
(
[post_title] => Add feature that allows our team to add internal notes on ideas
[post_description] => Sometimes our team needs to leave feedback that users should not see publicly. Having this feature would allow the team to collaborate better at scale.
[categories] => Admin,Communication
)
)
[res_2] => Array
(
[status_name] => PLANNED
[status_color] => aa5c9e
[post] => Array
(
[post_title] => Add support for multiple languages
[post_description] => We have customers across the globe who would appreciate this page to be localized to their language
[categories] => Integrations
)
)
[res_3] => Array
(
[status_name] => IN PROGRESS
[status_color] => 3333cc
[post] => Array
(
[post_title] => Allow users to add an image with their feature request
[post_description] => Sometimes users want something visual, having an example really helps.
[categories] => Uncategorized
)
)
[res_4] => Array
(
[status_name] => COMPLETED
[status_color] => 7ac01d
[post] => Array(
[0] => Array (
[post_title] => Add feature that allows #mentioning in comments
[post_description] => There is no hierarchy in comments so its hard to reply to one specific user if there is a longer thread of comments.
[categories] => Communication
)
[1] => Array (
[post_title] => Feature Number 5
[post_description] => lorum ipsum awesomeness.
[categories] => Admin
)
)
)
Here's what I've tried:
Running two separate DB queries: one to fetch statuses and another to fetch posts then merging the arrays recursively and changing the array keys to a string. This does the same thing, post 5 never shows up in the newly merged array.
Same as above - ran two separate queries and rebuilt the array manually, the same result the 5th post never appears.
I printed out the database result from $stmt2->fetchAll(); all 5 posts are there in the result-set directly from the database. The 5th one just won't persist when merging arrays or building a fresh one so the posts can relate to the statuses.
I also tried joining the tables with SQL but even grouping by resolution_id does the same thing, post number 5 gets lost by the grouping. I've tried sub-queries too.
DB Results array for just posts:
Array
(
[0] => Array
(
[title] => Feature number 5
[0] => Feature number 5
[description] => lorum ipsum awesomeness
[1] => lorum ipsum awesomeness
[resolution_id] => 4
[2] => 4
[category_names] => Admin
[3] => Admin
)
[1] => Array
(
[title] => Allow users to add an image with their feature request
[0] => Allow users to add an image with their feature request
[description] => Sometimes users want something visual, having an example really helps.
[1] => Sometimes users want something visual, having an example really helps.
[resolution_id] => 3
[2] => 3
[category_names] => Uncategorized
[3] => Uncategorized
)
[2] => Array
(
[title] => Add support for multiple languages
[0] => Add support for multiple languages
[description] => We have customers across the globe who would appreciate this page to be localized to their language
[1] => We have customers across the globe who would appreciate this page to be localized to their language
[resolution_id] => 2
[2] => 2
[category_names] => Integrations
[3] => Integrations
)
[3] => Array
(
[title] => Add feature that allows #mentioning in comments
[0] => Add feature that allows #mentioning in comments
[description] => There is no hierarchy in comments so it's hard to reply to one specific user if there is a longer thread of comments.
[1] => There is no hierarchy in comments so it's hard to reply to one specific user if there is a longer thread of comments.
[resolution_id] => 4
[2] => 4
[category_names] => Communication
[3] => Communication
)
[4] => Array
(
[title] => Add feature that allows our team to add internal notes on ideas
[0] => Add feature that allows our team to add internal notes on ideas
[description] => Sometimes our team needs to leave feedback that users should not see publicly. Having this feature would allow the team to collaborate better at scale.
[1] => Sometimes our team needs to leave feedback that users should not see publicly. Having this feature would allow the team to collaborate better at scale.
[resolution_id] => 1
[2] => 1
[category_names] => Admin,Communication
[3] => Admin,Communication
)
)
Since the data is always going to be dynamic (users can choose status names and create as many as they need to) I can't just hard-code the status names/ids and run 4 queries to populate the columns.
To prevent this from being an essay long post, here are the bits of code that are building the array from array 1:
Builds the initial statuses array from the query results from the resolutions table.
$statuses = [];
while ($row = $stmt->fetch(PDO::FETCH_ASSOC)) {
$statuses['res_' . $row['id']] = ['status_name' => $row['name'], 'status_color' => $row['color']];
}
Adds the individual posts to the statuses array:
foreach ($dbposts as $row2) {
$statuses['res_' . $row2['resolution_id']]['post'] = ['post_title' => $row2['title'], 'post_description' => $row2['description'], 'categories' => $row2['category_names']];
}
The resolution ID is concatenated with res_ from when I tried doing an array merge based on keys. It would not merge when the keys were just integers.
Finally some context behind why I am trying to do what I am trying to do. I am building a platform where companies can have users submit feature requests and view the results in a list view or board view. The list view was a piece of cake since the board view needs to be per status, this is where I am having trouble. I hard-coded the board view values to demonstrate the expected end-result:
Not looking for someone to write my code for me, just looking for some guidance - perhaps I am building or merging the arrays wrong?
To build the array of posts, you need to append elements to the array of posts. Currently, you are just assigning a single element to the array over and over, which overwrites the previous value of the entire array.
The code to append posts:
$statuses['res_' . $row2['resolution_id']]['post'][] = ['post_title' => $row2['title'], 'post_description' => $row2['description'], 'categories' => $row2['category_names']];
Note the [] which I added to the end of the left side of the assignment operator.

Mongo $sum slow

I need I bit of light of how to use mongo to perform better. I have 2 projects using mongo, one of them has 140 millions of rows and every query runs near instantly, the data is displayed in little chunks so with a few indexes mongo is able to filter 99% of the data and return the selected ones quickly. Mongo work well on this kind of projects.
On the other hand I have another project that works like google analytics tracking visits, clicks etc. The objetive is to count clicks in a time range based on certain criteria (using a form). Im challenging mysql for the same task.
First Try
I used the traditional schema of data, row by row, something like:
{
'user':'abc',
'date':'2015-07-20',
'hour':02,
[....]
'clicks':30
}
with 200+ millions rows (even with clicks pack by hour as you see), I have indexes by every field and some compound indexes by the most queried groups. Trying to agregate and $sum clicks by certain $match is really really slow if the resulted chunk of rows is big enough, even worse with that count of total rows, the indexes eat the 32gb of ram in the server.
Second try
Using the schema advantages of mongo, designed a grouping schema to have the less duplicated data as possible, a schema where the properties of every type of click are determined by a unique combination of fields (with a unique index) and then clicks grouped on a tree distributed by dates, row example:
{
"user" : "asd",
[....]
"date" : {
"total" : 5,
"years" : {
"2015" : {
"total" : 5,
"months" : {
"06" : {
"total" : 5,
"days" : {
"30" : {
"total" : 2,
"hours" : {
"16" : 1,
"22" : 1
}
},
"28" : {
"total" : 1,
"hours" : {
"6" : 1
}
},
"29" : {
"total" : 2,
"hours" : {
"14" : 1,
"20" : 1
}
}
}
}
}
}
}
}
}
Thank to this strategy, the 200+ million of rows get reduced by a factor of 10 and the indexes fit in memory then, the inserction speed was slowed down because before inserting a new "row" you must check if one with the same characteristics is found and merge the clicks where it applies on the dates array if do exists before.
When I need to count rows, the speed have been inproved against the traditional schema, but I need to do obscure aggregate things like this to count data:
['$sum'=>'$date.years.'.$year.'.months.'.$month.'.days.'.$day.'.total']
This is performing a bit down the average speed of mysql in general, but the difference is so tight, even under certain conditions mysql win the battle by too much, considering mysql is counting 200million of rows and mongo 20millions, Its not acceptable because so many times mysql do a query in 16s while mongo resolves it in 120s. I want to beat mysql (myIsam) to use mongo as a replace. I have tried lots of things, from sparse indexes on the dates tree to a second level cache saving some pre-processed results and mixing them. Its not posible to cache all permutations of data by a certain day because the [...] fields are a lot.
Shards could be a solution but I dont think will magically improve the speed by 2.
Give me some hints, please
Update
Lets search some days for a certain country:
Mongo compressed schema
Mongodb count rows where country = 'AD': 11389
Aggregate:
Array
(
[0] => Array
(
[$match] => Array
(
[country] => AD
)
)
[1] => Array
(
[$group] => Array
(
[_id] => Array
(
[country] => $country
)
[2015-07-01] => Array
(
[$sum] => $date.years.2015.months.07.days.01.total
)
[2015-07-02] => Array
(
[$sum] => $date.years.2015.months.07.days.02.total
)
[2015-07-03] => Array
(
[$sum] => $date.years.2015.months.07.days.03.total
)
[2015-07-04] => Array
(
[$sum] => $date.years.2015.months.07.days.04.total
)
[2015-07-05] => Array
(
[$sum] => $date.years.2015.months.07.days.05.total
)
[2015-07-06] => Array
(
[$sum] => $date.years.2015.months.07.days.06.total
)
[2015-07-07] => Array
(
[$sum] => $date.years.2015.months.07.days.07.total
)
[2015-07-08] => Array
(
[$sum] => $date.years.2015.months.07.days.08.total
)
[2015-07-09] => Array
(
[$sum] => $date.years.2015.months.07.days.09.total
)
[2015-07-10] => Array
(
[$sum] => $date.years.2015.months.07.days.10.total
)
[2015-07-11] => Array
(
[$sum] => $date.years.2015.months.07.days.11.total
)
[2015-07-12] => Array
(
[$sum] => $date.years.2015.months.07.days.12.total
)
)
)
[2] => Array
(
[$project] => Array
(
[_id] => $_id
[dates] => Array
(
[2015-07-01] => $2015-07-01
[2015-07-02] => $2015-07-02
[2015-07-03] => $2015-07-03
[2015-07-04] => $2015-07-04
[2015-07-05] => $2015-07-05
[2015-07-06] => $2015-07-06
[2015-07-07] => $2015-07-07
[2015-07-08] => $2015-07-08
[2015-07-09] => $2015-07-09
[2015-07-10] => $2015-07-10
[2015-07-11] => $2015-07-11
[2015-07-12] => $2015-07-12
)
)
)
)
Result:
Array
(
[data] => Array
(
[AD] => Array
(
[_id] => Array
(
[country] => AD
)
[dates] => Array
(
[2015-07-01] => 6080
[2015-07-02] => 6580
[2015-07-03] => 6178
[2015-07-04] => 6084
[2015-07-05] => 7085
[2015-07-06] => 7192
[2015-07-07] => 5672
[2015-07-08] => 6769
[2015-07-09] => 6370
[2015-07-10] => 6035
[2015-07-11] => 5513
[2015-07-12] => 6941
)
)
)
[time] => 17.0764780045
)
Mysql tradicional schema
Mysql count rows: 38515
Mysql query:
SELECT date,sum(clicks) as clicks FROM table WHERE ( country = "AD" AND ( date > 20150700 AND date < 20150712 ) ) GROUP BY country,date;
Result:
Array
(
[0] => Array
(
[date] => 20150701
[clicks] => 6080
)
[1] => Array
(
[date] => 20150702
[clicks] => 6580
)
[2] => Array
(
[date] => 20150703
[clicks] => 6178
)
[3] => Array
(
[date] => 20150704
[clicks] => 6084
)
[4] => Array
(
[date] => 20150705
[clicks] => 7085
)
[5] => Array
(
[date] => 20150706
[clicks] => 7192
)
[6] => Array
(
[date] => 20150707
[clicks] => 5672
)
[7] => Array
(
[date] => 20150708
[clicks] => 6769
)
[8] => Array
(
[date] => 20150709
[clicks] => 6370
)
[9] => Array
(
[date] => 20150710
[clicks] => 6035
)
[10] => Array
(
[date] => 20150711
[clicks] => 5513
)
)
time: 0.25689506530762
Mongodb tradicional schema
Items count:
Aggregate:
Array
(
[0] => Array
(
[$match] => Array
(
[country] => AD
[date] => Array
(
[$in] => Array
(
[0] => 20150701
[1] => 20150702
[2] => 20150703
[3] => 20150704
[4] => 20150705
[5] => 20150706
[6] => 20150707
[7] => 20150708
[8] => 20150709
[9] => 20150710
[10] => 20150711
[11] => 20150712
)
)
)
)
[1] => Array
(
[$group] => Array
(
[_id] => Array
(
[country] => $country
)
[count] => Array
(
[$sum] => $clicks
)
)
)
)
Result:
Array
(
[result] => Array
(
[0] => Array
(
[_id] => Array
(
[country] => AD
)
[clicks] => 76499
)
)
[ok] => 1
)
time: 27.8900089264
I was holding off on answer, because I was sure that some MongoDB experts will answer. However as no one is giving answers, I will give few hints. Maybe something of that can help. But then again - I'm not a MongoDB expert. Take everything with small grain of salt.
1) Which version are you using? If you are still on 2.6 - try out 3.0.x (or newer) with WiredTiger engine.
2) If you have a lot of data sharding can greatly help. This will increase setup complexity, but as you will be able to process parts of data set in paralell, you can get significant speed gains. But be careful with choosing proper sharding key.
3) Consider creation of several collections which can act as smaller views. Example: if you currently have 15 fields in [..] there is great chance that lots of queries just use 1 or 2 at once. Like country. Create one more collection in which you use country data and skip rest. If query uses only country fields and not other of those 15, then use small collection. If query uses more fields, use big one. That way queries on countries will be much faster as you will be able to group data more. However not always this is possible as it adds extra complexity in building such small collections. If you process data in some queue (to insert in big), you could insert in small too. Or you could use some aggregate queries and $out to build smaller tables once every X minutes.
4) Come up with 3rd schema. Yours 2nd schema is easy to put data in, but its hard to get data out. You could use arrays more. That way it will be harder to get data in, but much more easy and faster to query it. Keep in mind that in your 2nd schema and in my sample for 3rd schema documents are growing and there can be need for MongoDB to move them around on disk and that is really slow operation. Test if that affects your setup. Small example of potential collection schema:
{
"user": "asd",
[...],
"date": ISODate("2015-07-01T00:00:00Z"), // first date of the month
"total": 2222,
"daily": [
{"date": ISODate("2015-07-01T00:00:00Z"), "total": 22},
{"date": ISODate("2015-07-11T00:00:00Z"), "total": 200},
{"date": ISODate("2015-07-20T00:00:00Z"), "total": 2000},
]
}
When inserting data you can use update with criteria (if you are in PHP): $criteria = ["user": "asd", "daily.date": new MongoDate("...."), // other fields] and update clause $update = ['$inc': ["total: 1, 'daily.$.total': 1]] . Check how many rows were updated. If 0, then create insert from the same data. I.e. unset $criteria['daily.date'] and change update to $update = ['$inc' => ['total' => 1], '$push' => ['daily' => ['date' => new MonoDate('..'), 'total': 1]]]. Keep in mind that you can run into problems if you have several scripts which insert data. Better do everything in queue by one. Or you do in parallel make sure that $push does not result in adding several daily.date with the same date. So - you try to update, if cant update, insert. As you use arrays and possitional operator, you can't use upserts. That's why there is extra insert needed. As I said, its more complicated to get data in. But it will be more easy to get data out. Make sure to set up proper indexes. For example on 'daily.date' etc. So that update queries would not need to check lots of documents. Even more - you can create some hash field to put [...] fields which would hold hash of all [...] fields. And use that in update. That way it will be much more easy to create small index to pinpoint particular document (you put in index 'daily.date', hash field and few more, but will not need to put 15 [..] fields).
When you have such structure you could do a lot of things with queries. For example - if you need full months, just query on date and [...] fields that you need, sum total and you are good. If you need some date range (like 1st - 10th of the month) you can query by [...] fields and date, project to get rid of unnecessary fields, $unwind daily, match again, but this time on daily.date field, then project to rename fields, then group and sum. It's much more flexible than use of $date.years.2015.months.07.days.03.total .
Keep in mind that all of those are just hints. Test everything on your own. And maybe 1 o 5 hints will work. But that can make all the difference.

PHP CodeIgniter - how to get good result array when relational table (there are same column name in different table)

I am trying to make pretty query result like doctrine, and other ORM
for example with relational table article and article_category.
i want to get query result like this :
Array
(
[0] => Array
(
[id] => 1
[title] => I am article title
[slug] => i-am-article-title
[category] => Array
(
[id] => 1
[name] => Category Name
[slug] => category-name
)
)
[1] => Array
(
[id] => 2
[title] => How to coding
[slug] => how-to-coding
[category] => Array
(
[id] => 4
[name] => Tutorial Area
[slug] => tutorial-area
)
)
)
i know this is basic, but i am want to know for create that result in very simple way.
thanks for all advice
UPDATED.
for to get that result, I am change using eloquent laravel framework.. . :)
No, you can't get this information in this way directly from your database if you are using a Relational Database like MySQL or PostgreSQL
You can get the effect you wish in two queries and insert the subquery array to your result array, or you can have a different table for your categories and do a JOIN with SQL.
As a note, other database systems return just what you asked, consider switching to MongoDB (a No-SQL solution) it returns an object just like you wished

MongoDB Array Search in Query or client side

I am wondering what is better to do. I have a pulled back a query like this:
Array
(
[_id] => MongoId Object
(
[$id] => 4eeedd9545c717620a000007
)
[field1] => ...
[field2] => ...
[field3] => ...
[field4] => ...
[field5] => ...
[field6] => ...
[votes] => Array
(
[whoVoted] => Array
(
[0] => 4f98930cb1445d0a7d000001
[1] => 4f98959cb1445d0a7d000002
[1] => 4f88730cb1445d0a7d000003
)
)
)
Which would be faster:
Pull that entire array in 1 query and use in_array() to find the right id?
Pull everything from the first query except the votes and then do another mongodb query to see if that id exist in the array?
It Depends on a lot of factors that I suggest you test but IMO most of the time it would be faster to just do 2 querys
Depends on the size of the array being returned / searched.
Also different servers are doing the work, what do you mean by faster? At what scale?

Find documents based on referenced ID in MongoDB & PHP

i've got referenced Users collection object in my MongoDB Items collection. Random Item document looks like this:
ps: to clarify, i really dont want to embed Items into Users collection.
Array
(
[_id] => MongoId Object
(
[$id] => 4d3c589378be56a008000000
)
[modified] => 1295800467
[order] => 1
[title] => MyFirstItem
[user] => Array
(
[$ref] => users
[$id] => MongoId Object
(
[$id] => 4d3c55e7a130717c09000012
)
)
)
So i need to find only items, which are assigned to the specific user. Find this question of my problem, but the solution didnt work for me.
MongoDB-PHP: JOIN-like query
Here is snippet of my code, givin' me no results at all.
$user = $db->users->findOne(array("_id" => new MongoID("4d3c55e7a130717c09000012")));
$items = $db->items->find(array("user" => array('$id' => $user["_id"])));
What is the correct way to finding that data? Should i instead put an user_id as a MongoID without reference?
Spent all my day with this, thanks in advance!
Try
$items = $db->items->find(array("user.$id" => $user["_id"]));

Categories