I'm querying my database using aggregation and pipeline, with two separate queries:
$groups_q = array(
'$group' => array(
'_id' => '$group_name',
'total_sum' => array('$sum' => 1)
)
);
$statuses_q = array(
'$group' => array(
'_id' => '$user_status',
'total_sum' => array('$sum' => 1)
)
);
$data['statuses'] = $this->mongo_db->aggregate('users',$statuses_q);
$data['groups'] = $this->mongo_db->aggregate('users',$groups_q);
And I'm getting what I want:
Array
(
[statuses] => Array
(
[result] => Array
(
[0] => Array
(
[_id] => Inactive
[total_sum] => 2
)
[1] => Array
(
[_id] => Active
[total_sum] => 5
)
)
[ok] => 1
)
[groups] => Array
(
[result] => Array
(
[0] => Array
(
[_id] => Accounting
[total_sum] => 1
)
[1] => Array
(
[_id] => Administrator
[total_sum] => 2
)
[2] => Array
(
[_id] => Rep
[total_sum] => 1
)
)
[ok] => 1
)
)
I don't want to query my database twice. Is there is a better way to do it?
How can I accomplish it with one query? Should I use $project operator?
You can't use a single aggregate() to do two grouped counts with your desired result format. Once the data has been grouped the first time you no longer have the details needed to create the second count.
The straightforward approach is to do two queries, as you are already doing ;-).
Thoughts on alternatives
If you really wanted to get the information in one aggregation query you could group on both fields and then do some manipulation in your application code. With two fields in the group _id, results are going to be every combination of group_name and status.
Example using the mongo shell :
db.users.aggregate(
{ $group: {
_id: { group_name: "$group_name", status: "$status" },
'total_sum': { $sum: 1 }
}}
)
That doesn't seem particularly efficient and lends itself to some convoluted application code because you have to iterate the results twice to get the expected groupings.
If you only wanted the unique names for each group instead of the names + counts, you could use $addToSet in a single group.
The other obvious alternative would be to do the grouping in your application code. Do a single find() projecting only the group_name and status fields, and build up your count arrays as you iterate the results.
Related
I need I bit of light of how to use mongo to perform better. I have 2 projects using mongo, one of them has 140 millions of rows and every query runs near instantly, the data is displayed in little chunks so with a few indexes mongo is able to filter 99% of the data and return the selected ones quickly. Mongo work well on this kind of projects.
On the other hand I have another project that works like google analytics tracking visits, clicks etc. The objetive is to count clicks in a time range based on certain criteria (using a form). Im challenging mysql for the same task.
First Try
I used the traditional schema of data, row by row, something like:
{
'user':'abc',
'date':'2015-07-20',
'hour':02,
[....]
'clicks':30
}
with 200+ millions rows (even with clicks pack by hour as you see), I have indexes by every field and some compound indexes by the most queried groups. Trying to agregate and $sum clicks by certain $match is really really slow if the resulted chunk of rows is big enough, even worse with that count of total rows, the indexes eat the 32gb of ram in the server.
Second try
Using the schema advantages of mongo, designed a grouping schema to have the less duplicated data as possible, a schema where the properties of every type of click are determined by a unique combination of fields (with a unique index) and then clicks grouped on a tree distributed by dates, row example:
{
"user" : "asd",
[....]
"date" : {
"total" : 5,
"years" : {
"2015" : {
"total" : 5,
"months" : {
"06" : {
"total" : 5,
"days" : {
"30" : {
"total" : 2,
"hours" : {
"16" : 1,
"22" : 1
}
},
"28" : {
"total" : 1,
"hours" : {
"6" : 1
}
},
"29" : {
"total" : 2,
"hours" : {
"14" : 1,
"20" : 1
}
}
}
}
}
}
}
}
}
Thank to this strategy, the 200+ million of rows get reduced by a factor of 10 and the indexes fit in memory then, the inserction speed was slowed down because before inserting a new "row" you must check if one with the same characteristics is found and merge the clicks where it applies on the dates array if do exists before.
When I need to count rows, the speed have been inproved against the traditional schema, but I need to do obscure aggregate things like this to count data:
['$sum'=>'$date.years.'.$year.'.months.'.$month.'.days.'.$day.'.total']
This is performing a bit down the average speed of mysql in general, but the difference is so tight, even under certain conditions mysql win the battle by too much, considering mysql is counting 200million of rows and mongo 20millions, Its not acceptable because so many times mysql do a query in 16s while mongo resolves it in 120s. I want to beat mysql (myIsam) to use mongo as a replace. I have tried lots of things, from sparse indexes on the dates tree to a second level cache saving some pre-processed results and mixing them. Its not posible to cache all permutations of data by a certain day because the [...] fields are a lot.
Shards could be a solution but I dont think will magically improve the speed by 2.
Give me some hints, please
Update
Lets search some days for a certain country:
Mongo compressed schema
Mongodb count rows where country = 'AD': 11389
Aggregate:
Array
(
[0] => Array
(
[$match] => Array
(
[country] => AD
)
)
[1] => Array
(
[$group] => Array
(
[_id] => Array
(
[country] => $country
)
[2015-07-01] => Array
(
[$sum] => $date.years.2015.months.07.days.01.total
)
[2015-07-02] => Array
(
[$sum] => $date.years.2015.months.07.days.02.total
)
[2015-07-03] => Array
(
[$sum] => $date.years.2015.months.07.days.03.total
)
[2015-07-04] => Array
(
[$sum] => $date.years.2015.months.07.days.04.total
)
[2015-07-05] => Array
(
[$sum] => $date.years.2015.months.07.days.05.total
)
[2015-07-06] => Array
(
[$sum] => $date.years.2015.months.07.days.06.total
)
[2015-07-07] => Array
(
[$sum] => $date.years.2015.months.07.days.07.total
)
[2015-07-08] => Array
(
[$sum] => $date.years.2015.months.07.days.08.total
)
[2015-07-09] => Array
(
[$sum] => $date.years.2015.months.07.days.09.total
)
[2015-07-10] => Array
(
[$sum] => $date.years.2015.months.07.days.10.total
)
[2015-07-11] => Array
(
[$sum] => $date.years.2015.months.07.days.11.total
)
[2015-07-12] => Array
(
[$sum] => $date.years.2015.months.07.days.12.total
)
)
)
[2] => Array
(
[$project] => Array
(
[_id] => $_id
[dates] => Array
(
[2015-07-01] => $2015-07-01
[2015-07-02] => $2015-07-02
[2015-07-03] => $2015-07-03
[2015-07-04] => $2015-07-04
[2015-07-05] => $2015-07-05
[2015-07-06] => $2015-07-06
[2015-07-07] => $2015-07-07
[2015-07-08] => $2015-07-08
[2015-07-09] => $2015-07-09
[2015-07-10] => $2015-07-10
[2015-07-11] => $2015-07-11
[2015-07-12] => $2015-07-12
)
)
)
)
Result:
Array
(
[data] => Array
(
[AD] => Array
(
[_id] => Array
(
[country] => AD
)
[dates] => Array
(
[2015-07-01] => 6080
[2015-07-02] => 6580
[2015-07-03] => 6178
[2015-07-04] => 6084
[2015-07-05] => 7085
[2015-07-06] => 7192
[2015-07-07] => 5672
[2015-07-08] => 6769
[2015-07-09] => 6370
[2015-07-10] => 6035
[2015-07-11] => 5513
[2015-07-12] => 6941
)
)
)
[time] => 17.0764780045
)
Mysql tradicional schema
Mysql count rows: 38515
Mysql query:
SELECT date,sum(clicks) as clicks FROM table WHERE ( country = "AD" AND ( date > 20150700 AND date < 20150712 ) ) GROUP BY country,date;
Result:
Array
(
[0] => Array
(
[date] => 20150701
[clicks] => 6080
)
[1] => Array
(
[date] => 20150702
[clicks] => 6580
)
[2] => Array
(
[date] => 20150703
[clicks] => 6178
)
[3] => Array
(
[date] => 20150704
[clicks] => 6084
)
[4] => Array
(
[date] => 20150705
[clicks] => 7085
)
[5] => Array
(
[date] => 20150706
[clicks] => 7192
)
[6] => Array
(
[date] => 20150707
[clicks] => 5672
)
[7] => Array
(
[date] => 20150708
[clicks] => 6769
)
[8] => Array
(
[date] => 20150709
[clicks] => 6370
)
[9] => Array
(
[date] => 20150710
[clicks] => 6035
)
[10] => Array
(
[date] => 20150711
[clicks] => 5513
)
)
time: 0.25689506530762
Mongodb tradicional schema
Items count:
Aggregate:
Array
(
[0] => Array
(
[$match] => Array
(
[country] => AD
[date] => Array
(
[$in] => Array
(
[0] => 20150701
[1] => 20150702
[2] => 20150703
[3] => 20150704
[4] => 20150705
[5] => 20150706
[6] => 20150707
[7] => 20150708
[8] => 20150709
[9] => 20150710
[10] => 20150711
[11] => 20150712
)
)
)
)
[1] => Array
(
[$group] => Array
(
[_id] => Array
(
[country] => $country
)
[count] => Array
(
[$sum] => $clicks
)
)
)
)
Result:
Array
(
[result] => Array
(
[0] => Array
(
[_id] => Array
(
[country] => AD
)
[clicks] => 76499
)
)
[ok] => 1
)
time: 27.8900089264
I was holding off on answer, because I was sure that some MongoDB experts will answer. However as no one is giving answers, I will give few hints. Maybe something of that can help. But then again - I'm not a MongoDB expert. Take everything with small grain of salt.
1) Which version are you using? If you are still on 2.6 - try out 3.0.x (or newer) with WiredTiger engine.
2) If you have a lot of data sharding can greatly help. This will increase setup complexity, but as you will be able to process parts of data set in paralell, you can get significant speed gains. But be careful with choosing proper sharding key.
3) Consider creation of several collections which can act as smaller views. Example: if you currently have 15 fields in [..] there is great chance that lots of queries just use 1 or 2 at once. Like country. Create one more collection in which you use country data and skip rest. If query uses only country fields and not other of those 15, then use small collection. If query uses more fields, use big one. That way queries on countries will be much faster as you will be able to group data more. However not always this is possible as it adds extra complexity in building such small collections. If you process data in some queue (to insert in big), you could insert in small too. Or you could use some aggregate queries and $out to build smaller tables once every X minutes.
4) Come up with 3rd schema. Yours 2nd schema is easy to put data in, but its hard to get data out. You could use arrays more. That way it will be harder to get data in, but much more easy and faster to query it. Keep in mind that in your 2nd schema and in my sample for 3rd schema documents are growing and there can be need for MongoDB to move them around on disk and that is really slow operation. Test if that affects your setup. Small example of potential collection schema:
{
"user": "asd",
[...],
"date": ISODate("2015-07-01T00:00:00Z"), // first date of the month
"total": 2222,
"daily": [
{"date": ISODate("2015-07-01T00:00:00Z"), "total": 22},
{"date": ISODate("2015-07-11T00:00:00Z"), "total": 200},
{"date": ISODate("2015-07-20T00:00:00Z"), "total": 2000},
]
}
When inserting data you can use update with criteria (if you are in PHP): $criteria = ["user": "asd", "daily.date": new MongoDate("...."), // other fields] and update clause $update = ['$inc': ["total: 1, 'daily.$.total': 1]] . Check how many rows were updated. If 0, then create insert from the same data. I.e. unset $criteria['daily.date'] and change update to $update = ['$inc' => ['total' => 1], '$push' => ['daily' => ['date' => new MonoDate('..'), 'total': 1]]]. Keep in mind that you can run into problems if you have several scripts which insert data. Better do everything in queue by one. Or you do in parallel make sure that $push does not result in adding several daily.date with the same date. So - you try to update, if cant update, insert. As you use arrays and possitional operator, you can't use upserts. That's why there is extra insert needed. As I said, its more complicated to get data in. But it will be more easy to get data out. Make sure to set up proper indexes. For example on 'daily.date' etc. So that update queries would not need to check lots of documents. Even more - you can create some hash field to put [...] fields which would hold hash of all [...] fields. And use that in update. That way it will be much more easy to create small index to pinpoint particular document (you put in index 'daily.date', hash field and few more, but will not need to put 15 [..] fields).
When you have such structure you could do a lot of things with queries. For example - if you need full months, just query on date and [...] fields that you need, sum total and you are good. If you need some date range (like 1st - 10th of the month) you can query by [...] fields and date, project to get rid of unnecessary fields, $unwind daily, match again, but this time on daily.date field, then project to rename fields, then group and sum. It's much more flexible than use of $date.years.2015.months.07.days.03.total .
Keep in mind that all of those are just hints. Test everything on your own. And maybe 1 o 5 hints will work. But that can make all the difference.
I have some mongoDB documents with the following structure:
[_id] => MongoId Object (
[$id] => 50664339b3e7a7cf1c000001
)
[uid] => 1
[name] => Alice
[words] => 1
[formatIds] => Array (
[0] => 1
[1] => 4
)
What I want to do is find all documents which has the value 1 in formatIds[ ]. I think it's possible to accomplish that. How can I do it in PHP?
UPDATE
Thanks for the help. It works fine now. Here is how i wrote the search,
$id=$_POST['id'];
$query = array('formatIds'=> "{$id}" );
$result = $stations_table->find($query); //where $stations_table = $db->stations;
MongoDB treats queries on array values the same way as queries on standard values, as per the docs.
Querying for array('formatIds' => 1) should work.
As MongoDB "transform" array into multi value for a same key :
$cursor = $mongo->base->collection->find(array('formatIds' => 1));
With correct definition of you mongo object and setting base/collection string.
It depends on whether you only want the documents or values that match your query or not.
If you don't mind pulling out the entire array and then searching client side for it you can of course use:
$c = $db->col->find(array('formatIds' => 1))
Since a 1-dimensional array in MongoDB can be searched like a single field.
But now to get only those that match your query since the above query will pick out all:
$db->command(array(
'aggregate' => 'col',
'pipeline' => array(
array('$unwind' => "$formatIds"),
array('$match' => array('formatIds' => 1)),
array('$group' => array(
'_id' => '$_id',
'formats' => array('$push' => '$formatIds'))
)
)
)
Use something like that.
This would give you a result of the _id being the _id of the document and a field of formats with only rows of the value 1 in the array.
I have four tables: followers, users, mixes, songs I am trying to get all the mixes from all the followers of one user, I have that part figured out, but I also want to get the songs from each of those mixes, currently my query is giving me results but each result is for one song on the mix, rather than an array of songs within each result for one mix ... any help would be amazing, my sql skills aren't the greatest and I have spent a lot of time trying to figure this out!
my current query is:
SELECT followers.following_id, users.id, users.user_username, mixes.id, mixes.mix_created_date, mixes.mix_name,songs.song_artist
FROM followers, users, mixes,songs
WHERE followers.user_id = 46
AND users.id = followers.following_id
AND mixes.user_id = followers.following_id
AND mixes.id > 0
ORDER BY mixes.mix_created_date DESC
LIMIT 10
the current result is (from running this through a cakephp custom query)
Array
(
[0] => Array
(
[followers] => Array
(
[following_id] => 47
)
[users] => Array
(
[id] => 47
[user_username] => someguy
)
[mixes] => Array
(
[id] => 45
[mix_created_date] => 2012-07-21 2:42:17
[mix_name] => this is a test
)
[songs] => Array
(
[song_artist] => Yo La Tengo
)
)
[1] => Array
(
[followers] => Array
(
[following_id] => 47
)
[users] => Array
(
[id] => 47
[user_username] => someguy
)
[mixes] => Array
(
[id] => 45
[mix_created_date] => 2012-07-21 2:42:17
[mix_name] => this is a test
)
[songs] => Array
(
[song_artist] => Animal Collective
)
)
as you can see the mix id's are the same, I am trying to get the songs to be an array inside of each result like :
Array
(
[0] => Array
(
[followers] => Array
(
[following_id] => 47
)
[users] => Array
(
[id] => 47
[user_username] => someguy
)
[mixes] => Array
(
[id] => 45
[mix_created_date] => 2012-07-21 2:42:17
[mix_name] => this is a test
)
[songs] => Array
(
[0]=>array(
['song_artist'] => Yo La Tengo
),
[1]=>array(
['song_artist'] => Animal Collective
)
)
)
Really hoping this can be done with just one sql statement! thanks in advance!
You can use the SQL join command to make multiple queries together..
Use this...
sql_join
first a note: it looks like you have a missing condition. according to the above query, every song in songs table will be joined with every result possible. probably there should be a condition similar to the following added: (column names can be different based on your tables):
...
and mix.song_id=songs.song_id
...
as for your question: I don't know php so i regard mysql alone: I don't think it is possible to do it with mysql. mysql returns rows in the result set and each row can contain a single value in each column. to add a group of values (song names) in one column, they must be concatenated (and that is possible: Can I concatenate multiple MySQL rows into one field?), and later you split them back in your php script. this is not a good idea as you will need to choose a separator that you know will never appear in the values that are concatenated. therefore I think its better to remove the songs table from the query and after getting the mix id, run a second query to get all songs in that mix.
I am wondering what is better to do. I have a pulled back a query like this:
Array
(
[_id] => MongoId Object
(
[$id] => 4eeedd9545c717620a000007
)
[field1] => ...
[field2] => ...
[field3] => ...
[field4] => ...
[field5] => ...
[field6] => ...
[votes] => Array
(
[whoVoted] => Array
(
[0] => 4f98930cb1445d0a7d000001
[1] => 4f98959cb1445d0a7d000002
[1] => 4f88730cb1445d0a7d000003
)
)
)
Which would be faster:
Pull that entire array in 1 query and use in_array() to find the right id?
Pull everything from the first query except the votes and then do another mongodb query to see if that id exist in the array?
It Depends on a lot of factors that I suggest you test but IMO most of the time it would be faster to just do 2 querys
Depends on the size of the array being returned / searched.
Also different servers are doing the work, what do you mean by faster? At what scale?
I am playing around with a quotes database relating to a ski trip I run. I am trying to list the quotes, but sort by the person who said the quote, and am struggling to get the paginate helper to let me do this.
I have four relevant tables.
quotes, trips, people and attendances. Attendances is essentially a join table for people and trips.
Relationships are as follows;
Attendance belongsTo Person hasMany Attendance
Attendance belongsTo Trip hasMany Attendance
Attendance hasMany Quote belongs to Attendance
In the QuotesController I use containable to retrieve the fields from Quote, along with the associated Attendance, and the fields from the Trip and Person associated with that Attendance.
function index() {
$this->Quote->recursive = 0;
$this->paginate['Quote'] = array(
'contain' => array('Attendance.Person', 'Attendance.Trip'));
$this->set('quotes', $this->paginate());
}
This seems to work fine, and in the view, I can echo out
foreach ($quotes as $quote) {
echo $quote['Attendance']['Person']['first_name'];
}
without any problem.
What I cannot get to work is accessing/using the same variable as a sort field in paginate
echo $this->Paginator->sort('Name', 'Attendance.Person.first_name');
or
echo $this->Paginator->sort('Location', 'Attendance.Trip.location');
Does not work. It appears to sort by something, but I'm not sure what.
The $quotes array I am passing looks like this;
Array
(
[0] => Array
(
[Quote] => Array
(
[id] => 1
[attendance_id] => 15
[quote_text] => Hello
)
[Attendance] => Array
(
[id] => 15
[person_id] => 2
[trip_id] => 7
[Person] => Array
(
[id] => 2
[first_name] => John
[last_name] => Smith
)
[Trip] => Array
(
[id] => 7
[location] => La Plagne
[year] => 2000
[modified] =>
)
)
)
I would be immensely grateful if someone could suggest how I might be able to sort by the the first_name of the Person associated with the Quote. I suspect my syntax is wrong, but I have not been able to find the answer. Is it not possible to sort by a second level association in this way?
I am pretty much brand new with cakephp so please be gentle.
Thanks very much in advance.
I've had the similar problem awhile back. Not with sort though. Try putting the associated table in another array.
echo $this->Paginator->sort('Name', 'Attendance.Person.first_name');
change to:
echo $this->Paginator->sort('Name', array('Attendance' => 'Person.first_name'));
Hope this helps
i'm also looking for help with this.
so far i've found that you can sort multi level associations in controller's pagination options after using the linkable plugin https://github.com/Terr/linkable.
but it breaks down when you try to sort form the paginator in the view. i'm using a controller for magazine clippings. each clipping belongs to an issue and each issue belongs to a publication.
$this->paginate = array(
"recursive"=>0,
"link"=>array("Issue"=>array("Publication")),
"order"=>array("Publication.name"=>"ASC",
"limit"=>10);
after debugging $this->Paginator->params->paging->Clipping in the view, you can see that the sort is described in two separate places, "defaults" and "options". the sort info needs to be present in both for it to work in the view.
here is after setting order in controller:
[Clipping] => Array
(
[page] => 1
[current] => 10
[count] => 6685
[prevPage] =>
[nextPage] => 1
[pageCount] => 669
[defaults] => Array
(
[limit] => 10
[step] => 1
[recursive] => 0
[link] => Array
(
[Issue] => Array
(
[0] => Publication
)
)
[order] => Array
(
[Publication.name] => ASC
)
[conditions] => Array
(
)
)
[options] => Array
(
[page] => 1
[limit] => 10
[recursive] => 0
[link] => Array
(
[Issue] => Array
(
[0] => Publication
)
)
[order] => Array
(
[Publication.name] => ASC
)
[conditions] => Array
(
)
)
)
and here is after using $this->Paginator->sort("Publication","Publication.name");.
notice the options array is empty.
[Clipping] => Array
(
[page] => 1
[current] => 10
[count] => 6685
[prevPage] =>
[nextPage] => 1
[pageCount] => 669
[defaults] => Array
(
[limit] => 10
[step] => 1
[recursive] => 0
[link] => Array
(
[Issue] => Array
(
[0] => Publication
)
)
[order] => Array
(
[Publication.name] => DESC
)
[conditions] => Array
(
)
)
[options] => Array
(
[page] => 1
[limit] => 10
[recursive] => 0
[link] => Array
(
[Issue] => Array
(
[0] => Publication
)
)
[order] => Array
(
)
[conditions] => Array
(
)
)
does one really need to modify the paginator class to make this work?
UPDATE:
i found out the problem:
in the core cake controller paginator merges default and options to create the find query.
but the options array is empty when using linkable to sort. because options is listed after default it overrides default and the empty array replaces the default array of options.
solution to this is extending the paginate function inside of app_controller.php and unsetting the options array order value if it is empty:
(line 1172 in cake/libs/controller/controller.php)
if(empty($options["order"])){
unset($options["order"]);
}
then the options will not be overwritten by thte blank array.
of course this should not be changed inside of controller.php, but put it in app_controller.php and move it to your app folder.
On CakePHP 3 this problem can be solved by adding 'sortWhitelist' params to $this->paginate on your controller.
$this->paginate = [
// ...
'sortWhitelist' => ['id', 'status', 'Attendance.Person.first_name']
];
And then in your view:
echo $this->Paginator->sort('Name', 'Attendance.Person.first_name');
This is noted in the docs:
This option is required when you want to sort on any associated data, or computed fields that may be part of your pagination query:
However that could be easily missed by tired eyes, so hope this helps someone out there!