Mongo DB MapReduce in PHP

Mongo DB MapReduce in PHP - php

First of all it's my first time in Mongo...
Concept:
A user is able to describe an image in natural language.
Divide the user input and store the words he described in a Collection called
words.
Users must be able to go through the most used words and add those words to their description.
The system will use the most used words (for all users) and use
those words to describe the image.
My words document (currently) is as follows (example)
{
"date": "date it was inserted"
"reported": 0,
"image_id": "image id"
"image_name": "image name"
"user": "user _id"
"word": "awesome"
}
The words will be duplicated so that each word can be associated to a user...
Problem: I need to perform a Mongo query to allow me to know the most used words (to describe an image) that were not created by a given user. (to meet point 3. above)
I've seen MapReduce algorithm, but from what I read there are a couple of issues with it:
Can't sort results (I can order from the most used to less used)
In millions of documents it can have a large processing time.
Can't limit the number of the results returned
I've thought about running a task at a given time each day to store on a document (in a different collection) the list the rank of words that a given user hasn't used to describe the given image. I would have to limit this to 300 results or something (any idea on a proper limit??) Something like:
{
user_id: "the user id"
[
{word: test, count: 1000},
{word: test2, count: 980},
{word: etc, count: 300}
]
}
Problems I see with this solution are:
Results would have quite a delay which is not desirable.
Server loads while generating this documents for all users can spike (I actually know very little about this in Mongo so this is just an assumption)
Maybe my approach doesn't make any sense... And maybe my lack of experience in Mongo is pointing me at the wrong "schema design".
Any idea of what could be a good approach for this kind of problem?
Sorry for the big post and thanks for your time and help!
João

As already mentioned you could use the group command which is easy to use, but you will need to sort the result on the client side. Also the result is returned as a single BSON object and for this reason must be fairly small – less than 10,000 keys, else you will get an exception.
Code example based on your data structure:
db.words.group({
key : {"word" : true},
initial: {count : 0},
reduce: function(obj, prev) { prev.count++},
cond: {"user" :{ $ne : "USERNAME_TO_IGNORE"}}
})
Another option is to use the new Aggregation framework, which will be released in the 2.2 version. Something like that should work.
db.words.aggregate({
$match : { "user" : { "$ne" : "USERNAME_TO_IGNORE"} },
$group : {
_id : "$word",
count: { $sum : 1}
}
})
Or you can still use MapReduce. Actually you can limit and sort the output, because the result is
an collection. Just use .sort() and .limit() on the output. Also you can use the incremental
map-reduce output option, which will help you solve your performance issues. Have a look at the out parameter in MapReduce.
Bellow it's an example, which use the incremental feature to merge the existing collection with new data in a words_usage collection:
m = function() {
emit(this.word, {count: 1});
};
r = function( key , values ){
var sum = 0;
values.forEach(function(doc) {
sum += doc.count;
});
return {count: sum};
};
db.runCommand({
mapreduce : "words",
map : m,
reduce : r,
out : { reduce: "words_usage"},
query : <query filter object>
})
# retrieve the top 10 words
db.words_usage.find().sort({"value.count" : -1}).sort({"value.count" : -1}).limit(10)
I guess you can run the above MapReduce command in a cron every few minutes/hours, depends how accurate results you want. For the update query criteria you can use the words documents creation date.
Once you have the system top words collection you can build per user top words or just compute them in real time (depends on the system size).

The group function is supposed to be a simpler version of MapReduce. You could use it like this to get a sum for each word:
db.coll.group(
{key: { a:true, b:true },
cond: { active:1 },
reduce: function(obj,prev) { prev.csum += obj.c; },
initial: { csum: 0 }
});

Related

MongoDB - Converting fields from int32 to int64

I have a vary large dataset in MongoDB, in which there are documents with numeric fields. Due to some issue in the data import, some of these fields ended up in int32 datatype with some are in int64 datatype.
I need to convert all of them to int32. Since many of the fields are nested documents/array I cannot use MongoChef or RoboMongo to edit the field and do a collection wide replace.
What is my next best option? Would I need to write a script that loop through each document/field and explicitly typecast them to NumberInt(). I could do this in PHP or Python, but I was wondering if there is a way to do this without writing extra code.
Is there any mongoshell magic that can be done? I would appreciate if any Mongo Masters can give me any insights.

To anyone looking to do this and coming here. You can run
db.foo.find().forEach(doc => {
const newBar = bar.valueOf()
db.foo.update({
"_id" : doc._id
}, {
"$set" : {
"bar" : newBar
}
})
})
in the mongo shell. This might not be doable in large collections. The key is to use .valueOf() on the Int64. You might want to check that this doesn't overflow

Select condition within a hash column using Doctrine mongoDB ODM query builder

I have the following structure within a mongoDB collection:
{
"_id" : ObjectId("5301d337fa46346a048b4567"),
"delivery_attempts" : {
"0" : {
"live_feed_id" : 107,
"remaining_attempts" : 2,
"delivered" : false,
"determined_status" : null,
"date" : 1392628536
}
}
}
// > db.lead.find({}, {delivery_attempts:1}).pretty();
I'm trying to select any data from that collection where remaining_attempts are greater than 0 and a live_feed_id is equal to 107. Note that the "delivery_attempts" field is of a type hash.
I've tried using an addAnd within an elemMatch (not sure if this is the correct way to achieve this).
$qb = $this->dm->createQueryBuilder($this->getDocumentName());
$qb->expr()->field('delivery_attempts')
->elemMatch(
$qb->expr()
->field('remaining_attempts')->gt(0)
->addAnd($qb->expr()->field('live_feed_id')->equals(107))
);
I do appear to be getting the record detailed above. However, changing the greater than
test to 3
->field('remaining_attempts')->gt(3)
still returns the record (which is incorrect). Is there a way to achieve this?
EDIT: I've updated the delivery_attempts field type from a "Hash" to a "Collection". This shows the data being stored as an array rather than an object:
"delivery_attempts" : [
{
"live_feed_id" : 107,
"remaining_attempts" : 2,
"delivered" : false,
"determined_status" : null,
"date" : 1392648433
}
]
However, the original issue still applies.

You can use a dot notation to reference elements within a collection.
$qb->field('delivery_attempts.remaining_attempts')->gt(0)
->field('delivery_attempts.live_feed_id')->equals(107);

It works fine for me if I run the query on mongo.
db.testQ.find({"delivery_attempts.remaining_attempts" : {"$gt" : 0}, "delivery_attempts.live_feed_id" : 107}).pretty()
so it seems something wrong with your PHP query, I suggest running profiler to see which query is actually run against mongo
db.setProfilingLevel(2)
This will log all operation since you enable profiling. Then you can query the log to see which the actual queries
db.system.profile.find().pretty()
This might help you to find the culprit.

It sounds like your solved your first problem, which was using the Hash type mapping (instead for storing BSON objects, or associative arrays in PHP) instead of the Collection mapping (intended for real arrays); however, the query criteria in the answer you submitted still seems incorrect.
$qb->field('delivery_attempts.remaining_attempts')->gt(0)
->field('delivery_attempts.live_feed_id')->equals(107);
You said in your original question:
I'm trying to select any data from that collection where remaining_attempts are greater than 0 and a live_feed_id is equal to 107.
I assume you'd like that criteria to be satisfied by a single element within the delivery_attempts array. If that's correct, the criteria you specified above may match more than you expect, since delivery_attempts.remaining_attempts can refer to any element in the array, as can the live_feed_id criteria. You'll want to use $elemMatch to restrict the field criteria to a single array element.
I see you were using elemMatch() in your original question, but the syntax looked a bit odd. There should be no need to use addAnd() (i.e. an $and operator) unless you were attempting to apply two query operators to the same field name. Simply add extra field() calls to the same query expression you're using for the elemMatch() method. One example of this from ODM's test suite is QueryTest::testElemMatch(). You can also use the debug() method on the query to see the raw MongoDB query object created by ODM's query builder.

Mongodb and sorting sub array

Not sure if this can be done, so thought I would ask.
I have the following mongodb/s
{
"store":"abc",
"offers":[{
"spend":"100.00",
"cashback":"10.00",
"percentage":"0.10"
},{
"spend":"50.00",
"cashback":"5.00",
"percentage":"0.10"
}]
}
and
{
"store":def",
"offers":[{
"spend":"50.00",
"cashback":"2.50",
"percentage":"0.05"
},{
"spend":"20.00",
"cashback":"1.00",
"percentage":"0.05"
}]
}
and
{
"store":ghi",
"offers":[{
"spend":"50.00",
"cashback":"5.00",
"percentage":"0.10"
},{
"spend":"20.00",
"cashback":"2.00",
"percentage":"0.10"
}]
}
the sort needs to be by percentage.
I am not sure if I would have to use usort of another PHP function to do it, or if Mongodb is smart enough to do what I want to do.

Amazingly, yes, mongodb can do this:
// Sort ascending, by minimum percentage value in the docs' offers array.
db.collection.find({}).sort({ 'offers.percentage': 1 });
// Sort descending, by maximum percentage value in the docs' offers array.
db.collection.find({}).sort({ 'offers.percentage': -1 });

Given your data structure of arrays within documents, I don't think it makes sense to do this sort in MongoDB -- Mongo will be returning entire documents (not arrays).
If you are trying to compare offers it would probably make more sense to have a separate collection instead of an embedded array. For example, you could then find offers matching a cashback of at least $5 sorted by spend or percentage discount.
If you are just trying to order the offers within a single document, you could do this in PHP with a usort().

Can MongoDB and its drivers preserve the ordering of document elements

I am considering using MongoDB to store documents that include a list of key/value pairs. The safe but ugly and bloated way to store this is as
[ ['k1' : 'v1'] , ['k2' : 'v2'], ...]
But document elements are inherently ordered within the underlying BSON data structure, so in principle:
{k1 : 'v1',
k2 : 'v2', ...}
should be enough. However I expect most language bindings will interpret these as associative arrays, and thus potentially scramble the ordering. So what I need to know is:
Does MongoDB itself promise to preserve item ordering of the second form.
Do language bindings have some API which can extract it ordered form -- even if the usual "convenient" API returns an associative array.
I am mostly interested in Javascript and PHP here, but I would also like to know about other languages. Any help is appreciated, or just a link to some documentation where I can go RTM.

From Version 2.6 on, MongoDB preserves the order of fields where possible. However, the _id field always comes first an renaming fields can lead to re-ordering. However, I'd generally try not to rely on details like this. As the original question mentions, there are also additional layers to consider which each must provide some sort of guarantee for the stability of the order...
Original Answer:
No, MongoDB does not make guarantees about the ordering of fields:
"There is no guarantee that the field order will be consistent, or the same, after an update."
In particular, in-place updates that change the document size will usually change the ordering of fields. For example, if you $set a field whose old value was of type number and the new value is NumberLong, fields usually get re-ordered.
However, arrays preserve ordering correctly:
[ {'key1' : 'value1'}, {'key2' : 'value2'}, ... ]
I don't see why this is "ugly" and "bloated" at all. Storing a list of complex objects couldn't be easier. However, abusing objects as lists is definitely ugly: Objects have associative array semantics (i.e. there can only be one field of a given name), while lists/arrays don't:
// not ok:
db.foo2.insert({"foo" : "bar", "foo" : "lala" });
db.foo2.find();
{ "_id" : ObjectId("4ef09cd9b37bc3cdb0e7fb26"), "foo" : "lala" }
// a list can do that
db.foo2.insert({ 'array' : [ {'foo' : 'bar'}, { 'foo' : 'lala' } ]});
db.foo2.find();
{ "_id" : ObjectId("4ef09e01b37bc3cdb0e7fb27"), "array" :
[ { "foo" : "bar" }, { "foo" : "lala" } ] }
Keep in mind that MongoDB is an object database, not a key/value store.

As of Mongo 2.6.1, it DOES keep the order of your fields:
MongoDB preserves the order of the document fields following write operations except for the following cases:
The _id field is always the first field in the document.
Updates that
include renaming of field names may result in the reordering of
fields in the document.
http://docs.mongodb.org/manual/release-notes/2.6/#insert-and-update-improvements

One of the pain points of this is comparing documents to one another in the shell.
I've created a project that creates a custom mongorc.js which sorts the document keys by default for you when they are printed out so at least you can see what is going on clearly in the shell. It's called Mongo Hacker if you want to give it a whirl.

Though it's true that, as of Mongo 2.6.1, it does preserve order, one should still be careful with update operations.
mattwad makes the point that updates can reorder things, but there's at least one other concern I can think of.
For example $addToSet:
https://docs.mongodb.com/manual/reference/operator/update/addToSet/
$addToSet when used on embedded documents in an array is discussed / exemplified here:
https://stackoverflow.com/a/21578556/3643190
In the post, mnemosyn explains how $addToSet disregards the order when matching elements in its deep value by value comparison.
($addToSet only adds records when they're unique)
This is relevant if one decided to structure data like this:
[{key1: v1, key2: v2}, {key1: v3, key2: v4}]
With an update like this (notice the different order on the embedded doc):
db.collection.update({_id: "id"},{$addToSet: {field:
{key2: v2, key1: v1}
}});
Mongo will see this as a duplicate and NOT this object to the array.

Selecting every Nth element from a large MongoDB collection w/ PHP?

I have a MongoDB collection with ~4M elements.
I want to grab X number of those elements, evenly spaced through the entire collection.
E.g., Get 1000 elements from the collection - one every 4000 rows.
Right now, I am getting the whole collection in a cursor and then only writing every Nth element. This gives me what I need but the original load of the huge collection takes a long time.
Is there an easy way to do this? Right now my guessed approach is to do a JS query on an incremented index property, with a modulus. A PHP implementation of this:
db.collection.find({i:{$mod:[10000,0]}})
But this seems like it will probably take just as much time for the query to run.
Jer

Use $sample.
This returns a random sample that is roughly "every Nth document".
To receive exactly every Nth document in a result set, you would have to provide a sort order and iterate the entire result set discarding all unwanted documents in your application.

I think the main problem, is that collection can be distributed over servers and thus you have to iterate over entire collection.

Do not put the whole dataset in a cursor. Since row order is not important, just collect x random rows out of your total, return that as a result and then modify those records

Personally I would design in a "modulus" value, populate it with something that is a function representative of the data - so if your data was inserted a regular intervals throughout the day you could do a modulus of the time, if there's nothing predictable then you could use a random value; with a collection of that size it would tend toward even distribution pretty quickly.
An example using a random value...
// add the index
db.example.ensureIndex({modulus: 1});
// insert a load of data
db.example.insert({ your: 'data', modulus: Math.round((Math.random() * 1000) % 1000) });
// Get a 1/1000 of the set
db.example.find({modulus: 1});
// Get 1/3 of the set
db.example.find({modulus: { $gt: 0, $lt: 333 }});

A simple (inefficient) way to do this is with a stream.
var stream = collection.find({}).stream();
var counter = 0;
stream.on("data", function (document) {
counter++;
if (counter % 10000 == 0) {
console.log(JSON.stringify(document, null, 2));
//do something every 10,000th time
}
});

If only your data was in a sql database, as it should be, ... this question wouldn't be in PHP and the answer would be so easy and quick ...
Loading anything into a cursor instead of calculating the information directly in the db is definitely a bad idea, is it not possible to do this directly in the MongoDB thingy ?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.