MongoDB speed when returning last document - php

I have a web app with tons of documents. User can enter id (valid MongoId / ObjectId), but if user doesn't enter it I have to retrieve object with last id:
I'm concerned about speed for searching last object. I'm currently doing it like this:
db.docs.find({"status": 1}).sort({"_id": -1}).limit(1);
//Or in php:
$docs->find(array('status' => 1))->sort(array('_id' => -1))->limit(1)->getNext();
Isn't this a bit slow? First is looking for all docs with status 1 then sort them and limit then. Is there any better way for getting last document with status 1?

To make this performant you'd likely need to add an index on { status: 1, _id: -1 }.
You can also use findOne instead of find with a limit to simplify the syntax:
db.docs.findOne({"status": 1}).sort({"_id": -1});

Perhaps just store another value in another table, or in an in-memory cache perhaps, indicating the highest id value the system has as status=1. It would require a small bit of logic to be added when inserting/updating objects in the database, to compare the id value of objects with status=1 against the current cached id value, updating if the value is higher. You could then access the latest file directly using this cached value.
It is a little clunky, but would probably perform much better than the find.sort.limit operation you are currently doing as your number of objects grows.

Related

Optimal way to detect fields to delete in database comparing to an array of IDs

I am trying to do the following.
I am consulting an external database using a web service. What the web service does is bring me all the products from an ERP system my client uses. As the server and the connection are not really fast, what I decided to do is basically synchronize the database on my web server and handle most operations there, so that the website can run smoothly.
Everything works fine I just need one last step to guarantee that the inventory on the website matches the one available on the ERP. The only issue comes when they (the client) deletes something on the ERP system.
At the moment I am thinking what would be the ideal strategy (least resource and time consuming) to remove products from my Products table if I don't receive them in the web service result.
So I basically have the following process:
I query the web service for all the products, give them a little format and store them in an array. The final size is about 600 indexes.
Then what I do is I do a foreach cycle and have the following subprocess.
I query my database to check if product_id is present.
If the product is present, I just update it with the latest info, stock data.
If the product is not present, I just insert it.
So, I was thinking of doing the following, but I do not think it's the ideal way:
Do a SELECT * FROM Products and generate an array that has all the products.
Do a foreach cycle in the resulting array and in each cycle scan the ERP array to check if the specific product exists. If not I delete it, if yes, I continue with the next product.
Now considering that after all the previous steps this would involve a couple of nested foreach I am a little worried that it might consume too much memory and also take longer to process.
I was thinking that maybe something like array_diff or array map could solve the issue, but I am not really experienced with these functions, and the structure of the two arrays differs a lot, so I am not sure if it would work that easily.
What would you guys recommend?
It's actually quite simple:
SELECT id FROM Products
Then you have an array of your product Ids, for example:
[123,5679,345]
Then as you go and do your updates or inserts, remove the id from the array.
[for updates]I query my database to check if product_id is present.
This is redundant now.
There are a few ways to remove the value from the array (when you do an update), this is the way I would probably do it.
if(false !== ($index = array_search($data['product_id'],$myids))){
//note the !== type comparison because array_search can return 0 for the first index, we must check for boolean false.
//find the index of the product id in our list of id's from local DB
unset($myids[$index]);
//If our incoming product_id is in the local list we Do Update
}else{
//Otherwise we Do Insert
}
As I mentioned above when doing your updates/inserts, You no longer have to check if the ID exists, because you already know this by having an array of IDs from the database. This alone saves you (n) queries (apx 600).
Then its very simple if you have ids left over.
//I wouldn't normally concatenate variables into SQL, in this case it's a list of int IDs from the database.
//you can of course come up with a loop to make it a prepared statement if you wish, but for the sake of simplistically, I'll leave that as an exercise for another day..
'DELETE FROM Products WHERE id IN('.implode(',', $myids).')'
And because you unset these when Updating, then the only thing left is Products that no longer exist.
Conclusion:
You have no choice (other then doing on duplicate key query, or ignoring exceptions) then to pull out the product Ids. You're already doing this on a row by row basis. So we can effectively kill 2 birds with one stone.
If you need more data then just the ID, for example you check that the product was changed before doing an update. Then pull that data out, but I would recommend using PDO and the FETCH_GROUP option. I wont go into the specifics of that but to say it lets you easily build your array this way:
[{product_id} => [ {product_name}, {product_price} etc..]];
Basically the product_id, is the key with a nested array of the row data, this will make lookup easier.
This way you can look it up like this.
//then instead of array_search
//if(false !== ($index = array_search($data['product_id'],$myids))){
if(isset($myids[$data['product_id']])){
unset($myids[$data['product_id']]);
//do your checks, then your update
}else{
//do inserts
}
References:
http://php.net/manual/en/function.array-search.php
array_search — Searches the array for a given value and returns the first corresponding key if successful
WARNING This function may return Boolean FALSE, but may also return a non-Boolean value which evaluates to FALSE. Please read the section on Booleans for more information. Use the === operator for testing the return value of this function.
UPDATE
There is one other really good way to do this, and that is to add a field called sync_date, now when you do your insert or update then set the sync_date to the current data.
This way when you are done, those products with an older sync date then today can be deleted. In this case it's best to cache the time when doing it so you know the exact time.
$time = data('Y-m-d H:i:s'); //or time() if you prefer timestamp
//use this same variable for the whole coarse of the script.
Then you can do
'DELETE from products WHERE sync_time != $time'
This may actually be a bit better because it has more utility. When was the last time it was ran, Now you know.

How combine the sorted sets Redis?

I use sorted set type in Redis store.
For each user I create a own KEY and put here data:
Example of KEY:
FEED:USER:**1**, FEED:USER:**2**, FEED:USER:**3**
I want to select data from Redis for user's keys: 1, 2, 3 and sorted each by score (timestamp).
If see at problem simply, I need select from any KEY a data across time and after combine all results sorted by score.
There are a couple of ways to do this but the right one depends on what you're trying to do. For example:
You can use ZRANGEBYSCORE (or ZREVRANGEBYSCORE) in your code for each FEED:USER:n key and "merge" the replies in the client
You can do a ZUNIONSTORE on the relevant keys and then do the ZRANGEBYSCORE on the result from the client.
However, if your "feeds" are large, #2's flow should be reversed - first range and then union.
You could also do similar types of processing entirely server-side with some Lua scripting.
EDIT: further clarifications
Re. 1 - Merging could be done client-side on the results that you get from ZRANGEBYSCORE or you could use server-side Lua scripts to do that. Use the WITHSCORES to get the timestamp and merge/sort on it. Regardless the your choice of location for running this code (I'd probably use Lua for data locality), the implementation is up to you - lmk if you need help with that :)

Use Redis for a timeout queue or leaderboard

I want to use Redis basically like this, if it (hypothetically) accepted SQL:
SELECT id, data, processing_due FROM qtable WHERE processing_due < NOW()
where processing_due is an integer timestamp of some sort.
The idea is then to also remove completed "jobs" with something like:
DELETE from qtable WHERE id = $someid
Which Redis commands would I use on the producing ("insert") and consuming ("select, delete from") end?
I find that Redis can be used as a queue, but I don't want the answers in strictly the order they were inserted, but rather based on if "now" is past processing_due.
I imagine this is almost the same problem as a leaderboard?
(I try to wrap my head around how Redis works and it looks simple enough from the documentation, but I just don't get it.)
Would a decent solution be to do ZADD qtable <timestamp> <UUID> and then use the UUID as a key to store the (json) value under it?
You can use a Sorted Set, in which the score is your time (an integer as you've suggested), and then you query using ZRANGEBYSCORE. Each member would be a Json representation of your "fields". For example: {id:"1",data:"bla",processing_due:"3198382"}
Regarding delete, just use ZREM when you find the relevant member to delete. pass your Json string as a parameter and you're OK.
A possibly better variant would be to just hold generated IDs as your member, and in a separate String-type key save pairs of your IDs along with the Json representation of your data. Just remember to maintain the two structs in sync.

Comparing documents within MongoDB

I'm looking to compare fields between potentially millions of documents within a mongo collection. The fields will be determined ahead of time and weights will be given to each field. These weights will then be used to return document pairs representing suggestions for 'like' documents. For instance, if 2 documents are being compared and both have the same value for the field 'first_name', the weight table will be referenced and the score for the pair will have that weight added to it. If another field is the same between the two, the score will updated to reflect a higher likeness.
I'm currently approaching this by iterating through the initial result set, then having an embedded iteration that also goes through the result set and compares each document to the document that the first iterator is on (extremely inefficient). This is currently all done by php as it grabs elements through the cursor.
I'm open for any suggestions including MapReduce implementations (doesn't really seem applicable), cursor manipulation, pretty much whatever you can conjure up to simplify the process because im working at O(n^2) complexity right now (Well, a little better as I skip the documents that have been covered so far by the first iterator).
To avoid n^2 you would have to look at storing fields and their values in a reference collection, e.g. :
{
field: "firstName",
value: "Remon",
documents : [ <list with all document _ids of documents that have "field" set to "value">]
}
This way you can query directly on this collection to get all documents that are "like" your source document. Additionally this allows you to query for multiple key/value pairs with a single O(n) query.
Obviously the only tricky thing is maintaining this reference collection in the first place but in your case that seems pretty straightforward (update references when you update the fields).
Does that help?

Tracking a total count of items over a series of paged results

What is the ideal way to keep track of the total count of items when dealing with paged results?
This seems like a simple question at first but it is slightly more complicated (to me... just bail now if you find this too stupid for words) when I actually start thinking about how to do it efficiently.
I need to get a count of items from the database. This is simple enough. I can then store this count in some variable (a $_SESSION variable for instance). I can check to see if this variable is set and if it isn't, get the count again. The trick part is deciding what is the best way to determine when I need to get a new count. It seems I would need to get a new count if I have added/deleted items to the total or if I am reloading or revisiting the grid.
So, how would I decide when to clear this $_SESSION variable? I can see clearing it and getting a new count after an update/delete (or even adding or subtracting to it to avoid the potentially expensive database hit) but (here comes the part I find tricky) what about when someone navigates away from the page or waits a variable amount of time before going to the next page of results or reloads the page?
Since we may be dealing with tens or hundreds of thousands of results, getting a count of them from the database could be quite expensive (right? Or is my assumption incorrect?). Since I need the total count to handle the total number of pages in the paged results... what's the most efficient way to handle this sort of situation and to persist it for... as long as might be needed?
BTW, I would get the count with an SQL query like:
SELECT COUNT(id) FROM foo;
I never use a session variable to store the total found in a query, I include the count in the regular query when I get the information and the count itself comes from a second query:
// first query
SELECT SQL_CALC_FOUND_ROWS * FROM table LIMIT 0, 20;
// I don´t actually use * but just select the columns I need...
// second query
SELECT FOUND_ROWS();
I´ve never noticed any performance degradation because of the second query but I guess you will have to measure that if you want to be sure.
By the way, I use this in PDO, I haven´t tried it in plain MySQL.
Why store it in a session variable? Will the result change per user? I'd rather store it in a user cache like APC or memcached, choose the cache key wisely, and then clear it when inserting or deleting a record related to the query.
A good way to do this would be to use an ORM that does it for you, like Doctrine, which has a result cache.
To get the count, I know that using COUNT(*) is not worse than using COUNT(id). (question: Is it even better?)
EDIT: interesting article about this on the MySQL performance blog
Most likely foo has a PRIMARY KEY index defined on the id column. Indexed COUNT() queries are usually quite easy on the DB.
However, if you want to go the extra mile, another option would be to insert a special hook into code that deals with inserting and deleting rows into foo. Have it write the number of total records into a protected file after each insert/update and read it from there. If every successful insert/update gets accounted for, the number in the protected file is always up-to-date.

Categories