Comparing documents within MongoDB - php

I'm looking to compare fields between potentially millions of documents within a mongo collection. The fields will be determined ahead of time and weights will be given to each field. These weights will then be used to return document pairs representing suggestions for 'like' documents. For instance, if 2 documents are being compared and both have the same value for the field 'first_name', the weight table will be referenced and the score for the pair will have that weight added to it. If another field is the same between the two, the score will updated to reflect a higher likeness.
I'm currently approaching this by iterating through the initial result set, then having an embedded iteration that also goes through the result set and compares each document to the document that the first iterator is on (extremely inefficient). This is currently all done by php as it grabs elements through the cursor.
I'm open for any suggestions including MapReduce implementations (doesn't really seem applicable), cursor manipulation, pretty much whatever you can conjure up to simplify the process because im working at O(n^2) complexity right now (Well, a little better as I skip the documents that have been covered so far by the first iterator).

To avoid n^2 you would have to look at storing fields and their values in a reference collection, e.g. :
{
field: "firstName",
value: "Remon",
documents : [ <list with all document _ids of documents that have "field" set to "value">]
}
This way you can query directly on this collection to get all documents that are "like" your source document. Additionally this allows you to query for multiple key/value pairs with a single O(n) query.
Obviously the only tricky thing is maintaining this reference collection in the first place but in your case that seems pretty straightforward (update references when you update the fields).
Does that help?

Related

Pairing fields in a table for a select query

Hi, so I have this database project I'm working on that involves transcribing archival sources to make them more accessible.
I'm revamping the database structure, so I can make the depiction of the archival data more accurate to the manuscript sources. As part of that, I have this new table, which has both the labels/titles for columns of data in the documents, plus a "used"field which acts both as a flag for if the field is used, and also for what position it should be in left to right (As the order changes sometimes).
I'm wondering if there's a way to pair the columns together so I can do a query that - when asking for a single row to be returned= sorts the "used" functions numerically (returning all the ones that aren't -1), and also returns all the "label" fields also sorted into the same order (eg if guns_used is 2, and men_used is 1 and ship_name_position is 0, the query will put them in the correct order and also return guns_label, men_label and shipname_label in the correct order).
I'm also working with/around wordpress, so I have the contents of the whole wpdb thing available to me too.
I'm hoping to be able to "pair" the fields in some way so that if I order one set, the other gets ordered as well.
Edit:
I really would prefer to find a way to do this in a query but until I find a way to do that I'm going to
a)Select the entire row that I need
b)Have a long series of if statements- one for each pair of _label/_used fields- and assigning the values I want to the position in the array indicated by the value of the _used field.

Algorithm for optimising compound index search in MongoDb

I have a collection X on which I have to apply a filter.
The filter is saved as a sepparate entity (collection filters) and the only data it holds are the field name and the conditions applied to that field name
Example of filter:
Name is Stephan and Age BETWEEN 10, 20
Basically what I have to improve is the fact that each field in my filter is an index added upon creation of the filter.
The only structure that matches is a compound index on the fields filtered.
In conclusion, the problem is that when I have a filter like:
Name is Stephan and Age BETWEEN 10,20
My compound index in MongoDb will be: {'Name':1,'Age':1}
But then, if I add another filter, let's say: Age is 10 and Name is Adrian and Height BETWEEN 170,180
compound index is: {'Age':1,'Name':1, 'Height':1}
{'Name':1,'Age':1} <> {'Age':1,'Name':1, 'Height':1}
What can I do to make the last index fit with the first and the other way around.
Please let me know if I haven't been to explicit.
The cleanest solution to this problem is index intersections, which is currently in development. That way, an index for each of the criteria would be sufficient.
In the mean time, I see two options:
Use a separate search database that returns the relevant ids based on your criteria, then use $in in MongoDB to query the actual documents. There are a number of tools that use this approach, but it adds quite a bit of overhead because you need to code against and administer a second db, keep the data in sync, etc.
Use a smart mix of compound indexes and 'infinite range queries'. For instance, you can argue that a query for age in the range of (0, 200) won't discard anybody from the result set, neither will a height query between 0 and 400.
That might not be the cleanest approach, and its efficiency depends very much on the details of the queries, so that might require some fine-tuning.

Database/datasource optimized for string matching?

I want to store large amount (~thousands) of strings and be able to perform matches using wildcards.
For example, here is a sample content:
Folder1
Folder1/Folder2
Folder1/*
Folder1/Folder2/Folder3
Folder2/Folder*
*/Folder4
*/Fo*4
(each line has additionnal data too, like tags, but the matching is only against that key)
Here is an example of what I would like to match against the data:
Folder1
Folder1/Folder2/Folder3
Folder3
(* being a wildcard here, it can be a different character)
I naively considered storing it in a MySQL table and using % wildcards with the LIKE operator, but MySQL indexes will only work for characters on the left of the wildcard, and in my case it can be anywhere (i.e. %/Folder3).
So I'm looking for a fast solution, that could be used from PHP. And I am open: it can be a separate server, a PHP library using files with regex, ...
Have you considered using MySQL's regular expression engine? Try something like this:
SELECT *
FROM your_table
WHERE your_query_string REGEXP pattern_column
This will return rows with regex keys that your query string matches. I expect it will perform better than running a query to pull all of the data and doing the matching in PHP.
More info here: http://dev.mysql.com/doc/refman/5.1/en/regexp.html
You might want to use the multicore approach to solve that search in a fraction of the time, i would recommend for search and matching, using FPGA's but thats probably the hardest way to do it, consider THIS ARTICLE using CUDA, you can do that searches in 16x usual time, in multicore CPU Systems, you can use posix, or a cluster of computers to do the job (MPI for example), you can call Gearman service to run the searches using advanced algorithms.
Were it me, I'd store out the key field two times ... once forward and once reversed (see mysql's reverse function). you can then search the index with left(main_field) and left(reversed_field). it won't help you when you have a wildcard in the middle of the string AND the beginning (e.g. "*Folder1*Folder2), but it will when you have a wildcard at the beginning or the end.
e.g. if you want to search */Folder1 then search where left(reverse_field, 8) = '1redloF/';
for Folder1/*/FolderX search where left(reverse_field, 8) = 'XredloF/' and left(main_field, 8) = 'Folder1/'
If your strings represent some kind of hierarchical structure (as it looks like in your sample content), actually not "real" files, but you say you are open to alternative solutions - why not consider something like a file-based index?
Choose a new directory like myindex
Create an empty file for each entry using the string key as location & file name in myindex
Now you can find matches using glob - thanks to the hierarchical file structure a glob search should be much faster than searching up all your database entries.
If needed you can match the results to your MySQL data - thanks to your MySQL index on the key this action will be very fast.
But don't forget to update the myindex structure on INSERT, UPDATE or DELETE in your MySQL database.
This solution will only compete on a huge data-set (but not too huge as #Kyle mentioned) with a rather deep than wide hierarchical structure.
EDIT
Sorry this would only work if the wildcards are in your search terms not in the stored strings itself.
As the wildcards (*) are in your data and not in your queries I think you should start with breaking up your data into pieces. You should create an index-table having columns like:
dataGroup INT(11),
exactString varchar(100),
wildcardEnd varchar(100),
wildcardStart varchar(100),
If you have a value like "Folder1/Folder2" store it in "exactString" and assign the ID of the value in the main data table to "dataGroup" in the above index table.
If you have a value like "Folder1/*" store a value of "Folder1/" to "wildcardEnd" and again assign the id of the value in the main table to the "dataGroup" field in above Table.
You can then do a match within your query using:
indexTable.wildcardEnd = LEFT('Folder1/WhatAmILookingFor/Data', LENGTH(indexTable.wildcardEnd))
This will truncate the search string ('Folder1/WhatAmILookingFor/Data') to "Folder1/" and then match it against the wildcardEnd field. I assume mysql is clever enough not to do the truncate for every row but to start with the first character and match it against every row (using B-Tree indexes).
A value like "*/Folder4" will go into the field "wildcardStart" but reversed. To cite Missy Elliot: "Is it worth it, let me work it
I put my thing down, flip it and reverse it" (http://www.youtube.com/watch?v=Ke1MoSkanS4). So store a value of "4redloF/" in "wildcardStart". Then a WHERE like the following will match rows:
indexTable.wildcardStart = LEFT(REVERSE('Folder1/WhatAmILookingFor/Folder4'), LENGTH(indexTable.wildcardStart))
of course you could do the "REVERSE" already in your application logic.
Now about the tricky part. Something like "*/Fo*4" should get split up into two records:
# Record 1
dataGroup ==> id of "*/Fo*4" in data table
wildcardStart ==> oF/
wildcardEnd ==> /Fo
# Record 2
dataGroup ==> id of "*/Fo*4" in data table
wildcardStart ==> 4
Now if you match something you have to take care that every index-record of a dataGroup gets returned for a complete match and that no overlapping occurs. This could also get solved in SQL but is beyond this question.
Database isn't the right tool to do these kinds of searches. You can still use a database (any database and any structure) to store the strings, but you have to write the code to do all the searches in memory. Load all the strings from the database (a few thousand strings is really no biggy), cache them and run your search\match algorithm on them.
You probably have to code your algorithm yourself because the standard tools will be an overkill for what you are trying to achieve and there is no garantee that they will be able to achieve exactly what you need.
I would build a regex representation of your wildcard based strings and run those regexs on your input. Your probabaly will have to do some work until you get the regex right, but it will be the fastest way to go.
I suggest reading the keys and their associated payload into a binary tree representation ordered alphanumerically by key. If your keys are not terribly "clumped" then you can avoid the (slight additional) overhead building of a balanced tree. You also can avoid any tree maintenance code as, if I understand your problem correctly, the data will be changing frequently and it would be simplest to rebuild the tree rather than add/remove/update nodes in place. The overhead of reading into the tree is similar to performing an initial sort, and tree traversal to search for your value is straight-forward and much more efficient than just running a regex against a bunch of strings. You may even find while working it through that your wild cards in the tree will lead to some shortcuts to prune the search space. A quick search show lots of resources and PHP snippets to get you started.
If you run SELECT folder_col, count(*) FROM your_sample_table group by folder_col do you get duplicate folder_col values (ie count(*) greater than 1)?
If not, that means you can produce an SQL that would generate a valid sphinx index (see http://sphinxsearch.com/).
I wouldn't recommend to do text search on large collection of data in MySQL. You need a database to store the data but that would be it. For searching use a search engine like:
Solr (http://lucene.apache.org/solr/)
Elastic Search (http://www.elasticsearch.org/)
Sphinx (http://sphinxsearch.com/)
Those services will allow you doing all sort of funky text search (including Wildcards) in a blink of an eye ;-)

Performance issues with mongo + PHP with pagination, distinct values

I have a mongodb collection contains lots of books with many fields. Some key fields which are relevant for my question are :
{
book_id : 1,
book_title :"Hackers & Painters",
category_id : "12",
related_topics : [ {topic_id : "8", topic_name : "Computers"},
{topic_id : "11", topic_name : "IT"}
]
...
... (at least 20 fields more)
...
}
We have a form for filtering results (with many inputs/selectbox) on our search page. And of course there is also pagination. With the filtered results, we show all categories on the page. For each category, number of results found in that category is also shown on the page.
We try to use MongoDB instead of PostgreSQL. Because performance and speed is our main concern for this process.
Now the question is :
I can easily filter results by feeding "find" function with all filter parameters. That's cool. I can paginate results with skip and limit functions :
$data = $lib_collection->find($filter_params, array())->skip(20)->limit(20);
But I have to collect number of results found for each category_id and topic_id before pagination occurs. And I don't want to "foreach" all results, collect categories and manage pagination with PHP, because filtered data often consists of nearly 200.000 results.
Problem 1 : I found mongodb::command() function in PHP manual with a "distinct" example. I think that I get distinct values by this method. But command function doesn't accept conditional parameters (for filtering). I don't know how to apply same filter params while asking for distinct values.
Problem 2 : Even if there is a way for sending filter parameters with mongodb::command function, this function will be another query in the process and take approximately same time (maybe more) with the previous query I think. And this will be another speed penalty.
Problem 3 : In order to get distinct topic_ids with number of results will be another query, another speed penalty :(
I am new with working MongoDB. Maybe I look problems from the wrong point of view. Can you help me solve the problems and give your opinions about the fastest way to get :
filtered results
pagination
distinct values with number of results found
from a large data set.
So the easy way to do filtered results and pagination is as follows:
$cursor = $lib_collection->find($filter_params, array())
$count = $cursor->count();
$data = $cursor->skip(20)->limit(20);
However, this method may not be somewhat inefficient. If you query on fields that are not indexed, the only way for the server to "count()" is to load each document and check. If you do skip() and limit() with no sort() then the server just needs to find the first 20 matching documents, which is much less work.
The number of results per category is going to be more difficult.
If the data does not change often, you may want to precalculate these values using regular map/reduce jobs. Otherwise you have to run a series of distinct() commands or in-line map/reduce. Neither one is generally intended for ad-hoc queries.
The only other option is basically to load all of the search results and then count on the webserver (instead of the DB). Obviously, this is also inefficient.
Getting all of these features is going to require some planning and tradeoffs.
Pagination
Be careful with pagination on large datasets. Remember that skip() and take() --no matter if you use an index or not-- will have to perform a scan. Therefore, skipping very far is very slow.
Think of it this way: The database has an index (B-Tree) that can compare values to each other: it can tell you quickly whether something is bigger or smaller than some given x. Hence, search times in well-balanced trees are logarithmic. This is not true for count-based indexation: A B-Tree has no way to tell you quickly what the 15.000th element is: it will have to walk and enumerate the entire tree.
From the documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the
server to walk from the beginning of the collection, or index, to get
to the offset/skip position before it can start returning the page of
data (limit). As the page number increases skip will become slower and
more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow
you to easily jump to a specific page.
Make sure you really need this feature: Typically, nobody cares for the 42436th result. Note that most large websites never let you paginate very far, let alone show exact totals. There's a great website about this topic, but I don't have the address at hand nor the name to find it.
Distinct Topic Counts
I believe you might be using a sledgehammer as a floatation device. Take a look at your data: related_topics. I personally hate RDBMS because of object-relational mapping, but this seems to be the perfect use case for a relational database.
If your documents are very large, performance is a problem and you hate ORM as much as I do, you might want to consider using both MongoDB and the RDBMS of your choice: Let MongoDB fetch the results and the RDBMS aggregate the best matches for a given category. You could even run the queries in parallel! Of course, writing changes to the DB needs to occur on both databases.

Compare/Diff Multiple (>Millions) Arrays

I'm not sure if this is possible; but I have millions of "lists" in a MySQL database, and would like develop a system where I take one of the lists; and compare it against all of the other lists in the database and return:
1.) Lists that closely resemble the primary list (some sort of % would be great)
2.) Given a certain items in a list; it would return a list of of items that are included in the majority of all the other lists (ie. autocomplete a list based on popular options).
I would've intially thought this would've been possible if I could create some sort of 'loose hash' that I can compare lists mathematically, but I haven't been able to find a solution that scales (since this is exponential when tackled head-on).
Any new ideas/solutions would be greatly appreciated. Thanks!
Your basic MD5 is a (somewhat) loose hash, supported by both php and mysql and quite fast in these kind of things. Just get an MD5 of what ever data and compare it to others.
Do it in PHP, store the MD5 of the data in array key an use if isset().
Your part 2) Given a certain items in a list; it would return a list of of items that are included in the majority of all the other lists (ie. autocomplete a list based on popular options).
is not very clear, but I interpret it as: Given few items, find all lists that contain all or most of the items.
This should be easy once you create an index on your list elements, essentially like a hash table. The exact query will depend on your requirement, length of lists (whether that is a factor in defining the specs, etc).
If you're saying there are millions of lists, itnis really not an option to load them all into a php script.
You could get the values of the list you are comparing the others to, and then run an SQL query similar to this:
SELECT list_id, COUNT(value) as c FROM lists WHERE value IN (a,b,c) GROUP BY list_id
ORDER BY c DESC
I'm not sure the sql is correct, but the idea is to select the ids of the lists that have the same members in them and then sort the output by the number of list items that intersect with the original list. The percenage of item correspondence is easily obtained in this case.

Categories