MongoDB category tallies of found set

MongoDB category tallies of found set - php

I have a product collection. Most products have a category, a sub-category and a subsub-category, some only have 1 or 2 of those. I'm currently storing them in an array field 'category', it could look like ["german", "literature", "novels"], for a product of type "book" (there are about 15 types, each with their own category trees).
What I would like to do is do a search, maybe there's 10K matches, return 100 to the browser, and also present a list of categories with found-counts for the query. I don't know what the categories are in advance, and they can change also.
Different ways I'm looking at:
MapReduce, but I hear this is "slow" and better geared for daily statistics than live searches
One suggestion I got was Aggregation->$group: looked at this but I cannot see how that could count values instead of just summing or averaging them.. am I missing something?
do a second search that just returns the category field, for all products, so I can do the counts in the production code
do a looped search for each category and simply return count() of the cursor. For this to work I will need to know the categories obviously, and it seems like a last-resort..
Basically my question is "what is the best way?", it should be reasonably fast, and scale.
When this works, it's the same after the user clicks on a category - then the results should be tallied for the sub-categories of that category, and so on for the subsub-categories, if any.
Additional info: the collection will have a few million products maybe, as we don't have the data yet it's hard to test against that, only about 50K products currently.. future plans include a sharded setup (there's a lot of other data besides "products").
Am I storing the categories in the right way or should they be separate fields, would that help? There's 3 items in the array right now but this could increase later.
New to MongoDB, only worked lots with MySQL so far..
Clarifying the categories; for an example product of type "book", "german" will be the main category, "literature" a sub-category and "novels" its subsub-category. Other main categories are 5-6 other languages (for books), other subcategories are for example "academic & study", "business" or "travel & languages". Subsub-categories then depend on the sub-category (for that last, the SSC's could be "foreign language study", "sociolinguistics", ..). I am storing all three in one field, as an array, per product.
When someone does a search for "foo" on type "book", it'll find 123 products in English, 456 products in German, 789 products in French. What I want is to show a listing of all those main (language) categories in which products were found, along with the number of found products.
Then when someone selects "German", it will do another query and show the number of found German books, by subcategory (44 in "academic & study", 57 in "business", ...).

I'm currently storing them in an array field 'category', it could look like ["german", "literature", "novels"]
You should not use one array for three different fields, which are "category", "subcategory" and "sub-subcategory".
Also why store language as a category and not as "language"? Add a bit of logic to the "schema" of your database, since it will help you when things become more complicated.
If you do, it will be much easier to use aggregation (which is faster that hadoop and is possible in a sharded cluster), because you won't have to query inside the arrays and you can get more accurate results. Since their values is really small so should the name of the field("c" for category, "sc" for subcategory, "scc" for sub-subcategory), like this:
{ _id : xxxxxxxxxxxx , name : "A novel of german literature" , c : "german", sc : "literature", ssc : "novels" }
What I would like to do is do a search, maybe there's 10K matches, return 100 to the browser, and also present a list of categories with found-counts for the query. I don't know what the categories are in advance, and they can change also.
Since mongo is schema-less you don't have to set all this fields for every record. If you plan to have much different schema between products, maybe you should use different collection for each product, but that is up to you.
What I would like to do is do a search, maybe there's 10K matches, return 100 to the browser, and also present a list of categories with found-counts for the query. I don't know what the categories are in advance, and they can change also.
Make good use of indexes (there are many kinds of indexes and you should probably use more than one) and use aggregation with $group and the $limit to return just 100 records.
When this works, it's the same after the user clicks on a category - then the results should be tallied for the sub-categories of that category, and so on for the subsub-categories, if any.
Here is a sample query to get all subcategories of a category (using the schema described before):
db.products.aggregate([{ $match : { "c" : "german"}},{ $group : { _id : {"c" : "$c"}, $addToSet :{ "subcategories" : "$sc"}}}])
This query will return an array of all the subcategories that exist for the current category.
(Updated query in case your category is an array and not a single string)
db.products.aggregate([{ $match : { "c" : {$elemMatch : {"german" : 1, "english" : 1}}}},{ $group : { _id : {"c" : "$c"}, $addToSet :{ "subcategories" : "$sc"}}}])

Related

Codeigniter Active Records LIKE Query

I am making a news website with Codeigniter, and I have an Articles MySQL table like
ID,Title,Body,Categories,Created etc...
In Categories field I have category separated with comma(,) like...
Article 1 Categories : National,Crime,Cinema
Article 2 Categories : National,City,Drama
Article 3 Categories : Funny,International,Cinema
Article 4 Categories : National,Crime,Cinema
I want to fetch article with Specific Category.. like National (1,2,4).
I tried many methods but nothing seems to work.
Please Help Thanks.

you can use FIND_IN_SET method to query your Categories field
FIND_IN_SET('Crime', your_table.Categories)
Your approach has a number of shortcomings, It would def be more scalable in the long run to change your tables relationship to Categories. You can use a manytomany relationship and a join table to more easily query your categories.
FIND_IN_SET will do a full table scan, and using this comma seperated way will be very difficult to aggregate, and get article/category counts.
Is storing a delimited list in a database column really that bad?
Bill Karwin has included this anti pattern as the first chapter in his excellent book.

Optimal database structure for entries in flexible category/subcategory system?

I want to store reviews in a flexible system of categories and subcategories, and am currently in the process of designing the database structure for that. I have an idea how to do that, but I'm not entirely sure if it couldn't be done more elegant and/or efficient. These are my thoughts - if anybody can comment on if/how this can be improved I'd be really grateful.
(To keep this post concise, I only list the important field for the tables)
1.) The reviews are stored in the table "reviews". It has the following fields:
id: uniquite ID, auto-incrementing.
title: the title that will show up in <head><title>, etc.
stub: a version of the title without spaces, special chars, etc. so it can be part of the URL/URI
text: the actual content
2.) All categories are in the same table "categories"
id: unique ID, auto-incrementing.
title: the full title/name of the categorie how it will be output on the website
stub: version of the title that will be shown in the URL/URI.
parent_id: if this is a subcategory, here is the categories.id of the parent category. Else this is 0.
order_number: simple number to order the categories by (for display in the navigation menu)
3.) Now I need an indicator which reviews are in what categories. The can be in multiple. My first idea was to add a "review_list" field to the categories and have it contain all reviews.id's that should be in this category. However I think that adding and removing reviews from categories would be a hassle and "unelegant". So my current idea is to have a table "review_in_category" and have an entry for every review-category relation. The structure is:
id: Unique ID, auto-increment.
review_id: the reviews.id
category_id: the categories.id
So if a review is in 3 different categories it would result in 3 entries in the "review_in_category" table.
The idea is, that when a user opens www.mydomain.de/animation/sci-fi/ the wrapper script will break up the URL into its parts. If it finds more than one category with category.stub = "sci-fi", it will check which of those has a parent category with the stub "animation". Once the correct category is identified (most the time the stubs are unique anyway so this check can be skipped) I want to SELECT all review_id's from "review_in_category" where the category_id matches the the one determined by the wrapper script. All the review_id's are put into an array. A loop will iterate through this array and compose the SELECT statement for listing all review titles (and create links to them using the stub values) by "SELECT title, stub FROM reviews WHERE id=review_list[$counter]" and then add "OR id=review_list[$counter]" until the array is completely travelled.
SO my questions are:
- Is the method my creating a single SELECT statement with potentially a large number of "OR id=" parts an "elegent" and/or efficient way to handle this situation or are there better variants?
- Does using a "taxonomy"-style table (review_in_category) make sense or would it be better to store the "membership"/"relation" directly in the reviews or category tables?
- Any other thoughts... I just started to learn this stuff and appreciate any feedback.
Thank you

Your design looks sound.
To retrieve all reviews in a category, you should use a join:
SELECT reviews.title, reviews.stub FROM reviews, review_in_category WHERE reviews.id = review_in_category.review_id AND category_id = $category

Select items that may be related (like: for 'orange', give 'bread') from MySQL database using PHP

I have a system in which I have to select "similar" records. Imagine a database containing a big list of products and when the user enters partial name of a product, a list of products come up as suggestions about the product he is searching for. These products have a longer description field too.
This is NOT about a WHERE product_name LIKE '%entered_string%' query, I think. The logic is akin to the one Stack Overflow might use, id est: when you ask a question, it prompts you with Questions that may already have your answer and Similar questions, both obviously using a method to derive what I want to ask from my question title/content and search against the database, showing the results.
I just wonder whether it is accomplishable with PHP and using MySQL as the database.
Example:
Entering food should give us results like 1kg oranges, bread and cookies. Both of these would have something similar which could help to link them programmatically to each other.

There can lots of methods to approach this scenario. but I think straight one is to have multiple keywords/tags mapped with every item. so when user types in, you would not be searching item table, you should be searching the mapped keywords and based on that searching loading the relevant items.

If you want similar products to show up, you need to put that information in your database.
So, make a category for foods, and assign every food product to that category. That way you can select similar products easily. There is no other efficient way to do this
So your database:
categories:
|id|name
1 fruit
2 Cars
Products
|id|name|category_id
1 apple 1
2 Ford focus 2
And you can select like this:
SELECT `name`,`id` FROM `products` WHERE category_id = 1;
Another way (as suggested in a comment) are tags
Products
|id|name|tags
1 apple "fruit food delicious"
2 Ford focus "Car wheels bumper"
Best way is to use a fulltext search on the tags:
SELECT * FROM `products` WHERE MATCH(tags) AGAINST ('fruit')
Make sure to have a fulltext index on tags.

How to Handle Consuming Lots of Data from Multiple Sources in a Web SIte

This is a "meta" question that I am asking in a effort to better understand some tough nuts I've had to crack lately. Even if you don't get precisely what I'm reaching for here or there is too much text to read through, any practical input is appreciated and probably useful.
Assume you have a website that needs to use data that is stored in multiple tables of a database. That data will need to be iterated through in a multitude of ways, used for calculations in various places, etc.
So on a page that needs to display a collection of projects (from one db table) that each contain a collection of categories (from another db table) that each contain 1 or more items (from another db table) what is the best way to gather the data, organize it and iterate through it for display?
Since each project can have 1 or more categories and each category can have one or more items (but the items are unique to a specific category) what's the best way to organize the resulting pile?
My goal in the below example is to generate a table of projects where each project has the associated categories listed with it and each category has the associated items listed with it but I also need to aggregate data from the items table to display next to the project name
A Project Name (43 items and 2 of them have errors!)
- category 1
- item 1
- item 2
- category 2
- item 1
Another Project Name (12 items and no errors)
- category 1
- item 1
- category 2
- item 1
What I did was to retrieve the data from each table and stick it in a variable. Giving me something like:
var $projects = array("id" => 1, "proj_id" => 1, "name" => "aname");
var $categories = array("id" => 1, "cat_id" => 1234, "proj_id" => 1, "cat_name" => "acatname");
var $items = array("id" => 1, "item_id" => 1234, "location" => "katmandu");
Then I went through the variables in nested foreach() loops building the rows I needed to display.
I ran into difficulties with this as the foreach() loop would work fine when building something 2 levels deep (associating categories with projects) but it did not work as expected when went three levels deep (I N C E P T I O N .. hah, couldn't resist) and tried adding the items to each category (instead adding all of them to one item... first or last I don't recall which). Also, when something was present in the third level of the array, how would you add up that data and then get it out for use back up in the top level of the array being built?
I suppose I could have constructed a mega SQL query that did it all for me and put everything into a single array, saving me the loop confusion by flattening it out, but... well, that's why I'm here asking you all.
So, I suppose the heart of this question is: How do you handle getting lots of data from different tables and then combining it all for display and use in calculations?

Sounds like you're going to want to use SQL JOINs. Consider looking into them:
http://www.w3schools.com/sql/sql_join_left.asp
They'll pull data from multiple tables and aggregate it. It won't produce quite what you're looking for, but it will produce something that you can use in a different way.

is Hadoop the sort of thing you're looking for?

How to find "related items" in PHP

we often see 'related items'. For instance in blogs we have related posts, in books we have related books, etc. My question is how do we compile those relevency? If it's just tag, I often see related items that does not have the same tag. For instance, when search for 'pink', a related item could have a 'purple' tag.
Anyone has any idea?

There are many ways to calculate similarity of two items, but for a straightforward method, take a look at the Jaccard Coefficient.
http://en.wikipedia.org/wiki/Jaccard_index
Which is: J(a,b) = intersection(a,b)/union(a,b)
So lets say you want to compute the coefficient of two items:
Item A, which has the tags "books, school, pencil, textbook, reading"
Item B, which has the tags "books, reading, autobiography"
intersection(A,B) = books, reading
union(A,B) = books, school, pencil, textbook, reading, autobiography
so J(a,b) = 2/6 = .333
So the most related item to A would be the item which results in the highest Jaccard Coefficient when paired with A.

Here are some of the ways:
Manually connecting them. Put up a table with the fields item_id and related_item_id, then make an interface to insert the connections. Useful to relate two items that are related but have no resemblance or do not belong to the same category/tag (or in an uncategorized entry table). Example: Bath tub and rubber ducky
Pull up some items that belong to the same category or have a similar tag. The idea is that those items must be somewhat related since they are in the same category. Example: in the page viewing LCD monitors, there are random LCD monitors (with same price range/manufacturer/resolution) in the "Related items" section.
Do a text search matching current item's name (and or description) against other items in the table. You get the idea.

To get a simple list of related items based on tags, the basic solutions goes like this:
3 tables, one with items, one with tags and one with the connection. The connection table consists of two columns, one for each id from the remaining tables. An entry in the connection table links a tag with an item by putting their respective ids in a row.
Now, to get that list of related items.
fetch all items which share at least one tag with the original item. be sure to fetch the tags along with the items, and then use a simple rating mechanism to determine, which item shares the most tags with the original one. each tag increases the relation-relevancy by one.
Depending on your tagging-habits, it might be smart to add some counter-mechanism to prevent large overarching tags from mixing up the relevancy. to achieve this, you could give greater weight to tags below a certain threshold of appliances. A threshold which has generally worked nicely for me, is total_number_of_tag_appliances/total_number_of_tags, which results in the average number of appliances. If the tags appliance-count is smaller than average, the relation-relevancy is increased double.

It can be more than a tag, for example it can be average of each work appearing in a paragraph, and then titles, etc

I would say they use ontology for that which adds more great features to the application.

it can also be based on "people who bought this book also bought"
No matter how, you will need some dort of connection between your items, and they will mostly be made by human beings

This is my implementation(GIST) of Jaccard index with PostgreSQL, and Ruby on Rails...

Here is an implementation of jaccard index between two texts based on bigrams.
https://packagist.org/packages/darkopetreski/textcategorization

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.