MongoDB Indexing vs Array Implementation for our specific application - php

Here is the issue. We are working with MongoDB-PHP.
In our application, we have many user groups where users can make posts. Presently we are maintaining the post ids these groups in the document of that group in array format. So that, when we need to grab first 10 posts we can grab them from the array using slice operation.
Eg: Case 1
collection posts: //this collection stores all the posts of various groups
{
{"_id":"1","post_text":"....",...}
{"_id":"2","post_text":"....",...}
} `
collection groups: //this collection contains documents for each group
{
{
"_id":"1"
"name":"Group ABC",
"post_ids":{"1","2"...."100"}
//1,2..100 represents MongoIDs of corresponding posts of this group
//so i can slice first 10 posts of this group when someone visits this page
}
}
`
In contrast to storing these post ids in document of the group, if we use indexing on group id and store that in posts collection.
Eg: Case 2
collection posts
{
{"_id":"1","group_id":"1","post_text":"....",...}
{"_id":"2","group_id":"2","post_text":"....",...}
}
Also note that in Case 1 we do not have to apply any sorting operations as array elements are pushed in order while in Case 2 we will have to apply sort(by timestamp criteria) after the find operation, which would read all documents from memory and then apply sorting on them.
Whose performance would be better taking into consideration that indexes would be stored in RAM ?
Please let me know if the issue is not clear from this question.

Doing one query (case #2) would be faster than doing two queries. Also, making documents bigger (e.g., appending new posts to post_ids in #1) is a fairly slow operation.

Related

laravel conditional join based on optional external keys

I got this db structure:
ddts contains 3 optional external keys (only one of panel_id, sawn_id or veneer_id can contains an external id and other 2 equals to null) .
So one ddt can be exatly just one of this 3 types:
sawn
panel
veneer
I need to extract for every company_id(another external key) i will sum some data from panels,sawns and veneers, but before sum it i need even to convert some of them in kgs too (with a function implemented by me).
In the ddts model ive the methods: panel(),sawn(),veneer();
I need just the final sum, but i guess that for achieve this i need to build up a huge collection and then manipulate it...
Id like to understand what is best to doing by query and what by code.
My first approach was about:
select all companies
in a foreach loop all ddts for each company
in a foreach loop for each ddt associate a type by selecting it
trough an if condition steatement
in a forache loop selection the value to sum
convert it to kgs where is necessary
sum it;
but it seems so long and im quite shure that the point 2 and 3 should be done by a JOIN but not so clare how!

PHP, MySQL, Efficient tag-driven search algorithm

I'm currenlty building a webshop. This shop allows users to filter products by category, and a couple optional, additional filters such as brand, color, etc.
At the moment, various properties are stored in different places, but I'd like to switch to a tag-based system. Ideally, my database should store tags with the following data:
product_id
tag_url_alias (unique)
tag_type (unique) (category, product_brand, product_color, etc.)
tag_value (not unique)
First objective
I would like to search for product_id's that are associated with anywhere between 1-5 particular tags. The tags are extracted from a SEO-friendly url. So I will be retrieving a unique strings (the tag_url_alias) for each tag, but I won't know the tag_type.
The search will be an intersection, so my search should return the product_id's that match all of the provided tags.
Second objective
Besides displaying the products that match the current filter, I would also like to display the product-count for other categories and filters which the user might supply.
For instance, my current search is for products that match the tags:
Shoe + Black + Adidas
Now, a visitor of the shop might be looking at the resulting products and wonder which black shoes other brands have to offer. So they might go to the "brand" filter, and choose any of the other listed brands. Lets say they have 2 different options (in practice, this will probably have many more), resulting in the following searches:
Shoe + Black + Nike > 103 results
Shoe + Black + K-swiss > 0 results
In this case, if they see the brand "K-swiss" listed as an available choise in their filter, their search will return 0 results.
This is obviously rather disappointing to the user... I'd much rather know that switching the "brand" from "adidas" to "k-swiss" will 0 results, and simply remove the entire option from the filter.
Same thing goes for categories, colors, etc.
In practice this would mean a single page view would not only return the filtered product list described in my primary objective, but potentially hundreds of similar yet different lists. One for each filter value that could replace another filter value, or be added to the existing filter values.
Capacity
I suspect my database will eventually contain:
between 250 and 1.000 unique tags
And it will contain:
between 10.000 and 100.000 unique products
Current Ideas
I did some Google searches and found the following article: http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html
Judging by that article, running hundreds of queries to achieve the 2nd objective, is going to be a painfully slow route. The "toxy" example might work for my needs and it might be acceptable for my First objective, but it would be unacceptably slow for the Second objective.
I was thinking I might run individual queries that match 1 tag to it's associated product_id's, cache those queries, and then calculate intersections on the results. But, do I calculate these intersections in MySQL? or in PHP? If I use MySQL, is there a particular way I should cache these individual queries, or is supplying the right indexes all I need?
I would imagine it's also quite possible to maybe even cache the intersections between two of these tag/product_id sets. The amount of intersections would be limited by the fact that a tag_type can have only one particular value, but I'm not sure how to efficiently manage this type of caching. Again, I don't know if I should do this in MySQL or in PHP. And if I do this in MySQL, what would be the best way to store and combine this type of cached results?
Using sphinx search engine can make this magic for you. Its is VERY fast, and even can handle wordforms, what can be useful with SEO requests.
In terms of sphinx, make a document - "product", index by tags, choose proper ranker for query (ex, MATCH_ALL_WORDS) and run batch request with different tag combinations to get best results.
Dont forget to use cachers like memcahed or any other.
I did not test this yet, but it should be possible to have one query to satisfy your second objective rather than triggering several hundred queries...
The query below illustrates how this should work in general.
The idea is to combine the three different requests at once and group by the dedicated value and collect only those which have any results.
SELECT t1.product_id, count(*) FROM tagtable t1, tagtable t2, tagtable t3 WHERE
t1.product_id = t2.product_id AND
t2.product_id = t3.product_id AND
t1.tag_type='yourcategoryforShoe' AND t1.tag_value='Shoe' AND
t2.tag_type='product_color' AND t2.tag_value='Black' AND
t3.tag_type='brand'
GROUP BY t3.tag_value
HAVING count(*) > 0

PHP/MYSQL store variables in array or separate fields

Premature optimization is the root of all evil...but...
I am allowing users to input data within categories as in favorite players, favorite teams etc. They can then use these choices to filter results. I let them input lists separated by commas so after exploding the data I have it in an array. So how to store.
Method 1: I could create a table of users, one row per user, with the categories, as in players, teams as fields and save the choices of each users as an array in the respective field. (userid would link to basic users table.)
Method 2. Or I could create separate tables for each thing, players, teams, etc, and have a fixed number of fields say 10, break up the array into each individual value, store and place it in its own field. (Already have this code working.) (Again userid is primary key.)
The advantage of Method 1 is it's a bit simpler, one table, no limit on number of choices.
Method 2 seems a bit more robust. The data is more visible and possibly easier to get and retrieve--although maybe not.
Does anyone have experience with this sort of thing and could recommend one over another?
Thanks for any recommendations, suggestions!

best way to store similar music

I have millions of songs, each song has its unique Song ID. Corresponding to each Song ID I have some attributes like song name, artist name, album name, year etc.
Now, I have implemented a mechanism to find out similarity ratio between two songs.
It gives me a value between 0 - 100.
So, I need to show similar music to users, which can not be done on a run time. I need to preprocess the similarity values between each and every song.
Hence, if I create a DB with three attributes,
song1, song2, similarity
I will be having n*n records where n is the number of songs.
And whenever I want to fetch the similar music, I need to execute this query:
SELECT song2 WHERE song1 = x AND similarity > 80 ORDER BY similarity DESC;
Please suggest something to maintain such information.
Thanks.
I think you'd be better off comparing similarity to a "prototypical" song or classification. Devise a fingerprint mechanism that includes information metadata about the song and whatever audio mechanism you use to judge similarity. Place each song into one (or more) categories and score the song within that category -- how closely does it match the prototype for the category using the fingerprint. Note that you could have hundreds or thousands of categories, i.e., they're not the typical categories that you think of when you think of music.
Once you have this done, you can then maintain indexes by category and when finding similar songs you devise a weight based on the category and similarity measures within the category -- say by giving greater weight to the category in which the song is closest to the prototype. Multiply the weight by the square of the difference between the candidate song and the current song to the prototype for the category. Sum the weights for the say top 3 categories with lower values being more similar.
This way you only need to store a few items of metadata for each song rather than keep relationship between pairs of songs. If the main algorithm runs too slowly, you could keep cached pair-wise data for the most common songs and default to the algorithmic comparison when a song isn't in your cached data set.
What you are proposing will work, however, you can reduce the number of rows by storing each pair only once. Then modifying your query to select the song id in song1 or song2.
Something like:
SELECT if(song1=?,song2,song1) as similar WHERE (song1 = ? or song2 =?) AND similarity > 80 ORDER BY similarity DESC;
It seems required mass computation power to maintain and access the similarity information. For example, if you already have 2000 songs processed, and you still need to perform the similarity analyze 2000 times for the next new song. It may have scalability problem and the data scheme can make the database slow in just a short time period.
I recommend that you can find some pattern and tag each song. For example, you can analyze the songs for "blues", "rocks", "90's" pattern and give them tags. If you want to find similar song based on one song, you can just query all tags that the given songs have. ex. "New age", "Slow" and "techno"

Weighing search results

PHP / MySQL backend. I've got a database full of movies YouTube-style. Each video has a name and category. Videos and categories have a m:n relationship.
I'd like for my visitors to be able to search for videos and have them enter the search terms in one search field. I can't figure out how to return the best search results based on being category, occurrences in name.
What's the best way to go about something like this? Scoring? => Check for each search term whether it occurs in the name of the video; if so, award the video a point; check if the video is in categories that are also contained in the search query; if so, award it a point. Sort it by number points received? That sounds very expensive in terms of CPU usage.
Using Full-Text Search may help: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html#function_match
You can test several columns at once against an expression.
First, use full text search. It can be either MySql full-text search or some kind of extrenal full-text search engine. I recommend sphinx. It is very fast, simple and even can be integrated with MuSQL using SphinxSE (so search indexes look loke tables in MySQL). However you have to install and configure it.
Second, think about splitting search results by search type. Any kind of full-text search will return list of matched items sorted by relevancy. You can search by all fields and get a single list. This is bad idea because hits by name and hits by category will be mixed. To solve this you can do multiple searches - search by name first, then search by category.
As a result you'll have two matching sets and you have a lot of options how to display this. Some ideas:
merge 2 sets based on relevancy rate returned by the search engine. This looks like result of one single query but you know what each item is (name hit or category hit) so you can highlight this
do the same marge as above but assign different weights to different sets, for eaxmple relevancy = 0.7*name_relevancy+0.3*category_relevancy. This will make search results more natural
spit results into tabs/groups e.g. 'There are N titles and M categories matching your query)
Use bands when displaying results. For each page (assuming you are splitting search results using paginator) dispslay N items from the first set and M items from the second set (you can dipslya sets one by one or shuffle items). If there is no enough items in one of sets then just get more items from another set, so there is always M+N items per page
Any other way you can imagine
And you can use this method for any kind of fields - name, categroy, actor, director, etc. However the more fields you use the more search queries you have to execute
I don't think you can avoid looking at the title and category of every movie for each search. So the CPU usage for that is a given. If you are concerned about the CPU usage of the sort, it would be negligible in most cases, since you would only be sorting the items that have more than zero points.
Having said that, what you probably want is a system that is partially rule-based and partially point-based. For instance, if you have a title that is equal to the search term, it should come first, regardless of points. Architect your search such that you can easily add rules and tweak points as you see fit to yield the best results.
Edit: In the event of an exact title match, you can take advantage of a DB index and not search the whole table. Optionally, the same goes for category.

Categories