Variety on search result - php

First of all: Sorry for the long post, I am trying to explain a hard situation in an easy way and, at the same time, trying to give as much information as I can.
I have an algorithm that tries to determine user expectation during a search. There are a couple of way I can use it and I have the same problem with both of them, so, lets say I use it for disambiguation. Well, with a db structure like this one (or any other that allows the work):
post
ID | TITLE
---+----------------------------------------------
1 | Orange developed the first 7G phone
2 | Orange: the fruit of gods
3 | Theory of Colors: Orange
4 | How to prepare the perfect orange juice
keywords
ID | WORD | ABOUT
---+----------+---------
1 | orange | company
2 | orange | fruit
3 | orange | color
post_keywords
ID | POST | KEYWORD
---+-------+---------
1 | 1 | 1
2 | 2 | 2
3 | 3 | 3
4 | 4 | 2
.
If in a search box, an user make a search for the word "orange", the algorithm would look that orange may refers to the company, the color, or the fruit and, by answering a couple of questions, it tries to determine which the user is looking for. After all that I get an array like this one:
$e = array(
'fruit' => 0.153257,
'color' => 0.182332,
'company' => 0.428191,
);
In this point I know the user is probably looking for information about the fruit (because fruit's value is closer to 0) and if I am wrong my second bet goes for the color. At the bottom of the list, the company.
So, with a Join and ORDER BY FIELD(keywords.id, 2,3,1) I can give the results the (almost) perfect order:
- Orange: the fruit of gods
- How to prepare the perfect orange juice
- Theory of Colors: Orange
- Orange developed the first 7G phone
.
Well... as you can imagine, I wouldn't come for help if everything is so nice. So, the problem is that in is the previous example we have only 4 possible results, so, if the user really was looking for the company he can find this result in the 4th position and everything is okay. But... If we have 200 post about the fruit and 100 post about the color, the first post about the company come in the position 301st.
I am looking for a way to alternate the order (in a predictable and repeatable way) now that I know the user is must likely looking for the fruit, followed by the color and the company at the end. I want to be able to show a post about the fruit in the first position (and possibly the second), followed by a post about the color, followed by the company and start this cycle again until the results ends.
Edit: I'll be happy with a MySQL trick or with an idea to change the approach, but I can't accept third-party solutions.

You can use variables to provide custom sort field.
SELECT
p.*,
CASE k.about
WHEN 'company' THEN #sort_company := #sort_company + 1
WHEN 'color' THEN #sort_color := #sort_color + 1
WHEN 'fruit' THEN #sort_fruit := #sort_fruit + 1
ELSE NULL
END AS sort_order,
k.about
FROM post p
JOIN post_keywords pk ON (p.id = pk.post)
JOIN keywords k ON (pk.keyword = k.id)
JOIN (SELECT #sort_fruit := 0, #sort_color := 0, #sort_company := 0) AS vars
ORDER BY sort_order, FIELD(k.id, 2, 3, 1)
Result will look like this:
| id | title | sort_order | about |
|---:|:----------------------------------------|-----------:|:--------|
| 2 | Orange: the fruit of gods | 1 | fruit |
| 3 | Theory of Colors: Orange | 1 | color |
| 1 | Orange developed the first 7G phone | 1 | company |
| 4 | How to prepare the perfect orange juice | 2 | fruit |

I think you do need some way of categorizing, or, I would prefer to say, clustering the answers. If you can do this, you can then start by showing the users the top scoring answer from each cluster. Hey, sometimes maximising diversity really is worth doing just for its own sake!
I think you should be able to cluster answers. You have some sort of scoring formula which tells you how good an answer a document is to a user query, perhaps based on a "bag of words" model. I suggest that you use this to tell how close one document is to another document by treating the other document as a query. If you do exactly this you might want to treat each document as a query with the other as an answer and average the two scores, so that the the score d(a, b) has the property that d(a, b) = d(b, a).
Now you have a score (unfortunately probably not a distance: that is, with a score, high values mean close together) and you need a clustering algorithm. Ideally you want a fast one, but maybe it just has to be fast enough to be faster than a human reading through the answers.
One fast clustering algorithm is to keep track of N (for some parameter N) cluster centres. Initialise these to the first N documents retrieved, then consider every other document one at a time. At each stage you are trying to reduce the maximum score found between any two documents in the cluster centre (which amounts to getting the documents as far apart as possible). When you consider a new document, compute the score between that document and each of the N current cluster centres. If the maximum of these scores is less than the current maximum score between the N current cluster centres, then this document is further away from the cluster centres than they are from each other so you want it. Swap it with one of the N cluster centres - whichever one makes that maximum score between the new N cluster centres the least.
This isn't a perfect clustering algorithm - for one thing, the result depends on the order in which documents are presented, which is a bad sign. It is, however, reasonably fast for small N, and it has one nice property: if you you have k <=N clusters, and (switching from scores to distances) every distance within a cluster is smaller than every distance between two points from different clusters, then the N cluster centres at the end will include at least one point from each of the k clusters. The first time you see a member of a cluster you haven't seen before, it will become a cluster centre, and you will never reduce the number of cluster centers held, because you would be ejecting a point which in a different cluster from the other centres, which won't increase the minimum distance between any two points held as cluster centres (reduce the maximum score between any two such points).

Related

Select a few coordinates from a large set that look evenly distributed in the area

I have a MySQL table with area and lat/lon location columns. Every area has many locations, say 20.000. Is there a way to pick just a few, say 100, that look somewhat evenly distributed on the map?
The distribution doesn't have to be perfect, query speed is more important. If that is not possible directly with MySQL a very fast algorithm that somehow picks evenly distributed locations might also work.
Thanks in advance.
Edit: answering some requests in comments. The data doesn't have something that can be used, it's just area and coordinates of locations, example:
+-------+--------------+----------+-----------+------------+--------+--------+
| id | area | postcode | lat | lon | colour | size |
+-------+--------------+----------+-----------+------------+--------+--------+
| 16895 | Athens | 10431 | 37.983917 | 23.7293599 | red | big |
| 16995 | Athens | 11523 | 37.883917 | 23.8293599 | green | medium |
| 16996 | Athens | 10432 | 37.783917 | 23.7293599 | yellow | small |
| 17000 | Thessaloniki | 54453 | 40.783917 | 22.7293599 | green | small |
+-------+--------------+----------+-----------+------------+--------+--------+
There are some more columns with characteristics but those are just used for filtering.
I did try getting the nth row in the meantime, it seems to work although a bit slow
SET #a = 0;
select * from `locations` where (#a := #a + 1) % 200 = 0
using random() also works but a bit slow too.
Edit2: Turns out it was easy to add postal codes on the table. Having that, getting grouped by postal code seems to give a nice to the eye result. Only issue is that there are very large areas with around 3000 distinct postcodes and getting just 100 may end up many of them show in one place, so will probably need to further process in PHP.
Edit3, answering #RickJames questions in comments so they are in one place:
Please define "evenly distributed" -- evenly spaced in latitude? no two are "close" to each other? etc.
"Evenly distributed" was a bad choice of words. We just want to show some locations on the area that are not all in one place
Are the "areas" rectangles? Hexagons? Or gerrymandered congressional districts?
They can be thought roughly as rectangles but it shouldn't matter. Important thing I missed, we also need to show locations from multiple areas. Areas may be far apart from each other or neighboring (but not overlap). In that case we'd want the sample of 100 to be split between the areas.
Is the "100 per area" fixed? Or can it be "about 100"
It's not fixed, it's around 100 but we can change this if it doesn't look nice
Is there an AUTO_INCREMENT id on the table? Are there gaps in the numbers?
Yes there is an AUTO_INCREMENT id and can have gaps
Has the problem changed from "100 per area" to "1 per postal code"?
Nope the problem is still the same, "show 100 per area in a way that not all of them are in the same place", how this is done it doesn't matter
What are the total row count and desired number of rows in output?
Total row count depends on area and criteria, it can be up to 40k in an area. If total is more than 1000 we want to fall back showing just a random of 100. If 1000 or less we can just show all of them
Do you need a different sample each time you run the query?
Same sample or different sample even with the same criteria is fine
Are you willing to add a column to the table?
It's not up to me but if I have good argument then most probably we can add a new column
Here's an approach that may satisfy the goals.
Preprocess the table, making a new table, to get rid of "duplicate" items.
If the new table is small enough, a full scan of it may be fast enough.
As for "duplicates", consider this as a crude way to discover that two items land in the same spot:
SELECT ROUND(latitude * 5),
ROUND(longitude * 3),
MIN(id) AS id_to_keep
FROM tbl
GROUP BY 1,2
The "5" and "3" can be tweaked upward (or downard) to cause more (or fewer) ids to be kept. "5" and "3" are different because of the way the lat/lng are laid out; that ratio might work most temperate latitudes. (Use equal numbers near the equator, use a bigger ration for higher latitudes.)
There is a minor flaw... Two items very close to each other might be across the boundaries created by those ROUNDs.
How many rows are in the original table? How many rows does the above query generate? ( SELECT COUNT(*) FROM ( ... ) x; )

Group coordinates by proximity to each other

I'm building a REST API so the answer can't include google maps or javascript stuff.
In our app, we have a table containing posts that looks like that :
ID | latitude | longitude | other_sutff
1 | 50.4371243 | 5.9681102 | ...
2 | 50.3305477 | 6.9420498 | ...
3 | -33.4510148 | 149.5519662 | ...
We have a view with a map that shows all the posts around the world.
Hopefully, we will have a lot of posts and it will be ridiculous to show thousands and thousands of markers in the map. So we want to group them by proximity so we can have something like 2-3 markers by continent.
To be clear, we need this :
Image from https://github.com/googlemaps/js-marker-clusterer
I've done some research and found that k-means seems to be part of the solution.
As I am really really bad at Math, I tried a couple of php libraries like this one : https://github.com/bdelespierre/php-kmeans that seems to do a decent job.
However, there is a drawback : I have to parse all the table each time the map is loaded. Performance-wise, it's awful.
So I would like to know if someone already got through this problematic or if there is a better solution.
I kept searching and I've found an alternative to KMeans : GEOHASH
Wikipedia will explain better than me what it is : Wiki geohash
But to summarize, The world map is divided in a grid of 32 cells and to each one is given an alpha-numeric character.
Each cell is also divided into 32 cells and so on for 12 levels.
So if I do a GROUP BY on the first letter of hash I will get my clusters for the lowest zoom level, if I want more precision, I just need to group by the first N letters of my hash.
So, what I've done is only added one field to my table and generate the hash corresponding to my coordinates:
ID | latitude | longitude | geohash | other_sutff
1 | 50.4371243 | 5.9681102 | csyqm73ymkh2 | ...
2 | 50.3305477 | 6.9420498 | p24k1mmh98eu | ...
3 | -33.4510148 | 149.5519662 | 8x2s9674nd57 | ...
Now, if I want to get my clusters, I just have to do a simple query :
SELECT count(*) as nb_markers FROM mtable GROUP BY SUBSTRING(geohash,1,2);
In the substring, 2 is level of precision and must be between 1 and 12
PS : Lib I used to generate my hash

PHP Recommendation Engine - Recommending Whiskies with 12 different taste ratings

I'm developing a whisky information system in PHP connected to a mySQL database with 3 tables consisting of bottles (about 100 in total), users and bottles certain users have added as a favourite to their whisky shelf.
I'm attempting to build a function to recommend whiskies to the user based on the current whiskies they have added on their whisky shelf.
Each whisky has a 'flavour profile' with 12 different flavour features (e.g. if the whisky is nutty, smoky e.t.c) each feature is ranked on a scale from 0 to 4. So I basically have 12 numbers to play with and compare to another 12 numbers.
I've done a fair bit of research on the subject, but can only find simple implementations comparing one rating to another, but I can't think of an efficient way to compare 12 numbers and return some kind of match percentage.
I was wondering if anyone has any suggestions on the best method to compare the whiskies in the database to the whiskies in the users favourites and recommend the closest matches?
What you are trying to accomplish is, in essence, Pandora for Whiskey. You will have to devise an algorithm which will compare different characteristics and provide some sort of weight that will affect the overall outcome. This is not a trivial process and your algorithm will undergo modification many times before it works optimally.
| CHARACTERISTICS | YOUR WHISKEY | WHISKEY #1 | WHISKEY #2|
---------------------------------------------------------------
| Smoky | x | | x |
---------------------------------------------------------------
| Nutty | | x | x |
---------------------------------------------------------------
In the above example, YOUR WHISKEY is one that you like, and WHISKEY #2 has more of your desired characteristics than WHISKEY #1. That is a very simple comparison, and not doesn't factor very much into it.
You need to sit down with your possible data, create an algorithm, and then try it out on people. If it doesn't quite work right, tweak the algorithm some more. It's a continuous process that will eventually work as you want.
This similar post on collaborative filtering and recommendation systems might provide some more useful insight: What is algorithm behind the recommendation sites like last.fm, grooveshark, pandora?

Statistical method for grading a set of exponential data

I have a PHP application that allows the user to specify a list of countries and a list of products. It tells them which retailer is the closest match. It does this using a formula similar to this:
(
(number of countries matched / number of countries selected) * (importance of country match)
+
(number of products matched / number of products selected) * (importance of product match)
)
*
(significance of both country and solution matching * (coinciding matches / number of possible coinciding matches))
Where [importance of country match] is 30%, [importance of product match] is 10% and [significance of both country and solution matching] is 2.5
So to simplify it: (country match + product match) * multiplier.
Think of it as [do they operate in that country? + do they sell that product?] * [do they sell that product in that country?]
This gives us a match percentage for each retailer which I use to rank the search results.
My data table looks something like this:
id | country | retailer_id | product_id
========================================
1 | FR | 1 | 1
2 | FR | 2 | 1
3 | FR | 3 | 1
4 | FR | 4 | 1
5 | FR | 5 | 1
Until now it's been fairly simple as it has been a binary decision. The retailer either operates in that country or sells that product or they don't.
However, I've now been asked to add some complexity to the system. I've been given the revenue data, showing how much of that product each retailer sells in each country. The data table now looks something like this:
id | country | retailer_id | product_id | revenue
===================================================
1 | FR | 1 | 1 | 1000
2 | FR | 2 | 1 | 5000
3 | FR | 3 | 1 | 10000
4 | FR | 4 | 1 | 400000
5 | FR | 5 | 1 | 9000000
My problem is that I don't want retailer 3 selling ten times as much as retailer 1 to make them ten times better as a search result. Similarly, retailer 5 shouldn't be nine thousand times better as a match than retailer 1. I've looked into using the mean, the mode and median. I've tried using the deviation from the mean. I'm stumped as to how to make the big jumps less significant. My lack of ignorance of the field of statistics is showing.
Help!
Consider using the log10() function. This reduces the direct scaling of results, like you were describing. If you log10() of the revenue, then someone with a revenue 1000 times larger receives a score only 3x larger.
A classic in "dampening" huge increases in value are the logarithms. If you look at that Wikipedia article, you see that the function value initially grows fairly quickly but then much less so. As mentioned in another answer, a logarithm with base 10 means that each time you multiply the input value by ten, the output value increases by one. Similarly, a logarithm with base two will grow by one each time you multiply the input value by two.
If you want to weaken the effect of the logarithm, you could look into combining it with, say, a linear function, e.g. f(x) = log2 x + 0.0001 x... but that multiplier there would need to be tuned very carefully so that the linear part doesn't quickly overshadow the logarithmic part.
Coming up with this kind of weighting is inherently tricky, especially if you don't know exactly what the function is supposed to look like. However, there are programs that do curve fitting, i.e. you can give it pairs of function input/output and a template function, and the program will find good parameters for the template function to approximate the desired curve. So, in theory you could draw your curve and then make a program figure out a good formula. That can be a bit tricky, too, but I thought you might be interested. One such program is the open source tool QtiPlot.

How to optimize this MySQL table?

This is for an upcoming project. I have two tables - first one keeps tracks of photos, and the second one keeps track of the photo's rank
Photos:
+-------+-----------+------------------+
| id | photo | current_rank |
+-------+-----------+------------------+
| 1 | apple | 5 |
| 2 | orange | 9 |
+-------+-----------+------------------+
The photo rank keeps changing on a regular basis, and this is the table that tracks it:
Ranks:
+-------+-----------+----------+-------------+
| id | photo_id | ranks | timestamp |
+-------+-----------+----------+-------------+
| 1 | 1 | 8 | * |
| 2 | 2 | 2 | * |
| 3 | 1 | 3 | * |
| 4 | 1 | 7 | * |
| 5 | 1 | 5 | * |
| 6 | 2 | 9 | * |
+-------+-----------+----------+-------------+ * = current timestamp
Every rank is tracked for reporting/analysis purpose.
[Edit] Users will have access to the statistics on demand.
I talked to someone who has experience in this field, and he told me that storing ranks like above is the way to go. But I'm not so sure yet.
The problem here is data redundancy. There are going to be tens of thousands of photos. The photo rank changes on a hourly basis (many times- within minutes) for recent photos but less frequently for older photos. At this rate the table will have millions of records within months. And since I do not have experience in working with large databases, this makes me a little nervous.
I thought of this:
Ranks:
+-------+-----------+--------------------+
| id | photo_id | ranks |
+-------+-----------+--------------------+
| 1 | 1 | 8:*,3:*,7:*,5:* |
| 2 | 2 | 2:*,9:* |
+-------+-----------+--------------------+ * = current timestamp
That means some extra code in PHP to split the rank/time (and sorting), but that looks OK to me.
Is this a correct way to optimize the table for performance? What would you recommend?
The first one. Period.
Actually you'll lose much more. A timestamp stored in the int column will occupy only 4 bytes of space.
While the same timestamp stored in the string format will take 10 bytes.
Your first design is correct for a relational database. The redundancy in the key columns is preferable because it gives you a lot more flexibility in how you validate and query the rankings. You can do sorts, counts, averages, etc. in SQL without having to write any PHP code to split your string six ways from Sunday.
It sounds like you would like to use a non-SQL database like CouchDB or MongoDB. These would allow you to store a semi-structured list of rankings right in the record for the photo, and subsequently query the rankings efficiently. With the caveat that you don't really know that the rankings are in the right format, as you do with SQL.
I would stick with your first approach. In the second you will have a lot of data stored in the row, as time goes by it gets more ranks! That is, if a photo gets thousands and thousands of rankings.
The first approach is also more maintainable, that is, if you wish to delete a rank.
I'd think the database 'hit' of over normalistion (querying the ranks table over and over) is nicely avoided by 'caching' the last rank in current_rank. It does not really matter ranks is growing tremendously if it is seldom queried (analyis / reporting you said), never updated but just gets records inserted at the end: even a very light box would have no problem having millions of rows in that table.
You alternative would require lots of updates on different locations on the disk, possibly resulting in degraded performance.
Of course, if you need all the old data, and always by photo_id, you could plan a scheduled run to another table rankings_old, possibly with photo_id, year,month, rankings (including timestamps) when a month is over, so retrieving old data stays easily possible, but there are no updates needed in rankings_old or rankings, only inserts at the end of the table.
And take it from me: millions of records in a pure logging table should be absolutely no problem.
Normalized data or not normalized data. You will find thousands of articles about that. :)
It really depends of your needs.
If you want to build your database only with performance (speed or RAM consumption or...) in mind you should only trust the numbers. To do that you have to profile your queries with the expected data "volume" (You can generate the data with some script you write). To profile your queries, learn how to read the results of the 2 following queries:
EXPLAIN extended...
SHOW STATUS
Then learn what to do to improve the figures (mysql settings, data structure, hardware, etc).
As a starter, I really advise these 2 great articles:
http://www.xaprb.com/blog/2006/10/12/how-to-profile-a-query-in-mysql/
http://ajohnstone.com/archives/mysql-php-performance-optimization-tips/
If you want to build for the academic beauty of the normalization: just follow the books and the general recommandations. :)
Out of the two options - like everyone before me said - it has to be option 1.
What you should really be concerned about are the bottlenecks in the application itself. Are users going to refer to the historical data often, or does it only show up for a few select users? If the answer is that everyone gets to see historical data of the ranks, then option 1 is good enough. If you are not going to refer to the historical ranks that often, then you could create a third "archive" table, and before updating the ranks, you can copy the rows of the original rank table to the archive table. This ensures that the number of rows stays minimal on the main table that is being called.
Remember, if you're updating the rows, and there's 10s of thousands, it might be more fruitful to get the results in your code (PHP/Python/etc), truncate the table and insert the results in rather than updating it row by row, as that would be a potential bottleneck.
You may want to look up sharding as well (horizontal partitioning) - http://en.wikipedia.org/wiki/Shard_%28database_architecture%29
And never forget to index well.
Hope that helped.
You stated the rank is only linked to the image, in which case all you need is table 1 and keep updating the rank in real time. Table 2 just stores unnecessary data. The disadvantage of this approach is that user cant change his vote.
You said the second table is for analysing /statistics, so it actually isn't something that needs to be stored in db. My suggestion is to get rid of the second table and use a logging facility to record rank changes.
Your second design is very dangerous in case you have 1 million votes for a photo. Can PHP handle that?
With the first design you can do all math on the database level which will be returning you a small result set.

Categories