I have a PHP application that allows the user to specify a list of countries and a list of products. It tells them which retailer is the closest match. It does this using a formula similar to this:
(
(number of countries matched / number of countries selected) * (importance of country match)
+
(number of products matched / number of products selected) * (importance of product match)
)
*
(significance of both country and solution matching * (coinciding matches / number of possible coinciding matches))
Where [importance of country match] is 30%, [importance of product match] is 10% and [significance of both country and solution matching] is 2.5
So to simplify it: (country match + product match) * multiplier.
Think of it as [do they operate in that country? + do they sell that product?] * [do they sell that product in that country?]
This gives us a match percentage for each retailer which I use to rank the search results.
My data table looks something like this:
id | country | retailer_id | product_id
========================================
1 | FR | 1 | 1
2 | FR | 2 | 1
3 | FR | 3 | 1
4 | FR | 4 | 1
5 | FR | 5 | 1
Until now it's been fairly simple as it has been a binary decision. The retailer either operates in that country or sells that product or they don't.
However, I've now been asked to add some complexity to the system. I've been given the revenue data, showing how much of that product each retailer sells in each country. The data table now looks something like this:
id | country | retailer_id | product_id | revenue
===================================================
1 | FR | 1 | 1 | 1000
2 | FR | 2 | 1 | 5000
3 | FR | 3 | 1 | 10000
4 | FR | 4 | 1 | 400000
5 | FR | 5 | 1 | 9000000
My problem is that I don't want retailer 3 selling ten times as much as retailer 1 to make them ten times better as a search result. Similarly, retailer 5 shouldn't be nine thousand times better as a match than retailer 1. I've looked into using the mean, the mode and median. I've tried using the deviation from the mean. I'm stumped as to how to make the big jumps less significant. My lack of ignorance of the field of statistics is showing.
Help!
Consider using the log10() function. This reduces the direct scaling of results, like you were describing. If you log10() of the revenue, then someone with a revenue 1000 times larger receives a score only 3x larger.
A classic in "dampening" huge increases in value are the logarithms. If you look at that Wikipedia article, you see that the function value initially grows fairly quickly but then much less so. As mentioned in another answer, a logarithm with base 10 means that each time you multiply the input value by ten, the output value increases by one. Similarly, a logarithm with base two will grow by one each time you multiply the input value by two.
If you want to weaken the effect of the logarithm, you could look into combining it with, say, a linear function, e.g. f(x) = log2 x + 0.0001 x... but that multiplier there would need to be tuned very carefully so that the linear part doesn't quickly overshadow the logarithmic part.
Coming up with this kind of weighting is inherently tricky, especially if you don't know exactly what the function is supposed to look like. However, there are programs that do curve fitting, i.e. you can give it pairs of function input/output and a template function, and the program will find good parameters for the template function to approximate the desired curve. So, in theory you could draw your curve and then make a program figure out a good formula. That can be a bit tricky, too, but I thought you might be interested. One such program is the open source tool QtiPlot.
Related
I have a MySQL table with area and lat/lon location columns. Every area has many locations, say 20.000. Is there a way to pick just a few, say 100, that look somewhat evenly distributed on the map?
The distribution doesn't have to be perfect, query speed is more important. If that is not possible directly with MySQL a very fast algorithm that somehow picks evenly distributed locations might also work.
Thanks in advance.
Edit: answering some requests in comments. The data doesn't have something that can be used, it's just area and coordinates of locations, example:
+-------+--------------+----------+-----------+------------+--------+--------+
| id | area | postcode | lat | lon | colour | size |
+-------+--------------+----------+-----------+------------+--------+--------+
| 16895 | Athens | 10431 | 37.983917 | 23.7293599 | red | big |
| 16995 | Athens | 11523 | 37.883917 | 23.8293599 | green | medium |
| 16996 | Athens | 10432 | 37.783917 | 23.7293599 | yellow | small |
| 17000 | Thessaloniki | 54453 | 40.783917 | 22.7293599 | green | small |
+-------+--------------+----------+-----------+------------+--------+--------+
There are some more columns with characteristics but those are just used for filtering.
I did try getting the nth row in the meantime, it seems to work although a bit slow
SET #a = 0;
select * from `locations` where (#a := #a + 1) % 200 = 0
using random() also works but a bit slow too.
Edit2: Turns out it was easy to add postal codes on the table. Having that, getting grouped by postal code seems to give a nice to the eye result. Only issue is that there are very large areas with around 3000 distinct postcodes and getting just 100 may end up many of them show in one place, so will probably need to further process in PHP.
Edit3, answering #RickJames questions in comments so they are in one place:
Please define "evenly distributed" -- evenly spaced in latitude? no two are "close" to each other? etc.
"Evenly distributed" was a bad choice of words. We just want to show some locations on the area that are not all in one place
Are the "areas" rectangles? Hexagons? Or gerrymandered congressional districts?
They can be thought roughly as rectangles but it shouldn't matter. Important thing I missed, we also need to show locations from multiple areas. Areas may be far apart from each other or neighboring (but not overlap). In that case we'd want the sample of 100 to be split between the areas.
Is the "100 per area" fixed? Or can it be "about 100"
It's not fixed, it's around 100 but we can change this if it doesn't look nice
Is there an AUTO_INCREMENT id on the table? Are there gaps in the numbers?
Yes there is an AUTO_INCREMENT id and can have gaps
Has the problem changed from "100 per area" to "1 per postal code"?
Nope the problem is still the same, "show 100 per area in a way that not all of them are in the same place", how this is done it doesn't matter
What are the total row count and desired number of rows in output?
Total row count depends on area and criteria, it can be up to 40k in an area. If total is more than 1000 we want to fall back showing just a random of 100. If 1000 or less we can just show all of them
Do you need a different sample each time you run the query?
Same sample or different sample even with the same criteria is fine
Are you willing to add a column to the table?
It's not up to me but if I have good argument then most probably we can add a new column
Here's an approach that may satisfy the goals.
Preprocess the table, making a new table, to get rid of "duplicate" items.
If the new table is small enough, a full scan of it may be fast enough.
As for "duplicates", consider this as a crude way to discover that two items land in the same spot:
SELECT ROUND(latitude * 5),
ROUND(longitude * 3),
MIN(id) AS id_to_keep
FROM tbl
GROUP BY 1,2
The "5" and "3" can be tweaked upward (or downard) to cause more (or fewer) ids to be kept. "5" and "3" are different because of the way the lat/lng are laid out; that ratio might work most temperate latitudes. (Use equal numbers near the equator, use a bigger ration for higher latitudes.)
There is a minor flaw... Two items very close to each other might be across the boundaries created by those ROUNDs.
How many rows are in the original table? How many rows does the above query generate? ( SELECT COUNT(*) FROM ( ... ) x; )
i´m new here and have a project where i have a performance problem that seems hard to fix. I have created a search for objects that have availibilities means a very simple structure:
ObjectID | Date | Number of available objects
---------------------------------------------
Object1 | 01.01.2019 | 1
Object1 | 02.01.2019 | 1
Object1 | 03.01.2019 | 0
Object1 | 04.01.2019 | 1
Object1 | 05.01.2019 | 1
Object2 | 01.01.2019 | 1
Object2 | 02.01.2019 | 1
Object2 | 03.01.2019 | 0
Object2 | 04.01.2019 | 1
Object2 | 05.01.2019 | 1
I´m working with mysql and php
A typical query would be:
Which objects are available between 01.01.2019 - 28.02.2019 10 days available in a row.
It´s not really hard to make it working with mysql but once you have more then 10 users using the searchfunction the server load becomes extremly high eventough the table is optimised (indexes etc.) The server has 2 cores with 4 GB of RAM.
I also tried to store the dates comma separated per object in a table and let the application search but that creates extrem high traffic between application and database which is also not a real solution.
In total we have around 20.000 Objects and availabilities stored for max. 500 days so we have around 10.000.000 datasets in my first solution.
Does anybody have and idea what´s the most efficient way toDo this ?
(How to store it to make search fast ?)
For this project i sadly can not cache the searches.
Thanks for you help and Kind Regards, Christoph
Don't store dates in 28.02.2019 format. Flip it over, then use a DATE datatype in the table. Please provide SHOW CREATE TABLE.
What is your algorithm for searching?
The header says "number of objects", yet the values seem to be only 0 or 1, as if it is a boolean flag??
What is the maximum timespan? (If under 64, there are bit-oriented tricks we could play.)
By looking at adjacent rows (cf LAG(), if using MySQL 8.0), decide when an object changes state. Save those dates.
From that, it is one more hop to get "how many consecutive days" starting at one of those dates. This will be a simple query, and very fast if you have a suitable composite index.
First of all: Sorry for the long post, I am trying to explain a hard situation in an easy way and, at the same time, trying to give as much information as I can.
I have an algorithm that tries to determine user expectation during a search. There are a couple of way I can use it and I have the same problem with both of them, so, lets say I use it for disambiguation. Well, with a db structure like this one (or any other that allows the work):
post
ID | TITLE
---+----------------------------------------------
1 | Orange developed the first 7G phone
2 | Orange: the fruit of gods
3 | Theory of Colors: Orange
4 | How to prepare the perfect orange juice
keywords
ID | WORD | ABOUT
---+----------+---------
1 | orange | company
2 | orange | fruit
3 | orange | color
post_keywords
ID | POST | KEYWORD
---+-------+---------
1 | 1 | 1
2 | 2 | 2
3 | 3 | 3
4 | 4 | 2
.
If in a search box, an user make a search for the word "orange", the algorithm would look that orange may refers to the company, the color, or the fruit and, by answering a couple of questions, it tries to determine which the user is looking for. After all that I get an array like this one:
$e = array(
'fruit' => 0.153257,
'color' => 0.182332,
'company' => 0.428191,
);
In this point I know the user is probably looking for information about the fruit (because fruit's value is closer to 0) and if I am wrong my second bet goes for the color. At the bottom of the list, the company.
So, with a Join and ORDER BY FIELD(keywords.id, 2,3,1) I can give the results the (almost) perfect order:
- Orange: the fruit of gods
- How to prepare the perfect orange juice
- Theory of Colors: Orange
- Orange developed the first 7G phone
.
Well... as you can imagine, I wouldn't come for help if everything is so nice. So, the problem is that in is the previous example we have only 4 possible results, so, if the user really was looking for the company he can find this result in the 4th position and everything is okay. But... If we have 200 post about the fruit and 100 post about the color, the first post about the company come in the position 301st.
I am looking for a way to alternate the order (in a predictable and repeatable way) now that I know the user is must likely looking for the fruit, followed by the color and the company at the end. I want to be able to show a post about the fruit in the first position (and possibly the second), followed by a post about the color, followed by the company and start this cycle again until the results ends.
Edit: I'll be happy with a MySQL trick or with an idea to change the approach, but I can't accept third-party solutions.
You can use variables to provide custom sort field.
SELECT
p.*,
CASE k.about
WHEN 'company' THEN #sort_company := #sort_company + 1
WHEN 'color' THEN #sort_color := #sort_color + 1
WHEN 'fruit' THEN #sort_fruit := #sort_fruit + 1
ELSE NULL
END AS sort_order,
k.about
FROM post p
JOIN post_keywords pk ON (p.id = pk.post)
JOIN keywords k ON (pk.keyword = k.id)
JOIN (SELECT #sort_fruit := 0, #sort_color := 0, #sort_company := 0) AS vars
ORDER BY sort_order, FIELD(k.id, 2, 3, 1)
Result will look like this:
| id | title | sort_order | about |
|---:|:----------------------------------------|-----------:|:--------|
| 2 | Orange: the fruit of gods | 1 | fruit |
| 3 | Theory of Colors: Orange | 1 | color |
| 1 | Orange developed the first 7G phone | 1 | company |
| 4 | How to prepare the perfect orange juice | 2 | fruit |
I think you do need some way of categorizing, or, I would prefer to say, clustering the answers. If you can do this, you can then start by showing the users the top scoring answer from each cluster. Hey, sometimes maximising diversity really is worth doing just for its own sake!
I think you should be able to cluster answers. You have some sort of scoring formula which tells you how good an answer a document is to a user query, perhaps based on a "bag of words" model. I suggest that you use this to tell how close one document is to another document by treating the other document as a query. If you do exactly this you might want to treat each document as a query with the other as an answer and average the two scores, so that the the score d(a, b) has the property that d(a, b) = d(b, a).
Now you have a score (unfortunately probably not a distance: that is, with a score, high values mean close together) and you need a clustering algorithm. Ideally you want a fast one, but maybe it just has to be fast enough to be faster than a human reading through the answers.
One fast clustering algorithm is to keep track of N (for some parameter N) cluster centres. Initialise these to the first N documents retrieved, then consider every other document one at a time. At each stage you are trying to reduce the maximum score found between any two documents in the cluster centre (which amounts to getting the documents as far apart as possible). When you consider a new document, compute the score between that document and each of the N current cluster centres. If the maximum of these scores is less than the current maximum score between the N current cluster centres, then this document is further away from the cluster centres than they are from each other so you want it. Swap it with one of the N cluster centres - whichever one makes that maximum score between the new N cluster centres the least.
This isn't a perfect clustering algorithm - for one thing, the result depends on the order in which documents are presented, which is a bad sign. It is, however, reasonably fast for small N, and it has one nice property: if you you have k <=N clusters, and (switching from scores to distances) every distance within a cluster is smaller than every distance between two points from different clusters, then the N cluster centres at the end will include at least one point from each of the k clusters. The first time you see a member of a cluster you haven't seen before, it will become a cluster centre, and you will never reduce the number of cluster centers held, because you would be ejecting a point which in a different cluster from the other centres, which won't increase the minimum distance between any two points held as cluster centres (reduce the maximum score between any two such points).
I'm building a REST API so the answer can't include google maps or javascript stuff.
In our app, we have a table containing posts that looks like that :
ID | latitude | longitude | other_sutff
1 | 50.4371243 | 5.9681102 | ...
2 | 50.3305477 | 6.9420498 | ...
3 | -33.4510148 | 149.5519662 | ...
We have a view with a map that shows all the posts around the world.
Hopefully, we will have a lot of posts and it will be ridiculous to show thousands and thousands of markers in the map. So we want to group them by proximity so we can have something like 2-3 markers by continent.
To be clear, we need this :
Image from https://github.com/googlemaps/js-marker-clusterer
I've done some research and found that k-means seems to be part of the solution.
As I am really really bad at Math, I tried a couple of php libraries like this one : https://github.com/bdelespierre/php-kmeans that seems to do a decent job.
However, there is a drawback : I have to parse all the table each time the map is loaded. Performance-wise, it's awful.
So I would like to know if someone already got through this problematic or if there is a better solution.
I kept searching and I've found an alternative to KMeans : GEOHASH
Wikipedia will explain better than me what it is : Wiki geohash
But to summarize, The world map is divided in a grid of 32 cells and to each one is given an alpha-numeric character.
Each cell is also divided into 32 cells and so on for 12 levels.
So if I do a GROUP BY on the first letter of hash I will get my clusters for the lowest zoom level, if I want more precision, I just need to group by the first N letters of my hash.
So, what I've done is only added one field to my table and generate the hash corresponding to my coordinates:
ID | latitude | longitude | geohash | other_sutff
1 | 50.4371243 | 5.9681102 | csyqm73ymkh2 | ...
2 | 50.3305477 | 6.9420498 | p24k1mmh98eu | ...
3 | -33.4510148 | 149.5519662 | 8x2s9674nd57 | ...
Now, if I want to get my clusters, I just have to do a simple query :
SELECT count(*) as nb_markers FROM mtable GROUP BY SUBSTRING(geohash,1,2);
In the substring, 2 is level of precision and must be between 1 and 12
PS : Lib I used to generate my hash
My website deals with a lot of drug doses and units. Currently it is setup where it's not really flexible to mold those doses and units into different values based on a user.
For instance, if a user submits a record where he took 1ml of alcohol on monday, 10ml on tuesday, and 2liters on tuesday then the same substance users two sepester units. So what happens if I want to show the user the average of these three days ONLY in mL? What about if the users want's to see it in only Liters?
Here's what I have so far.
drugs
id | drug
1 | alcohol
2 | cannabis
units
id | unit
1 | ml
2 | mg
3 | g
4 | l
unit_conversion
id | from_unitid | to_unitid | multiplier
1 | 1 | 4 | 1000
2 | 4 | 1 | .001
3 | 2 | 3 | 1000
4 | 3 | 2 | .001
user_dose_line
id | drug_id | unit_id | dose
1 | 1 | 1 | 100
2 | 1 | 1 | 200
3 | 1 | 4 | 1
Here's how i'd ideals want it to work.
A user submits a record. He fills out that the drug is alcohol, the dose is 100, and the unit is ml. This stores in the data base table is id#1 on user_dose_line.
Now let's say there is another table. It is the options table where the user sets what default unit of measurement he wants to use.
user_dose_options
id | user_id | drug_id | unit_id
1 | 22 | 1 | 4
This shows that the user has selected that any entry with alcohol in it, should always be converted to liters.
Here is my problem, where in my logic do I do these conversions.
I use CAKEPHP which is a usual MVC framework. Should I be doing the conversion in the model? in the controller? What is the best practice for this. And am I choose the most optimized route for this?
I plan to later add a lot more functionality for units/doses and thus I need my database setup in the most efficient way to allow me to do lots of cool calculations with the data (display graphics, statistics, etc)
You should have an input and an output method. Those should convert from the input unit to one standard unit which you use to store the data.
You could do this in the setter of your model.
i.e.
class userInTake {
private $userId;
private $drugId;
private $dose;
public function setDrugUse($drugId,$dose,$unitId) {
// $convert dose to you default unit for this drug
$this->dose = yourConversionMethod($dose,$unit,$desiredUnit);
$this->userId = $userId;
$this->drugId = $drugId;
}
...
}
At the point where you store the user input into your models, you do the conversion.
So, if for instance, a user inputs he took 4l of alcohol on monday, you convert the litres to ml and then store the ml value.
so if a user inouts three occasions, monday 4l, tuesday 40ml, friday 2,4l, you convert each input to ml.
When displaying the average, you'll get 6440ml.
Your output function could choose the unit that serves best to display the value (i.e. if a ml value > 10000, display as litres).
This way you don't have to store different units and only need your units for conversion upon input/output. That should make calculations a hell of a lot easier.
The ideal setting is that any operation related to data should be in the model. You can override the afterFind() method to do the conversion right after the query, and have the data ready to use.
http://book.cakephp.org/2.0/en/models/callback-methods.html#afterfind
Remember the philosophy: Fat models, skinny controllers.