What is the best way to fuzzy compare two tables

What is the best way to fuzzy compare two tables - php

I have a two tables with ~70 000 rows. Both of tables have a column "title". I need to compare to tables and find intersections of them by title column. I try to use JOIN and UNION, but titles can be little different. I mean, in one table it can be New-York, USA but in other it can be New York Usa. I googled it, and it calls "fuzzy string searching".
I already start with php, and similar_text, but it's very slow... I think that for this task I should use something else, like a R maybe.. I already push this data into BigQuery, but BigQuery support only REGEXEP for search in where statement, or I can't understand how it should be used.
Is R can solve my problems with speed?
Thanks!
Example of dataset1:
new-york, usa|100|5000
dataset2:
newyork usa|50|1000
nnNew-York |10|500
Example of desired output:
New-York, Usa|160|6500
In other words, I need create new table that will contain data from both tables.
UPDATED
Thanks for your answers, I tried R and agrep, it works, but very slowly..2 000 rows in 40 minutes, I have a 190 000 rows totally. Is it normal?

The answer to your question is "Levenshtein distance". However, with 70,000 rows, this requires approximately 70,000*70,000 comparisons -- 490 million. That is a lot.
Doing the work in R may be your best approach, because R will keep all the data in memory and probably be more efficient than an implementation in MySQL.
There are ways to short-circuit the searching. One method, for instance, is to divide each string into n-grams (trigrams are typical) and use these to reduce the search space. After all, "New York City" and "Dallas" have no letters in common, so no comparison really needs to be done.
There are probably routines in R to handle trigrams. They are pretty easy to do in MySQL, but not built-in.

Related

Find closest match by number in MySQLI

I'm working on a PHP / MySQLI application, where the user needs to input a number, and then it should get the 5 closest records to that given number.
Can this be done by a simple SQL string, or do I need to get all the numbers into an array, and then match by that..?
Thanks!

This is possible through following query:
SELECT * FROM [table]
ORDER BY ABS([column] - [userinput])
LIMIT 5
However, if you could provide more information we would also be able to provide you with a better solution. This query is not very scalable and will after a couple of thousand rows start to get slow.
How are you going to use this query? Are we talking thousands of records? What kind of numbers is it? Is there some pattern? All such questions would allow for a more precise solution that could possible scale better with your system.

MySQL: ignore results where difference is more than x between rows

I have a simple PHP/HTML page that runs MySQL queries to pull temperature data and display on a graph. Every once in a while there is some bad data read from my sensors (DHT11 Temp / RH sensors, read by Arduino), where there will be a spike that is too high or too low, so I know it's not a good data point. I have found this is easy to deal with if it is "way" out of range, as in not a sane temperature, I just use a BETWEEN statement to filter out any records that are not possibly true.
I do realize that ultimately this should be fixed at the source so these bad readings never post in the first place, however as a debugging tool, I do actually want to record those errors in my DB, so I can track down the points in time when my hardware was erroring.
However, this does not help with the occasional spikes that actually fall within the range of sane temperatures. For example if it is 65 F outside, and the sensor occasionally throws an odd reading and I get a 107 F reading, it totally screws up my graphs, scaling, etc. I cant filter that with a BETWEEN (that I know of), because 107 F is actually a practical summer time temp in my region.
Is there a way to filter out values based on their neighboring rows? Can I do something like, if I am reading five rows for the sake of simplicity, and their result is: 77,77,76,102,77 ... that I can say "anything that is more than (x) difference between sequential rows, ignore it because it's bad data" ?
[/longWinded]

It is hard to answer without your schema so I did a SQLFiddle to reproduce your problem.
You need to average the temperature between a time frame and then compare this value with the current row. If the difference is too big, then we don't select this row. In my Fiddle this is done by :
abs(temp - (SELECT AVG(temp) FROM temperature AS t
WHERE
t.timeRead BETWEEN
DATE_ADD(temperature.timeRead, interval-3 HOUR)
AND
DATE_ADD(temperature.timeRead, interval+3 HOUR))) < 8
This condition is calculating the average of the temprature of the last 3 hours and the next 3 hours. If the difference is more than 8 degrees then we skip this row.

What is the most efficient way to find the euclidean distance in 3d using mysql?

I have a MySQL table with thousands of data points stored in 3 columns R, G, B. how can I find which data point is closest to a given point (a,b,c) using Euclidean distance?
I'm saving RGB values of colors separately in a table, so the values are limited to 0-255 in each column. What I'm trying to do is find the closest color match by finding the color with the smallest euclidean distance.
I could obviously run through every point in the table to calculate the distance but that wouldn't be efficient enough to scale. Any ideas?

I think the above comments are all true, but they are - in my humble opinion - not answering the original question. (Correct me if I'm wrong). So, let me here add my 50 cents:
You are asking for a select statement, which, given your table is called 'colors', and given your columns are called r, g and b, they are integers ranged 0..255, and you are looking for the value, in your table, closest to a given value, lets say: rr, gg, bb, then I would dare trying the following:
select min(sqrt((rr-r)*(rr-r)+(gg-g)*(gg-g)+(bb-b)*(bb-b))) from colors;
Now, this answer is given with a lot of caveats, as I am not sure I got your question right, so pls confirm if it's right, or correct me so that I can be of assistance.

Since you're looking for the minimum distance and not exact distance you can skip the square root. I think Squared Euclidean Distance applies here.
You've said the values are bounded between 0-255, so you can make an indexed look up table with 255 values.
Here is what I'm thinking in terms of SQL. r0, g0, and b0 represent the target color. The table Vector would hold the square values mentioned above in #2. This solution would visit all the records but the result set can be set to 1 by sorting and selecting only the first row.
select
c.r, c.g, c.b,
mR.dist + mG.dist + mB.dist as squared_dist
from
colors c,
vector mR,
vector mG,
vector mB
where
c.r-r0 = mR.point and
c.g-g0 = mG.point and
c.b-b0 = mB.point
group by
c.r, c.g, c.b

The first level of optimization that I see you can do would be square the distance to which you want to limit the query so that you don't need to perform the square root for each row.
The second level of optimization I would encourage would be some preprocessing to alleviate the need for extraneous squaring for each query (which could possibly create some extra run time for large tables of RGB's). You'd have to do some benchmarking to see, but by substituting in values for a, b, c, and d and then performing the query, you could alleviate some stress from MySQL.
Note that the performance difference between the last two lines may be negligible. You'll have to use test queries on your system to determine which is faster.
I just re-read and noticed that you are ordering by distance. In which case, the d should be removed everything should be moved to one side. You can still plug in the constants to prevent extra processing on MySQL's end.

I believe there are two options.
You have to either as you say iterate across the entire set and compare and check against a maximum that you set initially at an impossibly low number like -1. This runs in linear time, n times (since you're only comparing 1 point to every point in the set, this scales in a linear way).
I'm still thinking of another option... something along the lines of doing a breadth first search away from the input point until a point is found in the set at the searched point, but this requires a bit more thought (I imagine the 3D space would have to be pretty heavily populated for this to be more efficient on average though).

If you run through every point and calculate the distance, don't use the square root function, it isn't necessary. The smallest sum of squares will be enough.
This is the problem you are trying to solve. (Planar case, select all points sorted by a x, y, or z axis. Then use PHP to process them)
MySQL also has a Spatial Database which may have this as a function. I'm not positive though.

MySql speed of executing max(), min(), sum() on relatively large database

I have a relatively large database (130.000+ rows) of weather data, which is accumulating very fast (every 5minutes a new row is added). Now on my website I publish min/max data for day, and for the entire existence of my weatherstation (which is around 1 year).
Now I would like to know, if I would benefit from creating additional tables, where these min/max data would be stored, rather than let the php do a mysql query searching for day min/max data and min/max data for the entire existence of my weather station. Would a query for max(), min() or sum() (need sum() to sum rain accumulation for months) take that much longer time then a simple query to a table, that already holds those min, max and sum values?

That depends on weather your columns are indexed or not. In case of MIN() and MAX() you can read in the MySQL manual the following:
MySQL uses indexes for these
operations:
To find the MIN() or MAX() value for a
specific indexed column key_col. This
is optimized by a preprocessor that
checks whether you are using WHERE
key_part_N = constant on all key parts
that occur before key_col in the
index. In this case, MySQL does a
single key lookup for each MIN() or
MAX() expression and replaces it with
a constant.
In other words in case that your columns are indexed you are unlikely to gain much performance benefits by denormalization. In case they are NOT you will definitely gain performance.
As for SUM() it is likely to be faster on an indexed column but I'm not really confident about the performance gains here.
Please note that you should not be tempted to index your columns after reading this post. If you put indices your update queries will slow down!

Yes, denormalization should help performance a lot in this case.
There is nothing wrong with storing calculations for historical data that will not change in order to gain performance benefits.

While I agree with RedFilter that there is nothing wrong with storing historical data, I don't agree with the performance boost you will get. Your database is not what I would consider a heavy use database.
One of the major advantages of databases is indexes. They used advanced data structures to make data access lightening fast. Just think, every primary key you have is an index. You shouldn't be afraid of them. Of course, it would probably be counter productive to make all your fields indexes, but that should never really be necessary. I would suggest researching indexes more to find the right balance.
As for the work done when a change happens, it is not that bad. An index is a tree like representation of your field data. This is done to reduce a search down to a small number of near binary decisions.
For example, think of finding a number between 1 and 100. Normally you would randomly stab at numbers, or you would just start at 1 and count up. This is slow. Instead, it would be much faster if you set it up so that you could ask if you were over or under when you choose a number. Then you would start at 50 and ask if you are over or under. Under, then choose 75, and so on till you found the number. Instead of possibly going through 100 numbers, you would only have to go through around 6 numbers to find the correct one.
The problem here is when you add 50 numbers and make it out of 1 to 150. If you start at 50 again, your search is less optimized as there are 100 numbers above you. Your binary search is out of balance. So, what you do is rebalance your search by starting at the mid-point again, namely 75.
So the work a database is just an adjustment to rebalance the mid-point of its index. It isn't actually a lot of work. If you are working on a database that is large and requires many changes a second, you would definitely need to have a strong strategy for your indexes. In a small database that gets very few changes like yours, its not a problem.

Difference in efficiency of retrieving all rows in one query, or each row individually?

I have a table in my database that has about 200 rows of data that I need to retrieve. How significant, if at all, is the difference in efficiency when retrieving all of them at once in one query, versus each row individually in separate queries?

The queries are usually made via a socket, so executing 200 queries instead of 1 represents a lot of overhead, plus the RDBMS is optimized to fetch a lot of rows for one query.
200 queries instead of 1 will make the RDBMS initialize datasets, parse the query, fetch one row, populate the datasets, and send the results 200 times instead of 1 time.
It's a lot better to execute only one query.

I think the difference will be significant, because there will (I guess) be a lot of overhead in parsing and executing the query, packaging the data up to send back etc., which you are then doing for every row rather than once.
It is often useful to write a quick test which times various approaches, then you have meaningful statistics you can compare.

If you were talking about some constant number of queries k versus a greater number of constant queries k+k1 you may find that more queries is better. I don't know for sure but SQL has all sorts of unusual quirks so it wouldn't surprise me if someone could come up with a scenario like this.
However if you're talking about some constant number of queries k versus some non-constant number of queries n you should always pick the constant number of queries option.

In general, you want to minimize the number of calls to the database. You can already assume that MySQL is optimized to retrieve rows, however you cannot be certain that your calls are optimized, if at all.

Extremely significant, Usually getting all the rows at once will take as much time as getting one row. So let's say that time is 1 second (very high but good for illustration) then getting all the rows will take 1 second, getting each row individually will take 200 seconds (1 second for each row) A very dramatic difference. And this isn't counting where are you getting the list of 200 to begin with.

All that said, you've only got 200 rows, so in practice it won't matter much.
But still, get them all at once.

Exactly as the others have said. Your RDBMS will not break a sweat throwing 200+++++ rows at you all at once. Getting all the rows in one associative array will also not make much difference to your script, since you no doubt already have a loop for grabbing each individual row.
All you need do is modify this loop to iterate through the array you are given [very minor tweak!]

The only time I have found it better to get fewer results from multiple queries instead of one big set is if there is lots of processing to be done on the results. I was able to cut out about 40,000 records from the result set (plus associated processing) by breaking the result set up. Anything you can build into the query that will allow the DB to do the processing and reduce result set size is a benefit, but if you truly need all the rows, just go get them.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.