I'm working on a PHP / MySQLI application, where the user needs to input a number, and then it should get the 5 closest records to that given number.
Can this be done by a simple SQL string, or do I need to get all the numbers into an array, and then match by that..?
Thanks!
This is possible through following query:
SELECT * FROM [table]
ORDER BY ABS([column] - [userinput])
LIMIT 5
However, if you could provide more information we would also be able to provide you with a better solution. This query is not very scalable and will after a couple of thousand rows start to get slow.
How are you going to use this query? Are we talking thousands of records? What kind of numbers is it? Is there some pattern? All such questions would allow for a more precise solution that could possible scale better with your system.
Related
I have big database of articles and I'd like before adding new items to DB check if already similar items exist and if so - group them together, so that later I can easily display them as a group of similar items.
Currently we use very simple, but shockingly very precise and our needs fully satisfying PHP's similar_text() function. The problem is, that before we add an item to DB, we first need to pull X amount of items from DB to then loop through every single one in order to check whether our new item is at least 75% similar to other items in order to group them together. This uses a lot of resources and time that we don't really have.
We use MySQL and Solr for all our queries. I've tried using MySQL Full-Text Search, Solr More like this. Compared to PHPs implementation, they are super fast and efficient, but I just can't get a robust percentage score which PHP similar_text() provides. It is crucial for our grouping to be accurate.
For example using this MySQL query:
SELECT id, body, ROUND(((MATCH(body) AGAINST ('ARTICLE TEXT')) / scores.max_score) * 100) as relevance
FROM natural_text_test,
(SELECT MAX(MATCH(body) AGAINST('ARTICLE TEXT')) as max_score FROM natural_text_test LIMIT 1) scores
HAVING relevance > 75
ORDER BY relevance DESC
i get that article with 130 words is 85% similar with another article with 4700 words. And in comparison PHP's similar_text() returns only 3% similarity score which is well below our threshold and is correct in our case.
I've also looked into Levenshtein distance algorithm, but it seems that the same problem as with MySQL and Solr arises.
There has to be a better way to handle similarity checks, maybe I'm using the algorithms incorrectly?
Based on some of the Comments, I might propose this...
It seems that 75%-similar documents would have a lot of the same sentences in the same order.
Break the doc into sentences
Take a crude hash of each sentence, map it to a visible ascii character. This gives you a string that is, perhaps, 1/100th the size of the original doc.
Store that with the doc.
When searching, use levenshtein() on this string to find 'similar' documents.
Sure, hashing is imperfect, etc. But this is fast. And you could apply some other technique to double-check the few docs that are close.
For a hash, I might do
$md5 = md5($sentence);
$x = somehow get 6 bits out of that hex string
$hash = chr(ord('0' + $x));
I have a two tables with ~70 000 rows. Both of tables have a column "title". I need to compare to tables and find intersections of them by title column. I try to use JOIN and UNION, but titles can be little different. I mean, in one table it can be New-York, USA but in other it can be New York Usa. I googled it, and it calls "fuzzy string searching".
I already start with php, and similar_text, but it's very slow... I think that for this task I should use something else, like a R maybe.. I already push this data into BigQuery, but BigQuery support only REGEXEP for search in where statement, or I can't understand how it should be used.
Is R can solve my problems with speed?
Thanks!
Example of dataset1:
new-york, usa|100|5000
dataset2:
newyork usa|50|1000
nnNew-York |10|500
Example of desired output:
New-York, Usa|160|6500
In other words, I need create new table that will contain data from both tables.
UPDATED
Thanks for your answers, I tried R and agrep, it works, but very slowly..2 000 rows in 40 minutes, I have a 190 000 rows totally. Is it normal?
The answer to your question is "Levenshtein distance". However, with 70,000 rows, this requires approximately 70,000*70,000 comparisons -- 490 million. That is a lot.
Doing the work in R may be your best approach, because R will keep all the data in memory and probably be more efficient than an implementation in MySQL.
There are ways to short-circuit the searching. One method, for instance, is to divide each string into n-grams (trigrams are typical) and use these to reduce the search space. After all, "New York City" and "Dallas" have no letters in common, so no comparison really needs to be done.
There are probably routines in R to handle trigrams. They are pretty easy to do in MySQL, but not built-in.
I need help with a query. I am taking input from a user where they enter a range between 1-100. So it could be like 30-40 or 66-99. Then I need a query to pull data from a table that has a high_range and a low_range to find a match to any number in their range.
So if a user did 30-40 and the table had entries for 1-80, 21-33, 32-40, 40-41, 66-99, and 1-29 it would find all but the last two in the table.
What is the easiest why to do this?
Thanks
If I understood correctly (i.e. you want any range that overlaps the one entered by the user), I'd say:
SELECT * FROM table WHERE low <= $high AND high >= $low
What I understood is that the range is stored in this format low-high. If that is the case, then this is a poor design. I suggest splitting the values into two columns: low, and high.
If you already have the values split, you can use some statement like:
SELECT * FROM myTable WHERE low <= $needleHigherBound AND high >= $needleLowerBound
If you have the values stored in one column, and insist they stay so, You might find the SUBSTRING_INDEX function of MySQL useful. But in this case, you'll have to write a complicated query to parse all the values of all the rows, and then compare them to your search values. It seems like a lot of effort to cover up a design flaw.
This question already has answers here:
SQL query, select nearest places by a given coordinates [duplicate]
(4 answers)
Closed 9 years ago.
I want to make a table of with columns in a MySQL database:
Index
Latitude
Longitude
Places like place of Countries, Cities, People, Building etc.
With huge number of rows, in order of hundred thousands until million of rows.
If I want to get nearest places of a selected row in the table, how can I do that in the fastest way?
It is no problem if more information, indexing, or presorting are necessary.
======
Edit 1:
I have read the answer and the answer is using a formula, for example from the best answer:
(((acos(sin((".$latitude."*pi()/180)) * sin((geo_latitude*pi()/180))+cos((".latitude."*pi()/180)) * cos((geo_latitude*pi()/180)) * cos(((".$longitude."- geo_longitude)*pi()/180))))*180/pi())*60*1.1515*1.609344)
If I have 1 million rows, that means, there are 1 million of expensive calculation. I thing it will be very slow.
Are the optimization, for example using filtering in the beginning:
1. If the input is City A in location 10.000, 20.000, then filter cities that located at 9.000 to 11.00.
2. Calculate with the formula above.
How to optimize the speed of that algorithm?
====
Edit 2:
Sorry, I've only read the best answer.
I found what I've looked for in the other answer: http://www.scribd.com/doc/2569355/Geo-Distance-Search-with-MySQL
You can use a quadkey. A quadkey is a spatial index like a quadtree. It sort the points into a grid and then you can search the grid around the center point. It's not easy to understand but you can download my php class hilbert-curve # phpclasses.org. Or you can use the native MySQL spatial extension and the point datatype. However my implementation uses a quadkey and a hilbert-curve and can be better. It depends much on the data. The problem with the harvesine formula is that it is very slow. But you can use both algorithms together to achieve better results.
I have a relatively large database (130.000+ rows) of weather data, which is accumulating very fast (every 5minutes a new row is added). Now on my website I publish min/max data for day, and for the entire existence of my weatherstation (which is around 1 year).
Now I would like to know, if I would benefit from creating additional tables, where these min/max data would be stored, rather than let the php do a mysql query searching for day min/max data and min/max data for the entire existence of my weather station. Would a query for max(), min() or sum() (need sum() to sum rain accumulation for months) take that much longer time then a simple query to a table, that already holds those min, max and sum values?
That depends on weather your columns are indexed or not. In case of MIN() and MAX() you can read in the MySQL manual the following:
MySQL uses indexes for these
operations:
To find the MIN() or MAX() value for a
specific indexed column key_col. This
is optimized by a preprocessor that
checks whether you are using WHERE
key_part_N = constant on all key parts
that occur before key_col in the
index. In this case, MySQL does a
single key lookup for each MIN() or
MAX() expression and replaces it with
a constant.
In other words in case that your columns are indexed you are unlikely to gain much performance benefits by denormalization. In case they are NOT you will definitely gain performance.
As for SUM() it is likely to be faster on an indexed column but I'm not really confident about the performance gains here.
Please note that you should not be tempted to index your columns after reading this post. If you put indices your update queries will slow down!
Yes, denormalization should help performance a lot in this case.
There is nothing wrong with storing calculations for historical data that will not change in order to gain performance benefits.
While I agree with RedFilter that there is nothing wrong with storing historical data, I don't agree with the performance boost you will get. Your database is not what I would consider a heavy use database.
One of the major advantages of databases is indexes. They used advanced data structures to make data access lightening fast. Just think, every primary key you have is an index. You shouldn't be afraid of them. Of course, it would probably be counter productive to make all your fields indexes, but that should never really be necessary. I would suggest researching indexes more to find the right balance.
As for the work done when a change happens, it is not that bad. An index is a tree like representation of your field data. This is done to reduce a search down to a small number of near binary decisions.
For example, think of finding a number between 1 and 100. Normally you would randomly stab at numbers, or you would just start at 1 and count up. This is slow. Instead, it would be much faster if you set it up so that you could ask if you were over or under when you choose a number. Then you would start at 50 and ask if you are over or under. Under, then choose 75, and so on till you found the number. Instead of possibly going through 100 numbers, you would only have to go through around 6 numbers to find the correct one.
The problem here is when you add 50 numbers and make it out of 1 to 150. If you start at 50 again, your search is less optimized as there are 100 numbers above you. Your binary search is out of balance. So, what you do is rebalance your search by starting at the mid-point again, namely 75.
So the work a database is just an adjustment to rebalance the mid-point of its index. It isn't actually a lot of work. If you are working on a database that is large and requires many changes a second, you would definitely need to have a strong strategy for your indexes. In a small database that gets very few changes like yours, its not a problem.