Levenshtein search

Levenshtein search - php

I work on a site which sells let's say stuff and offers a "vendors search". On this search you enter your city, or postal code, or region and a distance (in km or miles) then the site gives you a list of vendors.
To do that, I have a database with the vendors. In the form to save these vendors, you enter their full address and when you click on the save button, a request to google maps is made in order to get their latitude and longitude.
When someone does a search, I look on a table where I store all the search terms and their lat/lng.
This table looks like
+--------+-------+------+
| term | lat | lng |
+--------+-------+------+
So the first query is something very simple
select lat, lng from my_search_table where term = "the term"
If I find a result, I then search with a nice method for all the vendors in the range the visitor wants and print the result on a map.
If I don't find a result, I search with a levenshtein function because people writing bruxelle or bruxeles instead of bruxelles is something really common and I don't want to make a request to google maps all the time (I also have a "how many time searched" column in my table to get some stats)
So I request my_search_time with no where clause and loop through all results to get the smallest levensthein distance. If the smallest result is greater than 2, I request coordinates from google maps.
Here is my problem. For some countries (we have several sites all around the world), my_search_table has 15-20k+ entries... and php doesn't (really) like looping on such data (which I perfectly understand) and my request falls under the php timeout. I could increase this timeout but the problem will be the same in a few months.
So I tried a levensthein MySQL function (found on stackoverflow btw) but it's also very slow.
So my question is "is there any way to make this search fast even on very large datasets ?"

My suggestion is based on three things:
First, your data set is big. That means - it's: big enough to reject the idea of "select all" + "run levenshtein() in PHP application"
Second, you have control over your database. So you can adjust some architecture-related things
Finally, performance of SELECT queries is the most important thing, while performance for adding new data doesn't matter.
The thing is you can not perform fast levenshtein search because levenshtein itself is very slow. I mean, calculating levenshtein distance is a slow thing. Thus, you'll not be able to resolve the issue with only "smart search". You'll have to prepare some data.
Possible solution will be: create some group index and assign it during adding/updating data. That means - you'll store additional column which will store some hash (numeric, for example). When adding new data, you'll:
Perform search with levenshtein distance (for that you may either use your application or that function which you've (already mentioned) over all records in your table against inserted data
Set group index for new row to value of index which found rows in previous step have.
If nothing found, set some new group index value (it' the first row and there are no similar rows yet) - which will be different from any group index values that already present in table
To search desired rows, you'll need just select rows with same group index value. That means: your select queries will be very fast. But - yes, this will cause extremely huge overhead when adding/changing your data. Thus, it isn't applicable for case, when performance of updating/inserting matters.

You could try MySQL function SOUNDS LIKE
SELECT lat, lng FROM my_search_table WHERE term SOUNDS LIKE "the term"

You can use a kd-tree or a ternary tree to speed up the search. The idea is to use a binary search.

Related

performance issue from 5 queries in one page

As i am a junior PHP Developer growing day by day stuck in a performance problem described here:
I am making a search engine in PHP ,my database has one table with 41 column and million's of rows obviously it is a very large dataset. In index.php i have a form for searching data.When user enters search keyword and hit submit the action is on search.php with results.The query is like this.
SELECT * FROM TABLE WHERE product_description LIKE '%mobile%' ORDER BY id ASC LIMIT 10
This is the first query.After result shows i have to run 4 other query like this:
SELECT DISTINCT(weight_u) as weight from TABLE WHERE product_description LIKE '%mobile%'
SELECT DISTINCT(country_unit) as country_unit from TABLE WHERE product_description LIKE '%mobile%'
SELECT DISTINCT(country) as country from TABLE WHERE product_description LIKE '%mobile%'
SELECT DISTINCT(hs_code) as hscode from TABLE WHERE product_description LIKE '%mobile%'
These queries are for FILTERS ,the problem is this when i submit search button ,all queries are running simultaneously at the cost of Performance issue,its very slow.
Is there any other method to fetch weight,country,country_unit,hs_code speeder or how can achieve it.
The same functionality is implemented here,Where the filter bar comes after table is filled with data,How i can achieve it .Please help
Full Functionality implemented here.
I have tried to explain my full problem ,if there is any mistake please let me know i will improve the question,i am also new to stackoverflow.

Firstly - are you sure this code is working as you expect it? The first query retrieves 10 records matching your search term. Those records might have duplicate weight_u, country_unit, country or hs_code values, so when you then execute the next 4 queries for your filter, it's entirely possible that you will get values back which are not in the first query, so the filter might not make sense.
if that's true, I would create the filter values in your client code (PHP)- finding the unique values in 10 records is going to be quick and easy, and reduces the number of database round trips.
Finally, the biggest improvement you can make is to use MySQL's fulltext searching features. The reason your app is slow is because your search terms cannot use an index - you're wild-carding the start as well as the end. It's like searching the phonebook for people whose name contains "ishra" - you have to look at every record to check for a match. Fulltext search indexes are designed for this - they also help with fuzzy matching.

I'll give you some tips that will show useful in many situations when querying a large dataset, or mostly any dataset.
If you can list the fields you want instead of querying for '*' is a better practice. The weight of this increases as you have more columns and more rows.
Always try to use the PK's to look for the data. The more specific the filter, the less it will cost.
An index in this kind of situation would come pretty handy, as it will make the search more agile.
LIKE queries are generally pretty slow and resource heavy, and more in your situation. So again, the more specific you are, the better it will get.
Also add, that if you just want to retrieve data from this tables again and again, maybe a VIEW would fit nicely.
Those are just some tips that came to my mind to ease your problem.
Hope it helps.

Generating statistics using PHP and SQL

I run a SQL database with some user information.
On the main page, I would like to throw some statistics about the database, and what I thought was easy at first, showed to be complicated for me (I'm a newbie).
To give a pratical example of what I'm trying to achieve, I will use a real situation to exemplify:
On my CLIENTS table, all of my clients are from different countries (represented by a country code). One of the statistics I'm trying to show, is WHAT COUNTRY HAS MORE CLIENTS.
Is there a simple way to find this kind of information? I understand I can simply count how many occurences of certain country I have on the TABLE, but I would need to compare with every country to check which on hosts more clients.
I guess that sums up my question.
EDIT: I came up with a solution but I'm just not sure if it's best, using PHP. I did a loop test for each country checking the number of clients, and compared to the one before. If the count was higher, I updated the $higher_country var, if not, I just moved to next country. Would that be my only option?

You can do something like...
SELECT country_id, count(country_id) as nmbr
FROM clients
group by country_id
order by nmbr desc
limit 1
This counts up the number of a specific value and orders it in reverse order (so highest first) and just picks the first record.

Optimise searching a given table when results require multiple JOINs

I'm working on an application which is a large database of chemical substances (approx 250,000 but rising) and associated data. I'm looking at ways to optimise the way searching is performed.
The application is running under PHP 7.0.27, MariaDB 5.5.56, and Apache 2.4.6
The application allows searching by chemical name and various chemical codes (such as EC number and CAS number). The schema is such that there are separate tables to hold the data, and the relationships of which codes apply to which chemicals.
These tables are in the database:
substances - unique ID and name for each chemical substance.
ecs - a list of EC Numbers
ecs_substances - which EC Number(s) apply to which substances
cas - a list of CAS Numbers
cas_substances - which CAS Number(s) apply to which substances
Note: there are other tables than the ones above where similar logic will apply, but for now I want to focus on these for this example.
It is possible for a substance to have multiple EC/CAS numbers, and a small number do not have them - i.e. it's not a simple 1:1 relationship.
The application has search fields for the substance name (substances.name), EC number (ecs.value) CAS number (cas.value). These can be used on their own, or in conjuction with each other. For example: find a substance by name, or find a substance by name and CAS number.
I believe the "quickest" way of performing a search for any given value would be to use a LIKE condition on the specific table required. So if I want to look up substances which have "acids" as part of the name:
SELECT id FROM substances WHERE name LIKE '%acids%' LIMIT 0,250
However the results that the application gives are shown in a table which includes headings for substance name, CAS number, EC number. It also allows the results to be ordered on a column (e.g. order by substance name, CAS, EC, etc). This requires JOIN conditions.
I'm doing it like this:
$sql = 'SELECT
DISTINCT(substances.`id`),
substances.`name`,
"" AS cas_number,
"" AS ec_number
FROM
substances ';
// Search - EC Number, or if trying to order by EC column (JOIN has to occur to make that possible)
if ( (isset($search['ecNumber'])) || (isset($order['column']) && ($order['column'] == 'ec_number')) ) {
$sql .= ' LEFT JOIN ecs_substances ON substances.id = ecs_substances.substance_id LEFT JOIN ecs ON ecs_substances.ec_id = ecs.id ';
}
// Search - CAS Number, or if trying to order by CAS column (JOIN has to occur to make that possible)
if ( (isset($search['casNumber'])) || (isset($order['column']) && ($order['column'] == 'cas_number')) ) {
$sql .= ' LEFT JOIN cas_substances ON cas_substances.substance_id = substances.id LEFT JOIN cas ON cas_substances.cas_id = cas.id ';
}
The problem is that because of all the JOINs that are occurring it's slowing down how quickly the results can be obtained.
Benchmark: The first query I posted which just uses a LIKE condition on 1 table will execute in 140ms, whereas it's taking 506ms for the same search criteria with all of the JOIN statements in the second block of code.
I'd like to know if there are ways to optimise this such that the time taken to present results to the user decreases.
It's worth mentioning that the results are displayed in DataTables and PHP is producing a JSON feed of the results. The LIMIT 0,250 is something the end user can override by setting results per page, but I'm happy to limit them to say no more than 500 per page.
Some things I've looked into are:
Caching the JSON. Not a big fan of this because the data is updated quite regularly. The data presented must always be what is in the database, not some cached copy.
Do a search on the required table as in the first code sample. Update the other columns with ajax. This would "appear" to give instant results on the column the user has searched and then quickly thereafter populate the other columns required by the DataTable. This seems incredibly fiddly to do and I don't know whether it's really a good idea.

Consider FULLTEXT because it allows for much faster searching than LIKE with a leading wildcard %. `MATCH(col) AGAINST('+acid' IN BOOLEAN MODE)
Sounds like you need a "many:many" mapping table. Tips on efficiency in such: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
Consider using GROUP_CONCAT(cas) for provideing a comma-list of CASs.
JSON does not seem practical. And even less so since you are using only MySQL 5.5.

I think a response time of half a second is quite good, given what you want to do. You must have done all necessary database optimizations? (db type, indexes, etc).
There are several things you could explore:
Prepare all possible searches and store them in a database for quick access. This may sound stupid but this is how I often achieve fast searches. It's difficult for me to judge what the best way to do this, with your data, would be. You could start by adding a TEXT column to your substances table and store all the information about the substance in it: It's name, and all EC/CAS numbers. Separate the items with something like '|', or any other character not used in searches. I would call that the 'search' column. Alternatively you could make a new table, just for searching with that column in it, and the id of the substance. Now you can make one search input field for all three types of data and search in one column only. Would that work for you? Would it be faster? Possibly, but I cannot guarantee it. I don't know, but it's quite easy to try. There is a disadvantage: You would have to update that column with every change in the database.
Use a proper search engine. Several are available for mariadb. Start at: https://mariadb.com/kb/en/library/about-sphinxse It basically does something far more advanced than what I described under point 1: Prepare a database with data for optimized searching.
Still, a response of half a second would be something I could live with.

Sphinx Search Sorting/Ranking

I just recently discovered sphinx search which I want to use for my PHP application. I have a table of geolocations where every record stores a country code. For every user who uses the search function to look up geopositions, I know which country he is from.
How would I reweigh the results such that the matching results are ascending in distance to the country of the user? I already have calculated a distance matrix for each country to each other, which I can access via SQL. The country information in the geolocation database is stored as 2 letter ISO country code.
What is a good solution for this problem? I heard about UDFs, are they applicable for that problem? Is it possible to solve this problem more easily by reformatting my table?
Thank you very much.

The "easiest" way to solve this is to have coordinates for each country. You then store the coordinates for each record in the sphinx index, and when searching find the coordinates and us it in the search. This way sphinx caculates the distance dynamically.
Did you have coordinates likes this to create the matrix? But it also resupposes, you are just using a 'point' per country, if your matrix is more advanced, eg taking the closest point on the borders of each (to make disances between odd shaped countries better), then it wont work so well.
In theory you could perhaps do this with payloads, by using the country name as keywords, and the distance in a payload (arranged specially so that close disances have a high weight) but will probably be expensive to index, and might not work all that well in practice.

Database structure for saving search results

I currently work for a social networking website.
My boss recently had the idea to show search results by random instead of normal results (registration date). The problem with that is simple and obvious: if you go from one page to another, it's going to show you different results each time as the list is randomized each time.
I had the idea to store results in database+cookies something like this:
Cookie containing a serialized version of the $_POST request (needed if we want to do a re-sort)
A table which would serve as the base for the search id => searches (id,user_id, creation_date)
A table which would store the results and their order => searches_results (search_id, order, user_id)
Flow chart would look like something like that:
After each searches I store the "where" into a cookie or session
Then I erase the previous search in "searches"
Then I delete previous results in "searches_results"
Then I insert a row into "searches" for the key
Then I insert each user row into "searches_results"
And finally I redirect the user to somethink like ?search_id=[search_key]
There is a big flaw here : performances .... it is definetly possible to make the system OR down OR very slow.
Any idea what would be the best to structure this ?

What if instead of ordering randomly, you ordered by some function where the order is known and repeatable, just non-obvious? You could seed such a function with some data from the search query to make it be even less obvious that it repeats. This way, you can page back and forth through your results and always get what you expect. Music players use this sort of function for their shuffle feature (so that if you click back, you get the previous song, and if you click next again, you're back where you started). I'm sure you can divine some function to accomplish this... bitwise XORing ID values with some constant (from the query) and then sorting by the resulting number might be sufficient. I chose XOR arbitrarily because it's a trivially simple function that will get you repeatable and non-obvious results.

Hum maybe, but doesn't the xor operator only say if it is an OR exclusive ? I mean, there is no mathematical operation here, as far as I know of tho.

Sorry, I know this doesn't help, but I don't understand why your boss would want this?
I know that if I search for a person on a social network, then I want the results to be ordered by relevance and relevance only. I think that randomized results would just be frustrating for the user, but maybe that's just me.
For example, if I search for "John Smith", then first first batch of results better be people named "John Smith". Then show me similar names near the end of the results. I don't want to search for "John Smith" and get "Jon Smithers" as my second result.

Well, I'm with Matt in asking "Why?"
I think rmeador has a good suggestion as well. You could randomly sort by a different field or some sort of algorithm. Just from the permutations of DESC / ASC on last updated or some other result field.
Other option would be to do an initial search the first time and return only related ID's and then store the full ID's string in the database and each subsequent page is then a lookup against those ID's.
My two cents.
I can see a scenario where a randomized result set is useful but not for searching but for browsing profiles or artists or local events. It offers more exposure to those that wouldn't show up in a traditionally directed search.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Levenshtein search - php

You could try MySQL function SOUNDS LIKE SELECT lat, lng FROM my_search_table WHERE term SOUNDS LIKE "the term"

You can use a kd-tree or a ternary tree to speed up the search. The idea is to use a binary search.

Related

performance issue from 5 queries in one page

Generating statistics using PHP and SQL

Optimise searching a given table when results require multiple JOINs

Sphinx Search Sorting/Ranking

Database structure for saving search results

Categories

Resources