Quick search for a similar text - php

I am supporting a public blog to which users could publish their posts. Some users have more than thousand different texts and they might not remember, that they have already published some text. I would like to help users not to publish duplicates.
Comparing texts for exact equality is not good - user might have changed text a little, or formatting, or copied from a different program, etc. So I need a quick estimate, if there is a similar text in existing database.
My technology stack includes PHP, MySQL and Redis. How can I solve my problem using those or other instruments?

PHP has a function called similar_text which you can use to calculate the amount of matching characters or the similarity in percent.
http://php.net/manual/en/function.similar-text.php
You could then check if the given text is within a certain margin of older blog posts.
If you don't want to check for similarity in text you could try to tag the posts based on tags of the original blog or subject of the blog. And then show the users the posts they made with similar tags.

You can use MySQL's match - against in a full text indexed column.
As an example:
SELECT table.*,
MATCH(userText) AGAINST ('this is user input') AS relevancy
FROM table
ORDER BY relevancy DESC;
So this will give you results ordered by relevancy.
Don't forget to add full text index on column userText.

Related

PHP MySQL based search with approximate matching

I would like to make my search feature to work more smartly in case of typo or product name special character.
For example, we have a product with name "Post-it" and we want to show it if users type "Post it" or "Postit".
Another example, we have a product with name "bic clic stic", we want to show it if the user searches for "bic clic stick" since it has a close match.
Our current query is like:
SELECT name, image, sku, description FROM products WHERE name like '%KEYWORD%' AND ....
Most methods for approaching this problem are not particularly efficient. That is, they still require full table scans (although some optimizations are available).
The technical solution is an algorithm called Levenshtein distance (or more generically, edit distance). This is a method for measuring the distance between two strings, and it works quite well for the examples in your question.
You can google "MySQL Levenshtein" to get various implementations.
Do note that the implementations are not efficient; they require full table scans. The resulting query would look like:
SELECT name, image, sku, description
FROM products
WHERE levenshtein(name, 'KEYWORD') <= 3; -- or some threshhold value
Another approach is to pre-proces the search word (according to some custom rules you will set, eg break into autonomous words) and then concatenate those words with % and search using Mysql's LIKE (or even REGEX) feature.
This requires no extra add-on for mysql nor re-arranging your already existing data tables. Plus the rules can change dynamicaly for your application.

Matching a user entered title to a category - large INNODB database

I have a large INNODB database with over 2 million products on it. The 'products' table has the following fields: id,title,description,category.
There is also a MyISAM table called 'category' that contains a list of all categories used on the website. This has the following fields: id,name,keywords,parentid.
My question is more about the logic rather than code, but what I am trying to achieve is as follows:
When a user lists a new product on the site, as they are typing the description it should try to work out what category to put the product in (with good accuracy).
I tried this initially by using MySQL MATCH() to match the entered title against a list of keywords in the category table, but this was far from accurate.
A better idea seems to be to match the user entered title against titles for products already in the database, grouping them by the category they are in and then sorting them by the largest group. However, on an INNODB database I obviously can't use fulltext, and with 2mill items I think it would be pretty slow anyway?
How would you do it - I guess it would need to be a similar way to how stackoverflow displays similar questions?
A fulltext index on 2 million records is a valid option, if you are running on a decent server. The inital indexing will take a while, that's for sure, but searches should be reasonably fast, MySQL can take it.
InnoDB supports fulltext indexes as of v5.6.4. You should consider upgrading.
If upgrading is not an option, please see this previous answer of mine where I suggest a workaround.
For your use case, you may want to take a look at the WITH QUERY EXPANSION option:
It works by performing the search twice, where the search phrase for the second search is the original search phrase concatenated with the few most highly relevant documents from the first search. Thus, if one of these documents contains the word “databases” and the word “MySQL”, the second search finds the documents that contain the word “MySQL” even if they do not contain the word “database”

Improving Full Text Search MYSQL

I have a search engine based site that is currently in beta mode http://www.jobportfolio.co.uk. The site has a job table that incorporates the following fields, (job_company, job_title, job_description, job_location) all the fields are Var except for description that is a text field. All the fields are indexed as FullText.
My current approach is to search based on the title, location and company. This seems to work fine however I would like to improve the search results by adding in the description field. The problem is however when I add the description field the search seems to take a lot longer. Even with a table that only contains 12000 rows it seems to be slow.
I am using the following MATCH AGAINST query to select the results
MATCH(job_posts.job_title, job_company) AGAINST('".$this->mysqli_escape($job_title)."' IN BOOLEAN MODE)
Does anyone have any opinions on how to improve the performance of the search?
Hm, my first thought is to approach this problem from the "outside": is it acceptable to have a search form that uses multiple different fields? If you're willing to have 4 search strings that each search in a different column, I suspect that will reduce load by itself. For example:
When someone types in the "location" field, you add a clause to the query that matches the searched text against the location field only.
When someone types in the "description" field, you add a clause to the query that matches the search text against the description field. Otherwise you don't match anything against the description field.
If you don't need to be able to enter text into one place and search "all possible fields" for it, this solution will prevent extra slowness until someone specifically wants to search in the description text. So the query speed varies based on the searcher's needs.

PHP MySQL Search Suggestions

In my web application there will be several users. and they have their own contents uploaded to my webapp. For each content they upload it has a title, description and tags(keywords). I can write a search script to search for content or user name. but they keywords when they have given with a spelling mistake it doesn't return any result. For example if there is a user named "Michael" in the database and the search query was "Micheal" i should get "Did you mean to search for 'Michael'" which is none other than a search suggestion.
Also this suggestion should be for the contents uploaded by the user. An user may keep their content's title as "Michael's activities May 2011" and suggestions should be generated for individual words.
You could use SOUNDEX to search for similar-sounding names, like that:
SELECT * FROM users WHERE SOUNDEX(name) = SOUNDEX(:input)
or like that
SELECT * FROM users WHERE name SOUNDS_LIKE :input
(which is completely equivalent)
Edit: if you need to use an algorithm other than Soundex, as Martin Hohenberg suggested, you would need to add an extra column to your table, called, for example, sound_equivalent. (This is actually a more efficient solution as this column can be indexed). The request would then be:
SELECT * FROM users WHERE sound_equivalent = :input_sound_equivalent
The content of the sound_equivalent column can then be generated with a PHP algorithm, and inserted in the table with the rest of user parameters.
You can also use the php library pspell to get suggestions if you have no search results.
Maybe create a database of the most common words (like: dog, house, city, numbers, water, internet). Don't need to make it big (<10000 words).
Then when you explode the search term, check the "word" database for words LIKE the search terms. Then just echo out the suggestions.

Need a PHP MySQL script to search for keywords in a database

I need to implement a search option for user comments that are stored in a MySQL database. I would optimally like it to work in a similar manner to a standard web page search engine, but I am trying to avoid the large scale solutions. I'd like to just get a feel for the queries that would give me decent results. Any suggestions? Thanks.
It's possible to create a full indexing solution with some straightforward steps. You could create a table that maps words to each post, then when you search for some words find all posts that match.
Here's a short algorithm:
When a comment is posted, convert the string to lowercase and split it into words (split on spaces, and optionally dashes/punctuation).
In a "words" table store each word with an ID, if it's not already in the table. (Here you might wish to ignore common words like 'the' or 'for'.)
In an "indexedwords" table map the IDs of the words you just inserted to the post ID of the comment (or article if that is what you want to return).
When searching, split the search term on words and find all posts that contain each of the words. (Again here you might want to ignore common words.)
Order the results by number of occurrences. If the results must contain all the words you'd need to find the union of your different arrays of posts.
As an entry point, you can use MySQL LIKE queries.
For example if you have a table 'comments' with a column named 'comment', and you want to find all comments that contain the word 'red', use:
SELECT comment FROM comments WHERE comment LIKE '% red %';
Please note that fulltext searches can be slow, so if your database is very large or if you run this query a lot, you will want to find an optimized solution, such as Sphinx (http://sphinxsearch.com).

Categories