Searching for keywords in mysql database most efficiently

Searching for keywords in mysql database most efficiently - php

I am working on a project where I have a database, that contains a summary field, which is filled in by a web form that visitors to the site enter on.
When the user completes entering the summary field, I want to perform a lookup using the words that were entered by the the user on the page for similar records in the database that contain the same keywords that they've filled in on the page.
I was thinking I could split the summary string that is submitted and then loop through the array and build up a query so the query would end up something like:
SELECT *
FROM my_table
WHERE summary LIKE '%keyword1%'
OR summary LIKE '%keyword2'
OR summary LIKE '%keyword3%';
However, this seems massively inefficient, and as the database could grow quite big, could potentially become quite a slow query to run.
I then found the MySQL IN clause, but this only seems to work with multiple values where a field can only contain 1 of these values in a row.
Is there a way I can use the IN function, or is there a better MySQL function that I can use to do what I want, or is my first idea the only way round it?
An example of what I am trying to achieve is a bit like on Stack Overflow. When you lose focus of the title field, it pops up similar questions based on the title you've provided.

I would recommend reading this manual page InnoDB FULLTEXT Indexes and the one on Full-Text Restrictions. New functionality of full text has been incorporated in recent releases of mysql, augmenting the use of it with INNODB tables.
Concerning the inability to upgrade a mysql version, there is no reason why one cannot mix and match MyISAM and INNODB tables in the same db. As such, one would keep textual information in MyISAM (where historically FTS index power was available), and doing joins to INNODB tables when needed. This avoids the "must upgrade to version 5.6" argument.
Legend: FTS=Full Text Search

Related

Alert that a similar data is already entered

I am trying to figure out a way to detect whether a similar input is entered into mysql database before or not.
I am not saying duplicate entry no similar but not exact, the thing is that when data entry staff need to enter a name the pronunciation of the name might be entered differently from one to another so I need a way so that my php code detects whether an entry similar to the one being entered is already input and warn the staff to double check if it is the same name or not

This can not be done with mysql alone but needs to be handled by your code. So before inserting any data into a table, a query needs to be formulated and executed to find potential similar data. The definition what counts as similar depends strongly on your business needs.
Fuzzy search is not the strength of MySQL by you might get away with one or several like conditions. Searches for variations to a Levenshtein distance of two — maybe three — might be done by calculating possible variations with wildcard placeholders and combine them into one query. This approach is described in more detail by Gordon Lesti. Depending on the complexity of those queries and the traffic on a system, this has grate potential to bring a mysql instance down quite easily.
If performance is an issue, a search engine like Elasticsearch might get helpful. When inserting data into MySQL, this data would also be added as a document to Elasticsearch. This would allow to search for similar records by using the fuzzy search capabilities of Elasticsearch which is by far more performant than MySQL.
If the transactional safety of MySQL is not a requirement for the application, one might also opt to replace MySQL with Elasticsearch and use Elasticsearch not only for searching but also for persistency.

Will my searches be slow in my database design?

I'm building a database of IT candidates for a friend who owns a recruitment company. He has a database of thousands of candidates currently in an excel spreadsheet and I'm converting it into mySQL database.
Each candidate has a skill field with their skills listed as a string e.g. "javascript, php, nodejs..." etc.
My friend will have employees under him who will also search the database, however we want to make it so they are limited to search results with candidates with specific skills depending on what vacancy they are working on for security reasons (so they don't steal large sections of the database and go and setup their own recruitment company with the data).
So if an employee is working on a javascript role, they will be limited to search results where the candidate has the word "javascript" in their skills field. So if they searched for all candidates named "Michael" then it would only return "Michaels" with javascript skills for instance.
My concern is that the searches might take too long if for every search since it must scan the skills field which can sometimes be a long string.
Is my concern justified? If so is there a way to optimize this?

If the number of records are in the thousands, you probably won't have any speed issues (just make sure you're not querying more often than you should).
You've tagged this question with a 'mysql' tag so I'm assuming that's the database you're using. Make sure you add a FULLTEXT index to speed up the search. Please note, however, that this type of index is only available for INNODB table starting with MySQL 5.6.
Try the builtin search first, but if you find it to be too slow, or not accurate enough in it's results, you can look at external full-text search engines. I've personally had very good experience with the Sphinx search server, where it easily indexed millions of text records and returned good results.

Your queries will require a full table scan (unless you use a full text index). I highly recommend that you change the data structure in the database by introducing two more tables: Skills and CandidateSkills.
The first would be a list of available skills, containing rows such as:
SkillId SkillName
1 javascript
2 php
3 nodejs
The second would say which skills each person has:
CandidateId SkillId
1 1
2 1
2 2
This will speed up the searches, but that is not the primary reason. The primary reason is fix problems and enable functionality such as:
Preventing spelling errors in the list of searchs.
Providing a basis for enabling synonym searches.
Making sure thought goes into adding new skills (because they need to be added to the Skills table.
Allowing the database to scale.
If you attempt to do what you want using a full text index, you will learn a few things. For instance, the default minimum word length is 4, which would be a problem if your skills include "C" or "C++". MySQL doesn't support synonyms, so you'd have to muck around to get that functionality. And, you might get unexpected results if you have have skills that are multiple words.

Simulate MySQL connection to analyze queries to rebuild table structure (reverse-engineering tables)

I have just been tasked with recovering/rebuilding an extremely large and complex website that had no backups and was fully lost. I have a complete (hopefully) copy of all the PHP files however I have absolutely no clue what the database structure looked like (other than it is certainly at least 50 or so tables...so fairly complex). All data has been lost and the original developer was fired about a year ago in a fiery feud (so I am told). I have been a PHP developer for quite a while and am plenty comfortable trying to sort through everything and get the application/site back up and running...but the lack of a database will be a huge struggle. So...is there any way to simulate a MySQL connection to some software that will capture all incoming queries and attempt to use the requested field and table names to rebuild the structure?
It seems to me that if i start clicking through the application and it passes a query for
SELECT name, email, phone from contact_table WHERE
contact_id='1'
...there should be a way to capture that info and assume there was a table called "contact_table" that had at least 4 fields with those names... If I can do that repetitively, each time adding some sample data to the discovered fields and then moving on to another page, then eventually I should have a rough copy of most of the database structure (at least all public-facing parts). This would be MUCH easier than manually reading all the code and pulling out every reference, reading all the joins and subqueries, and sorting through it all manually.
Anyone ever tried this before? Any other ideas for reverse-engineering the database structure from PHP code?

mysql> SET GLOBAL general_log=1;
With this configuration enabled, the MySQL server writes every query to a log file (datadir/hostname.log by default), even those queries that have errors because the tables and columns don't exist yet.
http://dev.mysql.com/doc/refman/5.6/en/query-log.html says:
The general query log can be very useful when you suspect an error in a client and want to know exactly what the client sent to mysqld.
As you click around in the application, it should generate SQL queries, and you can have a terminal window open running tail -f on the general query log. As you see queries run by that reference tables or columns that don't exist yet, create those tables and columns. Then repeat clicking around in the app.
A number of things may make this task even harder:
If the queries use SELECT *, you can't infer the names of columns or even how many columns there are. You'll have to inspect the application code to see what column names are used after the query result is returned.
If INSERT statements omit the list of column names, you can't know what columns there are or how many. On the other hand, if INSERT statements do specify a list of column names, you can't know if there are more columns that were intended to take on their default values.
Data types of columns won't be apparent from their names, nor string lengths, nor character sets, nor default values.
Constraints, indexes, primary keys, foreign keys won't be apparent from the queries.
Some tables may exist (for example, lookup tables), even though they are never mentioned by name by the queries you find in the app.
Speaking of lookup tables, many databases have sets of initial values stored in tables, such as all possible user types and so on. Without the knowledge of the data for such lookup tables, it'll be hard or impossible to get the app working.
There may have been triggers and stored procedures. Procedures may be referenced by CALL statements in the app, but you can't guess what the code inside triggers or stored procedures was intended to be.
This project is bound to be very laborious, time-consuming, and involve a lot of guesswork. The fact that the employer had a big feud with the developer might be a warning flag. Be careful to set the expectations so the employer understands it will take a lot of work to do this.
PS: I'm assuming you are using a recent version of MySQL, such as 5.1 or later. If you use MySQL 5.0 or earlier, you should just add log=1 to your /etc/my.cnf and restart mysqld.

Crazy task. Is the code such that the DB queries are at all abstracted? Could you replace the query functions with something which would log the tables, columns and keys, and/or actually create the tables or alter them as needed, before firing off the real query?
Alternatively, it might be easier to do some text processing, regex matching, grep/sort/uniq on the queries in all of the PHP files. The goal would be to get it down to a manageable list of all tables and columns in those tables.

I once had a similar task, fortunately I was able to find an old backup.
If you could find a way to extract the queries, like say, regex match all of the occurrences of mysql_query or whatever extension was used to query the database, you could then use something like php-sql-parser to parse the queries and hopefully from that you would be able to get a list of most tables and columns. However, that is only half the battle. The other half is determining the data types for every single column and that would be rather impossible to do autmatically from PHP. It would basically require you inspect it line by line. There are best practices, but who's to say that the old dev followed them? Determining whether a column called "date" should be stored in DATE, DATETIME, INT, or VARCHAR(50) with some sort of manual ugly string thing can only be determined by looking at the actual code.
Good luck!

You could build some triggers with the BEFORE action time, but unfortunately this will only work for INSERT, UPDATE, or DELETE commands.
http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html

Autocomplete concept

I'm programming a search engine for my website in PHP, SQL and JQuery. I have experience in adding autocomplete with existing data in the database (i.e. searching article titles). But what about if I want to use the most common search queries that the users type, something similar to the one Google has, without having so much users to contribute to the creation of the data (most common queries)? Is there some kind of open-source SQL table with autocomplete data in it or something similar?

As of now use the static data that you have for auto complete.
Create another table in your database to store the actual user queries. The schema of the table can be <queryID, query, count> where count is incremented each time same query is supplied by some other user [Kind of Rank]. N-Gram Index (so that you could also auto-complete something like "Manchester United" when person just types "United", i.e. not just with the starting string) the queries and simply return the top N after sorting using count.
The above table will gradually keep on improving as and when your user base starts increasing.
One more thing, the Algorithm for accomplishing your task is pretty simple. However the real challenge lies in returning the data to be displayed in fraction of seconds. So when your query database/store size increases then you can use a search engine like Solr/Sphinx to search for you which will be pretty fast in returning back the results to be rendered.

You can use Lucene Search Engiine for this functionality.Refer this link
or you may also give look to Lucene Solr Autocomplete...

Google has (and having) thousands of entries which are arranged according to (day, time, geolocation, language....) and it is increasing by the entries of users, whenever user types a word the system checks the table of "mostly used words belonged to that location+day+time" + (if no answer) then "general words". So for that you should categorize every word entered by users, or make general word-relation table of you database, where the most suitable searched answer will be referenced to.

Yesterday I stumbled on something that answered my question. Google draws autocomplete suggestions from this XML file, so it is wise to use it if you have little users to create your own database with keywords:
http://google.com/complete/search?q=[keyword]&output=toolbar
Just replacing [keyword] with some word will give suggestions about that word then the taks is just to parse the returned xml and format the output to suit your needs.

How do I get this lightning fast search?

I just came over this site: http://www.hittaplagget.se. If you enter the following search word moo, the autosuggest pops up immediately.
But if you go to my site, http://storelocator.no, and use the same search phrase (in "Search for brand" field), it takes a lot longer for autosuggest to suggest anything.
I know that we can only guess on what type of technology they are using, but hopefully someone here can do an educational guess better than I can.
In my solution I only do a SELECT moo% FROM table and return the results.
I have yet not indexed my table as there are only 7000 rows in it. But I'm thinking of indexing my tables using Lucene.
Can anyone suggest what I need to do in order to get equally fast autosuggest?

You must add an index on the column holding your search terms, even at 7000 - otherwise, the database searching through the whole list every time. See http://dev.mysql.com/doc/refman/5.0/en/create-index.html.

Lucene is a full text search index and may or may not be what you're looking for. Lucene would find any occurrence of "moo" in the entire indexed column (e.g. Mootastic and Fantasticmoo) and does not necessarily speed up your search although it's faster than a where x like '%moo%' type of search.
As others have already pointed out a regular index (probably even unique?) is what you want if you're performing "starts with" type of searches.

You will need to table-scan the table, so I suggest:
Don't put any rows in the table you don't need - for example, "inactive" records - keep them in a different table
Don't put any columns in the table you don't need
You can achieve this by having a special "Search table" which just contains the rows/columns you're interested in, and updating it from the "Master table".
Table-scanning a 7000 row table should be extremely efficient if the rows are small; I understand from your problem domain that this will be the case.
But as others have pointed out - don't send the 7000 rows to the client-side when it doesn't need it.
A conventional index can optimise a LIKE 'someprefix%' into a range-scan, so it is probably helpful having one. If you want to search for the string in any part of the entry, it is going to be a table-scan (which should not be slow on such a tiny table!)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Searching for keywords in mysql database most efficiently - php

Related

Alert that a similar data is already entered

Will my searches be slow in my database design?

Simulate MySQL connection to analyze queries to rebuild table structure (reverse-engineering tables)

Autocomplete concept

How do I get this lightning fast search?

Categories

Resources