I'm designing a mysql database, and i'd like some input on an efficient way to store blog/article data for searching.
Right now, I've made a separate column that stores the content to be searched - no duplicate words, no words shorter than four letters, and no words that are too common. So, essentially, it's a list of keywords from the original article. Also searched would be a list of tags, and the title field.
I'm not quite sure how mysql indexes fulltext columns, so would storing the data like that be ineffective, or redundant somehow? A lot of the articles are on the same topic, so would the score be hurt by so many of the rows having similar keywords?
Also, for this project, solutions like sphinx, lucene or google custom seach can't be used -- only php & mysql.
Thanks!
EDIT - Let me clarify:
Basically, i'm asking which way fulltext would provide the fastest, most relevant results: by finding many instances of the search term in all the data, or just the single keyword among a handful of other words.
I think a separate keywords table would be over the top for what i need, so should I forget the keywords column and search on the article, or continue to select keywords for each row?
You should build the word list (according to the rules you've specified) in a separate table and then map it to each article in a join table, along with the number of occurrences:
words: id | name
articles: id | title | content
articles_words: id | article_id | word_id | occurrences
Now you can scan through the join table and even rank the articles by the occurrence of the word, and probably place some importance on the order in which the words were typed in the search query string.
Of course, this is a very academic solution. I'm not sure what your project requires, but FULLTEXT indexing is very powerful and you're always better off using it in most practical situations.
HTH.
Related
I have a large INNODB database with over 2 million products on it. The 'products' table has the following fields: id,title,description,category.
There is also a MyISAM table called 'category' that contains a list of all categories used on the website. This has the following fields: id,name,keywords,parentid.
My question is more about the logic rather than code, but what I am trying to achieve is as follows:
When a user lists a new product on the site, as they are typing the description it should try to work out what category to put the product in (with good accuracy).
I tried this initially by using MySQL MATCH() to match the entered title against a list of keywords in the category table, but this was far from accurate.
A better idea seems to be to match the user entered title against titles for products already in the database, grouping them by the category they are in and then sorting them by the largest group. However, on an INNODB database I obviously can't use fulltext, and with 2mill items I think it would be pretty slow anyway?
How would you do it - I guess it would need to be a similar way to how stackoverflow displays similar questions?
A fulltext index on 2 million records is a valid option, if you are running on a decent server. The inital indexing will take a while, that's for sure, but searches should be reasonably fast, MySQL can take it.
InnoDB supports fulltext indexes as of v5.6.4. You should consider upgrading.
If upgrading is not an option, please see this previous answer of mine where I suggest a workaround.
For your use case, you may want to take a look at the WITH QUERY EXPANSION option:
It works by performing the search twice, where the search phrase for the second search is the original search phrase concatenated with the few most highly relevant documents from the first search. Thus, if one of these documents contains the word “databases” and the word “MySQL”, the second search finds the documents that contain the word “MySQL” even if they do not contain the word “database”
I am building a comics website and have implemented a search. The search originally used the image title... This worked fine:
if (strtolower($input)==strtolower(substr($row['title'],0,strlen($input)))) {
But now I feel that many people won't remember the title. So I've added a column to the database called "keywords", varchar [55]... and my test entry is the string "five, second, rule, food, rules". I figured I could replace $row['title'] with $row['keywords'] and the search would still work.
It worked if I start searching from the beginning of the string, like I enter "five, seco..." But not if I start from the middle of the string, like "second", or "rule", etc.
Any ideas on how I can make that work?
Thanks!
Doing it in a comic's column will only bring tears as it breaks normalization. What you should do is make a keyword table that has one word per entry and then a pivot table that matches keywords to comics. Then you can just do a join to find all the comics that match.
It's more work to setup, but more flexible in the long run.
EDIT:
Keyword table:
id keyword
1 x-men
2 action
3 mystery
4 batman
etc.
comic_keyword table
comic_id keyword_id
45 3
678 1
678 2
77 3
77 4
etc.
The second table (the pivot table) matches the ids of the comic to the ids of the keywords they're associated with. That's how many-to-many relationships should be modeled in most cases.
The straightforward solution would be to use stripos instead:
if(stripos($input, $row['title']) !== false) {
// row matches
}
However, this isn't really a good solution. It would be much better to offload the filtering to your database, so non-matching rows don't have to make the trip to your front end at all. If you keep keywords as a comma-separated field then LIKE or REGEXP would be a good choice of tool; if you normalized your database schema so that the 1-to-many relationship of comics to keywords is modeled with a separate comic_keywords table then matching would be even easier.
For example, assuming that comic has an id and comic_keywords has a comic_id and a keyword, you could select matching comics with
SELECT DISTINCT(comic_id) FROM comic_keywords WHERE keyword = 'blah'
You can split the column on the fly:
if (in_array(strtolower($input), explode(', ', $row['title']))) {
/* found it */
}
Looks like you need Sphinx or (less likely, but easier to set up) MySQLs full-text search.
I want to write a tag based search engine in MySQL, but I don't really know how to get to a pleasant result.
I used LIKE, but as I stored over 18k keywords in the database, it's pretty slow.
What I got is a table like this:
id(int, primary key) article_cloud(text) keyword(varchar(40), FULLTEXT INDEX)
So I store one keyword per row and save all the refering article numbers in article_cloud.
I tried the MATCH() AGAINST() stuff, which works fine as long as the user types in the whole keyword. But I also want a suggest search, so that there are relevant articles popping up, while the user is typing. So I still need a similar statement to LIKE, but faster. And I have no idea what I could do.
Maybe this is the wrong concept of tag based searching. If you know a better one, please let me know. I'm fighting with this for days and can't figure out a satisfying solution. Thanks for reading :)
MATCH() AGAINST() / FULLTEXT searching is a quick fix to a problem - but your schema makes no sense at all - surely there are multiple keywords in each article? And using a fulltext index on a column which only contains a single word is rather dumb.
and save all the refering article numbers in article_cloud
No! storing multiple values in a single column is VERY bad practice. When those values are keys to another table, it's a mortal sin!
It looks like you've got a long journey ahead of you to create something which will work efficiently; the quickest route to the goal is probably to use Google or Yahoo's indexing services on your own data. But if you want to fix it yourself....
See this answer on creating a search engine - the keywords should be in a separate table with a N:1 relationship to your articles, primary key on keyword and article id, e.g.
CREATE TABLE article (
id INTEGER NOT NULL autoincrement,
modified TIMESTAMP,
content TEXT
...
PRIMARY KEY (id)
);
CREATE TABLE keyword (
word VARCHAR(20),
article_id INTEGER, /* references article.id
relevance FLOAT DEFAULT 0.5, /* allow users to record relevance of keyword to article*/
PRIMARY KEY (word, article_id)
);
CREATE TEMPORARY TABLE search (
word VARCHAR(20),
PRIMARY KEY (word)
);
Then split the words entered by the user, convert them to a consistent case (same as used for populating the keyword table) and populate the search table, then find matches using....
SELECT article.id, SUM(keyword.relevance)
FROM article, keyword, search
WHERE article.id=keyword.article_id
AND keyword.word=search.word
GROUP BY article_id
ORDER BY SUM(keyword.relevance) DESC
LIMIT 0,3
It'll be a lot more efficient if you can maintain a list of words or rules about words NOT to use as keywords (e.g. ignore any words of 3 chars or less in mixed or lower case will omit stuff like 'a', 'to', 'was', 'and', 'He'...).
Have a look at Sphinx and Lucene
I tried the MATCH() AGAINST() stuff, which works fine as long as the user types in the whole keyword.
what do you think that FULLTEXT means?
I had 40 000 entries in my table, using no indexes (local use) and it searched for maximally 0.1 sec with LIKE '%SOMETHING%'
You may LIMIT your queries output
I need to implement a search option for user comments that are stored in a MySQL database. I would optimally like it to work in a similar manner to a standard web page search engine, but I am trying to avoid the large scale solutions. I'd like to just get a feel for the queries that would give me decent results. Any suggestions? Thanks.
It's possible to create a full indexing solution with some straightforward steps. You could create a table that maps words to each post, then when you search for some words find all posts that match.
Here's a short algorithm:
When a comment is posted, convert the string to lowercase and split it into words (split on spaces, and optionally dashes/punctuation).
In a "words" table store each word with an ID, if it's not already in the table. (Here you might wish to ignore common words like 'the' or 'for'.)
In an "indexedwords" table map the IDs of the words you just inserted to the post ID of the comment (or article if that is what you want to return).
When searching, split the search term on words and find all posts that contain each of the words. (Again here you might want to ignore common words.)
Order the results by number of occurrences. If the results must contain all the words you'd need to find the union of your different arrays of posts.
As an entry point, you can use MySQL LIKE queries.
For example if you have a table 'comments' with a column named 'comment', and you want to find all comments that contain the word 'red', use:
SELECT comment FROM comments WHERE comment LIKE '% red %';
Please note that fulltext searches can be slow, so if your database is very large or if you run this query a lot, you will want to find an optimized solution, such as Sphinx (http://sphinxsearch.com).
I have a table called artists. Within it, there is a field for the artist name (artist_name). Then there is a field for SEO friendly artist name, we'll call it search_name.
I have over 40,000 artists in this table. So, I'd like to convert all artists names to search friendly. What is the best way to accomplish this? Not looking for code here, just ideas.
This is what I have thus far. I'm just not sure if I should call all 40,000 artists, loop through them and update?
// Does this artist name have any symbols, apostrophes, etc. If so, strip them out
// Does this artist have a space (the beatles)? If so, replace with + (the+beatles).
// insert into search field
As 40,000 records aren't that much, I'd grab all of them and loop through them in memory. By doing it in memory, unique checks should be pretty fast.
In the end, I'd just chain the commands together like: $query .= "UPDATE artists SET search_name = $generated_name[$i] WHERE id = $id[$i];".
By the way: I'd replace spaces with a minus.
You could go through and create a secondary table two columns wide (id, safe) and insert it from there.
Query Table 1
Convert artist names to safe names
Insert into Table 2
Use id of both tables to match them. This would only allow one to one matches though if id is the index, you may want to create a third column if you want multiple safe names for a single artist (id | artistID | artistName)
Please consider using some full-text search engine. For example, the free sphinx search - it's quite flexible, extremely fast and it does support word stemming.