I am building a data driven website (using PHP and MySQL) for some farmers around the community (well, was - you will see in a second). They wanted to be able to list their products and have people search those products and their names come up along with a link to a page detailing all of their produce.
While I knew it would be a long list, I thought, "Since every search into a mysql database is picky - including case sensitive - I'll just make a list in alphabetical order and people can choose from a dropdown box what they want to search for. No problem."
Well, now it's a problem. He has expanded the parameters of the site. He now wants to included hand made and home made products. Needless to say we went from a few dozen to hundreds of potential products and now a dropdown list is no longer feasible. But if I use a text field for visitors to search the site, unless they type it with no spelling errors and use the same case, they won't get accurate results from the search.
Can anyone recommend another method? I am aware of the "LIKE" search, but it doesn't really solve my problem - especially since it could create false positives in the search. Any help would be appreciated, thanks!
Well, the question is somewhat vague, since you are talking about multiple search parameters.
It's more a design choice. For example, consider the following:
For items such as "homemade" or "handmade", perhaps underneath the input field, you should have a checkbox where people can add additional flags to the search
say, search by name "John Smith" and check off "handmade" and "homemade"
The "handmade" and "homemade" fields in the database will always be the same (either on / off)
So, an sql query might be like this:
SELECT * FROM products WHERE farmername LIKE '%$search%' AND handmade = 'handmade'
Or, when inputting the data, if handmade is checked, insert an integer into the handmade, where 1 would mean handmade was checked, 0 handmade was not checked
so your query would then go AND handmade = 1 (or 0) for not handmade
These are just some ideas to get you going, but this is more a design decision than a database decision, how do you create your tables to use the flags
I would use two tables:
one with all possible search terms
one with synonyms of any terms that were applicable (EG "handmade" and "homemade" )
Use AJAX to search for values from the first table as characters are entered in a text box. Return a list of possible terms using:
select term from search_table where term like '%<input string>%'
Only start returning values when you have less than 10 hits or so. (IE don't populate when they enter 2 letters ). Then when a particular term is entered, search in the second table for synonyms and include those with the search. In the results page, indicate that you included the synonyms and maybe put an 'X' by each to opt to re-search with those excluded.
Note that 'like' is case-insensitive.
Related
Im currently working on system that will enable to use of hashtags on our site and im having some trouble with how to best and most efficiently store the hashtags in the database. The design needs to be set up so that its relatively simple to retrieve posts that match search terms (like on Twitter when you click the link of a hashtag and it shows all the tweets with that hashtag).
The hashtags will be stored in db by extracting the terms from the content of created posts (also comparable to twitter) and inserting them. How is insert them of course is the problem at hand:
At the moment I'm torn between 2 possible designs:
1) My first design idea (and perhaps more conventional) is a 3-table design:
the first table simply stores the post content and other data related
to the post itself (im already using a table like this).
the second table simply stores new hashtags being used, basically functioning as a look-up for all hashtags that have been used.
the third table is a table that defines the relationships between hashtags and posts. So basically is would be a simple table that
would have one column with the ID of a post and another column for
the ID of a single hashtag that we stored in the previous table. So a post that has for example 3 hashtags would have 3 rows in this table, 1 for each hashtag with which it is associated.
2) The second design is 2-table design:
the same table with the post data stored in it, like above.
the 2nd table is a mix of the 2nd and 3rd table in the first design: It holds the data between the relationships of hashtags and
posts, but instead of storing the new hashtag in a table assigning it
an ID, it simply stores the actual hashtag (so for example "#test")
itself along with the ID of the post. Same concept applies here that
if a post has 3 hashtags in it, it would store 3 individual rows in
the table.
The reason I'm torn between the ideas, is because the first option does seem to the more standard way to do it and there seems to be more "structure" to it. Since they are hashtags, however, I don't see a lot of purpose in actually assigning a unique ID to each hashtag, since hashtags aren't true classifications like a category or genre or so.
Also for when I try to make a search page for hashtags I have to use less JOINs since I wouldn't need to look up the ID of the searched terms and then go to another table and find the associated posts with that ID.
Additionally, when trying to simply list the hashtags of a post, one thing that would be kind of annoying is that the hashtags may print out differently than a user may have stylized them in their post. So for example if a user adds #testing, but another user had previously entered a post with #TeStIng, the hashtag for the post would then print out #TeStIng, since that's how it would have been saved in the database lookup table. Of course you could make it case-sensitive but in searches #testing and #TeStIng should be considered the same hashtag so that could get messy. Or am I wrong about this? Does anyone have a suggestion about how to avoid this scenario?
On the other hand my concern with the 2nd table design is that I fear it could become inefficient if the table grows to be huge, since looking up strings is slower than searching for integers (which I would be doing with the first design). However, since I would have to use more JOINs in the 1st design, would there actually be a performance difference? Just to be clear, when searching for strings themselves I would be using the = operator and not LIKE.
Similarly, I would imagine that the first design is more efficient if I wanted to make queries about the hashtags themselves, for example how many posts are using a certain hashtag and things like that, though it would not be very difficult with the 2nd design either, I just wonder about the efficiency again.
Any thoughts on what may work better? The most important thing is that it is efficient to search by hashtag, so for example I'm trying to find posts that have #test associated with them. Ideally, I would also like to be able to retrieve a post's hashtag from the database as it was stylized by the user in the post content. All other queries and functions around analyzing hashtags is secondary at this point.
Purely from a database normalization perspective your second design would not be in the 3NF. There's a reason why you rely on the whole primary and nothing but the key. If anything in the hash table changes that has a direct impact on the post table you come up with a logical inconsistency. For example, the table of hashtags has two rows: one with the hashtag #politics and another with the hashtag #politic. Let's say the person that created the post for the second hashtags decides to edit their post and updates the hashtag to #politics (perhaps because they made a typo). Which row do you update?
As for performance, I wouldn't worry about it in the least with the first design. Your database (like almost every major relational dbms out there today) relies on something called a binary search tree (or more specifically a red-black tree) to optimize the cost of insertion/deletion/search in your database tables when you're properly indexing these values. It can further optimize this with O(1) (hashtable lookups) in some text search use cases or you could even do that in a key/value cache store like Memcached/Redis yourself down the road. For the most part, indexing the hashtags in order to create faster search of posts that use those hashtags is definitely the design you want to go for. Since the biggest cost factor isn't in looking up a single hashtag (most searches will have a single hashtag I'm assuming in this use case), but retrieving all of the posts that contain that hashtag.
As for addressing the case-insensitive search portion of your query, your dbms most likely has some collation option that you can specify in your schema (like utf8_general_ci) where the ci represents case-insensitive comparison in the schema. Meaning, the data will be stored as-is, but when compared in a query to another value, MySQL would do the comparison of characters in a case-insensitive manner.
I am trying to find the best way to create a search engine for my website. All of the items that need to be searched are in the mysqli database, but I also have tables in the database that need to be excluded from the search (ie. Users information, navigation menu tables, etc). The only solution that I have come up with so far is to search each table individually for that key word and then display the results.
Is there an easier way to do this? I would like to have a ‘table group’ or something like that so my query could be something like:
SELECT * FROM table_group WHERE any_column LIKE “%search_string%”
The database has around 30 tables right now, but tables can be dynamically added and this will grow as the site is used more. What is the best way to go about this?
If you want to pursuit the idea of creating a search engine querying to the database, I think you have to query for the metadata of the database: the list of tables and the list of fields each table has.
However, this is difficult to follow because at the end you have to present the information to the user in a printable view: it isn't the same finding a proper name (and hence you have to present the page userInfo.php -to say an example- to the user) or a name of the company (and therefore you have to present a totally different page.
I think at least you have to create search elements based on the pages you have in the application. This means that you have to create a table with the information for the search engine: how to present this information, in what tables/fields should it search, etc.
Even if there will be tricks. What if the user searchs for "smyth"?? Have the search engine be smart enough to return "smith"?? And if you have two fields, one with the name and other wit the surname, what will happen the user types "john smith"??
I am trying to write a predictive search system for a website I am making.
The finished functionality will be a lot like this:
I am not sure of the best way to do this, but here is what I have so far:
Searches Table:
id - term - count
Every time a search is made it is inserted into the searches table.
When a user enters a character into the search input, the following occurs:
The page makes an AJAX request to a search PHP file
The PHP file connects to MySQL database and executes a query: SELECT * FROM searches WHERE term LIKE 'x%' AND count >= 10 ORDER BY count DESC LIMIT 10 (x = text in search input)
The 10 top results based on past search criteria are then listed on the page
This solution is far from perfect. If any random person searches for the same term 10 times it will then show up as a recommended search (if somebody where to search a term starting with the same characters). By this I mean, if somebody searched "poo poo" 10 times and then someone on the site searched for "po" looking for potatoes, they would see "poo poo" as a popular search. This is not cool.
A few ideas to get around this do come to my head. For example, I could limit each insert query into the searches table to the user's IP address. This doesn't fully solve the problem though, if the user has a dynamic IP address they could just restart their modem and perform the search 10 times on each IP address. Sure, the amount of times it has to be entered could remain a secret so it is a little more secure.
I suppose another solution would be to add a blacklist to remove words like "poo poo" from showing up.
My question is, is there a better way of doing this or am I moving along the right lines? I would like to write code that is going to allow this to scale up.
Thanks
You are on the right track.
What I would do:
You store every query uniquely. Add a table tracking each IP for that search term and only update your count once per IP
If a certain new/unique keyword gets upcounted more then X times in an X period of time, let your system mail you/your admin so you have the opportunity to blacklist they keyword manually. This has to be manually because some hot topic might also show this behavior.
This is the most interesting one: Once the query is complete, check the amount of results. It is pointless to suggest keywords that give no results. So only suggest queries that atleast will give X amount of results. Queries like "poo poo" will give no results, so they won't show up in your suggestion list.
I hope this helps. Talk to me further in chat if you have questions :)
For example, you could add a new boolean column called validate, and avoid using a blacklist. If validate is false, not appear in recommended list
This field can be ajusted manually by an administrator (via query or backoffice tool). You could add another column called audit, which stores the timestamp of the query. If the difference between the maximum and minimum timestamp exceeds a value, validate field could be false by default.
This solution is easy and fast for develop your idea.
Regards and good luck.
I'm programming a search engine for my website in PHP, SQL and JQuery. I have experience in adding autocomplete with existing data in the database (i.e. searching article titles). But what about if I want to use the most common search queries that the users type, something similar to the one Google has, without having so much users to contribute to the creation of the data (most common queries)? Is there some kind of open-source SQL table with autocomplete data in it or something similar?
As of now use the static data that you have for auto complete.
Create another table in your database to store the actual user queries. The schema of the table can be <queryID, query, count> where count is incremented each time same query is supplied by some other user [Kind of Rank]. N-Gram Index (so that you could also auto-complete something like "Manchester United" when person just types "United", i.e. not just with the starting string) the queries and simply return the top N after sorting using count.
The above table will gradually keep on improving as and when your user base starts increasing.
One more thing, the Algorithm for accomplishing your task is pretty simple. However the real challenge lies in returning the data to be displayed in fraction of seconds. So when your query database/store size increases then you can use a search engine like Solr/Sphinx to search for you which will be pretty fast in returning back the results to be rendered.
You can use Lucene Search Engiine for this functionality.Refer this link
or you may also give look to Lucene Solr Autocomplete...
Google has (and having) thousands of entries which are arranged according to (day, time, geolocation, language....) and it is increasing by the entries of users, whenever user types a word the system checks the table of "mostly used words belonged to that location+day+time" + (if no answer) then "general words". So for that you should categorize every word entered by users, or make general word-relation table of you database, where the most suitable searched answer will be referenced to.
Yesterday I stumbled on something that answered my question. Google draws autocomplete suggestions from this XML file, so it is wise to use it if you have little users to create your own database with keywords:
http://google.com/complete/search?q=[keyword]&output=toolbar
Just replacing [keyword] with some word will give suggestions about that word then the taks is just to parse the returned xml and format the output to suit your needs.
I just came over this site: http://www.hittaplagget.se. If you enter the following search word moo, the autosuggest pops up immediately.
But if you go to my site, http://storelocator.no, and use the same search phrase (in "Search for brand" field), it takes a lot longer for autosuggest to suggest anything.
I know that we can only guess on what type of technology they are using, but hopefully someone here can do an educational guess better than I can.
In my solution I only do a SELECT moo% FROM table and return the results.
I have yet not indexed my table as there are only 7000 rows in it. But I'm thinking of indexing my tables using Lucene.
Can anyone suggest what I need to do in order to get equally fast autosuggest?
You must add an index on the column holding your search terms, even at 7000 - otherwise, the database searching through the whole list every time. See http://dev.mysql.com/doc/refman/5.0/en/create-index.html.
Lucene is a full text search index and may or may not be what you're looking for. Lucene would find any occurrence of "moo" in the entire indexed column (e.g. Mootastic and Fantasticmoo) and does not necessarily speed up your search although it's faster than a where x like '%moo%' type of search.
As others have already pointed out a regular index (probably even unique?) is what you want if you're performing "starts with" type of searches.
You will need to table-scan the table, so I suggest:
Don't put any rows in the table you don't need - for example, "inactive" records - keep them in a different table
Don't put any columns in the table you don't need
You can achieve this by having a special "Search table" which just contains the rows/columns you're interested in, and updating it from the "Master table".
Table-scanning a 7000 row table should be extremely efficient if the rows are small; I understand from your problem domain that this will be the case.
But as others have pointed out - don't send the 7000 rows to the client-side when it doesn't need it.
A conventional index can optimise a LIKE 'someprefix%' into a range-scan, so it is probably helpful having one. If you want to search for the string in any part of the entry, it is going to be a table-scan (which should not be slow on such a tiny table!)