Trying to build a search engine for my website

Trying to build a search engine for my website - php

I am trying to find the best way to create a search engine for my website. All of the items that need to be searched are in the mysqli database, but I also have tables in the database that need to be excluded from the search (ie. Users information, navigation menu tables, etc). The only solution that I have come up with so far is to search each table individually for that key word and then display the results.
Is there an easier way to do this? I would like to have a ‘table group’ or something like that so my query could be something like:
SELECT * FROM table_group WHERE any_column LIKE “%search_string%”
The database has around 30 tables right now, but tables can be dynamically added and this will grow as the site is used more. What is the best way to go about this?

If you want to pursuit the idea of creating a search engine querying to the database, I think you have to query for the metadata of the database: the list of tables and the list of fields each table has.
However, this is difficult to follow because at the end you have to present the information to the user in a printable view: it isn't the same finding a proper name (and hence you have to present the page userInfo.php -to say an example- to the user) or a name of the company (and therefore you have to present a totally different page.
I think at least you have to create search elements based on the pages you have in the application. This means that you have to create a table with the information for the search engine: how to present this information, in what tables/fields should it search, etc.
Even if there will be tricks. What if the user searchs for "smyth"?? Have the search engine be smart enough to return "smith"?? And if you have two fields, one with the name and other wit the surname, what will happen the user types "john smith"??

Related

Database design for hashtags in MySQL

Im currently working on system that will enable to use of hashtags on our site and im having some trouble with how to best and most efficiently store the hashtags in the database. The design needs to be set up so that its relatively simple to retrieve posts that match search terms (like on Twitter when you click the link of a hashtag and it shows all the tweets with that hashtag).
The hashtags will be stored in db by extracting the terms from the content of created posts (also comparable to twitter) and inserting them. How is insert them of course is the problem at hand:
At the moment I'm torn between 2 possible designs:
1) My first design idea (and perhaps more conventional) is a 3-table design:
the first table simply stores the post content and other data related
to the post itself (im already using a table like this).
the second table simply stores new hashtags being used, basically functioning as a look-up for all hashtags that have been used.
the third table is a table that defines the relationships between hashtags and posts. So basically is would be a simple table that
would have one column with the ID of a post and another column for
the ID of a single hashtag that we stored in the previous table. So a post that has for example 3 hashtags would have 3 rows in this table, 1 for each hashtag with which it is associated.
2) The second design is 2-table design:
the same table with the post data stored in it, like above.
the 2nd table is a mix of the 2nd and 3rd table in the first design: It holds the data between the relationships of hashtags and
posts, but instead of storing the new hashtag in a table assigning it
an ID, it simply stores the actual hashtag (so for example "#test")
itself along with the ID of the post. Same concept applies here that
if a post has 3 hashtags in it, it would store 3 individual rows in
the table.
The reason I'm torn between the ideas, is because the first option does seem to the more standard way to do it and there seems to be more "structure" to it. Since they are hashtags, however, I don't see a lot of purpose in actually assigning a unique ID to each hashtag, since hashtags aren't true classifications like a category or genre or so.
Also for when I try to make a search page for hashtags I have to use less JOINs since I wouldn't need to look up the ID of the searched terms and then go to another table and find the associated posts with that ID.
Additionally, when trying to simply list the hashtags of a post, one thing that would be kind of annoying is that the hashtags may print out differently than a user may have stylized them in their post. So for example if a user adds #testing, but another user had previously entered a post with #TeStIng, the hashtag for the post would then print out #TeStIng, since that's how it would have been saved in the database lookup table. Of course you could make it case-sensitive but in searches #testing and #TeStIng should be considered the same hashtag so that could get messy. Or am I wrong about this? Does anyone have a suggestion about how to avoid this scenario?
On the other hand my concern with the 2nd table design is that I fear it could become inefficient if the table grows to be huge, since looking up strings is slower than searching for integers (which I would be doing with the first design). However, since I would have to use more JOINs in the 1st design, would there actually be a performance difference? Just to be clear, when searching for strings themselves I would be using the = operator and not LIKE.
Similarly, I would imagine that the first design is more efficient if I wanted to make queries about the hashtags themselves, for example how many posts are using a certain hashtag and things like that, though it would not be very difficult with the 2nd design either, I just wonder about the efficiency again.
Any thoughts on what may work better? The most important thing is that it is efficient to search by hashtag, so for example I'm trying to find posts that have #test associated with them. Ideally, I would also like to be able to retrieve a post's hashtag from the database as it was stylized by the user in the post content. All other queries and functions around analyzing hashtags is secondary at this point.

Purely from a database normalization perspective your second design would not be in the 3NF. There's a reason why you rely on the whole primary and nothing but the key. If anything in the hash table changes that has a direct impact on the post table you come up with a logical inconsistency. For example, the table of hashtags has two rows: one with the hashtag #politics and another with the hashtag #politic. Let's say the person that created the post for the second hashtags decides to edit their post and updates the hashtag to #politics (perhaps because they made a typo). Which row do you update?
As for performance, I wouldn't worry about it in the least with the first design. Your database (like almost every major relational dbms out there today) relies on something called a binary search tree (or more specifically a red-black tree) to optimize the cost of insertion/deletion/search in your database tables when you're properly indexing these values. It can further optimize this with O(1) (hashtable lookups) in some text search use cases or you could even do that in a key/value cache store like Memcached/Redis yourself down the road. For the most part, indexing the hashtags in order to create faster search of posts that use those hashtags is definitely the design you want to go for. Since the biggest cost factor isn't in looking up a single hashtag (most searches will have a single hashtag I'm assuming in this use case), but retrieving all of the posts that contain that hashtag.
As for addressing the case-insensitive search portion of your query, your dbms most likely has some collation option that you can specify in your schema (like utf8_general_ci) where the ci represents case-insensitive comparison in the schema. Meaning, the data will be stored as-is, but when compared in a query to another value, MySQL would do the comparison of characters in a case-insensitive manner.

Allowing users to build views from my database and editing those fields

I'm building a site that contains "panels" which are used as containers for various information. I have set it up so panels are editable, which is simple for panels that just contain text. For that I just grab the content from the database and wrap it in a textarea rather than a <p> tag. For panels that contain table views however this is proving to be a more difficult task.
First off I'm having trouble allowing the admin of the site pick what information is in a given table (for example if the admin wanted to add a panel view that showed each members first name, last name, and picture they could pick from those columns in my database). I've come up with a few ways to do this, but each have their own set of problems.
I tried using the INFORMATION_SCHEMA table to generate a table containing the possible tables and columns that the user can choose from. But when it comes to building the query with PDO I have problems. For instance with prepared statements you can't use a variable for the schema.
I also thought of using MySQL views but I can't seem to figure out how to do it that way either.
My second problem is allowing the admin to add rows to the tables directly. Right now all the add row template does is create a row with a text field in each column. This is good for purely text options (like first name) but for things like pictures obviously a text field won't work. Should I create a table that contains this metadata or perform the check in PHP? If it's the latter, how would I know what input type the column needs?
I think my main problem is I'm trying to solve too many things with only one design change (or not focusing on one problem at a time). It's resulting in me becoming very flustered and confused. Help is greatly appreciated and if you need anymore information like how my database tables are currently setup I'll provide an ERD.
Edit: I just wanted to make it clear that I don't want to allow the user to actually manipulate the tables in the database, but rather select what information from the existing tables is shown on a given panel.

Coding the ability for users to freely query a database has a lot of problems (including security) and is way more complicated than predefined information queries that simply return a defined set of information.
It also places the burden of defining which info might be useful onto the user. It places the burden of deciding whether a certain information should be accessible to a particular user onto the query logic and database access rules.
Effectively you are trying to copy PHPMyAdmin with a different design and only your defined database as a target.

SELECT DISTINCT or separate normalized table?

We're making the plans now, so before I start progress I want to make sure I'm handing things in the best way.
We have a products table to which we're adding a new field called 'format', which is going to be the structure of the product (bag, box, etc). There is no set values for this, users can enter whatever they like into that field, however we want to show a drop down list of all formats that the user has already entered.
There's two ways I can think of to do that: either a basic SELECT DISTINCT on the products table to get all formats the user already filled in; or a separate table that stores the formats and is linked to by the product.
Instinctively I'd like to use SELECT DISTINCT, since it would make my life easier. However, assuming a table of a billion products, which would be the best way to go?

I think i would opt for the second option (additional table + foreign key if you want to add constraint), just because of the volume and because you can have management that will merge similar product form for example.

If you decide to keep everything in one table, then build an index on the column. This should speed the processing for creating the list in the user application.
I'm somewhat agnostic about which is the best approach. Often, when designing user interfaces, you want to try out different things. Having to make database changes impedes the creative process of building the application.
On the other hand, generally when users pick things from a drop down box in the application, these "things" are excellent examples of "entities" -- and that is what tables are intended to store.
In the end, I would say do what is most convenient while developing the application. As you get closer to finalizing it, consider whether it would be better to store these things in a separate table. One of the big questions is whether you want to know all formats that have every been used, even if no user currently has them defined.

Since you are letting users enter whatever they want I would go with the 2nd option.
Create a new table and insert in there all the new 'formats' and link to the product table.
Be sure when you create the code to add the format the user typed in, check if there is an equal value on the database so you won't need to distinct them as well.
Also, keep it consistent, either by having only the first letter upprcase of each word.

Integrating search on a website where the backend is MYSQL

I have a location search website for a city, we started out with collecting data for all possible categories in the city like Schools, Colleges, Departmental Stores etc and stored their information in a separate table, as each entry had different details apart from their name, address and phone number.
We had to integrate search in the website to enable people to find information, so we built an index table where in we stored the categories and related keywords for the same category and the table which much be fetched if that category was searched for. Later on we added the functionality of searching on the name and address as well by adding another master table containing those fields from all the tables to one place. Now my doubt is the following
The application design is improper, and we have written queries like select * from master where name like "%$input%" , all over, since our database is MYSQL and PHP on serverside, is there any suggestion for me to improve on the design of the system?
People want more features like splitting the keywords and ranking them according to relevance etc, is there any ready framework available which runs search on a database.
I tried using Full Text Search in MYSQL and it seems effective to me, is that enough?
Correct me if i am wrong, i had a look into Lucene and Google Custom Search, don't they work on making an index by crawling existing webpages and building their own index? I have a collection of tables on a mysql database on which i have to apply searching. What options do i have?

To address your points:
Using %input% is very bad. That will cause a full table scan every query. Under any amount of load or on even a remotely large dataset your DB server will choke.
An RDBMS alone is not a good solution for this. You are looking in the right place by seeking a separate solution for search. Something which can communicate well with your RDBMS is good; something that runs inside an RDBMS won't do what you need.
Full Text Search in MySQL is workable for very basic keyword searches, nothing more. The scope of usefulness is extremely limited - you need a highly predictable usage model to leverage the built-in searching. It is called "search" but it's not really search the way most people think of it. Compared to the quality of search results we have come to expect from Google and Bing, it does not compare. In that sense of the word "search", it is something else - like Notepad vs Word. They both are things to type in, but that's about it.
As far as separate systems for handling search, Lucene is very good. Lucene works however you want it to work, essentially. You can interact with it programatically to insert indexable documents. Likewise, a Google Appliance (not Google Custom Search) can be given direct meta feeds which expose whatever you want to be indexed, such as data directly from a database.

Take a look at sphinx: http://www.sphinxsearch.com/
Per their site:
How do you implement full-text search for that 10+ million row table, keep up with the load, and stay relevant? Sphinx is good at those kinds of riddles.
It's quite popular with a lot of people in the rails community right now, and they all rave about how awesome it is :)

How to store search criteria or search results?

I have a php classifieds website (mostly) and I am currently using MYSQL as a database.
Later on I will use SOLR or maybe Sphinx as a "search engine".
I want to make it possible for users to view "results" of searches they have made before, but I don't know where to start...
How is this done?
Currently I have a form which is filled in and when submitted, the php just checks agains a mysql table to see if there are any matches.
Should I store the 'Search criteria' and do a new search every time the users click on one of their previous searches, or should I store the results? I would prefer to make a new search because new items may have been inserted since the last search!
If you need more input, just let me know and I will update this Q.
Thanks

Well... if you're basically talking about "saved searches", I'm doing something similar currently so that I just have a separate table where....
saved_search_id (primary) | user_id (foreign) | search_name | criteria1 | criteria2 | criteria3 ... etc
So basically I can now display to the user a list of saved searches they've created, and the table stores the criteria that were part of that search. I can then use those saved criteria to run a saved search anytime.
Does that help?

Use query-string parameters ($_GET) for the search form. Then the user can bookmark the search. If you want, you could create a bookmarking feature in your application, but there really is no need.
If you are concerned about performance, make sure that your database' cache settings are tuned correctly, and that you don't write too often to the table. MySql will do a good job of caching then.

You already have said it: if users should see the new results of old queries, you'll have to store the search parameters somehow and re-do the search when a users requests it.

Store the search criteria. This is quite obivious as if the data changes users will get old results. And consider the space the results might take after a while :)
I would also consider really storing the search criteria not the actual query. If you change the database the stored searches would still work as you need to update the query generation engine also but you would most likely forgot to update every stored query.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.