Storing users' data in lucene or querying rdbms?

Storing users' data in lucene or querying rdbms? - php

I'm struggling with lucene and not sure how it's better to do: i've got users' data for their profiles - some of them(3-4 fields) are storing in lucene.But on query results i need also to show user's age/name/etc.
I don't think it's reasonable to save all of these fields(additional, which are not participate in the search process) in lucene, but querying rdmbs will either take some time, so my question is how it's better to do?
Thanks.

Indexing all profile fields with lucene gives better search experience to end users, as it will search over all fields and do appropriate ranking. In RDBMS, i dont know abt full text search over multiple columns and ranking. In such case i have always preferred Lucene.
you are also required to sync index with rdms.

This blog post tries to give you tools to choose between a full text search engine and a database. A compromise is to index all searchable fields and store an id you can use to retrieve a record from the database using a database key.

Apart from taking more disk space, using "stored" field in the index does not impact performance of queries. I would go with that.

Related

Dealing with millions of data records in MySQL and PHP/Laravel

I have an activity records table named revisions (showed in following image) built for a big learning management system, which mainly keeps record of CRUD operations on tables (e.g. who has done what on which object in what time).
This table may contain up to 3M records of data. I want to build a search functionality for this on the front-end with PHP/Laravel.
Now my question is that what things should I consider for building search functionalities with high performance for tables with millions of records of data, what are the things on code level, database level, or are there 3rd party stuff to support these kind of issues?
I am experienced with building systems with PHP/Laravel, Python/Django, Ruby, etc. But I have never encountered with a case like this, dealing with millions records of data. So please keep in mind my knowledge/experience level. I have NO experience on this level.
Note: Search will be an advance search, making users able to search with different criteria and parameters, the object which is changed, who has changed it, when it's changed, etc.
Let me know if my question still isn't clear.

I would recommend to take a look at the https://www.elastic.co/products/elasticsearch and save your activity records to its storage when you do save to the main database. Then you can easily search any field. Elasticsearch can store a schema free JSON documents, if you prefer more SQL way, there is another search engine - http://sphinxsearch.com/.

There is no problem inserting a zillion rows into a table. Performance problems come when you try to do non-trivial SELECTs on the table. You mentioned "search"; you will have to limit what the 'users' can search for. But at least make a stab at what they might want to search for.
You mentioned "searching for an object", but I don't see a column called object. How many rows might there be for a given object? Do you need all the rows? Or selected ones? (An INDEX on object is likely to make the query efficient, regardless of table size.)
Third-party software sometimes gets in the way of dealing with really large tables. Beware.

Autocomplete concept

I'm programming a search engine for my website in PHP, SQL and JQuery. I have experience in adding autocomplete with existing data in the database (i.e. searching article titles). But what about if I want to use the most common search queries that the users type, something similar to the one Google has, without having so much users to contribute to the creation of the data (most common queries)? Is there some kind of open-source SQL table with autocomplete data in it or something similar?

As of now use the static data that you have for auto complete.
Create another table in your database to store the actual user queries. The schema of the table can be <queryID, query, count> where count is incremented each time same query is supplied by some other user [Kind of Rank]. N-Gram Index (so that you could also auto-complete something like "Manchester United" when person just types "United", i.e. not just with the starting string) the queries and simply return the top N after sorting using count.
The above table will gradually keep on improving as and when your user base starts increasing.
One more thing, the Algorithm for accomplishing your task is pretty simple. However the real challenge lies in returning the data to be displayed in fraction of seconds. So when your query database/store size increases then you can use a search engine like Solr/Sphinx to search for you which will be pretty fast in returning back the results to be rendered.

You can use Lucene Search Engiine for this functionality.Refer this link
or you may also give look to Lucene Solr Autocomplete...

Google has (and having) thousands of entries which are arranged according to (day, time, geolocation, language....) and it is increasing by the entries of users, whenever user types a word the system checks the table of "mostly used words belonged to that location+day+time" + (if no answer) then "general words". So for that you should categorize every word entered by users, or make general word-relation table of you database, where the most suitable searched answer will be referenced to.

Yesterday I stumbled on something that answered my question. Google draws autocomplete suggestions from this XML file, so it is wise to use it if you have little users to create your own database with keywords:
http://google.com/complete/search?q=[keyword]&output=toolbar
Just replacing [keyword] with some word will give suggestions about that word then the taks is just to parse the returned xml and format the output to suit your needs.

How do I get this lightning fast search?

I just came over this site: http://www.hittaplagget.se. If you enter the following search word moo, the autosuggest pops up immediately.
But if you go to my site, http://storelocator.no, and use the same search phrase (in "Search for brand" field), it takes a lot longer for autosuggest to suggest anything.
I know that we can only guess on what type of technology they are using, but hopefully someone here can do an educational guess better than I can.
In my solution I only do a SELECT moo% FROM table and return the results.
I have yet not indexed my table as there are only 7000 rows in it. But I'm thinking of indexing my tables using Lucene.
Can anyone suggest what I need to do in order to get equally fast autosuggest?

You must add an index on the column holding your search terms, even at 7000 - otherwise, the database searching through the whole list every time. See http://dev.mysql.com/doc/refman/5.0/en/create-index.html.

Lucene is a full text search index and may or may not be what you're looking for. Lucene would find any occurrence of "moo" in the entire indexed column (e.g. Mootastic and Fantasticmoo) and does not necessarily speed up your search although it's faster than a where x like '%moo%' type of search.
As others have already pointed out a regular index (probably even unique?) is what you want if you're performing "starts with" type of searches.

You will need to table-scan the table, so I suggest:
Don't put any rows in the table you don't need - for example, "inactive" records - keep them in a different table
Don't put any columns in the table you don't need
You can achieve this by having a special "Search table" which just contains the rows/columns you're interested in, and updating it from the "Master table".
Table-scanning a 7000 row table should be extremely efficient if the rows are small; I understand from your problem domain that this will be the case.
But as others have pointed out - don't send the 7000 rows to the client-side when it doesn't need it.
A conventional index can optimise a LIKE 'someprefix%' into a range-scan, so it is probably helpful having one. If you want to search for the string in any part of the entry, it is going to be a table-scan (which should not be slow on such a tiny table!)

Is this a good way to use mysql?

I am building a web app, and I am thinking about how I should build the database.
The app will be feed by keywords, then it will retrieve info for those keywords and save it into the database with a datestamp. The info will be from different source like, num of results from yahoo, diggs from the last month that contains that keyword, etc.
So I was thinking the a simple way to do it would be to have a table with an id and keyword column where the keywords would be stored, and another table for ALL the data with a id(same as keyword), datestamp, data_name, data_content.
Is this a good way to use mysql or could this in someway make queries slower or sometihng? should I build tables for each type of data I want to use? I am mostly looking for a good performance on the application.
Another reason I would like to use only one table for the data is that I can easly add more data_name(s) without touching the db.

The second table i.e, the table which contains various information about keywords ,id column in this table can be used as a foreign key to the first table id column

Is this a good way to use mysql or
could this in someway make queries
slower or sometihng? should I build
tables for each type of data I want to
use? I am mostly looking for a good
performance on the application.
A lot of big MySQL players(like for example flickr) use MySQL as a simple KV(key-value) store.
Furthermore if you are concerned with performance you should cache your data in memcached/redis(there is nothing which can beat memory).

I have another recommendation: index your content. If you plan on storing content and be able to search using keywords, then use mysql to store the details of your document, like author, text and some other info and use Lucene to create an index. Lucene is originally for Java, but has ports to many languages, and PHP is no exception.
You can use Zend framework to manage Lucene indexes, with very little effort, browse thru the documentation or look for a tutorial online. The thing of this recommendation is simple:
- You'll improve your search time drastically
- Your keyword acceptance will be higher, Lucene will give power to search
I hope I can help!
Best luck!

Integrating search on a website where the backend is MYSQL

I have a location search website for a city, we started out with collecting data for all possible categories in the city like Schools, Colleges, Departmental Stores etc and stored their information in a separate table, as each entry had different details apart from their name, address and phone number.
We had to integrate search in the website to enable people to find information, so we built an index table where in we stored the categories and related keywords for the same category and the table which much be fetched if that category was searched for. Later on we added the functionality of searching on the name and address as well by adding another master table containing those fields from all the tables to one place. Now my doubt is the following
The application design is improper, and we have written queries like select * from master where name like "%$input%" , all over, since our database is MYSQL and PHP on serverside, is there any suggestion for me to improve on the design of the system?
People want more features like splitting the keywords and ranking them according to relevance etc, is there any ready framework available which runs search on a database.
I tried using Full Text Search in MYSQL and it seems effective to me, is that enough?
Correct me if i am wrong, i had a look into Lucene and Google Custom Search, don't they work on making an index by crawling existing webpages and building their own index? I have a collection of tables on a mysql database on which i have to apply searching. What options do i have?

To address your points:
Using %input% is very bad. That will cause a full table scan every query. Under any amount of load or on even a remotely large dataset your DB server will choke.
An RDBMS alone is not a good solution for this. You are looking in the right place by seeking a separate solution for search. Something which can communicate well with your RDBMS is good; something that runs inside an RDBMS won't do what you need.
Full Text Search in MySQL is workable for very basic keyword searches, nothing more. The scope of usefulness is extremely limited - you need a highly predictable usage model to leverage the built-in searching. It is called "search" but it's not really search the way most people think of it. Compared to the quality of search results we have come to expect from Google and Bing, it does not compare. In that sense of the word "search", it is something else - like Notepad vs Word. They both are things to type in, but that's about it.
As far as separate systems for handling search, Lucene is very good. Lucene works however you want it to work, essentially. You can interact with it programatically to insert indexable documents. Likewise, a Google Appliance (not Google Custom Search) can be given direct meta feeds which expose whatever you want to be indexed, such as data directly from a database.

Take a look at sphinx: http://www.sphinxsearch.com/
Per their site:
How do you implement full-text search for that 10+ million row table, keep up with the load, and stay relevant? Sphinx is good at those kinds of riddles.
It's quite popular with a lot of people in the rails community right now, and they all rave about how awesome it is :)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Storing users' data in lucene or querying rdbms? - php

This blog post tries to give you tools to choose between a full text search engine and a database. A compromise is to index all searchable fields and store an id you can use to retrieve a record from the database using a database key.

Apart from taking more disk space, using "stored" field in the index does not impact performance of queries. I would go with that.

Related

Dealing with millions of data records in MySQL and PHP/Laravel

Autocomplete concept

How do I get this lightning fast search?

Is this a good way to use mysql?

Integrating search on a website where the backend is MYSQL

Categories

Resources