I have a dictionary(in form of sql table) containing model numbers of mobile phones and an article(or just a line) about mobiles phones(in form of a string in php or C). I want to find out the mobile phone models discussed in that article but I don't want to do a brute force search i.e. search each and every model name in the text one by one.
Also I was thinking of maintain a hash table of the entire dictionary and then try to match then against the hashes of each and every work in the article and look for the collisions. But since the dictionary is very large, memory overhead in this approach is too much.
Also, If there is no database at all i.e. we have everything in the language scope only, dictionary in form of array and text in form of string.
You definitely need to use FULLTEXT index on your article field and perform searches with MATCH/AGAINST for performing searches.
SELECT * FROM your_table MATCH('phonemodel') AGAINST ('article');
Inverted index would help. Link: Inverted index
Split your articles into tokens, filter tokens of model name. So you can build an index, the key of the index is model name, and the value of the index is an article list.
Maybe you can add some additional information like the position of model name appears in the article.
If you thinking of using C and performance is what you wish for. I would suggest building a trie (http://en.wikipedia.org/wiki/Trie) to all the words in the articles. It's a little faster than hashing and consumes much less memory than Dictionary.
It's not easy to implement in c, but I'm sure you can find one ready some where.
Good Luck (:
If you have huge data then use one of them -
Sphinx
Lucene
Trie/DAWG(Directed Acyclic Word Graph) are elegant solutions but also hard to implement & maintain. And, MySQL FULLTEXT search is good but not for large data.
Related
In CodeIgniter I am looking for a way to do some post processing on queries on a specific table/model. I can think of a number of ways of doing this, but I can't figure out any particularly nice way that would work well in the long run.
So what I am trying to do is something like this:
I have a table with an serial number column which is stored as an int (so it can be used as AI and PK, which might or might not be a great idea, but that's how it is right now anyway). In all circumstances where this serial number is used (in views, search queries, real world etc.) it is used with an three letter prefix. So I can add this in the view or wherever needed, but I guess my question is more on what would be the best design choice. Is there a good way to add a column ('ABC' + serial) after queries so that it is mostly transparent to the rest of the application? Perhaps something similar to CakePHPs afterFind() hook?
You can do that in the query itself:
SELECT CONCAT(prefix, serial_number) AS prefixed FROM table_name
I've got a requirement to encrypt Personally identifiable information (PII) data in an application DB. The application uses smart searches in the system that use sound like, name roots and part words searches to find name and address quickly.
If we put in encryption on those fields (the PII data encrypted at the application tier), the searches will be impacted by the volume of records because we cant rely on SQL in the normal way and the search engine (in the application) would switch to reading all values, decrypt them and do the searches.
Is there any easy way of solving this so we can always encrypt the PII data and also give our user base the fast search functionality?
We are using a PHP Web/App Tier (Zend Server and a SQL Server DB). The application does not currently use technology like Lucene etc.
Thanks
Cheers
Encrypting the data also makes it look a great deal like randomized bit strings. This precludes any operations the shortcut searching via an index.
For some encrypted data, e.g. Social security number, you can store a hash of the number in a separate column, then index this hash field and search for the hash. This has limited utility obviously, and is of no value in searches name like 'ROB%'
If your database is secured properly may sound nice, but it is very difficult to achieve if the bad guys can break in and steal your servers or backups. And if it is truly as requirement (not just a negotiable marketing driven item), you are forced to comply.
You may be able to negotiate storing partial data in unencrypted, e.g., first 3 character of last name or such like so that you can still have useful (if not perfect) indexing.
ADDED
I should have added that you might be allowed to hash part of a name field, and search on that hash -- assuming you are not allowed to store partial name unencrypted -- you lose usefulness again, but it may still be better than no index at all.
For this hashing to be useful, it cannot be seeded -- i.e., all records must hash based on the same seed (or no seed), or you will be stuck performing a table scan.
You could also create a covering index, still encrypted of course, but a table scan could be considerable quicker due to the reduced I/O & memory required.
I'll try to write about this simply because often the crypto community can be tough to understand (I resisted the urge to insert a pun here).
A specific solution I have used which works nicely for names is to create index tables for things you wish to index and search quickly like last names, and then encrypt these index column(s) only.
For example, you could create a table where the key column contains one entry for every possible combination of characters A-Z in a 3-letter string (and include spaces for all but the first character). Like this:
A__
AA_
AAA
AAB
AAC
AAD
..
..
..
ZZY
ZZZ
Then when you add a person to your database, you add their index to a second column which is just a list of person ID's.
Example: In your patients table, you would have an entry for smith like this:
231 Smith John A 1/1/2016 .... etc
and this entry would be encrypted, perhaps all columns but the ID 231. You would then add this person to the index table:
SMH [342, 2342, 562, 12]
SMI [123, 175, 11, 231]
Now you encrypt this second column (the list of ID's). So when you search for a last name, you can type in 'smi' and quickly retrieve all of the last names that start with this letter combination. If you don't have the key, you will just see a cyphertext. You can actually create two columns in such a table, one for first name and one for last name.
This method is just as quick as a plaintext index and uses some of the same underlying principles. You can do the same thing with a soundex ('sounds like') by constructing a table with all possible soundex patterns as your left column, and person (patient?) Id's as another column. By creating multiple such indices you can develop a nice way to hone in on the name you are looking for.
You can also extend to more characters if you like but obviously this lengthens your table by more than an order of magnitude for each letter. It does have the advantage of making your index more specific (not always what you want). Truthfully any type of histogram where you can bin the persons using their names will work. I have seen this done with date of birth as well. anything you need to search on.
A table like this suffers from some vulnerabilities, particularly that because the number of entries for certain buckets may be very short, it would be possible for an attacker to determine which names have no entries in the system. However, using a sort of random 'salt' in your index list can help with this. Other problems include the need to constantly update all of your indices every time values get updated.
But even so, this method creates a nicely encrypted system that goes beyond data-at-rest. Data-at-rest only protects you from attackers who cannot gain authorization to your systems, but this system provides a layer of protection against DBA's and other personnel who may need to work in the database but do not need (or want) to see the personal data contained within. They will just see ciphertext. So, an additional key is needed by the users or systems that actually need/want to access this information. Ashley Madison would have been wise to employ such a tactic.
Hope this helps.
Sometimes, "encrypt the data" really means "encrypt the data at rest". Which is to say that you can use Transparent Data Encryption to protect your database files, backups, and the like but the data is plainly viewable through querying. Find out if this would be sufficient to meet whatever regulations you're trying to satisfy and that will make your job a whole lot easier.
I'm programming a search engine for my website in PHP, SQL and JQuery. I have experience in adding autocomplete with existing data in the database (i.e. searching article titles). But what about if I want to use the most common search queries that the users type, something similar to the one Google has, without having so much users to contribute to the creation of the data (most common queries)? Is there some kind of open-source SQL table with autocomplete data in it or something similar?
As of now use the static data that you have for auto complete.
Create another table in your database to store the actual user queries. The schema of the table can be <queryID, query, count> where count is incremented each time same query is supplied by some other user [Kind of Rank]. N-Gram Index (so that you could also auto-complete something like "Manchester United" when person just types "United", i.e. not just with the starting string) the queries and simply return the top N after sorting using count.
The above table will gradually keep on improving as and when your user base starts increasing.
One more thing, the Algorithm for accomplishing your task is pretty simple. However the real challenge lies in returning the data to be displayed in fraction of seconds. So when your query database/store size increases then you can use a search engine like Solr/Sphinx to search for you which will be pretty fast in returning back the results to be rendered.
You can use Lucene Search Engiine for this functionality.Refer this link
or you may also give look to Lucene Solr Autocomplete...
Google has (and having) thousands of entries which are arranged according to (day, time, geolocation, language....) and it is increasing by the entries of users, whenever user types a word the system checks the table of "mostly used words belonged to that location+day+time" + (if no answer) then "general words". So for that you should categorize every word entered by users, or make general word-relation table of you database, where the most suitable searched answer will be referenced to.
Yesterday I stumbled on something that answered my question. Google draws autocomplete suggestions from this XML file, so it is wise to use it if you have little users to create your own database with keywords:
http://google.com/complete/search?q=[keyword]&output=toolbar
Just replacing [keyword] with some word will give suggestions about that word then the taks is just to parse the returned xml and format the output to suit your needs.
I'm struggling with lucene and not sure how it's better to do: i've got users' data for their profiles - some of them(3-4 fields) are storing in lucene.But on query results i need also to show user's age/name/etc.
I don't think it's reasonable to save all of these fields(additional, which are not participate in the search process) in lucene, but querying rdmbs will either take some time, so my question is how it's better to do?
Thanks.
Indexing all profile fields with lucene gives better search experience to end users, as it will search over all fields and do appropriate ranking. In RDBMS, i dont know abt full text search over multiple columns and ranking. In such case i have always preferred Lucene.
you are also required to sync index with rdms.
This blog post tries to give you tools to choose between a full text search engine and a database. A compromise is to index all searchable fields and store an id you can use to retrieve a record from the database using a database key.
Apart from taking more disk space, using "stored" field in the index does not impact performance of queries. I would go with that.
I have a constantly growing database of keywords. I need to parse incoming text inputs (articles, feeds etc) and find which keywords from the database are present in the text. The database of keywords is much larger than the text.
Since the database is constantly growing (users add more and more keywords to watch for), I figure the best option will be to break the text input into words and compare those against the database. My main dilemma is implementing this comparison scheme (PHP and MySQL will be used for this project).
The most naive implementation would be to create a simple SELECT query against the keywords table, with a giant IN clause listing all the found keywords.
SELECT user_id,keyword FROM keywords WHERE keyword IN ('keyword1','keyword2',...,'keywordN');
Another approach would be to create a hash-table in memory (using something like memcache) and to check against it in the same manner.
Does anyone has any experience with this kind of searching and has any suggestions on how to better implement this? I haven't tried yet any of those approaches, I'm just gathering ideas at this point.
The classic way of searching a text stream for multiple keywords is the Aho-Corasick finite automaton, which uses time linear in the text to be searched. You'll want minor adaptations to recognize strings only on word boundaries, or perhaps it would be simpler just to check the keywords found and make sure they are not embedded in larger words.
You can find an implementation in fgrep. Even better, Preston Briggs wrote a pretty nice implementation in C that does exactly the kind of keyword search you are talking about. (It searches programs for occurrences of 'interesting' identifiers'.) Preston's implementation is distributed as part of the Noweb literate-programming tool. You could find a way to call this code from PHP or you could rewrite it in PHP---the recognize itself is about 220 lines of C, and the main program is another 135 lines.
All the proposed solutions, including Aho-Corasick, have these properties in common:
A preprocessing step that takes time and space proportional to the number of keywords in the database.
A search step that takes time and space proportional to the length of the text plus the number of keywords found.
Aho-Corasick offers considerably better constants of proportionality on the search step, but if your texts are small, this won't matter. In fact, if your texts are small and your database is large, you probably want to minimize the amount of memory used in the preprocessing step. Andrew Appel's DAWG data structure from the world's fastest scrabble program will probably do the trick.
In general,
break the text into words
b. convert words back to canonical root form
c. drop common conjunction words
d. strip duplicates
insert the words into a temporary table then do an inner join against the keywords table,
or (as you suggested) build the keywords into a complex query criteria
It may be worthwhile to cache a 3- or 4-letter hash array with which to pre-filter potential keywords; you will have to experiment to find the best tradeoff between memory size and effectiveness.
I'm not 100% clear on what you're asking, but maybe what you're looking for is an inverted index?
Update:
You can use an inverted index to match multiple keywords at once.
Split up the new document into tokens, and insert the tokens paired with an identifier for the document into the inverted index table. A (rather denormalized) inverted index table:
inverted_index
-----
document_id keyword
If you're searching for 3 keywords manually:
select document_id, count(*) from inverted_index
where keyword in (keyword1, keyword2, keyword3)
group by document_id
having count(*) = 3
If you have a table of the keywords you care about, just use an inner join rather than an in() operation:
keyword_table
----
keyword othercols
select keyword_table.keyword, keyword_table.othercols from inverted_index
inner join keyword_table on keyword_table.keyword=inverted_index.keyword
where inverted_index.document_id=id_of_some_new_document
is any of this closer to what you want?
Have you considered graduating to a fulltext solution such as Sphinx?
I'm talking out of my hat here, because I haven't used it myself. But it's getting a lot of attention as a high-speed fulltext search solution. It will probably scale better than any relational solution you use.
Here's a blog about using Sphinx as a fulltext search solution in MySQL.
I would do 2 things here.
First (and this isn't directly related to the question) I'd break up and partition user keywords by users. Having more tables with fewer data, ideally on different servers for distributed lookups where slices or ranges of users exist on different slices. Aka, all of usera's data exists on slice one, userb on slice two, etc.
Second, I'd have some sort of in-memory hash table to determine existence of keywords. This would likely be federated as well to distribute the lookups. For n keyword-existence servers, hash the keyword and mod it by n then distribute ranges of those keys across all of the memcached servers. This quick way lets you say is keyword x being watched, hash it and determine what server it would live on. Then make the lookup and collect/aggregate keywords being tracked.
At that point you'll at least know which keywords are being tracked and you can take your user slices and perform subsequent lookups to determine which users are tracking which keywords.
In short: SQL is not an ideal solution here.
I hacked up some code for scanning for multiple keywords using a dawg (as suggested above referencing the Scrabble paper) although I wrote it from first principles and I don't know whether it is anything like the AHO algorithm or not.
http://www.gtoal.com/wordgames/spell/multiscan.c.html
A friend made some hacks to my code after I first posted it on the wordgame programmers mailing list, and his version is probably more efficient:
http://www.gtoal.com/wordgames/spell/multidawg.c.html
Scales fairly well...
G