Fast search on encrypted data?

Fast search on encrypted data? - php

I've got a requirement to encrypt Personally identifiable information (PII) data in an application DB. The application uses smart searches in the system that use sound like, name roots and part words searches to find name and address quickly.
If we put in encryption on those fields (the PII data encrypted at the application tier), the searches will be impacted by the volume of records because we cant rely on SQL in the normal way and the search engine (in the application) would switch to reading all values, decrypt them and do the searches.
Is there any easy way of solving this so we can always encrypt the PII data and also give our user base the fast search functionality?
We are using a PHP Web/App Tier (Zend Server and a SQL Server DB). The application does not currently use technology like Lucene etc.
Thanks
Cheers

Encrypting the data also makes it look a great deal like randomized bit strings. This precludes any operations the shortcut searching via an index.
For some encrypted data, e.g. Social security number, you can store a hash of the number in a separate column, then index this hash field and search for the hash. This has limited utility obviously, and is of no value in searches name like 'ROB%'
If your database is secured properly may sound nice, but it is very difficult to achieve if the bad guys can break in and steal your servers or backups. And if it is truly as requirement (not just a negotiable marketing driven item), you are forced to comply.
You may be able to negotiate storing partial data in unencrypted, e.g., first 3 character of last name or such like so that you can still have useful (if not perfect) indexing.
ADDED
I should have added that you might be allowed to hash part of a name field, and search on that hash -- assuming you are not allowed to store partial name unencrypted -- you lose usefulness again, but it may still be better than no index at all.
For this hashing to be useful, it cannot be seeded -- i.e., all records must hash based on the same seed (or no seed), or you will be stuck performing a table scan.
You could also create a covering index, still encrypted of course, but a table scan could be considerable quicker due to the reduced I/O & memory required.

I'll try to write about this simply because often the crypto community can be tough to understand (I resisted the urge to insert a pun here).
A specific solution I have used which works nicely for names is to create index tables for things you wish to index and search quickly like last names, and then encrypt these index column(s) only.
For example, you could create a table where the key column contains one entry for every possible combination of characters A-Z in a 3-letter string (and include spaces for all but the first character). Like this:
A__
AA_
AAA
AAB
AAC
AAD
..
..
..
ZZY
ZZZ
Then when you add a person to your database, you add their index to a second column which is just a list of person ID's.
Example: In your patients table, you would have an entry for smith like this:
231 Smith John A 1/1/2016 .... etc
and this entry would be encrypted, perhaps all columns but the ID 231. You would then add this person to the index table:
SMH [342, 2342, 562, 12]
SMI [123, 175, 11, 231]
Now you encrypt this second column (the list of ID's). So when you search for a last name, you can type in 'smi' and quickly retrieve all of the last names that start with this letter combination. If you don't have the key, you will just see a cyphertext. You can actually create two columns in such a table, one for first name and one for last name.
This method is just as quick as a plaintext index and uses some of the same underlying principles. You can do the same thing with a soundex ('sounds like') by constructing a table with all possible soundex patterns as your left column, and person (patient?) Id's as another column. By creating multiple such indices you can develop a nice way to hone in on the name you are looking for.
You can also extend to more characters if you like but obviously this lengthens your table by more than an order of magnitude for each letter. It does have the advantage of making your index more specific (not always what you want). Truthfully any type of histogram where you can bin the persons using their names will work. I have seen this done with date of birth as well. anything you need to search on.
A table like this suffers from some vulnerabilities, particularly that because the number of entries for certain buckets may be very short, it would be possible for an attacker to determine which names have no entries in the system. However, using a sort of random 'salt' in your index list can help with this. Other problems include the need to constantly update all of your indices every time values get updated.
But even so, this method creates a nicely encrypted system that goes beyond data-at-rest. Data-at-rest only protects you from attackers who cannot gain authorization to your systems, but this system provides a layer of protection against DBA's and other personnel who may need to work in the database but do not need (or want) to see the personal data contained within. They will just see ciphertext. So, an additional key is needed by the users or systems that actually need/want to access this information. Ashley Madison would have been wise to employ such a tactic.
Hope this helps.

Sometimes, "encrypt the data" really means "encrypt the data at rest". Which is to say that you can use Transparent Data Encryption to protect your database files, backups, and the like but the data is plainly viewable through querying. Find out if this would be sufficient to meet whatever regulations you're trying to satisfy and that will make your job a whole lot easier.

Related

How do I properly separate user data when implementing data anonymization in an RDBMS?

I'm trying to implement data anonymization in MySQL and PHP.
At the moment I'm separating the data by encrypting the foreign key/ID using the user password and save it in the 'user' account table. But I quickly realized that when a user is initially created, and I insert the first data inside the other tables, I can match them together by row count.
What I thought of doing is to randomly swap the user account details each time a new account is created - but this feels very inefficient.
I cannot find anything related online like a basic explanation of how one would properly achieve user data separation so that it is completely anonymized. Can anyone explain here what goes into achieving data anonymization in an RDBMS architecture?
Thanks a lot in advance!
EDIT:
To be more clear, let's imagine I have two tables: One holding the user email & encrypted unique foreign key (account-table). The other holding user preferences/info (this table will always hold 1 row per user).
Now let's say I added a new user in account-table, and data in user-preferences/info table. In reality, I can still know from counting the table rows if this info is owned by that user.
I can't encrypt all this data, because some of it might be public anonymously. And even so, making the rows unrelated to each other continues making it harder on anyone getting hold of this encrypted data from matching it to any user.
I'm looking for complete anonymity and privacy not just by encryption, but by separation of user-data. I want data to be completely private to the user - possibly without duplicating any of it in multiple places.
Would the random swap be the best scenario in this case? (copy a randomly picked user, and swap/overwrite new the data in their original row)

You need to look at differential privacy. The idea here is to preserve the original data in one record, but add carefully randomised data that looks very similar to it.
For example imagine you were storing user year of birth. If you add a single user record and an unrelated separate single birth year record, it's very likely (as you say) that you will be able to reverse the relationship and reassociate the two. However, you could add multiple records with randomised values clustered around the real value (but not exactly centred as that's statistically reversible too), so you could have user1 born in 1970, and add records for 1968, 1969, 1970, and 1971, user2 born in 1980 could have values of 1979, 1980, 1981, 1982. You then can't tell exactly which record is exactly correct, but on average the values are reasonably correct. Note that this even works for a single record.
But there is a further concern here – exactly how anonymous do you want records to be? The degree of anonymity you need may depend on the nature of the data you're processing. This simple example only looks at a single field - one that may indeed not allow reidentification when used alone, but might provide sufficient information when combined with other fields, even if they use a similar approach.
As you may gather, this is a difficult and subtle thing to design effectively – the algorithm for figuring out how much noise you need to add is something that has won mathematics medals!
Another approach is to keep the real data without knowing what it is using homomorphic encryption, allowing you to still do things like searching but without actually being able to see the underlying data.
Since you're in PHP, you might find CipherSweet provides a useful toolkit.

Why do sites use random alphanumeric ids rather than database ids to identify content?

Why do sites like YouTube, Imgur and most others use random characters as their content ids rather than just sequential numbers, like those created by auto-increment in MySQL?
To explain what I mean:
In the URL: https://www.youtube.com/watch?v=QMlXuT7gd1I
The QMlXuT7gd1I at the end indicates the specific video on that page, but I'm assuming that video also has a unique numeric id in the database. Why do they create and use this alphanumeric string rather than just use the video's database id?
I'm creating a site which identifies content in the URL like above, but I'm currently using just the DB id. I'm considering switching to random strings because all major sites do it, but I'd like to know why this is done before I implement it.
Thanks!

Some sites do that because of sharding.
When you have only one process (one server) writing, it is possible to make an auto-increment id without having duplicate ids, but when you have multiple servers (with multiple processes) writing content, like youtube, it's not possible to use autoincrement id anymore. The costs of synchronization to avoid duplication would be huge.
For example, if you read mongodb's ocjectid documentation you can see this structure for the id:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
At the end, it's only 12 byte. The thing is when you represent in hexadecimal, it seems like 24 bytes, but that is only when you show it.
Another advantage of this system is that the timestamp is included in the id, so you can decouple the id to get the timestamp.

First this is not a random string, it is a base calculation which is depended on the id. They go this way, because Alphanumeric has a bigger base
Something like 99999999 could be 1NJCHR
Take a look here, and play with the bases, and learn more about it.
You will see it is way more shorter. That is the only reason i can imagine, someone would go this way, and it makes sense, if you have ids like 54389634589347534985348957863457438959734
As self and Cameron commented/answered there are chances (especialy for youtube) that there are additional security parameters like time and lenght are calculated into it in some way, so you are not able to guess an identifier.

In addition to Christian's answer above, using a base calculation, hashed value or other non-numeric identifier has the advantage of obscuring your db's size from competitors.
Even if you stayed with numeric and set your auto_increment to start at 50,000, increase by 50, etc., educated guesses can still be made at the db's size and growth. Non-numeric options don't eliminate this possibility, but they inhibit it to a certain extent.

there are major chances for malicious inputs by end users, and by not using ids users cant guess ids and thus can't guess how large db is. However other's answers on base calculation explains well.

Proposed solution: Generate unique IDs in a distributed environment

I've been browsing the net trying to find a solution that will allow us to generate unique IDs in a regionally distributed environment.
I looked at the following options (among others):
SNOWFLAKE (by Twitter)
It seems like a great solutions, but I just don't like the added complexity of having to manage another software just to create IDs;
It lacks documentation at this stage, so I don't think it will be a good investment;
The nodes need to be able to communicate to one another using Zookeeper (what about latency / communication failure?)
UUID
Just look at it: 550e8400-e29b-41d4-a716-446655440000;
Its a 128 bit ID;
There has been some known collisions (depending on the version I guess) see this post.
AUTOINCREMENT IN RELATIONAL DATABASE LIKE MYSQL
This seems safe, but unfortunately, we are not using relational databases (scalability preferences);
We could deploy a MySQL server for this like what Flickr does, but again, this introduces another point of failure / bottleneck. Also added complexity.
AUTOINCREMENT IN A NON-RELATIONAL DATABASE LIKE COUCHBASE
This could work since we are using Couchbase as our database server, but;
This will not work when we have more than one clusters in different regions, latency issues, network failures: At some point, IDs will collide depending on the amount of traffic;
MY PROPOSED SOLUTION (this is what I need help with)
Lets say that we have clusters consisting of 10 Couchbase Nodes and 10 Application nodes in 5 different regions (Africa, Europe, Asia, America and Oceania). This is to ensure that content is served from a location closest to the user (to boost speed) and to ensure redundancy in case of disasters etc.
Now, the task is to generate IDs that wont collide when the replication (and balancing) occurs and I think this can be achieved in 3 steps:
Step 1
All regions will be assigned integer IDs (unique identifiers):
1 - Africa;
2 - America;
3 - Asia;
4 - Europe;
5 - Ociania.
Step 2
Assign an ID to every Application node that is added to the cluster keeping in mind that there may be up to 99 999 servers in one cluster (even though I doubt: just as a safely precaution). This will look something like this (fake IPs):
00001 - 192.187.22.14
00002 - 164.254.58.22
00003 - 142.77.22.45
and so forth.
Please note that all of these are in the same cluster, so that means you can have node 00001 per region.
Step 3
For every record inserted into the database, an incremented ID will be used to identify it, and this is how it will work:
Couchbase offers an increment feature that we can use to create IDs internally within the cluster. To ensure redundancy, 3 replicas will be created within the cluster. Since these are in the same place, I think it should be safe to assume that unless the whole cluster is down, one of the nodes responsible for this will be available, otherwise a number of replicas can be increased.
Bringing it all together
Say a user is signing up from Europe:
The application node serving the request will grab the region code (4 in this case), get its own ID (say 00005) and then get an incremented ID (1) from Couchbase (from the same cluster).
We end up with 3 components: 4, 00005,1. Now, to create an ID from this, we can just join these components into 4.00005.1. To make it even better (I'm not too sure about this), we can concatenate (not add them up) the components to end up with: 4000051.
In code, this will look something like this:
$id = '4'.'00005'.'1';
NB: Not $id = 4+00005+1;.
Pros
IDs look better than UUIDs;
They seem unique enough. Even if a node in another region generated the same incremented ID and has the same node ID as the one above, we always have the region code to set them apart;
They can still be stored as integers (probably Big Unsigned integers);
It's all part of the architecture, no added complexities.
Cons
No sorting (or is there)?
This is where I need your input (most)
I know that every solution has flaws, and possibly more that what we see on the surface. Can you spot any issues with this whole approach?
Thank you in advance for your help :-)
EDIT
As #DaveRandom suggested, we can add the 4th step:
Step 4
We can just generate a random number and append it to the ID to prevent predictability. Effectively, you end up with something like this:
4000051357 instead of just 4000051.

I think this looks pretty solid. Each region maintains consistency, and if you use XDCR there are no collisions. INCR is atomic within a cluster, so you will have no issues there. You don't actually need to have the Machine code part of it. If all the app servers within a region are connected to the same cluster, it's irrelevant to infix the 00001 part of it. If that is useful for you for other reasons (some sort of analytics) then by all means, but it isn't necessary.
So it can simply be '4' . 1' (using your example)
Can you give me an example of what kind of "sorting" you need?
First: One downside of adding entropy (and I am not sure why you would need it), is you cannot iterate over the ID collection as easily.
For Example: If you ID's from 1-100, which you will know from a simple GET query on the Counter key, you could assign tasks by group, this task takes 1-10, the next 11-20 and so on, and workers can execute in parallel. If you add entropy, you will need to use a Map/Reduce View to pull the collections down, so you are losing the benefit of a key-value pattern.
Second: Since you are concerned with readability, it can be valuable to add a document/object type identifier as well, and this can be used in Map/Reduce Views (or you can use a json key to identify that).
Ex: 'u:' . '4' . '1'
If you are referring to ID's externally, you might want to obscure in other ways. If you need an example, let me know and I can append my answer with something you could do.
#scalabl3

You are concerned about IDs for two reasons:
Potential for collisions in a complex network infrastructure
Appearance
Starting with the second issue, Appearance. While a UUID certainly isn't a great beauty when it comes to an identifier, there are diminishing returns as you introduce a truly unique number across a complex data center (or data centers) as you mention. I'm not convinced that there is a dramatic change in perception of an application when a long number versus a UUID is used for example in a URL to a web application. Ideally, neither would be shown, and the ID would only ever be sent via Ajax requests, etc. While a nice clean memorable URL is preferable, it's never stopped me from shopping at Amazon (where they have absolutely hideous URLs). :)
Even with your proposal, the identifiers, while they would be shorter in the number of characters than a UUID, they are no more memorable than a UUID. So, the appearance likely would remain debatable.
Talking about the first point..., yes, there are a few cases where UUIDs have been known to generate conflicts. While that shouldn't happen in a properly configured and consistently obtained architecture, I can see how it might happen (but I'm personally a lot less concerned about it).
So, if you're talking about alternatives, I've become a fan of the simplicity of the MongoDB ObjectId and its techniques for avoiding duplication when generating an ID. The full documentation is here. The quick relevant pieces are similar to your potential design in several ways:
ObjectId is a 12-byte BSON type, constructed using:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
The timestamp can often be useful for sorting. The machine identifier is similar to your application server having a unique ID. The process id is just additional entropy, and finally to prevent conflicts, there is a counter that is auto incremented whenever the timestamp is the same as the last time an ObjectId is generated (so that ObjectIds can be created rapidly). ObjectIds can be generated on the client or on the database. Further, ObjectIds do take up fewer bytes than a UUID (but only 4). Of course, you could not use the timestamp and drop 4 bytes.
For clarification, I'm not suggesting you use MongoDB, but be inspired by the technique they use for ID generation.
So, I think your solution is decent (and maybe you want to be inspired by MongoDB's implementation of a unique ID) and doable. As to whether you need to do it, I think that's a question only you can answer.

How to store text checksums for quick binary exists/does-not-exist lookups?

Consider an application which accepts arbitrary-length text input from users, similar to Twitter 'tweets' but up to 1 MiB in size. Due to the distributed nature of the application the same text input may be delivered multiple times to any particular node. In order to prevent the same text from appearing twice in the index (based on Apache Solr), I am using an MD5 hash of the text as a unique key.
Unfortunately, Solr does not support an SQL-like "INSERT IGNORE", as such all duplicate documents replace the content of the original document. Since the user of the application can add additional fields, this replacement is problematic. In order to prevent it, I have two choices:
Before each insert, query the index for documents with the MD5 hashed unique key. If I get a result, then I know that the document already exists in the index. I found this approach to be too slow, probably because we are indexing a few hundred documents per minute.
Store the MD5 hash in an additional store, such as a flat file, MySQL, or elsewhere. This approach is the basis of this question.
What forms of data storage can handle a few hundred inserts per minute, and quickly let me know if the value exists? I am testing with both MySQL (on a different spindle than the Solr index) and with flat files using grep -w someHash hashes.txt and cat someHash >> hashes.txt. Both approaches seem to slow down as the index grows, but it will take a few days or weeks until I see if either approach is feasible.
What other methods of storing and checking the existence of a hash are possible? What fundamental issues might I run into with the MySQL and flat files approach? What would Knuth do?

From solr side, you can try for Deduplication and UpdateXmlMessages#Optional_attributes which may serve the purpose.

Dictionary Search

I have a dictionary(in form of sql table) containing model numbers of mobile phones and an article(or just a line) about mobiles phones(in form of a string in php or C). I want to find out the mobile phone models discussed in that article but I don't want to do a brute force search i.e. search each and every model name in the text one by one.
Also I was thinking of maintain a hash table of the entire dictionary and then try to match then against the hashes of each and every work in the article and look for the collisions. But since the dictionary is very large, memory overhead in this approach is too much.
Also, If there is no database at all i.e. we have everything in the language scope only, dictionary in form of array and text in form of string.

You definitely need to use FULLTEXT index on your article field and perform searches with MATCH/AGAINST for performing searches.
SELECT * FROM your_table MATCH('phonemodel') AGAINST ('article');

Inverted index would help. Link: Inverted index
Split your articles into tokens, filter tokens of model name. So you can build an index, the key of the index is model name, and the value of the index is an article list.
Maybe you can add some additional information like the position of model name appears in the article.

If you thinking of using C and performance is what you wish for. I would suggest building a trie (http://en.wikipedia.org/wiki/Trie) to all the words in the articles. It's a little faster than hashing and consumes much less memory than Dictionary.
It's not easy to implement in c, but I'm sure you can find one ready some where.
Good Luck (:

If you have huge data then use one of them -
Sphinx
Lucene
Trie/DAWG(Directed Acyclic Word Graph) are elegant solutions but also hard to implement & maintain. And, MySQL FULLTEXT search is good but not for large data.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.