I am making a search feature where the search value should be compared with two columns in a db table. One column is just a name and the other column is en encrypted value in format "xxxxxx-xxxx" (only numbers). The user should be able to just search for part of the total string in the table.
For the name comparison I use where name LIKE %search_value%, but for the encrypted value I can't use that way of doing it.
Any ideas to how a good way of doing the comparison would be?
You couldn't use a wildcard search for crypted values, because the crypting of 'a' is ENTIRELY and UTTERLY different than the crypting of 'bac'. There's no practical method of doing sub-string matching within a crypted field. However, a simple direct equality test is doable. If you're a DB-side function like mysql's aes_encrypt(), then you could do
... WHERE
(name LIKE '%search%') OR
(cryptedfield = AES_ENCRYPT('search', 'key'))
For substring matching, you'd have to decrypt the field first:
... WHERE
(name LIKE '%search%') OR
(AES_DECRYPT(cryptedfield, 'key') LIKE '%search%')
basically, if it needs to be encrypted, no part of the system should be able to search it. if it should be searchable, then it probably doesn't need to be encrypted.
otherwise you are kind of defeating the purpose of encryption.
If you're looking for full-text search capability of encrypted data (without the database server being able to decrypt the messages), you're in academic research territory.
However, if you only need a limited subset of searching capabilities on encrypted data, you can use blind indexes constructed from the plaintext which can be used in SELECT queries.
So instead of:
SELECT *
FROM humans
WHERE name LIKE '%search_value%';
You could do this:
SELECT h.*
FROM humans h
JOIN humans_blind_indexes hb ON hb.human_id = h.id
WHERE hb.first_initial_last_name = $1 OR hb.first_name = $2 OR hb.last_name = $3
And then pass it three truncated hash function outputs, and you'll get your database records with matching ciphertexts.
This isn't just a theoretical remark, you can actually do this today with this open source library.
Related
I am currently working on a new web based project with various types of entities. This service will be accessible through an REST API, and I'm thinking about endpoints like:
api.example.com/users/{user_id}
For this, I think that an auto-incremental ID for users will be a bad approach, since anybody can hit:
api.example.com/users/1, and then api.example.com/users/2, api.example.com/users/3, and so on.
Now, I'm thinking to use UUID, but I don't know if it's a good idea, because it's a VARCHAR(36). For these reason, I do something like this when I generate the user ID on the INSERT query (I'm using MySQL):
unhex(replace(uuid(),'-',''))
With this, I'm casting the UUID to binary. And I'm storing an BINARY(16) on the database instead.
And when I want to retrieve info from database, I can use something like that:
SELECT hex(id), name FROM user;
and
SELECT hex(id), name FROM user WHERE hex(id) = '36c5f55620ef11e7b94d6c626d968e15';
So, I'm working with Hexadecimal form, but storing it in binary form.
It is this a good approach?
Almost there...
Indexes are your performance friend. Presumably id is indexed, then
WHERE id = function('...')
uses the index and goes straight to the row, but
WHERE function(id) = '...'
cannot use the index; instead, it scans all the rows, checking each one. For a large table, this is sloooow.
So...
Use this to store it:
INSERT INTO tbl (uuid, ...)
VALUES
(UNHEX(REPLACE(UUID(), '-', '')), ...);
And this to test for it:
SELECT ... WHERE
uuid = UNHEX(REPLACE('786f3c2a-21f6-11e7-9392-80fa5b3669ce', '-', ''));
If you choose to send (via REST) the 32 characters without the dashes, you can figure that minor variation.
Since that gets tedious, build a pair of Stored Functions. Oh, I have one for you; see http://mysql.rjweb.org/doc.php/uuid .
That also discusses why UUIDs are inefficient for huge tables, and a possible way around it.
Please be patient, I'm not an expert about cryptography. My question is probably very basic but I googled a lot and I'm still confused.
In a PHP project, I need to encrypt/decrypt the data saved in the database. In a previous project I used the aes128 encryption and everything went well. But now I have a different need. I need to perform queries in the database using the operator LIKE. And obviously the encryption of a portion of a string is not included in the encryption of the whole string.
Googling around, I realized that maybe I have to use symmetric-key algorithm (like the Caesar's cipher). But I did a test with the php-encryption library (https://github.com/defuse/php-encryption) and I got the following result:
MAMMA = ÿŸNq!!83=S™÷á;Bª¯‚óØ š‹ æ%§0 %? _† Ÿ&0c—âÐÜÉ/:LSçï; յ嬣§.öÒ9
MAMMAMIA = (Ò Î{yG : [&¶›J'Õ6÷ííG£V{ÉsÙ=qÝ×.:ÍÔ j…Qž¹×j¶óóþ¡ÔnÛŠ *ån\hhN
The encryption of the first word is not included in the encryption of the second. Evidently the simmetric algorithm is not right for my need.
What I can use to reach my goal using PHP? Thanks!
The easiest way is to use mysql encrypt/decrypt functionality and do both "on the fly".
To insert and encrypt data:
insert into mytable (secret) values AES_ENCRYPT('SomeTextToHide','myPassword');
To search for encrypted values using like
select * from mytable where AES_DECRYPT(secret,'myPassword') like '%text%';
The goal of being able to use a LIKE clause and its wildcards on securely encrypted data is a set of two mutually exclusive desires.
You can securely encrypt your data, OR you can use a LIKE clause with wildcards. You cannot do both at once, because the LIKE clause itself would bypass the encryption!
SELECT * WHERE data LIKE 'a%'
SELECT * WHERE data LIKE 'b%'
...
SELECT * WHERE data LIKE '_a%'
SELECT * WHERE data LIKE '_b%'
...
SELECT * WHERE data LIKE '__a%'
SELECT * WHERE data LIKE '__b%'
...
If you securely encrypt the data in the database, then you need to pull ALL of it back local, decrypt it all, and then do your wildcard searches on the decrypted data.
I'm using mysql database auto-increment as an order ID. When I display the order ID to the user, I want to somehow mask/obfuscate it.
Why?
So at first glance, it is obvious to admin users what the number
refers to (orders start with 10, customers start with 20 etc)
To hide, at first glance, that this is only my 4th order.
Based on this this answer, I want the masked/obfuscated order id to:
Be only numbers
Consistent length (if possible)
Not cause collisions
Be reversible so I can decode it and get the original ID
How would I acheive this in PHP? It doesn't have to be very complex, just so at first glance it's not obvious.
I think you can use XOR operator to hide "at first glance" for example (MySQL example):
(id*121) ^ 2342323
Where 2342323 and 121 are "magic" numbers - templates for the order number.
To reverse:
(OrderNum ^ 2342323)/121
Additional advantage in this case - you can validate OrderNumber (to avoid spam or something like this in online form) if (OrderNum ^ 2342323) is divided by 121 with no remainder.
SQLFiddle demo
A little bit late, but Optimus (https://github.com/jenssegers/optimus) does exactly what is here asked for.
$encoded = $optimus->encode(20); // 1535832388
$original = $optimus->decode(1535832388); // 20
Only the initial setup is a bit weird (generate primenumbers)
Probably the simplest way is to just generate a long random string and use it instead of the auto-increment ID. Or maybe use it alongside the auto-increment ID. If the string is long enough and random enough, it will be unique for every record (think of GUIDs). Then you can display these to the user and not worry about anything.
Can it help?
echo hexdec(uniqid());
Off course you should store this value at db, at the same row with order id.
Just converting a ID into something like HEX might not give you the result what you like. Moreover its still easy 'guessable'
I would a a extra ID column (i.e. order_id). Set a unqi. index. Then on_creation use one of the following mysql functions:
SHA1(contcat('ORDER', id))
MD5(contcat('ORDER', id))
SHA1(contcat('ORDER', id, customer_id))
MD5(contcat('ORDER', id, customer_id))
UUID()
// try this in your mysql console
SELECT UUID(), SHA(CONCAT('ORDER',10)), SHA1(1);
You could (as in the example), add a simple text prefix like 'order'. Or even combine them. However i think UUID() would be easiest.
Implementation depends a bit on what you prefer you could use a stored procedure) or incorporate it in your model.
I want to store large amount (~thousands) of strings and be able to perform matches using wildcards.
For example, here is a sample content:
Folder1
Folder1/Folder2
Folder1/*
Folder1/Folder2/Folder3
Folder2/Folder*
*/Folder4
*/Fo*4
(each line has additionnal data too, like tags, but the matching is only against that key)
Here is an example of what I would like to match against the data:
Folder1
Folder1/Folder2/Folder3
Folder3
(* being a wildcard here, it can be a different character)
I naively considered storing it in a MySQL table and using % wildcards with the LIKE operator, but MySQL indexes will only work for characters on the left of the wildcard, and in my case it can be anywhere (i.e. %/Folder3).
So I'm looking for a fast solution, that could be used from PHP. And I am open: it can be a separate server, a PHP library using files with regex, ...
Have you considered using MySQL's regular expression engine? Try something like this:
SELECT *
FROM your_table
WHERE your_query_string REGEXP pattern_column
This will return rows with regex keys that your query string matches. I expect it will perform better than running a query to pull all of the data and doing the matching in PHP.
More info here: http://dev.mysql.com/doc/refman/5.1/en/regexp.html
You might want to use the multicore approach to solve that search in a fraction of the time, i would recommend for search and matching, using FPGA's but thats probably the hardest way to do it, consider THIS ARTICLE using CUDA, you can do that searches in 16x usual time, in multicore CPU Systems, you can use posix, or a cluster of computers to do the job (MPI for example), you can call Gearman service to run the searches using advanced algorithms.
Were it me, I'd store out the key field two times ... once forward and once reversed (see mysql's reverse function). you can then search the index with left(main_field) and left(reversed_field). it won't help you when you have a wildcard in the middle of the string AND the beginning (e.g. "*Folder1*Folder2), but it will when you have a wildcard at the beginning or the end.
e.g. if you want to search */Folder1 then search where left(reverse_field, 8) = '1redloF/';
for Folder1/*/FolderX search where left(reverse_field, 8) = 'XredloF/' and left(main_field, 8) = 'Folder1/'
If your strings represent some kind of hierarchical structure (as it looks like in your sample content), actually not "real" files, but you say you are open to alternative solutions - why not consider something like a file-based index?
Choose a new directory like myindex
Create an empty file for each entry using the string key as location & file name in myindex
Now you can find matches using glob - thanks to the hierarchical file structure a glob search should be much faster than searching up all your database entries.
If needed you can match the results to your MySQL data - thanks to your MySQL index on the key this action will be very fast.
But don't forget to update the myindex structure on INSERT, UPDATE or DELETE in your MySQL database.
This solution will only compete on a huge data-set (but not too huge as #Kyle mentioned) with a rather deep than wide hierarchical structure.
EDIT
Sorry this would only work if the wildcards are in your search terms not in the stored strings itself.
As the wildcards (*) are in your data and not in your queries I think you should start with breaking up your data into pieces. You should create an index-table having columns like:
dataGroup INT(11),
exactString varchar(100),
wildcardEnd varchar(100),
wildcardStart varchar(100),
If you have a value like "Folder1/Folder2" store it in "exactString" and assign the ID of the value in the main data table to "dataGroup" in the above index table.
If you have a value like "Folder1/*" store a value of "Folder1/" to "wildcardEnd" and again assign the id of the value in the main table to the "dataGroup" field in above Table.
You can then do a match within your query using:
indexTable.wildcardEnd = LEFT('Folder1/WhatAmILookingFor/Data', LENGTH(indexTable.wildcardEnd))
This will truncate the search string ('Folder1/WhatAmILookingFor/Data') to "Folder1/" and then match it against the wildcardEnd field. I assume mysql is clever enough not to do the truncate for every row but to start with the first character and match it against every row (using B-Tree indexes).
A value like "*/Folder4" will go into the field "wildcardStart" but reversed. To cite Missy Elliot: "Is it worth it, let me work it
I put my thing down, flip it and reverse it" (http://www.youtube.com/watch?v=Ke1MoSkanS4). So store a value of "4redloF/" in "wildcardStart". Then a WHERE like the following will match rows:
indexTable.wildcardStart = LEFT(REVERSE('Folder1/WhatAmILookingFor/Folder4'), LENGTH(indexTable.wildcardStart))
of course you could do the "REVERSE" already in your application logic.
Now about the tricky part. Something like "*/Fo*4" should get split up into two records:
# Record 1
dataGroup ==> id of "*/Fo*4" in data table
wildcardStart ==> oF/
wildcardEnd ==> /Fo
# Record 2
dataGroup ==> id of "*/Fo*4" in data table
wildcardStart ==> 4
Now if you match something you have to take care that every index-record of a dataGroup gets returned for a complete match and that no overlapping occurs. This could also get solved in SQL but is beyond this question.
Database isn't the right tool to do these kinds of searches. You can still use a database (any database and any structure) to store the strings, but you have to write the code to do all the searches in memory. Load all the strings from the database (a few thousand strings is really no biggy), cache them and run your search\match algorithm on them.
You probably have to code your algorithm yourself because the standard tools will be an overkill for what you are trying to achieve and there is no garantee that they will be able to achieve exactly what you need.
I would build a regex representation of your wildcard based strings and run those regexs on your input. Your probabaly will have to do some work until you get the regex right, but it will be the fastest way to go.
I suggest reading the keys and their associated payload into a binary tree representation ordered alphanumerically by key. If your keys are not terribly "clumped" then you can avoid the (slight additional) overhead building of a balanced tree. You also can avoid any tree maintenance code as, if I understand your problem correctly, the data will be changing frequently and it would be simplest to rebuild the tree rather than add/remove/update nodes in place. The overhead of reading into the tree is similar to performing an initial sort, and tree traversal to search for your value is straight-forward and much more efficient than just running a regex against a bunch of strings. You may even find while working it through that your wild cards in the tree will lead to some shortcuts to prune the search space. A quick search show lots of resources and PHP snippets to get you started.
If you run SELECT folder_col, count(*) FROM your_sample_table group by folder_col do you get duplicate folder_col values (ie count(*) greater than 1)?
If not, that means you can produce an SQL that would generate a valid sphinx index (see http://sphinxsearch.com/).
I wouldn't recommend to do text search on large collection of data in MySQL. You need a database to store the data but that would be it. For searching use a search engine like:
Solr (http://lucene.apache.org/solr/)
Elastic Search (http://www.elasticsearch.org/)
Sphinx (http://sphinxsearch.com/)
Those services will allow you doing all sort of funky text search (including Wildcards) in a blink of an eye ;-)
I'm having a dilemma. I have a field hashedX that is a hashed/salted value and the salt is saved in the same row in the mysql database as is common practice.
hashedX saltX
------ ------
hashed1 ssai3
hashed2 woddp
hashed3 92ofu
When I receive inputX, I need to know if it matches any of the values in hashedX such as hashed1 hashed2 or hashed3. So typically I would take my input, hash/salt it, and compare it to the values of hashedX. Pseudo code:
$hashed_input = hash ($input with $salt );
select * from tablename where $hashed_input is hashedX
The problem is I don't know which saltX I need to even get to the $hashed_input before I can do any select.
I could go through the database rows, one by one, try that salt on my input, then check if the input as hashed/salted with this salt matches hashedX of that same row. If I have a 100,000 records, my guess is that this would be painfully slow. I have no idea how slow since I'm not that great at databases.
Is there a better way to do this, than selecting all rows, looping through them, using that row's salt to hash input, then comparing again to the hashed value in the db?
If it is possible (depends on your hash formula) define a MySQL User Defined Function database side for the hash formula (see CREATE FUNCTION). This way you will be able to get your results in one simple request:
SELECT hashedX, saltX FROM tablename WHERE UDFhash(input, saltX) = hashedX ;
You don't specify which hash algorithm you're using in PHP. MySQL supports MD5 and SHA1 hash algorithms as builtin functions:
SELECT ...
FROM tablename
WHERE SHA1(CONCAT(?, saltX)) = hashedX;
SHA2 algorithms are supported in MySQL 5.5, but this is only available in pre-beta release at this time. See http://dev.mysql.com/doc/refman/5.5/en/news-5-5-x.html for releases.
Is there a better way to do this, than selecting all rows, looping
through them, using that row's salt to
hash input, then comparing again to
the hashed value in the db?
Yes. A much better way.
Typically a salt is only used to prevent exactly what you are trying to do. So either you don't want to use a salt, or you don't want to do this kind of lookup.
If you are checking an entered password against a given user account or object, you should reference the object on the same line that you have the salt and hashed salt+password. Require the account name / object to be referenced when the password is given, then look up the row corresponding to that account name and object and compare the password against that salt + hash.
If you are keeping a record of items that you've seen before, then you should just go with a hash, (or a bloom filter) and forget the salt, because it doesn't buy you anything.
If you're doing something new / creative, please describe what it is.