MySQL Unique hash insertion

MySQL Unique hash insertion - php

So, imagine a mysql table with a few simple columns, an auto increment, and a hash (varchar, UNIQUE).
Is it possible to give mysql a query that will add a column, and generate a unique hash without multiple queries?
Currently, the only way I can think of to achieve this is with a while, which I worry would become more and more processor intensive the more entries were in the db.
Here's some pseudo-php, obviously untested, but gets the general idea across:
while(!query("INSERT INTO table (hash) VALUES (".generate_hash().");")){
//found conflict, try again.
}
In the above example, the hash column would be UNIQUE, and so the query would fail. The problem is, say there's 500,000 entries in the db and I'm working off of a base36 hash generator, with 4 characters. The likelyhood of a conflict would be almost 1 in 3, and I definitely can't be running 160,000 queries. In fact, any more than 5 I would consider unacceptable.
So, can I do this with pure SQL? I would need to generate a base62, 6 char string (like: "j8Du7X", chars a-z, A-Z, and 0-9), and either update the last_insert_id with it, or even better, generate it during the insert.
I can handle basic CRUD with MySQL, but even JOINs are a little outside of my MySQL comfort zone, so excuse my ignorance if this is cake.
Any ideas? I'd prefer to use either pure MySQL or PHP & MySQL, but hell, if another language can get this done cleanly, I'd build a script and AJAX it too.
Thanks!

This is our approach for a similar project, where we wanted to generate unique coupon codes.
First, we used an AUTO_INCREMENT primary key. This ensures uniqueness and query speed.
Then, we created a base24 numbering system, using A,B,C, etc, without using O and I, because someone might have thought that they were 0 or 1.
Then we converted the auto-increment integer to our base24 number. For example, 0=A, 1=B, 28=BE, 1458965=EKNYF. We used base24, because long numbers in base10 have fewer letters in base24.
Then we created a separate column in our table, coupon_code. This was not indexed.
We took the base24 and added 3 random numbers, or I and O (which were not used in our base24), and inserted them into our number. For example, EKNYF could turn into 1EKON6F or EK2NY3F9. This was our coupon code and we inserted it into our coupon_code column. It's unique and random.
So, when the user uses code EK2NY3F9, all we have to do it remove all non-used characters (2,3 and 9) and we get EKNYF, which we convert to 1458965. We just select the primary key 1458965 and then compare coupon_code column with EK2NY3F9.
I hope this helps.

If your heart is set on using base-36 4 character hashes (hashspace is only 1679616), you could probably pre-generate a table of hashes that aren't already in the other table. Then finding a unique hash would be as simple as moving it from the "unused table" to the "used table" which is O(1).
If your table is conceivably 1/3 full you might want to consider expanding your hashspace since it will probably fill up in your lifetime. Once the space is full you will no longer be able to find unique hashes no matter what algorithm you use.

What is this hash a hash of? It seems like you just want a randomly generated unique VARCHAR column? What's wrong with the auto increment?
Anyway, you should just use a bigger hash - find an MD5 function - (if you're actually hashing something), or a UUID generator with more than 4 characters, and yes, you could use a while loop, but just generate a big enough one so that conflicts are incredibly unlikely

As others have suggested whats wrong with an autoinc field? If you want an alpha numeric value then you could simply do a simple conversion from int to a alphanumeric string in base 36. This could be implemented in almost any language.

Going with zneaks comment, why don't you use an autoincrement column? save the hash in another (non unique) field, and concatenate the id to it (dynamically). So you give a user [hash][id]. You can parse it out in pure sql using the substring functions.
Since you have to have the hash, the user can't look at other records by incrementing the id.

So, just in case someone runs across a similar issue, I'm using a UNIQUE field, I'll be using a php hash function to insert the hashes, if it comes back with an error, I'll try again.
Hopefully because of the low likelyhood of conflict, it won't get slow.

You could also check the MySQL functions UUID() and UUID_SHORT(). Those functions generate UUIDs that are globally unique by definition. You won't have to double-check if your PHP-generated hash string already exists.
I think in several cases these functions can also fit your project's requirements. :-)

If you already have the table filled by some content, you can alter it with the following :
ALTER TABLE `page` ADD COLUMN `hash` char(64) AS (SHA2(`content`, 256)) AFTER `content`
This solution will add hash column right after the content one, generates hash for existing and new records too without need to change your INSERT statement.
If you add UNIQUE index to the column (after have removed duplicates), your inserts will only be done if content is not already in the table. This will prevent duplicates.

Related

Column varchar and issue about index or fulltext?

Well I have a column varchar for password on my table and at some scripts i make queries like:
length(column_varchar) < 10
My question is if i put a index on this column, it will help? or in this case should use fulltext? or don't need a index?
Another question i need to use index in all columns that will be used in 'where'?
Thanks in advanced.

Indexes are used to index content (field value), not the length of the field, therefore no index can help in the above query. (N. B. you could have a sparate field that has the content length and index that separate field.) Also, the password should be stored in a hashed format, so all password lengths should be the same, or at least should not be a criteria for selection.
No, you should not index all columns that will be used in a where criteria. Selecting the optimal index structure is a complicated and very broad topic. Always consider the following points when trying to determine what fields (or combination of fields) to index:
Indexes speed up selects, but slow down data modification, since you have to update the index as well, not just the column's value.
MySQL can use only 1 index per table in a query.
MySQL uses the selectivity of the indexes to determine which one to use. A field that can have 2 values only (yes / no, true / false) is not selective enough, so do not trouble yourself with indexing it.
Always use the explain command to check which indexes your queries use.

You've got two questions here, in general you should split questions up.
Anyway, the first "Will it help indexing a column where you doing a test for length."
No, it won't. The only way you could improve the performance here would be to have an additional column that holds the length of the value in column_varchar and index that.
You wrote in comments that you are holding hashes, so the lengths will all be the same, so I have to guess that some passwords are null and so you don't hash them, or that you are migrating from not hashed to hashed.
The second question: should you index all fields in a where clause. This is not an automatic yes, which is why there are books written about query optimisation.
It depends on how much benefit you will get from the index, and that depends on the nature of the data.
The main trade off is between insert speed and query speed. Indexes slow inserts and speed up queries.
The next thing to consider is selectivity. If the value you are indexing has only three potential values, for example, the database will need frequent updating of the index to get real value from it.
In this specific case, you have evenly distributed data ( because it is hashed), you have great selectivity ( MD5 has few collisions) and you are expecting to query more often with a single term, so you should definitely be indexing this column.

creating unique user hash for autoincrement field

So in this app, we have a user id which is simple auto-increment primary key. Since we do not want to expose this at the client side, we are going to use a simple hash (encryption is not important, only obfuscation).
So when a user is added to the table we do uniqid(). user_id. This will guarantee that the user hash is random enough and always unique.
The question I have is, while inserting the record, we do not know the user id at that point (cannot assume max(user_id) + 1) since there might be inserts getting committed. So we are doing an insert then getting the last_insert_idthen using that for theuser_id`, which adds an additional db query. So is there a better way to do this?

A few things before the actual answer: with latest version of MySQL which uses InnoDB as default storage engine - you always want an integer pk (or the famous auto_increment). Reasons are mostly performance. For more information, you can research on how InnoDB clusters records using PK and why it's so important. With that out of the way, let's consider our options for creating a unique surrogate key.
Option 1
You calculate it yourself, using PHP and information you obtained back from MySQL (the last_insert_id()), then you update the database back.
Pros: easy to understand by even novice programmers, produces short surrogate key.
Cons: extremely bad for concurrent access, you'll probably get clashes, and you never want to use PHP to calculate unique indices required by the database.
You don't want that option
Option 2
Supply the uniqid() to your query, create an AFTER INSERT trigger that will concatenate uniqid() with the auto_increment.
Pros: easy to understand, produces short surrogate key.
Cons: requires you to create the trigger, implements magic that's not visible from the code directly which will definitely confuse a developer that inherits the project at some point - and from experience I would bet that bad things will happen
Option 3
Use universally unique identifiers or UUIDs (also known as GUIDs). Simply supply your query with surrogate_key = UUID() and MySQL does the rest.
Pros: always unique, no magic required, easy to understand.
Cons: none, unless the fact that it occupies 36 chars bothers you.
You want the option 3.

Since we do not want to expose this at the client side
Simply don't.
In a well-designed database, users never need to see a primary-key value. In fact, a user need never know the primary key even exists.
From your question it seems you actually replace your normal auto-increment ID column with a surrogate id (If not skip to the last paragraph).
Try creating a column with another unique surrogate ID and use that on your frontend. And you can keep your normal primary ids for relationships etc.'
Remember one of the basic must rules for primary keys:
The primary key must be compact and contain the fewest possible attributes.
Also integer serials have the advantage of being simple to use and implement. They also, depending on the specific implementation of the serialization method, have the advantage of being quickly derivable, as most databases just store the serial number in a fixed location. Meaning in stead of max(id)+1 the db has it already stored and makes auto-increment fast.
So we are doing an insert then getting the last_insert_id then using
that for theuser_id`, which adds an additional db query.
last_insert_id Isn't actually a query and is a saved variable in your db connection when you performed a insert query.
If you already have a second column for your surrogate ID ignore all the above:
So we are doing an insert then getting the last_insert_id then using
that for theuser_id`, which adds an additional db query. So is there a
better way to do this?
No, you can only retrieve that uniqid by doing a query.
$res = mysql_query('SELECT LAST_INSERT_ID()');
$row = mysql_fetch_array($res);
$lastsurrogateid = $row['surrogate_id'];
Anything else is making it more complicated than necessary.

MySQL insert unique technique

I have a php application that inserts a data into MySQL, which contains a randomly-generated unique value. The string will have about 1 billion possibilities, with probably no more than 1 or 2 million entries at any one time. Essentially, most combinations will not exist in the database.
I'm trying to find the least expensive approach to ensuring a unique value on insert. Specifically, my two options are:
Have a function that generates this unique ID. On each generation, test if the value exists in the database, if yes then re-generate, if no, return value.
Generate random string and attempt insert. If insert fails, test error is 1062 (MySQL duplicate entry X for key Y), re-generate key and insert with new value.
Is it a bad idea to rely upon the MySQL error for re-trying the insert? As I see it, the value will probably be unique, and it seems the initial (using technique 1) would be unnecessary.
EDIT #1
I should have also mentioned, the value must be a 6 character length string, composed of either uppercase letters and/or numbers. They can't be incremental either - they should be random.
EDIT #2
As a side note, I'm trying to create a redemption code for a gift certificate that is difficult to guess. Using numbers and letters creates 36 possibilities for each character, instead of 10 for just numbers or 26 for just letters.
Here's a stripped-down version of the solution I created. The first value entered in the table is the primary key, which is auto incremented. affected_rows() will equal 1 if the insert is successful:
$code = $build_code();
while ((INSERT INTO certificates VALUES ('', $code) ON DUPLICATE KEY UPDATE pk = pk) && affected_rows() == 0)
$code = $build_code();

Is it a bad idea to rely upon the MySQL error for re-trying the insert?
Nope. Go ahead an use it if you want. In fact many people think if you check and if it doesn't exist then it's safe to insert. But unless you lock the table it's always possible that another process might slip in and grab the id.
So go ahead generate a random id if it suits your purpose. Just make sure you test your code so it does properly handle dups. Might also be useful to log dups just to ensure your assumptions about how unlikey dups are to occur are correct.

Define your table with unique constraint:
http://dev.mysql.com/doc/refman/5.0/en/constraint-primary-key.html

Why not just use: "YourColName BIGINT AUTO_INCREMENT PRIMARY KEY" to ensure uniqueness?

How to encrypt primary keys in a consistent way

I've got a database table that has a simple incrementing integer as the primary key (1,2,3 etc). These numbers represent the alumni of a college, with personal information in the records. I've been asked to give each record a unique ID when displayed to the user (after they've queried the database) but the ID must not be the primary key, and it must be consistent.
If someone retrieved a record with the arbitrary ID 88gh344r, for example, then did another search and retrieved record 88gh344r again they need to be able to say "That's the same person". Since people need to be able to recognise the identifier from one search to the next, then the ID can't be long and complex.
I've thought of three approaches:
Create an extra table containing the primary key and a random sequence of numbers, and get the query to retrieve the random number equivalent of the primary key.
Encrypt the primary key using MySQL's SHA2 or AES, but these produce long letter and digit sequences.
Encrypt the Primary key on the fly in the query, using something like Base64 encryption in PHP.
Which of these is best, or have I missed a better approach?

I actually just wrote a short tut on a URL shortener that works on that basis, using a recid as the seed. you could use that function to create your lookup key and store it in the DB as the "reference" key the code is here

If you're doing it to protect privacy, you're heading for a major fsckup. It won't take long for the lamest script kiddie to write a simple program which just tries every possibile "hash", dowloading your entire list.
You should look into proper access control so people can see only what they're allowed to see.

If I understand you correctly, your main goal is just not to reveal the primary keys, but use something else instead when communicating with the users.
Simplest way:
add an CHAR column to your table and choose some length you want the other identifiers to be, for example CHAR(16).
give UNIQUE index to that column, so that you won't have any duplicates.
for each row generate a secure *random* string of length 16 and update the row.
DO NOT hash the plain primary key. If the keys start from 1,2,3.. then everybody can match the id to the hash by just calculating hashes for 1,2,3 .... etc
Another problem is that if you for example already have 200 rows in the table and you add 1, then the attacker can automatically associate the primary key 201 to the random string that just appeared in the list.
On the other hand, why do you need to hide the primary keys in the first place. Maybe you should instead encrypt the personal user data in the columns?

you could do a base 36 encode on the userid*100 or something, for example.
userid 26=208
userid 3=8C
http://www.translatorscafe.com/cafe/units-converter/numbers/calculator/decimal-to-base-36

You can truncate a hash or an encrypted value to the desired length, but with both you risk a collision. With eight base-36 digits you have about a 50% chance of a collision if you have 2 million records. If you don't convert to base-36 and just take eight hexadecimal digits, you only need 80 thousand records.
With random numbers you don't have this problem, because you can put a uniqueness constraint on the column and generate a new number if a collision occurs.

Don't make it too complicated.
Consider this:
The user is given an alternative mapping key. This can be a temporary session mapping and/or a secondary unique key (but is not the PK and may or may not be in the same table) and should, of course, be unique with the domain.
Access token, randomly generated per item but need not be unique, which is combined with a simple PK for the exposed id. This can be made to look pretty across values with appropriate transformation if desired. The access token may also be treated as a separate value.
I like the second approach. In both cases it is not the "direct PK" which is exposed although there is more coupling in the 2nd form.
Both of these will prevent the "knowing" the next key based off a sequence, but guessing/brute-force is tied to the size of the domain: as others have said, this shouldn't be used as primary security layer.
Happy coding.

the short answer that I found after 2 month search is hashids that you can install it for every language from https://hashids.org/ .Its options are:
generate encrytion string from a key that called salt.
define min-length for your string for example 8 digits that it's unavailable for functions like base64_encode()
reverse decode your string
define your desired alphabets for example a-z only.(note that 16 alphabet must defined at least)
note: for php it is recommended to activate bcmath and GMP extensions for your website host but it works without GMP.

what is the best way to create random mysql ID using PHP

i'm building an application that needs a random unique id for each user not a sequence
mysql database
ID Username
for my unique random ID, what is the best way to do that?

PHP provides a uniqid function, which might do the trick, I suppose.
Note it's returning a string, though, and not an integer.
Another idea would be to generate / use some GUID -- there are some proposals about that in the user notes of the manual page of uniqid.

I would still have the normal auto-increment primary key to identify each row properly, it's just standard convention.
I'd then have another indexed column called 'user_id' or something and use uniqid(); for it.

MySQL provides a function called UUID():
http://dev.mysql.com/doc/refman/5.1/en/miscellaneous-functions.html#function_uuid
Documentation claims this:
A UUID is designed as a number that is
globally unique in space and time. Two
calls to UUID() are expected to
generate two different values, even if
these calls are performed on two
separate computers that are not
connected to each other.
This should cover your needs.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.