PHP/MySQL - Hashing To BIGINT To Ensure No Duplicates

PHP/MySQL - Hashing To BIGINT To Ensure No Duplicates - php

I'm currently scraping virtual currency transaction data off a webpage. The transactions consist of time/date, a description, price, and new balance.
Results are paginated. I can fetch 20 at a time. My goal is to have an accurate record of all entries in a separate database. There are a very large number of transactions occurring, and transactions can occur at any time, including between fetching different pages.
Time/date is measured to the minute, so multiple transactions can occur in the same minute. Descriptions can also be the same (for example the same item can be sold in the same quantity to the same person multiple times). Both price and balance could also overlap.
I am storing a timestamp, price, balance, and data which is parsed from the description in multiple fields. I need to be able to tell if an entry is already in the database quickly. The maximum effect I could get is to ensure that each entry has a unique time/data, description, price, and balance. The issue with composite keys is that I don't want to store the full description in the database. (This would double the database size.)
My solution that I came up with was to create a BIGINT hash based on those fields, which would be used as a UNIQUE field in the database. I found that the probability of a collision (based on the birthday attack formula) would be less than 1% for up to 61 million entries, which is a satisfactory probability, since the number of entries I'm planning to track is in the neighbourhood of 40k-2m.
My question is, based on my application and goals, which hashing algorithm would you recommend and how can I get the values from it in to a BIGINT size without losing any of the properties of the algorithm? The most important thing is to avoid collisions, as each one would affect the integrity of the data. Unless you have a better idea, my plan was to concatenate the data into a string (with separators between fields) then feed it into the function. Short code snippets are much appreciated!

Because I don't care about security, I used SHA1. This generates a 20-byte hexadecimal string. BIGINT is 8 bytes in size. Therefore, we need to truncate to 16 characters (since each character is half a byte in hex) using substr, and use base_convert to convert to base 10 for database storage.
function hashToBigInt ($string) {
return base_convert(substr(sha1($string), 0, 16), 16, 10);
}
Thanks everyone for all your help!

Related

Does hashing a random value plus an auto increment number ensure uniqueness?

I'm trying to generate a unique order number for my ecommerce application, this is my code:
<?php
$bytes = random_bytes(3);
$random_hash = bin2hex($bytes);
$order_num = $random_hash . "1";
echo strtoupper(hash('crc32b', $order_num));
The order number (in the example is 1), is going to be an auto-increment value retrieved from MySQL.
Does this ensure me uniqueness?
I wanted a short max 8-10 chars unique final value.
An only numbers solution would be fine too.

As far as I know, most hash algorithms make no guarantee of when collisions might occur, so you're probably just as likely to get a collision with your proposed code as using the random part on its own.
If the auto-increment part is unique, and the random part is just to avoid guesses, you could just concatenate the two parts together (i.e. everything in your example before the hash call). That way if the same random number comes up twice, it will have different numbers on the end.
If that results in something too long, you could do something with base_convert or asc to convert the number into a shorter representation.

The hash function will not provide any uniqueness to the id, it only obfuscates the id a bit.
If you have lets say 100 possible values, you would get 100 possible hashes from them, no more. If an attacker wants to brute-force the hashes, he can pick the 100 possible hashes and try them.
In your case with 3 bytes of randomness, you would not get all possible combinations before you get a duplicate. So the same random number would be generated much earlier than with 3 bytes of possible combinations.
There are two common approaches when it comes to unique ids:
You let the database automatically increment the id, this makes sure that the id is unique.
You generate a UUID (global id with 16 bytes) which offers such a huge keyspace, that a duplicate is extremely unlikely. In practice one can neglate the possiblilty of duplicates.
The UUID has a lot of advantages and one disadvantage:
(+) UUID's can work decentralized e.g. in an offline scenario.
(+) One can generate the id before it is inserted in the database, so one has not to wait before the row is created in the db.
(+) The ids are not deterministic, so an attacker cannot guess the next id.
(-) They use more storage space and are a bit slower when searching.

MySQL RAND() how often can it be used? does it use /dev/random?

I have a table with few rows (tops 50), I need to get random value out of table I can do that by
ORDER BY RAND() LIMIT 1
Main question is in the point when I have 6k selects in 5 seconds is rand stil 'reliable'?
How is rand calculated, can I seed it over time? (idk, every 5 seconds).

The MySQL pseudo-random number generator is completely deterministic. The docs say:
RAND() is not meant to be a perfect random generator. It is a fast way to generate random numbers on demand that is portable between platforms for the same MySQL version.
It can't use /dev/random because MySQL is designed to work on a variety of operating systems, some of which don't have a /dev/random.
MySQL initializes a default seed at server startup, using the integer returned by time(0).
If you're interested in the source line, it's in the MySQL source in file sql/mysqld.cc, function init_server_components(). I don't think it ever re-seeds itself.
Then the subsequent "random" numbers are based solely on the seed. See source file mysys_ssl/my_rnd.cc, function my_rnd().
The best practice solution to your random-selection task, for both performance and quality of randomization, is to generate a random value between the minimum primary key value and maximum primary key value. Then use that random value to select a primary key in your table:
SELECT ... FROM MyTable WHERE id > $random LIMIT 1
The reason you'd use > instead of = is that you might have gaps in the id due to rows being deleted or rolled back, or you might have other conditions in your WHERE clause so that you have gaps in between rows that match your conditions.
The disadvantages of this greater-than method:
Rows following such a gap have a higher chance of being chosen, and the larger the gap the greater the chance.
You need to know the MIN(id) and MAX(id) before you generate the random value.
Doesn't work as well if you need more than one random row.
Advantages of this method:
It's much faster than ORDER BY RAND(), even for a modest table size.
You can use a random function outside of SQL.

RAND is pseudorandom. Be careful using it for security stuff. I don't think your "choose one row randomly out of fifty" is for security, so you're probably OK.
It's pretty fast for a small table. It will be horrible for picking a random row out of a large table: it will has to tag every row with a pseudorandom number and then sort them. For the application you're describing, #TheEwook's suggestion is exactly right; sorting even a small table more often than once a millisecond can swamp even powerful MySQL hardware.
Don't seed RAND, ever, unless you're testing and you want a repeatable sequence of random numbers for some kind of unit test. I learned this the hard way once when generating what I thought were hard-to-guess session tokens. The MySQL guys did a good job with RAND and you can trust them for the application you're talking about.
I think (not sure), if you don't seed it, it starts with a random seed from /dev/random.
If you need crypto-grade random numbers, read /dev/random yourself. But keep in mind that /dev/random can only generate a limited rate. /dev/urandom uses /dev/random to generate a faster rate, but isn't as high-grade in its entropy pool.

If your table is not too big (let's say max 1000 records) it doesn't really matter. But for big tables you must choose an alternative way.
This article may help you:
http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/

Is my random generation method flawed? (PHP)

I have this small internal project that inserts occasional entries into a MySQL database. I have a column named "idChar" were I set it's value to a randomly generated string using 62 possible characters with a length of 31.
Today I discovered that a new entry just so happened to have the same exact idChar as an entry from several months ago. I am now checking for duplicate entries before saving them, but this made me think about the odds of this happening and I am curious to know if my implementation of generating these random keys is flawed. Getting a duplicate should be roughly 1 in 62^31 right?
function getCode($len)
{
//$len = 10;
$base='ABCDEFGHIJKLMNOPQRSTWXYZabcdefghijklmnopqrstwxyz123456789';
$max=strlen($base)-1;
$linkCode='';
mt_srand((double)microtime()*1000000);
while (strlen($linkCode)<$len+1)
$linkCode.=$base{mt_rand(0,$max)};
return $linkCode;
}
$idChar=getCode(30);
//code to insert into MySQL here

The odds of getting a duplicate would be calculated as per the birthday problem, because that's how you calculate the chance of collisions for the output of a function that produces a randomly chosen output from a discrete codomain. In effect you want to calculate the chance that among a pool of selections made randomly any two selections are the same.
You should also completely drop the mt_srand call as it is not necessary, and it's likely to provide worse seeds than what PHP will do automatically. Consider that the output of microtime (in my system at least) is like
0.29574400 1348356024
which means that you only have 1 million different seeds available as the last two digits of the float are always zeroes and the (double)microtime() cast completely ignores the seconds part (it would be a lousy seed anyway).
Assuming that the random number generator produces the same sequence of random numbers whenever it is seeded with the same seed then in effect you only have 1 million possible random codes instead of 62^31 -- quite a decrease! Fortunately it is documented that this does not happen on PHP 5.2.1 onwards.

How many bytes are unique enough for twitter?

I don't want my database id's to be sequential, so I'm trying to generate uids with this code:
$bin = openssl_random_pseudo_bytes(12);
$hex = bin2hex($bin);
return base_convert($hex, 16, 36);
My question is: how many bytes would i need to make the ids unique enough to handle large amounts of records (like twitter)?

Use PHP's uniqid(), with an added entropy factor. That'll give you plenty of room.

You might considering something like the way tinyurl and other shortening services work. I've used similar techniques, which guarantees uniqueness until all combinations are exhausted. So basically you choose an alphabet, and how many characters you want as a length. Let's say we use alphanumeric, upper and lower, so that's 62 characters in the alphabet, and let's do 5 characters per code. That's 62^5 = 916,132,832 combinations.
You start with your sequential database ID and you multiply that be some prime number (choose one that's fairly large, like 2097593). All you do is multiply that by your database ID, making sure to wrap around if you exceed 62^5, and then convert that number to base-62 as per your chosen alphabet.
This makes each code look fairly unique, yet because we use a prime number, we're guaranteed not to hit the same number twice until we've used all codes already. And it's very short.
You can use longer keys with a smaller alphabet, too, if length isn't a concern.
Here's a question I asked along the same lines: Tinyurl-style unique code: potential algorithm to prevent collisions

Assuming that openssl_random_pseudo_bytes may generate every possible value, N bytes will give you 2 ^ (N * 8) distinct values. For 12 bytes this is 7.923 * 10^28

use MySQL UUID
insert into `database`(`unique`,`data`) values(UUID(),'Test');
If your not using MySQL search google for UUID (Database Name) and it will give you an option
Source Wikipedia
In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%

MySql speed of executing max(), min(), sum() on relatively large database

I have a relatively large database (130.000+ rows) of weather data, which is accumulating very fast (every 5minutes a new row is added). Now on my website I publish min/max data for day, and for the entire existence of my weatherstation (which is around 1 year).
Now I would like to know, if I would benefit from creating additional tables, where these min/max data would be stored, rather than let the php do a mysql query searching for day min/max data and min/max data for the entire existence of my weather station. Would a query for max(), min() or sum() (need sum() to sum rain accumulation for months) take that much longer time then a simple query to a table, that already holds those min, max and sum values?

That depends on weather your columns are indexed or not. In case of MIN() and MAX() you can read in the MySQL manual the following:
MySQL uses indexes for these
operations:
To find the MIN() or MAX() value for a
specific indexed column key_col. This
is optimized by a preprocessor that
checks whether you are using WHERE
key_part_N = constant on all key parts
that occur before key_col in the
index. In this case, MySQL does a
single key lookup for each MIN() or
MAX() expression and replaces it with
a constant.
In other words in case that your columns are indexed you are unlikely to gain much performance benefits by denormalization. In case they are NOT you will definitely gain performance.
As for SUM() it is likely to be faster on an indexed column but I'm not really confident about the performance gains here.
Please note that you should not be tempted to index your columns after reading this post. If you put indices your update queries will slow down!

Yes, denormalization should help performance a lot in this case.
There is nothing wrong with storing calculations for historical data that will not change in order to gain performance benefits.

While I agree with RedFilter that there is nothing wrong with storing historical data, I don't agree with the performance boost you will get. Your database is not what I would consider a heavy use database.
One of the major advantages of databases is indexes. They used advanced data structures to make data access lightening fast. Just think, every primary key you have is an index. You shouldn't be afraid of them. Of course, it would probably be counter productive to make all your fields indexes, but that should never really be necessary. I would suggest researching indexes more to find the right balance.
As for the work done when a change happens, it is not that bad. An index is a tree like representation of your field data. This is done to reduce a search down to a small number of near binary decisions.
For example, think of finding a number between 1 and 100. Normally you would randomly stab at numbers, or you would just start at 1 and count up. This is slow. Instead, it would be much faster if you set it up so that you could ask if you were over or under when you choose a number. Then you would start at 50 and ask if you are over or under. Under, then choose 75, and so on till you found the number. Instead of possibly going through 100 numbers, you would only have to go through around 6 numbers to find the correct one.
The problem here is when you add 50 numbers and make it out of 1 to 150. If you start at 50 again, your search is less optimized as there are 100 numbers above you. Your binary search is out of balance. So, what you do is rebalance your search by starting at the mid-point again, namely 75.
So the work a database is just an adjustment to rebalance the mid-point of its index. It isn't actually a lot of work. If you are working on a database that is large and requires many changes a second, you would definitely need to have a strong strategy for your indexes. In a small database that gets very few changes like yours, its not a problem.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.