Comparable hashes

Comparable hashes - php

i wasn't able to answer my question.
I need a hashing method that will generate a hash that can be compared with others and find out the fidelity,
let's say i have to 2 strings, "mother", "father" and when i compare the 2 hashes, it will say that there is a fidelity between them because of the "ther".
Is there any hashing method that it's able to do that?
thank you

PHP provides a function called similar_text which calculates similarity between two strings. You could also use the levenshtein function to calculate the distance between the two strings. Whilst these aren't hashing functions, I think they should provide the functionality I think you're after.

I'm not sure if you were looking for an answer specific to your specific case of 2 words, but there are definitely hash-style functions that are useful for comparing parts of a whole. A Hash Tree is a perfect example of one such structure. Hash trees are used to compare parts of a chunk of data and they aggregate for comparison of the entire chunk of data.
I'll also note that while others point out that most real world hash functions will not allow any information about the input to be derived from the output, they are talking about a Cryptographic Hash Function. The set of guarantees for a regular Hash Function is much less strict than those of a Cryptographic Hash Function. For instance, in Java you can override .hashCode() and return 4 for every object. This is perfectly valid, but not extremely useful. It is valid because collisions are ok in general hash functions, but they are considered failure in a cryptographic hash function.

I believe rot13, along with taking out all the vowels would qualify. Any real-world hash would not. That's kind of the point.

In short: There can't be in a universal sense of the word
This is why:
One of the main functions of a hash is compression - apart from trivial usage (such as "mother" and "father") a hash will allways be shorter than the hashed information. E.g. a SHA1 (or even MD5) as a quick check, whether a a download of a 600MB ISO went without corruption will be much shorter than the file itself.
Another main function of a hash is (very high grade) obfuscation. Were this not so, hashing a salted password would do nothing (or at least much less) to protect against a dicitionary attack, as similar passwords would result in similar hashes.

Related

Is it a good practice to compare the checksums of two complex objects instead of iterating?

Suppose you have two datasets that you need to make sure that they have not changed. For example, you have an array of objects in one hand, and another array in the other hand. Now, you need to verify that both arrays are exactly the same.
Each array can contain any type data: boolean, strings, objects, arrays, NULL, etc.
When comparing both array contents should be exactly the same. Same data type and same order.
Instead of iterating over the array contents, with code that can compare different types of data, and possible recursive comparisons, I came with a solution that I would be grateful if you could shed a light if there is any downside in. PHP is the language, but I'm more interested in a language-neutral answer.
I serialized both datasets separately, and calculated their md5 hashes. I chose md5 because it is available without external extensions or libraries, and works quite fast. I am aware of chance of a collision, and md5 hashes are no where nearly cryptographically secure.
My question is that:
Is it a widely used method to validate the arbitrary types of data. Checking file checksums make sense, but I have not personally used it to compare variables like this.
I'm mainly doing this to keep my code simple. A comparison is probably faster because it can break the comparison whenever it finds a mismatch first. In my case, the length of the data is fairly small. About 5kb as a serialized string.
Are there any other downsites that I should know off.
Thanks in advance.

If you're looking for changes in an array I would actually recommend using CRC32(). Like MD5() this function has been available in PHP since version 4 and requires no special libraries adding. However, CRC32() is actually meant for the purpose of error checking and is quicker than MD5(), which is meant as a hashing function and as such is slower by design.
Especially in terms of your language agnostic answer, I would always choose CRC32() over MD5() as it's much much simpler to find libraries for and it is much less computationally expensive making it ideal for pretty much every application, even embedded devices.

HASH for GIF IMAGES

I need to know if exists any form to get a unique hash from gif images, i did tried with SHA1 file function
sha1_file
but i don't know if exist the case where two hash of different gif images, result in same hash.
Its can happen with SHA1? In this case is better SHA2, or MD5? Or any other previously implemented in PHP language.
I know its also depends of file size, but gifs image don't exceed 10mb in any case.
I need recommendations for this problem. best regards.

There is no hash function that creates different values for each and every set of images you provide. This should be obvious as your hash values are much shorter than the files themselves and therefore they are bound to drop some information on the way. Given a fixed set of images it is rather simple to produce a perfect hash function (e.g. by numbering them), but this is probably not the answer you are looking for.
On the other hand you can use "perfect hashing", a two step hashing algorithm that guarantees amortized O(1) access using a two step hashing algorithm, but as you are asking for a unique 'hash' that may also not be what you are looking for. Could you be a bit more specific about why you insist on the hash-value being unique and under what circumstances?

sha1_file is fine.
In theory you can run into two files that hash to the same value, but in practice it is so stupendously unlikely that you should not worry about it.

Hash functions don't provide any guarantees about uniqueness. Patru explains why, very well - this is the pigeonhole principle, if you'd like to read up.
I'd like to talk about another aspect, though. While you won't get any theoretical guarantees, you get a practical guarantee. Consider this: SHA-256 generates hashes that are 256 bits long. That means there are 2256 possible hashes it can generate. Assume further that the hashes it generates are distributed almost purely randomly (true for SHA-256). That means that if you generate a billion hashes a second, 24 hours a day, you'll have generated 31,536,000,000,000,000 hashes a year. A lot, right?
Divide that by 2256. That's ~1060. If you walked linearly through all possible hashes, that's how many years it would take you to generate all possible hashes (pack a lunch). Divide that by two, that's... still ~1060. That's how many years you'd have to work to have a greater than 50% chance of generating the same hash twice.
To put it another way, if you generate a billion hashes a second for a century, you'd have a 1/1058 chance of generating the same hash twice. Until the sun burns out, 1/1050.
Those are damn fine chances.

Encryption - does it work this way or am I thinking wrong?

So I know that MD5's are technically a no-no in new applications, but I randomly had a thought of this:
Since
md5($password);
is insecure, wouldn't
md5(md5($password))
be a better alternative?
would it keep getting more secure the more I use it? Say if I made a function like this
function ExtremeEncrypt($password)
{
$encryptedpass = md5(sha1(md5(md5($pass))));
return $encryptedpass;
}
Would this function be a good alternative to say using a random salt for every account like vbulletin does.

Double hashing a string does nothing except limit your key space and make collisions more likely. Please don't do this. Double md5 hashing is actually less secure than a single hash with some attack vectors.
A better option would be to use the password_hash function in php 5.5 or ircmaxell's password_compat library for earlier php versions.

First of: hash and encryption are not the same. Hash is a one-way function while encryption expects data could be decrypted.
You should not try to invent your own solution when it comes to security. In PHP, since 5.5 version, there is native solution called Password Hashing. md5() is insecure and you should be aware of that.
If you have PHP below 5.5 version, you should use salt to hash & store your passwords.

You have lots of answers here and they are accurate but they don't really explain why.
MD5 is a hashing algorithm. What a Hashing algorithm does, is take a long piece of data and analyse it cryptographically in a way that creates a smaller piece of data. So from ABCDEFGHIJKLMNOPQRSTUVWXYZ with my custom hash algorithm I might create a single digit hash 5.
When that is done, you lose information - ABCDEFGHIJKLMNOPQRSTUVWXYZ contains far more information than 5 and there is no way to make the translation the other way.
The problem with hashing in a way that only allows an outcome of 0-9 ( this is effectively a Checksum ) is that if you take two pieces of text, the chances are quite high that they will have the same hash. So maybe with my algorithm ZZZZZZZZZ will also produce a hash of 5. This is what is termed a Hash Collision.
Now what happens if I take the hash of my hash? Well, my starting point is already very low information - the most it can possibly be is one of ten digits, so the chance of a collision is now exceedingly high. Supposing when my hash algorithm runs on numbers it returns 1 if it is odd and 0 if it is even- so if I have a hash of ABCDEFGHIJKLMNOPQRSTUVWXYZ which comes to 5 then I have a 10% chance of a collision. But if I make a hash of that hash, I will now have a 50% chance of a collision.
The trick of cryptography is hiding information in such an enormous possible space that it is unbelievably hard to find. The more you shrink that possible space, the less well hidden your information is.

Short answer: No.
md5 is easy to break using brute-force. Adding additional layers of hashing only slows down a brute-force attack linearly.

First of all md5 isn't really encryption, because there isn't a decryption method to it. It's called hashing.
The standard practice is to salt your passwords:
$salt = [some random/unique number, people usually use user_id or timestamp]
$hashed_password = sha1($salt . $password)
Remember that you need to know the salt, hence usually it means storing it along with the hashed password.
You can have multiple salts, and arrange them however you like.

Is it wrong to use a hash for a unique ID?

I want to use a unique ID generated by PHP in a database table that will likely never have more than 10,000 records. I don't want the time of creation to be visible or use a purely numeric value so I am using:
sha1(uniqid(mt_rand(), true))
Is it wrong to use a hash for a unique ID? Don't all hashes lead to collisions or are the chances so remote that they should not be considered in this case?
A further point: if the number of characters to be hashed is less than the number of characters in a sha1 hash, won't it always be unique?

If you have 2 keys you will have a theoretical best case scenario of 1 in 2 ^ X probability of a collision, where X is the number of bits in your hashing algorithm. 'Best case' because the input usually will be ASCII which doesn't utilize the full charset, plus the hashing functions do not distribute perfectly, so they will collide more often than the theoretical max in real life.
To answer your final question:
A further point: if the number of characters to be hashed is less than
the number of characters in a sha1 hash, won't it always be unique?
Yeah that's true-sorta. But you would have another problem of generating unique keys of that size. The easiest way is usually a checksum, so just choose a large enough digest that the collision space will be small enough for your comfort.
As #wayne suggests, a popular approach is to concatenate microtime() to your random salt (and base64_encode to raise the entropy).

How horrible would it be if two ended up the same? Murphy's Law applies - if a million to one, or even a 100,000:1 chance is acceptable, then go right ahead! The real chance is much, much smaller - but if your system will explode if it happens then your design flaw must be addressed first. Then proceed with confidence.
Here is a question/answer of what the probabilities really are: Probability of SHA1 Collisions

Use sha1(time()) in stead, then you remove the random possibility of a repeating hash for as long as time can be represented shorter than the sha1 hash. (likely longer than you fill find a working php parser ;))

Computer random isn't actually random, you know?
The only true random that you can obtain from a computer, supposing you are on a Unix environment is from /dev/random, but this is a blocking operation that depends on user interactions like moving a mouse or typing on keyboard. Reading from /dev/urandom is less safe, but it's probably better thang using just ASCII characters and gives you instantaneous response.

sha1($ipAddress.time())
Causes it's impossible for anyone to use same IP address same time

Absolute fastest method of hashing a string in PHP

What is the absolute fastest way to hash a string in PHP?
I have read that md5 can be relatively slow but am unsure of the alternatives.
Basically, i have a function that i need to squeeze every last bit of performance possible out of and within that function i have a string say "yada yada yada" and i need it hashed in someway so it becomes one string.
I should note that security is no issue here - i simply need a single unique string representation, as its for a cache key.

The whole point of a hash is that it's -not- fast. The faster the hash is the faster it can be cracked.
By that logic, the less secure the hash is - the faster it'll be. If you're going to favour such logic I suggest you either stop what you're doing or use encryption instead.
In response to your update
It sounds like you may want a CRC. Again it's worth mentioning that typically the faster the check is the less combinations exist for the particular algorithm, and thus it's less likely to be a "unique representation".
The associated PHP documentation can be found here: hash function with crc32/crc32b

Benchmarks. I seem to recall reading somewhere that this depends a lot of your version of apache and PHP, can't remember where though. I'll post if I remember :)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.