Which hashing method should I use for large text? - PHP / MYSQL

Which hashing method should I use for large text? - PHP / MYSQL - php

Most of the text stored in my DB is from 1MB to 1.5MB big. But not bigger then 1.5MB, because that's the limit I set.
Here are my needs:
I need it for lowering my mysql database size
I need it to be as fast as possible
no security needed
it must just work correctly, so that string_1 and string_2 can never have the same hash
I use PHP and MYSQL.

A hash is not reversible. You can make a 1.5MB text into a small string with the help of hashing, but you cannot convert the same hash back into the original text.
What you are looking for is a compression algorithm. You can make the files a lot smaller with compression, but it's unlikely to be as small as a hash.

I would suggest SHA1, as it is also in use by git and similar applications to identify strings.
See: https://en.wikipedia.org/wiki/Sha1
and: http://php.net/manual/en/function.hash.php
$hash = hash( 'sha1', $inputData );

Saving space
MySQL has built-in COMPRESS() and UNCOMPRESS() functions which will save space in your DB, as well having to write extra PHP code.
Checking unique-ness
Instead of indexing TEXT columns [regardless of if they're compressed or not] you can store and index 2 relatively-small things that will guarantee that that text is unique.
A hash of the data, MD5, SHA, whatever you want.
The length of the uncompressed data.
For most hashing functions you're more likely to get hit by a meteor than have 2 identical hashes for different text strings, and having 2 indentical length and hash strings is less likely than getting hit by a meteor and lightning while winning three simultaneous lotteries.

I'm going to assume you want a compression algorithm to reduce the text size.
See http://php.net/manual/en/function.gzcompress.php.

Related

128 bit reversible encryptor/hasher to reduce DB size

Is there anything out there for PHP that can hash/encrypt a long string into a 128 bit string that can also be reversed?
I am trying importing hundreds on millions of strings into a MySQL DB and the average string is over 100 characters, MD5 gets this down to 32 characters which significantly reduces storage however I cannot reverse this again in my application.
Does PHP have anything available that can handle this?

If I understand your question correctly, it seems to me you mix up hashing and compression quite a lot.
Most hash-functions are not easily reversible, because that is not their purpose. There are infinite "Strings/ByteStreams/Numbers/..." that correspond to the result of a hash-function. As you may know, even images that are a few Gigabytes big, also give you an md5sum of 32 characters.
You can not just magically map any String into a String of fixed length that is shorter, to just be able to magically pouff it back to its original String.
It may well be, that some hash-functions could very efficiently be used to reverse their process if you know that your target results have to have this and that property (in you case maybe character-length of 100-120), but I doubt it.
Or do I totally misunderstand and you just mean ASCII-Strings with the expression "128 bit string"?

No, you can't do this: Pigeonhole principle

Can I use PHP base64_encode for image duplicate check?

Converting base64_encode gives the binary data into characters like
9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAUDBAQEAwUEBAQFBQUGBwwIBwcHBw8LCwkMEQ8
Can I take some set of character to check duplicate? Can I do it the same for videos?

Like the others have said, don't use Base64 as a means of comparing files, it would be much much less expensive to to use something like SHA1, particularly if you are using this for videos. See the sha1_file function
For example if you already have a SHA1 sum, it is easy to compare:
if ($storedSHA1 == sha1_file($newImage)){
// ...some rejection code
}
I'd recommend creating a database table that stores the name, size and SHA1 of each file you upload. Then you can run a simple query to check if any of the records match. If you have a match in your database you know you have a duplicate.
See the below MySQL query.
SELECT SHA1_hash FROM Uploads
WHERE SHA1_hash = '<hashOfIncomingImage>';

No, you don't. Use digest for duplicates checking. SHA1 is good enough choice. It has constant and small footprint in comparing to base64. Base64 is good for transmitting or exchanging binary data but that's all. In addition, base64 is about 1/3 greater than binary data.
Verifying that two files are identical using pure PHP?

You want to use hash functions for that, for example, Sha1. It always returns a 40 character wich you can use to compare.

sha1, crc32, and md5 how to read this data?

How can I Decode the md5, crc32, and sha1, below is xml file and then is code I'm using to get data so far.
<files>
<file name="AtTheInn-Germany-Morrow78Collection.mp3" source="original">
<format>VBR MP3</format>
<title>At the Inn - Germany - Morrow 78 collection</title>
<md5>056bbd63961450d9684ca54b35caed45</md5>
<creator>Germany</creator>
<album>Morrow 78 collection</album>
<mtime>1256879264</mtime>
<size>2165481</size>
<crc32>22bab6a</crc32>
<sha1>796fccc9b9dd9732612ee626c615050fd5d7483c</sha1>
<length>179.59</length>
</file>
And this is code I'm using to get title and album name how can I make sense of sha1 and md5, any help to any direction will be helpful, Thanks
<?php
$search = $_GET['sku'];
$catalogfile = $_GET['file'];
$directory = "feeds/";
$xmlfile = $directory . $catalogfile;
$xml = simplexml_load_file($xmlfile);
list($product) = $xml->xpath("//file[crc32 = '$search']");
echo "<head>";
echo "<title>$product->title</title>";

MD5, SHA-1, and CRC32 are hash functions. That means that they cannot be reversed.1 You'd have more luck looking into that name attribute of the file tag.
1 You can2 brute-force them, but since they can represent variable-length data as a fixed-length piece of data, due to the pigeonhole principle and just plain probability, you're more likely to get something that's not the original input than the original input.
2 It'll take forever for SHA-1, though.

Hash functions generate numbers that represent some arbitrary data. They can be used to verify if the data has changed (a good hash function should produce a totally different hash for even a single bit has changed).
Since you are turning an arbitrary amount of data in a number as a result you loose information, this means that it's hard to reverse them. Technically there is an infinite number of possible results for a hash as the data can be any length. For limited data sizes its still possible for there to be multiple data values for a specific hash, this is called a collision.
For some data sets (for example passwords) you can generate all possible combinations of data and check to see if they match a hash. If you do the generation at the same time as the checking it's known as 'brute forcing'. You can also store all possible combinations (for a limited range, for example all dictionary works or all combinations of characters under a specific size), then look it up. This is known as a rainbow table and is useful for reversing multiple hashes.
It's good practice to store passwords as a hash rather than in plain text but to ensure the passwords are hard to reverse they add a bit of random data to each one and store it along with the passwords, this is known as salting. This salt means it takes much longer to brute force a password.
In this case they are probably hashes of the mp3 file that is specified to verify file integrity and show any corruption that occurs during transfer (or storage). It won't be possible to reverse them since you would have to generate all possible combinations of megabytes of data. But if you have the file itself there wouldn't be any reason too. You can confirm they are hashes of the file by running a checksum generating program on it.

Is there a PHP function or method to do a checksum on data?

I have a MySQL database I am working on in PHP where It will perform address verification from a daily data feed. We would do address correction on our end, because we don't have control over the source of the feed.
I am trying to come up with a method to see if the address has been changed at the source. If it changes then an address verification would be performed in PHP on our MySQL database.
Without storing a copy of the old feed I was thinking it might be better to do a checksum of the fields from the feeds and store this with each record. Then each feed after that it would see if the checksum has changed. Is this the best method to do this? Might there been a PHP function to do all this already? What about something in MySQL? Thanks!

crc32 is probably what you want.
In php: crc32()
In Mysql CRC32()
crc32 is probably a better fit that SHA1 or MD5 for simple comparisons/data integrity:
see here

PHP and MySQL both support the crc32 function which is inexpensive to run; at least less so than a hash algorithm like MD or SHA.

There are various hash methods you can use, either the md5 or sha ones will be ok, you will need to store in your database the hash string to compare to,
Idealy you'd want to do something like
if (sha1(strtoupper($list_of_values) )=== $stored_hashstring){
//skip
}else{
//update
}
Depending of the data you might need to add additional parsing on the strings ie: removing spaces, etc

What's a good hash to use between PHP and Python?

I have the luxury of starting from scratch, so I'm wondering what would be a good hash to use between PHP and Python.
I just need to be able to generate the same hash from the same text in each language.
From what I read, PHP's md5() isn't going to work nicely.

md5() always plays nicely - it always does the same thing because it is a standard hashing format.
The only tripping hazard is that some languages default return format for an MD5 hash is a 32 byte ascii string containing hexadecimal characters, and some use a 16 byte string containing a literal binary representation of the hash.
PHP's md5() by default returns a 32-byte string, but if you pass true to the second argument, it will return the 16 byte form instead. So as long as you know which version your other language uses (in you case Python), you just need to make sure that you get the correct format from PHP.
You may be better using the 32-byte form anyway, depending on how your applications communicate. If you use a communication protocol based on plain-text (such as HTTP) it is usually safer to use plain-text versions of anything - binary, in this case, is smaller, but liable to get corrupted in transmission by badly written servers/clients.
The binary vs. ascii problem applys to just about any hashing algorithm you can think of.

What is it you want from the hash? (portability, security, performance....)
From what I read, PHP's md5() isn't going to work nicely.
What did you read? Why won't it work?
I just need to be able to generate the same hash from the same text in each language
Since PHP only provides crc32 (very insecure), md5 and sha1 out of the box, it's not exactly a huge amount of testing you need to do. Of course if portability is not an issue then there's the mcrypt and openssl apis available. And more recently the hash PECL gives you a huge choice.

I suggest to use sha1 as it is implemented out of the box in both but has no collision valnurabilities like md5. See: http://en.wikipedia.org/wiki/MD5#Collision_vulnerabilities

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.