disagreeing md5 hashes of same string, with example - php

This is getting me crazy, my md5's don't agree. I have this string:
The Combinations Generator is a tool that allows you to easily create
a series of combinations by selecting the related attributes. For
example, if you're selling t-shirts in three different sizes and two
different colors, the generator will create six combinations for you.
When I hash it on my computer using the md5 function (with php 5.5.0) it produces the following hash: 422f3f656e1a5f95e8b5cf7565d815b5
http://www.miraclesalad.com/webtools/md5.php agrees with my computer's result.
http://www.md5.cz/ disagrees with both my computer and miraclesalad.
This string/md5 pair was initially computed by another computer which also gives the same result as md5.cz.
I read about encoding issues (although the string doesn't contain any non ASCII characters), so I tried the following code on my computer:
<?php
$str = "The Combinations Generator is a tool that allows you to easily create a series of combinations by selecting the related attributes. For example, if you're selling t-shirts in three different sizes and two different colors, the generator will create six combinations for you.";
echo "$str<BR/>";
echo md5($str)."<BR/>";
echo md5(utf8_encode($str))."<BR/>";
echo md5(utf8_decode($str))."<BR/>";
die();
The output is:
The Combinations Generator is a tool that allows you to easily create
a series of combinations by selecting the related attributes. For
example, if you're selling t-shirts in three different sizes and two
different colors, the generator will create six combinations for you.
422f3f656e1a5f95e8b5cf7565d815b5
422f3f656e1a5f95e8b5cf7565d815b5
422f3f656e1a5f95e8b5cf7565d815b5
So it is not about utf8.
Any idea what's happening?

My best guess is that it has something to do with the ' mark in the word "you're" and character encodings. If you remove that quote both sites report the same md5.

I tried feeding the string above incrementally to both sites you linked to in your question, and it turns out that the character breaking the generator at md5.cz is the apostrophe in if you're selling t-shirts.
If you strip the string of special characters before feeding it to a hasher, possibly preserving the string's uniqueness using something like urlencode(), you should get matching hashes for any string.

The strings need to be exactly the same, including the whitespaces.
Probably the sites are using some transformation like trim() or stripslashes().
md5 will return the same value only if the strings are exact.

md5 is md5. That's all there is to it. If you get different hashes from different (non-buggy) implementations, then you're feeding in diffent inputs. Remember that md5 is DESIGNED to produce wildly different outputs if the input(s) are even slightly different. A single whitespace character (tab, linebreak, etc...) at the end of one of your test strings will totally trash your expected hash, because you've fed in a different input.

Related

create random short string [a-zA-Z0-9]+

What is the best way to create a short (6 chars), random, and with low collison probability? I need to create short links like bit.ly.
The problem of md5, sha1, uniqid etc. is that they don't generate uppercase characters, so I'm looking for a case-sensitive output to have a wider range of possible values...
I like to use Hashids for this kind of thing:
Hashids is a small open-source library that generates short, unique, non-sequential ids from numbers.
It converts numbers like 347 into strings like “yr8”, or array of numbers like [27, 986] into “3kTMd”.
You can also decode those ids back. This is useful in bundling several parameters into one or simply using them as short UIDs.
Hashids has been ported to many languages, including PHP.
(Note that, despite the name, Hashids is not a true hashing system since it is designed to be reversible.)

Is there a php port of namick/obfuscate_id rails plugin ?

Please check this
https://github.com/namick/obfuscate_id
This plugin converts id 7000 to 5270192353
I tried https://github.com/ivanakimov/hashids.php and it similar ones but it converts ids into a mix of alphabets like (yJJpo90) and numbers.I don't want that.I want IDs to convert into a positive integers.Are there any php packages for this sort?
You can try Optimus id transformation:
With this library, you can transform your internal id's to obfuscated integers based on Knuth's integer hash. It is similar to Hashids, but will generate integers instead of random strings. It is also super fast.
https://github.com/jenssegers/optimus

Find common substrings from 2 separate sets

Background: I have a large database of people, and I want to look for duplicates, which is more difficult than it seems. I already do a lot of comparison between the names (which are often spelled in different ways), dates of birth and so on. When two profiles appear to be similar enough to the matching algorithm, they are presented to an operator who will judge.
Most profiles have more than one phone number attached, so I would like to use them to find duplicates. They can be entered as "001-555-123456", but also as "555-123456", "555-123456-7-8", "555-123456 call me in the evening" or anything you might imagine.
My first idea is to strip all non-numeric characters and get the "longest common substring".
There are a lot of algorithms around to find the longest common substring inside a set.
But whenever I compare two profiles A and B, I have two sets of phone numbers. I would like to find the longest common substring between a string in the set A and a string in a set B.
Can you please help me in finding such an algorithm?
I normally program in PHP, a SQL-only solution would be even better, but any other language would go.
As Voitcus said before, you have to clean your data first before you start comparing or looking for duplicates. A phone number should follow a strict pattern. For the numbers which do not match the pattern try to adjust them to it. Then you have the ability to look for duplicates.
Morevover you should do data-cleaning before persisting it, maybe in a seperate column. You then dont have to care for that when looking for duplicates ... just to avoid performance peaks.
Algorithms like levenshtein or similar_text() in php, doesnt fit to that use-case quite well.
In my opinion the best way is to strip all non-numeric characters from the texts containing phone numbers. You can do this in many ways, some regular expression would be the best, but see below.
Then, if it is possible, you can find the country direction code, if the user has its location country. If there is none, assume default and add to the string. The same would be probably with the cities. You can try to take a look also in place one lives, their zip code etc.
At the end of this you should have uniform phone numbers which can be easily compared.
The other way is to compare strings with the country (and city) code removed.
About searching "the longest common substring": The numbers thus filtered are the same, however you might need it eg. if someone typed "call me after 6 p.m.". If you're sure that the phone number is always at the beginning, so nobody typed something like 555-SUPERMAN which translates to 555-78737626, there is also possibility to remove everything after the last alphanumeric character (and this character, as well).
There is also a possibility to filter such data in the SQL statement. Consider something like a SELECT ..., [your trimming function(phone_number)] AS trimmed_phone WHERE (trimmed_phone is not numerical characters only) GROUP BY trimmed_phone. If trimming function would remove only whitespaces and special dividers like -, +, . (commonly in use in Germany), , perhaps etc., this query would leave you all phone numbers that are trimmed but contain characters not numeric -- take a look at the results, probably mostly digits and letters. How many of them are they? Maybe they have something common? Maybe some typical phrases you can filter out too?
If the result from such query would not be very much, maybe it's easier just to do it by hand?

Auto-generate Base62 numbers without vowels (PHP)?

This may end up being a trivial question - I know I'm going to need to do this soon for an app I'm working on, but haven't really worked on it myself yet - I'm really just floating it to see if there's an obvious method I'm missing.
Basically, what I need is to generate a sequence of numbers using a-z, A-Z, 0-9, except without vowels. There is a small chance I will need to make it unpredictable, so being able to generate out of order is a bonus.
I'm initially thinking for each new one to just work forward from the last no-vowel match until I find the next one (or generate random numbers until I get one I don't have already in the case of unpredictable values), but is there a better way? Perhaps a baseX number system obj that allows you to specify the allowable characters?
Using PHP/MySQL if it matters.
There's a function in an answer of mine here that can convert from any base to any other and which lets you customize the digit pool; it also works on arbitrary-sized input. You can generate a sequence in base 10 and convert to whatever you need.

Convert Chinese Pinyin with accents to numerical form

I'm looking to convert Pinyin where the tone marks are written with accents (e.g.: Nín hǎo) to Pinyin written in numerical/ASCII form (e.g.: Nin2 hao1).
Does anyone know of any libraries for this, preferably PHP? Or know Chinese/Pinyin well enough to comment?
I started writing one myself that was rather simple, but I don't speak Chinese and don't fully understand the rules of when words should be split up with a space.
I was able to write a translator that converts:
Nín hǎo. Wǒ shì zhōng guó rén ==> Nin2 hao3. Wo3 shi4 zhong1 guo2 ren2
But how do you handle words like the following - do they get split up with a space into multiple words, or do you interject the tone numbers within the word (if so, where?) :
huā shíjiān, wèishénme, yuèláiyuè, shēngbìng, etc.
The problem with parsing pinyin without the space separating each word is that there will be ambiguity. Take, for instance, the name of an ancient Chinese capital 长安: Cháng'ān (notice the disambiguating apostrophe). If we strip out the apostrophe however this can be interpreted in two ways: Chán gān or Cháng ān. A Chinese would tell you that the second is far more likely, depending on the context of course, but there's no way your computer can do that.
Assuming no ambiguity, and that all input are valid, the way I would do it would look something like this:
Create accent folding function
Create an array of valid pinyin (You should take it from the Wikipedia page for pinyin)
Match each word to the list of valid pinyin
Check ahead to the next word when there is ambiguity about the possibility of the last character belonging to the next word, such as:
shēngbìng
^ Does this 'g' belong to the next word?
Anyway, the correct positioning of the numerical representation of the tones, and the correct numerals to represent each accent are covered fairly well in this section of the Wikipeda article on pinyin: http://en.wikipedia.org/wiki/Pinyin#Numerals_in_place_of_tone_marks. You might also want to have a look at how IMEs do their job.
Spacing should stay the same, but you got numbering of tones incorrectly.
Nin2 hao3. Wo3 shi4 zhong1 guo2 ren2.
wèishénme becomes wei4shen2me.
Remove diacritical marks by mapping "āáǎà" to "a", etc.
Using simple maximum matching algorithm, split compounds into syllables (there are only 418 or so Mandarin syllables).
Append numbers (you have to remember what kind of mark you removed) and joing syllables back into compounds.

Categories