In PHP I am calculating Levenshtein distance using function levenshtein(). For simple characters it works as expected, but for diacritic characters like in example
echo levenshtein('à', 'a');
it returns "2". In this case only one replacement has to be done, so I expect it to return "1".
Am I missing something?
I thought it may be useful to have this comment from the PHP manual posted as an answer to this question, so here it is:-
The levenshtein function processes each byte of the input string individually. Then for multibyte encodings, such as UTF-8, it may give misleading results.
Example with a french accented word :
- levenshtein('notre', 'votre') = 1
- levenshtein('notre', 'nôtre') = 2 (huh ?!)
You can easily find a multibyte compliant PHP implementation of the levenshtein function but it will be of course much slower than the C implementation.
Another option is to convert the strings to a single-byte (lossless) encoding so that they can feed the fast core levenshtein function.
Here is the conversion function I used with a search engine storing UTF-8 strings, and a quick benchmark. I hope it will help.
<?php
// Convert an UTF-8 encoded string to a single-byte string suitable for
// functions such as levenshtein.
//
// The function simply uses (and updates) a tailored dynamic encoding
// (in/out map parameter) where non-ascii characters are remapped to
// the range [128-255] in order of appearance.
//
// Thus it supports up to 128 different multibyte code points max over
// the whole set of strings sharing this encoding.
//
function utf8_to_extended_ascii($str, &$map)
{
// find all multibyte characters (cf. utf-8 encoding specs)
$matches = array();
if (!preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
return $str; // plain ascii string
// update the encoding map with the characters not already met
foreach ($matches[0] as $mbc)
if (!isset($map[$mbc]))
$map[$mbc] = chr(128 + count($map));
// finally remap non-ascii characters
return strtr($str, $map);
}
// Didactic example showing the usage of the previous conversion function but,
// for better performance, in a real application with a single input string
// matched against many strings from a database, you will probably want to
// pre-encode the input only once.
//
function levenshtein_utf8($s1, $s2)
{
$charMap = array();
$s1 = utf8_to_extended_ascii($s1, $charMap);
$s2 = utf8_to_extended_ascii($s2, $charMap);
return levenshtein($s1, $s2);
}
?>
Results (for about 6000 calls)
- reference time core C function (single-byte) : 30 ms
- utf8 to ext-ascii conversion + core function : 90 ms
- full php implementation : 3000 ms
The default PHP levenshtein(), like many PHP functions, is not multibyte aware. So, when processing strings with Unicode characters, it handles each byte separately and changes two bytes.
There is no multibyte version (i.e. mb_levenshtein()) so you have two options:
1) Re-implement the function yourself, using mb_ functions. Possible example code from a Gist:
<?php
function levenshtein_php($str1, $str2){
$length1 = mb_strlen( $str1, 'UTF-8');
$length2 = mb_strlen( $str2, 'UTF-8');
if( $length1 < $length2) return levenshtein_php($str2, $str1);
if( $length1 == 0 ) return $length2;
if( $str1 === $str2) return 0;
$prevRow = range( 0, $length2);
$currentRow = array();
for ( $i = 0; $i < $length1; $i++ ) {
$currentRow=array();
$currentRow[0] = $i + 1;
$c1 = mb_substr( $str1, $i, 1, 'UTF-8') ;
for ( $j = 0; $j < $length2; $j++ ) {
$c2 = mb_substr( $str2, $j, 1, 'UTF-8' );
$insertions = $prevRow[$j+1] + 1;
$deletions = $currentRow[$j] + 1;
$substitutions = $prevRow[$j] + (($c1 != $c2)?1:0);
$currentRow[] = min($insertions, $deletions, $substitutions);
}
$prevRow = $currentRow;
}
return $prevRow[$length2];
}
2) Convert your string's Unicode characters to ASCII. If you are specifically wanting to calculate Levenshtein differences from diacritic characters to non-diacritics, though, this is probably not what you want.
Related
I'm iterating through each character in a string in PHP.
Currently I'm using direct access
$len=strlen($str);
$i=0;
while($i++<$len){
$char=$str[$i];
....
}
That got me pondering what is probably purely academic.
How does direct access work under the hood and is there a length of string that would see optimization in a character loop(micro though it may be) by splitting said string into an array and using the array's internal pointer to keep index location in memory?
TLDNR:
Would accessing each member of a 5 million item array be faster than accessing each character of a 5 million character string directly?
Accessing a string's bytes is faster by an order of magnitude. Why? PHP likely just has each array index referenced to the index where it is storing each byte in memory. So it likely just goes right to the location it needs to, reads in one byte of data, and it is done. Note that unless the characters are single-byte you will not actually get a usable character from accessing via string byte-array.
When accessing a potential multi-byte string (via mb_substr) a number of additional steps need to be taken in order to ensure the character is not more than one byte, how many bytes it is, then access each needed byte and return the individual [possibly multi-byte] character (notice there are a few extra steps).
So, I put together a simple test code just to show that array-byte access is orders of magnitude faster (but will not give you a usable character if it a multi-byte character exists as a given string's byte index). I grabbed the random character function from here ( Optimal function to create a random UTF-8 string in PHP? (letter characters only) ), then added the following:
$str = rand_str( 5000000, 5000000 );
$bStr = unpack('C*', $str);
$len = count($bStr)-1;
$i = 0;
$startTime = microtime(true);
while($i++<$len) {
$char = $str[$i];
}
$endTime = microtime(true);
echo '<pre>Array access: ' . $len . ' items: ', $endTime-$startTime, ' seconds</pre>';
$i = 0;
$len = mb_strlen($str)-1;
$startTime = microtime(true);
while($i++<$len) {
$char = mb_substr($str, $i, 1);
if( $i >= 100000 ) {
break;
}
}
$endTime = microtime(true);
echo '<pre>Substring access: ' . ($len+1) . ' (limited to ' . $i . ') items: ', $endTime-$startTime, ' seconds</pre>';
You will notice that the mb_substr loop I have restricted to 100,000 characters. Why? It just takes too darn long to run through all 5,000,000 characters!
What were my results?
Array access: 12670380 items: 0.4850001335144 seconds
Substring access: 5000000 (limited to 100000) items: 17.00200009346 seconds
Notice the string array access was able to filter through all 12,670,380 bytes -- yep, 12.6 MILLION bytes from 5 MILLION characters [many were multi-byte] -- in just 1/2 second while the mb_substring, limited to 100,000 characters, took 17 seconds!
The answer to your question is that your current method is highly likely the fastest way.
Why?
Since a string in php is just an array of bytes with one byte representing each character (when using UTF-8), there shouldn't be a theoretically faster form of array.
Moreover, any additional implementation of an array to which you'd copy the characters of your original string would add overhead and slow things down.
If your string is highly limited in its contents (for instance, only allowing 16 characters instead of 256), there may be faster implementations, but that seems like an edge case.
Quick answer (for non-multibyte strings which may have been what the OP was asking about, and useful to others as well): Direct access is still faster (by about a factor of 2). Here's the code, based on the accepted answer, but doing an apples-apples comparison of using substr() rather than mb_substr()
$str = base64_encode(random_bytes(4000000));
$len = strlen($str)-1;
$i = 0;
$startTime = microtime(true);
while($i++<$len) {
$char = $str[$i];
}
$endTime = microtime(true);
echo '<pre>Array access: ' . $len . ' items: ', $endTime-$startTime, ' seconds</pre>';
$i = 0;
$len = strlen($str)-1;
$startTime = microtime(true);
while($i++<$len) {
$char = substr($str, $i, 1);
}
$endTime = microtime(true);
echo '<pre>Substring access: ' . ($len) . ' items: ', $endTime-$startTime, ' seconds</pre>';
Note: used base64 coding of random numbers to create the random string, as rand_str was not a defined function. Maybe not exactly the most random, but certainly random enough for testing.
My results:
Array access: 5333335 items: 0.40552091598511 seconds
Substring access: 5333335 items: 0.87574410438538 seconds
Note: also tried to do a $chars = preg_split('//', $str, -1, PREG_SPLIT_NO_EMPTY); and iterating through $chars. Not only was this slower, but it ran out of space with a 5,000,000 character string
I'm sorry if I'm asking the obvious, but I can't seem to find a working solution for a simple task. On the input I have a string, provided by a user, encoded with UTF-8 encoding. I need to sanitize it by removing all characters less than 0x20 (or space), except 0x7 (or tab.)
The following works for ANSI strings, but not for UTF-8:
$newName = "";
$ln = strlen($name);
for($i = 0; $i < $ln; $i++)
{
$ch = substr($name, $i, 1);
$och = ord($ch);
if($och >= 0x20 ||
$och == 0x9)
{
$newName .= $ch;
}
}
It totally missed UTF-8 encoded characters and treats them as bytes. I keep finding posts where people suggest using mb_ functions, but that still doesn't help me. (For instance, I tried calling mb_strlen($name, "utf-8"); instead of strlen, but it still returns the length of string in BYTEs instead of characters.)
Any idea how to do this in PHP?
PS. Sorry, my PHP is somewhat rusty.
If you use multibyte functions (mb_) then you have to use them for everything. In this example you should use mb_strlen() and mb_substr().
The reason it is not working is probably because you are using ord(). It only works with ASCII values:
ord
(PHP 4, PHP 5)
ord — Return ASCII value of character
...
Returns the ASCII value of the first character of string.
In other words, if you throw a multibyte character into ord() it will only use the first byte, and throw away the rest.
Wow, PHP is one messed up language. Here's what worked for me (but how much slower will this run for a longer chunk of text...):
function normalizeName($name, $encoding_2_use, $encoding_used)
{
//'$name' = string to normalize
// INFO: Must be encoded with '$encoding_used' encoding
//'$encoding_2_use' = encoding to use for return string (example: "utf-8")
//'$encoding_used' = encoding used to encode '$name' (can be also "utf-8")
//RETURN:
// = Name normalized, or
// = "" if error
$resName = "";
$ln = mb_strlen($name, $encoding_used);
if($ln !== false)
{
for($i = 0; $i < $ln; $i++)
{
$ch = mb_substr($name, $i, 1, $encoding_used);
$arp = unpack('N', mb_convert_encoding($ch, 'UCS-4BE', $encoding_used));
if(count($arp) >= 1)
{
$och = intval($arp[1]); //Index 1?! I don't understand why, but it works...
if($och >= 0x20 || $och == 0x9)
{
$ch2 = mb_convert_encoding('&#'.$och.';', $encoding_2_use, 'HTML-ENTITIES');
$resName .= $ch2;
}
}
}
}
return $resName;
}
I'm looking for a code/lib in php that I will call it and pass a text to it and it will tell me:
What is the encode I need to use in order to send this text as SMS (7,8,16 bit)
How many SMS message I will use to send this text (it must be smart to count "segmenation information" like in http://ozekisms.com/index.php?owpn=612)
do you have any idea of any code/lib exists that will do this for me?
Again I'm not looking for sending SMS or converting SMS, just to give me information about the text
Update:
Ok I did the below code and it seems to be working fine, let me know if you have better/optimized code/solution/lib
$text = '\#£$¥èéùìòÇØøÅåΔ_ΦΓΛΩΠΨΣΘΞÆæßÉ -./0123456789:;<=>?¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑܧ¿abcdefghijklmnopqrstuvwxyzäöñüà^{}[~]|€' ; //"\\". //'"';//' ';
print $text . "\n";
print isGsm7bit($text). "\n";
print getNumberOfSMSsegments($text). "\n";
function getNumberOfSMSsegments($text,$MaxSegments=6){
/*
http://en.wikipedia.org/wiki/SMS
Larger content (concatenated SMS, multipart or segmented SMS, or "long SMS") can be sent using multiple messages,
in which case each message will start with a user data header (UDH) containing segmentation information.
Since UDH is part of the payload, the number of available characters per segment is lower:
153 for 7-bit encoding,
134 for 8-bit encoding and
67 for 16-bit encoding.
The receiving handset is then responsible for reassembling the message and presenting it to the user as one long message.
While the standard theoretically permits up to 255 segments,[35] 6 to 8 segment messages are the practical maximum,
and long messages are often billed as equivalent to multiple SMS messages. See concatenated SMS for more information.
Some providers have offered length-oriented pricing schemes for messages, however, the phenomenon is disappearing.
*/
$TotalSegment=0;
$textlen = mb_strlen($text);
if($textlen==0) return false; //I can see most mobile devices will not allow you to send empty sms, with this check we make sure we don't allow empty SMS
if(isGsm7bit($text)){ //7-bit
$SingleMax=160;
$ConcatMax=153;
}else{ //UCS-2 Encoding (16-bit)
$SingleMax=70;
$ConcatMax=67;
}
if($textlen<=$SingleMax){
$TotalSegment = 1;
}else{
$TotalSegment = ceil($textlen/$ConcatMax);
}
if($TotalSegment>$MaxSegments) return false; //SMS is very big.
return $TotalSegment;
}
function isGsm7bit($text){
$gsm7bitChars = "\\\#£\$¥èéùìòÇ\nØø\rÅåΔ_ΦΓΛΩΠΨΣΘΞÆæßÉ !\"#¤%&'()*+,-./0123456789:;<=>?¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑܧ¿abcdefghijklmnopqrstuvwxyzäöñüà^{}[~]|€";
$textlen = mb_strlen($text);
for ($i = 0;$i < $textlen; $i++){
if ((strpos($gsm7bitChars, $text[$i])==false) && ($text[$i]!="\\")){return false;} //strpos not able to detect \ in string
}
return true;
}
I'm adding some extra information here because the previous answer isn't quite correct.
These are the issues:
You need to be specifying the current string encoding to mb_string, otherwise this may be incorrectly gathered
In 7-bit GSM encoding, the Basic Charset Extended characters (^{}\[~]|€) require 14-bits each to encode, so they count as two characters each.
In UCS-2 encoding, you have to be wary of emoji and other characters outside the 16-bit BMP, because...
GSM with UCS-2 counts 16-bit characters, so if you have a 💩 character (U+1F4A9), and your carrier and phone sneakily support UTF-16 and not just UCS-2, it will be encoded as a surrogate pair of 16-bit characters in UTF-16, and thus be counted as TWO 16-bit characters toward your string length. mb_strlen will count this as a single character only.
How to count 7-bit characters:
What I've come up with so far is the following to count 7-bit characters:
// Internal encoding must be set to UTF-8,
// and the input string must be UTF-8 encoded for this to work correctly
protected function count_gsm_string($str)
{
// Basic GSM character set (one 7-bit encoded char each)
$gsm_7bit_basic = "#£$¥èéùìòÇ\nØø\rÅåΔ_ΦΓΛΩΠΨΣΘΞÆæßÉ !\"#¤%&'()*+,-./0123456789:;<=>?¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑܧ¿abcdefghijklmnopqrstuvwxyzäöñüà";
// Extended set (requires escape code before character thus 2x7-bit encodings per)
$gsm_7bit_extended = "^{}\\[~]|€";
$len = 0;
for($i = 0; $i < mb_strlen($str); $i++) {
$c = mb_substr($str, i, 1);
if(mb_strpos($gsm_7bit_basic, $c) !== FALSE) {
$len++;
} else if(mb_strpos($gsm_7bit_extended, $c) !== FALSE) {
$len += 2;
} else {
return -1; // cannot be encoded as GSM, immediately return -1
}
}
return $len;
}
How to count 16-bit characters:
Convert the string into UTF-16 representation (to preserve the emoji characters with mb_convert_encoding($str, 'UTF-16', 'UTF-8').
do not convert into UCS-2 as this is lossy with mb_convert_encoding)
Count bytes with count(unpack('C*', $utf16str)) and divide by two to get the number of UCS-2 16-bit characters that count toward the GSM multipart length
*caveat emptor, a word on counting bytes:
Do not use strlen to count the number of bytes. While it may work, strlen is often overloaded in PHP installations with a multibyte-capable version, and is also a candidate for API change in the future
Avoid mb_strlen($str, 'UCS-2'). While it does currently work, and will return, correctly, 2 for a pile of poo character (as it looks like two 16-bit UCS-2 characters), its stablemate mb_convert_encoding is lossy when converting from >16-bit to UCS-2. Who's to say that mb_strlen won't be lossy in the future?
Avoid mb_strlen($str, '8bit') / 2. It also currently works, and is recommended in a PHP docs comment as a way to count bytes. But IMO it suffers from the same issue as the above UCS-2 technique.
That leaves the safest current way (IMO) as unpacking into a byte array, and counting that.
So, what does this look like?
// Internal encoding must be set to UTF-8,
// and the input string must be UTF-8 encoded for this to work correctly
protected function count_ucs2_string($str)
{
$utf16str = mb_convert_encoding($str, 'UTF-16', 'UTF-8');
// C* option gives an unsigned 16-bit integer representation of each byte
// which option you choose doesn't actually matter as long as you get one value per byte
$byteArray = unpack('C*', $utf16str);
return count($byteArray) / 2;
}
Putting it all together:
function multipart_count($str)
{
$one_part_limit = 160; // use a constant i.e. GSM::SMS_SINGLE_7BIT
$multi_limit = 153; // again, use a constant
$max_parts = 3; // ... constant
$str_length = count_gsm_string($str);
if($str_length === -1) {
$one_part_limit = 70; // ... constant
$multi_limit = 67; // ... constant
$str_length = count_ucs2_string($str);
}
if($str_length <= $one_part_limit) {
// fits in one part
return 1;
} else if($str_length > ($max_parts * $multi_limit) {
// too long
return -1; // or throw exception, or false, etc.
} else {
// divide the string length by multi_limit and round up to get number of parts
return ceil($str_length / $multi_limit);
}
}
Turned this into a library...
https://bitbucket.org/solvam/smstools
The best solution I have so far:
$text = '\#£$¥èéùìòÇØøÅåΔ_ΦΓΛΩΠΨΣΘΞÆæßÉ -./0123456789:;<=>?¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑܧ¿abcdefghijklmnopqrstuvwxyzäöñüà^{}[~]|€' ; //"\\". //'"';//' ';
print $text . "\n";
print isGsm7bit($text). "\n";
print getNumberOfSMSsegments($text). "\n";
function getNumberOfSMSsegments($text,$MaxSegments=6){
/*
http://en.wikipedia.org/wiki/SMS
Larger content (concatenated SMS, multipart or segmented SMS, or "long SMS") can be sent using multiple messages,
in which case each message will start with a user data header (UDH) containing segmentation information.
Since UDH is part of the payload, the number of available characters per segment is lower:
153 for 7-bit encoding,
134 for 8-bit encoding and
67 for 16-bit encoding.
The receiving handset is then responsible for reassembling the message and presenting it to the user as one long message.
While the standard theoretically permits up to 255 segments,[35] 6 to 8 segment messages are the practical maximum,
and long messages are often billed as equivalent to multiple SMS messages. See concatenated SMS for more information.
Some providers have offered length-oriented pricing schemes for messages, however, the phenomenon is disappearing.
*/
$TotalSegment=0;
$textlen = mb_strlen($text);
if($textlen==0) return false; //I can see most mobile devices will not allow you to send empty sms, with this check we make sure we don't allow empty SMS
if(isGsm7bit($text)){ //7-bit
$SingleMax=160;
$ConcatMax=153;
}else{ //UCS-2 Encoding (16-bit)
$SingleMax=70;
$ConcatMax=67;
}
if($textlen<=$SingleMax){
$TotalSegment = 1;
}else{
$TotalSegment = ceil($textlen/$ConcatMax);
}
if($TotalSegment>$MaxSegments) return false; //SMS is very big.
return $TotalSegment;
}
function isGsm7bit($text){
$gsm7bitChars = "\\\#£\$¥èéùìòÇ\nØø\rÅåΔ_ΦΓΛΩΠΨΣΘΞÆæßÉ !\"#¤%&'()*+,-./0123456789:;<=>?¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑܧ¿abcdefghijklmnopqrstuvwxyzäöñüà^{}[~]|€";
$textlen = mb_strlen($text);
for ($i = 0;$i < $textlen; $i++){
if ((strpos($gsm7bitChars, $text[$i])==false) && ($text[$i]!="\\")){return false;} //strpos not able to detect \ in string
}
return true;
}
page 1 : 160 byte
page 2 : 146 byte
page 3 : 153 byte
page 4 : 153 byte
page 5 : 153 byte, ....
So regardless of language :
// strlen($text) show bytes
$count = 0;
$len = strlen($text);
if ($len > 306) {
$len = $len - 306;
$count = floor($len / 153) + 3;
} else if($len>160){
$count = 2;
}else{
$count = 1;
}
I've got a 6-digit number and a 31-digit number (e.g. "234536" & "201103231043330478311223582826") that I need to cram into the same 22-character alphanumeric field in an API using PHP. I tried converting each to base 32 (had to use a custom function as base_convert() doesn't handle big numbers well) and joining with a single-character delimiter, but that only gets me down to 26 characters. It's a REST API, so the characters need to be URI-safe.
I'd really like to do this without creating a database table cross referencing the two numbers with another reference value, if possible. Any suggestions?
Use a radix of 62 instead. That will get you 3.35 characters for the former and 17.3 characters for the latter, for an upper total of 22 characters.
>>> math.log(10**6)/math.log(62)
3.3474826039165504
>>> math.log(10**31)/math.log(62)
17.295326786902177
You can write something like pack() that works with big numbers using bc. Here is my quick solution, it converts your second number in a 13-character string. Pretty nice !
<?php
$i2 = "201103231043330478311223582826";
function pack_large($i) {
$ret = '';
while(bccomp($i, 0) !== 0) {
$mod = bcmod($i, 256);
$i = bcsub($i, $mod);
$ret .= chr($mod);
$i = bcdiv($i, 256);
}
return $ret;
}
function unpack_large($s) {
$ret = '0';
$len = strlen($s);
for($i = $len - 1; $i >= 0; --$i) {
$add = ord($s[$i]);
$ret = bcmul($ret, 256);
$ret = bcadd($ret, $add);
}
return $ret;
}
var_dump($i2);
var_dump($pack = pack_large($i2));
var_dump(unpack_large($pack));
Sample output :
string(30) "201103231043330478311223582826"
string(13) "jàÙl¹9±̉"
string(47) "201103231043330478311223582826.0000000000000000"
Since you need URL-friendly characters, use base64_encode on the packed string, this will give you a 20-character string (18 if your remove the padding).
I need PHP function that will create 8 chars long [a-z] hash from any input string.
So e.g. when I'll submit "Stack Overflow" it will return e.g. "gdqreaxc" (8 chars [a-z] no numbers allowed)
Perhaps something like:
$hash = substr(strtolower(preg_replace('/[0-9_\/]+/','',base64_encode(sha1($input)))),0,8);
This produces a SHA1 hash, base-64 encodes it (giving us the full alphabet), removes non-alpha chars, lowercases it, and truncates it.
For $input = 'yar!';:
mwinzewn
For $input = 'yar!!';:
yzzhzwjj
So the spread seems pretty good.
This function will generate a hash containing evenly distributed characters [a-z]:
function my_hash($string, $length = 8) {
// Convert to a string which may contain only characters [0-9a-p]
$hash = base_convert(md5($string), 16, 26);
// Get part of the string
$hash = substr($hash, -$length);
// In rare cases it will be too short, add zeroes
$hash = str_pad($hash, $length, '0', STR_PAD_LEFT);
// Convert character set from [0-9a-p] to [a-z]
$hash = strtr($hash, '0123456789', 'qrstuvwxyz');
return $hash;
}
By the way, if this is important for you, for 100,000 different strings you'll have ~2% chance of hash collision (for a 8 chars long hash), and for a million of strings this chance rises up to ~90%, if my math is correct.
function md5toabc($myMD5)
{
$newString = "";
for ($i = 0; $i < 16; $i+=2)
{
//add the first val of 0-15 to the second val of 0-15 for a range of 0-30
$myintval = hexdec(substr($myMD5, $i, $i +1) ) +
hexdec(substr($myMD5, $i+1, $i +2) );
// mod by 26 and add 97 to get to the lowercase ascii range
$newString .= chr(($myintval%26) + 97);
}
return $newString;
}
Note this introduces bias to various characters, but do with it what you will.
(Like when you roll two dice, the most common value is a 7 combined...) plus the modulo, etc...
one can give you a good a-p{8} (but not a-z) by using and modifying (the output of) a well known algo:
function mini_hash( $string )
{
$h = hash( 'crc32' , $string );
for($i=0;$i<8;$i++) {
$h{$i} = chr(96+hexdec($h{$i}));
}
return $h;
}
interesting set of constraints you posted there
how about
substr (preg_replace(md5($mystring), "/[1-9]/", ""), 0, 8 );
you could add a bit more entorpy by doing a
preg_replace($myString, "1", "g");
preg_replace($myString, "2", "h");
preg_replace($myString, "3", "i");
etc instead of stripping the digits.