Is there a encoding function in PHP which will encode strings and the resulting output will only contain letters and numbers? I would use base64 but that still has some stuff which is not numeric/alphanumeric
You could use base32 (code easy to google), which is sort of a standard alternative to base64. Or resort to bin2hex() and pack("H*",$hex) to reverse. Hex encoding however leads to size doubling.
Short answer is no, base64 uses a reduced set of output chars compared with uuencode and was intended to solve most character converions issues - but still isn't url-safe (IIRC).
But the machanism is trivial and easily adapted - I'd suggest having a look at base32 encoding - same as base64 but using one less bit per input char to create the output (and hence a 32 char alphabet is all that's required) but using something different for the padding char ('=' is not url safe).
A quick google found this
Any of the hash functions (md5, sha1, etc.) output will only consist of hexadecimal digits but that's not exactly 'encoding'.
You could write your own base-62 encoder/decoder using a-z/A-Z/0-9. You'd need 3 digits for every ASCII character though, so not that efficient.
I wrote this to use letters, numbers and dashes.
I'm sure you can improve it to take out the dashes:
function pj_code($str) {
$len = strlen($str);
while ($len--) {
$enc .= base_convert(ord(substr($str,$len,1)),10,36) . '-';
}
return $enc;
}
function pj_decode($str) {
$ords = explode('-',$str);
$c = count($ords);
while ($c--) {
$dec .= chr(base_convert($ords[$c],36,10));
}
return $dec;
}
You can use the basic md5 hash function which output only alphanumeric characters.
Related
So I was searching for a proper way in PHP to detect if a string is in the BMP range (Basic Multilingual Plane) but I found nothing. Even mb-check-encoding and mb_detect_encoding do not offer any help in this particular case.
So I wrote my own code
<?php
function is_bmp($string) {
$str_ar = mb_str_split($string);
foreach ($str_ar as $char) {
/*Check if there's any character's code point outside the BMP range*/
if (mb_ord($char) > 0xFFFF)
return false;
}
return true;
}
/*String containing non-BMP Unicode characters*/
$string = '😈blah blah';
var_dump(is_bmp($string));
?>
Outputs:
bool(false)
Now my question is:
Is there a better approach? and are there any flaws in it?
If you have an correct UTF-8 encoded input string, you can just check its bytes to figure out does it have symbols out of BMP or not.
Literally, you need to detect: does the input string contains any symbol, which codepoint is greater than 0xFFFF (i.e. longer than 16 bits)
Note on how UTF-8 encoding works:
Codepoints with codes 0 thru 0x7F are encoded as is. By one byte.
All other codepoints have a code within range 0xC0 ... 0xFF as the first byte, which also encodes how many additional bytes folows. And codes 0x80...0xBF as additional bytes.
To encode code points 0x10000 and greater, UTF-8 requires a sequence of 4 bytes, and the first byte of that sequence will be 0xF0 or greater. In all other cases the whole string will contain bytes less than 0xF0.
In short your task just to find: does the binary representation of the string contanin any byte of range 0xF0...0xFF?
function is_bmp($string) {
return preg_match('#[\xF0-\xFF]#', $string) != 0;
}
OR
even simpler (but probably less effective on speed), you can use ability of PCRE to work with UTF-8 sequences (see option PCRE_UTF8):
function is_bmp($string) {
return preg_match('#[^\x00-\x{FFFF}]#u', $string) != 0;
}
var_dump(
!preg_match('/[^\x0-\x{ffff}]/u', '😈blah blah')
);
I have a large string $string that when applied to md5(), give me
c4ca4238a0b923820dcc509a6f75849b
The length is 32, I want to reduce it, so
base64_encode(md5($string, true));
xMpCOKC5I4INzFCab3WEmw==
Removing the last two == it give me a string with length = 22.
Are there any other better algorithms?
I am not sure you realised that md5 is a hash function, and therefore irreversible. If you do not care about reversibility, you could just as well trim the md5 hash (or any hash of your liking*) down to an arbitrary number of characters. All this would do is increase the likelihood of collision (I feel this does not produce an uniform distribution though).
If you are looking for a reversible (ie. non-destructive) compression, then do not reinvent the wheel. Use the built-in functions, such as gzdeflate() or gzcompress(), or other similar functions.
*Here is a list of hash functions (wikipedia) along with the size of their output.
I suppose the smallest possible "hash function" would be a parity bit :)
One better way would be to, instead of converting to binary to hexadecimal (as md5 does) and then converting the string to base64, instead convert from the hexadecimal md5 directly to base64.
Since hexadecimal is 16 bits per character, and base64 is 64 bits per character, every 2 hexadecimal characters will make up one base64 character.
To perform the conversion, you can do the following:
Split the string into sixteen 2 character chunks
The first character should be multiplied by 2 and added to the second (keeping in mind that A-F = 10-15).
This number can be matched to the base64 scheme using the table from here: https://en.wikipedia.org/wiki/Base64
This will result in a 16 character base64 string with the same value as the hexadecimal representation of the md5 string.
Theoretically, you could do the same for any base. If we had a way to encode base128 strings in ASCII, we could end up with an 8 character string. However, as the character set is limited, I think base64 is the highest base that is commonly used.
The smaller the length of the string you want .. the smaller the number of possible combination
Total Number of Possibility with reputation
Total Possibility = nr
Since we are dealing with base64 has the printable output this means we only have 64 characters
n = 64
If you are looking at 22 letters in length
nr = 6422 = 5,444,517,870,735,015,415,413,993,718,908,291,383,296 possibilities
Back to your question : Are there any better algorithm?
Truncate the string with a good hash to desired length you want since the total possibility and collision is fixed
$string = "the fox jumps over the lazy brown dog";
echo truncateHash($string, 8);
Output
9TWbFjOl
Function Used
function truncateHash($str, $length) {
$hash = hash("sha256", $str, true);
return substr(base64_encode($hash), 0, $length);
}
This encoding generates shorter string,
print base64_encode(hash("crc32b",$string,1));
output
qfQIdw==
Not sure if MD5 is the right choice for you, but i will assume that you have reason to stick with this algorithm and are looking for a shorter representation. There are several possibilities to generate a shorter string with different alphabets:
Option 1: Binary string
The shortest possbile form of an MD5 is it's binary representation, to get such a string you can simply call:
$binaryMd5 = md5($input, true);
This string you can store like any other string in a database, it needs only 16 characters. Just make sure you do a proper escaping, either with mysqli_real_escape_string() or with parametrized queries (PDO).
Option 2: Base64 encoding
Base64 encoding will produce a string with this alphabet: [0-9 A-Z a-z + /] and uses '=' as padding. This encoding is very fast, but includes the sometimes unwanted characters '+/='.
$base64Md5 = base64_encode(md5($input, true));
The output length will be always 24 characters for the MD5 hash.
Option 3: Base62 encoding
The base62 encoding only uses the alphabet [0-9 A-Z a-z]. Such strings can be safely used for any purpose like tokens in an URL, and they are very compact. I wrote a base62 encoder, which is able to convert binary strings to the base62 alphabet. It may not be the fastest possible implementation, but it was my goal to write understandable code. The same class could be easily adapted to different alphabets.
$base62Md5 = StoBase62Encoder::base62encode(md5($input, true));
The output length will vary from 16 to 22 characters for the MD5 hash.
Base 91 looks like the most space efficient binary to ASCII printable encoding algorithm (which is what it seems you want).
I've not seen the PHP implementation, but if your software has to work with others I'd stick to Base 64; it's well-known, lightning fast, and available everywhere.
Firstly, to answer your question: Yes, there is a better algorithm (if with "better" you mean "shorter").
Use the hash() function (which has been part of the PHP core and enabled by default since PHP 5.1.2.) with any of the adler32, fnv132, crc32, crc32b, fnv132 or joaat algorithms.
Without a more in-depth knowledge of your current situation, you might as well just pick whichever one you think sounds the coolest.
Here is an example:
hash('crc32b', $string)
I set up an online example you can play around with.
Secondly, I would like to point out that what you are asking is an almost exact duplicate of another question here on stackoverflow.
I read from your post that you are searching for a hashing algorithm and not compression.
There are various standard hashing algorithms in php out there. Have a look at PHP hashing functions.
Depending on what you want to hash there are different approches. Be careful and calculate the average collision probability.
However it seems you are searching for a 'compression' which outputs the minimum possible size of chars for a given string. If you do, then have a look at Lempel–Ziv–Welch (php implementation) or others.
If i try to get sha1 from "ABC" they are same if PHP and Node.JS.
function sha1(input) {
return crypto.createHash('sha1').update(input).digest('hex');
};
But if i try to take hash of something cyrillic like this: "ЭЮЯЁ" they are not.
How to fix it?
The issue is likely that the character set/encodings aren't matching.
If the string in PHP is UTF-8 encoded, you can mirror that in Node.js by specifying 'utf8':
function sha1(input) {
return crypto.createHash('sha1').update(input, 'utf8').digest('hex');
};
> crypto.createHash('sha1').update('ЭЮЯЁ').digest('hex')
'da7f63ac9a3b5c67c8920871145cb5904f3df29a'
> crypto.createHash('sha1').update('ЭЮЯЁ', 'utf8').digest('hex')
'f78c3521413a8321231e35665f8c4a16550e182a'
'ABC' will have a better chance of matching because these are all ASCII characters and ASCII is a starting point for many other character sets. It's when you get beyond ASCII that you'll more often run into conflicts.
What is the best way of generating a hash for the purpose of storing a session? I am looking for a lightweight, portable solution.
bin2hex(mcrypt_create_iv(22, MCRYPT_DEV_URANDOM));
mcrypt_create_iv will give you a random sequence of bytes.
bin2hex will convert it to ASCII text
Example output:
d2c63a605ae27c13e43e26fe2c97a36c4556846dd3ef
Bare in mind that "best" is a relative term. You have a tradeoff to make between security, uniqueness and speed. The above example is good for 99% of the cases, though if you are dealing with a particularly sensitive data, you might want to read about the difference between MCRYPT_DEV_URANDOM and MCRYPT_DEV_RANDOM.
Finally, there is a RandomLib "for generating random numbers and strings of various strengths".
Notice that so far I have assumed that you are looking to generate a random string, which is not the same as deriving a hash from a value. For the latter, refer to password_hash.
random_bytes() is available as of PHP 7.0 (or use this polyfill for 5.2 through 5.6). It is cryptographically secure (compared to rand() which is not) and can be used in conjunction with bin2hex(), base64_encode(), or any other function that converts binary to a string that's safe for your use case.
As a hexadecimal string
bin2hex() will result in a hexadecimal string that's twice as many characters as the number of random bytes (each hex character represents 4 bits while there are 8 bits in a byte). It will only include characters from abcdef0123456789 and the length will always be an increment of 2 (regex: /^([a-f0-9]{2})*$/).
$random_hex = bin2hex(random_bytes(18));
echo serialize($random_hex);
s:36:"ee438d1d108bd818aa0d525602340e5d7036";
As a base64 string
base64_encode() will result in a string that's about 33% longer than the number of random bytes (each base64 character represents 6 bits while there are 8 bits in a byte). It's length will always be an increment of 4, with = used to pad the end of the string and characters from the following list used to encode the data (excluding whitespace that I added for readability):
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789
/+
To take full advantage of the space available, it's best to provide an increment of 3 to random_bytes(). The resulting string will match /^([a-zA-Z\/+=]{4})*$/, although = can only appear at the end as = or == and only when a number that is not an increment of 3 is provided to random_bytes().
$random_base64 = base64_encode(random_bytes(18));
echo serialize($random_base64);
s:24:"ttYDDiGPV5K0MXbcfeqAGniH";
You can use PHP's built-in hashing functions, sha1 and md5. Choose one, not both.
One may think that using both, sha1(md5($pass)) would be a solution. Using both does not make your password more secure, its causes redundant data and does not make much sense.
Take a look at PHP Security Consortium: Password Hashing they give a good article with weaknesses and improving security with hashing.
Nonce stands for "numbers used once". They are used on requests to prevent unauthorized access, they send a secret key and check the key each time your code is used.
You can check out more at PHP NONCE Library from FullThrottle Development
Maybe uniqid() is what you need?
uniqid — Generate a unique ID
You can use openssl_random_pseudo_bytes since php 5.3.0 to generate a pseudo random string of bytes. You can use this function and convert it in some way to string using one of these methods:
$bytes = openssl_random_pseudo_bytes(32);
$hash = base64_encode($bytes);
or
$bytes = openssl_random_pseudo_bytes(32);
$hash = bin2hex($bytes);
The first one will generate the shortest string, with numbers, lowercase, uppercase and some special characters (=, +, /). The second alternative will generate hexadecimal numbers (0-9, a-f)
Use random_bytes() if it's available!
$length = 32;
if (function_exists("random_bytes")) {
$bytes = random_bytes(ceil($length / 2));
$token = substr(bin2hex($bytes), 0, $length)
}
Check it on php.net
I personally use apache's mod_unique_id to generate a random unique number to store my sessions. It's really easy to use (if you use apache).
For nonce take a look here http://en.wikipedia.org/wiki/Cryptographic_nonce there's even a link to a PHP library.
I generally dont manually manage session ids. Ive seen something along these lines recommended for mixing things up a bit before, ive never used myself so i cant attest to it being any better or worse than the default (Note this is for use with autogen not with manual management).
//md5 "emulation" using sha1
ini_set('session.hash_function', 1);
ini_set('session.hash_bits_per_character', 5);
Different people will have different best ways. But this is my way:
Download this rand-hash.php file :
http://bit.ly/random-string-generator
include() it in the php script that you are working with. Then, simply call
cc_rand() function. By default it will return a 6 characters long
random string that may include a-z, A-Z, and 0-9. You can pass
length to specify how many characters cc_rand() should return.
Example:
cc_rand() will return something like: 4M8iro
cc_rand(15) will return something similar to this: S4cDK0L34hRIqAS
Cheers!
In looking at URL safe base 64 encoding, I've found it to be a very non-standard thing. Despite the copious number of built in functions that PHP has, there isn't one for URL safe base 64 encoding. On the manual page for base64_encode(), most of the comments suggest using that function, wrapped with strtr():
function base64_url_encode($input)
{
return strtr(base64_encode($input), '+/=', '-_,');
}
The only Perl module I could find in this area is MIME::Base64::URLSafe (source), which performs the following replacement internally:
sub encode ($) {
my $data = encode_base64($_[0], '');
$data =~ tr|+/=|\-_|d;
return $data;
}
Unlike the PHP function above, this Perl version drops the '=' (equals) character entirely, rather than replacing it with ',' (comma) as PHP does. Equals is a padding character, so the Perl module replaces them as needed upon decode, but this difference makes the two implementations incompatible.
Finally, the Python function urlsafe_b64encode(s) keeps the '=' padding around, prompting someone to put up this function to remove the padding which shows prominently in Google results for 'python base64 url safe':
from base64 import urlsafe_b64encode, urlsafe_b64decode
def uri_b64encode(s):
return urlsafe_b64encode(s).strip('=')
def uri_b64decode(s):
return urlsafe_b64decode(s + '=' * (4 - len(s) % 4))
The desire here is to have a string that can be included in a URL without further encoding, hence the ditching or translation of the characters '+', '/', and '='. Since there isn't a defined standard, what is the right way?
There does appear to be a standard, it is RFC 3548, Section 4, Base 64 Encoding with URL and Filename Safe Alphabet:
This encoding is technically identical
to the previous one, except for the
62:nd and 63:rd alphabet character, as
indicated in table 2.
+ and / should be replaced by - (minus) and _ (understrike) respectively. Any incompatible libraries should be wrapped so they conform to RFC 3548.
Note that this requires that you URL encode the (pad) = characters, but I prefer that over URL encoding the + and / characters from the standard base64 alphabet.
I don't think there is right or wrong. But most popular encoding is
'+/=' => '-_.'
This is widely used by Google, Yahoo (they call it Y64). The most url-safe version of encoders I used on Java, Ruby supports this character set.
I'd suggest running the output of base64_encode through urlencode. For example:
function base64_encode_url( $str )
{
return urlencode( base64_encode( $str ) );
}
If you're asking about the correct way, I'd go with proper URL-encoding as opposed to arbitrary replacement of characters. First base64-encode your data, then further encode special characters like "=" with proper URL-encoding (i.e. %<code>).
Why don't you try wrapping it in a urlencode()? Documentation here.