If i try to get sha1 from "ABC" they are same if PHP and Node.JS.
function sha1(input) {
return crypto.createHash('sha1').update(input).digest('hex');
};
But if i try to take hash of something cyrillic like this: "ЭЮЯЁ" they are not.
How to fix it?
The issue is likely that the character set/encodings aren't matching.
If the string in PHP is UTF-8 encoded, you can mirror that in Node.js by specifying 'utf8':
function sha1(input) {
return crypto.createHash('sha1').update(input, 'utf8').digest('hex');
};
> crypto.createHash('sha1').update('ЭЮЯЁ').digest('hex')
'da7f63ac9a3b5c67c8920871145cb5904f3df29a'
> crypto.createHash('sha1').update('ЭЮЯЁ', 'utf8').digest('hex')
'f78c3521413a8321231e35665f8c4a16550e182a'
'ABC' will have a better chance of matching because these are all ASCII characters and ASCII is a starting point for many other character sets. It's when you get beyond ASCII that you'll more often run into conflicts.
Related
in PHP, how can i convert UTF-8 to MUTF-8? i am hoping i can lazily just get away with
function utf8_to_mutf8(string $utf8):string{
return str_replace("\x00", "\xC0\x80", $utf8);
}
? given that all multi-byte characters in utf-8 have the high bit set, \x00 will never occur in any multi-byte character, and the following should be completely unnecessary?
function utf8_to_mutf8(string $utf8):string{
$old = mb_internal_encoding();
mb_internal_encoding("UTF-8");
$ret = mb_ereg_replace("\x00", "\xC0\x80",$utf8);
mb_internal_encoding($old);
return $ret;
}
Yes, "\x00" will only occur for codepoint U+0000 and never for any other codepoint. Only all ASCII characters have the highest bit not set (U+0000 to U+007F = bits 00000000 to 01111111). Encountering bytes that have not the highest bit set can also be used for sychronization in case it is unclear where the next codepoint/character begins.
Yes, str_replace() is enough, because it is already binary safe, as said in the docs. Speak: it does neither care about the input's encoding, nor about global settings.
If your goal is to have a chain of bytes that will never ever have a "\x00" in it then you should achieve it this way.
Personally I think null terminations are outdated, and following the old Java way to work around that limitation just comes with the same disadvantages of not being able to use "\x00" in the first way. You just end up to unmodify your encoding again to let all UTF-8 handling properly deal with it.
I'm trying to convert a string to UTF8, on both obj-c and php.
I get different results:
"\xd7\x91\xd7\x93\xd7\x99\xd7\xa7\xd7\x94" //Obj-C
"\u05d1\u05d3\u05d9\u05e7\u05d4" //PHP
Obj-C code:
const char *cData = [#"בדיקה" cStringUsingEncoding:NSUTF8StringEncoding]
PHP code:
utf8_encode('בדיקה')
This difference breaks my hash algorithm that follows.
How can I make the two strings encoded the same way? Should I change the obj-c\php ?
Go to http://www.utf8-chartable.de/unicode-utf8-table.pl
In the combo box switch to “U+0590 … U+5FF Hebrew”
Scroll down to “U+05D1” which is the rightmost character of your input string.
The third column shows the two UTF-8 bytes: “d7 91”
If you keep looking you will see that the PHP and the Objective-C are actually the same. The “problem” you are seeing is that while PHP uses an Unicode escape (\u), Objective-C uses direct byte hexadecimal escapes (\x). Those are only visual representations of the strings, the bytes in memory are actually the same.
If your hash algorithm deals with bytes correctly, you should not see differences.
What are you using to do the encoding on PHP? It looks like you're generating a UTF-16 string.
Try utf8_encode() and see if that gives better results.
At the moment, I don't understand why it is really important to use mbstring functions in PHP when dealing with UTF-8? My locale under linux is already set to UTF-8, so why doesn't functions like strlen, preg_replace and so on don't work properly by default?
All of the PHP string functions do not handle multibyte strings regardless of your operating system's locale. That is why you need to use the multibyte string functions.
From the Multibyte String Introduction:
When you manipulate (trim, split, splice, etc.) strings encoded in a
multibyte encoding, you need to use special functions since two or
more consecutive bytes may represent a single character in such
encoding schemes. Otherwise, if you apply a non-multibyte-aware string
function to the string, it probably fails to detect the beginning or
ending of the multibyte character and ends up with a corrupted garbage
string that most likely loses its original meaning.
Here is my answer in plain English.
A single Japanese and Chinese and Korean character take more than a single byte. Eg., a typical charactert say x is takes 1 byte in English it will take more than 1 byte in Japanese and Chinese and Korean. Now PHP's standard string functions are meant to treat a single character as 1 byte. So in case you are trying to do compare two Japanese or Chinese or Korean characters they will not work as expected. For example the length of "Hello World!" in Japanese or Chinese or Korean will have more than 12 bytes.
Read http://www.php.net/manual/en/intro.mbstring.php
You do not need to use UTF-8 aware code to process UTF-8. For the most part.
I've even written a Unicode uppercaser/lowercaser, and NFC and NFD transforms, using only byte-aware functions. It's hard to think of anything more complicated than that, that needs such delicate and detailed treatment of UTF-8. And yet it still works with byte-only functions.
It's very rare that you need UTF-8 aware code. Maybe to count the number of characters, or to move an insertion point forward by 1 character. But actually, even then your code won't work ;) because of decomposed characters.
But if all you are doing is replacements, finding stuff, or even parsing syntax, you just need byte-aware functions.
I'll explain why.
It's because no UTF-8 character can be found inside any other UTF-8 character. That's how it is designed.
Try to explain to me how you can get text processing errors, in terms of a multi-byte system where no character can be found inside another character? Just one example case! The simplest you can think of.
PHP strings are just plain byte sequences. They have no meaning by themselves. And they do not use any particular character encoding either.
So if you read a file using file_get_contents() you get a binary-safe representation of the file. May it be the (binary) representation of an image or a human-readable text file - PHP doesn't care.
Now, as long as you just need to do basic processing of the string, you do not need to know the character encoding at all. So if you want to store the string back into a file using file_put_contents() or want to get its length (not the number of characters) using strlen(), you're fine.
However, as soon as you start doing more fancy string manipulation, you need to know the character encoding! There is no way to store it as part of the string, so you either have to track it separately, or, what most people do, use the convention of having all (text) strings in a common character encoding, like US-ASCII or nowadays UTF-8.
So because there is no way to set a character encoding for a string, PHP has no idea which character encoding the string is using. Due to that, the only sane thing for strlen() to do is to return the number of bytes, as this is the only thing PHP knows for sure.
If you provide the additional information of the used character encoding, you need to use another function - the function is called mb_strlen() in this case.
The same applies to preg_replace(): If you want to replace umlaut-a, or match three identical characters in a row, you need to know how umlaut-a is encoded, and in general, how characters are encoded.
So if you have a hypothetical character encoding, which encodes a lower-case a as a1 and an upper-case A as a2, a b as b1 and B as b2 (and so on), you can have an (encoded) string a1a1a1 which consists of three identical characters in a row. However, without knowing the encoding and by just looking at the byte sequence, there is no way to detect this.
Summary:
No sane 'default' is possible as PHP strings do not contain the character encoding. And even if, a single function like strlen() cannot return the length of the byte sequence as required for Content-Length HTTP header and at the same time the number of characters as useful to denote the length of a blog article.
That's why the Function Overloading Feature is inherently broken and even if it looks nice at first, will break your code in a hard-to-debug way.
multibyte => multi + byte.
1) It is use to work with string which is in other language(means not in English) format.
2) Default PHP string functions only work proper with English (or releted to it) language.
3) If you want to use strlen() or strpos() or uppercase() or strreplace() for special character,
Suppose We need to apply string functions on "Hello".
In chines (你好), Arabic (مرحبا), Japanese (こんにちは), Hindi (
नमस्ते), Gujarati (હેલો).
Different language can it's own character sets
so that mbstring introduced for communicate with various languages like (chines,Japanese etc).
Raul González is a perfect example of why:
It is about shortening too long user names for MySQL database, say we have 10 character limit and Raul González.
The unit test below is an example how you can get an error like this
General error: 1366 Incorrect string value: '\xC3' for column 'name' at row 1 (SQL: update users set name = Raul Gonz▒, updated_at = 2019-03-04 04:28:46 where id = 793)
and how you can avoid it
public function test_substr(): void
{
$name = 'Raul González';
$user = factory(User::class)->create(['name' => $name]);
try {
$name1 = substr($name, 0, 10);
$user->name = $name1;
$user->save();
} catch (Exception $ex) {
}
$this->assertTrue(isset($ex));
$name2 = mb_substr($name, 0, 10);
$user->name = $name2;
$user->save();
$this->assertTrue(true);
}
PHP Laravel and PhpUnit was used for illustration.
Is there a encoding function in PHP which will encode strings and the resulting output will only contain letters and numbers? I would use base64 but that still has some stuff which is not numeric/alphanumeric
You could use base32 (code easy to google), which is sort of a standard alternative to base64. Or resort to bin2hex() and pack("H*",$hex) to reverse. Hex encoding however leads to size doubling.
Short answer is no, base64 uses a reduced set of output chars compared with uuencode and was intended to solve most character converions issues - but still isn't url-safe (IIRC).
But the machanism is trivial and easily adapted - I'd suggest having a look at base32 encoding - same as base64 but using one less bit per input char to create the output (and hence a 32 char alphabet is all that's required) but using something different for the padding char ('=' is not url safe).
A quick google found this
Any of the hash functions (md5, sha1, etc.) output will only consist of hexadecimal digits but that's not exactly 'encoding'.
You could write your own base-62 encoder/decoder using a-z/A-Z/0-9. You'd need 3 digits for every ASCII character though, so not that efficient.
I wrote this to use letters, numbers and dashes.
I'm sure you can improve it to take out the dashes:
function pj_code($str) {
$len = strlen($str);
while ($len--) {
$enc .= base_convert(ord(substr($str,$len,1)),10,36) . '-';
}
return $enc;
}
function pj_decode($str) {
$ords = explode('-',$str);
$c = count($ords);
while ($c--) {
$dec .= chr(base_convert($ords[$c],36,10));
}
return $dec;
}
You can use the basic md5 hash function which output only alphanumeric characters.
In looking at URL safe base 64 encoding, I've found it to be a very non-standard thing. Despite the copious number of built in functions that PHP has, there isn't one for URL safe base 64 encoding. On the manual page for base64_encode(), most of the comments suggest using that function, wrapped with strtr():
function base64_url_encode($input)
{
return strtr(base64_encode($input), '+/=', '-_,');
}
The only Perl module I could find in this area is MIME::Base64::URLSafe (source), which performs the following replacement internally:
sub encode ($) {
my $data = encode_base64($_[0], '');
$data =~ tr|+/=|\-_|d;
return $data;
}
Unlike the PHP function above, this Perl version drops the '=' (equals) character entirely, rather than replacing it with ',' (comma) as PHP does. Equals is a padding character, so the Perl module replaces them as needed upon decode, but this difference makes the two implementations incompatible.
Finally, the Python function urlsafe_b64encode(s) keeps the '=' padding around, prompting someone to put up this function to remove the padding which shows prominently in Google results for 'python base64 url safe':
from base64 import urlsafe_b64encode, urlsafe_b64decode
def uri_b64encode(s):
return urlsafe_b64encode(s).strip('=')
def uri_b64decode(s):
return urlsafe_b64decode(s + '=' * (4 - len(s) % 4))
The desire here is to have a string that can be included in a URL without further encoding, hence the ditching or translation of the characters '+', '/', and '='. Since there isn't a defined standard, what is the right way?
There does appear to be a standard, it is RFC 3548, Section 4, Base 64 Encoding with URL and Filename Safe Alphabet:
This encoding is technically identical
to the previous one, except for the
62:nd and 63:rd alphabet character, as
indicated in table 2.
+ and / should be replaced by - (minus) and _ (understrike) respectively. Any incompatible libraries should be wrapped so they conform to RFC 3548.
Note that this requires that you URL encode the (pad) = characters, but I prefer that over URL encoding the + and / characters from the standard base64 alphabet.
I don't think there is right or wrong. But most popular encoding is
'+/=' => '-_.'
This is widely used by Google, Yahoo (they call it Y64). The most url-safe version of encoders I used on Java, Ruby supports this character set.
I'd suggest running the output of base64_encode through urlencode. For example:
function base64_encode_url( $str )
{
return urlencode( base64_encode( $str ) );
}
If you're asking about the correct way, I'd go with proper URL-encoding as opposed to arbitrary replacement of characters. First base64-encode your data, then further encode special characters like "=" with proper URL-encoding (i.e. %<code>).
Why don't you try wrapping it in a urlencode()? Documentation here.