PHP ansi to utf8 and vice versa - php

I just want to accomplish something where I want to convert strings from ansi to utf8 and vice versa.
example: to_ansi(1234)
expected result: NLKJ
example: to_utf8(NLKJ)
expected result: 1234
functions I currently have are:
function to_ansi($str)
{
$newString = "";
$reversedString = strrev($str);
for($i=0; $i < strlen($reversedString); $i++ ) {
$newString .= iconv(mb_detect_encoding(), 'UTF-8', chr(ord($reversedString[$i]) * 1.5));
}
return $newString;
}
function to_utf8($str)
{
$newString = "";
$reversedString = strrev($str);
for($i=0; $i < strlen($reversedString); $i++ ) {
$newString .= iconv(mb_detect_encoding(), 'UTF-8', chr(ord($reversedString[$i]) / 1.5));
}
return $newString;
}
usnig those functions above I get
example: to_ansi(1234)
result: NLKI
example: to_utf8(NLKJ)
result: 1224
actually I'm just interpreting vbs to PHP and the original functions are:
Function ToAnsi(ByVal strPassword As String) As String
Dim strLetter As String
Dim strRevPass As String
Dim strNewPass As String
strRevPass = strReverse(strPassword)
strNewPass = ""
For a = 1 To Len(strRevPass)
strLetter = Mid$(strRevPass, a, 1)
strNewPass = strNewPass & Chr((Asc(strLetter) * 1.5))
Next a
Text2.Text = strNewPass
End Function
Function ToUTF8(ByVal strPassword As String)
Dim strLetter As String
Dim strRevPass As String
Dim strNewPass As String
strRevPass = strReverse(strPassword)
strNewPass = ""
For a = 1 To Len(strRevPass)
strLetter = Mid$(strRevPass, a, 1)
strNewPass = strNewPass & Chr(Asc(strLetter) / 1.5)
Next a
txtText3.Text = strNewPass
End Function

Why do you think that 1234 in ANSI should NLKJ in UTF-8?
The reason of your problem might be a rounding error. You're multiplying and dividing by 1.5. For instance the letter 'y' (ASCII 122) divided by 1.5 is 80 2/3, which is treated as 81 (there are no fractions in character codes). Then back: 81 * 1.5 = 121.5 which is treated as 122, resulting in a 'z'.
So it's hard to grasp what the meaning of this code is. It certainly isn't a regular ANSI to UTF-8 conversion.
It seems to do some password hashing/encoding, but in a way that is very insecure. It's just a very simple encoding algorithm that can be very easily decoded as well, apart from the fact that it is inherently broken, and mangles your data beyond repair.

Related

Why, when I convert a string into binary, does it miss the first zeros?

I try to convert any string into binary. But if binary start with zeros, it doesn't display it. All my test give me the binary value from the first 1 until end. Here my code :
$value = unpack('H*', $MESSAGE);
$binary .= base_convert($value[1], 16, 2);
For example when I tried to convert the character "%" it display : 100101 instead of : 00100101
Did I forget something?
Thanks.
Yacine
It is easy to see that the question boils down to the following:
Why base_convert($value[1], 16, 2) does not zero-pad the result?
The reason is that base_convert interprets the first argument as a number (not a string of bytes, for example); it stops converting the bytes after the most significant bit is reached:
static char digits[] = "0123456789abcdefghijklmnopqrstuvwxyz";
char buf[(sizeof(zend_ulong) << 3) + 1];
char *ptr, *end;
zend_ulong value;
if (Z_TYPE_P(arg) != IS_LONG || base < 2 || base > 36) {
return ZSTR_EMPTY_ALLOC();
}
value = Z_LVAL_P(arg);
end = ptr = buf + sizeof(buf) - 1;
*ptr = '\0';
do {
*--ptr = digits[value % base];
value /= base;
} while (ptr > buf && value);
return zend_string_init(ptr, end - ptr, 0);
(i.e. when the value becomes zero.) The behavior is correct, since it is possible to add any number of zeroes up after the most significant bit without changing the result, e.g. 100101 is equal to 00100101.
The function does not have a parameter that affects the formatting of the result. So, in order to achieve the desired output, you need to use other function(s) such as sprintf.

Levenshtein distance on diacritic characters

In PHP I am calculating Levenshtein distance using function levenshtein(). For simple characters it works as expected, but for diacritic characters like in example
echo levenshtein('à', 'a');
it returns "2". In this case only one replacement has to be done, so I expect it to return "1".
Am I missing something?
I thought it may be useful to have this comment from the PHP manual posted as an answer to this question, so here it is:-
The levenshtein function processes each byte of the input string individually. Then for multibyte encodings, such as UTF-8, it may give misleading results.
Example with a french accented word :
- levenshtein('notre', 'votre') = 1
- levenshtein('notre', 'nôtre') = 2 (huh ?!)
You can easily find a multibyte compliant PHP implementation of the levenshtein function but it will be of course much slower than the C implementation.
Another option is to convert the strings to a single-byte (lossless) encoding so that they can feed the fast core levenshtein function.
Here is the conversion function I used with a search engine storing UTF-8 strings, and a quick benchmark. I hope it will help.
<?php
// Convert an UTF-8 encoded string to a single-byte string suitable for
// functions such as levenshtein.
//
// The function simply uses (and updates) a tailored dynamic encoding
// (in/out map parameter) where non-ascii characters are remapped to
// the range [128-255] in order of appearance.
//
// Thus it supports up to 128 different multibyte code points max over
// the whole set of strings sharing this encoding.
//
function utf8_to_extended_ascii($str, &$map)
{
// find all multibyte characters (cf. utf-8 encoding specs)
$matches = array();
if (!preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
return $str; // plain ascii string
// update the encoding map with the characters not already met
foreach ($matches[0] as $mbc)
if (!isset($map[$mbc]))
$map[$mbc] = chr(128 + count($map));
// finally remap non-ascii characters
return strtr($str, $map);
}
// Didactic example showing the usage of the previous conversion function but,
// for better performance, in a real application with a single input string
// matched against many strings from a database, you will probably want to
// pre-encode the input only once.
//
function levenshtein_utf8($s1, $s2)
{
$charMap = array();
$s1 = utf8_to_extended_ascii($s1, $charMap);
$s2 = utf8_to_extended_ascii($s2, $charMap);
return levenshtein($s1, $s2);
}
?>
Results (for about 6000 calls)
- reference time core C function (single-byte) : 30 ms
- utf8 to ext-ascii conversion + core function : 90 ms
- full php implementation : 3000 ms
The default PHP levenshtein(), like many PHP functions, is not multibyte aware. So, when processing strings with Unicode characters, it handles each byte separately and changes two bytes.
There is no multibyte version (i.e. mb_levenshtein()) so you have two options:
1) Re-implement the function yourself, using mb_ functions. Possible example code from a Gist:
<?php
function levenshtein_php($str1, $str2){
$length1 = mb_strlen( $str1, 'UTF-8');
$length2 = mb_strlen( $str2, 'UTF-8');
if( $length1 < $length2) return levenshtein_php($str2, $str1);
if( $length1 == 0 ) return $length2;
if( $str1 === $str2) return 0;
$prevRow = range( 0, $length2);
$currentRow = array();
for ( $i = 0; $i < $length1; $i++ ) {
$currentRow=array();
$currentRow[0] = $i + 1;
$c1 = mb_substr( $str1, $i, 1, 'UTF-8') ;
for ( $j = 0; $j < $length2; $j++ ) {
$c2 = mb_substr( $str2, $j, 1, 'UTF-8' );
$insertions = $prevRow[$j+1] + 1;
$deletions = $currentRow[$j] + 1;
$substitutions = $prevRow[$j] + (($c1 != $c2)?1:0);
$currentRow[] = min($insertions, $deletions, $substitutions);
}
$prevRow = $currentRow;
}
return $prevRow[$length2];
}
2) Convert your string's Unicode characters to ASCII. If you are specifically wanting to calculate Levenshtein differences from diacritic characters to non-diacritics, though, this is probably not what you want.

How to implement a Longitudinal Redundancy Check (LRC/CRC8/XOR8) checksum in PHP?

I'm having real problems trying to implement a XOR8/LRC checksum in PHP, according to the algorithm present here: http://en.wikipedia.org/wiki/Longitudinal_redundancy_check
What I'm trying to do is, given any string calculate its LRC checksum.
For example, I know for sure this string:
D$1I 11/14/2006 18:15:00 1634146 3772376 3772344 3772312 3772294 1*
Has a hexadecimal checksum of 39 (including the last * char).
For anyone interested what is the meaning of the string, it's is a DART (Deep-ocean Assesment and Reporting of Tsunamis) message - http://nctr.pmel.noaa.gov/Dart/Pdf/dartMsgManual3.01.pdf.
I convert the string to a binary string with 1's and 0's. From there, I try to create a byte array and apply the algorithm to the byte array, but it's not working and I can't figure out why.
The function I'm using for converting to String to Binary String is:
function str2binStr($str) {
$ret = '';
for ($i = 0, $n = strlen($str); $i < $n; ++$i)
$ret .= str_pad(decbin(ord($str[$i])), 8, 0, STR_PAD_LEFT);
return $ret;
}
The function I'm using for converting from Binary String to Binary Array is:
function byteStr2byteArray($s) {
return array_slice(unpack("C*", "\0".$s), 1);
}
Finally, the LRC implementation I'm using, with bitwise operators, is:
function lrc($byteArr) {
$lrc = 0;
$byteArrLen = count($byteArr);
for ($i = 0; $i < $byteArrLen; $i++) {
$lrc = ($lrc + $byteArr[$i]) & 0xFF;
}
$lrc = (($lrc ^ 0xFF) + 1) & 0xFF;
return $lrc;
}
Then, we convert the final decimal result of the LRC checksum with dechex($checksum + 0), so we have the final hexadecimal checksum.
After all these operations, I'm not getting the expected result, so any help will be highly appreciated.
Thanks in advance.
Also, I can't make it work following the CRC8-Check in PHP answer.
I'm afraid that nobody on StackOverflow can help you, and here's why. This question was bugging me so I went to the DART website you mentionned to take a look at their specs. Two problems became apparent:
The first one is you have misunderstood part of their specs. Messages start with a Carriage Return (\r or \0x0D) and the asterisk * is not part of the checksum
The second, bigger problem is that their specs contain several errors. Some of them may originate from bad copy/paste and/or an incorrect transformation from Microsoft .doc to PDF.
I have taken the time to inspect some of them so that would be nice if you could contact the specs authors or maintainers so they can fix them or clarify them. Here is what I've found.
2.1.2 The message breakdown mentions C/I as message status even though it doesn't appear in the example message.
2.1.3 The checksum is wrong, off by 0x31 which corresponds to the character 1.
2.2.3 The six checksums are wrong, off by 0x2D which corresponds to the character -.
2.3.1.2 I think there's a <cr> missing between dev3 and tries
2.3.1.3 The checksum is off by 0x0D and there's no delimiter between dev3 and tries. The checksum would be correct if there was a carriage return between the dev3 value and the tries value.
2.3.2.2-3 Same as 2.3.1.2-3.
2.3.3.3 Wrong checksum again, and there's no delimiter before tries.
2.4.2 The message breakdown mentions D$2 = message ID which should be D$3 = message ID.
Here's the code I used to verify their checksums:
$messages = array(
"\rD\$0 11/15/2006 13:05:28 3214.2972 N 12041.3991 W* 46",
"\rD\$1I 11/14/2006 18:15:00 1634146 3772376 3772344 3772313 3772294 1* 39",
"\rD\$1I 11/14/2006 19:15:00 1634146 3772275 3772262 3772251 3772249 1* 38",
"\rD\$1I 11/14/2006 20:15:00 1634146 3772249 3772257 3772271 3772293 1* 3E",
"\rD\$1I 11/14/2006 21:15:00 1634146 3772315 3772341 3772373 3772407 1* 39",
"\rD\$1I 11/14/2006 22:15:00 1634146 3772440 3772472 3772506 3772540 1* 3C",
"\rD\$1I 11/14/2006 23:15:00 1634146 3772572 3772603 3772631 3772657 1* 3B",
"\rD\$2I 00 tt 18:32:45 ts 18:32:00 3772311\r00000063006201* 22",
"\rD\$2I 01 tt 18:32:45 ts 18:32:00 3772311\r000000630062706900600061005f005ffffafff9fff8fff8fff7fff6fff401* 21",
"\rD\$2I 02 tt 18:32:45 ts 18:32:00 3772335\rfffdfffafff7fff5fff1ffeeffea00190048ffe1ffddffdaffd8ffd5ffd101* 21"
);
foreach ($messages as $k => $message)
{
$pos = strpos($message, '*');
$payload = substr($message, 0, $pos);
$crc = trim(substr($message, $pos + 1));
$checksum = 0;
foreach (str_split($payload, 1) as $c)
{
$checksum ^= ord($c);
}
$crc = hexdec($crc);
printf(
"Expected: %02X - Computed: %02X - Difference: %02X - Possibly missing: %s\n",
$crc, $checksum, $crc ^ $checksum, addcslashes(chr($crc ^ $checksum), "\r")
);
}
For what it's worth, here's a completely unoptimized, straight-up implementation of the algorithm from Wikipedia:
$buffer = 'D$1I 11/14/2006 18:15:00 1634146 3772376 3772344 3772312 3772294 1*';
$LRC = 0;
foreach (str_split($buffer, 1) as $b)
{
$LRC = ($LRC + ord($b)) & 0xFF;
}
$LRC = (($LRC ^ 0xFF) + 1) & 0xFF;
echo dechex($LRC);
It results in 0x0E for the string from your example, so either I've managed to fudge the implementation or the algorithm that produced 0x39 is not the same.
I realize that this question pretty old, but I had trouble figuring out how to do this. It's working now, so I figured I should paste the code. In my case, the checksum needs to return as an ASCII string.
public function getLrc($string)
{
$LRC = 0;
// Get hex checksum.
foreach (str_split($string, 1) as $char) {
$LRC ^= ord($char);
}
$hex = dechex($LRC);
// convert hex to string
$str = '';
for($i=0;$i<strlen($hex);$i+=2) $str .= chr(hexdec(substr($hex,$i,2)));
return $str;
}

How can I cram 6+31 numeric characters into 22 alphanumeric characters?

I've got a 6-digit number and a 31-digit number (e.g. "234536" & "201103231043330478311223582826") that I need to cram into the same 22-character alphanumeric field in an API using PHP. I tried converting each to base 32 (had to use a custom function as base_convert() doesn't handle big numbers well) and joining with a single-character delimiter, but that only gets me down to 26 characters. It's a REST API, so the characters need to be URI-safe.
I'd really like to do this without creating a database table cross referencing the two numbers with another reference value, if possible. Any suggestions?
Use a radix of 62 instead. That will get you 3.35 characters for the former and 17.3 characters for the latter, for an upper total of 22 characters.
>>> math.log(10**6)/math.log(62)
3.3474826039165504
>>> math.log(10**31)/math.log(62)
17.295326786902177
You can write something like pack() that works with big numbers using bc. Here is my quick solution, it converts your second number in a 13-character string. Pretty nice !
<?php
$i2 = "201103231043330478311223582826";
function pack_large($i) {
$ret = '';
while(bccomp($i, 0) !== 0) {
$mod = bcmod($i, 256);
$i = bcsub($i, $mod);
$ret .= chr($mod);
$i = bcdiv($i, 256);
}
return $ret;
}
function unpack_large($s) {
$ret = '0';
$len = strlen($s);
for($i = $len - 1; $i >= 0; --$i) {
$add = ord($s[$i]);
$ret = bcmul($ret, 256);
$ret = bcadd($ret, $add);
}
return $ret;
}
var_dump($i2);
var_dump($pack = pack_large($i2));
var_dump(unpack_large($pack));
Sample output :
string(30) "201103231043330478311223582826"
string(13) "jàÙl¹9±̉"
string(47) "201103231043330478311223582826.0000000000000000"
Since you need URL-friendly characters, use base64_encode on the packed string, this will give you a 20-character string (18 if your remove the padding).

How to increment letters like numbers in PHP?

I would like to write a function that takes in 3 characters and increments it and returns the newly incremented characters as a string.
I know how to increase a single letter to the next one but how would I know when to increase the second letters and then stop and then increase the first letter again to have a sequential increase?
So if AAA is passed, return AAB. If
AAZ is passed return ABA (hard part).
I would appreciate help with the logic and what php functions will be useful to use.
Even better, has some done this already or there is a class available to do this??
Thanks all for any help
Character/string increment works in PHP (though decrement doesn't)
$x = 'AAZ';
$x++;
echo $x; // 'ABA'
You can do it with the ++ operator.
$i = 'aaz';
$i++;
print $i;
aba
However this implementation has some strange things:
for($i = 'a'; $i < 'z'; $i++) print "$i ";
This will print out letters from a to y.
for($i = 'a'; $i <= 'z'; $i++) print "$i ";
This will print out lettes from a to z and it continues with aa and ends with yz.
As proposed in PHP RFC: Strict operators directive
(currently Under Discussion):
Using the increment function on a string will throw a TypeError when strict_operators is enabled.
Whether or not the RFC gets merged, PHP will sooner or later go that direction of adding operator strictness. Therefore, you should not be incrementing strings.
a-z/A-Z ranges
If you know your letters will stay in range a-z/A-Z (not surpass z/Z), you can use the solution that converts letter to ASCII code, increments it, and converts back to letter.
Use ord() a chr():
$letter = 'A';
$letterAscii = ord($letter);
$letterAscii++;
$letter = chr($letterAscii); // 'B'
ord() converts the letter into ASCII num representation
that num representation is incremented
using chr() the number gets converted back to the letter
As discovered in comments, be careful. This iterates ASCII table so from Z (ASCII 90), it does not go to AA, but to [ (ASCII 91).
Going beyond z/Z
If you dare to go further and want z became aa, this is what I came up with:
final class NextLetter
{
private const ASCII_UPPER_CASE_BOUNDARIES = [65, 91];
private const ASCII_LOWER_CASE_BOUNDARIES = [97, 123];
public static function get(string $previous) : string
{
$letters = str_split($previous);
$output = '';
$increase = true;
while (! empty($letters)) {
$letter = array_pop($letters);
if ($increase) {
$letterAscii = ord($letter);
$letterAscii++;
if ($letterAscii === self::ASCII_UPPER_CASE_BOUNDARIES[1]) {
$letterAscii = self::ASCII_UPPER_CASE_BOUNDARIES[0];
$increase = true;
} elseif ($letterAscii === self::ASCII_LOWER_CASE_BOUNDARIES[1]) {
$letterAscii = self::ASCII_LOWER_CASE_BOUNDARIES[0];
$increase = true;
} else {
$increase = false;
}
$letter = chr($letterAscii);
if ($increase && empty($letters)) {
$letter .= $letter;
}
}
$output = $letter . $output;
}
return $output;
}
}
I'm giving you also 100% coverage if you intend to work with it further. It tests against original string incrementation ++:
/**
* #dataProvider letterProvider
*/
public function testIncrementLetter(string $givenLetter) : void
{
$expectedValue = $givenLetter;
self::assertSame(++$expectedValue, NextLetter::get($givenLetter));
}
/**
* #return iterable<array<string>>
*/
public function letterProvider() : iterable
{
yield ['A'];
yield ['a'];
yield ['z'];
yield ['Z'];
yield ['aaz'];
yield ['aaZ'];
yield ['abz'];
yield ['abZ'];
}
To increment or decrement in the 7bits 128 chars ASCII range, the safest:
$CHAR = "l";
echo chr(ord($CHAR)+1)." ".chr(ord($CHAR)-1);
/* m k */
So, it is normal to get a backtick by decrementing a, as the ascii spec list
Print the whole ascii range:
for ($i = 0;$i < 127;$i++){
echo chr($i);
}
/* !"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ */
More infos about ANSI 7 bits ASCII: man ascii
To increment or decrement in the 8-bits extended 256 chars UTF-8 range.
This is where it starts to differ regarding the host machine charset. but those charsets are all available on modern machines. From php, the safest is to use the php-mbstring extension: https://www.php.net/manual/en/function.mb-chr.php
Extended ASCII (EASCII or high ASCII) character encodings are
eight-bit or larger encodings that include the standard seven-bit
ASCII characters, plus additional characters. https://en.wikipedia.org/wiki/Extended_ASCII
More info, as example: man iso_8859-9
ISO 8859-1 West European languages (Latin-1)
ISO 8859-2 Central and East European languages (Latin-2)
ISO 8859-3 Southeast European and miscellaneous languages (Latin-3)
ISO 8859-4 Scandinavian/Baltic languages (Latin-4)
ISO 8859-5 Latin/Cyrillic
ISO 8859-6 Latin/Arabic
ISO 8859-7 Latin/Greek
ISO 8859-8 Latin/Hebrew
ISO 8859-9 Latin-1 modification for Turkish (Latin-5)
ISO 8859-10 Lappish/Nordic/Eskimo languages (Latin-6)
ISO 8859-11 Latin/Thai
ISO 8859-13 Baltic Rim languages (Latin-7)
ISO 8859-14 Celtic (Latin-8)
ISO 8859-15 West European languages (Latin-9)
ISO 8859-16 Romanian (Latin-10)
Example, we can find the € symbol in ISO 8859-7:
244 164 A4 € EURO SIGN
To increment or decrement in the 16 bits UTF-16 Unicode range:
Here is a way to generate the whole unicode charset, by generating html entities and converting to utf8. Run it online
for ($x = 0; $x < 262144; $x++){
echo html_entity_decode("&#".$x.";",ENT_NOQUOTES,"UTF-8");
}
Same stuff, but the range goes up to (16^4 * 4)!
echo html_entity_decode('!',ENT_NOQUOTES,'UTF-8');
/* ! */
echo html_entity_decode('"',ENT_NOQUOTES,'UTF-8');
/* " */
To retrieve the unicode € symbol,using the base10 decimal representation of the character.
echo html_entity_decode('€',ENT_NOQUOTES,'UTF-8');
/* € */
The same symbol, using the base16 hexadecimal representation:
echo html_entity_decode('&#'.hexdec("20AC").';',ENT_NOQUOTES,'UTF-8');
/* € */
First 32 bits are reserved for special control characters, output garbage �����, but have a meaning.
You are looking at a number representation problem. This is base24 (or however many numbers your alphabet has). Lets call the base b.
Assign a number to each letter in alphabet (A=1, B=2, C=3).
Next, figure out your input "number": The representation "ABC" means A*b^2 + B*b^1 + C*b^0
Use this formula to find the number (int). Increment it.
Next, convert it back to your number system: Divide by b^2 to get third digit, the remainder (modulo) by b^1 for second digit, the remainder (modulo) by `b^0^ for last digit.
This might help: How to convert from base10 to any other base.
You could use the ASCII codes for alpha numerics. From there you increment and decrement to get the previous/next character.
You could split your string in single characters and then apply the transformations on these characters.
Just some thoughts to get you started.
<?php
$values[] = 'B';
$values[] = 'A';
$values[] = 'Z';
foreach($values as $value ){
if($value == 'Z'){
$value = '-1';
}
$op = ++$value;
echo $op;
}
?>
I have these methods in c# that you could probably convert to php and modify to suit your needs, I'm not sure Hexavigesimal is the exact name for these though...
#region Hexavigesimal (Excel Column Name to Number)
public static int FromHexavigesimal(this string s)
{
int i = 0;
s = s.Reverse();
for (int p = s.Length - 1; p >= 0; p--)
{
char c = s[p];
i += c.toInt() * (int)Math.Pow(26, p);
}
return i;
}
public static string ToHexavigesimal(this int i)
{
StringBuilder s = new StringBuilder();
while (i > 26)
{
int r = i % 26;
if (r == 0)
{
i -= 26;
s.Insert(0, 'Z');
}
else
{
s.Insert(0, r.toChar());
}
i = i / 26;
}
return s.Insert(0, i.toChar()).ToString();
}
public static string Increment(this string s, int offset)
{
return (s.FromHexavigesimal() + offset).ToHexavigesimal();
}
private static char toChar(this int i)
{
return (char)(i + 64);
}
private static int toInt(this char c)
{
return (int)c - 64;
}
#endregion
EDIT
I see by the other answers that in PHP you can use ++ instead, nice!

Categories