Convert two string to the same byte length - php

I have 2 strings in my PHP code, 1 is a parameter to my method and 1 is a string from an ini file.
The problem is that they are not equal, although they have the same content, probably due to encoding issues. When using var_dump, it is reported that the first string's lenght is 23 and the second string's length is 47 (see the end of my question for the reason behind this)
How can i make sure they are both encoded the same way and have the same length in the end so comparison won't fail? Preferably, i would like them to be utf8 encoded.
For reference, this is an excerpt from the code:
static function getString($keyword,$file) {
$lang_handle = parse_ini_file($file, true);
var_dump($keyword);
foreach ($lang_handle as $key => $value) {
var_dump($key);
if ($key == $keyword) {
foreach ($value as $subkey => $subvalue) {
var_dump("\t" . $subkey . " => " . $subvalue);
}
}
}
}
with the following ini:
[clientcockpit/login.php]
header = "Kunden Login"
username = "Benutzername"
password = "Passwort"
forgot = "Passwort vergessen"
login = "Login"
When calling the method with getString("clientcockpit/login.php", "inifile.ini") the output is:
string 'clientcockpit/login.php' (length=23)
string '�c�l�i�e�n�t�c�o�c�k�p�i�t�/�l�o�g�i�n�.�p�h�p�' (length=47)

Your INI file seems to be in UTF16 encoding or similar, using two bytes to represent a single character. I guess that the strange characters in your string are actually NULL bytes (\0).
PHP's Unicode support is quite poor and I guess that parse_ini_file() does not support multibyte encodings properly. It will treat the file as if it was encoded using a "ASCII-compatible" single-byte encoding, just looking for special characters [ and ] to detect sections. As a result, the section keys will be corrupted: One byte actually belonging to [ or ] will be part of the section key:
UTF-16: [c] (3 characters, 6 bytes)
For UTF-16BE (big endian):
Bytes: 00 5B 00 63 00 5D (6 bytes)
ASCII: \0 [ \0 c \0 ] (6 characters)
For UTF-16LE (little endian):
Bytes: 5B 00 63 00 5D 00 (6 bytes)
ASCII: [ \0 c \0 ] \0 (6 characters)
Assuming ASCII, instead of reading c, parse_ini_file() will read \0c\0 if the source file encoding is UTF-16.
If you can control the format of your INI file, make sure to save it in UTF8 or ISO-8859-1 encoding, using your favorite text editor.
Otherwise you will have to read in the file contents using file_get_contents(), do the encoding conversion (eg. using iconv()) and pass the result to parse_ini_string(). The drawback here is that you will have to detect or hardcode the original file encoding.
If the mb multibyte extension is available on your PHP installation, you can use mb_detect_encoding() and mb_convert_encoding() to do the conversion dynamically.

Try this:
$lang_handle = parse_ini_string(file_get_contents($file), true);

Related

PHP Turkish Characters to ASCII Giving Same Output

ord('Ö') is giving 195 and also ord('Ç') is giving 195 too. I didn't get what is the error. Can you guys help me?
ord — Convert the first byte of a string to a value between 0 and 255
https://www.php.net/manual/en/function.ord.php
The question is - what the charset of the source file?
Since 'Ö' and 'Ç' both are not ASCII symbols, they are represented as two bytes in UTF-8 encoding
Ö - 0xC3 0x96
Ç - 0xC3 0x87
As you can see, both characters has first bytes 0xC3 (=195 dec.)
So, you need to decide what code you want to get?
For example, you can convert the UTF-8 string into Windows-1254:
print ord(iconv('UTF-8', 'Windows-1254', 'Ö')); // 214
print ord(iconv('UTF-8', 'Windows-1254', 'Ç')); // 199
Or you may want to get unicode Code point. To do that you can first convert the string into UTF-32, and then decode a 32-bit number:
function get_codepoint($utf8char) {
$bin = iconv('UTF-8', 'UTF-32BE', $utf8char); // convert to UTF-32 big endian
$a = unpack('Ncp', $bin); // unpack binary data
return $a['cp']; // get the code point
}
print get_codepoint('Ö'); // 214
print get_codepoint('Ç'); // 199
Or in php 7.2 and later you can simple use mb_ord
print mb_ord('Ö'); // 214
print mb_ord('Ç'); // 199

Strlen not returning the correct string length [duplicate]

This question already has answers here:
strlen() and UTF-8 encoding
(6 answers)
Closed 4 years ago.
I have a string with this content :
$myString = 'Câmara de Dirigentes Lojistas';
This string have 29 chars. BUT when i call strlen, it returns 30 ! Even when i call var_dump($myString), that's the result :
114:string 'Câmara de Dirigentes Lojistas' (length=30)
What is going on here ? Maybe the problem is related to the special char â ?
That's the right behavior since you are using UTF-8 encoding.
Please see this note on strlen() documentation
Note:
strlen() returns the number of bytes rather than the number of characters in a string.
As your string have multi-byte characters (â), PHP uses two bytes to represent it.
To have the right string length, you must use the mb_strlen() function:
mb_strlen("â"); // 1
strlen("â"); // 2
There are several definitions of the "length" of a string, because there are a variety of tricks used to represent the huge range of accented characters, variants, and non-alphabetic scripts used around the world.
The number of bytes the string takes up. This is the easiest to calculate, but not always what is expected. For instance, in UTF-16, every code point takes up either 2 or 4 bytes; in UTF-8, code points take up 1, 2, 3, or 4 bytes. This is what strlen and most PHP functions work with.
The number of "code points": separate symbols in the character set. This is the next easiest, and the next most common, but is generally a compromise between bytes and "graphemes" (see below) - there aren't many cases where it's particularly useful to count é as 2 "characters" just because it's represented with a combining diacritic. In PHP you can use mb_strlen to count these, telling it your string's character encoding.
The number of "graphemes": separate symbols a reader would recognise. This is the most intuitive meaning, but the hardest for a computer to define. In PHP you can use grapheme_strlen, as long as you have ensured your string is encoded as UTF-8.
There is an issue with the character â as it is a special character which uses a different encoding. Characters like this are actually double characters this is why its giving 30 and not 29
To fix this, you need to use mb_strlen() with encoding
$myString = 'Câmara de Dirigentes Lojistas';
echo mb_strlen($myString,'utf8')
NOTE : If mb_strlen is undefined, then you will have to enable mb extension in your php settings
Interestingly the â char exists in extended ascii, i.e. it can be represented by just one byte, you can try it with this code:
$str = utf8_decode('Câmara de Dirigentes Lojistas');
echo 'length is ' . strlen($str);
that will output length is 29.
So as you see the thing is that when a char is not plain ascii (127 char ascii table) then PHP assumes UTF-8 automatically.

ord() doesn't work with utf-8

according to ISO 8859-1
€ Symbol has decimal value 128
My default php script encoding is
echo mb_internal_encoding(); //ISO-8859-1
So now as PHP
echo chr(128); //Output exactly what i want '€'
But
echo ord('€'); //opposite it returns 226, it should be 128
why it is so?
It is only for 2018's PHP v7.2.0+.
mb_ord()
Now you can use mb_ord().
Example echo mb_ord('€','UTF-8');
See also mb_chr(), to get the UTF-8 representation of a decimal code. Example echo mb_chr(2048,'UTF-8');.
The best practice is to be universal, save all your PHP scripts as UTF-8 (see #deceze).
According to Wikipedia and FileFormat,
ISO-8859-1 doesn't have the Euro symbol at all
ISO-8859-15 has it at codepoint 164 (0xA4)
Windows-1252 has it at codepoint 128 (0x80)
Unicode has the Euro symbol at codepoint 8364 (0x20AC)
UTF-8 encodes that as 0xE2 0x82 0xAC. The first byte E2 is 226 in decimal.
So it seems your source file is encoded in UTF-8 (and ord() only returns the first byte), whereas your output is in Windows-1252.
echo ord('€'); //opposite it returns 226, it should be 128
Your .php file is saved as UTF-8 (you saved it as UTF-8 in your text editor when you saved the file to disk). The string literal in there contains the bytes E2 82 AC; visualised it's something like this:
echo ord('\xE2\x82\xAC');
Open the file in a hex editor for real clarity.
ord only returns a single integer in the range of 0 - 255. Your string literal contains three bytes, for which ord would need to return three integers, which it won't. It returns only the first one, which is 226.
Save the file in different encodings in your text editor and you'll see different results.
This PHP function return the decimal number of the first character in string.
If the number is lower than 128 then the character is encoded in 1 octet.
Elseif the number is lower than 2048 then the character is encoded in 2 octets.
Elseif the number is lower than 65536 then the character is encoded in 3 octets.
Elseif the number is lower than 1114112 then the character is encoded in 4 octets.
function ord_utf8($s){
return (int) ($s=unpack('C*',$s[0].$s[1].$s[2].$s[3]))&&$s[1]<(1<<7)?$s[1]:
($s[1]>239&&$s[2]>127&&$s[3]>127&&$s[4]>127?(7&$s[1])<<18|(63&$s[2])<<12|(63&$s[3])<<6|63&$s[4]:
($s[1]>223&&$s[2]>127&&$s[3]>127?(15&$s[1])<<12|(63&$s[2])<<6|63&$s[3]:
($s[1]>193&&$s[2]>127?(31&$s[1])<<6|63&$s[2]:0)));
}
echo ord_utf8('€');
// Output 8364 then this character is encoded in 3 octets
You can check the result in https://eval.in/748181 …
The ord_utf8 function is the reciprocal of chr_utf8 (print one utf8 character from decimal number)
function chr_utf8($n,$f='C*'){
return $n<(1<<7)?chr($n):($n<1<<11?pack($f,192|$n>>6,1<<7|191&$n):
($n<(1<<16)?pack($f,224|$n>>12,1<<7|63&$n>>6,1<<7|63&$n):
($n<(1<<20|1<<16)?pack($f,240|$n>>18,1<<7|63&$n>>12,1<<7|63&$n>>6,1<<7|63&$n):'')));
}
for($test=1;$test<1114111;$test++)
if (ord_utf8(chr_utf8($test))!==$test)
die('Error found');
echo 'No error';
// Output No error

Real binary write PHP

How do I do something as simple as (in PHP) this code in C:
char buffer[5] = "testing";
FILE* file2 = fopen("data2.bin", "wb");
fwrite(buffer, sizeof buffer, 1, file2);
fclose(file2);
Whenever I try to write a binary file in PHP, it doesn't write in real binary.
Example:
$ptr = fopen("data2.bin", 'wb');
fwrite($ptr, "testing");
fclose($ptr);
I found on internet that I need to use pack() to do this...
What I expected:
testing\9C\00\00
or
7465 7374 696e 679c 0100 00
What I got:
testing412
Thanks
You're making the classic mistake of confusing data with the representation of that data.
Let's say you have a text file. If you open it in Notepad, you'll see the following:
hello
world
This is because Notepad assumes the data is ASCII text. So it takes every byte of raw data, interprets it as an ASCII character, and renders that text to your screen.
Now if you go and open that file with a hex editor, you'll see something entirely different1:
68 65 6c 6c 6f 0d 0a 77 6f 72 6c 64 hello..world
That is because the hex editor instead takes every byte of the raw data, and displays it as a two-character hexadecimal number.
1 - Assuming Windows \r\n line endings and ASCII encoding.
So if you're expecting hexadecimal ASCII output, you need to convert your string to its hexadecimal encoding before writing it (as ASCII text!) to the file.
In PHP, what you're looking for is the bin2hex function which "Returns an ASCII string containing the hexadecimal representation of str." For example:
$str = "Hello world!";
echo bin2hex($str); // output: 48656c6c6f20776f726c6421
Note that the "wb" mode argument doesn't cause any special behavior. It guarantees binary output, not hexadecimal output. I cannot stress enough that there is a difference. The only thing the b really does, is guarantee that line endings will not be converted by the library when reading/writing data.

Why call mb_convert_encoding to sanitize text?

This is in reference to this (excellent) answer. He states that the best solution for escaping input in PHP is to call mb_convert_encoding followed by html_entities.
But why exactly would you call mb_convert_encoding with the same to and from parameters (UTF8)?
Excerpt from the original answer:
Even if you use htmlspecialchars($string) outside of HTML tags, you are still vulnerable to multi-byte charset attack vectors.
The most effective you can be is to use the a combination of mb_convert_encoding and htmlentities as follows.
$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
$str = htmlentities($str, ENT_QUOTES, 'UTF-8');
Does this have some sort of benefit I'm missing?
Not all binary data is valid UTF8. Invoking mb_convert_encoding with the same from/to encodings is a simple way to ensure that one is dealing with a correctly encoded string for the given encoding.
A way to exploit the omission of UTF8 validation is described in section 6 (security considerations) in rfc2279:
Another example might be a parser which
prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
illegal octet sequence 2F C0 AE 2E 2F.
This may be more easily understood by examining the binary representation:
110xxxxx 10xxxxxx # header bits used by the encoding
11000000 10101110 # C0 AE
00101110 # 2E the '.' character
In other words: (C0 AE - header-bits) == '.'
As the quoted text points out, C0 AE is not a valid UTF8 octet sequence, so mb_convert_encoding would have removed it from the string (or translated it to '.', or something else :-).

Categories