How to convert binary string to normal string in php - php

Description of the problem
I am trying to import email content into a database table. Sometimes, I get an SQL error while inserting a message. I found that it fails when the message content is a binary string instead of a string.
For exemple, I get this in the console if I print a message that is imported successfully (Truncated)
However, I get this with problematic import:
I found out that if I use the function utf8_encode, I am successfully able to import it into SQL. The problem is that it "breaks" previously successfull imports accented characters:
What I have tried
Detect if the string was a binary string with ctype_print, returned false for both non binary and binary string. I would have then be able to call utf8_encode only if it was binary
Use of unpack, did not work
Detect string encoding with mb_detect_encoding, return UTF-8 for both
use iconv , failed with iconv(): Detected an illegal character in input string
Cast the content as string using (string) / settype($html, 'string')
Question
How can I transform the binary string in a normal string so I can then import it in my database without breaking accented characters in other imports?

This is pretty late, but for anyone else reading... Apparently the b prefix is meaningless in PHP, it's a bit of a red herring. See: https://stackoverflow.com/a/51537602/6111743
What encodings did you pass to iconv()? This is the correct solution but you have to give it the correct first argument, which depends on your input. In my example I use "LATIN1" because that turned out to be the correct way to interpret my input but your use case may vary.
You can use mb_check_encoding() to check if it is valid UTF-8 or not. This returns a boolean.
Assuming the question is really something like "how to convert extended ascii string to valid utf-8 string in PHP" - Here is how I did it in my application:
if(!mb_check_encoding($string)) {
$string = iconv("LATIN1", "UTF-8//TRANSLIT//IGNORE", $string);
}
The "TRANSLIT" part tells it to attempt transliteration, that's optional for you. The "IGNORE" will prevent it from throwing Detected an illegal character in input string if it does detect one; instead the character will just get ignored, meaning, removed. Your use case may not need either of these.
When you're debugging, I recommend just using "UTF-8" as the second argument so you can see what it's doing. It's useful to see if it throws an error. For me, I had given it the wrong first argument at first (I wrote "ASCII" instead of "LATIN-1") and it threw the illegal character error on an accented character. That error went away once I passed it the correct encoding.
By the way, mb_detect_encoding() was no help to me in figuring out that Latin-1 was what I needed. What helped was dumping the contents of unpack("C*", $string) to see what exact bytes were in there. That's more debugging advice than solution but worth mentioning in case it helps.

Related

How can I reproducibly represent a non-UTF8 string in PHP (Browser)

I received a string with an unknown character encoding via import. How can I display such a string in the browser so that it can be reproduced as PHP code?
I would like to illustrate the problem with an example.
$stringUTF8 = "The price is 15 €";
$stringWin1252 = mb_convert_encoding($stringUTF8,'CP1252');
var_dump($stringWin1252); //string(17) "The price is 15 �"
var_export($stringWin1252); // 'The price is 15 �'
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol. The string is only generated here with mb_convert_encoding for test purposes. Here the character coding is known. In practice, it comes from imports e.G. with file_cet_contents() and the character coding is unknown.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
My approach to the solution is to find all non-UTF8 characters and then show them in hexadecimal. The code for this is too extensive to be shown here.
Another variant is to output all characters in hexadecimal PHP notation.
function strToHex2($str) {
return '\x'.rtrim(chunk_split(strtoupper(bin2hex($str)),2,'\x'),'\x');
}
echo strToHex2($stringWin1252);
Output:
\x54\x68\x65\x20\x70\x72\x69\x63\x65\x20\x69\x73\x20\x31\x35\x20\x80
This variant is well suited for purely binary data, but quite large and difficult to read for general texts.
My question in other words:
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.
I'm going to start with the question itself:
How can I reproducibly represent a non-UTF8 string in PHP (Browser)
The answer is very simple, just send the correct encoding in an HTML tag or HTTP header.
But that wasn't really your question. I'm actually not 100% sure what the true question is, but I'm going to try to follow what you wrote.
I received a string with an unknown character encoding via import.
That's really where we need to start. If you have an unknown string, then you really just have binary data. If you can't determine what those bytes represents, I wouldn't expect the browser or anyone else to figure it out either. If you can, however, determine what those bytes represent, then once again, send the correct encoding to the client.
How can I display such a string in the browser so that it can be reproduced
as PHP code?
You are round-tripping here which is asking for problems. The only safe and sane answer is Unicode with one of the officially support encodings such as UTF-8, UTF-16, etc.
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol.
The string you entered as a sample did not end with a byte sequence of x80. Instead, you entered the € character which is 20AC in Unicode and expressed as the three bytes xE2 x82 xAC in UTF-8. The function mb_convert_encoding doesn't have a map of all logical characters in every encoding, and so for this specific case it doesn't know how to map "Euro Sign" to the CP1252 codepage. Whenever a character conversion fails, the Unicode FFFD character is used instead.
The string is only generated here with mb_convert_encoding for test purposes.
Even if this is just for testing purposes, it is still messing with the data, and the previous paragraph is important to understand.
Here the character coding is known. In practice, it comes from imports e.g. with file_get_contents() and the character coding is unknown.
We're back to arbitrary bytes at this point. You can either have PHP guess, or if you have a corpus of known data you could build some heuristics.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
Both var_dump and var_export are intended to show you quite literally what is inside the variable, and changing them would have a giant BC problem. (There actually was an RFC for making a new dumping function but I don't think it did what you want.)
In PHP, strings are just byte arrays so calling these functions dumps those byte arrays to the stream, and your browser or console or whatever takes the current encoding and tries to match those bytes to the current font. If your font doesn't support it, one of the replacement characters is shown. (Or, sometimes a device tries to guess what those bytes represent which is why you see € or similar.) To say that again, your browser/console does this, PHP is not doing that.
My approach to the solution is to find all non-UTF8 characters
That's probably not what you want. First, it assumes that the characters are UTF-8, which you said was not an assumption that you can make. Second, if a file actually has byte sequences that aren't valid UTF-8, you probably have a broken file.
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.
The real solution is to use Unicode all the way through your application and to enforce an encoding whenever you store/output something. This also means that when viewing this data that you have a font capable of showing those code points.
When you ingest data, you need to get it to this sane point first, and that's not always easy. Once you are Unicode, however, you should (mostly) be safe. (For "mostly", I'm looking at you Emojis!)
But how do you convert? That's the hard part. This answer shows how to manually convert CP1252 to UTF-8. Basically, repeat with each code point that you want to support.
If you don't want to do that, and you really want to have the escape sequences, then I think I'd inspect the string byte by byte, and anything over x7F gets escaped:
$s = "The price is 15 \x80";
$buf = '';
foreach(str_split($s) as $c){
$buf .= $c >= "\x80" ? '\x' . bin2hex($c) : $c;
}
var_dump($buf);
// string(20) "The price is 15 \x80"

PHP string encoding is not recognized by strpos()?

I have a binary Word .doc that looks something like this in string format:
þÿÿÿÿÿÿÿppp„±¶g œÙ Text in word doc here I'm interested in [|`ñÿ|Standard1$S_HmHnHsHtHOJPJQJCJEH567>
When I echo that string, I can see all the text I'm interested in finding in between unrecognized characters (but those I'm not worried about them since I only want the text). The issue is that PHP does not seem to recognize it as a string and so I cannot search it with strpos(), strpos(), strchr(), mb_strpos() all return nothing. No -1, no error in the PHP error log, just nothing.
However, when I call gettype() I get string. I suspect this is an encoding issue, but mb_detect_encoding returns UTF-8. I have tried converting it to multiple different encoding types, without avail.
How can I get PHP to search this string? I understand that parsing a Word .doc is more complex of an issue, but for my purposes the plaintext I'm interested in are in the binary data. Does anyone have any experience with this?
Thank you :)
Since you string seems binary encoded and you are only interested in text a quick solution would be to use filter_var to clean the string from non ascii-printable characters.Try using this before searching:
$clean_string = filter_var($str,FILTER_FLAG_STRIP_LOW, FILTER_FLAG_STRIP_HIGH);
Notice the part "Standard1$". php is taking $ as the operator instead of a character.
check here.
<?php
$s = "þÿÿÿÿÿÿÿppp„±¶g œÙ Text in word doc here I'm interested in [|`ñÿ|Standard1$S_HmHnHsHtHOJPJQJCJEH567>";
$s2 = strpos($s, "interested");
echo $s2;
?>
you might want to put a backslash before that $ sign.

Why is my PHP urlencode not functioning as examples on internet?

Why does my urlencode() produce something different than I expected?
This might be my expectations being wrong but then I would be even more puzzled.
example
urlencode("ä");
expectations = returns %C3%A4
reality = returns %E4
Where have I gone wrong in my expections? It seems to be linked to encoding. But I'm not very familiar in what I should do/use.
Should I change something on my server to that the function uses the right encoding?
urlencode encodes the raw bytes in your string into a percent-encoded representation. If you expect %C3%A4 that means you expect the UTF-8 byte representation of "ä". If you get %E4 that means your string is actually encoded in ISO-8859-1 instead.
Encode your string in UTF-8 to get the expected result. How to do this depends on where this string comes from. If it's a string literal in your source code file, save the file as UTF-8 in your text editor. If it comes from a database, see UTF-8 all the way through.
For more background information, see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

How to check if a string can safely be converted in another character set without loss?

Is it possible, prior to converting a string from a charset to another, to know whether this conversion will be lossless?
If I try to convert an UTF-8 string to latin1, for example, the chars that can't be converted are replaced by ?. Checking for ? in the result string to find out if the conversion was lossless is obviously not a choice.
The only solution I can see right now is to convert back to the original charset, and compare to the original string:
function canBeSafelyConverted($string, $fromEncoding, $toEncoding)
{
$encoded = mb_convert_encoding($string, $toEncoding, $fromEncoding);
$decoded = mb_convert_encoding($encoded, $fromEncoding, $toEncoding);
return $decoded == $string;
}
This is just a quick&dirty one though, that may come with unexpected behaviours at times, and I guess there might be a cleaner way to do this with mbstring, iconv, or any other library.
An alternative way is to set up your own error handler with set_error_handler(). If you use iconv() on the string it will throw a notice if it can not be fully converted that you can catch there and react to in your code.
Or you could just count the number of question marks before and after encoding. Or call iconv() with //IGNORE and count the number of characters.
None of the suggestions much more elegant than yours, but gets rid of the double processing.

Handling Multibyte characters in php

Am working on php based mime parser. If the body contains string like Iñtërnâtiônàlizætiøn we see that It is getting converted into Iñtërnâtiônàlizætiøn. Can somebody suggest how to handle (what functions) for such string ?
So we are doing the following
Using Zend Library connecting to the IMAP server
mail = new Zend_Mail_Storage_Imap($params);
Read the message using
$message = $mail->getMessage($i);
in the loop.
When we print the $message we see the string e.g. Iñtërnâtiônàlizætiøn printed as Iñtërnâtiônà lizætiøn.
What I need is if there is someway by which we can retain the original string? And this is just one example we may run into other multi-byte characters, so what to know how we handle this generically?
There's no specific function for that, you simply need to treat the string in the encoding it's in. A string is just a blob of bytes, it gets turned into characters by whatever is interpreting those bytes as text. And that something needs to use the correct encoding for that, otherwise those bytes are not interpreted as the characters they were supposed to be. See Handling Unicode Front To Back In A Web App for a rundown of the common pitfalls.
as mentioned in the comment, you can use php mb_* functions to work with multibyte characters. Here is just an example to detect the encoding of a string:
$s="Iñtërnâtiônàlizætiøn";
echo mb_detect_encoding($s); //UTF-8
then you can work with this, use utf8_decode($s) or any mb_ functions to convert the string to your wished encoding.

Categories