Reading Encoded File in PHP - php

I have a text file that displays as below when I open it in Sublime Text:
But when I fread the file and echo each line in php, I get garbled characters like this:
I read Joel's post on encoding and understood the basics of encoding. Also, When I do mb_detect_encoding, it detects the string as UTF-8. Now, I do not understand what I have to do with this information of what encoding the string is in in order to display it, or convert it, to readable characters like in the first picture.
And why is it displaying garbled characters when it is already in UTF-8? Is php using different encoding type to read the file? Does php string have to be in UTF-8 or ASCII, or does it not matter as long as I specify what it is?
I would really appreciate if someone can help me understanding the idea! Thanks.
EDIT:
Pedro Lobito and Peter's suggestions worked.
$file = file_get_contents($bl_file);
$content = unpack("H*", $file);
But if someone could explain why I have to do this way, that'd still help me understanding it!

But if someone could explain why I have to do this way, that'd still help me understanding it!
Because it's a binary file. Sublime shows you hex human-readable representation of raw binary values.
When you do file_get_contents, you read a file into a string of '0101010', raw bits and bytes.
When you do unpack (H* /* H is for HEX */, $file ), you're telling PHP, you want to see your binary data as a human-readable hex encoded representation of your byte-stream. (You can tell it's a hex when you see letters A-F)
HEX encoding is far more readable than binary, that's why Sublime uses it. Also, I once saw a man who can code in raw binary. I was scared.
When you echo them, you're just streaming raw binary values to console, they have no special meaning, so console(or browser) shows them as a control characters and other garbage, which has now meaning for a human eye.
So, if you open this file with another text editor, it will:
a) show garbage (mcedit)
b) show garbage and tell you that it's a binary file (vim, gedit)
Sublime tricked you into thinking it's a text file by being too friendly.
If you'll echo binary files to your command prompt / shell it can kill your data. Never do this. Because shell can interpret raw binary data as a command and run it.
If you'll echo some binary file having something like this inside:
rm -rf ~/[bytecode_For_NewLine_Here],
you can delete your home folder contents in linux

Related

How can I reproducibly represent a non-UTF8 string in PHP (Browser)

I received a string with an unknown character encoding via import. How can I display such a string in the browser so that it can be reproduced as PHP code?
I would like to illustrate the problem with an example.
$stringUTF8 = "The price is 15 €";
$stringWin1252 = mb_convert_encoding($stringUTF8,'CP1252');
var_dump($stringWin1252); //string(17) "The price is 15 �"
var_export($stringWin1252); // 'The price is 15 �'
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol. The string is only generated here with mb_convert_encoding for test purposes. Here the character coding is known. In practice, it comes from imports e.G. with file_cet_contents() and the character coding is unknown.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
My approach to the solution is to find all non-UTF8 characters and then show them in hexadecimal. The code for this is too extensive to be shown here.
Another variant is to output all characters in hexadecimal PHP notation.
function strToHex2($str) {
return '\x'.rtrim(chunk_split(strtoupper(bin2hex($str)),2,'\x'),'\x');
}
echo strToHex2($stringWin1252);
Output:
\x54\x68\x65\x20\x70\x72\x69\x63\x65\x20\x69\x73\x20\x31\x35\x20\x80
This variant is well suited for purely binary data, but quite large and difficult to read for general texts.
My question in other words:
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.
I'm going to start with the question itself:
How can I reproducibly represent a non-UTF8 string in PHP (Browser)
The answer is very simple, just send the correct encoding in an HTML tag or HTTP header.
But that wasn't really your question. I'm actually not 100% sure what the true question is, but I'm going to try to follow what you wrote.
I received a string with an unknown character encoding via import.
That's really where we need to start. If you have an unknown string, then you really just have binary data. If you can't determine what those bytes represents, I wouldn't expect the browser or anyone else to figure it out either. If you can, however, determine what those bytes represent, then once again, send the correct encoding to the client.
How can I display such a string in the browser so that it can be reproduced
as PHP code?
You are round-tripping here which is asking for problems. The only safe and sane answer is Unicode with one of the officially support encodings such as UTF-8, UTF-16, etc.
The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol.
The string you entered as a sample did not end with a byte sequence of x80. Instead, you entered the € character which is 20AC in Unicode and expressed as the three bytes xE2 x82 xAC in UTF-8. The function mb_convert_encoding doesn't have a map of all logical characters in every encoding, and so for this specific case it doesn't know how to map "Euro Sign" to the CP1252 codepage. Whenever a character conversion fails, the Unicode FFFD character is used instead.
The string is only generated here with mb_convert_encoding for test purposes.
Even if this is just for testing purposes, it is still messing with the data, and the previous paragraph is important to understand.
Here the character coding is known. In practice, it comes from imports e.g. with file_get_contents() and the character coding is unknown.
We're back to arbitrary bytes at this point. You can either have PHP guess, or if you have a corpus of known data you could build some heuristics.
The output with an improved var_export that I expect looks like this:
"The price is 15 \x80"
Both var_dump and var_export are intended to show you quite literally what is inside the variable, and changing them would have a giant BC problem. (There actually was an RFC for making a new dumping function but I don't think it did what you want.)
In PHP, strings are just byte arrays so calling these functions dumps those byte arrays to the stream, and your browser or console or whatever takes the current encoding and tries to match those bytes to the current font. If your font doesn't support it, one of the replacement characters is shown. (Or, sometimes a device tries to guess what those bytes represent which is why you see € or similar.) To say that again, your browser/console does this, PHP is not doing that.
My approach to the solution is to find all non-UTF8 characters
That's probably not what you want. First, it assumes that the characters are UTF-8, which you said was not an assumption that you can make. Second, if a file actually has byte sequences that aren't valid UTF-8, you probably have a broken file.
How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.
The real solution is to use Unicode all the way through your application and to enforce an encoding whenever you store/output something. This also means that when viewing this data that you have a font capable of showing those code points.
When you ingest data, you need to get it to this sane point first, and that's not always easy. Once you are Unicode, however, you should (mostly) be safe. (For "mostly", I'm looking at you Emojis!)
But how do you convert? That's the hard part. This answer shows how to manually convert CP1252 to UTF-8. Basically, repeat with each code point that you want to support.
If you don't want to do that, and you really want to have the escape sequences, then I think I'd inspect the string byte by byte, and anything over x7F gets escaped:
$s = "The price is 15 \x80";
$buf = '';
foreach(str_split($s) as $c){
$buf .= $c >= "\x80" ? '\x' . bin2hex($c) : $c;
}
var_dump($buf);
// string(20) "The price is 15 \x80"

file_get_contents returns bizarre characters from raw text file

This is very bizarre. I have a .txt file on my Windows server. I'm using file_get_contents to retrieve it, but the first several characters show up as a diamond with a question make inside them. I've tried recreating the file from scratch and it's the same result. What's really bizarre is other files don't have this issue.
Also, if I put a * at the start of the file it seems to fix it, but if I try to open the file and do it with PHP it's still messed up.
The start of the file in question begins with: Trinity Cannon - that's a direct copy and paste from the text file. I've tried re-typing it and the first few characters are always that diamond with a question mark.
$myfile='C:\\inetpub\\wwwroot\\fastpitchscores\\data\\2020.txt';
$fh = file_get_contents($myfile);
echo $fh; // Trinity Cannon
echo $fh[0]; // �
It sounds like whatever editor you used to originally create the file a UTF Byte Order Mark at the beginning the file.
You typically can't edit the BOM from within an editor. If your editor has a encoding conversion functionality, try converting to ASCII. For example, in Notepad++ use Encoding->Encode in ANSI.

PHP upload text file encoding check and manipulation

I have a standard file upload where the user is supposed to upload a text file. But "text file" is not egual to "text file". The same file can have different encodings: UTF8, UTF7, UTF16, UTF32, ASCII, and ANSI
To be more clear I noticed that some encodings are not able to show all characters, another encoding can show.
Tree questions:
witch encoding is the one that is "the most compete", where you can convert any encoding into without loosing content
check if the file a text file and not a binary
check if the content of the text file is base64 encoded or not?
if the uploaded encoding is not "the most compete" , change the encoding "on the fly" to the "the most compete" encoding (see question 1)
I do not want to troll here sending the whole code, so lets admit I have the form and the action="upload.php", now comes the part where I need to check the above.
$target_dir = "uploads/";
$target_file = $target_dir . basename($_FILES["fileToUpload"]["name"]);
[...]
// this ist the check after the upload
if(isset($_POST["submit"])) {
// check 1 : what encoding has been uploaded ?
// check 2 : is the file a text file and not a binary?
// check 3 : in the content of the file a base64 encoded text?
}
// if the encoding is different to the "most preferred" change the encoding to the "most preferred"
[...]
can you please help quick ?
witch encoding is the one that is "the most compete", where you can convert any encoding into without loosing content
Unicode. Choose any of the common encodings of the Unicode standard, like UTF-8 or UTF-16. The de facto standard on the internet is UTF-8.
check if the file a text file and not a binary
There's no such difference as such. Text files also just contain binary data, it just so happens that this binary data interpreted in the right encoding results in human readable text.
You can try to check whether the file contains a lot of "control characters" or NUL bytes or such, then it may not be text.
You can try confirming whether the file is valid in any of your expected encodings. Have a list of supported/expected encodings at hand and check against that list. Note though that any random binary garbage is "valid" in any single byte encoding like ISO-8859-1...
check if the content of the text file is base64 encoded or not?
Try to decode it as Base64. If it decodes properly, then it was probably Base64 encoded. If it can't be decoded due to bad/malformed characters, then it probably wasn't. However, this can easily yield false positives, as simple short text sequences may look like Base64 encoded text.
if the uploaded encoding is not "the most compete" , change the encoding "on the fly" to the "the most compete" encoding (see question 1)
If it's not UTF-8 encoded, convert it to UTF-8... from its original encoding...
How do you know its original encoding? You don't. You can guess. Again, have a list of encodings at hand and check them off one by one, using the one that seems most likely.
This doesn't sound very sane to you? Well, that's because it isn't.
Trying to handle unknown encodings is a nightmare you best try to avoid outright.
There is no right answer. There will be false positives. You cannot be sure without having a human confirm the result. If you have a text file in an unknown encoding, try to interpret it in all known encodings, rule out the ones in which it cannot be decoded correctly, and let a human pick the best result. There are libraries which implement such guessing/detection logic, probably paired with statistical text analysis to guesstimate the likelihood of decoded text being actual text, but be aware that all such libraries fundamentally can only provide you with a best guess.
Or know what the encoding is to begin with. From meta data, or by having a human tell you.
Also see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

Why is my PHP urlencode not functioning as examples on internet?

Why does my urlencode() produce something different than I expected?
This might be my expectations being wrong but then I would be even more puzzled.
example
urlencode("ä");
expectations = returns %C3%A4
reality = returns %E4
Where have I gone wrong in my expections? It seems to be linked to encoding. But I'm not very familiar in what I should do/use.
Should I change something on my server to that the function uses the right encoding?
urlencode encodes the raw bytes in your string into a percent-encoded representation. If you expect %C3%A4 that means you expect the UTF-8 byte representation of "ä". If you get %E4 that means your string is actually encoded in ISO-8859-1 instead.
Encode your string in UTF-8 to get the expected result. How to do this depends on where this string comes from. If it's a string literal in your source code file, save the file as UTF-8 in your text editor. If it comes from a database, see UTF-8 all the way through.
For more background information, see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

How to convert unknown/mixed encoding file to UTF-8

I am using retrieving an XML file from a remote service which is supposed to be UTF-8, as the header is <?xml version="1.0" encoding="UTF-8"?>. However, certain parts of it is apparently not UTF-8, as when I load it into PHP's XMLReader extension, it throws some sort of "Not UTF-8 as expected" error when parsing over certain parts of the document (parts that look like they have been copy-pasted directly from MS Word).
I am looking for ideas to solve this error. Is there some program I can use to "fix" the file of any non-uft8 encodings? A PHP solution or any other solution will do
Depending on what encoding it is you are converting from, quick and easy utf-8 safe strings,utf8_encode function is your friend, but only for iso8859-1 encoding. Also, your txt cannot be already UTF-8 else you have good chances of having garbled text.
See the man page for more info:
// Usage can be as simple as this.
$name = utf8_encode($contact['name']);
On the other hand, if you need to convert from any other encoding, you will have to maybe look into incov() function.
Good-luck

Categories