on my site I allow for direct text file uploads. These files are then stored on the server, and displayed on the website. I use UTF-8 on the site.
Now I run into trouble when people upload non-UTF-8 files which contain special chars, such as é.
I've been doing some testing. Made 2 text files, both containing the same word fiancée. One encoded UTF-8 and one encoded ISO 8859-2.
The UTF-8 one uploads fine, and shows the text correct, but the ISO 8859-2 shows as fianc�e.
Now I've tried to detect the uploaded file content with mb_detect_encoding, but whatever file I throw at it, it always detect UTF-8.
I noticed that I can use utf8_encode to convert the ISO 8859-2 files to valid UTF-8, but this only works on non-UTF files. And as I currently cannot detect non-UTF files, I cannot use the utf8_encode function, as it messes up valid UTF-8 files.
Hope this makes sense :)
So my question is, how can I detect files that are for sure not UTF-8 encoded to start with, so that I can use the utf8_encode function on them.
You cannot. Welcome to encodings.
Seriously though, files are just binary blobs. The bits and bytes in the file could mean anything at all; it could be images, CAD data or, perhaps, text. It depends on how you interpret the bytes. For text files that specifically means with which encoding you interpret them. There's nothing in the files themselves that tells you the correct encoding, you have to know it. Typically you want to know it from metadata accompanying the file. In the case of random user uploads though, there is no metadata, and/or it wouldn't be reliable. So you cannot "know".
The next step would be to guess, but that is obviously not foolproof. You can rule out certain encodings, for example if a file does not validate as UTF-8 (mb_check_encoding($data, 'UTF-8') == false), then it cannot be UTF-8. However, any single byte encoding will validate as any other single byte encoding. It's impossible to distinguish ISO-8859-1 from ISO-8859-2 this way, the bytes are equally valid in both. It's just that the characters that show up may not be the ones you want. To detect that automatically you need a statistical language analyser which can tell you that this character probably shouldn't show up in that word for it to be grammatical. Obviously for that to work you need to know the language used in the file, or you need to detect that first… And even then this is hardly foolproof.
The sanest way is to ask the user. Accept the upload, perhaps do some upfront testing on which encodings can be ruled out, then ask the user which of a bunch of possible encodings the file is in. Present them the result, what the file looks like when interpreted as the chosen encoding, let the user confirm that it looks alright. Many decent text editors do this when you open a file with an ambiguous encoding.
Related
I need to determine the character encoding of the contents of a .csv file.
Every snippet that I have seen do this uses file_get_contents(), however I can't use that because the file is too large to store in a variable (server memory limit exhausted).
How can I determine the character encoding of a file? Can I just get the first x characters and check them? Would that guarantee that my whole file is that encoding?
Alternatively, can I simply convert the entire csv to UTF-8 without knowing the current file encoding?
No, you can't determine the encoding with just the first x characters. You can guess it, and the guess may be wrong. The file may be UTF-8 but not contain UTF-8 before x characters. If may contain another encoding that is compatible with ASCII, bot only after character x.
No, you can't convert a file without knowing the current file encoding.
You can go straight to the conversion, as you said, using iconv (http://php.net/manual/en/function.iconv.php#49434)
'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
—Charles Babbage, 1864.
You have missing metadata and are proposing to put in values whether they are right or not.
Only the author/sender can tell you, perhaps via some standard, specification, convention, agreement or communication. A common method of communication when transferring data via HTTP is the Content-Type header.
Unfortunately, inadequate communication of metadata for text files and streams is too common in our industry. It stems from the 1970s and 80s when text files were converted to the local character encoding upon receipt. That doesn't apply anymore and nothing really took its place.
Non-answer:
Conversion from ISO-8859-1 will never fail during conversion because it uses all 256 bytes values in any sequence.
Conversion to any current Unicode encoding (including UTF-8) will never fail because all of them support the whole Unicode character set, and Unicode includes every computerized character you are likely to see today.
But wait, there is more needed metadata in the case of CSV:
line ending (arguably detectable)
field separator (arguably detectable)
quoting scheme, including escaping
presence of header row
and, finally, the datatype of each column.
And, keep in mind, if you were to guess any of this, and the data source is updatable, today's guess might not work tomorrow.
I have an open source PHP website and I intend to modify/translate (mostly constant strings) it so it can be used by Japanese users.
The original code is PHP+MySQL+Apache and written in English with charset=utf-8
I want to change, for example, the word "login" into Japanese counterpart "ログイン" etc
I am not sure whether I have to save the PHP code in utf-8 format (just like Python)?
I only have experience with Python, so what other issues I should take care of?
If it's in the file, then yes, you will need to save the file as UTF-8.
If it's is in the database, you do not need to save the PHP file as UTF-8.
In PHP, strings are basically just binary blobs. You will need to save the file as UTF-8 so the correct bytes are read in. In theory, if you saved the raw bytes in an ANSI file, it would still be output to the browser correctly, just your editor would not display it correctly, and you would run the risk of your editor manipulating it incorrectly.
Also, when handling non-ANSI strings, you'll need to be careful to use the multi-byte versions of string manipulation functions (str_replace will likely botch a utf-8 string for example).
If the file contains UTF-8 characters then save it with UTF-8. Otherwise you can save it in any format. One thing you should be aware of is that the PHP interpreter does not support the UTF-8 byte order mark so make sure you save it without that.
I'm sorry you have to use PHP after using Python.
PHP has no concept of character sets: all strings are binary, even in parsed php code, so if you include a UTF-8 multibyte character in a php string, make sure the bytes in the code file are UTF-8 bytes.
You will need to be extremely careful with the use of string functions at all levels of your application. You also need to make sure your MySQL connection is set to use UTF-8 (using SET NAMES or the charset dsn parameter in later versions of PDO), and that your mysql string column datatypes use utf-8 storage.
I am using flash to read contents from a UTF8 page, which has unicode in it.
The problem is that when Flash loads the data it displays ???????? instead all unicode.
What could be the problem?
By default Flash treats strings as if they are encoded using UTF-8. The reason that you are seeing characters that possibly substitute non-printable characters or invalid / missing glyphs could be that you set System.useCodepage to true - if that's what happened, then why did you do that?
Otherwise, the font that is used to display the characters may be missing glyphs for the characters you need. You can check that by using Font.hasGlyphs("string with the glyphs"); to make sure the text can be displayed. This would normally only apply to embedded fonts.
Yet another possibility is that the source text you are trying to display is not a UTF-8 encoded string. Some particularly popular file formats such as XML and HTML some times use a declaration of the format in no correspondence to the actual payload (example XML tag: <?xml encoding="utf-8" ?> can be attached to any XML regardless of the actual encoding of the document). In order to make sure that the text is in UTF-8 - read it as ByteArray and verify that the first bit of every byte is set to 0. Single-byte encodings that use national characters use the first bit to encode their characters, while UTF-8 never does that.
Flash internally uses UTF-8 to represent strings, so there should not be a problem if the entire stack uses UTF-8 encoding.
You probably have an implicit decode/encode step somewhere along the way.
This could really be a million things, unfortunately. Start from the ground up, insert traces and/or log messages to see where the conversion fails. Make sure your XML-content uses UTF-8, and especially if you're using PHP, make sure that all the PHP source files are saved in UTF-8 encoding - editing PHP files in simple text editors often results in Windows/Mac format source files, which will then break your character encoding. Also, verify HTML request/response headers to see if there is an encoding mismatch.
I've looked across the web, I've looked through SO, through PHP documentation and more.
It seems like a ridiculous problem not to have a standard solution to. If you get an unknown character set, and it has strange characters (like english quotes), is there a standard way to convert them to UTF-8?
I've seen many messy solutions using a plethora of functions and checking and none of them are definitely going to work.
Has anyone come up with their own function or a solution that always works?
EDIT
Many people have answered saying "it is not solvable" or something of that nature. I understand that now, but none have given any sort of solution that has worked besides utf8_encode which is very limited. What methods ARE out there to deal with this? What is the best method?
No. One should always know what character set a string is in. Guessing the character set by using a sniffing function is unreliable (although in most situations, in the western world, it's usually a mix-up between ISO-8859-1 and UTF-8).
But why do you have to deal with unknown character sets? There is no general solution for this because the general problem shouldn't exist in the first place. Every web page and data source can and should have a character set definition, and if one doesn't, one should request the administrator of that resource to add one.
(Not to sound like a smartass, but that is the only way to deal with this well.)
The reason why you saw so many complicated solutions for this problem is because by definition it is not solvable. The process of encoding a string of text is non-deterministic.
It is possible to construct different combinations of text and encodings that result in the same byte stream. Therefore, it is not possible, strictly logically speaking, to determine the encoding, character set, and the text from a byte stream.
In reality, it is possible to achieve results that are "close enough" using heuristic methods, because there is a finite set of encodings that you'll encounter in the wild, and with a large enough sample a program can determine the most likely encoding. Whether the results are good enough depends on the application.
I do want to comment on the question of user-generated data. All data posted from a web page has a known encoding (the POST comes with an encoding that the developer has defined for the page). If a user pastes text into a form field, the browser will interpret the text based on encoding of the source data (as known by the operating system) and the page encoding, and transcode it if necessary. It is too late to detect the encoding on the server - because the browser may have modified the byte stream based on the assumed encoding.
For instance, if I type the letter Ä on my German keyboard and post it on a UTF-8 encoded page, there will be 2 bytes (xC3 x84) that are sent to the server. This is a valid EBCDIC string that represents the letter C and d. This is also a valid ANSI string that represents the 2 characters à and „. It is, however, not possible, no matter what I try, to paste an ANSI-encoded string into a browser form and expect it to be interpreted as UTF-8 - because the operating system knows that I am pasting ANSI (I copied the text from Textpad where I created an ANSI-encoded text file) and will transcode it to UTF-8, resulting in the byte stream xC3 x83 xE2 x80 x9E.
My point is that if a user manages to post garbage, it is arguably because it was already garbage at the time it was pasted into a browser form, because the client did not have the proper support for the character set, the encoding, whatever.
Because character encoding is non-deterministic, you cannot expect that there exist a trivial method to uncover from such a situation.
Unfortunately, for uploaded files the problem remains. The only reliable solution that I see is to show the user a section of the file and ask if it was interpreted correctly, and cycle through a bunch of different encodings until this is the case.
Or we could develop a heuristic method that looks at the occurance of certain characters in various languages. Say I uploaded my text file that contains the two bytes xC3 x84. There is no other information - just two bytes in the file. This method could find out that the letter Ä is fairly common in German text, but the letters à and „ together are uncommon in any language, and thus determine that the encoding of my file is indeed UTF-8. This roughy is the level of complexity that such a heuristic method has to deal with, and the more statistical and linguistic facts it can use, the more reliable will its results be.
Pekka is right about the unreliability, but if you need a solution and are willing to take the risk, and you have the mbstring library available, this snippet should work:
function forceToUtf8($string) {
if (!mb_check_encoding($string)) {
return false;
}
return mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string));
}
If I'm not wrong, there is something called utf8encode... it works well EXCEPT if you are already in utf8
http://php.net/manual/en/function.utf8-encode.php
Quick question, how can I make this valid :
if($this->datos->bathrooms == "1½"){$select1 = JText::_( 'selected="selected"' );}
The ½ doesn't seem to be recognized. I tried to write it as ½ but then it looks for ½ literally, and not the ½ sign. Any ideas?
As many others have noted, you have a character encoding problem, most likely. I'm not sure what encodings PHP supports but you need to take the whole picture into account. For this example I'm assuming your PHP script is responding to a FORM post.
Some app (yours, most likely) writes some HTML which is encoded using some encoding and sent to the browser. Common choices are ISO-8859-1 and UTF-8. You should always use UTF-8 if you can. Note: it's not the default for the web (sadly).
The browser downloads this html and renders the page. Browsers use Unicode internally, mostly, or some superset. The user submits a form. The data in that form is encoded, usually with the same encoding that the page was sent in. So if you send UTF-8 it gets sent back to you as UTF-8.
PHP reads the bytes of the incoming request and sets up its internal variables. This is where you might have a problem, if it is not picking the right encoding.
You are doing a string comparison, which decomposes to a byte comparison, but the bytes that make up the characters depends on the encoding used. As Peter Bailey wrote,
In ISO-8859-1 this character is encoded as 0xBD
In UTF-8 this character is encoded as 0xC2BD
You need to verify the text encoding along each step of the way to make sure it is happening as you expect. You can verify the data sent to the browser by changing the encoding from the browser's auto-detected encoding to something else to see how the page changes.
If your data is not coming from the browser, but rather from the DB, you need to check the encodings between your app and the DB.
Finally, I'd suggest that it's impractical to use a string like 1½ as a key for comparison as you are. I'd recommend using 1.5 and detecting that at display time, then changing how the data is displayed only. Advantages: you can order the results by number of bathrooms if the value is numeric as opposed to a string, etc. Plus you avoid bugs like this one.
The character you are looking for is the Unicode character Vulgar Fraction One Half
There are a multitude of ways to make sure you are displaying this character properly, all of which depend on the encoding of your data. By looking here we can see that
In ISO-8859-1, a popular western encoding, this character is encoded as BD
In UTF-8, a popular international encoding, this character is encoded ad C2BD
What this means is that if your PHP file is UTF-8 encoded, but you are sending this to the browser as ISO-8850-1 (or the other way around), the character will not render properly.
As others have posted, you can also use the HTML Entity for this character which will be character-encoding agnostic and will always render (in HTML) properly, regardless of the output encoding.
Try comparing it with "1½"
Use the PHP chr function to create the character by its hex 0xBD or dec 189:
if($this->datos->bathrooms == "1".chr(189)){$select1 = JText::_( 'selected="selected"' );}