I have an unknown number of files with garbage data interspersed and I want to remove said garbage data dynamically, perhaps using regex.
It'll usually look something like this in an HTML file in a browser:
this is the beginning of the file, ��
In the file, it'll appear as like this:
this is the beginning of the file, xE2xA0
I tried using a regex editor to remove it, but it was to no avail, it cannot find it at all. How can I remove this garbage data? Again, some of the files have all kinds of HTML markup.
Thank you for any help.
Those appear because something is wrong with a character set on your site.
For example, your files are stored in Unicode, but your Content-Type is set as text/html; charset=ISO-8859-1. The problem could also be how text is stored in your database, or with how text is represented internally by your programming language.
Rather than try to strip them out, it is better to get the character set correct. This is generally a frustrating process because there are so many points where the problem could have been introduced.
You don't say what technologies you use. Generally you can search for how to solve character set issues with specific technologies such as "character set problems mysql" to find solutions.
I recommend using command line tools like file to examine what character set a text file is stored in and iconv to convert text files from one character set to another.
There are two possibilities. The first, unlikely, one is that you are getting 0xe2 0xa0 ... because there are Braille patterns in the document.
As for the second possibility, 0xa0 is NBSP. 0xe2 makes me think of ISO-8859-5.
Is there any chance someone copied & pasted stuff from a Russian version of some software package?
Also, you can get & use iconv on Windows.
Related
I need to determine the character encoding of the contents of a .csv file.
Every snippet that I have seen do this uses file_get_contents(), however I can't use that because the file is too large to store in a variable (server memory limit exhausted).
How can I determine the character encoding of a file? Can I just get the first x characters and check them? Would that guarantee that my whole file is that encoding?
Alternatively, can I simply convert the entire csv to UTF-8 without knowing the current file encoding?
No, you can't determine the encoding with just the first x characters. You can guess it, and the guess may be wrong. The file may be UTF-8 but not contain UTF-8 before x characters. If may contain another encoding that is compatible with ASCII, bot only after character x.
No, you can't convert a file without knowing the current file encoding.
You can go straight to the conversion, as you said, using iconv (http://php.net/manual/en/function.iconv.php#49434)
'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
—Charles Babbage, 1864.
You have missing metadata and are proposing to put in values whether they are right or not.
Only the author/sender can tell you, perhaps via some standard, specification, convention, agreement or communication. A common method of communication when transferring data via HTTP is the Content-Type header.
Unfortunately, inadequate communication of metadata for text files and streams is too common in our industry. It stems from the 1970s and 80s when text files were converted to the local character encoding upon receipt. That doesn't apply anymore and nothing really took its place.
Non-answer:
Conversion from ISO-8859-1 will never fail during conversion because it uses all 256 bytes values in any sequence.
Conversion to any current Unicode encoding (including UTF-8) will never fail because all of them support the whole Unicode character set, and Unicode includes every computerized character you are likely to see today.
But wait, there is more needed metadata in the case of CSV:
line ending (arguably detectable)
field separator (arguably detectable)
quoting scheme, including escaping
presence of header row
and, finally, the datatype of each column.
And, keep in mind, if you were to guess any of this, and the data source is updatable, today's guess might not work tomorrow.
Example :
$fire = '🔥';
I know PHP 5+ supports this functionality natively but is it best practice or should I be storing them using their codepoints instead and if so, why?
As far as your editor and the PHP compiler are concerned, it's all just text, and '🔥' is no different from 'fire' or 'Φωτιά'.
When PHP runs, it will read the bytes in from the file and put them in memory, without caring what they mean. This leads to the most likely problem you'll have: if you save the file in your text editor as UTF-16, and then echo the string to a browser telling it that it's UTF-8, the browser won't show the right thing. But that's easily avoided by making sure your editor always uses UTF-8, and your output headers tell the browser that's what you're using.
If you don't trust your editor to do that, and you're running PHP7, you could write it in the escaped notation "\u{1f525}", but when it runs, the same bytes will end up in memory.
You might have similar problems if you send the text elsewhere - to a database, for instance - and that somewhere else doesn't know to handle it as UTF-8. How you write the string in your source file won't make any difference to that, though, that's just a case of making sure everything is configured to match.
Note: you don't actually have to use UTF-8 for this, you could use UTF-16, or some other encoding, as long as you're consistent; but UTF-8 is by far the most common these days, particularly on the web.
Any idea why this is happening?
It looks to be happening mainly with apostrophes and hyphens. Any ideas if I can fix this? I pull the data from my database and print it to the page like:
<div class="block">
<?=$details['agenda'] ?>
</div>
As other commenters may have mentioned, this is a character encoding problem. If you're lucky, you can force your HTML page to render in UTF-8 and that will resolve it.
Unfortunately, if you're not lucky, you'll discover that the characters are stored in the database in the wrong encoding. Or maybe the database converts them. Or maybe the character encoding data has been destroyed along the path! There's no way of knowing in advance where those characters have been damaged.
The best way I know to fix problems like this is to force every step along your path to follow UTF-8 content encoding. For example, you probably go through steps like this:
Content author writes a document in Microsoft Word containing "SmartQuotes"
Content author copies-and-pastes into the edit box of a content management system.
Content management system saves to the database.
Database may or may not store data in Unicode internally - make sure you use nvarchar (or whatever unicode type your database supports).
Reading from the database may need to scan for characters.
However, it's very tricky to fix this! A long time ago, I used to have a habit of writing "detect-and-fix" routines like this:
$smartquotes = array("”", "“");
str_replace($smartquotes, '"', $mytext);
Of course you know what the problem is - I'd keep discovering new characters I had to fix. Microsoft Word likes to do tons of unusual characters - copyright, registration marks, apostrophes, hyphens, and so on. I'd keep adding to this function, over and over, until I went crazy. So nowadays I just go through my entire content delivery path and force everything to obey UTF-8 rules; that seems to resolve it in most cases.
Good luck!
I've looked across the web, I've looked through SO, through PHP documentation and more.
It seems like a ridiculous problem not to have a standard solution to. If you get an unknown character set, and it has strange characters (like english quotes), is there a standard way to convert them to UTF-8?
I've seen many messy solutions using a plethora of functions and checking and none of them are definitely going to work.
Has anyone come up with their own function or a solution that always works?
EDIT
Many people have answered saying "it is not solvable" or something of that nature. I understand that now, but none have given any sort of solution that has worked besides utf8_encode which is very limited. What methods ARE out there to deal with this? What is the best method?
No. One should always know what character set a string is in. Guessing the character set by using a sniffing function is unreliable (although in most situations, in the western world, it's usually a mix-up between ISO-8859-1 and UTF-8).
But why do you have to deal with unknown character sets? There is no general solution for this because the general problem shouldn't exist in the first place. Every web page and data source can and should have a character set definition, and if one doesn't, one should request the administrator of that resource to add one.
(Not to sound like a smartass, but that is the only way to deal with this well.)
The reason why you saw so many complicated solutions for this problem is because by definition it is not solvable. The process of encoding a string of text is non-deterministic.
It is possible to construct different combinations of text and encodings that result in the same byte stream. Therefore, it is not possible, strictly logically speaking, to determine the encoding, character set, and the text from a byte stream.
In reality, it is possible to achieve results that are "close enough" using heuristic methods, because there is a finite set of encodings that you'll encounter in the wild, and with a large enough sample a program can determine the most likely encoding. Whether the results are good enough depends on the application.
I do want to comment on the question of user-generated data. All data posted from a web page has a known encoding (the POST comes with an encoding that the developer has defined for the page). If a user pastes text into a form field, the browser will interpret the text based on encoding of the source data (as known by the operating system) and the page encoding, and transcode it if necessary. It is too late to detect the encoding on the server - because the browser may have modified the byte stream based on the assumed encoding.
For instance, if I type the letter Ä on my German keyboard and post it on a UTF-8 encoded page, there will be 2 bytes (xC3 x84) that are sent to the server. This is a valid EBCDIC string that represents the letter C and d. This is also a valid ANSI string that represents the 2 characters à and „. It is, however, not possible, no matter what I try, to paste an ANSI-encoded string into a browser form and expect it to be interpreted as UTF-8 - because the operating system knows that I am pasting ANSI (I copied the text from Textpad where I created an ANSI-encoded text file) and will transcode it to UTF-8, resulting in the byte stream xC3 x83 xE2 x80 x9E.
My point is that if a user manages to post garbage, it is arguably because it was already garbage at the time it was pasted into a browser form, because the client did not have the proper support for the character set, the encoding, whatever.
Because character encoding is non-deterministic, you cannot expect that there exist a trivial method to uncover from such a situation.
Unfortunately, for uploaded files the problem remains. The only reliable solution that I see is to show the user a section of the file and ask if it was interpreted correctly, and cycle through a bunch of different encodings until this is the case.
Or we could develop a heuristic method that looks at the occurance of certain characters in various languages. Say I uploaded my text file that contains the two bytes xC3 x84. There is no other information - just two bytes in the file. This method could find out that the letter Ä is fairly common in German text, but the letters à and „ together are uncommon in any language, and thus determine that the encoding of my file is indeed UTF-8. This roughy is the level of complexity that such a heuristic method has to deal with, and the more statistical and linguistic facts it can use, the more reliable will its results be.
Pekka is right about the unreliability, but if you need a solution and are willing to take the risk, and you have the mbstring library available, this snippet should work:
function forceToUtf8($string) {
if (!mb_check_encoding($string)) {
return false;
}
return mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string));
}
If I'm not wrong, there is something called utf8encode... it works well EXCEPT if you are already in utf8
http://php.net/manual/en/function.utf8-encode.php
Probably a problem many of you have encountered some day earlier, but i'm having problems with rendering of special characters in Flash (as2 and as3).
So my question is: What is the proper and fool-proof way to display characters like ', ", ë, ä, etc in a flash textfield? The data is collected from a php generated xml file, with content retrieved from a SQL database.
I believe it has something to do with UTF-8 encoding of the retrieved database data (which i've tried already) but I have yet to find a solid solution.
Just setting the header to UTF-8 won't work, it's a bit like changing the covers on a book from english to french and expecting the contents to change with it.
What you need to to is to make sure your text is UTF-8 from beginning to end, store it as that in the database, if you can't do that, make sure you encode your output properly.
If you get all those steps down it should all work just fine in flash, assuming you've got the proper glyphs embedded unless you're using a system font.
AS2 has a setting called useSystemCodepage, this may seem to solve the problem, but will likely make it break even more for users on different codepages, try to avoid this unless you're really sure of what you're doing.
Sometimes having those extra letters in your language actually helps ;)
I think that it's enough for you to put this in the xml head
<?xml version="1.0" encoding="UTF-8"?>
If your special characters are a part of Unicode set (and they should be, otherwise you're basically on your own), you just need to ensure that the font you're using to render the text has all of the necessary glyphs, and that the database output produces proper unicode text.
Some fonts don't neccessarily include all the unicode glyphs, but only a subset of them (usually dropping international glyphs and special characters). Make sure the font has them (test the font out in a word processor, for example). Also, if you're using embedded fonts, be sure to embed all the characters you need to use.