converting unicode code points to string in entire file

converting unicode code points to string in entire file - php

I am running a PHP web application which accepts the file from the user, append some data to it and provide user new files to download.
Occasionally I get files which contains invisible control characters like BOM, zero-width-no-break-space etc. in it (In plain text editor it does not show but when checked with 'less' command or in 'vi' editor, it shows <U+200F>, <U+FEFF>, <U+0083> etc) and that causes an issue with our processing. Currently, I have list of few such code points which I remove from the file using 'sed' before processing it (below is the command I use). Then I also use "iconv" to convert non-utf files to utf-8.
exec("sed -i 's/\xE2\x80\x8F|\xC2\x81|\xE2\x80\x8B|\xE2\x80\x8E|\xEF\xBB\xBF|\xC2\xAD|\xC2\x89|\xC2\x83|\xC2\x87|\xC2\x82//g' 'my_file_path'");
But the list of such character is increasing and when not handled properly, such characters are causing file encoding to be 'unknown-8bit' which is not proper and will show corrupted content. Now I need to for a solution which should be efficient and does not need me to the lookup code table.
How should I do this so it automatically handles every code point in the file and doesn't need to maintain a list of such code to replace. I am open for Perl /python/bash script solution also.
P.S. I need to support all languages (not just US ascii or extended ascii) and I also dont want any data loss.

Related

special symbols in filename doesn't display correct in mPDF

I have code:
$mpdf = new mPDF();
$mpdf->WriteHTML('some html text');
return $mpdf->Output("123!##$%^&*()_+<><?:}{P}" . '.pdf', 'I');
But when I save document in filename instead symbols <>?: displays -----.
Can it be fixed?

First of all, this question has nothing to do with PDF generation. You want to create a file system object with a name that includes characters that have a special meaning in some shells:
< is the input redirecton operator
> is the output redirection operator
? is the any character wildcard
: is the Windows drive letter separator
And to you want to accomplish it through an additional layer you don't have control over (I assume a web browser).
Some file systems (not all) treat object names as raw byte strings and do not impose any condition. I recall being able to create files in an old Unix box that contained a * character and a line feed, after I read a book that explained such thing was possible. However, a file name goes though several software layers, many of which actually need to understand the name, and some of them will possibly impose additional restrictions to those of the file system itself. So, even if you manage to create the file, you might not be able to read it back later.
For this reason, the browser actively removes problematic characters. In some cases, it might be overzealous (: is safe on Unix) but it just tries to prevent potential issues (e.g. the Unix file is emailed or copied to a Windows share) and there's nothing you can do on the server to avoid that.

Detecting char encoding on direct file uploads PHP

on my site I allow for direct text file uploads. These files are then stored on the server, and displayed on the website. I use UTF-8 on the site.
Now I run into trouble when people upload non-UTF-8 files which contain special chars, such as é.
I've been doing some testing. Made 2 text files, both containing the same word fiancée. One encoded UTF-8 and one encoded ISO 8859-2.
The UTF-8 one uploads fine, and shows the text correct, but the ISO 8859-2 shows as fianc�e.
Now I've tried to detect the uploaded file content with mb_detect_encoding, but whatever file I throw at it, it always detect UTF-8.
I noticed that I can use utf8_encode to convert the ISO 8859-2 files to valid UTF-8, but this only works on non-UTF files. And as I currently cannot detect non-UTF files, I cannot use the utf8_encode function, as it messes up valid UTF-8 files.
Hope this makes sense :)
So my question is, how can I detect files that are for sure not UTF-8 encoded to start with, so that I can use the utf8_encode function on them.

You cannot. Welcome to encodings.
Seriously though, files are just binary blobs. The bits and bytes in the file could mean anything at all; it could be images, CAD data or, perhaps, text. It depends on how you interpret the bytes. For text files that specifically means with which encoding you interpret them. There's nothing in the files themselves that tells you the correct encoding, you have to know it. Typically you want to know it from metadata accompanying the file. In the case of random user uploads though, there is no metadata, and/or it wouldn't be reliable. So you cannot "know".
The next step would be to guess, but that is obviously not foolproof. You can rule out certain encodings, for example if a file does not validate as UTF-8 (mb_check_encoding($data, 'UTF-8') == false), then it cannot be UTF-8. However, any single byte encoding will validate as any other single byte encoding. It's impossible to distinguish ISO-8859-1 from ISO-8859-2 this way, the bytes are equally valid in both. It's just that the characters that show up may not be the ones you want. To detect that automatically you need a statistical language analyser which can tell you that this character probably shouldn't show up in that word for it to be grammatical. Obviously for that to work you need to know the language used in the file, or you need to detect that first… And even then this is hardly foolproof.
The sanest way is to ask the user. Accept the upload, perhaps do some upfront testing on which encodings can be ruled out, then ask the user which of a bunch of possible encodings the file is in. Present them the result, what the file looks like when interpreted as the chosen encoding, let the user confirm that it looks alright. Many decent text editors do this when you open a file with an ambiguous encoding.

How to remove question mark garbage data, dynamically, from files?

I have an unknown number of files with garbage data interspersed and I want to remove said garbage data dynamically, perhaps using regex.
It'll usually look something like this in an HTML file in a browser:
this is the beginning of the file, ��
In the file, it'll appear as like this:
this is the beginning of the file, xE2xA0
I tried using a regex editor to remove it, but it was to no avail, it cannot find it at all. How can I remove this garbage data? Again, some of the files have all kinds of HTML markup.
Thank you for any help.

Those appear because something is wrong with a character set on your site.
For example, your files are stored in Unicode, but your Content-Type is set as text/html; charset=ISO-8859-1. The problem could also be how text is stored in your database, or with how text is represented internally by your programming language.
Rather than try to strip them out, it is better to get the character set correct. This is generally a frustrating process because there are so many points where the problem could have been introduced.
You don't say what technologies you use. Generally you can search for how to solve character set issues with specific technologies such as "character set problems mysql" to find solutions.
I recommend using command line tools like file to examine what character set a text file is stored in and iconv to convert text files from one character set to another.

There are two possibilities. The first, unlikely, one is that you are getting 0xe2 0xa0 ... because there are Braille patterns in the document.
As for the second possibility, 0xa0 is NBSP. 0xe2 makes me think of ISO-8859-5.
Is there any chance someone copied & pasted stuff from a Russian version of some software package?
Also, you can get & use iconv on Windows.

ISO-8859-1 and MacRoman Encoding

I've got a MySQL database table with an ISO-8859-1 encoded text field containing user names. When I export that to a text file using PHP I get a normal text file saved on the client computer. When I open it in Word or Excel on a Windows system, it looks good. When I open it on Mac using Word or Excel, the high-ascii characters are wrong.
I know this is due to the Mac using MacRoman and Windows using ISO-8859-1. My question is how can I write a text file that will open up on both platforms and look good on both?
Is there some XML varian that I can wrap around the text that will clue Word into the fact that it's ISO-8859-1 encoded? What magic dust can I sprinkle on a TXT file to clue the os into the fact that it's using another encoding scheme?

...I get a normal text file saved on the client computer
You actually get a text in a specific encoding. Let's assume it's ISO-8859-1.
I know this is due to the Mac using MacRoman and Windows using ISO-8859-1. My question is how can I write a text file that will open up on both platforms and look good on both?
The software that opens a text document must know the charset encoding. Sometimes, it can guess it using some heuristics, sometimes it will not try to guess (and use its own default), sometimes you can make him ask you what encoding to use. See here.
There is no general method that guarantees that every user will open it in the correct encoding, as long as we are speaking of pure text files. In some other formats (eg HTML) the encoding can be specified as part of the document itself.

PHP, MSSQL2005 and Codepages

I have a php script which accesses a MSSQL2005 database, reads some data from it and sends the results in a mail.
There are special characters in both some column names and in the fields itself.
When I access the script through my browser (webserver iis), the query is executed correctly and the contents of the mail are correctly (for my audience) encoded.
However, when I execute php from the console, the query fails (due to the special characters in the column names). If I replace the special characters in the query with calls to chr() and the character code in latin-1, the query gets executed correctly, but the results are also encoded in latin-1 and therefore not displayed correctly in the mail.
Why is PHP/the MSSQL driver/… using a different encoding in the two scenarios? Is there a way around it?
If you wonder, I need the console because I want to schedule the script using SQLAgent (or taskmanager or whatever).

Depending on the type of characters you have in your database, it might be a console limitation I guess. If you type chcp in the console, you'll see what is the active code page, which might something like CP437 also known as Extended ASCII. If you have characters out of this code page, like in UTF8, you might run into problems. You can change the current active code page by typing chcp 65001 to switch to UTF8.
You might also want to change the default Raster font to Lucida Console depending on the required characters as not all fonts support extended characters (right click on command prompt window's title, properties, font).
As already said, PHP's unicode support is not ideal, but you can manage to do it in PHP5 with a few well placed function call of utf8_decode. The secret of character encoding is to understand well what is the current encoding of all the tools you are using: database, database connection, current bytes in your PHP variable, your output to the console screen, your email's body encoding, your email client, and so on...
For everything that have special characters, in our modern days, something like UTF8 is often recommended. Make sure everything along the way is set to UTF8 and convert only where necessary.

PHP's poor support for the non English world is well known. I've never used a database with characters outside the basic ASCII realm, but obviously you already have a work around and it seems you just have to live with it.
If you wanted to take it a step further, you could:
1. Write an array that contains all the special chars and their CHR equivalents
2. foreach the array and str_replace on the query
But if the query is hardcoded, I guess what you have is fine. Also, make sure you are using the latest PHP, at least 4.4.x, there's always a change this was fixed but I skimmed the 4.x.x release notes and I don't see anything that relates to your problem.

The thing to remember about PHP strings is that they are streams of bytes. If you want to get the data in the correct character set (for whatever you are doing), you have to do this explicitly through some kind of function or filter. It's all pretty low-level.
Depending on your setup, you may need to know the internal character set of the strings in the database, but at the very least you need to know what character set the database is sending to PHP (because, remember, to PHP it's just a stream of bytes).
Then you have to know the target character set (and possibly specify it, which you really should anyway). For example, say that you are getting utf-8 from the database, but wish to send a latin-1 (and therefore base64 or q-printable encoded as 'Content-transfer-encoding'):
$send_string = base64_encode(utf8_decode($database_string));
Of course in this case, you'd have to know that all the utf-8 characters exist in the latin-1 character set, and you probably wouldn't really want base64 (PHP unfortunately does not have a good q-printable encoding function, though curiously, it does for decoding), and if you aren't talking about utf-8 <=> latin-1 you'll want to whip out the mbstring functions instead.
As far as the console, you'd have to know what PHP is getting when you are typing in special characters from the console, which probably depends on the shell and/or PHP settings. But remember that PHP only understands strings as byte byte byte and you should be able to work it out.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.