How to format incoming email text for HTML display

How to format incoming email text for HTML display - php

I've set up a script that processes incoming emails and creates blog entries on Blogger. I'm using PEAR's Mail_Mime libs (for now) to read the incoming message. The messages often have characters in them that cannot be read by browsers--this happens most often when people use Outlook or cut/paste from MS Word.
So the output at the other end is something like this:
Here is a test post with “quotes” and ‘apostrophes�for what it�s worth, it also has dashes�and other strange formatting cut and paste from MS Word.
You can also see the output in the wild.
It's not hard to fix any specific instance, but each client (hotmail, gmail, outlook, etc) seems to handle things just a bit differently. Mail_Mime only seems to munge the output and, if I turn off Mail_Mime's parsing and try to translate the encoded characters myself using mb_convert_encoding or some manual simulation of this, it's even worse.
Please not that this is not going to be solved by selecting the right encoding type and using decode/encode/convert functions. The incoming formats vary from Windows-1252 to UTF8 to just about anything else mail clients can think of.
Has anyone scripted this before that could save me some time by offering up a sample or advice on the best approach? I've tried all the simple answers and done plenty of experimenting, so please don't bother responding unless you've dealt with a similar issue successfully or have a deep understanding of encoding issues.

The only way to do this is to do it by the spec's which is I'm afraid to pull in the 'Content-Type' mime header, pick up the charset (it'll look like Content-Type: text/plain; charset="us-ascii") then convert to UTF-8, and of course ensure your output on the web is sent as UTF-8 with the right headers.

To solve this problem, and get my message into valid UTF-8 that is readable from a browser, I found this PHP lib, ConvertCharset by Mikolaj Jedrzejak, which worked on almost everything. It still had issues with a specific symbol (=A0) when converting from Windows-1252 or iso-8859-1. So I converted this character manually before setting the code loose.
Here's what it looks like overall:
// decode using Mail_Mime
require 'Mail.php';
require 'Mail/mime.php';
require 'Mail/mimeDecode.php';
$params['include_bodies'] = true;
$params['decode_bodies'] = true; // this decodes it!
$params['decode_headers'] = true;
$decoder = new Mail_mimeDecode($input);
$mime = $decoder->decode($params);
// too much work to put in this example
$charset = ...; //do some magic with $mime->parts to get the character set
$text = ...; //do some magic with $mime->parts to get the text
// fix the =A0 control character; it's already been decoded
// by Mail_Mime, so we need the actual byte code now
// this has to be done before trying to convert to UTF-8
$char = chr(hexdec(substr('A0',1)));
$text = str_replace($char, '', $text);
// convert to UTF-8 using ConvertCharset
require 'ConvertCharset.class.php';
if( strtolower($charset) != 'utf-8' ) {
$converter = new ConvertCharset($charset, 'utf-8', false);
}
$text = $converter->Convert($text);
Then everything is spiffy. It even does the infamous Iñtërnâtiônàlizætiøn conversion, as well as accepting french, spanish, and pastes directly from MS Word :)

Related

PHP's base64_decode returns wrong result working in IIS

I'm trying to show a jpg that was previously encoded in a WCF web service using:
<?php
require_once '../inc/config.php';
[...]
header("Content-type: image/jpg");
echo base64_decode($doc['BDATA']);
But I'm getting a
Can't display the image because it contains errors.
I've decoded the base64 string in this web app www.opinionatedgeek.com/dotnet/tools/base64decode/ and the result is right, but different that the one I'm getting with base64_decode, which is wrong.
Edit: I have two enviroments using the same code: Test and Production. It works fine in Test, but not in Production, so I'm thinking in some configuration problem.
I'm working with PHP 5.5.9 in Microsoft IIS.
An example of a string that base64_decode isn't decoding well:
/9j/4AAQSkZJRgABAQEAeAB4AAD/2wBDAAIBAQIBAQICAgICAgICAwUDAwMDAwYEBAMFBwYHBwcGBwcICQsJCAgKCAcHCg0KCgsMDAwMBwkODw0MDgsMDAz/2wBDAQICAgMDAwYDAwYMCAcIDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAz/wAARCAABAAEDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD9/KKKKAP/2Q==
Any ideas?
Edit 2: If I comment this line
require_once '../inc/config.php';
and copy the code from config.php to my actual file, it works fine. What could be happening?

From base_64_decode manual comments
php <= 5.0.5's base64_decode( $string ) will assume that a space is
meant to be a + sign where php >= 5.1.0's base64_decode( $string )
will no longer make that assumption
To fix this behavior try this code
$encodedData = str_replace(' ','+',$encodedData);
$decocedData = base64_decode($encodedData);
As this is no't your case then you have to check this answer
Because every thing work fine for me here on (WAMP)
EDIT:
As in our below conversation
There are a lot of things that may corrupt header for example , if
your file encoding is UTF-8 then you should save it as UTF-8 Without
bom you can do this using notepad ++ , also make sure if you use FTP
that your client didn't any chars to your file , rather than that
every thing should work fine

base64 encoding is not completely standardised.
Some implementations use different characters, so you'll have to replace those characters before you run your decode.
further details

Send SMS containing 'æøå' with AT commands

I'm making a SMS sending function for at project i'm working on. The code works just fine, but
when i send the letters 'æ-ø-å-Æ-Ø-Å' it turns to 'f-x-e-F-X-E'.
How do i change the encoding so that I can send these letters?
This is my code:
<?php
include "php_serial.class.php";
$html = $_POST['msg'];
$serial = new phpSerial;
$serial->deviceSet("/dev/cu.HUAWEIMobile-Modem");
$serial->deviceOpen();
$serial->sendMessage("ATZ\n\r");
// Wait and read from the port
var_dump($serial -> readPort());
$serial->sendMessage("ATE0\n\r");
// Wait and read from the port
var_dump($serial -> readPort());
// To write into
$serial->sendMessage("AT+cmgf=1;+cnmi=2,1,0,1,0\n\r");//
$serial->sendMessage("AT+cmgs=\"+45{$_POST['number']}\"\n\r");
$serial->sendMessage("{$html}\n\r");
$serial->sendMessage(chr(26));
//wait for modem to send message
sleep(3);
$read=$serial->readPort();
$serial->deviceClose();
$read = preg_replace('/\s+/', '', $read);
$read = substr($read, -2);
if($read == "OK") {
header("location: index.php?send=1");
} else {
header("location: index.php?send=2");
}
?>

First of all, you must seriously redo your AT command handling to
Read and parse the every single response line given back from the modem until you get a final result code for every single command line invocation, no exceptions whatsoever. See this answer for more details.
For AT+CMGS specifically you also MUST wait for the "\n\r> " response before sending data, see this answer for more details.
Now to answer your question about æøå turning into fxe, this is a classical stripping of the most significant bit of ISO 8859-1 encoding (which I had almost forgotten about). This is probably caused the default character encoding, but since you always should be explicit and set the character encoding you want to use in any case, investigating that further is of no value. The character encoding used for strings to AT commands is controlled by the AT+CSCS command (see this answer for more details), run AT+CSCS=? to get a list of options.
Based on your information you seem to using ISO 8859-1, so running AT+CSCS="8859-1" will stop zeroing the MSB. You might be satisfied with just that, but I strongly recommend using character encoding UTF-8 instead, it is just so vastly superior to 8859-1.
All of that failing I am quite sure that at least one of the GMS or IRA encodings should supports the æøå characters, but then you have to do some very custom translation, those characters will have binary values very different from what is common in text elsewhere.

PHP: string breaks at special character

I wrote a small PHP script which does a "branding" on a present PDF file. This means on every page I put a string like "belongs to " at a special position. Therefor I use Zend_Pdf out of the Zend Framework.
Because the script is used in German language area, in one line there I use the special character "ö" ("Gehört zu ").
On my local machine (Windows, XAMPP) the script worked fine, but when moving it to my hoster's webspace (some Linux), the string breaks at "ö". That means in my PDF on appears "Geh".
The code is this:
if (substr($file, strlen($file) - 4) === '.pdf') {
$name = $user->GetName;
$fontSize = 12;
$xTextPos = 100;
$yTextPos = 10;
set_include_path(dirname(__FILE__)); // set include_path for external library Zend Framework
require_once('Zend' .DS . 'Pdf.php');
$pdf = Zend_Pdf::load($file);
$font = Zend_Pdf_Font::fontWithName(Zend_Pdf_Font::FONT_HELVETICA);
$branding = 'Gehört zu ' . $name; // German for: 'Belongs to ', problem with 'ö'
foreach ($pdf->pages as &$page) {
$page->setFont($font, $fontSize);
$page->drawText($branding, $xTextPos, $yTextPos);
}
}
I guess the problem is related to some kind of default charset or language setting of the PHP environment. So I searched here and tried out:
$branding = utf8_encode('Gehört zu ') . $name;
...and I made some experiments with functions like html_entity_decode but nothing helped and I decided stopping groping in the dark and open an own question.
Looking forward to any hints. Thank you in advance for your help!
EDIT: Meanwhile I found the same (?) problem, solved on a German forum. But if I do it like they say...
$branding = mb_convert_encoding('Gehört zu ', 'ISO-8859-1') . $name;
... the resulting branding in the PDF is "Gehrt zu ". The "ö" is skipped now.
For this I found another hint on the Zend issue tracker.
I sum up, that I can drop all UTF8 things and concentrate on Latin-1 AKA ISO 8859-1.
I still don't understand why the code worked on my Windows + XAMPP and now crashes on my hoster's Linux.

Your guess is right, the problem is related to encoding. Where exactly the encoding is messed up is hard to say from afar. I'm assuming you work not only with Zend_Pdf, but also have the MVC in place (meaning a complete Zend_Application).
You should check if your application serves pages as UTF-8, by setting:
resources.view.encoding = "UTF-8"
and also placing the appropriate meta-tags in your layout/view.
Depending on what Editor you use, your files may be encoded in a different encoding. You can use Notepad++ on Windows to check your file-encoding and for converting it to UTF-8 (don't just set the encoding to UTF-8, this might mess up your file!) if necessary. I recommend using Eclipse with text file encoding set to "UTF-8" (Preferences > General > Workspace) to make sure your code files are encoded in UTF-8.
Now comes the crucial part:
Zend_Pdf_Page::drawText(string $text, float $x, float $y, string $charEncoding)
See that last argument... set it. If you're lucky, you can skip the previous stuff and just set the encoding there.
edit: I missed something. Database connections. You should check the encoding there too. I frequently work with MS SQL Server, which uses Latin-1 internally; not setting driver_otpions.CharacterSet can mess up stuff pretty bad too. This might be relevant, if you have soemthing like Gehört zu: Günther, where the Name Günther is fetched from db.

Encoding is also depending of the file encoding.
If you encode your file in UTF8 for example and use ut8_encode("ö"), then you'll encode in UTF_8 something already in UTF_8.
So you may want to check what your file encoding is, and what your PDF lib is requiring. Then apply the right formula/transformation.

Mysql insert text data truncated by weird character encodings

I'm importing data from a CSV file that comes from excel, but i can't seem to insert my data correctly. This data contains french accented characters and if i open the CSV with OpenOffice (i don't use excel) i just select UTF-8 and the data gets converted and shown fine.
If i try to read that into php memory, i can see they are UTF-8 encoded strings if i use MB_DETECT_ENCODING. I connect to a database and specify all UTF-8 charsets using:
mysql_query('SET character_set_results = "utf8", character_set_client = "utf8", character_set_connection = "utf8", character_set_database = "utf8", character_set_server = "utf8"');
And i can certify that my database contains UTF-8 only fields and tables.
What happens is that my content gets truncated at the first accented character. But that happens only in my php script it seems. I output all my data to the browser and if i copy the INSERT statement, it inserts the whole data.
There might be something going on between php and the browser output but i can certify that it's not in the programming of the script... Thus far, i was able to circumvent this issue by HTMLENTITY'ing all my data, but the problem is that my search engine is going coo-coo-crazy because of that...
Any reason or way you can spare would be really appreciated...
EDIT #1:
I searched for the default excel encoding of CSV data and found out it was CP1252. I tried using ICONV('CP1252', 'UTF-8//TRANSLIT', $data) and now, the accented characters seem to fit. I'm going to try it everywhere in my script to see if all my accented character issues are fix and post the solution if so...

After countless tries, i was able to fix all my encoding problems but some of them i still don't know why they happen. I hope this will give some help to someone else later:
function fixEncoding($data){
//Replace
return iconv('CP1252', 'UTF-8//TRANSLIT', $data);
}
I used this function now to recode my strings correctly. It seems that excel saves data as CP1252 and NOT utf-8.
Further more, it seems there is a bug with accented characters at the start of a string in a CSV if you use fgetcsv, so i had to forego usage of fgetcsv and create an alternate method cause i'm not in PHP 5.3, maybe str_getcsv could have fixed my issue i'm not sure but in the current case it couldn't cause i don't have the function. I even tried looking for ports and nothing seems to exist and work correctly.
This is my solution, although very ugly, it works for me:
function fgetcsv2($filepointer, $maxlen, $sep, $enc){
$data = fgets($filepointer, $maxlen);
if($data === false){
return false;
}
$data = explode($sep, $data);
return $data;
}
Good luck to all who get similar problems

I also had to work on such a project, and, seriously, PHPExcel was my savior to avoid any brainfuck.
P.S. : also, there is this link to help you getting started (in french).

I have just had a similar problem and although I tested the $value using MB_DETECT_ENCODING and it said it was UTF-8, it still truncated the data.
Not knowing what to convert from, I couldn't use the iconv function mentioned above.
However I forced it to UTF-8 using utf8_encode($value) and everything works fine now.

Which encoding are you using for your tables?
MB_DETECT_ENCODING is not 100% correct all the time and no encoding detecter can ever be that.

fwrite() and UTF8

I am creating a file using php fwrite() and I know all my data is in UTF8 ( I have done extensive testing on this - when saving data to db and outputting on normal webpage all work fine and report as utf8.), but I am being told the file I am outputting contains non utf8 data :( Is there a command in bash (CentOS) to check the format of a file?
When using vim it shows the content as:
Donâ~#~Yt do anything .... Itâ~#~Ys a
great site with
everything....Weâ~#~Yve only just
launched/
Any help would be appreciated: Either confirming the file is UTF8 or how to write utf8 content to a file.
UPDATE
To clarify how I know I have data in UTF8 i have done the following:
DB is set to utf8 When saving data
to database I run this first:
$enc = mb_detect_encoding($data);
$data = mb_convert_encoding($data, "UTF-8", $enc);
Just before I run fwrite i have checked the data with Note each piece of data returns 'IS utf-8'
if (strlen($data)==mb_strlen($data, 'UTF-8')) print 'NOT UTF-8';
else print 'IS utf-8';
Thanks!

If you know the data is in UTF8 than you want to set up the header.
I wrote a solution answering to another tread.
The solution is the following: As the UTF-8 byte-order mark is \xef\xbb\xbf we should add it to the document's header.
<?php
function writeStringToFile($file, $string){
$f=fopen($file, "wb");
$file="\xEF\xBB\xBF".$file; // this is what makes the magic
fputs($f, $string);
fclose($f);
}
?>
You can adapt it to your code, basically you just want to make sure that you write a UTF8 file (as you said you know your content is UTF8 encoded).

fwrite() is not binary safe. That means, that your data - be it correctly encoded or not - might get mangled by this command or it's underlying routines.
To be on the safe side, you should use fopen() with the binary mode flag. that's b. Afterwards, fwrite() will safe your string data "as-is", and that is in PHP until now binary data, because strings in PHP are binary strings.
Background: Some systems differ between text and binary data. The binary flag will explicitly command PHP on such systems to use the binary output. When you deal with UTF-8 you should take care that the data does not get's mangeled. That's prevented by handling the string data as binary data.
However: If it's not like you told in your question that the UTF-8 encoding of the data is preserved, than your encoding got broken and even binary safe handling will keep the broken status. However, with the binary flag you still ensure that this is not the fwrite() part of your application that is breaking things.
It has been rightfully written in another answer here, that you do not know the encoding if you have data only. However, you can validate data if it validates UTF-8 encoding or not, so giving you at least some chance to check the encoding. A function in PHP which does this I've posted in a UTF-8 releated question so it might be of use for you if you need to debug things: Answer to: SimpleXML and Chinese look for can_be_valid_utf8_statemachine, that's the name of the function.

//add BOM to fix UTF-8 in Excel
fputs($fp, $bom =( chr(0xEF) . chr(0xBB) . chr(0xBF) ));
I find this piece works for me :)

The problem is your data is double encoded. I assume your original text is something like:
Don’t do anything
with ’, i.e., not the straight apostrophe, but the right single quotation mark.
If you write a PHP script with this content and encoded in UTF-8:
<?php
//File in UTF-8
echo utf8_encode("Don’t"); //this will double encode
You will get something similar to your output.

$handle = fopen($file,"w");
fwrite($handle, pack("CCC",0xef,0xbb,0xbf));
fwrite($handle,$file);
fclose($handle);

I know all my data is in UTF8 - wrong.
Encoding it's not the format of a file. So, check charset in headers of the page, where you taking data from:
header("Content-type: text/html; charset=utf-8;");
And check if data really in multi-byte encoding:
if (strlen($data)==mb_strlen($data, 'UTF-8')) print 'not UTF-8';
else print 'utf-8';

There is some reason:
first you get information from database it is not utf-8.
if you sure that was true use this ,I always use this and it work :
$file= fopen('../logs/logs.txt','a');
fwrite($file,PHP_EOL."_____________________output_____________________".PHP_EOL);
fwrite($file,print_r($value,true));

The only thing I had to do is add a UTF8 BOM to the CSV, the data was correct but the file reader (external application) couldn't read the file properly without the BOM

Try this simple method that is more useful and add to the top of the page before tag <body> :
<head>
<meta charset="utf-8">
</head>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.