Odd encoding issue after UTF-8 straightens "most" things out - php

Ok, So we have a script that takes emails sent to thunderbird, convertes part of the message to html and saves it to a MySQL. Every file, every part written is set to UTF-8. Finally, on my end of the work, the CRM (written in PHP5.3 expected output Chrome and Firefox), I pull the message, along with other info and display something resembling GMail, but as a "task list" for our employees.
The problem I'm having, if you havn't guessed already, some customer emails are obviously using different encodings. Thus, some (not all, and certainly not majority) of the e-mails don't show all characters correctly.
At first I made use of utf8_encode to get the email messages to look right, and this helps with most email messages coming from the database, however, a few slip by with bad characters.
In the DB these "bad apostrophes" appear as ’, but after utf8_encode they come through as �??. I've tried various encoding things to guess and change as needed, however, this tends to hurt the vast majority of the other emails.
Any suggestions, on one end of the pipe or the other, how I might get these few emails to match everything else, or how i might at least create a possible preg_replace filter at the end or something?
update
it seems even the emails with bad characters are passed to end php as utf-8 according to mb_detect_encoding. This is before any extra encoding. iconv does detect the ones that ahve problems, but this really gives me no way to solve them and just puts a php error box up on the screen instead of a simple FALSE return that it says it's supposed to give, so this too seems to be no solution.

The problem is that you don't know the encoding of the mail. utf8_encode encodes only from ISO-8859-1 to UTF-8. So you could try to get the encoding with mb_detect_encoding and then convert to UTF-8 with iconv.
EDIT: You could also try to read the Content-Type's charset of the mail.

Found My Answer!
Let me start by saying thanks Sebastián Grignoli for creating this VERY handy class(raw). I ended up working it into my final solution.
Second, I added the class to Codeigniter. For any of you using CI, this is an easy implementation. Simply create a file in application/libraries named Encoding.php (yes with the capital e). Then copy in the code to that file, but comment out(or remove) namespace ForceUTF8 on line 40.
My end result looks something like:
echo(Encoding::fixUTF8(utf8_decode($msgHTML)));
I'm still double checking, but thus far, I've yet to find one single error!
If I do find another encoding issue after this, I'll make sure to update.
SO Question I found that helped.

Related

Can't get rid of diamonds with question marks in the middle since php 5.6 switch

Rackspace upgraded their servers to php 5.6/apache 2.4 and ever since then, I have had several sites show these strange characters. I have gone all through google to apply patches/fixes but absolutely nothing is working. Here is an example of one: believerschallenge.com/index.php
Any help would be GREATLY appreciated!
While it's not immediately obvious what character those are supposed to be, I can think of two ways to try to address this. The first is to check the source where these characters are appearing and remove / alter the characters that aren't being rendered correctly.
The second is more complicated but probably better in the long run, and it involves checking the various things detailed in this question:
I need help fixing Broken UTF8 encoding
In particular, this one is probably the cause:
Change your PHP default charset to utf-8:
ini_set("default_charset", 'utf-8');
See UTF-8 all the way through
You probably have latin1-encoded bytes in your text and have not specified that the connection between the client and the server is utf8. Both should be changed to get rid of the black-diamond-with-question-mark.

Error on PHP Site when Copying/Paste from Outlook into Internet Explorer

Some of our users are experiencing a problem after copying and pasting text from MS Outlook into a text area box on our PHP site (running in IE, seems to work fine in other browsers). Specifically, the contents are apparently pasted properly, but when the data is passed back to the server and stored in the PostgreSQL database, no data is actually stored in the database (I'm about to check to see if the PHP is even receiving it in the $_POST variable, I'll post an update when I've done that).
It sounds like a problem with rich-text formatting or perhaps the encoding of what is pasted.
Does anyone know of a solution that we can apply to the PHP site to enforce that the text area only accept plain text (or automatically convert it) for IE?
Thanks!
Update: Sadly, I cannot reproduce the bug on IE 6, 7 or 8 using Outlook Express. Perhaps this is user error...I'll update with more info when I figure out what the actual problem is.
This might happen when some symbols copied are high-ASCII characters, and there is a mismatch with encodings you are working with. Make sure the page, your program, and your database use the same encoding (e.g. all use UTF-8, or whatever you use). I've encountered weird problems (empty strings, cut-off at the instance, etc) with inserting data that has characters like these.
But of course, check that you're actually getting the data to your program in the fist place :)
Try calling strip_tags when pulling out from $_POST.

"Smart Quotes" not displaying properly in email from phpmailer

I'm dealing with a LAMP web server. I have forms that users use to submit text that is stored in a text field in mysql. Often this text is copied and pasted from Microsoft Office products, so I'm getting a lot of smart quotes and emdashes. These characters display properly if I retrieve them from the database and display them on the webpage, but where I'm running into trouble is sending the text in an email using the phpmailer class. I get stuff that looks like this: – (where it should be an emdash).
One thing that may be important: If I pull up a console in mysql and select a field that has an emdash or smart quote in it, it will display on my console incorrectly: –, however, as stated above, if my php page (using PDO) selects the field and displays it, it will display correctly in a browser (as an emdash in this case).
I'm not sure if there's a way to select a character set in phpmailer, (maybe it's a simple setting somewhere?) or if there is a better way around this problem. I think I should be clear, though, that "search and replace smart quotes and emdashes with their regular equivalents" is NOT the answer I'm looking for (hopefully that's not the only solution).
I found this information:
My php webpage: utf-8
mysql client encoding: latin1
mysql server encoding: latin1
phpmailer character set: iso-8859-1
Character set can be switched in phpmailer with the following code:
$myMail->CharSet = "UTF-8";
This solved my issue. Typographic quotes and double dashes show up in my emails from phpmailer as expected now. This may have been a sorta noobish question (blush). Thanks, Col. Shrapnel for prompting me to look into what encoding all the pieces of the puzzle were using. I'd vote you up but don't have the reputation.
For anyone interested in homework, this link really helped me understand the basics of encoding:
http://www.joelonsoftware.com/articles/Unicode.html
The PEAR Mail_MIME package lets you do this via http://pear.php.net/manual/en/package.mail.mail-mime.get.php I am pretty certain I have used this feature before, but not positive.
You may also need to run things through iconv to normalize the character sets to a single one, if there are multiple data sources.

Why is PHP's utf8_encode breaking my utf-8 string?

I'm doing a kind of roundabout experiment thing where I'm pulling data from tables in a remote page to turn it into an ICS so that I can find out when this sports team is playing (because I can't find anywhere that the information is more readily available than in this table), but that's just to give you some context.
I pull this data using cURL and parse it using domDocument. Then I take it and parse it for the info I need. What's giving me trouble is the opposing team. When I display the data on the initial PHP page, it's correct. But when I write to an ICS file, special UTF-8 characters get messed up. I thought utf8_encode would solve that problem, but it actually seems to have the opposite effect: when I run the function on my data, even the stuff displayed on the page (which had been displaying correctly), not in the separate ICS file (which was writing incorrectly), is incorrect. As an example: it turns "Inđija" to "InÄija."
Any tips or resources as far as dealing with UTF-8 strings in PHP? My server (a remote host) doesn't have mbstring installed either, which is a pain.
utf8_encode encodes a string in ISO 8859-1 as UTF-8. If you put UTF-8 into it, it's going to interpret it as if it was ISO 8859-1, and hence produce mojibake.
To help with your first problem, before this, I'd want to know what sort of "special" characters are being messed up in the original problem, and what way are they being messed up?

E-mails sent through php5+htmlMimeMail are being received with random characters replaced with =

currently using PHP5 with htmlMimeMail 5 (http://www.phpguru.org/static/mime.mail.html) to send HTML e-mail communications. Have been having issues with a number of recipients seeing random characters replaced with equals signs e.g.:
"Good mor=ing. Our school is sending our newsletter= and information through a company called..."
Have set e-mail text, HTML, and header encoding to UTF-8. The template files loaded by PHP for the e-mail (just include()'d text/HTML with a few php tags in them) are both encoded in UTF-8.
The interesting thing is that I can't duplicate the problem on any of my e-mail clients, and can't find any information by searching yahoo/googlies that would point me at the problem!!
Try sending with 8-bit encoding:
$message->setTextEncoding(new EightBitEncoding());
$message->setHTMLEncoding(new EightBitEncoding());
I had a similar issue, but mine was a little different. Since I stumbled upon this thread looking for the answer and it helped me find it, I thought I may as well post this related answer here.
In my case special characters were getting messed up in emails even through the actual mb_detect_encoding of the text strings being sent was "UTF-8" and if I echoed them they looked fine.
So I had to us the function
$message->setTextCharset('UTF-8')
and
$message->setHTMLCharset('UTF-8')
I suspect your problem is related to older versions of Exchange. Equal signs at end of line:
It may not be the quoted printable thing with high/low order characters or the encoding. Also, elsewhere on that page it says:
NOTE: A bug ("feature"?) in Exchange
may cause line feeds to be replaced
with equal signs when rich text mail
is disabled.

Categories