How to encode spaces in E-Mail headers?

How to encode spaces in E-Mail headers? - php

I want to send an e-mail using the mail-function in PHP. The "FROM"-part of the header should contain a name with a space
name with space <mail#server.tld>
Unfortunately, some mail clients cannot handle the spaces. Thats why e.g. Thunderbird adds quotes to the name.
"name with space" <mail#server.tld>
That works fine, until you add special characters like ÄÖÜ, since they need to be encoded. The follwing would not work:
"name with ÄÖÜ and space" <mail#server.tld>
Thats why I tried the function mb_encode_mimeheader
echo mb_encode_mimeheader("name with"." ÄÖÜ"." and space", "ISO-8859-1", "Q");
# result:
# name with =?ISO-8859-1?Q?=C3=84=C3=96=C3=9C=20and=20space?=
That still does not work, since before the first occurence of the special characters, the spaces are still in the string. the correct result schould be:
=?ISO-8859-1?Q?name=20with=20=C3=84=C3=96=C3=9C=20and=20space?=
Is there a function in PHP that can handle this? or should I use a mixture of quotes and ´mb_encode_mimeheader´? Or is there a different way to handle spaces in mailheaders? To be honest, I did not understand the meaning of the different whitespaces mentioned in the RFC822.

You don't need quotes. RFC2047-encoding (i.e. mb_encode) handles spaces as well.
(Although for the record, just plain spaces are completely unproblematic; they are not the reason some clients use quoting, which is often completely redundant anyway. So the result with the spaces is actually not incorrect at all.)

As IETF says, emails only let ASCII characters in their headers. Thus I guess you should ASCII encode your email header.
However, based on the trends that I see in new specifications coming from Internet Engineering Task Force, soon we should see that you might don't need to encode/decode your headers.

Related

â€™ in PHP is converting to ’ when using mb_convert_encoding in Outlook subject

I have a mail() function set up in PHP, when emailing to my email to test I noticed the subject was converting my ' into â€™.
$subject="Please provide an updated copy of your company's certification";
result: Please provide an updated copy of your companyâ€™s certification.
I followed Getting â€™ instead of an apostrophe(') in PHP adding mb_convert_encoding but now I am getting &rsquo instead of '.
$subjectBad="Please provide an updated copy of your company's certification";
$subject= mb_convert_encoding($subjectBad, "HTML-ENTITIES", 'UTF-8');
result: Please provide an updated copy of your company&rsquo ;s certification.
It comes through fine to my personal email, so is there a way to properly display a ' in Outlooks subject or am I at the whim of whatever their system settings are?

Whatever you used to type the subject did not use a simple apostrophe ' which has a common representation across virtually all single-byte encodings and UTF8, instead it used a "fancy" right single quote ’, which is represented differently between single-byte encodings and UTF-8.
mb_convert_encoding() is converting to an HTML entity because you are literally telling it to, and email headers are not HTML so it's going to display as the literal string ’. The only character set other than UTF-8 that has "smart quotes" is Microsoft's cp1252, and that is still the wrong answer for email headers.
The simplest answer is: Don't do that. Use a normal apostrophe. Everyone hates dealing with "smart" quotes.
The more complex answer is that email headers MUST be 7-bit safe "ASCII" text, and anything else requires additional handwaving. Ideally you should be using a proper email library that handles this, and the dozens of other annoyances that will malform your emails and impact deliverability.
If you're dead-set on eroding your sanity and using mail() directly, then you're going to want to properly encode your subject line and use an explicitly-defined character set, which you should be doing anyways. Eg:
$subject = 'Please provide an updated copy of your company’s certification';
var_dump(
sprintf('=?UTF-8?Q?%s?=', quoted_printable_encode($subject))
);
Output:
string(82) "=?UTF-8?Q?Please provide an updated copy of your company=E2=80=99s certification?="

How can I sanitize a string while maintaining all non-English alphabet support

Generally, I would strip all characters that are not English using something like :
$file = filter_var($file, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW | FILTER_FLAG_STRIP_HIGH );
however, I am tired of not providing support for user input from other languages which may be in the form of an uploaded file (the filename may be in Cyrillic or Chinese, or Arabic, etc) or a form field, or even content from a WYSIWYG.
The examples for sanitizing data with regards to this, come in one of two forms
Those that strip all chars which are non-English
Those that convert all chars which are non-English to English letter substitutes.
The problem with this practice, is that you end up with a broken framework that pretends it supports multiple languages however it really doesn't aside from maybe displaying labels or content to them in their language.
There are a number of attacks which take advantage of unicode/utf-8/utf-16/etc support passing null bytes and so on, so it is clear that not sanitizing the data is not an option.
Is there any way to clean up a variable from arbitrary commands while maintaining the full alphabets/chars of these other languages, but stripping out (in a generic manner) all possible non-printable chars, chars that have nulls in them as part of the char, and other such exploits while maintaining the integrity of the actual characters the user input ? The above command is perfect and does everything exactly as it should, however it would be super cool if there were a way to expand that to allow support for all languages.

Null bytes are not(!) UTF-8, so assuming you use UTF-8 internally, all you need to do is to verify that the passed variables are UTF-8. There's no need to support UTF-16, for example, because you as author of the according API or form define the correct encoding and you can limit yourself to UTF-8. Further, "unicode" is also not an encoding you need to support, simply because it is not an encoding. Rather, Unicode is a standard and the UTF encodings are part of it.
Now, back to PHP, the function you are looking for is mb_check_encoding(). Error handling is simple, if any parameter doesn't pass that test, you reply with a "bad request" response. No need to try to guess what the user might have wanted.
While the question doesn't specifically ask this, here are some examples and how they should be handled on input:
non-UTF-8 bytes: Reject with 400 ("bad request").
strings containing path elements (like ../): Accept.
filename (not file path) containing path elements (like ../): Reject with 400.
filenames شعار.jpg, 标志.png or логотип.png: Accept.
filename foo <0> bar.jpg: Accept.
number abc: Reject with 400.
number 1234: Accept.
Here's how to handle them for different outputs:
non-UTF-8 bytes: Can't happen, they were rejected before.
filename containing path elements: Can't happen, they were rejected before.
filenames شعار.jpg, 标志.png or логотип.png in HTML: Use verbatim if the HTML encoding is UTF-8, replace as HTML entities when using default ISO8859-1.
filenames شعار.jpg, 标志.png or логотип.png in Bash: Use verbatim, assuming the filesystem's encoding is UTF-8.
filenames شعار.jpg, 标志.png or логотип.png in SQL: Probably just quote, depends on the driver, DB, tables etc. Consult the manual.
filename foo <0> bar.jpg in HTML: Escape as "foo <0> bar.jpeg". Maybe use " " for the spaces.
filename foo <0> bar.jpg in Bash: Quote or escape " ", "<" and ">" with backslashes.
filename foo <0> bar.jpg in SQL: Just quote.
number abc: Can't happen, they were rejected before.
number 1234 in HTML: Use verbatim.
number 1234 in Bash: Use verbatim (not sure).
number 1234 in SQL: Use verbatim.
The general procedure should be:
Define your internal types (string, filename, number) and reject anything that doesn't match. These types create constraints (filename doesn't include path elements) and offer guarantees (filename can be appended to a directory to form a filename inside that directory).
Use a template library (Moustache comes to mind) for HTML.
Use a DB wrapper library (PDO, Propel, Doctrine) for SQL.
Escape shell parameters. I'm not sure which way to go here, but I'm sure you will find proper ways.
Escaping is not a defined procedure but a family of procedures. The actual escaping algorithm used depends on the target context. Other than what you wrote ("escaping will also screw up the names"), the actual opposite should be the case! Basically, it makes sure that a string containing a less-than sign in XML remains a string containing a less-than sign and doesn't turn into a malformed XML snippet. In order to achieve that, escaping converts strings to prevent any character that is normally not interpreted as just text from getting its normal interpretation, like the space character in the shell.

Strip Base64 strings from long text

I really wonder if I'm really the first one asking this question or am I so blind to finde some about this...
I have a longer text and I want to strip base64 encoded strings of it
I am a text and have some lines with some content
There are more than one line but sometimes I have
aSBhbSBhIG5vcm1hbCB0ZXh0IHRoYXQgd2FzIGNvZ
GVkIGluIGJhc2UgNjQgYW5kIG5vdyBpIHdhcyB0cmFu
c2xhdGVkIGJhY2sgdG8gYmxhbmsgdGV4dGZvcm1hd
C4gaSB0aGFuayB5b3UgZm9yIHBheWluZyBhdHRlbnRp
b24uIGJ5ZQ==
and this is what I want to strip / extract by using php
As you can see there is base64 encoded data in the text and I want to extract/strip these lines.
I allready tried a lot of regex samples from SO something like
$regex = '#^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$#m';
preg_match($regex, $content, $output_array );
but this not solved anything...
What I need is a regex that only selects the base strings...
Is this even possible ? I mean is base64 selectable by regex ? I guess :)
EDIT: String-Source is the content of an email
EDIT2: Guess the best syntax for this case your be so track strings that has more than one uppdercased character and can have numbers and has no whitespaces. But regex is not my daily bread :D

First of all: You can not reliably do this!
Why?
Simple, the point why base64 is so great in some cases is, that is encodes all the data with "standard" characters. Those that are used in normal texts, sentences, and yes, even words.
Background
Is "Hello" a base64-encoded string? Well, yes, in the meaning of it is "valid base64 encoded". It probably returns a lot of jibberish, but it is a base64-ok string.
Therefore, you can only decide on a length after which you consider characters connected without any space to be base64 encoded. Of course in languages such as german you may have quite some trouble here, as there a compound nouns, such as "Bäckerfachverkäuferinnenhosenherstellungsautomatenzuliefererdienst" or such (just made that up).
Workaround
So on the length you have to decide yourself, an then you can go with this:
[a-zA-Z0-9\+\/\=]{20,}
Also see the example here: https://regex101.com/r/uK5gM1/1
I considered "20" to be the minimum length for "base64 encoded stuff" here, but as said, it is up to you. Also, as a small side note, the = is not really encoded content but fill bytes, but I still added it to the regex.
Edit: Gnah.. you can even see in my example that I did not catch the last line :) When changing the number to 12 it works fine here, but there may be words with more than 12 characters ... so - as said, not really reliably possible in this manner.

For the snippet in the example /^\w{53}$/gm does the job. If you can rely on length of course.
EDIT:
Considering circumstances and updates, I would go with /\n([\w=\n]{50,})\n/gs but without metadata it may be tricky to guess mime-type of the decoded stuff, and almost impossible to restore filenames etc.

Am I decoding quoted printable bounce messages containing a VERP-encoded addresses correctly in PHP?

Thanks to SO, I was able to save myself some time by avoiding writing my own function to decode quoted printable emails and instead use PHP's decode_quoted_printable() function.
However, I very quickly ran into a problem. I have a bounce notification email that I need to decode and display in the browser and the body of the email contains the original email headers, which includes the original source address, which was VERP-encoded so that I could associate the bounce back with the corresponding user in the database.
The body of the bounce notification email includes this snippet:
------ This is a copy of the message, including all the headers. ------
Return-path: <blah+user=af.com#mydomain.com>
The problem is that quoted_printable_decode() uses the '=' character as a special character to indicate that it might have to do some special decoding. In the case of '=af' (\x3D6166) it decides to translate it into \xAF, which is the Unicode code point for the Macron character. When I later run this through htmlentities() this gets converted to the appropriate HTML code for the Macron character, so I end up with this output in the browser:
Return-path: <blah+user¯.com#mydomain.com>
Of course, this doesn't happen for all sequences starting with '=', only the ones that PHP decides it can convert into meaningful Unicode code points. The alternative imap_qprint() exhibits the same behaviour.
Oh, and I'm running PHP 5.3.8.
Am I doing something wrong or is this just how quoted_printable_decode() is supposed to work?

After a little more research I realised that the only two escape sequences in the encoded emails were equals signs at the end of each line to identify where a line feed had been removed and '=3D' to denote where real equals signs were supposed to be.
So I simply replaced the call to quoted_printable_decode() with this one-liner:
$decoded_body = str_replace( array( "=\n", '=3D' ), array( '', '=' ), $email_body );
I realise this won't deal with any Unicode data that could theoretically appear in the email, but given that I'm controlling the content of the messages that are bouncing back, I can pretty much guarantee that they won't. So keep that in mind if you decide to use this solution yourself.

Illegal non-standard quotes in XML

I'm allowing some user input on my website, that later is read in XML. Every once in a while I get these weird single or double quotes like this ”’. These are directly copied from the source that broke my XML. I'm wondering if there is an easy way to correct these types of characters in my xml. htmlentities did not seem to touch them.
Where do these characters come from? I'm not even sure how I'd go about typing them out unintentionally.
EDIT- I forgot to clarify these quotes are not being used in attributes, but in the following way:
<SomeTag>User’s Input</SomeTag>

Don't disallow and/or modify foreign characters; that's just annoying for your users! This is just an encoding issue. I don't know what parser you're using to read the XML, but if it's reasonably sophisticated, you can solve your problem by including the following encoding pragma at the top of your XML files:
<?xml version="1.0" encoding="UTF-8"?>
There may also be a UTF-8 option in the parser's API.
Edit: I just read that you're reading the XML directly in a browser. Most browsers listen to the encoding pragma!
Edit 2: Apparently, those quotes aren't even legal in UTF-8, so ignore what I said above. Instead, you might find what you're looking for here, where a similar problem is being discussed.

Are these quotes being used in text content, or to delimit attributes? For attribute delimiters, XML requires typewriter quotes (single or double). Microsoft and other word-processing applications often try to be smart and replace typewriter quotes with typographical quotes, which is almost certainly the answer to the question "where are they coming from?".
If you need to get rid of them, a simple global replace using a text editor will do the job fine.
But you might try to work out first why they are causing a problem. Perhaps your data flow can't handle ANY non-ASCII characters, in which case that's a deeper problem that you really ought to fix (it would typically imply some unwanted transcoding is happing somewhere along the line).

If the input string is UTF-8 encoded, maybe you need to specify that to htmlentities(), for example:
$html = htmlentities( '”’', ENT_COMPAT, "utf-8" );
echo $html;
For me gives:
”’
whereas
$html = htmlentities( '”’' );
echo $html;
gets confused:
â??â??
If the input string is non-UTF-8, then you'd need to adjust the encoding arg for htmlentities() accordingly.

Stay away from MicroSoft Office apps. Word, Excel etc. have a nasty habit of replacing matching pairs of single quotes and double quotes with non-standard "smart-quotes".
These quote characters are truly non-standard and never made it into the official latin-1 character set. All the MS Office apps "helpfully" replace standard quote characters with these abominations.
Just google for "undoing smatquotes" or "convert smartquotes back" for hints tips and regexes to get rid of these.

Use
$s = 'User’s Input';
$descriptfix = preg_replace('/[“”]/','\"',$s);
$descriptfix = preg_replace('/[‘’]/','\'',$descriptfix);
echo "<SomeTag>htmlentities($s)</SomeTag>";

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.