Can't understand why Zend_Mail::addHeader() strips newlines

Can't understand why Zend_Mail::addHeader() strips newlines - php

(Since this is my first SO question, let me just say I hope it's not too Zend-specific. As far as I can tell this shouldn't be a problem. Although I could have posted it in a Zend-specific forum, I feel like I'm at least as likely to get a good answer here, especially since the answer might involve MIME-related issues that transcend Zend Framework. I'm basically trying to understand whether the issue I'm facing should be considered a ZF bug, or if I'm misunderstanding something or misusing it.)
I've been using Zend_Mail to build up a MIME message that gets sent through SendGrid, an email distribution service. Their platform allows you to send emails through their SMTP server, but gives added features when you use a special header (X-SMTPAPI) whose value is a JSON-encoded string of proprietary parameters, which can get quite long.
Eventually, the header I was passing got too long (I think >1000 chars), and I got errors. I was confused because I knew that it was getting passed through PHP's native wordwrap() function before I passed the value to Zend_Mail::addHeader(), so I thought line length should never be a problem.
It turns out that addHeader() strips newlines very deliberately, and with no particular explanation by way of comments.
// In Zend_Mail::addHeader()
$value = $this->_filterOther($value);
// In Zend_Mail::_filterOther()
$rule = array("\r" => '',
"\n" => '',
"\t" => '',
);
return strtr($data, $rule);
Ok, this seemed reasonable at first -- maybe ZF wants full control of formatting and line-wrapping. The next method called in Zend_Mail::addHeader() is
$value = $this->_encodeHeader($value);
This method encodes the value (either quoted-printable or base64 as appropriate) and chunks it into lines of appropriate length, but only if it contains "non-printable characters", as determined by Zend_Mime::isPrintable($value).
Looking into that method, newlines (\n) are indeed considered non-printable characters! So if only they hadn't been stripped out of the string in the previous method call, the long header would get encoded as QP and chunked into 72-char lines, and everything would work fine. In fact, I did a test where I commented out the call to _filterOther(), and the long header gets encoded and goes through with no problem. But now I've just made a careless hack to ZF without really understanding the purpose behind the line I removed, so this can't be a long-term solution.
My medium-term solution has been to extend Zend_Mail and create a new method, addHeaderForceEncode(), which will always encode the value of the header, and thus always chunk it into short lines. But I'm still not satisfied because I don't understand why that _filterOther() call was necessary in the first place -- maybe I shouldn't be working around it at all.
Can anyone explain to me why this behaviour exists of stripping newlines? It seems to inevitably lead to situations where a header can get too long if it doesn't contain any "non-printable characters" other than newlines.
I've done a bunch of different searches on this subject and looked through some ZF bug reports, but haven't seen anyone talking about this. Surprisingly it seems to be a really obscure issue. FYI I'm working with ZF 1.11.11.
Update: In case anyone wants to follow the ZF issue I opened about this, here it is: Zend_Mail::addHeader() UNfolds long headers, then throws exception

You're probably running into a few things. Per RFC 2821, text lines in SMTP can't exceed 1000 characters:
text line
The maximum total length of a text line including the is
1000 characters (not counting the leading dot duplicated for
transparency). This number may be increased by the use of SMTP
Service Extensions.
A header can't contain newlines, so that's probably why Zend is stripping them. For long headers, it's common to insert a line break (CRLF in SMTP) and a tab to "wrap" them.
Excerpt from RFC 822:
Each header field can be viewed as a single, logical line of
ASCII characters, comprising a field-name and a field-body.
For convenience, the field-body portion of this conceptual
entity can be split into a multiple-line representation; this
is called "folding". The general rule is that wherever there
may be linear-white-space (NOT simply LWSP-chars), a CRLF
immediately followed by AT LEAST one LWSP-char may instead be
inserted.
I would say that the _encodeHeader() function should possibly look at line length, and if the header is longer than some magic value, do the "wrap and tab" to have it span multiple lines.

Related

Google translate API - too many characters being sent - how to debug?

I am using this script (https://github.com/viniciusgava/google-translate-php-client) to pass translation of text to Google. The text is generated through a script I coded myself.
The problem is that, Google is saying I'm passing 1-3k more characters than I actually am. I have gone through and made sure the loops are tight, commented out any sections that could possibly be leaking to test, but no matter what, it's passing too many characters.
Also, I tested doing this:
$text = (strlen($string) > 2000) ? substr($string, 0, 2000) . '...' : $string;
Even so, Google is saying I sent 3,300 characters.
My question is, how can I debug what's actually being sent to Google so I can identify the "leak"? How can I retrieve a list of everything that has been sent?
Note: The script I use for character counting (http://www.javascriptkit.com/script/script2/charcount.shtml) already includes whitespaces, html and punctuation when counting the amount of characters.
Thank you for the help.
Update:
I ran a test completely separate from my script, I input 550 words in raw $string = "words"; format... ran the translation, and Google is reporting 1000! So the same thing is happening... I'm wondering if it's something to do with the script itself.
Update 2:
I ran a test using the official GTranslate API script, with 60 characters, Google recorded it as 100 characters. So something is happening that's doubling my characters, even on the official script.

Escaping DOI links in php - when esc_url() is not enough

I am writing php code that generates html that contains links to documents via their DOI. The links should point to https://doi.org/ followed by the DOI of the document.
As the results is a url, I thought I could simply use php's esc_url() function like in
echo '' . esc_url('https://doi.org/' . $doi)) . '';
as this is what one is supposed to use in text nodes, attribute nodes or anywhere else. Unfortunately things apparenty aren't that easy...
The problem is that DOIs can contain all sorts of special characters that are apparently not handled correctly by esc_url(). A nice example of such a DOI is
10.1002/(SICI)1521-3978(199806)46:4/5<493::AID-PROP493>3.0.CO;2-P
which is supposed to link to
https://doi.org/10.1002/(SICI)1521-3978(199806)46:4/5<493::AID-PROP493>3.0.CO;2-P
With $doi equal to this DOI the above code however produces a link that is displayed and links to https://doi.org/10.1002/(SICI)1521-3978(199806)46:4/5493::AID-PROP4933.0.CO;2-P.
This leads me to the question: If esc_url() is obviously not one-size-fits-all no-brained solution to escaping urls, then what should I use? For this case I can get the result I want with
esc_url(htmlspecialchars('https://doi.org/' . $doi))
but is this really the right way™ of doing it? Does this have any other unwanted side effects? If not, then why does esc_url() not also escape < and >? Would esc_html() be better than htmlspecialchars()? If so, should I nest it into a esc_url()?
I am aware that there are many articles on escaping urls in php on stackoverflow, but I couldn't find one that addresses the issues of < and > signs.

I'm no PHP expert, but I do know about DOIs and SICIs can be really annoying.
URL-encoding and HTML encoding are separate things, so it makes sense to think about them separately. You must escape the angle-brackets to make correct HTML. As for the URL-escaping, you should also do this because there are other characters that might break URLs (such as the # character, which also pops up from time to time).
So I would recommend:
'https://doi.org/' . htmlspecialcharacters(urlencode($doi))
Which will give you:
Click here
Note the order of function application, and the fact that you don't want to encode the https://doi.org resolver!
To the above "dipshit decision" comment... it's certainly inconvenient. But SICIs were around before DOIs and it's one of those annoying things we've had to live with ever since!

Combine strpos with regex

Here's the skinny. I'm using imap to get into Gmail, and using that for database entries. I'm having quite the headache of it to. The main issue is that I'm getting all sorts of random '=''s in my body. I can get around most of it, but the headache comes from primarily one source. I'm isolating out JUST the reply, and the email body is similar to this.
<div dir="ltr">Quick Quick! He's Drowning!!!!!!</div><div class="gm=
ail_extra"><br clear="all"><div>Thank you<div>Daniel Jenkins</div><div>Te=
chnical Assistant</div><div><a href="[url]" target="_blank">=
[work]</a><br>
</div><div>[phone number here]</div></div>
<br><br>
Now, I don't need the email signature, I just need the part before it. What I'm trying to do is strpos the <div class="gmail_ extra line, but it's a moving target because of the =. It's been after the a, the l, the g, etc. Is there a way to strpos(<div calss=g[=]?m[=]?a[=]?i[=]?l[=]?)?

The ending = is a soft return / newline in quoted-printable encoding, just use:
$string = quoted_printable_decode($string);
... which will also take care of other unexpected differences between encoded body & actual content. After that you should have nice predictable HTML (which you would run through a parser rather then trying to split it with a regex).

HTML safe wrapping of long lines

I'm having problems sending HTML emails with long lines of text. The WYSIWYG editor (FCKEditor 2.5) used on the site keeps removing all the \n characters on certain browsers, including IE and Chrome. The result is an email with a single huge line of text. This wouldn't be a problem if it wasn't for email clients that wrap lines of over 998 characters by inserting ! \n in it. Of course, these almost always end up in the most unfortunate places, breaking HTML tags and looking nasty in the content itself.
My initial solution was to add a line feed after every HTML tag or every 900 to 990 characters. This is the regex I ended up with:
return preg_replace("/(<\/[^\>]+>|<[^\>]+\/>|>[^<]{900,990}\s)(\n)*/","$1\n",$str);
However, when there are lines that don't contain any tags at all, the whitespace matching part is never triggered. But if I remove the > from it's beginning, it starts breaking tags.
Is there a better way than regex to do this, or can this regex be healed?
EDIT: The 1000 character line length limit is defined in RFC 821.

Following my comment, I'm posting this as I have been able to run a test.
tidy::repairString shoud do the job just fine, better than any regex solution.
$content = "<html>......</html>";
$oTidy = new tidy();
$content = $oTidy->repairString($content,
array("show-errors" => 0, "show-warnings" => false),
"utf8"
);
Adapt the Charset parameter (3rd) to your needs.
The clean option is unneeded for this, I was wrong in my comment.

If I understand everything correctly, you don't need to concern yourself with lines that don't contain HTML at all - these can be left to be handled by email clients.

XML parser error: entity not defined

I have searched stackoverflow on this problem and did find a few topics, but I feel like there isn't really a solid answer for me on this.
I have a form that users submit and the field's value is stored in a XML file. The XML is set to be encoded with UTF-8.
Every now and then a user will copy/paste text from somewhere and that's when I get the "entity not defined error".
I realize XML only supports a select few entities and anything beyond that is not recognized - hence the parser error.
From what I gather, there's a few options I've seen:
I can find and replace all and swap them out with or an actual space.
I can place the code in question within a CDATA section.
I can include these entities within the XML file.
What I'm doing with the XML file is that the user can enter content into a form, it gets stored in a XML file, and that content then gets displayed as XHTML on a Web page (parsed with SimpleXML).
Of the three options, or any other option(s) I'm not aware of, what's really the best way to deal with these entities?
Thanks,
Ryan
UPDATE
I want to thank everyone for the great feedback. I actually determined what caused my entity errors. All the suggestions made me look into it more deeply!
Some textboxes where plain old textboxes, but my textareas were enhanced with TinyMCE. It turns out, while taking a closer look, that the PHP warnings always referenced data from the TinyMCE enhanced textareas. Later I noticed on a PC that all the characters were taken out (because it couldn't read them), but on a MAC you could see little square boxes referencing the unicode number of that character. The reason it showed up in squares on a MAC in the first place, is because I used utf8_encode to encode data that wasn't in UTF to prevent other parsing errors (which is somehow also related to TinyMCE).
The solution to all this was quite simple:
I added this line entity_encoding : "utf-8" in my tinyMCE.init. Now, all the characters show up the way they are supposed to.
I guess the only thing I don't understand is why the characters still show up when placed in textboxes, because nothing converts them to UTF, but with TinyMCE it was a problem.

I agree that it is purely an encoding issue. In PHP, this is how I solved this problem:
Before passing the html-fragment to SimpleXMLElement constructor I decoded it by using html_entity_decode.
Then further encoded it using utf8_encode().
$headerDoc = '<temp>' . utf8_encode(html_entity_decode($headerFragment)) . '</temp>';
$xmlHeader = new SimpleXMLElement($headerDoc);
Now the above code does not throw any undefined entity errors.

You could HTML-parse the text and have it re-escaped with the respective numeric entities only (like: →  ). In any case — simply using un-sanitized user input is a bad idea.
All of the numeric entities are allowed in XML, only the named ones known from HTML do not work (with the exception of &, ", <, >, &apos;).
Most of the time though, you can just write the actual character (ö → ö) to the XML file so there is no need to use an entity reference at all. If you are using a DOM API to manipulate your XML (and you should!) this is your safest bet.
Finally (this is the lazy developer solution) you could build a broken XML file (i.e. not well-formed, with entity errors) and just pass it through tidy for the necessary fix-ups. This may work or may fail depending on just how broken the whole thing is. In my experience, tidy is pretty smart, though, and lets you get away with a lot.

1. I can find and replace all [ ?] and swap them out with [ ?] or an actual space.
This is a robust method, but it requires you to have a table of all the HTML entities (I assume the pasted input is coming from HTML) and to parse the pasted text for entity references.
2. I can place the code in question within a CDATA section.
In other words disable parsing for the whole section? Then you would have to parse it some other way. Could work.
3. I can include these entities within the XML file.
You mean include the entity definitions? I think this is an easy and robust way, if you don't mind making the XML file quite a bit bigger. You could have an "included" file (find one on the web) which is an external entity, which you reference from the top of your main XML file.
One downside is that the XML parser you use has to be one that processes external entities (which not all parsers are required to do). And it must correctly resolve the (possibly relative) URL of the external entity to something accessible. This is not too bad but it may increase constraints on your processing tools.
4. You could forbid non-XML in the pasted content. Among other things, this would disallow entity references that are not predefined in XML (the 5 that Tomalak mentioned) or defined in the content itself. However this may violate the requirements of the application, if users need to be able to paste HTML in there.
5. You could parse the pasted content as HTML into a DOM tree by setting someDiv.innerHTML = thePastedContent;
In other words, create a div somewhere (probably display=none, except for debugging). Say you then have a javascript variable myDiv that holds this div element, and another variable myField that holds the element that is your input text field. Then in javascript you do
myDiv.innerHTML = myField.value;
which takes the unparsed text from myField, parses it into an HTML DOM tree, and sticks it into myDiv as HTML content.
Then you would use some browser-based method for serializing (= "de-parsing") the DOM tree back into XML. See for example this question. Then you send the result to the server as XML.
Whether you want to do this fix in the browser or on the server (as #Hannes suggested) will depend on the size of the data, how quick the response has to be, how beefy your server is, and whether you care about hackers sending not-well-formed XML on purpose.

Use "htmlentities()" with flag "ENT_XML1": htmlentities($value, ENT_XML1);
If you use "SimpleXMLElement" class:
$SimpleXMLElement->addChild($name, htmlentities($value, ENT_XML1));

If you want to convert all characters, this may help you (I wrote it a while back) :
http://www.lautr.com/convert-all-applicable-characters-to-numeric-entities-for-use-in-xml
function _convertAlphaEntitysToNumericEntitys($entity) {
return '&#'.ord(html_entity_decode($entity[0])).';';
}
$content = preg_replace_callback(
'/&([\w\d]+);/i',
'_convertAlphaEntitysToNumericEntitys',
$content);
function _convertAsciOver127toNumericEntitys($entity) {
if(($asciCode = ord($entity[0])) > 127)
return '&#'.$asciCode.';';
else
return $entity[0];
}
$content = preg_replace_callback(
'/[^\w\d ]/i',
'_convertAsciOver127toNumericEntitys', $content);

This question is a general problem for any language that parses XML or JSON (so, basically, every language).
The above answers are for PHP, but a Perl solution would be as easy as...
my $excluderegex =
'^\n\x20-\x20' . # Don't Encode Spaces
'\x30-\x39' . # Don't Encode Numbers
'\x41-\x5a' . # Don't Encode Capitalized Letters
'\x61-\x7a' ; # Don't Encode Lowercase Letters
# in case anything is already encoded
$value = HTML::Entities::decode_entities($value);
# encode properly to numeric
$value = HTML::Entities::encode_numeric($value, $excluderegex);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.