I'm building an PHP email mailbox script.
How would I make html emails display cleanly as they do in gmail/hotmail.
If I just echo it out it affects the whole page layout.
I could use iframes but surely that isn't the best solution.
If you are looking for the 'best solution' get on board with another open source email library that is doing the same thing you are. Maintaining an email renderer on your own that is safe against script injection and other hacks will simply be too much work for one person.
One example: https://github.com/afterlogic/webmail-lite
Another: http://trac.roundcube.net/
You get the benefit of other developers who use the library maintaining the code base, so if something is broken, all you have to do is pull the latest update (hopefully) and you get the fix. If you find something that needs improving, you can fix it or build it, and make the code better for everyone. I'm really just pitching open source libraries here, however in any commercial context, building your own email renderer without a big team, is a bad idea.
As Marc B stated, I believe an IFrame would be your best bet... but please realize that if you just dump any email HTML code you risk exposing yourself to viruses, Trojans, and malicious HTML/JavaScript code - Your opening Pandora's box on your computer unless you find a good way to sandbox/strip that HTML.
Here's a simple Regex to clean JavaScript at least :
"(?s)<script.*?(/>|</script>)"
Consider the use of some HTML Tidy library (i.e.: PHP.Tidy).
You can pass the text through the library to get well formatted html.
A good practice would be to define a CSS standard behaviour for most tags in the div you're using.
Create a DIV container that you assign width (and height if needed) to, and make sure you add an overflow property to match your design. This should keep your email HTML from interfering with your layout.
UPDATE
A DIV container still assures you that you can constrain the size of the display box and with appropriate CSS acts similar to an iframe without all the baggage.
If you are worried about the code in the email, strip_tags would seem a better solution than the regex. You can define a list of tags to leave alone and still be confident of stripping the rest.
Related
I'm currently working on an email template building website in PHP (LAMP to be specific) that allows users to paste in their HTML email code and then send it off to their customers.
Obviously with handling this kind of data I need to implement some kind of XSS security. I've scowled the net for weeks trying to find solutions to this and found very few good methods but they don't really work for full HTML documents (which is what I'd be dealing with).
These are the solutions I found and why they don't work for me:
HTMLPurifier:
I think this is the obvious choice for most because it's got the best security and is up to date with industry standards. Although it's main use is supposed to be for HTML fragements/small snippets, I thought I'd give it a go.
The first issue I ran into was that the head tags (and anything inside them) was getting stripped and removed. The head is quite essential in HTML emails so I had to find a way around this...unfortunately, the only fix I could find was to seperate the head from the rest of the email and run each part seperately though HTMLPurifier.
I've yet to try this because it seems very hacky but it seems to be the only way to achieve what I'm after. I'm also not sure on how well HTMLPurifier is at finding XSS in CSS. On top of all that, it doesn't do well in terms of performance with it being such a large library.
HTMLawed:
HTMLawed seemed to be another great option but a few things swayed me from using it.
A) Compared to HTMLPurifier, this seems to be less secure. HTMLawed has several documented security issues at the moment. It's also not widely used yet which is more worrying (only used by about 10 registered companies).
B) It's released under the GPL/GPU License, which effectively means I can't use it on my website unless I'm willing to let people use my service for free.
C) From what I've seen of people talking about it, it seems to strip a lot of tags unless it's heavily configured. I can't have much say here because I've not tried it but that also raises security concerns for me - what if I miss something? what if I can't configure it to keep the elements I want? etc.
These are my questions to you:
Are there any better alternatives to the ones listed above?
Is it possible to code this myself or is that too ambitious and too insecure?
How do the larger email companies tackle this issue (mailchimp, activecampaign, sendinblue, etc.)?
It seem you are sending an HTML content. So then you cannot filter them. You must store HTML in your database. If you filter them using XSS proof, then the HTML will not working properly. By default, all Webmail service disabling Javascript by default like GMail, Yahoo, Roundcube etc.
If you are using WYSIWYG like CKEditor, it automatically remove all <script> tags and also certain unknown attribute. But still you can set it to what to accept and what to remove via CKEditor.config().
If you PHP cannot insert into your database because of some special chars, then you can use SQL prepare statement or encode your HTML input to base64 using base64_encode() then decode it when to use in mail() or PHPMailer::Body.
I have a database of emails I've collected from our gmail account which I'm trying to render out to an internal page.
This is working, however the occasional email comes in that causes problems because of missing/not closed tags. There might be some CSS thrown in there that I don't want rendered on the whole page.
I could use iFrames, but they seem outdated, and just not the right approach.
What would the suggested method be to render blocks of HTML from the database, but without them effecting the rest of the page?
Firstly, you need to load and interpret that HTML into something not broken. To do this, you use a DOM parser. http://php.net/manual/en/domdocument.loadhtml.php
I could use iFrames, but they seem outdated, and just not the right approach.
No, this is wrong. Until we have good shadow DOM support (and likely even after), iframes are the right way to isolate something in its own context. Make sure you use the sandboxproperty.
Note that you could do this without iframes, but it's going to be a lot more work.
But curious how something like 'Google' can render it without affecting the whole page.
Google doesn't just accept anything that comes through that e-mail, and neither should you, even if you use the iframe method. You need to make a whitelist of what elements you will support, and filter out everything outside of that. Next, you need to figure out what CSS properties you will support. Finally, you need to transform that whole DOM document into something useful and output it as HTML. Check out HTML Purifier for your whitelisting.
None of this is an easy task. You're stepping into an awful lot of hassle. There is no real standard for HTML e-mail. Each provider and mail client has a different set of what they support, and with varying results.
For the CSS part, you may add for example an id for the div u have email content on then add the id as father to all the css selectors! like this
And you may check the tags to be closed correctly which both of these ideas may take some processing power but not much, And you can prevent recalculation by preproccessing and storing the result.
iFrames I wouldn't say are dated and I would consider them a valid output in your case.
But curious how something like 'Google' can render it without affecting the whole page.
If you inspect, they use iFrames.
100% agree with #brad with running it through a parser, then output it to an iFrame.
I have problem with PHP and JavaScript/CSS.
I have database with table. The table has a descriptions of articles. I want to echo the descriptions of the articles from database. Unfortunately many of them has a JavaScript or CSS included ( Then some article text), so when I use echo, it shows all of that code (and after that text). Is there any way to not show the JavaScript/CSS part and show only the text? For example with str_replace and regular expression? If yes, can somebody write me how it should look like?
Thanks for help and let me know if u need more info (code etc.)
Use HTMLPurifier - it will remove the scripts, css and any harmfull content from your articles. Since it is a CPU-intensive operations, it's better to run article trough HTMLPurifer before saving in the database, then to run it each time you are showing the article.
If you're trying to remove tags from a user's post, you can call strip_tags. This will get rid of css links, script tags, etc. It will not get rid of the style attribute, but if you get rid of div, span, p, etc. that won't matter -- there will be no tag for it to reside on.
As has been stated by others, it is generally best to sanitize your input (data from user before it goes into the DB), than it is to sanitize your output.
If you're trying to simply hide the JS and CSS from users, you can use Packer to obfusicate Javascript from less-savvy users, use Packer and use base 62 encoding. The JS will still work but will look like jiberish. Be aware that more knowledgeable users can attempt to unobfusicate the code, so any critical security risks in the JS still exists. Don't think any JS that accesses your databases directly will be safe; instead remove database access from the Javascript for security. If the JS is just to do fancy things like move elements around the page it's probably fine to just obfuscate it.
Only consider this if YOU have complete control and awareness of all JS included with the articles. If this is something your anonmous or otherwise not 120% trusted users can upload, you need to kill that functionality and use HTML Purifier to remove any JS they might add. It is not safe to output user entered JS, for you or your users.
For the CSS, I'm not sure why you want to hide it, and CSS can't be obfuscated quite like JS can; the styles will still be in plain English, best you can do is butcher the class/id names and whitespace; outputting CSS that YOU generated isn't a real security risk though, and even if people reverse engineer it I wouldn't be that afraid.
Again, if this is something anonymous/non trusted users can ADD to your site on their own, you don't want this at all, so remove the ability to upload CSS with an article using the HTML Purifier Darhazer mentioned.
You can try the following regex to remove the script and css:
"<script[\d\D]*?>[\d\D]*?</script>"
"<style[\d\D]*?>[\d\D]*?</style>"
It should help, but it cannot remove all the scripts. Like onclick="javascript:alert(1)".
I have some random HTML layouts that contain important text I would like to extract. I cannot just strip_tags() as that will leave a bunch of extra junk from the sidebar/footer/header/etc.
I found a method built in Python and I was wondering if there is anything like this in PHP.
The concept is rather simple: use
information about the density of text
vs. HTML code to work out if a line of
text is worth outputting. (This isn’t
a novel idea, but it works!) The basic
process works as follows:
Parse the HTML code and keep track of the number of bytes processed.
Store the text output on a per-line, or per-paragraph basis.
Associate with each text line the number of bytes of HTML required to
describe it.
Compute the text density of each line by calculating the ratio of text
t> o bytes.
Then decide if the line is part of the content by using a neural network.
You can get pretty good results just
by checking if the line’s density is
above a fixed threshold (or the
average), but the system makes fewer
mistakes if you use machine learning -
not to mention that it’s easier to
implement!
Update: I started a bounty for an answer that could pull main content from a random HTML template. Since I can't share the documents I will be using - just pick any random blog sites and try to extract the body text from the layout. Remember that the header, sidebar(s), and footer may contain text also. See the link above for ideas.
phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library.
UPDATE 2
DEMO: http://so.lucafilosofi.com/find-important-text-in-arbitrary-html-using-php/
tested on a casual blogs list taken from Technorati Top 100 and Best Blogs of 2010
many blogs make use of CMS;
blogs html structure is the same almost the time.
avoid common selectors like #sidebar, #header, #footer, #comments, etc..
avoid any widget by tag name script, iframe
clear well know content like:
/\d+\scomment(?:[s])/im
/(read the rest|read more).*/im
/(?:.*(?:by|post|submitt?)(?:ed)?.*\s(at|am|pm))/im
/[^a-z0-9]+/im
search for well know classes and ids:
typepad.com .entry-content
wordpress.org .post-entry .entry .post
movabletype.com .post
blogger.com .post-body .entry-content
drupal.com .content
tumblr.com .post
squarespace.com .journal-entry-text
expressionengine.com .entry
gawker.com .post-body
Ref: The blog platforms of choice among the top 100 blogs
$selectors = array('.post-body','.post','.journal-entry-text','.entry-content','.content');
$doc = phpQuery::newDocumentFile('http://blog.com')->find($selectors)->children('p,div');
search based on common html structure that look like this:
<div>
<h1|h2|h3|h4|a />
<p|div />
</div>
$doc = phpQuery::newDocumentFile('http://blog.com')->find('h1,h2,h3,h4')->parent()->children('p,div');
Domdocument can be used to parse html documents, which can then be queried through PHP.
Edit: wikied
I worked on a similar project a while back. It's not as complex as the Python script but it will do a good job. Check out the Simple HTML PHP Parser
http://simplehtmldom.sourceforge.net/
Depending on your HTML structure and if you have id's or classes in place you can get a little complicated and use preg_match() to specifically get any information between a certain start and end tag. This means that you should know how to write regular expressions.
You can also look into a browser emulation PHP class. I've done this for page scraping and it works well enough depending on how well formatted the DOM is. I personally like SimpleBrowser
http://www.simpletest.org/api/SimpleTest/WebTester/SimpleBrowser.html
I have developed a HTML parser and filter PHP package that can be used for that purpose.
It consists of a set of classes that can be chained together to perform a series of parsing, filtering and transformation operations in HTML/XML code.
It was meant to deal with real world pages, so it can deal with malformed tag and data structures, so it can preserve as much as the original document as possible.
One of the filter classes it comes with can do DTD validation. Another can discard insecure HTML tags and CSS to prevent XSS attacks. Another can simply extract all document links.
All those filter classes are optional. You can chain them together the way you want, if you need any at all.
So, to solve your problem, I do not think there is already a specific solution for that in PHP anywhere, but a special filter class could be developed for it. Take a look at the package. It is thoroughly documented.
If you need help, just check my profile and mail me and I may even develop the filter that does exactly what you need, eventually inspired in any solutions that exist for other languages.
Tumblr and other blogging websites allows people to post embeded codes of videos from youtube and all video networks.
but how they filter only the flash object code and remove any other html or scripts? and even they have an automated code that informes you this is not a valid video code.
Is this done using REGEX expressions? And Is there a PHP class to do that?
Thanks
Generally speaking, using regex is not a good way to deal with HTML : HTML is not regular enough for regular expressions : there are too many variations permitted in the standards... And browsers even accept HTML that's not valid !
In PHP, as your question is tagged as php, a great solution that exists to filter user input is the HTMLPurifier tool.
A couple of interesting things are :
It allows you specify which specific tags are allowed
For each tag, you can define which specific attributes are allowed
Basically, the idea is to only keep what you specify (white-list), instead of trying to remove bad stuff using a black-list (which will never be quite complete).
And if you only specify a list of tags and attributes that can do no harm, only those will be kept -- and the risks of injections are lowered a lot.
Quoting HTMLPurifier's home page :
HTML Purifier is a standards-compliant
HTML filter library written in PHP.
HTML Purifier will not only remove
all malicious code (better known as
XSS) with a thoroughly audited,
secure yet permissive whitelist, it
will also make sure your documents are
standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.
Yes, another great thing is that the code you get as output is valid.
Of course, this will only allow you to clean / filter / purify the HTML input ; it will not allow you to validate that the URL used by the user is both :
correct ; i.e. points to a real content
"OK" as defined by your website ; i.e. for example no nudity, ...
About the second point, there's not much one can do about it : the best solution will be to either :
Have a moderator accept / reject the contents before they're put online
Give the website's users a way to flag some content as inappropriate, so a moderator takes actions.
Basically, to check the content itself of the video, there is not much choice but have a human being say "ok" or "not ok".
About the first point, though, there's hope : some services that host content have APIs that you might want / be able to use.
For instance, Youtube provides an API -- see Developer's Guide: PHP.
In your case, the Retrieving a specific video entry section looks promising : if you send an HTTP request to an URL that looks like this :
http://gdata.youtube.com/feeds/api/videos/videoID
(Replacing "videoID" by the ID of the video, of course)
You'll get some ATOM feed if the video is valid ; and "Invalid id" if it's not
This might help you validate at least some URL to contents -- even if you'll have to develop some specific code for each possible content-hosting service that your users like...
Now, to extract the identifier of the video from your HTML string... If you're thinking about using regex, you are wrong ;-)
The best solution to extract a portion of data from an HTML string is generally to :
Load the HTML using a DOM parser ; DOMDocument::loadHTML is generally pretty helpful, here
Go though the document using DOM methods ; either, depending on your situation :
DOMDocument::getElementsByTagName, if you need to iterate over all elements that have a specific tag name ; might be great to iterate over all <object> or <embed> tags, for instance
Or, if you need something more complex, you could do an XPath query, using the DOMXPath class and its DOMXPath::query method.
And using DOM will also allow you to modify the HTML document using a standard API -- which might help, in case you want to add some message next to the video, or any other thing like that.
Take a look at htmlpurifier to start.
http://htmlpurifier.org/
I have implemented an algorithm for this for the company i work for. It works just fine. BUT, it was quite complicated to implement.
I would definitely check out HTMLPurifier to see if that works in an easy way for you. If you insist on doing it the old-school-way like I did, this is the basic steps:
1.
First of ==> get friendly with stripos()
2.
You have to make an recursive function to identify the start and stop tags for the widget, that includes all combinations of <embed></embed> or <embed/> (selfclosing) or <object></object> ... or <object><params>...<embed/></object>
3.
After this, you have to parse out all attributes and params.
4.
Now, all <object> tags should have <param> tags as child elements. You have to parse all of these to get all the data you need for finally generating a new embed or object tag. Escpecially the params and attributes that holds with, height, data source are important.
5.
Now, you don't know if the attributes are enclosed by single or double-quotes, so your code has to be lenient in this way. Also, you dont know if the code is valid or well formed. So, It should be able to handle nested embed/object tags, embed tags that are not enclosed correctly etc etc... As it is user generatede content, you can't really know and trust the input. You will see that there are lots of combinations.
6.
If you manage to parse the embeded element with all its attributes (or object element and its child params), the whitelisting of domains is easy...
My code ended up to be about 800 lines of code, which is quite large, and it was filled with recursive methods, finding correct stop and end tags etc. My alghorithm also removed all the SEO-text that often are included in the cut&paste embed-code, like links back to the site holding the widget.
Its a good excercise, but If i where you... Don't start walking this road.
Recommendation: Try find something ready made, open source!
This will never be safe. Browsers have those funny little functionalities that help people display content of their pages even if html is messy. There are endless opportunities to get something through :)
check here to see the tip of the iceberg
What You need to do is use a single input for just a link and aditional inputs for width and height and filter those. THEN generate the object tag Yourself.
This might be safe.
http://php.net/manual/en/function.strip-tags.php
and allow certain tags?
The most simple and elegant solution: Allowing HTML and Preventing XSS # shiflett.org.
Using all sorts of "HTML purifier" is more than pointless. Sorry but I don't get people who like to use these bloated libraries when a much simpler solution is in hand.
If you're looking make your site "safe" from vulnerabilities, a white list approach is the (only) way to go. I would recommend safely escaping all user generated content, and white listing only markup you know is safe and works on your site. This means not only <B> tags, but also the flash embeddings.
For example, if you want to allow any youtube to be embedded, write a validation RegEx that looks for the embed code they generate. Refuse to accept any others (or simply display it as escaped markup). This is testable. Forget all this parsing nonsense.
If you also want to add vimeo videos, then look at the embed code they provide and accept that as well.
Ugh? I know this seems like a pain, but in reality it's much easier to write than some algorithm that tries to detect "bad" content in some sort of generic fashion.
After getting the simple version of the algorithm working, you could go back and make it nicer. You could "provisionally" accept content with URLs, scripts, etc. that don't pass your white list, and have an admin process to add approved regexes to your output escaping routine. This way legitimate users aren't left out in the cold, but you don't open your self up to attacks of this nature.