I have a database of emails I've collected from our gmail account which I'm trying to render out to an internal page.
This is working, however the occasional email comes in that causes problems because of missing/not closed tags. There might be some CSS thrown in there that I don't want rendered on the whole page.
I could use iFrames, but they seem outdated, and just not the right approach.
What would the suggested method be to render blocks of HTML from the database, but without them effecting the rest of the page?
Firstly, you need to load and interpret that HTML into something not broken. To do this, you use a DOM parser. http://php.net/manual/en/domdocument.loadhtml.php
I could use iFrames, but they seem outdated, and just not the right approach.
No, this is wrong. Until we have good shadow DOM support (and likely even after), iframes are the right way to isolate something in its own context. Make sure you use the sandboxproperty.
Note that you could do this without iframes, but it's going to be a lot more work.
But curious how something like 'Google' can render it without affecting the whole page.
Google doesn't just accept anything that comes through that e-mail, and neither should you, even if you use the iframe method. You need to make a whitelist of what elements you will support, and filter out everything outside of that. Next, you need to figure out what CSS properties you will support. Finally, you need to transform that whole DOM document into something useful and output it as HTML. Check out HTML Purifier for your whitelisting.
None of this is an easy task. You're stepping into an awful lot of hassle. There is no real standard for HTML e-mail. Each provider and mail client has a different set of what they support, and with varying results.
For the CSS part, you may add for example an id for the div u have email content on then add the id as father to all the css selectors! like this
And you may check the tags to be closed correctly which both of these ideas may take some processing power but not much, And you can prevent recalculation by preproccessing and storing the result.
iFrames I wouldn't say are dated and I would consider them a valid output in your case.
But curious how something like 'Google' can render it without affecting the whole page.
If you inspect, they use iFrames.
100% agree with #brad with running it through a parser, then output it to an iFrame.
Related
I am working on an applet that allows the user to input a URL to a news article or other webpage (in Japanese) and view the contents of that page within an iFrame in my page. The idea is that once the content is loaded into the page, the user can highlight words using their cursor, which stores the selected text in an array (for translating/adding to a personal dictionary of terms) and surrounds the text in a red box (div) according to a stylesheet defined on my domain. To do this, I use cURL to retrieve the HTML of the external page and dump it into the source of the iFrame.
However, I keep running into major formatting problems with the retrieved HTML. The big problem is preserving style sheets, and to fix this, I've used DOMDocument to add tags to the section of the retrieved HTML. This works for some pages/URLs, but there are still lots of style problems with the output HTML for many others. For example, div layers crash into each other, alignments are off, and backgrounds are missing. This is made a bit more problematic as I need to embed the output HTML into a new in order to make the onClick javascript function for passing text selections in the embedded content to work, which means the resulting source ends up looking like this:
<div onclick="parent.selectionFunction()" id ="studyContentn">
<!-- HTML of output from cURL, including doctype declarations and <html>,<head> tags -->
</div>
It seems like for the most part a lot of the formatting issues I keep running into are largely arbitrary. I've tried using php Tidy to clean output from HTML, but that also only works for some pages but not many others. I've got a slight suspicion it may have to do with CDATA declarations that get parsed oddly when working with DOMDocument, but I am not certain.
Is there a way I can guarantee that HTML output from cURL will be rendered correctly and faithfully in all instances? Or is there perhaps a better way of going about doing this? I've tried a bunch of different ways of approaching this issue, and each gets closer to a solution but brings its own new problems as well.
Thanks -- let me know if I can clarify anything.
If I understand correctly you are trying to pull the html of a complete web page and display it under your domain, in your html. This is always going to be tricky, a lot of java script will break, relative url's will be wrong and as you mentioned, styles as well. Your probably also changing the dimensions that the page is displayed in. These can all be worked around but your going to be fighting an uphill battle with each new site, or if a current site change design
I'd probably take a different approach to the problem. You might want to write a browser plugin as the interface to the external web site instead. Then your applet can sit on top of the functional and tested (hopefully) site. Then you can focus on what you need to do for your applet rather than a never ending list of fiddly html issues.
I am trying to do a similar thing. It is very difficult to conserve the formatting, and the JS scripts in webpage complicated the thing. I finally gave up the complete the idea of completely displaying the original format, but do it with a workaround:
Select only headers, links, lists, paragraph which you are interested at.
Add the domain path of your ownsite to links.
You may wrap the headers, links etc. items by your own class.
Display it
in your case you want to select text and store it, which is another topic. What I did is to parse the HTMl in two levels, and then it is easy to do the selection. Keep in mind IE and Firefox/Chrome needs to be dealt with separately.
I'm building an PHP email mailbox script.
How would I make html emails display cleanly as they do in gmail/hotmail.
If I just echo it out it affects the whole page layout.
I could use iframes but surely that isn't the best solution.
If you are looking for the 'best solution' get on board with another open source email library that is doing the same thing you are. Maintaining an email renderer on your own that is safe against script injection and other hacks will simply be too much work for one person.
One example: https://github.com/afterlogic/webmail-lite
Another: http://trac.roundcube.net/
You get the benefit of other developers who use the library maintaining the code base, so if something is broken, all you have to do is pull the latest update (hopefully) and you get the fix. If you find something that needs improving, you can fix it or build it, and make the code better for everyone. I'm really just pitching open source libraries here, however in any commercial context, building your own email renderer without a big team, is a bad idea.
As Marc B stated, I believe an IFrame would be your best bet... but please realize that if you just dump any email HTML code you risk exposing yourself to viruses, Trojans, and malicious HTML/JavaScript code - Your opening Pandora's box on your computer unless you find a good way to sandbox/strip that HTML.
Here's a simple Regex to clean JavaScript at least :
"(?s)<script.*?(/>|</script>)"
Consider the use of some HTML Tidy library (i.e.: PHP.Tidy).
You can pass the text through the library to get well formatted html.
A good practice would be to define a CSS standard behaviour for most tags in the div you're using.
Create a DIV container that you assign width (and height if needed) to, and make sure you add an overflow property to match your design. This should keep your email HTML from interfering with your layout.
UPDATE
A DIV container still assures you that you can constrain the size of the display box and with appropriate CSS acts similar to an iframe without all the baggage.
If you are worried about the code in the email, strip_tags would seem a better solution than the regex. You can define a list of tags to leave alone and still be confident of stripping the rest.
I have problem with PHP and JavaScript/CSS.
I have database with table. The table has a descriptions of articles. I want to echo the descriptions of the articles from database. Unfortunately many of them has a JavaScript or CSS included ( Then some article text), so when I use echo, it shows all of that code (and after that text). Is there any way to not show the JavaScript/CSS part and show only the text? For example with str_replace and regular expression? If yes, can somebody write me how it should look like?
Thanks for help and let me know if u need more info (code etc.)
Use HTMLPurifier - it will remove the scripts, css and any harmfull content from your articles. Since it is a CPU-intensive operations, it's better to run article trough HTMLPurifer before saving in the database, then to run it each time you are showing the article.
If you're trying to remove tags from a user's post, you can call strip_tags. This will get rid of css links, script tags, etc. It will not get rid of the style attribute, but if you get rid of div, span, p, etc. that won't matter -- there will be no tag for it to reside on.
As has been stated by others, it is generally best to sanitize your input (data from user before it goes into the DB), than it is to sanitize your output.
If you're trying to simply hide the JS and CSS from users, you can use Packer to obfusicate Javascript from less-savvy users, use Packer and use base 62 encoding. The JS will still work but will look like jiberish. Be aware that more knowledgeable users can attempt to unobfusicate the code, so any critical security risks in the JS still exists. Don't think any JS that accesses your databases directly will be safe; instead remove database access from the Javascript for security. If the JS is just to do fancy things like move elements around the page it's probably fine to just obfuscate it.
Only consider this if YOU have complete control and awareness of all JS included with the articles. If this is something your anonmous or otherwise not 120% trusted users can upload, you need to kill that functionality and use HTML Purifier to remove any JS they might add. It is not safe to output user entered JS, for you or your users.
For the CSS, I'm not sure why you want to hide it, and CSS can't be obfuscated quite like JS can; the styles will still be in plain English, best you can do is butcher the class/id names and whitespace; outputting CSS that YOU generated isn't a real security risk though, and even if people reverse engineer it I wouldn't be that afraid.
Again, if this is something anonymous/non trusted users can ADD to your site on their own, you don't want this at all, so remove the ability to upload CSS with an article using the HTML Purifier Darhazer mentioned.
You can try the following regex to remove the script and css:
"<script[\d\D]*?>[\d\D]*?</script>"
"<style[\d\D]*?>[\d\D]*?</style>"
It should help, but it cannot remove all the scripts. Like onclick="javascript:alert(1)".
I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead.
It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored.
I think you don't have any clean solutions if you are scraping a page where content changes.
I have developed several python scrapers and I know how can be frustrating when site just makes a subtle change on its layout.
You could try a solution a la mechanize (don't know the php counterpart) and if you are lucky you could isolate the content you need to extract (links?).
Another possibile approach would be to code some constraints and check them before store to db.
For example, if you are scraping Urls, you will need to verify that what scraper has parsed is formally a valid Url; same for integer ID or whatever you want to scrape that can be recognized as valid.
If you are scraping plain text, it will be more difficult to check.
Depends on the site but you could count the number of page elements in the scraped page like div, class & style tags then by comparing these totals against those of later scrapes detect if the page structure has been changed.
A similiar process could be used for the CSS file where the names of each each class or id could be extracted using simple regex, stored and checked as needed. If this list has new additions then the page structure has almost certainly changed somewhere on the site being scraped.
Speaking out of my ass here, but its possible you might want to look at some Document Object Model PHP methods.
http://php.net/manual/en/book.dom.php
If my very, very limited understanding of DOM is correct, a change in HTML site structure would change the Document Object Model, but a simple content change within a fixed structure wouldn't. So, if you could capture the DOM state, and then compare it at each scrape, couldn't you in theory determine that such a change has been made?
(By the way, the way I did this when I was trying to get an email notification when the bar exam results were posted on a particular page was just compare file_get_contents() values. Surprisingly, worked flawlessly: No false positives, and emailed me as soon as the site posted the content.)
If you want to know changes with respect to structure, I think the best way is to store the DOM structure of your first page and then compare it with new one.
There are lot of way you can do it:-
SaxParser
DOmParser etc
I have a small blog which will give some pointers to what I mean
http://let-them-c.blogspot.com/2009/04/xml-as-objects-in-oops.html
or you can use http://en.wikipedia.org/wiki/Simple_API_for_XML or DOm Utility parser.
First, in some cases you may want to compare hashes of the original to the new html. MD5 and SHA1 are two popular hashes. This may or may not be valid in all circumstances but is something you should be familiar with. This will tell you if something has changed - content, tags, or anything.
To understand if the structure has changed you would need to capture a histogram of the tag occurrences and then compare those. If you care about tags being out of order then you would have to capture a tree of the tags and do a comparison to see if the tags occur in the same order. This is going to be very specific to what you want to achieve.
PHP Simple HTML DOM Parser is a tool which will help you parse the HTML.
Explode() is not an HTML parser, but you want to know about changes in the HTML structure. That's going to be tricky. Try using an HTML parser. Nothing else will be able to do this properly.
Tumblr and other blogging websites allows people to post embeded codes of videos from youtube and all video networks.
but how they filter only the flash object code and remove any other html or scripts? and even they have an automated code that informes you this is not a valid video code.
Is this done using REGEX expressions? And Is there a PHP class to do that?
Thanks
Generally speaking, using regex is not a good way to deal with HTML : HTML is not regular enough for regular expressions : there are too many variations permitted in the standards... And browsers even accept HTML that's not valid !
In PHP, as your question is tagged as php, a great solution that exists to filter user input is the HTMLPurifier tool.
A couple of interesting things are :
It allows you specify which specific tags are allowed
For each tag, you can define which specific attributes are allowed
Basically, the idea is to only keep what you specify (white-list), instead of trying to remove bad stuff using a black-list (which will never be quite complete).
And if you only specify a list of tags and attributes that can do no harm, only those will be kept -- and the risks of injections are lowered a lot.
Quoting HTMLPurifier's home page :
HTML Purifier is a standards-compliant
HTML filter library written in PHP.
HTML Purifier will not only remove
all malicious code (better known as
XSS) with a thoroughly audited,
secure yet permissive whitelist, it
will also make sure your documents are
standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.
Yes, another great thing is that the code you get as output is valid.
Of course, this will only allow you to clean / filter / purify the HTML input ; it will not allow you to validate that the URL used by the user is both :
correct ; i.e. points to a real content
"OK" as defined by your website ; i.e. for example no nudity, ...
About the second point, there's not much one can do about it : the best solution will be to either :
Have a moderator accept / reject the contents before they're put online
Give the website's users a way to flag some content as inappropriate, so a moderator takes actions.
Basically, to check the content itself of the video, there is not much choice but have a human being say "ok" or "not ok".
About the first point, though, there's hope : some services that host content have APIs that you might want / be able to use.
For instance, Youtube provides an API -- see Developer's Guide: PHP.
In your case, the Retrieving a specific video entry section looks promising : if you send an HTTP request to an URL that looks like this :
http://gdata.youtube.com/feeds/api/videos/videoID
(Replacing "videoID" by the ID of the video, of course)
You'll get some ATOM feed if the video is valid ; and "Invalid id" if it's not
This might help you validate at least some URL to contents -- even if you'll have to develop some specific code for each possible content-hosting service that your users like...
Now, to extract the identifier of the video from your HTML string... If you're thinking about using regex, you are wrong ;-)
The best solution to extract a portion of data from an HTML string is generally to :
Load the HTML using a DOM parser ; DOMDocument::loadHTML is generally pretty helpful, here
Go though the document using DOM methods ; either, depending on your situation :
DOMDocument::getElementsByTagName, if you need to iterate over all elements that have a specific tag name ; might be great to iterate over all <object> or <embed> tags, for instance
Or, if you need something more complex, you could do an XPath query, using the DOMXPath class and its DOMXPath::query method.
And using DOM will also allow you to modify the HTML document using a standard API -- which might help, in case you want to add some message next to the video, or any other thing like that.
Take a look at htmlpurifier to start.
http://htmlpurifier.org/
I have implemented an algorithm for this for the company i work for. It works just fine. BUT, it was quite complicated to implement.
I would definitely check out HTMLPurifier to see if that works in an easy way for you. If you insist on doing it the old-school-way like I did, this is the basic steps:
1.
First of ==> get friendly with stripos()
2.
You have to make an recursive function to identify the start and stop tags for the widget, that includes all combinations of <embed></embed> or <embed/> (selfclosing) or <object></object> ... or <object><params>...<embed/></object>
3.
After this, you have to parse out all attributes and params.
4.
Now, all <object> tags should have <param> tags as child elements. You have to parse all of these to get all the data you need for finally generating a new embed or object tag. Escpecially the params and attributes that holds with, height, data source are important.
5.
Now, you don't know if the attributes are enclosed by single or double-quotes, so your code has to be lenient in this way. Also, you dont know if the code is valid or well formed. So, It should be able to handle nested embed/object tags, embed tags that are not enclosed correctly etc etc... As it is user generatede content, you can't really know and trust the input. You will see that there are lots of combinations.
6.
If you manage to parse the embeded element with all its attributes (or object element and its child params), the whitelisting of domains is easy...
My code ended up to be about 800 lines of code, which is quite large, and it was filled with recursive methods, finding correct stop and end tags etc. My alghorithm also removed all the SEO-text that often are included in the cut&paste embed-code, like links back to the site holding the widget.
Its a good excercise, but If i where you... Don't start walking this road.
Recommendation: Try find something ready made, open source!
This will never be safe. Browsers have those funny little functionalities that help people display content of their pages even if html is messy. There are endless opportunities to get something through :)
check here to see the tip of the iceberg
What You need to do is use a single input for just a link and aditional inputs for width and height and filter those. THEN generate the object tag Yourself.
This might be safe.
http://php.net/manual/en/function.strip-tags.php
and allow certain tags?
The most simple and elegant solution: Allowing HTML and Preventing XSS # shiflett.org.
Using all sorts of "HTML purifier" is more than pointless. Sorry but I don't get people who like to use these bloated libraries when a much simpler solution is in hand.
If you're looking make your site "safe" from vulnerabilities, a white list approach is the (only) way to go. I would recommend safely escaping all user generated content, and white listing only markup you know is safe and works on your site. This means not only <B> tags, but also the flash embeddings.
For example, if you want to allow any youtube to be embedded, write a validation RegEx that looks for the embed code they generate. Refuse to accept any others (or simply display it as escaped markup). This is testable. Forget all this parsing nonsense.
If you also want to add vimeo videos, then look at the embed code they provide and accept that as well.
Ugh? I know this seems like a pain, but in reality it's much easier to write than some algorithm that tries to detect "bad" content in some sort of generic fashion.
After getting the simple version of the algorithm working, you could go back and make it nicer. You could "provisionally" accept content with URLs, scripts, etc. that don't pass your white list, and have an admin process to add approved regexes to your output escaping routine. This way legitimate users aren't left out in the cold, but you don't open your self up to attacks of this nature.