PHP Tidy removes tags - php

I am making use of the following HTML tags and when I pass it through tidy and view the HTML output those tags have been removed. I have had a look at the list of config options but I can't find one that prevents this from happening.
Tidy removes: unsubscribe and webversion.
How can I get it to keep HTML tags like these?

PHP Tidy is aimed at correcting HTML and those tags aren't valid. Through correct configuration of php tidy you might be able to add them.
If I guessd correctly those should be blocklevel elements read how to add them here or see all of the other options.

Those aren't valid HTML tags, so Tidy will remove them. You might have to massage the text before/after to 'hide' the tags, by changing them to [unsubscribe] and [webversion], possibly, then change back to the <> versions afterwards. Another option would be to process the file as XML, which allows arbitrary tags of that sort. However, you'd have to be producing valid XHTML to begin with, or tidy could nuke other parts of your document.

Related

How to Secure Data Submitted Through CKEditor

I am using CKEditor in my site to let the users post their comments. CKEditor has many buttons to compose the comment. Suppose If a User makes his comment bold and italic Such Like
This is comment
And CKEditor will ouput the following html
<i><strong>This is comment</strong></i>
Now, If I store this html in the mysql database and output on the webpage as it is, without wrapping it with htmlspecialchars(), then The Comment will be shown on the page bold and italic and this is what I want.
But on the other hand If I wrap the comment with htmlspecialchars() and displays it on the webpage it will be shown as
<i><strong>This is comment</strong></i>
But I do not want to show like this, I want the user formatting. But If I do not wrap it with htmlspecialchars(), it is risky and it can cause XSS Attack and other security risks.
How Can I Achieve both Purposes
(1). Keep the User Formatting
(2). Also Secure the HTML Contents
You need to draw up a whitelist of what elements and attributes you want to allow your users to include (eg allow <strong> but not <script>; allow <a href> but not <div onmouseover>), and then enforce it by parsing the input, removing all elements and attributes that don't fit your pattern, and serialising the results back into HTML.
This is a hard job that cannot be done with a few simple regexes or strip_tags (which is NOT an adequate solution for XSS even if it did fit your needs). You would be well advised to use an existing library to do it - HTML Purifier is one such for PHP.
i think you are looking for strip_tags. it will remove all the html and php tags from the string and only allow the given tags like <strong><i> etc
<?php
$str = "<i><strong>this is a comment<strong></i><script>here is script</script>";
echo $str = strip_tags($str,"<i><strong>");
?>
php.net documentation for strip_tags
strip_tags function has option to allow or disallow tags. use php.net for more reference about strip tags. You must strip unwanted or not allowed tags. if you don't then it might be vunerable by javascripts too.
Use htmlspecialchars while u are storing and use htmlspecialchars_decode while you are displaying. This will help you to keep format of user formated content
Two options spring to mind. First of all you can strip out all HTML and use a BB code parser to allow the user to post BB tags, rather than HTML - http://php.net/manual/en/book.bbcode.php
Secondly, you could strip out all HTML except a few tags. I don't know of any parser that does that personally, however I have seen it in action on sites before (Murphy's law I can't find any right now). You should be able to achieve this with a sophisticated enough RegEx replacement check.
Use this before printing it back on screen:
function html_escape($raw_input)
{
return htmlspecialchars($raw_input, ENT_QUOTES | ENT_HTML401, 'UTF-8');
}

Ensure valid XHTML from a string in PHP

I'm using XHTML Transitional doctype for displaying content in a browser. But, the content is displayed it is passed through a XML Parser (DOMDocument) for giving final touches before outputting to the browser.
I use a custom designed CMS for my website, that allows me to make changes to the site. I have a module that allows me to display HTML scripts on my website in a way similar to WordPress widgets.
The problem i am facing right now is that I need to make sure any code provided through this module should be in a valid XHTML format or else the module will need to convert the code to valid XHTML. Currently if a portion of the input code is not XHTML compliant then my XML parser breaks and throws warnings.
What I am looking for is a solution that encodes the entities present in the URLs and text portions of the input provided via TextArea control. For example the following string will break the parser giving entity reference error:
<script type="text/javascript" src="http://www.abcxyz.com/foo?bar=1&sumthing"></script>
Also the following line would cause same error:
<a href="http://www.somesite.com">Books & Cool stuff<a/>
P.S. If i use htmlentities or htmlspecialchars, they also convert the angle brackets of tags, which is not required. I just need the urls and text portions of the string to be escaped/encoded.
Any help would be greatly appreciated.
Thanks and regards,
Waqar Mushtaq
What you'd need to do is generate valid XHTML in the first place. All your attributes much be htmlentitied.
<script type="text/javascript" src="http://www.abcxyz.com/foo?bar=1&sumthing"></script>
should be
<script type="text/javascript" src="http://www.abcxyz.com/foo?bar=1&sumthing"></script>
and
Books & Cool stuff
should be
Books & Cool stuff
It's not easy to always generate valid XHTML. If at all possible I would recommend you find some other way of doing the post processing.
As already suggested in a quick comment, you can solve the problem with the PHP tidy extensionDocs quite comfortable.
To convert a HTML fragment - even a good tag soup - into something DomDocument or SimpleXML can deal with, you can use something like the following:
$config = array(
'output-xhtml' => 1,
'show-body-only' => 1
);
$fragment = tidy_repair_string($html, $config);
$xhtml = sprintf("<body>%s</body>", $fragment);
Example: Format tag soup html as valid xhtml with tidy_repair_stringDocs.
Tidy has many options, these two used are needed for fragments and XHTML compatibility.
The only problem left now is that this XHTML fragment can contain entities that DomDocument or SimpleXML do not understand, for example . This and others are undefined in XML.
As far as DomDocument is concerned (you wrote you use it), it supports loading html instead of xml as well which deals with those entities:
$dom = new DomDocument;
$dom->loadHTML($xhtml);
Example: Loading HTML with DomDocument
HTML Tidy is a computer program and a library whose purpose is to fix invalid HTML and to improve the layout and indent style of the resulting markup.
http://tidy.sourceforge.net/
Examples of bad HTML it is able to fix:
Missing or mismatched end tags, mixed up tags
Adding missing items (some tags, quotes, ...)
Reporting proprietary HTML extensions
Change layout of markup to predefined style
Transform characters from some encodings into HTML entities

Outputting a snippet of arbitrary HTML document content without open tags

For example, a snippet of 50 characters. Problem is, of course, closing any opened tags. What's a good way to do this? Or else to make things easier, what's a good way to completely skim off all HTML content from the snippet?
You can strip out all HTML tags, etc. via the strip_tags() function, which is (being realistic) probably the best way to go, as otherwise you'll most likely end up with more tags than actual content.
For example:
$first50Chars = substr(trim(strip_tags($longString)), 0, 50);
If tags are generally allowed in the text (I mean, if, for example, text contains <b>, text must be marked with bold, etc), then looks like strip_tags() function is the easiest variant to remove tags from snippet.
If tags are generally not allowed in the text (for example, "<b>" must be just displayed as "<b>"), then you can use htmlentities() function.

htmlentities displaying html safely

I have data that is coming in from a rss feed. I want to be safe and use htmlentities but then again if I use it if there is html code in there the page is full of code and content. I don't mind the formatting the rss offers and would be glad to use it as long as I can display it safely. I'm after the content of the feed but also want it to format decently too (if there is a break tag or paragraph or div) Anyone know a way?
Do you want to protect from XSS in the feed? If so, you'll need an HTML sanitizer to run on the HTML prior to displaying it:
HTMLSanitizer
HTMLPurifier
If you just want to escape whatever is there, just call htmlspecialchars() on it. But any HTML will appear as escaped text...
You can use the strip_tags tags function and specify the allowed tags in there:
echo strip_tags($content, '<p><a>');
This way any tag not specified in allowed tags will be removed.
You can transform the HTML into mark down and then back up again using various libraries.

Error Tolerant HTML/XML/SGML parsing in PHP

I have a bunch of legacy documents that are HTML-like. As in, they look like HTML, but have additional made up tags that aren't a part of HTML
<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>
I need to parse these files. PHP is the only only tool available. The documents don't come close to being well formed XML.
My original thought was to use the loadHTML methods on PHPs DOMDocument. However, these methods choke on the make up HTML tags, and will refuse to parse the string/file.
$oDom = new DomDocument();
$oDom->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
//gives us
DOMDocument::loadHTML() [function.loadHTML]: Tag pseud-template invalid in Entity, line: 1 occured in ....
The only solution I've been able to come up with is to pre-process the files with string replacement functions that will remove the invalid tags and replace them with a valid HTML tag (maybe a span with an id of the tag name).
Is there a more elegant solution? A way to let DOMDocument know about additional tags to consider as valid? Is there a different, robust HTML parsing class/object out there for PHP?
(if it's not obvious, I don't consider regular expressions a valid solution here)
Update: The information in the fake tags is part of the goal here, so something like Tidy isn't an option. Also, I'm after something that does the some level, if not all, of well-formedness cleanup for me, which is why I was looking the DomDocument's loadHTML method in the first place.
You can suppress warnings with libxml_use_internal_errors, while loading the document. Eg.:
libxml_use_internal_errors(true);
$doc = new DomDocument();
$doc->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
libxml_use_internal_errors(false);
If, for some reason, you need access to the warnings, use libxml_get_errors
I wonder if passing the "bad" HTML through HTML Tidy might help as a first pass? Might be worth a look, if you can get the document to be well formed, maybe you could load it as a regular XML file with DomDocument.
#Twan
You don't need a DTD for DOMDocument to parse custom XML. Just use DOMDocument->load(), and as long as the XML is well-formed, it can read it.
Once you get the files to be well-formed, that's when you can start looking at XML parsers, before that you're S.O.L. Lok Alejo said, you could look at HTML TIDY, but it looks like that's specific to HTML, and I don't know how it would go with your custom elements.
I don't consider regular expressions a valid solution here
Until you've got well-formedness, that might be your only option. Once you get the documents to that stage, then you're in the clear with the DOM functions.
Take a look at the Parser in the PHP Fit port. The code is clean and was originally designed for loading the dirty HTML saved by Word. It's configured to pull tables out, but can easily be adapated.
You can see the source here:
http://gerd.exit0.net/pat/PHPFIT/PHPFIT-0.1.0/Parser.phps
The unit test will show you how to use it:
http://gerd.exit0.net/pat/PHPFIT/PHPFIT-0.1.0/test/parser.phps
My quick and dirty solution to this problem was to run a loop that matches my list of custom tags with a regular expression. The regexp doesn't catch tags that have another inner custom tag inside them.
When there is a match, a function to process that tag is called and returns the "processed HTML". If that custom tag was inside another custom tag than the parent becomes childless by the fact that actual HTML was inserted in place of the child, and it will be matched by the regexp and processed at the next iteration of the loop.
The loop ends when there are no childless custom tags to be matched. Overall it's iterative (a while loop) and not recursive.
#Alan Storm
Your comment on my other answer got me to thinking:
When you load an HTML file with DOMDocument, it appears to do some level of cleanup re: well well-formedness, BUT requires all your tags to be legit HTML tags. I'm looking for something that does the former, but not the later. (Alan Storm)
Run a regex (sorry!) over the tags, and when it finds one which isn't a valid HTML element, replace it with a valid element that you know doesn't exist in any of the documents (blink comes to mind...), and give it an attribute value with the name of the illegal element, so that you can switch it back afterwards. eg:
$code = str_replace("<pseudo-tag>", "<blink rel=\"pseudo-tag\">", $code);
// and then back again...
$code = preg_replace('<blink rel="(.*?)">', '<\1>', $code);
obviously that code won't work, but you get the general idea?

Categories