WMD markdown editor - HTML to Markdown conversion

WMD markdown editor - HTML to Markdown conversion - php

I am using wmd markdown editor on a project and had a question:
When I post the form containing the markdown text area, it (as expected) posts html to the server. However, say upon server-side validation something fails and I need to send the user back to edit their entry, is there anyway to refill the textarea with just the markdown and not the html? Since as I have it set up, the server only has access to the post data (which is in the form of html) so I can't seem to think of a way to do this. Any ideas? Preferably a non-javascript based solution.
Update: I found an html to markdown converter called markdownify. I guess this might be the best solution for displaying the markdown back to the user...any better alternatives are welcome!
Update 2: I found this post on SO and I guess there is an option to send the data to the server as markdown instead of html. Are there any downsides to simply storing the data as markdown in the database? What about displaying it back to the user (outside of an editor)? Maybe it would be best to post both versions (html AND markdown) to the server...
SOLVED: I can simply use php markdown to convert the markdown to html serverside.

I would suggest that you simply send and store the text as Markdown. This seems to be what you have settled on already. IMO, storing the text as Markdown will be better because you can safely strip all HTML tags out without worrying about loss of formatting - this makes your code safer, because it will be harder to use a XSS attack (although it may still be possible though - I am only saying that this part will be safer).

One thing to consider is that WMD appears to have certain different edge cases from certain server-side Markdown implementations. I've definitely seen some quirks in the previews here that have shown up differently after submission (I believe one such case was attempting to escape a backtick surrounded by backticks). By sending the converted preview over the wire, you can ensure that the preview is accurate.
I'm not saying that should make your decision, but it's something to consider.

Try out Pandoc. It's a little more comprehensive and reliable than Markdownify.

The HTML you are seeing is just a preview, so it's not a good idea to store that in the database as you will run into issues when you try to edit. It's also not a good idea to store both versions (markdown and HTML) as the HTML is just an interpretation and you will have the same problems of editing and keeping both versions in synch.
So the best idea is to store the markdown in the db and then convert it server side before displaying.
You can use PHP Markdown for this purpose. However this is not 100% perfect conversion of what you are seeing on the javascript side and may need some tweaking.
The version that the Stack Exchange network is using is a C# implementation and there should be a python implementation you downloaded with the version of wmd you have.
The one thing I tweaked was the way new lines were rendered so I changed this in markdown.php to convert some new lines into <br> starting from line 626 in the version I have:
var $span_gamut = array(
#
# These are all the transformations that occur *within* block-level
# tags like paragraphs, headers, and list items.
#
# Process character escapes, code spans, and inline HTML
# in one shot.
"parseSpan" => -30,
# Process anchor and image tags. Images must come first,
# because ![foo][f] looks like an anchor.
"doImages" => 10,
"doAnchors" => 20,
# Make links out of things like `<http://example.com/>`
# Must come after doAnchors, because you can use < and >
# delimiters in inline links like [this](<url>).
"doAutoLinks" => 30,
"encodeAmpsAndAngles" => 40,
"doItalicsAndBold" => 50,
"doHardBreaks" => 60,
"doNewLines" => 70,
);
function runSpanGamut($text) {
#
# Run span gamut tranformations.
#
foreach ($this->span_gamut as $method => $priority) {
$text = $this->$method($text);
}
return $text;
}
function doNewLines($text) {
return nl2br($text);
}

Related

XSS vulnerabilities still exist even after using HTML Purifier

I'm testing one of my web application using Acunetix. To protect this project against XSS attacks, I used HTML Purifier. This library is recommended by most of PHP developers for this purpose, but my scan results shows HTML Purifier can not protect us from XSS attacks completely. The scanner found two ways of attack by sending different harmful inputs:
1<img sRc='http://attacker-9437/log.php? (See HTML Purifier result here)
1"onmouseover=vVF3(9185)" (See HTML Purifier result here)
As you can see results, HTML Purifier could not detect such attacks. I don't know if is there any specific option on HTML Purifier to solve such problems, or is it really unable to detect these methods of XSS attacks.
Do you have any idea? Or any other solution?

(This is a late answer since this question is becoming the place duplicate questions are linked to, and previously some vital information was only available in comments.)
HTML Purifier is a contextual HTML sanitiser, which is why it seems to be failing on those tasks.
Let's look at why in some detail:
1<img sRc='http://attacker-9437/log.php?
You'll notice that HTML Purifier closed this tag for you, leaving only an image injection. An image is a perfectly valid and safe tag (barring, of course, current image library exploits). If you want it to throw away images entirely, consider adjusting the HTML Purifier whitelist by setting HTML.Allowed.
That the image from the example is now loading a URL that belongs to an attacker, thus giving the attacker the IP of the user loading the page (and nothing else), is a tricky problem that HTML Purifier wasn't designed to solve. That said, you could write a HTML Purifier attribute checker that runs after purification, but before the HTML is put back together, like this:
// a bit of context
$htmlDef = $this->configuration->getHTMLDefinition(true);
$image = $htmlDef->addBlankElement('img');
// HTMLPurifier_AttrTransform_CheckURL is a custom class you've supplied,
// and checks the URL against a white- or blacklist:
$image->attr_transform_post[] = new HTMLPurifier_AttrTransform_CheckURL();
The HTMLPurifier_AttrTransform_CheckURL class would need to have a structure like this:
class HTMLPurifier_AttrTransform_CheckURL extends HTMLPurifier_AttrTransform
{
public function transform($attr, $config, $context) {
$destination = $attr['src'];
if (is_malicious($destination)) {
// ^ is_malicious() is something you'd have to write
$this->confiscateAttr($attr, 'src');
}
return $attr;
}
}
Of course, it's difficult to do this 'right':
if this is a live check with some web-service, this will slow purification down to a crawl
if you're keeping a local cache you run risk of having outdated information
if you're using heuristics ("that URL looks like it might be malicious based on indicators x, y and z"), you run risk of missing whole classes of malicious URLs
1"onmouseover=vVF3(9185)"
HTML Purifier assumes the context your HTML is set in is a <div> (unless you tell it otherwise by setting HTML.Parent).
If you just feed it an attribute value, it's going to assume you're going to output this somewhere so the end-result looks like this:
...
<div>1"onmouseover=vVF3(9185)"</div>
...
That's why it appears to not be doing anything about this input - it's harmless in this context. You might even not want to strip this information in that context. I mean, we're talking about this snippet here on stackoverflow, and that's valuable (and not causing a security problem).
Context matters. Now, if you instead feed HTML Purifier this snippet:
<div class="1"onmouseover=vVF3(9185)"">foo</div>
...suddenly you can see what it's made to do:
<div class="1">foo</div>
Now it's removed the injection, because in this context, it would have been malicious.
What to use HTML Purifier for and what not
So now you're left to wonder what you should be using HTML Purifier for, and when it's the wrong tool for the job. Here's a quick run-down:
you should use htmlspecialchars($input, ENT_QUOTES, 'utf-8') (or whatever your encoding is) if you're outputting into a HTML document and aren't interested in preserving HTML at all - it's unnecessary overhead and it'll let some things through
you should use HTML Purifier if you want to output into a HTML document and allow formatting, e.g. if you're a message board and you want people to be able to format their messages using HTML
you should use htmlspecialchars($input, ENT_QUOTES, 'utf-8') if you're outputting into a HTML attribute (HTML Purifier is not meant for this use-case)
You can find some more information about sanitising / escaping by context in this question / answer.

All the HTML purifier seems to be doing, from the brief look that I gave, was HTML encode certain characters such as <, > and so on. However there are other means of invoking JS without using the normal HTML characters:
javascript:prompt(1) // In image tags
src="http://evil.com/xss.html" // In iFrame tags
Please review comments (by #pinkgothic) below.
Points below:
This would be HTML injection which does effectively lead to XSS. In this case, you open an <img> tag, point the src to some non-existent file which in turn raises an error. That can then be handled by the onerror handler to run some JavaScript code. Take the following example:
<img src=x onerror=alert(document.domain)>
The entrypoint for this it generally accompanied by prematurely closing another tag on an input. For example (URL decoded for clarity):
GET /products.php?type="><img src=x onerror=prompt(1)> HTTP/1.1
This however, is easily mititgated by HTML escaping meta-character (i.e. <, >).
Same as above, except this could be closing off an HTML attribute instead of a tag and inserting its own attribute. Say you have a page where you can upload the URL for an image:
<img src="$USER_DEFINED">
A normal example would be:
<img src="http://example.com/img.jpg">
However, inserting the above payload, we cut off the src attribute which points to a non-existent file and inject an onerror handler:
<img src="1"onerror=alert(document.domain)">
This executes the same payload mentioned above.
Remediation
This is heavily documented and tested in multiple places, so I won't go into detail. However, the following two articles are great on the subject and will cover all your needs:
https://www.acunetix.com/websitesecurity/cross-site-scripting/
https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet

Do you bother with markup formatting?

I am a front end guy who is getting more and more into scripting and that being the case, I like my regurgitated markup to kind of look nice.
I ran a loop over some database values for a list and while most sites would just show a big old concatenated slew of <LI> tags back to back, I kind of like them \r\n distanced with proper \t tabbing. Weird thing is, the first list member renders like LI> rather than <LI> about 1 out of 5 page serves.
Anyone seen this? Should I not bother? Am I formatting the loops badly? Here's an example:
while ($whatever = mysql_fetch_array($blah_query)){
echo "\t\t\t\t\t\t";
echo "<li>\n";
echo "\t\t\t\t\t\t";
echo '<a href="#'.$whatever['name'].'" id="category_id_'.$whatever['id'].'">';
echo ucfirst($whatever['name']);
echo "</a>\n\t\t\t\t\t\t</li>\n";
}

this seems as if the goal is to output a page source that types out the proper indentions for you?
at least for right now to debug and be easier read?
while ($whatever = mysql_fetch_array($blah_query)){
echo "\t\t\t\t\t\t";
echo "<li>\n";
echo "\t\t\t\t\t\t";
echo '<a href="#'.$whatever['name'].'" id="category_id_'.$whatever['id'].'">';
echo ucfirst($whatever['name']);
echo "</a>\n\t\t\t\t\t\t</li>\n";
}
since you're using PHP to echo out those HTML codes, just type them as you would see them on the page source
while($whatever = mysql_fetch_array($blah_query)){
//When you want a new line, just hit enter. PHP will echo the carriage returns too
echo'
<li>
ucfirst($whatever['name'])
</li>
';
}
this is how I would do it so that it would line break every time including the first time incase I have a left over "</div>" or some other closing tag without a line break after it.
it will output a nicer clean list item that tabbed in with the breaks

Removing spaces between code can significantly decrease the sizes of files especially if your code is of significant length. By removing any indenting and minimising spaces within files, you can maximise connection speeds to your site by delivering the requested pages considerably faster than if you were indenting. This adds up if your website is receiving any reasonable amount of traffic, as each page served may be made more efficient by removing 5-10kb of spacing. In the long run, if you're serving users pages regularly, the added network strain can be minimised by ensuring your code uses as little of the space as possible.
Although, if you happen to be developing in a private environment, it's good practice to use indenting for debugging purposes. The style of the code allows you to follow it's logic and flow in comparison to minified code that lacks legibility.

Typically, removing the spaces between elements, is a way to 'save bandwidth' for high traffic sites. It is something akin to minifying JavaScript or CSS. If you are still in 'testing/development' mode, then sure, indent it, so you can see if you are making mistakes. However in any production environment, with any appreciable traffic, you should 'minify' you html too.
This is not just to cut back on the monetary cost of bandwidth. This is to cut down on the system resources cost as well. It takes a little longer to send a 45k file (with spaces) and it does a 29k file (without spaces). Therefore, your server can push it out faster, which in turn means it can free up a open connection faster, which means it can accept a new incoming connection now. There are lots of talks dedicated to this idea, of minification. Minification, coupled with compression, is the leading reason in why webpages are held to high standards for loading quickly. The less you send out, the faster you can do so, the more people you can get it too.

I am like you. Everything must be clean and tidy.
I would recommend using XSL as a templating engine. This autmatically make all your HTML properly formatted if you set it to formatOutput = true.
I use those setting for my local copy, but for the live copy I set XSL to use no fromatting and white space. This returns all the HTML on one line. This saves about 20-30% or whatever of the HTML file size. So you save bandwidth and get quicker load times. Probably slightly quicker for browsers to render too.
See:
http://www.php.net/xsl
$xsl->preserveWhiteSpace = false;
$xsl->formatOutput = TRUE OR FALSE;
Just looking at my code the above is what I use to either set to indent nicely, or output all on one line.

How to convert HTML into XHTML [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
PHP library for converting HTML4 to XHTML?
Is there any ready made function in PHP to achieve this? Basically I'm taking HTML data from Smarty template and want to convert it into XHTML through coding.

$filename = 'template.php'; // filepath to file
// All options : http://tidy.sourceforge.net/docs/quickref.html
$options = array('output-xhtml' => true, 'clean' => true, 'wrap-php' => true);
$tidy = new tidy(); // create new instance of Tidy
$tidy->parseFile($filename, $options); // open file
$tidy->cleanRepair(); // process with specified options
copy($filename, $filename . '.bak'); // backup current file
file_put_contents($filename, $tidy); // overwrite current file with XHTML version
I don't have a Smarty template file to test this on, but give it a try and see if it works correctly in converting one. Backup your files as always when running something of this nature. Test out on sample files first.

The problem is that you do not have an html file to work with. You have a php template written in the programming language "smarty" that is not markup, even though it contains blocks of markup. You're looking for a magic wand and no such wand exists.
If it was purely html, then you could probably use Domdocument to read the files into a Dom structure and generate xhtml, but that is simply not going to work with the pure source files, although you could potentially write a parser to read the smarty tpl files, look for the html snippets and try and load them into Domdocument objects.
With that said, I have to ask first -- why you really want to convert to xhtml when xhtml is basically a failed standard that is obsolete at this point in time, and secondarily, if you have some legitimate reason for wanting to forge ahead, why you can't use some regex search and replace snippets that change the doctypes and some regex based searches to look for tags that lack the end tags, and the other relatively minor tweaks needed. The differences between html and xhtml can be boiled down to a handful of rules that are pretty easy to understand.

In answer to your original question: sort of. Core PHP -> DOM, SimpleXML, SPL = templating engine. That's why (and how) templating engines such as Smarty exist.
Re: installing Tidy as suggested in comments,
Tidy has a prerequisite lib. If you don't already have it:
http://php.net/manual/en/tidy.installation.php
To use Tidy, you will need libtidy installed, available on the tidy homepage »
http://tidy.sourceforge.net/.
To enable, you will need to recompile PHP and include it in your config flags:
"This extension is bundled with PHP 5 and greater, and is installed
using the --with-tidy configure option."
So, get your existing config flags:
php -i | grep config
and add --with-tidy.
However, this is probably the wrong approach. It does not solve your actual problem (outputting XHTML instead of HTML) - it fixes Smarty's problem. Recompiling PHP to add an extension so you can use it to fix a templating engine's doctype shortcomings probably means you should consider using a different templating engine, if possible. That's sort of drastic (and adds a lot of overhead for what you get, which amounts to for a hacky non-solution bandaid workaround retroactively repairing broken output.)
PEAR's HTML_Template_PHPTAL is probably the best solution to your problem, and the closest answer to your original question.
And if PHPTAL doesn't quite cut it, there are at least 5 others available as PEAR libs to choose from.
pear install http://phptal.org/latest.tar.gz
Or it's been ported to Git:
git clone git://github.com/pornel/PHPTAL
A cursory google search: http://webification.com/best-php-template-engines
HTH

Changing/deleting html from file_get_contents

I'm currently using this code:
$blog= file_get_contents("http://powback.tumblr.com/post/" . $post);
echo $blog;
And it works. But tumblr has added a script that activates each time you enter a password-field. So my question is:
Can i remove certain parts with file_get_contents? Or just remove everything above the <html> tag? could i possibly kill a whole div so it wont load at all? And if so; how?
edit:
I managed to do it the simple way. By skipping 766 characters. The script now work as intended!
$blog= file_get_contents("powback.tumblr.com/post/"; . $post, NULL, NULL, 766);

After file_get_contents returns, you have in your hands a string. You can do anything you want to it, including cutting out parts of it.
There are two ways to actually do the cutting:
Using string functions like str_replace, preg_replace and others; the exact recipe depends on what you need to do. This approach is kind of frowned upon because you are working at the wrong level of abstraction, but in some cases it has an unmatched performance to time spent ratio.
Parsing the HTML into a DOM tree, modifying it appropriately (this time working at the appropriate level of abstraction) and then turn it back into a string and echo it. This can be more convenient to work with if your requirements are not dead simple and is easier to maintain, but it typically requires more code to be written.
If you want to do something that's most naturally expressed in HTML document terms ("cutting out this <div>") then don't be tempted and go with the second approach.

At that point, $blog is just a string, so you can use normal PHP functions to alter it. Look into these 2:
http://php.net/manual/en/function.str-replace.php
http://us2.php.net/manual/en/function.preg-replace.php

You can parse your output using simple html dom parser and display olythe contents thatyou really want to display

"Safe" markdown processor for PHP?

Is there a PHP implementation of markdown suitable for using in public comments?
Basically it should only allow a subset of the markdown syntax (bold, italic, links, block-quotes, code-blocks and lists), and strip out all inline HTML (or possibly escape it?)
I guess one option is to use the normal markdown parser, and run the output through an HTML sanitiser, but is there a better way of doing this..?
We're using PHP markdown Extra for the rest of the site, so we'd already have to use a secondary parser (the non-"Extra" version, since things like footnote support is unnecessary).. It also seems nicer parsing only the *bold* text and having everything escaped to <a href="etc">, than generating <b>bold</b> text and trying to strip the bits we don't want..
Also, on a related note, we're using the WMD control for the "main" site, but for comments, what other options are there? WMD's javascript preview is nice, but it would need the same "neutering" as the PHP markdown processor (it can't display images and so on, otherwise someone will submit and their working markdown will "break")
Currently my plan is to use the PHP-markdown -> HTML santiser method, and edit WMD to remove the image/heading syntax from showdown.js - but it seems like this has been done countless times before..
Basically:
Is there a "safe" markdown implementation in PHP?
Is there a HTML/javascript markdown editor which could have the same options easily disabled?
Update: I ended up simply running the markdown() output through HTML Purifier.
This way the Markdown rendering was separate from output sanitisation, which is much simpler (two mostly-unmodified code bases) more secure (you're not trying to do both rendering and sanitisation at once), and more flexible (you can have multiple sanitisation levels, say a more lax configuration for trusted content, and a much more stringent version for public comments)

PHP Markdown has a sanitizer option, but it doesn't appear to be advertised anywhere. Take a look at the top of the Markdown_Parser class in markdown.php (starts on line 191 in version 1.0.1m). We're interested in lines 209-211:
# Change to `true` to disallow markup or entities.
var $no_markup = false;
var $no_entities = false;
If you change those to true, markup and entities, respectively, should be escaped rather than inserted verbatim. There doesn't appear to be any built-in way to change those (e.g., via the constructor), but you can always add one:
function do_markdown($text, $safe=false) {
$parser = new Markdown_Parser;
if ($safe) {
$parser->no_markup = true;
$parser->no_entities = true;
}
return $parser->transform($text);
}
Note that the above function creates a new parser on every run rather than caching it like the provided Markdown function (lines 43-56) does, so it might be a bit on the slow side.

JavaScript Markdown Editor Hypothesis:
Use a JavaScript-driven Markdown Editor, e.g., based on showdown
Remove all icons and visual clues from the Toolbar for unwanted items
Set up a JavaScript filter to clean-up unwanted markup on submission
Test and harden all JavaScript changes and filters locally on your computer
Mirror those filters in the PHP submission script, to catch same on the server-side.
Remove all references to unwanted items from Help/Tutorials
I've created a Markdown editor in JavaScript, but it has enhanced features. That took a big chunk of time and SVN revisions. But I don't think it would be that tough to alter a Markdown editor to limit the HTML allowed.

How about running htmlspecialchars on the user entered input, before processing it through markdown? It should escape anything dangerous, but leave everything that markdown understands.
I'm trying to think of a case where this wouldn't work but can't think of anything off hand.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.