How to convert HTML into XHTML [duplicate]

How to convert HTML into XHTML [duplicate] - php

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
PHP library for converting HTML4 to XHTML?
Is there any ready made function in PHP to achieve this? Basically I'm taking HTML data from Smarty template and want to convert it into XHTML through coding.

$filename = 'template.php'; // filepath to file
// All options : http://tidy.sourceforge.net/docs/quickref.html
$options = array('output-xhtml' => true, 'clean' => true, 'wrap-php' => true);
$tidy = new tidy(); // create new instance of Tidy
$tidy->parseFile($filename, $options); // open file
$tidy->cleanRepair(); // process with specified options
copy($filename, $filename . '.bak'); // backup current file
file_put_contents($filename, $tidy); // overwrite current file with XHTML version
I don't have a Smarty template file to test this on, but give it a try and see if it works correctly in converting one. Backup your files as always when running something of this nature. Test out on sample files first.

The problem is that you do not have an html file to work with. You have a php template written in the programming language "smarty" that is not markup, even though it contains blocks of markup. You're looking for a magic wand and no such wand exists.
If it was purely html, then you could probably use Domdocument to read the files into a Dom structure and generate xhtml, but that is simply not going to work with the pure source files, although you could potentially write a parser to read the smarty tpl files, look for the html snippets and try and load them into Domdocument objects.
With that said, I have to ask first -- why you really want to convert to xhtml when xhtml is basically a failed standard that is obsolete at this point in time, and secondarily, if you have some legitimate reason for wanting to forge ahead, why you can't use some regex search and replace snippets that change the doctypes and some regex based searches to look for tags that lack the end tags, and the other relatively minor tweaks needed. The differences between html and xhtml can be boiled down to a handful of rules that are pretty easy to understand.

In answer to your original question: sort of. Core PHP -> DOM, SimpleXML, SPL = templating engine. That's why (and how) templating engines such as Smarty exist.
Re: installing Tidy as suggested in comments,
Tidy has a prerequisite lib. If you don't already have it:
http://php.net/manual/en/tidy.installation.php
To use Tidy, you will need libtidy installed, available on the tidy homepage »
http://tidy.sourceforge.net/.
To enable, you will need to recompile PHP and include it in your config flags:
"This extension is bundled with PHP 5 and greater, and is installed
using the --with-tidy configure option."
So, get your existing config flags:
php -i | grep config
and add --with-tidy.
However, this is probably the wrong approach. It does not solve your actual problem (outputting XHTML instead of HTML) - it fixes Smarty's problem. Recompiling PHP to add an extension so you can use it to fix a templating engine's doctype shortcomings probably means you should consider using a different templating engine, if possible. That's sort of drastic (and adds a lot of overhead for what you get, which amounts to for a hacky non-solution bandaid workaround retroactively repairing broken output.)
PEAR's HTML_Template_PHPTAL is probably the best solution to your problem, and the closest answer to your original question.
And if PHPTAL doesn't quite cut it, there are at least 5 others available as PEAR libs to choose from.
pear install http://phptal.org/latest.tar.gz
Or it's been ported to Git:
git clone git://github.com/pornel/PHPTAL
A cursory google search: http://webification.com/best-php-template-engines
HTH

Related

How do I configure emacs for proper PHP development?

My current setup in emacs for PHP development has a variety of shortcomings. I often use a mixed mode of html and php. I want the mode to be able to recognize what context I'm in and format appropriately. I am especially interested in appropriate tabbing. This is the most important feature for me. Correct coloring would be nice, but if it messes up once in a while that's ok.
I am currently using multi-web-mode and the default php-mode in Emacs 24.3 on MacOS X.
One of the most frustrating problems is incorporating the heredoc syntax: echo <<< My current system doesn't recognize that this syntax needs to be NOT tabbed. I typically get warnings like this:
Indentation fails badly with mixed HTML/PHP in the HTML part in
plaín `php-mode'. To get indentation to work you must use an
Emacs library that supports 'multiple major modes' in a buffer.
Parts of the buffer will then be in `php-mode' and parts in for
example `html-mode'. Known such libraries are:
mumamo, mmm-mode, multi-mode
You have these available in your `load-path':
mumamo
I've already tried using mumao/nxhtml but that didn't give me the results I wanted. In some ways it was worse. I'd really appreciate any tips people have for getting a working php development environment setup for emacs.

I use web-mode (http://web-mode.org/) for mixed HTML/PHP files and php-mode for pure PHP files. The latest version of php-mode also recommended web-mode for mixed HTML/PHP files: https://github.com/ejmr/php-mode#avoid-html-template-compatibility.
Unlike other modes like mmm-mode, mumamo or multi-web-mode that try to apply different behaviors to different parts of a buffer, web-mode is aware of all the available syntax/template that can be mixed with HTML. You can also use web-mode for mixed HTML files/templates such as Twig, Django, ERB... In fact I use web-mode for anything involve HTML.
There is a catch for PHP template though: Other template systems has different file extension so it is easy to switch the mode automatically, but PHP templates usually use the same .php extension; so I have to make it switch by folders or sometimes manually invoke M-x web-mode. Here's my current configuration:
(defun add-auto-mode (mode &rest patterns)
(mapc (lambda (pattern)
(add-to-list 'auto-mode-alist (cons pattern mode)))
patterns))
(add-auto-mode 'web-mode
"*html*" "*twig*" "*tmpl*" "\\.erb" "\\.rhtml$" "\\.ejs$" "\\.hbs$"
"\\.ctp$" "\\.tpl$" "/\\(views\\|html\\|templates\\)/.*\\.php$")
BTW, try to separate your PHP files and templates and keep the mixed HTML/PHP file as simple as possible (refactor long PHP blocks into functions in a pure file). The code will be easier to read/follow.

Mediawiki tag extension - chained tags do not get processed

I'm trying to develop a simple Tag Extension for Mediawiki. So far I'm basically outputing the input as it comes. The problem arises when there are chained tags. For instance, for this example:
function efSampleParserInit( Parser &$parser ) {
$parser->setHook( 'sample', 'efSampleRender' );
return true;
}
function efSampleRender( $input, array $args, Parser $parser, PPFrame $frame ) {
return "hello ->" . $input . "<- hello";
}
If I write this in an article:
This is the text <sample type="1">hello my <sample type="2">brother</sample> John</sample>
Only the first sample tag gets processed. The other one isn't. I guess I should work with the $parser object I receive so I return the parsed input, but I don't know how to do it.
Furthermore, Mediawiki's reference is pretty much non existant, it would be great to have something like a Doxygen reference or something.

Use $parser->recursiveTagParse(), as shown at Manual:Tag_extensions#How do I render wikitext in my extension?.
It is kind of a clunky interface, and not very well documented. The underlying reason why such a seemingly natural thing to do is so tricky to accomplish is that it sort of goes against the original design intent of tag extensions — they were originally conceived as low-level filters that take in raw unparsed text and spit out HTML, completely bypassing normal parsing. So, for example, if you wanted to include some content written in Markdown (such as a StackOverflow post) on a wiki page, the idea was that you could install a suitable extension and then write
<markdown>
**Look,** here's some Markdown text!
</markdown>
on the page, and the MediaWiki parser would leave everything between the <markdown> tags alone and just hand it over to the extension for parsing.
Of course, it turned out that most people who wrote MediaWiki tag extensions didn't really want to replace the parser, but just to apply some tweaks to it. But the way the tag extension interface was set up, the only way to do that was to call the parser recursively. I've sometimes thought it would be nice to add a new parser extension type to MediaWiki, something that looked like tag extensions but didn't interrupt normal parsing in such a drastic manner. Alas, my motivation and copious free time hasn't so far been sufficient to actually do something about it.

Why use an XML parser?

I'm a somewhat experienced PHP scripter, however I just dove into parsing XML and all that good stuff.
I just can't seem to wrap my head around why one would use a separate XML parser instead of just using the explode function, which seems to be just as simple. Here's what I've been doing (assuming there is a valid XML file at the path xml.php):
$contents = file_get_contents("xml.php");
$array1 = explode("<a_tag>", $contents);
$array2 = explode("</a_tag>", $array1[1]);
$data = $array2[0];
So my question is, what is the practical use for an XML parser if you can just separate the values into arrays and extract the data from that point?
Thanks in advance! :)

Excuse me for not going into details but for starters try parsing
$contents = '<a xmlns="urn:something">
<a_tag>
<b>..</b>
<related>
<a_tag>...</a_tag>
</related>
</a_tag>
<foo:a_tag xmlns:foo="urn:something">
<![CDATA[This is another <a_tag> element]]>
</foo:a_tag>
</a>';
with your explode-approach. When you're done we can continue with some trickier things ;-)

In a nutshell, its consistency. Before XML came into wide use there were numerous undocumented formats for keeping information in files. One of the motivators behind XML was to create a well defined, standard document format. With this well defined format in place, a general set of parsing tools could be developed that would work consistently on documents so long as the documents adhered to the aforementioned well defined format.
In some specific cases, your example code will work. However, if the document changes
...
<!-- adding an attribute -->
<a_tag foo="bar">Contents of the Tag</a_tag>
...
...
<!-- adding a comment to the contents -->
<a_tag>Contents <!-- foobar --> of the Tag</a_tag>
...
Your parsing code will probably break. Code written using a correctly defined XML parser will not.

XML parsers:
Handle encoding
May have xpath support
Allow you to easily modify and save the XML; append/remove child nodes, add/remove attributes, etc.
Don't need to load the whole file into memory (except from DOM parsers)
Know about namespaces
...

How would you explode the same file if a_tag had an attribute?
explode("<a_tag>" ... will work differently than explode("<a_tag attr='value'>" ..., after all.
XML Parsers understand the XML specification. Explode can only handle the simplest of cases, and will most likely fail in a lot of instances of that case.

Using a proven XML parsing method will make the code more maintainable and easy to read. It will also make it more easily adaptable should the schema change, and it can make it easier to determine error conditions. XPath and XSLT exist for a reason, they are proven ways to deal with XML data in a sensible, legible manner. I'd suggest you use whichever is applicable in your given situation. Remember, just because you think you're only writing code for one specific purpose, you never know what a piece of well-written code could evolve into.

WMD markdown editor - HTML to Markdown conversion

I am using wmd markdown editor on a project and had a question:
When I post the form containing the markdown text area, it (as expected) posts html to the server. However, say upon server-side validation something fails and I need to send the user back to edit their entry, is there anyway to refill the textarea with just the markdown and not the html? Since as I have it set up, the server only has access to the post data (which is in the form of html) so I can't seem to think of a way to do this. Any ideas? Preferably a non-javascript based solution.
Update: I found an html to markdown converter called markdownify. I guess this might be the best solution for displaying the markdown back to the user...any better alternatives are welcome!
Update 2: I found this post on SO and I guess there is an option to send the data to the server as markdown instead of html. Are there any downsides to simply storing the data as markdown in the database? What about displaying it back to the user (outside of an editor)? Maybe it would be best to post both versions (html AND markdown) to the server...
SOLVED: I can simply use php markdown to convert the markdown to html serverside.

I would suggest that you simply send and store the text as Markdown. This seems to be what you have settled on already. IMO, storing the text as Markdown will be better because you can safely strip all HTML tags out without worrying about loss of formatting - this makes your code safer, because it will be harder to use a XSS attack (although it may still be possible though - I am only saying that this part will be safer).

One thing to consider is that WMD appears to have certain different edge cases from certain server-side Markdown implementations. I've definitely seen some quirks in the previews here that have shown up differently after submission (I believe one such case was attempting to escape a backtick surrounded by backticks). By sending the converted preview over the wire, you can ensure that the preview is accurate.
I'm not saying that should make your decision, but it's something to consider.

Try out Pandoc. It's a little more comprehensive and reliable than Markdownify.

The HTML you are seeing is just a preview, so it's not a good idea to store that in the database as you will run into issues when you try to edit. It's also not a good idea to store both versions (markdown and HTML) as the HTML is just an interpretation and you will have the same problems of editing and keeping both versions in synch.
So the best idea is to store the markdown in the db and then convert it server side before displaying.
You can use PHP Markdown for this purpose. However this is not 100% perfect conversion of what you are seeing on the javascript side and may need some tweaking.
The version that the Stack Exchange network is using is a C# implementation and there should be a python implementation you downloaded with the version of wmd you have.
The one thing I tweaked was the way new lines were rendered so I changed this in markdown.php to convert some new lines into <br> starting from line 626 in the version I have:
var $span_gamut = array(
#
# These are all the transformations that occur *within* block-level
# tags like paragraphs, headers, and list items.
#
# Process character escapes, code spans, and inline HTML
# in one shot.
"parseSpan" => -30,
# Process anchor and image tags. Images must come first,
# because ![foo][f] looks like an anchor.
"doImages" => 10,
"doAnchors" => 20,
# Make links out of things like `<http://example.com/>`
# Must come after doAnchors, because you can use < and >
# delimiters in inline links like [this](<url>).
"doAutoLinks" => 30,
"encodeAmpsAndAngles" => 40,
"doItalicsAndBold" => 50,
"doHardBreaks" => 60,
"doNewLines" => 70,
);
function runSpanGamut($text) {
#
# Run span gamut tranformations.
#
foreach ($this->span_gamut as $method => $priority) {
$text = $this->$method($text);
}
return $text;
}
function doNewLines($text) {
return nl2br($text);
}

"Safe" markdown processor for PHP?

Is there a PHP implementation of markdown suitable for using in public comments?
Basically it should only allow a subset of the markdown syntax (bold, italic, links, block-quotes, code-blocks and lists), and strip out all inline HTML (or possibly escape it?)
I guess one option is to use the normal markdown parser, and run the output through an HTML sanitiser, but is there a better way of doing this..?
We're using PHP markdown Extra for the rest of the site, so we'd already have to use a secondary parser (the non-"Extra" version, since things like footnote support is unnecessary).. It also seems nicer parsing only the *bold* text and having everything escaped to <a href="etc">, than generating <b>bold</b> text and trying to strip the bits we don't want..
Also, on a related note, we're using the WMD control for the "main" site, but for comments, what other options are there? WMD's javascript preview is nice, but it would need the same "neutering" as the PHP markdown processor (it can't display images and so on, otherwise someone will submit and their working markdown will "break")
Currently my plan is to use the PHP-markdown -> HTML santiser method, and edit WMD to remove the image/heading syntax from showdown.js - but it seems like this has been done countless times before..
Basically:
Is there a "safe" markdown implementation in PHP?
Is there a HTML/javascript markdown editor which could have the same options easily disabled?
Update: I ended up simply running the markdown() output through HTML Purifier.
This way the Markdown rendering was separate from output sanitisation, which is much simpler (two mostly-unmodified code bases) more secure (you're not trying to do both rendering and sanitisation at once), and more flexible (you can have multiple sanitisation levels, say a more lax configuration for trusted content, and a much more stringent version for public comments)

PHP Markdown has a sanitizer option, but it doesn't appear to be advertised anywhere. Take a look at the top of the Markdown_Parser class in markdown.php (starts on line 191 in version 1.0.1m). We're interested in lines 209-211:
# Change to `true` to disallow markup or entities.
var $no_markup = false;
var $no_entities = false;
If you change those to true, markup and entities, respectively, should be escaped rather than inserted verbatim. There doesn't appear to be any built-in way to change those (e.g., via the constructor), but you can always add one:
function do_markdown($text, $safe=false) {
$parser = new Markdown_Parser;
if ($safe) {
$parser->no_markup = true;
$parser->no_entities = true;
}
return $parser->transform($text);
}
Note that the above function creates a new parser on every run rather than caching it like the provided Markdown function (lines 43-56) does, so it might be a bit on the slow side.

JavaScript Markdown Editor Hypothesis:
Use a JavaScript-driven Markdown Editor, e.g., based on showdown
Remove all icons and visual clues from the Toolbar for unwanted items
Set up a JavaScript filter to clean-up unwanted markup on submission
Test and harden all JavaScript changes and filters locally on your computer
Mirror those filters in the PHP submission script, to catch same on the server-side.
Remove all references to unwanted items from Help/Tutorials
I've created a Markdown editor in JavaScript, but it has enhanced features. That took a big chunk of time and SVN revisions. But I don't think it would be that tough to alter a Markdown editor to limit the HTML allowed.

How about running htmlspecialchars on the user entered input, before processing it through markdown? It should escape anything dangerous, but leave everything that markdown understands.
I'm trying to think of a case where this wouldn't work but can't think of anything off hand.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.