PHP syntax highlighting [closed] - php

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I'm searching for a PHP syntax highlighting engine that can be customized (i.e. I can provide my own tokenizers for new languages) and that can handle several languages simultaneously (i.e. on the same output page). This engine has to work well together with CSS classes, i.e. it should format the output by inserting <span> elements that are adorned with class attributes. Bonus points for an extensible schema.
I do not search for a client-side syntax highlighting script (JavaScript).
So far, I'm stuck with GeSHi. Unfortunately, GeSHi fails abysmally for several reasons. The main reason is that the different language files define completely different, inconsistent styles. I've worked hours trying to refactor the different language definitions down to a common denominator but since most definition files are in themselves quite bad, I'd finally like to switch.
Ideally, I'd like to have an API similar to CodeRay, Pygments or the JavaScript dp.SyntaxHighlighter.
Clarification:
I'm looking for a code highlighting software written in PHP, not for PHP (since I need to use it from inside PHP).

Since no existing tool satisfied my needs, I wrote my own. Lo and behold:
Hyperlight
Usage is extremely easy: just use
<?php hyperlight($code, 'php'); ?>
to highlight code. Writing new language definitions is relatively easy, too – using regular expressions and a powerful but simple state machine. By the way, I still need a lot of definitions so feel free to contribute.

[I marked this answer as Community Wiki because you're specifically not looking for Javascript]
http://softwaremaniacs.org/soft/highlight/ is a PHP (plus the following list of other languages supported) syntax highlighting library:
Python, Ruby, Perl, PHP, XML, HTML, CSS, Django, Javascript, VBScript, Delphi, Java, C++, C#, Lisp, RenderMan (RSL and RIB), Maya Embedded Language, SQL, SmallTalk, Axapta, 1C, Ini, Diff, DOS .bat, Bash
It uses <span class="keyword"> style markup.
It has also been integrated in the dojo toolkit (as a dojox project: dojox.lang.highlight)
Though not the most popular way to run a webserver, strictly speaking, Javascript is not only implemented on the client-side, but there are also Server-Side Javascript engine/platform combinations too.

I found this simple generic syntax highlighter written in PHP here and modified it a bit:
<?php
/**
* Original => http://phoboslab.org/log/2007/08/generic-syntax-highlighting-with-regular-expressions
* Usage => `echo SyntaxHighlight::process('source code here');`
*/
class SyntaxHighlight {
public static function process($s) {
$s = htmlspecialchars($s);
// Workaround for escaped backslashes
$s = str_replace('\\\\','\\\\<e>', $s);
$regexp = array(
// Comments/Strings
'/(
\/\*.*?\*\/|
\/\/.*?\n|
\#.[^a-fA-F0-9]+?\n|
\<\!\-\-[\s\S]+\-\-\>|
(?<!\\\)".*?(?<!\\\)"|
(?<!\\\)\'(.*?)(?<!\\\)\'
)/isex'
=> 'self::replaceId($tokens,\'$1\')',
// Punctuations
'/([\-\!\%\^\*\(\)\+\|\~\=\`\{\}\[\]\:\"\'<>\?\,\.\/]+)/'
=> '<span class="P">$1</span>',
// Numbers (also look for Hex)
'/(?<!\w)(
(0x|\#)[\da-f]+|
\d+|
\d+(px|em|cm|mm|rem|s|\%)
)(?!\w)/ix'
=> '<span class="N">$1</span>',
// Make the bold assumption that an
// all uppercase word has a special meaning
'/(?<!\w|>|\#)(
[A-Z_0-9]{2,}
)(?!\w)/x'
=> '<span class="D">$1</span>',
// Keywords
'/(?<!\w|\$|\%|\#|>)(
and|or|xor|for|do|while|foreach|as|return|die|exit|if|then|else|
elseif|new|delete|try|throw|catch|finally|class|function|string|
array|object|resource|var|bool|boolean|int|integer|float|double|
real|string|array|global|const|static|public|private|protected|
published|extends|switch|true|false|null|void|this|self|struct|
char|signed|unsigned|short|long
)(?!\w|=")/ix'
=> '<span class="K">$1</span>',
// PHP/Perl-Style Vars: $var, %var, #var
'/(?<!\w)(
(\$|\%|\#)(\->|\w)+
)(?!\w)/ix'
=> '<span class="V">$1</span>'
);
$tokens = array(); // This array will be filled from the regexp-callback
$s = preg_replace(array_keys($regexp), array_values($regexp), $s);
// Paste the comments and strings back in again
$s = str_replace(array_keys($tokens), array_values($tokens), $s);
// Delete the "Escaped Backslash Workaround Token" (TM)
// and replace tabs with four spaces.
$s = str_replace(array('<e>', "\t"), array('', ' '), $s);
return '<pre><code>' . $s . '</code></pre>';
}
// Regexp-Callback to replace every comment or string with a uniqid and save
// the matched text in an array
// This way, strings and comments will be stripped out and wont be processed
// by the other expressions searching for keywords etc.
private static function replaceId(&$a, $match) {
$id = "##r" . uniqid() . "##";
// String or Comment?
if(substr($match, 0, 2) == '//' || substr($match, 0, 2) == '/*' || substr($match, 0, 2) == '##' || substr($match, 0, 7) == '<!--') {
$a[$id] = '<span class="C">' . $match . '</span>';
} else {
$a[$id] = '<span class="S">' . $match . '</span>';
}
return $id;
}
}
?>
Demo: http://phpfiddle.org/lite/code/1sf-htn
Update
I just created a PHP port of my own JavaScript generic syntax highlighter here → https://github.com/taufik-nurrohman/generic-syntax-highlighter/blob/master/generic-syntax-highlighter.php
How to use:
<?php require 'generic-syntax-highlighter.php'; ?>
<pre><code><?php echo SH('<div class="foo"></div>'); ?></code></pre>

It might be worth looking at Pear_TextHighlighter (documentation)
I think it won't by default output html exactly how you want it, but it does provide extensive capabilities for customisation (i.e. you can create different renderers/parsers)

I had exactly the the same problem but as I was very short on time and needed really good code coverage I decided to write a PHP wrapper around Pygments library.
It's called PHPygmentizator. It's really simple to use. I wrote a very basic manual. As PHP is Web Development language primarily, I subordinated the structure to that fact and made it very easy to implement in almost any kind of website.
It supports configuration files and if that isn't enough and somebody needs to modify stuff in the process it also fires events.
Demo of how it works can be found on basically any post of my blog which contains source code, this one for example.
With default config you can just provide it a string in this format:
Any text here.
[pygments=javascript]
var a = function(ar1, ar2) {
return null;
}
[/pygments]
Any text.
So it highlights code between tags (tags can be customized in configuration file) and leaves the rest untouched.
Additionally I already made a Syntax recognition library (it uses algorithm which would probably be classified as Bayesian probability) which automatically recognizes which language code block is written in and can easily be hooked to one of PHPygmentizator events to provide automatic language recognition. I will probably make it public some time this week since I need to beautify the structure a bit and write some basic documentation. If you supply it with enough "learning" data it recognizes languages amazingly well, I tested even minified javascripts and languages which have similar keywords and structures and it has never made a mistake.

Another option is to use the GPL Highlight GUI program by Andre Simon which is available for most platforms. It converts PHP (and other languages) to HTML, RTF, XML, etc. which you can then cut and paste into the page you want. This way, the processing is only done once.
The HTML is also CSS based, so you can change the style as you please.
Personally, I use dp.SyntaxHighlighter, but that uses client side Javascript, so it doesn't meet your needs. It does have a nice Windows Live plugin though which I find useful.

A little late to chime in here, but I've been working on my own PHP syntax highlighting library. It is still in its early stages, but I am using it for all code samples on my blog.
Just checked out Hyperlight. It looks pretty cool, but it is doing some pretty crazy stuff. Nested loops, processing line by line, etc. The core class is over 1000 lines of code.
If you are interested in something simple and lightweight check out Nijikodo:
http://www.craigiam.com/nijikodo

Why not use PHP's build-in syntax highlighter?
http://php.net/manual/en/function.highlight-string.php

PHP Prettify Works fine so far, And has more customization than highlight_string

Krijn Hoetmer's PHP Highlighter provides a completely customizable PHP class to highlight PHP syntax. The HTML it generates, validates under a strict doctype, and is completely stylable with CSS.

Related

Perf. issue / Too much calls to string manipulation functions

This question is about optimizing a part of a program that I use to add in many projects as a common tool.
This 'templates parser' is designed to use a kind of text pattern containing html code or anything else with several specific tags, and to replace these by developer given values when rendered.
The few classes involved do a great job and work as expected, it allows when needed to isolate design elements and easily adapt / replace design blocks.
The patterns I use look like this (nothing exceptional I admit) :
<table class="{class}" id="{id}">
<block_row>
<tr>
<block_cell>
<td>{content}</td>
</block_cell>
</tr>
</block_row>
</table>
(Example code below are adapted extracts)
The parsing does things like that :
// Variables are sorted by position in pattern string
// Position is read once and stored in cache to avoid
// multiple calls to str_pos or str_replace
foreach ($this->aVars as $oVar) {
$sString = substr($sString, 0, $oVar->start) .
$oVar->value .
substr($sString, $oVar->end);
}
// Once pattern loaded, blocks look like --¤(<block_name>)¤--
foreach ($this->aBlocks as $sName=>$oBlock) {
$sBlockData = $oBlock->parse();
$sString = str_replace('--¤(' . $sName . ')¤--', $sBlockData, $sString);
}
By using the class instance I use methods like 'addBlock' or 'setVar' to fill my pattern with data.
This system has several disadvantages, among them the multiple objects in memory (one for each instance of block) and the fact that there are many calls to string manipulation functions during the parsing process (preg_replace in the past, now just a bunch of substr and pals).
The program on which I'm working is making a large use of these templates and they are just about to show their limits.
My question is the following (No need for code, just ideas or a lead to follow) :
Should I consider I've abused of this and should try to manage so that I don't need to make so many calls to these templates (for instance improving cache, using only simple view scripts...)
Do you know a technical solution to feed a structure with data that would not be that mad resource consumer I wrote ? While I'm writing I'm thinking about XSLT, would it be suitable, if yes could it improve performances ?
Thanks in advance for your advices
Use the XDebug extension to profile your code and find out exactly which parts of the code are taking the most time.

PHP library for HTML output

Is there an standard output library that "knows" that php outputs to html?
For instance:
var_dump - this should be wrapped in <pre> or maybe in a table if the variable is an array
a version of echo that adds a "<br/>\n" in the end
Somewhere in the middle of PHPcode I want to add an H3 title:
.
?><h3><?= $title ?></h3><?
Out of php and then back in. I'd rather write:
tag_wrap($title, 'h3');
or
h3($title);
Obviously I can write a library myself, but I would prefer to use a conventional way if there is one.
Edit
3 Might not be a good example - I don't get much for using alternative syntax and I could have made it shorter.
1 and 2 are useful for debugging and quick testing.
I doubt that anyone would murder me for using some high-level html emitting functions of my own making when it saves a lot of writing.
In regards to #1, try xdebug's var_dump override, if you control your server and can install PHP extensions. The remote debugger and performance tools provided by xdebug are great additions to your arsenal. If you're looking only for pure PHP code, consider Kint or dBug to supplement var_dump.
In regards to #2 and #3, you don't need to do this. Rather, you probably shouldn't do this.
PHP makes a fine HTML templating language. Trying to create functions to emit HTML is going to lead you down a horrible road of basically implementing the DOM in a horribly awkward and backwards way. Considering how horribly awkward the DOM already is, that'll be quite an accomplishment. The future maintainers of your code are going to want to murder you for it.
There is no shame in escaping out of PHP to emit large blocks of HTML. Escaping out to emit a single tag, though, is completely silly. Don't do that, and don't create functions that do that. There are better ways.
First, don't forget that print and echo aren't functions, they're built in to the language parser. Because they're special snowflakes, they can take a list without parens. This can make some awkward HTML construction far less awkward. For example:
echo '<select name="', htmlspecialchars($select_name), '</select>';
foreach($list as $key => $value) {
echo '<option value="',
htmlspecialchars($key),
'">',
htmlspecialchars($value),
'</option>'
}
echo '</select>';
Next, PHP supports heredocs, a method of creating a double-quoted string without the double-quotes:
$snippet = <<<HERE
<h1>$heading</h1>
<p>
<span class="aside">$aside_content</span>
$actual_content
</p>
HERE;
With these two tools in your arsenal, you may find yourself breaking out of PHP far less frequently.
While there is a case for helper functions (there are only so many ways you can build a <select>, for example), you want to use these carefully and create them to reduce copy and paste, not simply to create them. The people that will be taking care of the code you're writing five years from now will appreciate you for it.
You should use a php template engine and just separate the entire presentation and logic. It make no sense for a educated programmer to try to create a library like that.

Mediawiki tag extension - chained tags do not get processed

I'm trying to develop a simple Tag Extension for Mediawiki. So far I'm basically outputing the input as it comes. The problem arises when there are chained tags. For instance, for this example:
function efSampleParserInit( Parser &$parser ) {
$parser->setHook( 'sample', 'efSampleRender' );
return true;
}
function efSampleRender( $input, array $args, Parser $parser, PPFrame $frame ) {
return "hello ->" . $input . "<- hello";
}
If I write this in an article:
This is the text <sample type="1">hello my <sample type="2">brother</sample> John</sample>
Only the first sample tag gets processed. The other one isn't. I guess I should work with the $parser object I receive so I return the parsed input, but I don't know how to do it.
Furthermore, Mediawiki's reference is pretty much non existant, it would be great to have something like a Doxygen reference or something.
Use $parser->recursiveTagParse(), as shown at Manual:Tag_extensions#How do I render wikitext in my extension?.
It is kind of a clunky interface, and not very well documented. The underlying reason why such a seemingly natural thing to do is so tricky to accomplish is that it sort of goes against the original design intent of tag extensions — they were originally conceived as low-level filters that take in raw unparsed text and spit out HTML, completely bypassing normal parsing. So, for example, if you wanted to include some content written in Markdown (such as a StackOverflow post) on a wiki page, the idea was that you could install a suitable extension and then write
<markdown>
**Look,** here's some Markdown text!
</markdown>
on the page, and the MediaWiki parser would leave everything between the <markdown> tags alone and just hand it over to the extension for parsing.
Of course, it turned out that most people who wrote MediaWiki tag extensions didn't really want to replace the parser, but just to apply some tweaks to it. But the way the tag extension interface was set up, the only way to do that was to call the parser recursively. I've sometimes thought it would be nice to add a new parser extension type to MediaWiki, something that looked like tag extensions but didn't interrupt normal parsing in such a drastic manner. Alas, my motivation and copious free time hasn't so far been sufficient to actually do something about it.

WMD markdown editor - HTML to Markdown conversion

I am using wmd markdown editor on a project and had a question:
When I post the form containing the markdown text area, it (as expected) posts html to the server. However, say upon server-side validation something fails and I need to send the user back to edit their entry, is there anyway to refill the textarea with just the markdown and not the html? Since as I have it set up, the server only has access to the post data (which is in the form of html) so I can't seem to think of a way to do this. Any ideas? Preferably a non-javascript based solution.
Update: I found an html to markdown converter called markdownify. I guess this might be the best solution for displaying the markdown back to the user...any better alternatives are welcome!
Update 2: I found this post on SO and I guess there is an option to send the data to the server as markdown instead of html. Are there any downsides to simply storing the data as markdown in the database? What about displaying it back to the user (outside of an editor)? Maybe it would be best to post both versions (html AND markdown) to the server...
SOLVED: I can simply use php markdown to convert the markdown to html serverside.
I would suggest that you simply send and store the text as Markdown. This seems to be what you have settled on already. IMO, storing the text as Markdown will be better because you can safely strip all HTML tags out without worrying about loss of formatting - this makes your code safer, because it will be harder to use a XSS attack (although it may still be possible though - I am only saying that this part will be safer).
One thing to consider is that WMD appears to have certain different edge cases from certain server-side Markdown implementations. I've definitely seen some quirks in the previews here that have shown up differently after submission (I believe one such case was attempting to escape a backtick surrounded by backticks). By sending the converted preview over the wire, you can ensure that the preview is accurate.
I'm not saying that should make your decision, but it's something to consider.
Try out Pandoc. It's a little more comprehensive and reliable than Markdownify.
The HTML you are seeing is just a preview, so it's not a good idea to store that in the database as you will run into issues when you try to edit. It's also not a good idea to store both versions (markdown and HTML) as the HTML is just an interpretation and you will have the same problems of editing and keeping both versions in synch.
So the best idea is to store the markdown in the db and then convert it server side before displaying.
You can use PHP Markdown for this purpose. However this is not 100% perfect conversion of what you are seeing on the javascript side and may need some tweaking.
The version that the Stack Exchange network is using is a C# implementation and there should be a python implementation you downloaded with the version of wmd you have.
The one thing I tweaked was the way new lines were rendered so I changed this in markdown.php to convert some new lines into <br> starting from line 626 in the version I have:
var $span_gamut = array(
#
# These are all the transformations that occur *within* block-level
# tags like paragraphs, headers, and list items.
#
# Process character escapes, code spans, and inline HTML
# in one shot.
"parseSpan" => -30,
# Process anchor and image tags. Images must come first,
# because ![foo][f] looks like an anchor.
"doImages" => 10,
"doAnchors" => 20,
# Make links out of things like `<http://example.com/>`
# Must come after doAnchors, because you can use < and >
# delimiters in inline links like [this](<url>).
"doAutoLinks" => 30,
"encodeAmpsAndAngles" => 40,
"doItalicsAndBold" => 50,
"doHardBreaks" => 60,
"doNewLines" => 70,
);
function runSpanGamut($text) {
#
# Run span gamut tranformations.
#
foreach ($this->span_gamut as $method => $priority) {
$text = $this->$method($text);
}
return $text;
}
function doNewLines($text) {
return nl2br($text);
}

"Safe" markdown processor for PHP?

Is there a PHP implementation of markdown suitable for using in public comments?
Basically it should only allow a subset of the markdown syntax (bold, italic, links, block-quotes, code-blocks and lists), and strip out all inline HTML (or possibly escape it?)
I guess one option is to use the normal markdown parser, and run the output through an HTML sanitiser, but is there a better way of doing this..?
We're using PHP markdown Extra for the rest of the site, so we'd already have to use a secondary parser (the non-"Extra" version, since things like footnote support is unnecessary).. It also seems nicer parsing only the *bold* text and having everything escaped to <a href="etc">, than generating <b>bold</b> text and trying to strip the bits we don't want..
Also, on a related note, we're using the WMD control for the "main" site, but for comments, what other options are there? WMD's javascript preview is nice, but it would need the same "neutering" as the PHP markdown processor (it can't display images and so on, otherwise someone will submit and their working markdown will "break")
Currently my plan is to use the PHP-markdown -> HTML santiser method, and edit WMD to remove the image/heading syntax from showdown.js - but it seems like this has been done countless times before..
Basically:
Is there a "safe" markdown implementation in PHP?
Is there a HTML/javascript markdown editor which could have the same options easily disabled?
Update: I ended up simply running the markdown() output through HTML Purifier.
This way the Markdown rendering was separate from output sanitisation, which is much simpler (two mostly-unmodified code bases) more secure (you're not trying to do both rendering and sanitisation at once), and more flexible (you can have multiple sanitisation levels, say a more lax configuration for trusted content, and a much more stringent version for public comments)
PHP Markdown has a sanitizer option, but it doesn't appear to be advertised anywhere. Take a look at the top of the Markdown_Parser class in markdown.php (starts on line 191 in version 1.0.1m). We're interested in lines 209-211:
# Change to `true` to disallow markup or entities.
var $no_markup = false;
var $no_entities = false;
If you change those to true, markup and entities, respectively, should be escaped rather than inserted verbatim. There doesn't appear to be any built-in way to change those (e.g., via the constructor), but you can always add one:
function do_markdown($text, $safe=false) {
$parser = new Markdown_Parser;
if ($safe) {
$parser->no_markup = true;
$parser->no_entities = true;
}
return $parser->transform($text);
}
Note that the above function creates a new parser on every run rather than caching it like the provided Markdown function (lines 43-56) does, so it might be a bit on the slow side.
JavaScript Markdown Editor Hypothesis:
Use a JavaScript-driven Markdown Editor, e.g., based on showdown
Remove all icons and visual clues from the Toolbar for unwanted items
Set up a JavaScript filter to clean-up unwanted markup on submission
Test and harden all JavaScript changes and filters locally on your computer
Mirror those filters in the PHP submission script, to catch same on the server-side.
Remove all references to unwanted items from Help/Tutorials
I've created a Markdown editor in JavaScript, but it has enhanced features. That took a big chunk of time and SVN revisions. But I don't think it would be that tough to alter a Markdown editor to limit the HTML allowed.
How about running htmlspecialchars on the user entered input, before processing it through markdown? It should escape anything dangerous, but leave everything that markdown understands.
I'm trying to think of a case where this wouldn't work but can't think of anything off hand.

Categories