Text-to-HTML converter for PHP - php

What text to HTML converter for PHP would you recommend?
One of the examples would be Markdown, which is used here at SO. User just types some text into the text-box with some natural formatting: enters at the end of line, empty line at the end of paragraph, asterisk delimited bold text, etc. And this syntax is converted to HTML tags.
The simplicity is the main feature we are looking for, there does not need to be a lot of possibilities but those basic that are there should be very intuitive (automatic URL conversion to link, emoticons, paragraphs).
A big plus would be if there is WYSIWYG editor for it. Half-wysiwig just like here at SO would be even better.
Extra points would be if it would fit with Zend Framework well.

Take your pick at http://en.wikipedia.org/wiki/Lightweight_markup_language.
As for Markdown, there's one PHP parser that I've been using called PHP Markdown, and I especially like the Extra extension.
I have actually taken a stab at extending it with my own (undocumented) features. It's available at GitHub (remember that it's the extra branch I've fixed, not the masteR), if you're interested. I've intended on making it a 'proper fork' for a while, but that's another, largely offtopic, story.

The Zend Framework has a WYSIWYG editor bundled with it's Dojo integration.
http://framework.zend.com/manual/en/zend.dojo.form.html#zend.dojo.form.elements.editor
... Bring on the extra points!

There's always textile. It is widely implemented, and has a few basic similarities with Markdown. However, I have never seen a WYSIWYG editor for Textile.

You might find upflow useful.

If you want WYSIWYG, I'm a big fan of FCKeditor. It converts user input to HTML before submitting the form, not after, but has a nice PHP library for using it, and a PHP connector for handling file uploading/browsing (along with several other languages).
If you want something that can be read as plain-text but output as HTML, I vote for Markdown.

I will stick with my original idea of adopting Texy.
None of the products mentioned here actually beats it. I had problem with Texys syntax but it seems to be quite standard and is present in other products too.
It is very lightweith, supports very natural syntax and has great "half" wysiwyg editor Texyla (wiki is in Czech only)

Related

PHP Basic BBcode

I've managed to successfuly integrate BBCode, but I was wondering say if I wanted to dynamically list all the allowed/accepted BBCode - how would I be able to do that? (as it can be tedious manually writing out...and if the BBCode ever changed I'd have to update the writing)
I current have a BBCode() function, which contains 2 arrays, one which contains the regex, and the other which contains the replacements (html), and then I return a preg_replace() of the regex array with the replacement (html) array.
Cheers and looking forward to your inputs!
Consider using a different markup language like Textile or Markdown. Simply saying that you support Markdown or Textile is decent enough; they're so widely used that users could easily look up the markup for them online.
Textile's syntax hasn't been updated since 2006, so it will likely remain very solid for years to come. Markdown's syntax hasn't been updated since 2004.
Both provide excellent PHP libraries:
http://michelf.com/projects/php-markdown/
http://textile.thresholdstate.com/

Markup filter wanted for a public website

Developing a community site where everyone can post text,
I'm looking for a markup filter:
What is not part of the markup must be escaped (htmlspecialchars()) as it is.
Should turn URL-s automatically into links
Should support some form of basic markups (bold, image, url, pre, list)
Should have a simple parser, that turns user input text into HTML
Content on the site is public to everyone, XSS must not allowed to happen.
What do you suggest? What markup language in the first place? BBCode? Wiki? Markdown? Are there any complete API-s with good examples?
PHP is available on the server side. If there is a WYSIWYG-like texarea in addition (like here on SO) that would be a fantastic bonus!
BBCode is old and it's very verbose (pretty much HTML) but both CKEditor and TinyMCE supports it.
Wiki syntax is somewhat confusing to new users and you have to override the CamelCased words.
Markdown seems to be the de facto standard of today's web applications and StackOverflow uses it. There is a very good PHP implementation, not sure about RTEs but StackOverflow uses WYM Editor.
Also, check out the Wikipedia entry on Lightweight Markup Languages.
I think I'm going to go with BBCode via NBBC: http://nbbc.sourceforge.net/
Great list of tags supported, auto-detects complicated link, configurable, slim implementation.

Text Parser with PHP, like Instapaper

I'm trying to write a text parser with PHP, like Instapaper did. What I want to do is; get a webpage and parse it in text-only mode.
It's simple to get the webpage with cURL and strip HTML tags. But every webpage have some common areas; like header, navigation, sidebar, footer, banners etc. I only want to get the article in text mode and exclude all other parts. It's also simple to exclude those parts if I know the "id" or "class" info. But I'm trying to automatize this process and apply for any page, like Instapaper.
I get all the content between but I don't know how to exclude header, sidebar or footer and get only the main article body. I have to develop a logic to get only the main article part.
It's not important for me to find the exact code. It would also be useful to understand how to exclude unnecessary parts as I can try to write my own code with PHP. It would also be useful if there any examples in other languages.
Thanks for helping.
You might try looking at the algorithms behind this bookmarklet, readability - It's got a decent success rate for extracting content among on all web page rubbish.
Friend of mine made it, that's why I'm recommending it - since I know it works, and I'm aware of the many techniques he's using to parse the data. You could apply these techniques for what your asking.
you can take a look at the source from Goose -> it already does alot of this like instapaper text extractions
https://github.com/jiminoc/goose/wiki
Have a look at the ExtractContent code from Shuyo Nakatani.
See original Ruby source http://rubyforge.org/projects/extractcontent/ or a port of it to Perl http://metacpan.org/pod/HTML::ExtractContent
You really should consider using a HTML parser for this. Gather similar pages and compare the DOM trees to find the differing nodes.
this article provides a comparison of different approaches. the java library boilerpipe was rated highly. at the boilerpipe site you find his scientific paper which compares to other algorithms.
not all algorithms suite all purposes. the biggest application of such tools is to just get the raw text to index as a search engine. the idea being that you don't want search results to be messed up by adverts. such extractions can be destructive; meaning that it wont give you "the best reading area" which is what people want with instapaper or readability.

Markdown vs. HTML in a CMS

I'm working on a fairly large CMS-like app that includes a forum, wiki pages, etc. What whould you chose between Markdown and HTML? I'm concerned about usability and the fact non-techie people will use this.
Markdown has a very simple syntax but few users know it
with HTML you can use a WYSIWYG editor but they are often terrible
I vote for Markdown.
I picked up Markdown in maybe 5 minutes in writing my first response here. Later
I learned more than what I picked up here, but I'd think this to be rather standard.
Markdown is much simpler to get good markup out of, and if you're worried about
speed just cache the resulting output.
Markdown is often better, and more easily understood, in plain text than HTML is
in a WYSIWYG editor. Also, no-script friendly.
And if you've got a user who wants an embeded object, just drop the HTML code from that Youtube video in and it'll get carried over.
If usability is an issue, and the target audience is non-geeks, WYSIWYG wins over Markdown. People are used to the toolbars with formatting buttons, but Markdown is a completely unknown markup language to most people (even "markup language" is completely unknown!).
I've had to explain a Markdown-lookalike wiki syntax to non-geeks at work, and they don't love it. When you want to write something, you want to write something, not look up weird ASCII syntax. Try not to interrupt the users' flow.
I would find a good WYSIWYG editor, like the one in WordPress (TinyMCE). It works ok.
If you wanted to use Markdown and a WYSIWYG editor you can use something like WMD Editor which (I am 99% sure) is what is used here at StackOverflow.
The benefit of using something like this is that your non-tech users get their WYSIWYG editor, your techie users get their Markdown love and you get clean markup. Another added side effect is it may actually teach end users Markdown from using it (or at least in an ideal world...)
WMD Editor also has an instant preview (which you can see when writing posts on StackOverflow), which will show users how changing the Markdown changes the look of their text.
We store XHTML in the database, validated against a restricted XHTML schema. The front-end is either a WYSIWYG editor (for the staff who know how to deal with its quirks) or a plain-text box (for the users, with automatic link detection etc.). We can convert the content back and forth, although the plain-text box loses formatting, so we do not depend on a specific UI. If we needed more than this, I would add another converter from XHTML to markdown.
I prefer Markdown with flat-file CMS, like Grav or other.
It's simpler to change styles, but not any html content. And you will take one killer feature: use git for web-site content. You even can create branches with "holiday" content.
Actually Markdown is simpler for non-tech people.

Best way to handle LARGE strings of text going into a database?

I have built a number of solutions in the past in which people enter data via a webform, validation checks are applied, regex in some cases and everything gets stored in a database. This data is then used to drive output on other pages.
I have a special case here where a user wants to copy/paste HUGE amounts of text (multiple paragraphs with various headers, links and etc throughout) -- what is the best way to handle this before it goes into a database to provide the best output when it needs to come back out?
So far the best I have come up with is sticking all the output from these fields in PRE tags and using regex to add links where appropriate. I have a database put together with a list of special keywords that need to be bold or have other styles applied to them which works fine. So I can make this work using these approaches but it just seems to me that there is probably a much more graceful way of doing it.
Nicholas
There are a lot of ways you could format the text for output. You could simply use pre tags as you mentioned (if you are worried about wrapping, the CSS white-space property does also support the pre-wrap value, but browser support for this is currently sketchy at best).
There are also a large number of markup languages you could use for more advanced formatting options (some of which are listed here). Stack Overflow itself uses Markdown, which I personally enjoy using very much.
However, as the data is being pasted from another source, a markup language may interfere with the formatting of the text - in which case you could roll your own solution, perhaps using regular expressions and functions like htmlentities and nl2br.
Whatever you decide, I would recommend storing the input in its original form in the database so you can retroactively amend your formatting routines at any time.
If you're expecting a good deal of formatting, you should probably go with a WYSIWYG editor. These editors produce word-like toolbars which product (hopefully) valid (x)html-markup which can be directly stored into a text field in your database. Here are a couple examples:
FCKeditor - Massive amount of options/tools
Tinymce - A nice alternative.
Markdown - What stackoverflow.com uses
Both FCKeditor and Tinymce have been thoroughly tested and have proven to be reliable. I don't have any experience with markdown but it seems solid.
I've always hated 'forum' formatting tags like [code], [link], etc. Stackoverflow and others have shown that providing an open wysisyg editor is safe, reliable, and very easy to use. Just take the output it gives you, run it through some sort of escape function to check for any kind of injection, xss, etc and store in a text field.

Categories