Building a markup parser in php - php

I have created a very simple markup parser in php. However, it currently uses str_replace to switch between markup and html. How can I make a "code" box of sorts (will eventually use GeSHI) that has the contents untouched?
Right now, the following markup: [code][b]Some bold text[/b][/code] winds up parsing as the code box with <b>Some bold text</b>.
I need some advice, which option is best?
Have it check each word individually, and if it is not inside a [code] box it should parse
Leave it as is, let users be unable to post markup inside of [code].
Create another type of code box specifically for HTML markup, have [code] autorevert any < or > to [ and ].
Is there maybe even another option? This is a bit tougher than I thought it would be...
EDIT: Is it even worth adding a code box type thing to this parser? I mean, I see how it could be useful, but it is a rather large amount of effort for a small result.

Why would you reinvent the wheel?
There's plenty of markup parsers already.
Anyway, just str_replace won't help much. You'd have to learn regular expressions and as they say, now you've got two problems ;)

You could break it down into multiple strings for the purposes of using the str_replace. Split the strings on the [code] and [/code] tags - saving the code box in a separate string. Make note of where it went in the original string somehow. Then use str_replace on the original string and do whatever parsing you like on the code box string. Finally reinsert the parsed code boxes and display.
Just a word of warning though, turning input into html for display strikes me as inherently dangerous. I'd recommend a large amount of input sanitization and checking before converting to html for redisplay.

HTML beautifier is pretty sweet. http://pear.php.net/package/PHP_Beautifier . The have a decorator class as well that would probably suit your needs.

To be clear, your problem is in two parts. The first part is the need for a lexical analyzer to break your "code" into the keywords for your "language." Once you have a lexical analyzer, you then need a parser. A parser is code that accepts the keywords for your language one-at-a-time in a logical (usually recursive-descent way) manner.

Related

How to properly format text retrieved from a website?

I'm building an application for a company that, unfortunately, has a very poorly designed website. Most of the HTML tags are wrongly and sometimes randomly placed, there is excessive use of no-break-spaces, p tags are randomly assigned, they don't follow any rule and so on...
I'm retrieving data from their website by using a crawler and then feeding the resulted strings to my application through my own web-service. The problem is that once displaying it into the android textview, the text is formatted all wrong, spread and uneven, very dissorderly.
Also, worth mentioning that I can not suggest to the company for various reasons to modify their website...
I've tried
String text = Html.fromHtml(myString).toString();
and other variations, I've even tried formatting it manually but it's been a pain.
My question is:
Is there an easy, elegant way to re-format all this text, either with PHP on my web-service or with Java, directly in my Android application?
Thanks to anyone who will take the time to answer...
You can use Tidy with PHP to clean up the code if you're keeping it in place. Otherwise stripping the HTML would probably make working with it a lot easier.
I would so: no, there is no easy, elegant way. HTML combines data and visual representation, they are inherently linked. To understand the data you must look at the tags. Tags like <h1> and <a> carry meaning.
If the HTML is structured enough to break it down into meaningful blocks: header, body and unrelated/unimportant stuff. Then you could apply restyling principles to those. A simple solution is to just strip all the tags, get only the textNodes and stitch them together. If the HTML is exceptionally poorly formatted you might get sentences that are out of order, but if the HTML isn't too contrived I expect this approach should work.
To give you an indication of the complexity involved: You could have <span>s that have styling applied to them, for instance display: block. This changes the way the span is displayed, from inline to block, so it behaves more like a <div> would. This means that each <span> will likely be on it's own line, it will seem to force a line break. Detecting these situations isn't impossible but it is quite complex. Who knows what happens when you've got list elements, tables or even floating elements; they might be completely out of order.
Probably not the most elegant solution, but I managed to get the best results by stripping some tags according to what I needed with php (that was really easy to do) and then displaying the retrieved strings into formatted WebViews.
As I said, probably not the most elegant solution but in this case it worked best for me.

Why use dom to parse webpages instead of regex?

I've been searching for questions about finding contents in a page, and alot of answers recommend using DOM when parsing webpages instead of REGEX. Why is it so? Does it improve the processing time or something.
A DOM parser is actually parsing the page.
A regular expression is searching for text, not understanding the HTML's semantic meaning.
It is provable that HTML is not a regular language; therefore, it is impossible to create a regular expression that will parse all instances of an arbitrary element-pattern from an HTML document without also matching some text which is not an instance of that element-pattern.
You may be able to design a regular expression which works for your particular use case, but foreseeing exactly the HTML with which you'll be provided (and, consequently, how it will break your limited-use-case regex) is extremely difficult.
Additionally, a regex is harder to adapt to changes in a page's contents than an XPath expression, and the XPath is (in my mind) easier to read, as it need not be concerned with syntactic odds and ends like tag openings and closings.
So, instead of using the wrong tool for the job (a text parsing tool for a structured document) use the right tool for the job (an HTML parser for parsing HTML).
I can't hear that "HTML is not a regular language ..." anymore. Regular expressions (as used in todays languages) also aren't regular.
The simple answer is:
A regular expression is not a parser, it describes a pattern and it will match that pattern, but it has no idea about the document structure. You can't parse anything with one regex. Of course regexes can be part of a parser, I don't know, but I assume nearly every parser will use regexes internally to find certain sub patterns.
If you can build that pattern for the stuff you want to find inside HTML, fine, use it. But very often you would not be able to create this pattern, because its practically not possible to cover all the corner cases, or dependencies like find all links but only if they are green and not pink.
In most cases its a lot easier to use a Parser, that understands the structure of your document, that accepts also a lot of "broken" HTML. It makes it so easy for you to access all links, or all table elements of a certain table, or ...
To my mind, it's safier to use REGEXP on pages where you don't have control on the content: HTML
might be not formed properly, then DOM parser can fail.
Edit:
Well, considered what I just read, you should probably use regexp only if you need very small things, like getting all links of a document,e tc.

Regular Expressions - Where Angels Fear to Tread

I've just started studying regular expressions in PHP, but I'm having a terrible time following some of the tutorials on the WWW and cannot seem to find anything addressing my current needs. Perhaps I'm trying to learn too much too fast. This aspect of PHP is entirely new to me.
What I'm trying to create is a regular expression to replace all HTML code in between the nth occurrence of <TAG> and </TAG> with any code I choose.
My ultimate goal is to make an Internet filter in PHP through which I can view a web page stripped of certain content (or replaced with sanitized content) between any specified set of tags <TAG>...</TAG> within the page, where <TAG>...</TAG> represents any valid paired HTML tags, such as <B>...</B> or <SPAN>...</SPAN> or <DIV>...</DIV>, etc, etc.
For example, if the page has a porn ad contained in the 5th <DIV>...</DIV> block within the page, what regular expression could be invoked to target and replace that code with something else, like xxxxxxx, but only the 5th <DIV> block within the page and nothing else?
The entire web page is contained within a single text string and the filtered result should also be a single string of text.
I'm not sure, but I think the code to do this could have a format similar to:
$FilteredPage = preg_replace("REG EXPRESSION", "xxxxxxxx", $OriginalPage);
The "REG EXPRESSION" to invoke is what I need to know and the "xxxxxxxx" represents the text to replace the code between the tags targeted by "REG EXPRESSION".
Regular expressions are obviously the work of Satan!
Any general suggestions or perhaps a couple of working examples which I could study and experiment with would be greatly appreciated.
Thanks, Jay
Firstly, are you using the right tool for the job? Regex is a text matching engine, not a fully blown parser - perhaps a dedicated HTML parser will give better results.
Secondly, when approaching any programming problem, try to simplify your problem and build it brick by brick rather than just jumping straight to a final solution. For example, you could:
Start with a simple block of normal english text, and try to match and replace (for example) every occurrence of the word "and".
When that works, wrap it in a loop of PHP that can count up to 5 and only replace the 5th occurrence. Why use regex to count when PHP is so much better at that task?
Then modify your regex to match your 5th HTML tag (which is a bit harder because <> are special characters and need escaping)
By approaching the problem in steps, you will be able to get each part working in turn and build a solid solution that you understand.
This has been done to death, but please, don't use a regex to parse HTML. Just stop, give up... It is not worth the kittens god will kill for you doing it. use a real HTML or XML parser
On a more constructive note, look at xpath as a technology better suited to describing html nodes you might want to replace... or phpQuery and QueryPath
The reason god kills kittens when you parse HTML with a regex:
Html is not a regular language, thus a regex can only ever parse very limited html. HTML is a context free language, and as such can only be properly parsed with a context free parser.
Edit: thank you #Andrew Grimm, this is said much better than i could, as evidenced by the first answer with well over four thousand upvotes!
RegEx match open tags except XHTML self-contained tags
ok, few ground rules.
Dont post a question like that, pre-ing all the question, will only keep people away
Regular expressions are awsome!
If you want to consider options, look on how to read html as an xml document and parse it using xpath
#tobyodavies is pretty much correct, I'll include the answer in case you want to do it anyways
Now, to your problem. With this one:
$regex = "#<div>(.+?)</div>#si";
You should be ok using that expression and counting the occurences, much like this:
preg_match_all($regex, $htmlcontent, $matches, PREG_SET_ORDER );
Suppose you only need the 5th one. Matches[$i][0] is the whole string of the $i-eth match
if (count($matches) > 5 )
{
$myMatch = $matches[5][0];
$matchedText = $matches[5][1];
}
Good luck in your efforts...

preg_match_all Problems

I'm trying to match a string that contains HTML code that contains parameters to a function in Javascript.
There are several of these functions found in the the string containing the HTML code.
changeImg('location','size');
Let's say that I want to grab the location within the single quotes, how would I go about doing this? There are more than one instance in the string.
Thanks in advance.
This is a fairly common question on SO and the answer is always the same: regular expressions are a poor tool for parsing HTML. Use an XML or HTML parser. That's what they're for. Take a look at Parse HTML With PHP And DOM for an example and Parsing Html The Cthulhu Way for a bit of background.
Parsing Javascript is even harder as it can appear inside <script> tags and attributes so in the very least you'd need to get every <script> tag and parse the contents as well as every element and parse their event handlers (onclick, etc).
I'm reminded of this quote:
"Some people, when confronted with a
problem, think "I know, I’ll use
regular expressions." Now they have
two problems." -- Jamie Zawinski

PHP regex to get contents of a specific span element

I need some help ... I'm a bit (read total) n00b when it comes to regular expressions, and need some help writing one to find a specific piece of text contained within a specific HTML tag from PHP.
The source string looks like this:
<span lang="en">English Content</span><span lang="fr">French content</span> ... etc ...
I'd like to extract just the text of the element for a specific language.
Can anyone help?
There are plenty of HTML parsers available for PHP. I suggest you check out one of those, (for example: PHP Simple HTML DOM Parser).
Shooting yourself in the foot with trying to read HTML with regex is a lot easier than you think, and a lot harder to avoid than you wish (especially when you don't know regex thoroughly, and your input is not guaranteed to be 100% clean HTML).
(Bad, not working) example which shows why you should not use regex for parsing html.
/<span lang="en">(.*)<\/span>/
Will output:
English Content</span><span lang="fr">French content
More stuff to read:
Parsing: Beyond Regex
For-the-2,295,485th-time-DO-NOT-PARSE-HTML-WITH-REGULAR-EXPRESSIONS
There's this most awesome class that lets you do SQL-like queries on HTML pages. It might be worth a look:
HTML SQL
I've used it a bunch and I love it.
Hope that helps...

Categories