Strip only valid html - php

I'm trying to strip HTML tags from a piece of text. However the trouble is that whatever I use - regex, strip_tags etc.. Comes up across the same problem: It will also strip text which is not HTML but looks like it.
Some <foo#bar.com> Content--> Some Content
Some <Content which looks like this --> Some
Is there a way I can get around this?

A fully correct solution would be a full-fledged HTML parser. See this legendary question for a full discussion.
A simple 80% solution would be to look for all known tags and strip them.
RegExp('</?(a|b|blockquote|cite|dd|dl|dt|...|u)\b.*?>')
The code would be more readable if you use an array of tags and build expressions as you loop through them. It will not handle comments nicely, so if you need more than hack quality, don't do it with a hack approach. If you need correctness, use an actual HTML parser (e.g. DOMDocument in PHP).

Have you tried the HTML purifier library? You can configure it to strip all tags out, I've found the library very reliable.

Related

Using php to get all translatable text from a website/html-page

I'm trying to set up a translation tool to translate websites. What I want to do is import html-code and get all translatable texts from that site.
One idea would be to use strip_tags, but it would ignore strings that could be translated such as alt-texts, title-texts and probably others that I don't have on my mind yet. Is there a clean way to do this?
In this case you need to parse HTML and extract text yourself. As you, probably, already know, parsing HTML with regular expressions is A Bad Idea (tm). SO, the only right solution is to parse DOM of the document. On this step you are free to use any tools including standard DOMDocument class.
If you are looking for some libraries or scripts to help, i would suggest to look on html2text which could be used commercially. As i see, it doesn't support attributes for <img> tags, but it's very easy to fix (use <a> tag as example).
If you are looking for some automated text extraction, then you should definitely look on something like Bolierpipe.
I would personally use the DOM Crowler component from Symfony2, which is a nice wrapper around php DOM functions and start from there.

Regex (or better suggestion) on html with correct nesting

I've had a look and there don't seem to be any old questions that directly address this. I also haven't found a clear solution anywhere else.
I need a way to match a tag, open to close, and return everything enclosed by the tag. The regexes I've tried have problems when tags are nested. For example, the regex <tag\b[^>]*>(.*?)</tag> will cause trouble with <tag>Some text <tag>that is nested</tag> in tags</tag>. It will match <tag>Some text <tag>that is nested</tag>.
I'm looking a solution to this. Ideally an efficient one. I've seen solutions that involve matching on start and end tags separately and keeping track of their index in the content to work out which tags go together but that seems wildly inefficient to me (if it's the only possible way then c'est la vie).
The solution must be PHP only as this is the language I have to work with. I'm parsing html snippets (think body sections from a wordpress blog and you're not too far off). If there is a better than regex solution, I'm all ears!
UPDATE:
Just to make it clear, I'm aware regexes are a poor solution but I have to do it somehow which is why the title specifically mentions better solutions.
FURTHER UPDATE:
I'm parsing snippets. Solutions should take this into account. If the parser only works on a full document or is going to add <head> etc... when I get the html back out, it's not an acceptable solution.
As always, you simply cannot parse HTML with regex because it is not a regular language. You either need to write a real HTML parser, or use a real HTML parser (that someone's already written). For reasons that should be obvious, I recommend the latter option.
Relevant questions
Robust and Mature HTML Parser for PHP
How do you parse and process HTML/XML in PHP?
Why not just use DOMDocument::loadHTML? It uses libxml under the hood which is fast and robust.

Alternative of html purifier

I want to accept to accept the html input from user and post it on my site also want to make sure that it don't create problem with my site template due to dirty html code.
I was using html purifier in the past but Html purifier is not working on one of my server. So I am searching for best alternative.
Which is purely written in php.
which can fix the dirty html code like
</div> it is dirty code as div is closed without opening.
Simple solution without third-party libraries: create a DOMDocument and call loadHTML on it with your input. Surrounded the input with <html> and <body> tags if you are only parsing a little snippet. You'll probably want to suppress warnings too, as you'll get them spat out for common bad HTML.
Then simply walk over the resulting document tree, removing any elements and attributes you've not included in a known-good list. You should also check allowed URL attributes to ensure they use known-good schemes like http:, and not potentially troublesome schemes like javascript:. If you want to go the extra mile you can check that only allowed combinations of elements are nested inside each other (this is easier the smaller number of elements you're allowing).
Finally, serialise the snippet's node again using saveHTML. Because you're creating new markup from a DOM, not maintaining the original—potentially malformed—markup, that's a whole class of odd-markup injection techniques you're blocking.
You can try PHP Tidy, which is the Tidy library in PHP.
I believe Tidy will help close your tags, but it isn't as comprehensive as HTML Purifier which can remove valid but unwanted tags or attributes (i.e. JavaScript onclick events, that kind of thing).
Be aware that Tidy requires libtidy to be installed on your server, so it's not just straight PHP.
I know Pádraic Brady has been working on an alternative to HTML Purifier for Zend Framework, though I think its just experimental code at this time
http://framework.zend.com/wiki/pages/viewpage.action?pageId=25002168
http://github.com/padraic/wibble
Do also consider HTMLawed at https://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/
From that page;
use to filter, secure & sanitize HTML in blog comments or forum posts, generate XML-
compatible feed items from web-page excerpts, convert HTML to XHTML, pretty-print
HTML, scrape web-pages, reduce spam, remove XSS code, etc.
Note that Tidy/HTML Tiday is NOT a anti XSS solution. It is a clean and repair utility which allows you to clean HTML, XHTML, and XML markup.
HTMLawed is a 55kb single php file whilst HTML Purifer is a 3 MB folder.

How to parse HTML for minification in PHP?

I'm looking to write an algorithm to compress HTML output for a CMS I'm writing in PHP, written with the CodeIgniter framework.
I was thinking of trying to remove whitespace between any angle brackets, except the <script>, <pre>, and <style> elements, and simply ignoring those elements for simplicity. I should clarify that this is whitespace between consecutive tags, with no text between them.
How should I go about parsing the HTML to find the whitespace I want to remove?
Edit:
To start off, I want to remove all tab characters that are not in <pre> tags. This can be done with regex, I'm sure, but what are the alternatives?
Don't. Whitespace is negligible. Better to be using output compression, with zlib or here for example
Is there something wrong with the existing HTML minification solutions?
Minify does HTML (as well as CSS and JS).
(That second link goes to the source code, which comments the steps it takes - should be a good leg up if you did want to create your own - it's BSD licensed.)
Also, as Pete says, you'll benefit much more by using gzip compression for your HTML (and CSS/JS/etc), and wont get tripped up by problems such as Gordon mentioned in his comment.

PHP regex to get contents of a specific span element

I need some help ... I'm a bit (read total) n00b when it comes to regular expressions, and need some help writing one to find a specific piece of text contained within a specific HTML tag from PHP.
The source string looks like this:
<span lang="en">English Content</span><span lang="fr">French content</span> ... etc ...
I'd like to extract just the text of the element for a specific language.
Can anyone help?
There are plenty of HTML parsers available for PHP. I suggest you check out one of those, (for example: PHP Simple HTML DOM Parser).
Shooting yourself in the foot with trying to read HTML with regex is a lot easier than you think, and a lot harder to avoid than you wish (especially when you don't know regex thoroughly, and your input is not guaranteed to be 100% clean HTML).
(Bad, not working) example which shows why you should not use regex for parsing html.
/<span lang="en">(.*)<\/span>/
Will output:
English Content</span><span lang="fr">French content
More stuff to read:
Parsing: Beyond Regex
For-the-2,295,485th-time-DO-NOT-PARSE-HTML-WITH-REGULAR-EXPRESSIONS
There's this most awesome class that lets you do SQL-like queries on HTML pages. It might be worth a look:
HTML SQL
I've used it a bunch and I love it.
Hope that helps...

Categories