Using htmlentities with BBCode - php

What I am trying to achieve is a sound method for using BBCode but where all other data is parsed through htmlentities(). I think that this should be possible, I was thinking along the lines of exploding around [] symbols, but I thought there may be a better way.
Any ideas?

htmlentities() does not parse. Rather, it encodes data so it can be safely displayed in an HTML document.
Your code will look like this:
Parse BB-code (by some mechanism); don't do escaping yet, just parse the input text into tags!
The output of your parser step will be some tree structure, consisting of nodes that represent block tags and nodes that represent plain text (the text between the tags).
Render the tree to your output format (HTML). At this point, you escape plain text in your data structure using htmlentities.
Your rendering function will be recursive. Some pseudo-functions that specify the relationship:
render( x : plain text ) = htmlentities(x)
render( x : bold tag ) = "<b>" . render( get_contents_of ( x )) . "</b>"
render( x : quote tag ) = "<blockquote>" .
render( get_contents_of( x )) .
"</blockquote>"
...
render( x : anything else) = "<b>Invalid tag!</b>"
So you see, the htmlentities only comes into play when you're rendering your output to HTML, so the browser does not get confused if your plain-text is supposed to contain special characters such as < and >. If you were rendering to plain text, you wouldn't use the function call at all, for example.

Related

php tcpdf writeHTML() fail when using <sometext>

I am using tcpdf to export some data to pdf and it works almost fine.
If my content has <some text> in it (and some text IS NOT an html tag) the pdf loses the styling up to this point and is corrupted.
How can I fix this?
TCPDF is very picky about input HTML. It supports a limited subset of HTML and all other tags are stripped. Last time I checked, TCPDF uses strip_tags to remove the unsupported tags (and anything that looks like tags). As such, there's a solid chance that depending on the particulars of your input it may remove more than just the tag-looking text in some cases.
Let's take this example:
<p>This is <some text></p>
In the browser <some text> would be treated like a tag with an empty "text" attribute, and not be visible to the user. TCPDF is indeed basically treating it the same way. I looked for a reliable automated fix for this, but most HTML parsers treat it the same way. The Tidy extension outright removes it for instance.
If you need those greater-than and less-than symbols in your text, they need to be encoded as HTML entities. For example, it needs to be passed to TCPDF like this:
<p>This is <some text></p>
One possible solution is to run expected plaintext (that is, sources that aren't expected to contain any HTML) through htmlspecialchars before adding it to your HTML. This will encode those special HTML characters into entities.
For example, a user's form entry into a field:
$someusertext = $_POST['somefield'];
$html = '<p>'.htmlspecialchars($someusertext, ENT_COMPAT, 'UTF-8').'</p>';

HTML string from API not being converted to HTML rendered entities

I'm receiving a chunk of HTML via API call and trying to put that HTML into my template for it to render. But, instead of rendering, it's being printed out as if it were a string.
Example of HTML string from API:
\u0026lt;p\u0026gt;\u0026lt;strong\u0026gt;Hello World\u0026lt;/strong\u0026gt;\u0026lt;/p\u0026gt;
Then in the controller I convert the string to HTML entities
$content = htmlspecialchars_decode($response['content']);
The issue I'm having is that in my view, the HTML is printed (tags and all) instead of being rendered as HTML:
In view code:
<?= $content ?>
End result:
<p><strong>Hello World</strong></p>
How can I get this HTML chunk to render in my view?
Your data looks like double-encoded. Try
$content = htmlspecialchars_decode(htmlspecialchars_decode($response['content']));
From the looks of your string, you don't need htmlspecialchars_decode(), since you don't have HTML special characters directly in it.
I suspect you are getting your data in JSON format. A JSON-encoded value can sometimes have special characters converted to Unicode constants (for example, the PHP json_encode() function does that when used with the JSON_HEX_* options).
Try this:
$content = json_decode('"' . $response['content'] . '"');

Stripping input to complete plain text

Currently finalising the coding for my comment system, and it want it to work a little how Stack Overflow works with their posts etc, I would like my users to be able to use BOLD, Italic and Underscore only, and to do that I would use following:
_ Text _ * BOLD * -Italic-
Now, firstly I would like to know a way of stripping a comment completely clean of any tags, html entities and such, so for example, if a user was to use any html / php tags, they would be removed from the input.
I am currently using Strip_tags, but that can leave the output looking quite nasty, even if an abusive or blatent XSS/Injection attempt has been made, I would still like the plain-text to be outputted in full, and not chopped up as strip_tags seems to make an absolute mess when it comes to that.
What I will then do, is replace the asterisks with bold html tags, and so on AFTER stripping the content clean of html tags.
How do people suggest I do this, currently this is the comment sanitize function
function cleanNonSQL( $str )
{
return strip_tags( stripslashes( trim( $str ) ) );
}
PHP tags are surrounded by <? and ?>, or maybe <% and %>on some ages-old installations, so removing PHP tags can be managed by a regex:
$cleaned=preg_replace('/\<\?.*?\?\>/', '', $dirty);
$cleaned=preg_replace('/\<\%.*?\%\>/', '', $cleaned);
Next you take care of the HTML tags: These are surrounded by < and >. Again you can do this with a regex
$cleaned=preg_replace('/\<.*?\>/','',$cleaned);
This will transform
$dirty="blah blah blah <?php echo $this; ?> foo foo foo <some> html <tag> and <another /> bar bar";
into
$cleaned="blah blah blah foo foo foo html and bar bar";
You could try using regular expressions to strip the tags, such as:
preg_replace("/\<(.+?)\>/", '', $str);
Not sure if that's what you're looking for, but it will remove anything inside < and >. You can also make it a little more foolproof by requiring the first character after the < to be a letter.
The correct way is not to delete html tags from your user's comment, but to tell the browser that the following text should not be interpreted as HTML, Javascript, whatever. Imagine someone wants to post example code like we do here on stackoverflow. If you just bluntly remove any parts of a comment that seem to be code, you will mess up the user's comment.
The solution is to use htmlentities which will escape symbols used for html markup in the comment so that it will actually show up as just text in the browser.
For example the browser will interpret a < as the beginning of a html tag. if you just want the browser to display a <, you have to write < in the source code. htmlentities will convert all the relevant symbols into their html entities for you.
Longer Example
echo htmlentities("<b>this text should not be bold</b><?php echo PHP_SELF;?>");
Outputs
<b>this text should not be bold</b><?php echo PHP_SELF;?>
The browser will output
<b>this text should not be bold</b><?php echo PHP_SELF;?>
Consider the following real life example with the solution, you accepted. Imagine a user writing this comment.
i'm in a bad mood today :<. but your blog made me really happy :>
You will now do your preg_replace("/\<(.+?)\>/", '', $comment); on the text and it will remove half the comment:
i'm in a bad mood today :
If that's what you wanted, never mind this answer. If you don't, use htmlentities.
If you want to save the comment as a file and not have the server interpret PHP code inside it, save it with an extension like '.html' or '.txt', so that the web server won't call the PHP interpreter in the first place. There is usually no need to escape PHP code.

How to deal with html excerpt?

It introduced a lot of non-closed tags like below:
<div>
<table>...</table>
The </div> is truncated by code like this:
(strlen($row['body']) > 200 ? substr($row['body'],0,200) . '...' : $row['body'])
And the layout of the whole page is broken,how to deal with it?
Assuming that $row['body'] contains HTML that you want to truncate to 200 visible characters:
Strip out HTML Tags
This is the quickest fix but may not be what you want:
$body= strip_tags($row['body']);
echo(strlen($body) > 200 ? substr($body,0,200) . '...' : $body);
Parse HTML and truncate text
Using PHP's DOMDocument class you can parse the HTML, check the length of text within HTML tags, count the length of text in the content, and remove any tags after the character limit from the HTML contained in $row['body'] while retaining well-formed HTML.
Here is an example of how to create a post preview or excerpt that contains valid HTML. It parses the HTML using PHP's DOMDocument as leepowers suggested:
http://bizzybytes.com/html-excerpt-php
I assume you have left it out for brevity but I see no tags
it should be
<div>
<table><tr><td>...</td></tr></table>
Also use the following you may have html embedded in your $row['body']
(strlen($row['body']) > 200 ? substr($row['body'],0,200) . '...' : htmlspecialchars($row['body']))
DC

PHP: HTML markup problem while displaying trimmed HTML markups

I am using a Richtext box control to post some data in one page.
and I am saving the data to my db table with the HTML mark up Ex : This is <b >my bold </b > text
I am displaying the first 50 characters of this column in another page. Now When i am saving, if i save a Sentence (with more than 50 chars )with bold tag applied and in my other page when i trim this (for taking first 50 chars) I would lost the closing b tag (</b>) .So the bold is getting applied to rest of my contents in that page.
How can i solve this ? How can i check which all open tags are not closed ? is there anyeasy way to do this in PHP. Is there any function to remove my entire HTML tags / mark up and give me the sentence as plain text ?
http://php.net/strip_tags
the strip_tags function will remove any tags you might have.
Yes
$textWithoutTags = strip_tags($html);
I generally use HTML::Truncate for this. Of course, being a Perl module, you won't be able to use it directly in your PHP - but the source code does show a working approach (which is to use an HTML parser).
An alternative approach, might be to truncate as you are doing at the moment, and then try to fix it using Tidy.
If you want the HTML tags to remain, but be closed properly, see PHP: Truncate HTML, ignoring tags. Otherwise, read on:
strip_tags will remove HTML tags, but not HTML entities (such as &), which could still cause problems if truncated.
To handle entities as well, one can use html_entity_decode to decode entities after stripping tags, then trim, and finally reencode the entities with htmlspecialchars:
$text = "1 < 2\n";
print $text;
print htmlspecialchars(substr(html_entity_decode(strip_tags($text), ENT_QUOTES), 0, 3));
(Note use of ENT_QUOTES to actually convert all entities.)
Result:
1 < 2
1 <
Footnote: The above only works for entities that can be decoded to ISO-8859-1. If you need support for international characters, you should already be working with UTF-8 encoded strings, and simply need to specify that in the call to html_entity_decode.

Categories