I am using tcpdf to export some data to pdf and it works almost fine.
If my content has <some text> in it (and some text IS NOT an html tag) the pdf loses the styling up to this point and is corrupted.
How can I fix this?
TCPDF is very picky about input HTML. It supports a limited subset of HTML and all other tags are stripped. Last time I checked, TCPDF uses strip_tags to remove the unsupported tags (and anything that looks like tags). As such, there's a solid chance that depending on the particulars of your input it may remove more than just the tag-looking text in some cases.
Let's take this example:
<p>This is <some text></p>
In the browser <some text> would be treated like a tag with an empty "text" attribute, and not be visible to the user. TCPDF is indeed basically treating it the same way. I looked for a reliable automated fix for this, but most HTML parsers treat it the same way. The Tidy extension outright removes it for instance.
If you need those greater-than and less-than symbols in your text, they need to be encoded as HTML entities. For example, it needs to be passed to TCPDF like this:
<p>This is <some text></p>
One possible solution is to run expected plaintext (that is, sources that aren't expected to contain any HTML) through htmlspecialchars before adding it to your HTML. This will encode those special HTML characters into entities.
For example, a user's form entry into a field:
$someusertext = $_POST['somefield'];
$html = '<p>'.htmlspecialchars($someusertext, ENT_COMPAT, 'UTF-8').'</p>';
Related
Is there a simple php script ignoring html content in a database and not loading it using php?
Like: don't load images, or anchors, or elements with class=""...
Best Regards
use function strip_tag('your content here'). It will remove all HTML tags from your content and gives pure text base output.
http://php.net/manual/en/function.strip-tags.php
I think you're looking for strip_tags(). It removes HTML tags from a text. You can also specify list of tags to keep.
Have to say, parsing text content from a HTML page requires more complex operation.
You can strip_tags()
This function tries to return a string with all NUL bytes, HTML and
PHP tags stripped from a given str. It uses the same tag stripping
state machine as the fgetss() function.
For example, a snippet of 50 characters. Problem is, of course, closing any opened tags. What's a good way to do this? Or else to make things easier, what's a good way to completely skim off all HTML content from the snippet?
You can strip out all HTML tags, etc. via the strip_tags() function, which is (being realistic) probably the best way to go, as otherwise you'll most likely end up with more tags than actual content.
For example:
$first50Chars = substr(trim(strip_tags($longString)), 0, 50);
If tags are generally allowed in the text (I mean, if, for example, text contains <b>, text must be marked with bold, etc), then looks like strip_tags() function is the easiest variant to remove tags from snippet.
If tags are generally not allowed in the text (for example, "<b>" must be just displayed as "<b>"), then you can use htmlentities() function.
I am developing a MVC application with PHP that uses XML and XSLT to print the views. It need to be fully UTF-8 supported. I also use MySQL right configured with UTF8. My problem is the next.
I have a <input type="text"/> with a value like àáèéìíòóùú"><'##~!¡¿?. This is processed to add it to the database. I use mysql_real_escape_string($_POST["name"]) and then do MySQL a INSERT. This will add a slash \ before " and '.
The MySQL database have a DEFAULT CHARACTER SET utf8 and COLLOCATE utf8_spanish_ci. The table field is a normal VARCHAR.
Then I have to print this on a XML that will be transformed with XSLT. I can use PHP on the XML so I echo it with <?php echo TexUtils::obtainSqlText($value_obtained_from_sql); ?>. The obtainSqlText() function actually returns the same as the $value processed, is waiting for a final structure.
One of the first things that I will need for the selected input is to convert > and < to > and < because this will generate problems with start/end tags. This will be done with <?php htmlspecialchars($string, ENT_QUOTES, "UTF-8"); ?>. This will also converts & to &, " to " and ' to '. This is a big problem: XSLT starts to fail because it doesn't recognize all HTML special characters.
There is another problem. I've talked about àáèéìíòóùú"><'##~!¡¿? input but I will have some text from a CKEditor <textarea /> that the value will look like:
<p>
àáèéìíòóùú"><'##~!¡¿?
</p>
How I've to manage this? At first, if I want to print this second value right I will need to use <xsl:value-of select="value" disable-output-escaping="yes" />. Will "><' print right?
So what I am really looking for is how I need to manage this values and how I've to print. I need to use something if is coming from a VARCHARthat doesn't allows HTML and another if is a TEXT (for example) and allows HTML? I will need to use disable-output-escaping="yes" everytime?
I also want to know if doing this I am really securing the query from XSS attacks.
Thank you in advance!
This will be done with <?php htmlspecialchars($string, ENT_QUOTES, "UTF-8"); ?>.
Fine.
This is a big problem: XSLT starts to fail because it doesn't recognize all HTML special characters.
It shouldn't fail on htmlspecialchars() output, ever. & is a predefined entity in XML and ' is a character reference which is always allowed. htmlspecialchars() should produce XML-compatible output, unlike the usually-a-mistake htmlentities(). What is the error you are seeing?
àáèéìíòóùú"><'##~!¡¿?
Urgh, an HTML rich text editor produced that invalid markup? What a dodgy editor.
If you have to allow users to input arbitrary HTML, it's going to need some processing. Unless you really trust those users, you'll need a purifier (to stop them using dangerous scripting elements and XSS-ing each other), and a tidier (to remove malformed markup either due to crap rich-text-editor output or deliberate sabotage). If you intend to put the content directly into XML you will also need it to convert to XHTML output and replace HTML entity references.
A simple way to do this in PHP would be DOMDocument->loadHTML followed by a walk of the DOM tree removing all but known-good elements/attributes/URL-schemes, followed by DOMDocument->saveXML.
Will "><' print right?
Well, it'll print as in your example, yes. But that's equally invalid as both HTML and XML.
I need to take two text blocks with html tags and render a comparison - merge the two text blocks and then highlight what was added or removed from one version to the next.
I have used the PEAR Text_Diff class to successfully render comparisons of plain text, but when I try to throw text with html tags in it, it gets UGLY. Because of the word and character-based compare algorithms the class uses, html tags get broken and I end up with ugly stuff like <p><span class="new"> </</span>p>. It slaughters the html.
Is there a way to generate a text comparison while preserving the original valid html markup?
Thanks for the help. I've been working on this for weeks :[
This is the best solution I could think of: find/replace each type of html tag with 1 special non-standard character like the apple logo (opt shift k), render the comparison with this kind of primative markdown, then revert the non-standard characters back into tags. Any feedback?
Simple Diff, by Paul Butler, looks as though it's designed to do exactly what you need: http://github.com/paulgb/simplediff/blob/5bfe1d2a8f967c7901ace50f04ac2d9308ed3169/simplediff.php
Notice in his php code that there's an html wrapper: htmlDiff($old, $new)
(His blog post on it: http://paulbutler.org/archives/a-simple-diff-algorithm-in-php/
The problem seems to be that your diff program should be treating existing HTML tags as atomic tokens rather than as individual characters.
If your engine has the ability to limit itself to working on word boundaries, see if you can override the function that determines word boundaries so it recognizes and treats HTML tags as a single "word".
You could also do as you are saying and create a lookup dictionary of distinct HTML tags that replaces each with a distinct unused Unicode value (I think there are some user-defined ranges you can use). However, if you do this, any changes to markup will be treated as if they were a change to the previous or following word, because the Unicode character will become part of that word to the tokenizer. Adding a space before and after each of your token Unicode characters would keep the HTML tag changes separate from the plain text changes.
I wonder that nobody mentioned HTMLDiff based on MediaWiki's Visual Diff. Give it a try, I was looking for something like you and found it pretty useful.
What about using an html tidier / formatter on each block first? This will create a standard "structure" which your diff might find easier to swallow
A copy of my own answer from here.
What about DaisyDiff (Java and PHP vesions available).
Following features are really nice:
Works with badly formed HTML that can be found "in the wild".
The diffing is more specialized in HTML than XML tree differs. Changing part of a text node will not cause the entire node to be changed.
In addition to the default visual diff, HTML source can be diffed coherently.
Provides easy to understand descriptions of the changes.
The default GUI allows easy browsing of the modifications through keyboard shortcuts and links.
Try running your HTML blocks through this function first:
htmlentities();
That should convert all of your "<"'s and ">"'s into their corresponding codes, perhaps fixing your problem.
//Example:
$html_1 = "<html><head></head><body>Something</body></html>"
$html_2 = "<html><head></head><body><p id='abc'>Something Else</p></body></html>"
//Below code taken from http://www.go4expert.com/forums/showthread.php?t=4189.
//Not sure if/how it works exactly
$diff = &new Text_Diff(htmlentities($html_1), htmlentities($html_2));
$renderer = &new Text_Diff_Renderer();
echo $renderer->render($diff);
I am using a Richtext box control to post some data in one page.
and I am saving the data to my db table with the HTML mark up Ex : This is <b >my bold </b > text
I am displaying the first 50 characters of this column in another page. Now When i am saving, if i save a Sentence (with more than 50 chars )with bold tag applied and in my other page when i trim this (for taking first 50 chars) I would lost the closing b tag (</b>) .So the bold is getting applied to rest of my contents in that page.
How can i solve this ? How can i check which all open tags are not closed ? is there anyeasy way to do this in PHP. Is there any function to remove my entire HTML tags / mark up and give me the sentence as plain text ?
http://php.net/strip_tags
the strip_tags function will remove any tags you might have.
Yes
$textWithoutTags = strip_tags($html);
I generally use HTML::Truncate for this. Of course, being a Perl module, you won't be able to use it directly in your PHP - but the source code does show a working approach (which is to use an HTML parser).
An alternative approach, might be to truncate as you are doing at the moment, and then try to fix it using Tidy.
If you want the HTML tags to remain, but be closed properly, see PHP: Truncate HTML, ignoring tags. Otherwise, read on:
strip_tags will remove HTML tags, but not HTML entities (such as &), which could still cause problems if truncated.
To handle entities as well, one can use html_entity_decode to decode entities after stripping tags, then trim, and finally reencode the entities with htmlspecialchars:
$text = "1 < 2\n";
print $text;
print htmlspecialchars(substr(html_entity_decode(strip_tags($text), ENT_QUOTES), 0, 3));
(Note use of ENT_QUOTES to actually convert all entities.)
Result:
1 < 2
1 <
Footnote: The above only works for entities that can be decoded to ISO-8859-1. If you need support for international characters, you should already be working with UTF-8 encoded strings, and simply need to specify that in the call to html_entity_decode.