How to show a comparison of 2 html text blocks - php

I need to take two text blocks with html tags and render a comparison - merge the two text blocks and then highlight what was added or removed from one version to the next.
I have used the PEAR Text_Diff class to successfully render comparisons of plain text, but when I try to throw text with html tags in it, it gets UGLY. Because of the word and character-based compare algorithms the class uses, html tags get broken and I end up with ugly stuff like <p><span class="new"> </</span>p>. It slaughters the html.
Is there a way to generate a text comparison while preserving the original valid html markup?
Thanks for the help. I've been working on this for weeks :[
This is the best solution I could think of: find/replace each type of html tag with 1 special non-standard character like the apple logo (opt shift k), render the comparison with this kind of primative markdown, then revert the non-standard characters back into tags. Any feedback?

Simple Diff, by Paul Butler, looks as though it's designed to do exactly what you need: http://github.com/paulgb/simplediff/blob/5bfe1d2a8f967c7901ace50f04ac2d9308ed3169/simplediff.php
Notice in his php code that there's an html wrapper: htmlDiff($old, $new)
(His blog post on it: http://paulbutler.org/archives/a-simple-diff-algorithm-in-php/

The problem seems to be that your diff program should be treating existing HTML tags as atomic tokens rather than as individual characters.
If your engine has the ability to limit itself to working on word boundaries, see if you can override the function that determines word boundaries so it recognizes and treats HTML tags as a single "word".
You could also do as you are saying and create a lookup dictionary of distinct HTML tags that replaces each with a distinct unused Unicode value (I think there are some user-defined ranges you can use). However, if you do this, any changes to markup will be treated as if they were a change to the previous or following word, because the Unicode character will become part of that word to the tokenizer. Adding a space before and after each of your token Unicode characters would keep the HTML tag changes separate from the plain text changes.

I wonder that nobody mentioned HTMLDiff based on MediaWiki's Visual Diff. Give it a try, I was looking for something like you and found it pretty useful.

What about using an html tidier / formatter on each block first? This will create a standard "structure" which your diff might find easier to swallow

A copy of my own answer from here.
What about DaisyDiff (Java and PHP vesions available).
Following features are really nice:
Works with badly formed HTML that can be found "in the wild".
The diffing is more specialized in HTML than XML tree differs. Changing part of a text node will not cause the entire node to be changed.
In addition to the default visual diff, HTML source can be diffed coherently.
Provides easy to understand descriptions of the changes.
The default GUI allows easy browsing of the modifications through keyboard shortcuts and links.

Try running your HTML blocks through this function first:
htmlentities();
That should convert all of your "<"'s and ">"'s into their corresponding codes, perhaps fixing your problem.
//Example:
$html_1 = "<html><head></head><body>Something</body></html>"
$html_2 = "<html><head></head><body><p id='abc'>Something Else</p></body></html>"
//Below code taken from http://www.go4expert.com/forums/showthread.php?t=4189.
//Not sure if/how it works exactly
$diff = &new Text_Diff(htmlentities($html_1), htmlentities($html_2));
$renderer = &new Text_Diff_Renderer();
echo $renderer->render($diff);

Related

Replace only on lines which contain no html tags

My knowledge in the RegEx context is still not big enough. The example should demonstrate my problem - I hope. I parse a text and render HTML. Currently, my problem is to set the paragraph markup for each text, paragraph without a markup and a line ending.
An example text:
<h1>Header</h1>\nA simple text with less of words. Yes much more lines.\n<h2>Tests</h2>\nThe solution is still active in his tests.\n
I like to add a simple paragraph <p> markup to each line (before <p> and after </p>), if it is without markup or an empty line, like ''.
The goal of the example below should looks like:
<h1>Header</h1>\n<p>A simple text with less of words. Yes much more lines.</p>\n<h2>Tests</h2>\n<p>The solution is still active in his tests.</p>\n
I'm tried
My current RegEx parse that, but have the problem if I have a line is empty or after an empty line after a tag, like </code>\n.
'#(?![a-z][0-9]).(.*\n)#'
I tried also with negative look for closing the HTML tag like #(?!\>).(.*\n)#.
Online test
https://regex101.com/r/khYWy4/2
Use another tool if you can!
Depending on how you are going to use this, I will recommend that you find a solution which is not based on regex. This task is better solved by iterating the lines in a proper script or program, perhaps the one which generates the html in the first place, and injects the tags you need.
Having said that, I appreciate that sometimes there is no optimal solution.
My attempt to solve yor case
I have updated your example with a substitution which does seem to do what you want.
\n([^<>\n;]+?)\n
Substitute with
\n<p>\1</p>\n
The updated example:
https://regex101.com/r/khYWy4/3
Be aware of a few things here:
I ignore any lines which already contain any html tags.
I ignore any lines which contain a semicolon, to avoid tags in your code block.
Disclaimer!
Depending on what other cases you have may look like, these simple skips were made just to make your example work. I can not guarantee that this will work for a larger set of data.

html visual editing like whatsapp in php

recent version of whatsapp introduced little bit of styling the message, suppose we want to write something like this
input This is a ~statement~ which has styling in it
output This is a statement which has styling in it
even stackoverflow has this kind of minimal styling which gives great look, we want to implement this in our platform where teachers while giving remark to student can give ol, ul, bold, italic but we also want to make sure they are not allowed to use traditional html tags because when tag changes we have to make changes instead we like the approach where you can add special character with word and turn them way you want in the output.
I don't know what is the specific terms for this type of editing so please ignore it.
language since our platform is already running in php we would like to implement that in php
thought process we thought it might be possible with regex but don't know how to implement ol, ul and we are not very sure if that is a correct method to implement
why not allowing traditional html tags
Not all of them know traditional html tags
want to keep our application secure
Take a look at this GitHub library
Here are some examples:
// traditional markdown and parse full text
$parser = new \cebe\markdown\Markdown();
$parser->parse($markdown);
// use github markdown
$parser = new \cebe\markdown\GithubMarkdown();
$parser->parse($markdown);
// use markdown extra
$parser = new \cebe\markdown\MarkdownExtra();
$parser->parse($markdown);
// parse only inline elements (useful for one-line descriptions)
$parser = new \cebe\markdown\GithubMarkdown();
$parser->parseParagraph($markdown);
You can use regular expressions like this:
/~([\w]*)~/
With preg_replace() function you can replace the content between ~ symbols with all you need. For example:
https://regex101.com/r/vD8wI4/2
Note the substitution tab, where I replace ~text~ with <pre>text</pre>.
The same technique applicable to bold, italic, etc:
Bold:
/\*([\w]*)\*/
Italic:
/_([\w]*)_/
Etc.
Good luck.

Simple BBparser in PHP that lets you replace content outside tags

I'm trying to parse strings that represent source code, something like this:
[code lang="html"]
<div>stuff</div>
[/code]
<div>stuff</div>
As you can see from my previous 20 questions, I tried to do it with PHP's regex functions, but ran into many problems, especially when the string is very big...
Do you guys know a BB parser class written in PHP that I can use for this, instead of regexes?
What I need it to do is:
be able to convert all content from within [code] tags with html entities
be able to run some kind of a filter (a callback function of mine) only on content outside of the [code] tags
thank you
edit:
I ended up using this:
convert all <pre> and <code> tags to [pre] and [code]:
str_replace(array('<pre>', '</pre>', '<code>', '</code>'), array('[pre]', '[/pre]', '[code]', '[/code]'), $content);
get contents from between [code]..[/code] and [pre]...[/pre] and do the html entity conversion
preg_replace_callback('/(.?)\[(pre|code)\b(.*?)(?:(\/))?\](?:(.+?)\[\/\2\])?(.?)/s', 'self::specialchars', $content);
(i stole this pattern from wordpress shortcode functions :)
store the entity converted content in a temporary array variable, and replace the one from $content with a unique ID
I can now safely run my filter on $content, because there's no code in it, just the ID (this filter does a strip_tags on the entire text and converts stuff like http://blabla.com to links)
replace the unique IDs from $content with the converted code blocks from the array variable
do you think it's ok?
HTML Purifier http://htmlpurifier.org/
But you are facing same issues just like in your 20 previous questions.
Do you guys know a BB parser class written in PHP that I can use for this, instead of regexes?
There's the BBCode PECL extension, but you'd need to compile it.
There's also PEAR's HTML_BBCodeParser, though I can't vouch for how effective it is.
There are also a few elsewhere, but I think they're all pretty rigid.
I don't believe that either of those do what you're looking for, with regard to having a callback for tag contents (and then #webarto is totally correct in that HTMLPurifier is the right tool to use when processing the contents). You might have to write your own here. I've previously written about my experiences doing the same that you might find helpful.

Recursive Contents of HTML tag using regex

I'm writing an application for my client that uses a WYSIWYG to allow employees to modify a letter template with certain variables that get parsed out to be information for the customer that the letter is written for.
The WYSIWYG generates HTML that I save to a SQL server database. I then use a PHP class to generate a PDF document with the template text.
Here's my issue. The PDF generation class can translate b,u,i HTML tags. That's it. This is mostly okay, except I need blockquote to be translated too. I figure the best solution would be to write a regex statement that is to take the contents of each blockquote HTML block, and replace each line within the block with five spaces. The trick is that some blockquotes might contain nested blockquotes (double indenting, and what not)
But unfortunately I have never been too well versed with regex, and I spent the last 1.5 hours experimenting with different patterns and got nothing working.
Here are the gotchyas:
String may or may not contain a blockquote block
String could contain multiple blockquotes
String could contain potentially any level of nesting of blockquotes blocks
We can rely on the HTML being properly formed
A sample input string would be look something like something like this:
Dear Charlie,<br><br>We are contacting you because blah blah blah blah.<br><br><br>To login, please use this information:<blockquote>Username: someUsername<br>Password: somePassword</blockquote><br><br>Thank you.
To simply the solution, I need to replace each HTML break inside each blockquote with 5 spaces and then the \n line break character.
You might want to check PHP Simple HTML DOM Parser out. You can use it to parse the input to an HTML DOM tree and use that.
~<blockquote>((?:[^<]*+(?:(?!<blockquote>)|(?R))*+)*+)</blockquote>~
You will need to run this regex recursively using preg_replace_callback:
const REGEX_BLOCKQUOTE = '~<blockquote>((?:[^<]*+(?:(?!<blockquote>)|(?R))*+)*+)</blockquote>~';
function blockquoteCallback($matches) {
return doIndent(preg_replace_callback(REGEX_BLOCKQUOTE, __FUNCTION__, $matches[1]));
}
$output = preg_replace_callback(REGEX_BLOCKQUOTE, 'blockQuoteCallback', $input);
My regex assumes, that there won't be any attributes on the blockquote or anywhere else.
(PS: I'll leave the "Use a DOM parser" comment to someone else.)
Regular expressions have a theory behind them, and even though the modern day's regular expresison engine provide can provide a 'Type - 2.5' level language , some things are still not doable. In your partiular case, nesting is not achievable easily.
A simple way way to explain this, is to say that regular expression can't keep a count ..
i.e. they can't count the nesting level...
what is you need is a limited CFG ( the paren-counting types ) ..
you need to somehow keep a count ..may be a stack or tree ...

Text Search - Highlighting the Search phrase

What will be the best way to highligh the Searched pharase within a HTML Document.
I have Complete HTML Document as a Large String in a Variable.
And I want to Highlight the searched term excluding text with Tags.
For example if the user searches for "img" the img tag should be ignored but
phrase "img" within the text should be highlighted.
Don't use regex.
Because regex cannot parse HTML (or even come close), any attempt to mess around with matching words in an HTML string risks breaking words that appear in markup. A badly-implemented HTML regex hack can even leave you with HTML-injection vulnerabilities which an attacker may be able to leverage to do cross-site-scripting.
Instead you should parse the HTML and do the searches on the text content only.
If you can accept a solution that adds the highlighting from JavaScript on the client side, this is really easy because the browser will already have parsed the HTML into a bunch of DOM objects you can manipulate. See eg. this question for a client-side example.
If you have to do it with PHP that's a bit more tricky. The simple solution would be to use DOMDocument::loadHTML and then translate the findText function from the above example into PHP. At least the DOM methods used are standardised so they work the same.
Edit: This was tagged as Java before, so this answer might not be applicable.
This is quick and dirty but it might work for you, or at least be a starting point
private String highlight(String search,String html) {
return html.replaceAll("(>[^<]*)("+search+")([^>]*<)","$1<em>$2</em>$3");
}
This requires testing, and I make no guarantees that its correct but the simplest way to explain how is that you ensure that your term exists between two tags and is thus is not itself a tag or part of a tag parameter.
var highlight = function(what){
var html = document.body.innerHTML,
word = "(" + what + ")",
match = new RegExp(word, "gi");
html = html.replace(match, "<span style='background-color: red'>$1</span>");
document.body.innerHTML = html;
};
highlight('ll');
This would highlight any occurence of 'll'.
Be carefull by calling highlight() with < or > or any tag name, it would also replace those, screwing up your markup. You might workaround that by reading innerText instead of innerHTML, but that way you'll lose the markup information.
Best way probably is to implement a parser routine yourself.
Example: http://www.jsfiddle.net/DRtVn/
there is a free javascript library that might help you out -> http://scott.yang.id.au/code/se-hilite/
You must be using some server side language to render the search results on the webpage.
So the best way I can think of is to highlight the word while rendering it using the server side language itself,which may be php,java or any other language.
This way you would have only the result strings without html and without parsing overhead.

Categories