Modify html2text library - php

I have been using Html2Text library by Jon Abernathy for a while but now I have a new requirement. Usually the library formats the output text according to the input html like: <b>i am bold</b> will become "HELLO I AM BOLD".
Requirement: What I need to do right now is have the library not convert if it finds something like: ${this should be not touched}$ including the $ sign and braces.
Question: Is there a better way (perhaps passing parameters) in achieving the above behavior instead of directly modifying the library code?

You don't need to modify the HTML or the library. Just add this to your stylesheet:
b { text-transform:uppercase; }
Now the content of your <b> tags will be displayed in upper case as well as bold.
If you have specific <b> tags that need to be upper-cased and others that don't, then use a class to differentiate them, and change the above CSS to reference the class name.

Related

html visual editing like whatsapp in php

recent version of whatsapp introduced little bit of styling the message, suppose we want to write something like this
input This is a ~statement~ which has styling in it
output This is a statement which has styling in it
even stackoverflow has this kind of minimal styling which gives great look, we want to implement this in our platform where teachers while giving remark to student can give ol, ul, bold, italic but we also want to make sure they are not allowed to use traditional html tags because when tag changes we have to make changes instead we like the approach where you can add special character with word and turn them way you want in the output.
I don't know what is the specific terms for this type of editing so please ignore it.
language since our platform is already running in php we would like to implement that in php
thought process we thought it might be possible with regex but don't know how to implement ol, ul and we are not very sure if that is a correct method to implement
why not allowing traditional html tags
Not all of them know traditional html tags
want to keep our application secure
Take a look at this GitHub library
Here are some examples:
// traditional markdown and parse full text
$parser = new \cebe\markdown\Markdown();
$parser->parse($markdown);
// use github markdown
$parser = new \cebe\markdown\GithubMarkdown();
$parser->parse($markdown);
// use markdown extra
$parser = new \cebe\markdown\MarkdownExtra();
$parser->parse($markdown);
// parse only inline elements (useful for one-line descriptions)
$parser = new \cebe\markdown\GithubMarkdown();
$parser->parseParagraph($markdown);
You can use regular expressions like this:
/~([\w]*)~/
With preg_replace() function you can replace the content between ~ symbols with all you need. For example:
https://regex101.com/r/vD8wI4/2
Note the substitution tab, where I replace ~text~ with <pre>text</pre>.
The same technique applicable to bold, italic, etc:
Bold:
/\*([\w]*)\*/
Italic:
/_([\w]*)_/
Etc.
Good luck.

Simple BBparser in PHP that lets you replace content outside tags

I'm trying to parse strings that represent source code, something like this:
[code lang="html"]
<div>stuff</div>
[/code]
<div>stuff</div>
As you can see from my previous 20 questions, I tried to do it with PHP's regex functions, but ran into many problems, especially when the string is very big...
Do you guys know a BB parser class written in PHP that I can use for this, instead of regexes?
What I need it to do is:
be able to convert all content from within [code] tags with html entities
be able to run some kind of a filter (a callback function of mine) only on content outside of the [code] tags
thank you
edit:
I ended up using this:
convert all <pre> and <code> tags to [pre] and [code]:
str_replace(array('<pre>', '</pre>', '<code>', '</code>'), array('[pre]', '[/pre]', '[code]', '[/code]'), $content);
get contents from between [code]..[/code] and [pre]...[/pre] and do the html entity conversion
preg_replace_callback('/(.?)\[(pre|code)\b(.*?)(?:(\/))?\](?:(.+?)\[\/\2\])?(.?)/s', 'self::specialchars', $content);
(i stole this pattern from wordpress shortcode functions :)
store the entity converted content in a temporary array variable, and replace the one from $content with a unique ID
I can now safely run my filter on $content, because there's no code in it, just the ID (this filter does a strip_tags on the entire text and converts stuff like http://blabla.com to links)
replace the unique IDs from $content with the converted code blocks from the array variable
do you think it's ok?
HTML Purifier http://htmlpurifier.org/
But you are facing same issues just like in your 20 previous questions.
Do you guys know a BB parser class written in PHP that I can use for this, instead of regexes?
There's the BBCode PECL extension, but you'd need to compile it.
There's also PEAR's HTML_BBCodeParser, though I can't vouch for how effective it is.
There are also a few elsewhere, but I think they're all pretty rigid.
I don't believe that either of those do what you're looking for, with regard to having a callback for tag contents (and then #webarto is totally correct in that HTMLPurifier is the right tool to use when processing the contents). You might have to write your own here. I've previously written about my experiences doing the same that you might find helpful.

Text Search - Highlighting the Search phrase

What will be the best way to highligh the Searched pharase within a HTML Document.
I have Complete HTML Document as a Large String in a Variable.
And I want to Highlight the searched term excluding text with Tags.
For example if the user searches for "img" the img tag should be ignored but
phrase "img" within the text should be highlighted.
Don't use regex.
Because regex cannot parse HTML (or even come close), any attempt to mess around with matching words in an HTML string risks breaking words that appear in markup. A badly-implemented HTML regex hack can even leave you with HTML-injection vulnerabilities which an attacker may be able to leverage to do cross-site-scripting.
Instead you should parse the HTML and do the searches on the text content only.
If you can accept a solution that adds the highlighting from JavaScript on the client side, this is really easy because the browser will already have parsed the HTML into a bunch of DOM objects you can manipulate. See eg. this question for a client-side example.
If you have to do it with PHP that's a bit more tricky. The simple solution would be to use DOMDocument::loadHTML and then translate the findText function from the above example into PHP. At least the DOM methods used are standardised so they work the same.
Edit: This was tagged as Java before, so this answer might not be applicable.
This is quick and dirty but it might work for you, or at least be a starting point
private String highlight(String search,String html) {
return html.replaceAll("(>[^<]*)("+search+")([^>]*<)","$1<em>$2</em>$3");
}
This requires testing, and I make no guarantees that its correct but the simplest way to explain how is that you ensure that your term exists between two tags and is thus is not itself a tag or part of a tag parameter.
var highlight = function(what){
var html = document.body.innerHTML,
word = "(" + what + ")",
match = new RegExp(word, "gi");
html = html.replace(match, "<span style='background-color: red'>$1</span>");
document.body.innerHTML = html;
};
highlight('ll');
This would highlight any occurence of 'll'.
Be carefull by calling highlight() with < or > or any tag name, it would also replace those, screwing up your markup. You might workaround that by reading innerText instead of innerHTML, but that way you'll lose the markup information.
Best way probably is to implement a parser routine yourself.
Example: http://www.jsfiddle.net/DRtVn/
there is a free javascript library that might help you out -> http://scott.yang.id.au/code/se-hilite/
You must be using some server side language to render the search results on the webpage.
So the best way I can think of is to highlight the word while rendering it using the server side language itself,which may be php,java or any other language.
This way you would have only the result strings without html and without parsing overhead.

PHP: Filter specific html tags out of a given text

I googled a lot, for those kind of problems have been asked a lot in the past. But I didn't find anything to match my needs.
I have a html formatted text from a form. Just like this:
Hey, I am just some kind of <strong>formatted</strong> text!
Now, I want to strip all html tags, that I don't allow. PHP's built-in strip_tags() Method does that very well.
But I want to go a step further: I want to allow some Tags only inside or not inside of other tags. I also want to define my own XML Tags.
Another example:
I am a custom xml tag: <book><strong>Hello!</strong></book>. Ok... <strong>Hi!</strong>
Now, I want the <strong/> inside of <book/> to be stripped, but the <strong>Hi!</strong> can stay the way it is.
So, I want to define some rules of what I allow or don't allow, and want to have any filter do the rest.
Is there any easy way to do that? Regexp aren't what I'm looking for, for they can't parse html properly.
Regards, Jan Oliver
Don't think there is such a thing, I think not even HTML Purifier does that.
I suggest you parse the XHTML by hand using something like Simple HTML Dom.
Use a second argument to strip_tags, which is allowable tags.
$text = strip_tags($text, '<book><myxml:tag>');
I don't think there's a way to only strip certain tags if they're not inside other tags, without using regex.
Also, regex aren't not good at parsing HTML, but it's slow compared to the options. But that's not what you're doing here, anyways. You're going through the string and removing things you don't want. And for your complex requirement I think your only option is to use regex.
To be completely honest I think you should decide which tags are allowable and which aren't. Whether or not they are inside of other tags shouldn't matter at all. It's markup, not a script.
The second argument shows that you cal allow some tags:
string strip_tags ( string $str [, string $allowable_tags ] )
From php.net
I wrote my own Filter class based on the DOM classes of PHP. Look here: XHTMLFilter class

How to show a comparison of 2 html text blocks

I need to take two text blocks with html tags and render a comparison - merge the two text blocks and then highlight what was added or removed from one version to the next.
I have used the PEAR Text_Diff class to successfully render comparisons of plain text, but when I try to throw text with html tags in it, it gets UGLY. Because of the word and character-based compare algorithms the class uses, html tags get broken and I end up with ugly stuff like <p><span class="new"> </</span>p>. It slaughters the html.
Is there a way to generate a text comparison while preserving the original valid html markup?
Thanks for the help. I've been working on this for weeks :[
This is the best solution I could think of: find/replace each type of html tag with 1 special non-standard character like the apple logo (opt shift k), render the comparison with this kind of primative markdown, then revert the non-standard characters back into tags. Any feedback?
Simple Diff, by Paul Butler, looks as though it's designed to do exactly what you need: http://github.com/paulgb/simplediff/blob/5bfe1d2a8f967c7901ace50f04ac2d9308ed3169/simplediff.php
Notice in his php code that there's an html wrapper: htmlDiff($old, $new)
(His blog post on it: http://paulbutler.org/archives/a-simple-diff-algorithm-in-php/
The problem seems to be that your diff program should be treating existing HTML tags as atomic tokens rather than as individual characters.
If your engine has the ability to limit itself to working on word boundaries, see if you can override the function that determines word boundaries so it recognizes and treats HTML tags as a single "word".
You could also do as you are saying and create a lookup dictionary of distinct HTML tags that replaces each with a distinct unused Unicode value (I think there are some user-defined ranges you can use). However, if you do this, any changes to markup will be treated as if they were a change to the previous or following word, because the Unicode character will become part of that word to the tokenizer. Adding a space before and after each of your token Unicode characters would keep the HTML tag changes separate from the plain text changes.
I wonder that nobody mentioned HTMLDiff based on MediaWiki's Visual Diff. Give it a try, I was looking for something like you and found it pretty useful.
What about using an html tidier / formatter on each block first? This will create a standard "structure" which your diff might find easier to swallow
A copy of my own answer from here.
What about DaisyDiff (Java and PHP vesions available).
Following features are really nice:
Works with badly formed HTML that can be found "in the wild".
The diffing is more specialized in HTML than XML tree differs. Changing part of a text node will not cause the entire node to be changed.
In addition to the default visual diff, HTML source can be diffed coherently.
Provides easy to understand descriptions of the changes.
The default GUI allows easy browsing of the modifications through keyboard shortcuts and links.
Try running your HTML blocks through this function first:
htmlentities();
That should convert all of your "<"'s and ">"'s into their corresponding codes, perhaps fixing your problem.
//Example:
$html_1 = "<html><head></head><body>Something</body></html>"
$html_2 = "<html><head></head><body><p id='abc'>Something Else</p></body></html>"
//Below code taken from http://www.go4expert.com/forums/showthread.php?t=4189.
//Not sure if/how it works exactly
$diff = &new Text_Diff(htmlentities($html_1), htmlentities($html_2));
$renderer = &new Text_Diff_Renderer();
echo $renderer->render($diff);

Categories