recent version of whatsapp introduced little bit of styling the message, suppose we want to write something like this
input This is a ~statement~ which has styling in it
output This is a statement which has styling in it
even stackoverflow has this kind of minimal styling which gives great look, we want to implement this in our platform where teachers while giving remark to student can give ol, ul, bold, italic but we also want to make sure they are not allowed to use traditional html tags because when tag changes we have to make changes instead we like the approach where you can add special character with word and turn them way you want in the output.
I don't know what is the specific terms for this type of editing so please ignore it.
language since our platform is already running in php we would like to implement that in php
thought process we thought it might be possible with regex but don't know how to implement ol, ul and we are not very sure if that is a correct method to implement
why not allowing traditional html tags
Not all of them know traditional html tags
want to keep our application secure
Take a look at this GitHub library
Here are some examples:
// traditional markdown and parse full text
$parser = new \cebe\markdown\Markdown();
$parser->parse($markdown);
// use github markdown
$parser = new \cebe\markdown\GithubMarkdown();
$parser->parse($markdown);
// use markdown extra
$parser = new \cebe\markdown\MarkdownExtra();
$parser->parse($markdown);
// parse only inline elements (useful for one-line descriptions)
$parser = new \cebe\markdown\GithubMarkdown();
$parser->parseParagraph($markdown);
You can use regular expressions like this:
/~([\w]*)~/
With preg_replace() function you can replace the content between ~ symbols with all you need. For example:
https://regex101.com/r/vD8wI4/2
Note the substitution tab, where I replace ~text~ with <pre>text</pre>.
The same technique applicable to bold, italic, etc:
Bold:
/\*([\w]*)\*/
Italic:
/_([\w]*)_/
Etc.
Good luck.
Related
I have been using Html2Text library by Jon Abernathy for a while but now I have a new requirement. Usually the library formats the output text according to the input html like: <b>i am bold</b> will become "HELLO I AM BOLD".
Requirement: What I need to do right now is have the library not convert if it finds something like: ${this should be not touched}$ including the $ sign and braces.
Question: Is there a better way (perhaps passing parameters) in achieving the above behavior instead of directly modifying the library code?
You don't need to modify the HTML or the library. Just add this to your stylesheet:
b { text-transform:uppercase; }
Now the content of your <b> tags will be displayed in upper case as well as bold.
If you have specific <b> tags that need to be upper-cased and others that don't, then use a class to differentiate them, and change the above CSS to reference the class name.
I am programmatically cleaning up some basic grammar in comments and other user submitted content. Capitalizing I, the first letter of sentence, etc. The comments and content are mixed with HTML as users have some options in formatting their text.
This is actually proving to bit a bit more challenging than expected, especially to someone new to PHP and regex.
If there a function like ucfirst that will ignore html to help capitalize sentences?
Also, any links or tutorials on cleaning up text like this in html, would be appreciated. Please leave anything you feel would help in the comments. thanks!
EDIT:
Sample Text:
<div><p>i wuz walkin thru the PaRK and found <strong>ur dog</strong>. <br />i hoPe to get a reward.<br /> plz call or text 7zero4 8two8 49 sevenseven</div>
I need for it to be (ultimately)
<div><p>I was walking through the park and found <strong>your dog<strong>. <p>I hope to get a reward.</p><p> Please call or text (704) 828-4977.</p>
I know this is going a little farther than the intended question, but my thought was to do this incrementally. ucfirst() is just one of many functions I was using to do one small cleanup at a time per scan. Even if I had to run the text 100 times through the filter, this runs on a cron run when the site has no traffic. I wish there was a discussion forum where this could continue as obviously there would be some great ideas on continuing the approach. Any thoughts on how to approach this as an overall project by all means please leave a comment.
I guess in the spirit of the question itself. ucfirst then would not be the best function for this as it could not take an argument list of things to ignore. A flag IGNORE_HTML would be great!
Given this is a PHP question, then the DOM parser recommended below sounds like the best answer? Thoughts?
You can also add a CSS pseudo-element to your desired elements like this:
div:first-letter {
text-transform: uppercase;
}
But you will probably need to change the way, you print out your senteces ( if you are printing them all in one huge tag ), since CSS lacks the ability to detect the start of a new sentence inside a single tag :(
You should probably use a DOM parser (either the built-in one or for example this one, which is really easy to use).
Walk through all of the text nodes in your HTML and perform the clean-up with preg_replace_callback, ucfirst and a regular expression like this one:
'/(\s*)([^.?!]*)/'
This will match a string of whitespace, and then as many non-sentence-ending-punctuation characters as possible. The actual sentence (starting with a letter, unless your sentence starts with ", which complicates things a bit) will then be found in the first capturing group.
But from your question, I suppose you are already doing something like the latter and your code is just choking on HTML tags. Here is some example code to get all text nodes with the second DOM parser I linked:
require 'simple_html_dom.php';
$html = new simple_html_dom();
$html->load($fullHtmlStr);
foreach($html->find('text') as $textNode)
$textNode = cleanupFunction($textNode);
$cleanedHtmlStr = $html->save();
In html it will be very difficult to do, as you will be building some kind of html parser. My suggestion would be to cleanup the text before it is transformed into html, at the moment you pull it out of the database. Or even better, cleanup the database once.
This should do it:
function html_ucfirst($s) {
return preg_replace_callback('#^((<(.+?)>)*)(.*?)$#', function ($c) {
return $c[1].ucfirst(array_pop($c));
}, $s);
}
Converts
<b>foo</b> to <b>Foo</b>,
<div><p>test</p></div> to <div><p>Test</p></div>,
but also bar to Bar.
Edit: According to your detailed question, you probably want to apply this function to each sentence. You will have to parse the text first (e.g. splitting by periods).
I'm writing an application for my client that uses a WYSIWYG to allow employees to modify a letter template with certain variables that get parsed out to be information for the customer that the letter is written for.
The WYSIWYG generates HTML that I save to a SQL server database. I then use a PHP class to generate a PDF document with the template text.
Here's my issue. The PDF generation class can translate b,u,i HTML tags. That's it. This is mostly okay, except I need blockquote to be translated too. I figure the best solution would be to write a regex statement that is to take the contents of each blockquote HTML block, and replace each line within the block with five spaces. The trick is that some blockquotes might contain nested blockquotes (double indenting, and what not)
But unfortunately I have never been too well versed with regex, and I spent the last 1.5 hours experimenting with different patterns and got nothing working.
Here are the gotchyas:
String may or may not contain a blockquote block
String could contain multiple blockquotes
String could contain potentially any level of nesting of blockquotes blocks
We can rely on the HTML being properly formed
A sample input string would be look something like something like this:
Dear Charlie,<br><br>We are contacting you because blah blah blah blah.<br><br><br>To login, please use this information:<blockquote>Username: someUsername<br>Password: somePassword</blockquote><br><br>Thank you.
To simply the solution, I need to replace each HTML break inside each blockquote with 5 spaces and then the \n line break character.
You might want to check PHP Simple HTML DOM Parser out. You can use it to parse the input to an HTML DOM tree and use that.
~<blockquote>((?:[^<]*+(?:(?!<blockquote>)|(?R))*+)*+)</blockquote>~
You will need to run this regex recursively using preg_replace_callback:
const REGEX_BLOCKQUOTE = '~<blockquote>((?:[^<]*+(?:(?!<blockquote>)|(?R))*+)*+)</blockquote>~';
function blockquoteCallback($matches) {
return doIndent(preg_replace_callback(REGEX_BLOCKQUOTE, __FUNCTION__, $matches[1]));
}
$output = preg_replace_callback(REGEX_BLOCKQUOTE, 'blockQuoteCallback', $input);
My regex assumes, that there won't be any attributes on the blockquote or anywhere else.
(PS: I'll leave the "Use a DOM parser" comment to someone else.)
Regular expressions have a theory behind them, and even though the modern day's regular expresison engine provide can provide a 'Type - 2.5' level language , some things are still not doable. In your partiular case, nesting is not achievable easily.
A simple way way to explain this, is to say that regular expression can't keep a count ..
i.e. they can't count the nesting level...
what is you need is a limited CFG ( the paren-counting types ) ..
you need to somehow keep a count ..may be a stack or tree ...
What will be the best way to highligh the Searched pharase within a HTML Document.
I have Complete HTML Document as a Large String in a Variable.
And I want to Highlight the searched term excluding text with Tags.
For example if the user searches for "img" the img tag should be ignored but
phrase "img" within the text should be highlighted.
Don't use regex.
Because regex cannot parse HTML (or even come close), any attempt to mess around with matching words in an HTML string risks breaking words that appear in markup. A badly-implemented HTML regex hack can even leave you with HTML-injection vulnerabilities which an attacker may be able to leverage to do cross-site-scripting.
Instead you should parse the HTML and do the searches on the text content only.
If you can accept a solution that adds the highlighting from JavaScript on the client side, this is really easy because the browser will already have parsed the HTML into a bunch of DOM objects you can manipulate. See eg. this question for a client-side example.
If you have to do it with PHP that's a bit more tricky. The simple solution would be to use DOMDocument::loadHTML and then translate the findText function from the above example into PHP. At least the DOM methods used are standardised so they work the same.
Edit: This was tagged as Java before, so this answer might not be applicable.
This is quick and dirty but it might work for you, or at least be a starting point
private String highlight(String search,String html) {
return html.replaceAll("(>[^<]*)("+search+")([^>]*<)","$1<em>$2</em>$3");
}
This requires testing, and I make no guarantees that its correct but the simplest way to explain how is that you ensure that your term exists between two tags and is thus is not itself a tag or part of a tag parameter.
var highlight = function(what){
var html = document.body.innerHTML,
word = "(" + what + ")",
match = new RegExp(word, "gi");
html = html.replace(match, "<span style='background-color: red'>$1</span>");
document.body.innerHTML = html;
};
highlight('ll');
This would highlight any occurence of 'll'.
Be carefull by calling highlight() with < or > or any tag name, it would also replace those, screwing up your markup. You might workaround that by reading innerText instead of innerHTML, but that way you'll lose the markup information.
Best way probably is to implement a parser routine yourself.
Example: http://www.jsfiddle.net/DRtVn/
there is a free javascript library that might help you out -> http://scott.yang.id.au/code/se-hilite/
You must be using some server side language to render the search results on the webpage.
So the best way I can think of is to highlight the word while rendering it using the server side language itself,which may be php,java or any other language.
This way you would have only the result strings without html and without parsing overhead.
I need to take two text blocks with html tags and render a comparison - merge the two text blocks and then highlight what was added or removed from one version to the next.
I have used the PEAR Text_Diff class to successfully render comparisons of plain text, but when I try to throw text with html tags in it, it gets UGLY. Because of the word and character-based compare algorithms the class uses, html tags get broken and I end up with ugly stuff like <p><span class="new"> </</span>p>. It slaughters the html.
Is there a way to generate a text comparison while preserving the original valid html markup?
Thanks for the help. I've been working on this for weeks :[
This is the best solution I could think of: find/replace each type of html tag with 1 special non-standard character like the apple logo (opt shift k), render the comparison with this kind of primative markdown, then revert the non-standard characters back into tags. Any feedback?
Simple Diff, by Paul Butler, looks as though it's designed to do exactly what you need: http://github.com/paulgb/simplediff/blob/5bfe1d2a8f967c7901ace50f04ac2d9308ed3169/simplediff.php
Notice in his php code that there's an html wrapper: htmlDiff($old, $new)
(His blog post on it: http://paulbutler.org/archives/a-simple-diff-algorithm-in-php/
The problem seems to be that your diff program should be treating existing HTML tags as atomic tokens rather than as individual characters.
If your engine has the ability to limit itself to working on word boundaries, see if you can override the function that determines word boundaries so it recognizes and treats HTML tags as a single "word".
You could also do as you are saying and create a lookup dictionary of distinct HTML tags that replaces each with a distinct unused Unicode value (I think there are some user-defined ranges you can use). However, if you do this, any changes to markup will be treated as if they were a change to the previous or following word, because the Unicode character will become part of that word to the tokenizer. Adding a space before and after each of your token Unicode characters would keep the HTML tag changes separate from the plain text changes.
I wonder that nobody mentioned HTMLDiff based on MediaWiki's Visual Diff. Give it a try, I was looking for something like you and found it pretty useful.
What about using an html tidier / formatter on each block first? This will create a standard "structure" which your diff might find easier to swallow
A copy of my own answer from here.
What about DaisyDiff (Java and PHP vesions available).
Following features are really nice:
Works with badly formed HTML that can be found "in the wild".
The diffing is more specialized in HTML than XML tree differs. Changing part of a text node will not cause the entire node to be changed.
In addition to the default visual diff, HTML source can be diffed coherently.
Provides easy to understand descriptions of the changes.
The default GUI allows easy browsing of the modifications through keyboard shortcuts and links.
Try running your HTML blocks through this function first:
htmlentities();
That should convert all of your "<"'s and ">"'s into their corresponding codes, perhaps fixing your problem.
//Example:
$html_1 = "<html><head></head><body>Something</body></html>"
$html_2 = "<html><head></head><body><p id='abc'>Something Else</p></body></html>"
//Below code taken from http://www.go4expert.com/forums/showthread.php?t=4189.
//Not sure if/how it works exactly
$diff = &new Text_Diff(htmlentities($html_1), htmlentities($html_2));
$renderer = &new Text_Diff_Renderer();
echo $renderer->render($diff);