PHP: htmlentities/strip_tags

PHP: htmlentities/strip_tags - php

I've been re-writing my website lately and added a Syntax highlighter so that I can post code snippets. Before, all I did was htmlentities() the string so that it would be safe and not break anything, but now that I have to use a <pre> to highlight code, htmlentites() effectively removes the syntax highlighting from the page. I've been trying to come up with a function that will just perform an htmlentites() on anything between two tags (<entitiesparse> </entitiesparse>) but nothing seems to work. Does anyone know of a function that I can either:
a) Set it to htmlentities() everything but specific tags (like strip_tags())
OR
b) Only htmlentities() things in certain tags (As mentioned above)

You only need to apply htmlentities() to the raw content. So you can apply htmlentities() to the raw content (the article text) and then invoke a function to add syntax highlighting after that. So long as you check that your syntax highlighting code cannot introduce unexpected nasties, you don't need to call htmlentities() again.
And if you're saying that you use the a element to highlight code, I strongly suggest you use the code element instead, which is designed to provide markup for lines or blocks of programming code. The a element should only be used as an anchor for a hyperlink.
For instance, you could use
<code class="highlighted-code">/* line of code here /*</code>
Then you could use a cascading style sheet to provide background colour for any element of type code with class equal to "highlighted-code", for instance:
code.highlighted-code {background-color: yellow}

Related

PHPCS warning for wordpress escaping function

Could anyone please help me to resolve this issue shown in the picture ?

The message shown for that line of code means exactly what it says. If what it means by an "escaping function" is not clear, you can search the Web for lots of information about that. It's a fundamental concept in programming, and in website programming in general. You are going to need to understand it to write good, usable, robust webapps / websites.
Another place to start learning about what the message is telling you is noted right in the message: the WordPress Developer Handbook.

When you are going to print any content as output and if you did not properly escaped it then phpcs will generate error and it will say Output should be run through an escaping function. Here i am describing some information how you can escape properly
esc_html() It will be used to escaped any content with html tag. This WILL NOT display HTML content, it is meant for being used inside HTML and will remove your HTML.
<span><?php echo esc_html($title); ?></span>
esc_attr() can be used when need to escape any html tag attributes
<div id="<?php echo esc_attr($variableId);"></div>
wp_kses() When you need to keep any content with html tag then you have to use wp_kses() where you can pass an array of allowed html tag and attributes so that those allowed html tag will not be truncated while printing.
$title = '<span id="testid" class="className">This is a test content</span>'
echo wp_kses($title, ['span' => [ 'id' => []]]);
So if your output need to contain any html tag seems like svg then you can use wp_kses() with allowed html tag so that the require tag would not wipe out with the escaping method and this will fix the phpcs output error issue.

What does exactly mean by 'Line feeds' in HTML and PHP? How do they are added in HTML and PHP code?

I was reading PHP Manual and I come across following text paragraph :
Line feeds have little meaning in HTML, however it is still a good
idea to make your HTML look nice and clean by putting line feeds in. A
linefeed that follows immediately after a closing ?> will be removed
by PHP. This can be extremely useful when you are putting in many
blocks of PHP or include files containing PHP that aren't supposed to
output anything. At the same time it can be a bit confusing. You can
put a space after the closing ?> to force a space and a line feed to
be output, or you can put an explicit line feed in the last echo/print
from within your PHP block.
I've following questions related to the text from above paragraph :
What does exactly mean by 'Line feeds' in HTML?
How to add them to the HTML code as well as PHP code and make visible in a web browser? What HTML entities/tags/characters are used to achieve this?
Is the meaning of 'Line feed' same in case of HTML and PHP? If no, what's the difference in meaning in both the contexts?
Why the PHP manual is saying in first line of paragraph itself that? What does PHP Manual want to say by the below sentence?
"Line feeds have little meaning in HTML"
How can it be useful to remove a linefeed that follows immediately after a closing tag ?> when someone is putting in many blocks of PHP or include files containing PHP that aren't supposed to output anything?
Please someone clear my above mentioned doubts by giving answer in simple, lucid and easy to understand language. If someone could accompany the answer by suitable working code examples it would be of great help to me in understanding the concept more clearly.
Thank You.

What does exactly mean by 'Line feeds' in HTML?
It is a general computing term.
The character (0x0a in ASCII) which advances the paper by one line in a teletype or printer, or moves the cursor to the next line on a display.
— source: Wiktionary
How to add them to the HTML code
Press the enter key on your keyboard. Note that (with a couple of exceptions like <pre>) all whitespace characters are interchangeable in HTML. A new line will be treated as a space.
as well as PHP code
Ditto … or you could use the escape sequence \n inside a string literal.
and make visible in a web browser?
The material you quoted is talking about making source code look nice. You generally don't want line feed characters to be visible in a browser.
You could use a <pre> element instead.
Outside of <pre> elements (and the CSS setting they have by default) you can use a space instead of a new line for the same effect in HTML.
What HTML entities/tags/characters are used to achieve this?
… but the advice given in the last sentence of the material you quoted is probably a better approach.

'Lines feed' exactly means a 'New line' both in Html and Php, only the syntax is different.
In case of Html tag, you can use <br> or <br/> tag for a Lines feed. Basically, this tag shows a new line in the output of the Html attribute block, while running through the browser.
You can take the following example for <br> tag:
<html> <body>
<p> To break lines<br>in a text,<br/>use the br element. </p>
</body> </html>
Output:
To break linesin a text,use the br element.
In case of Php, you can use '\n' for a lines feed.
If you are using a string in Php, then instead of writing,
echo "New \nLine";
you can use nl2br() function to get line break, like:
echo nl2br("New \nLine");
Output:
New
Line

Recursive Contents of HTML tag using regex

I'm writing an application for my client that uses a WYSIWYG to allow employees to modify a letter template with certain variables that get parsed out to be information for the customer that the letter is written for.
The WYSIWYG generates HTML that I save to a SQL server database. I then use a PHP class to generate a PDF document with the template text.
Here's my issue. The PDF generation class can translate b,u,i HTML tags. That's it. This is mostly okay, except I need blockquote to be translated too. I figure the best solution would be to write a regex statement that is to take the contents of each blockquote HTML block, and replace each line within the block with five spaces. The trick is that some blockquotes might contain nested blockquotes (double indenting, and what not)
But unfortunately I have never been too well versed with regex, and I spent the last 1.5 hours experimenting with different patterns and got nothing working.
Here are the gotchyas:
String may or may not contain a blockquote block
String could contain multiple blockquotes
String could contain potentially any level of nesting of blockquotes blocks
We can rely on the HTML being properly formed
A sample input string would be look something like something like this:
Dear Charlie,<br><br>We are contacting you because blah blah blah blah.<br><br><br>To login, please use this information:<blockquote>Username: someUsername<br>Password: somePassword</blockquote><br><br>Thank you.
To simply the solution, I need to replace each HTML break inside each blockquote with 5 spaces and then the \n line break character.

You might want to check PHP Simple HTML DOM Parser out. You can use it to parse the input to an HTML DOM tree and use that.

~<blockquote>((?:[^<]*+(?:(?!<blockquote>)|(?R))*+)*+)</blockquote>~
You will need to run this regex recursively using preg_replace_callback:
const REGEX_BLOCKQUOTE = '~<blockquote>((?:[^<]*+(?:(?!<blockquote>)|(?R))*+)*+)</blockquote>~';
function blockquoteCallback($matches) {
return doIndent(preg_replace_callback(REGEX_BLOCKQUOTE, __FUNCTION__, $matches[1]));
}
$output = preg_replace_callback(REGEX_BLOCKQUOTE, 'blockQuoteCallback', $input);
My regex assumes, that there won't be any attributes on the blockquote or anywhere else.
(PS: I'll leave the "Use a DOM parser" comment to someone else.)

Regular expressions have a theory behind them, and even though the modern day's regular expresison engine provide can provide a 'Type - 2.5' level language , some things are still not doable. In your partiular case, nesting is not achievable easily.
A simple way way to explain this, is to say that regular expression can't keep a count ..
i.e. they can't count the nesting level...
what is you need is a limited CFG ( the paren-counting types ) ..
you need to somehow keep a count ..may be a stack or tree ...

Text Search - Highlighting the Search phrase

What will be the best way to highligh the Searched pharase within a HTML Document.
I have Complete HTML Document as a Large String in a Variable.
And I want to Highlight the searched term excluding text with Tags.
For example if the user searches for "img" the img tag should be ignored but
phrase "img" within the text should be highlighted.

Don't use regex.
Because regex cannot parse HTML (or even come close), any attempt to mess around with matching words in an HTML string risks breaking words that appear in markup. A badly-implemented HTML regex hack can even leave you with HTML-injection vulnerabilities which an attacker may be able to leverage to do cross-site-scripting.
Instead you should parse the HTML and do the searches on the text content only.
If you can accept a solution that adds the highlighting from JavaScript on the client side, this is really easy because the browser will already have parsed the HTML into a bunch of DOM objects you can manipulate. See eg. this question for a client-side example.
If you have to do it with PHP that's a bit more tricky. The simple solution would be to use DOMDocument::loadHTML and then translate the findText function from the above example into PHP. At least the DOM methods used are standardised so they work the same.

Edit: This was tagged as Java before, so this answer might not be applicable.
This is quick and dirty but it might work for you, or at least be a starting point
private String highlight(String search,String html) {
return html.replaceAll("(>[^<]*)("+search+")([^>]*<)","$1<em>$2</em>$3");
}
This requires testing, and I make no guarantees that its correct but the simplest way to explain how is that you ensure that your term exists between two tags and is thus is not itself a tag or part of a tag parameter.

var highlight = function(what){
var html = document.body.innerHTML,
word = "(" + what + ")",
match = new RegExp(word, "gi");
html = html.replace(match, "<span style='background-color: red'>$1</span>");
document.body.innerHTML = html;
};
highlight('ll');
This would highlight any occurence of 'll'.
Be carefull by calling highlight() with < or > or any tag name, it would also replace those, screwing up your markup. You might workaround that by reading innerText instead of innerHTML, but that way you'll lose the markup information.
Best way probably is to implement a parser routine yourself.
Example: http://www.jsfiddle.net/DRtVn/

there is a free javascript library that might help you out -> http://scott.yang.id.au/code/se-hilite/

You must be using some server side language to render the search results on the webpage.
So the best way I can think of is to highlight the word while rendering it using the server side language itself,which may be php,java or any other language.
This way you would have only the result strings without html and without parsing overhead.

How to show a comparison of 2 html text blocks

I need to take two text blocks with html tags and render a comparison - merge the two text blocks and then highlight what was added or removed from one version to the next.
I have used the PEAR Text_Diff class to successfully render comparisons of plain text, but when I try to throw text with html tags in it, it gets UGLY. Because of the word and character-based compare algorithms the class uses, html tags get broken and I end up with ugly stuff like <p><span class="new"> </</span>p>. It slaughters the html.
Is there a way to generate a text comparison while preserving the original valid html markup?
Thanks for the help. I've been working on this for weeks :[
This is the best solution I could think of: find/replace each type of html tag with 1 special non-standard character like the apple logo (opt shift k), render the comparison with this kind of primative markdown, then revert the non-standard characters back into tags. Any feedback?

Simple Diff, by Paul Butler, looks as though it's designed to do exactly what you need: http://github.com/paulgb/simplediff/blob/5bfe1d2a8f967c7901ace50f04ac2d9308ed3169/simplediff.php
Notice in his php code that there's an html wrapper: htmlDiff($old, $new)
(His blog post on it: http://paulbutler.org/archives/a-simple-diff-algorithm-in-php/

The problem seems to be that your diff program should be treating existing HTML tags as atomic tokens rather than as individual characters.
If your engine has the ability to limit itself to working on word boundaries, see if you can override the function that determines word boundaries so it recognizes and treats HTML tags as a single "word".
You could also do as you are saying and create a lookup dictionary of distinct HTML tags that replaces each with a distinct unused Unicode value (I think there are some user-defined ranges you can use). However, if you do this, any changes to markup will be treated as if they were a change to the previous or following word, because the Unicode character will become part of that word to the tokenizer. Adding a space before and after each of your token Unicode characters would keep the HTML tag changes separate from the plain text changes.

I wonder that nobody mentioned HTMLDiff based on MediaWiki's Visual Diff. Give it a try, I was looking for something like you and found it pretty useful.

What about using an html tidier / formatter on each block first? This will create a standard "structure" which your diff might find easier to swallow

A copy of my own answer from here.
What about DaisyDiff (Java and PHP vesions available).
Following features are really nice:
Works with badly formed HTML that can be found "in the wild".
The diffing is more specialized in HTML than XML tree differs. Changing part of a text node will not cause the entire node to be changed.
In addition to the default visual diff, HTML source can be diffed coherently.
Provides easy to understand descriptions of the changes.
The default GUI allows easy browsing of the modifications through keyboard shortcuts and links.

Try running your HTML blocks through this function first:
htmlentities();
That should convert all of your "<"'s and ">"'s into their corresponding codes, perhaps fixing your problem.
//Example:
$html_1 = "<html><head></head><body>Something</body></html>"
$html_2 = "<html><head></head><body><p id='abc'>Something Else</p></body></html>"
//Below code taken from http://www.go4expert.com/forums/showthread.php?t=4189.
//Not sure if/how it works exactly
$diff = &new Text_Diff(htmlentities($html_1), htmlentities($html_2));
$renderer = &new Text_Diff_Renderer();
echo $renderer->render($diff);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.