I'm trying to support multiple languages on my site. Some of the content that needs translating will have entity references like Ç. I could use htmlentities to convert that into a Ã. However, what if I need to translate a string that has mark up:
"<p>Hello, world with Ç</p>"
If I use htmlentities, the < and > would be converted, too. I don't want to break down the string into tags and non-tag parts, then apply htmlentities only to the non-tag parts. That'll be too messy and tedious.
A work around posted here
Pass your string to the following function and work with the returned string.
function unicode_escape_sequences($str){
$working = json_encode($str);
$working = preg_replace('/\\\u([0-9a-z]{4})/', '&#x$1;', $working);
return json_decode($working);
}
Related
I have some text that I will be saving to my DB. Text may look something like this: Welcome & This is a test paragraph. When I save this text to my DB after processing it using htmlspecialchars() and htmlentities() in PHP, the sentence will look like this: Welcome & This is a test paragraph.
When I retrieve and display the same text, I want it to be in the original format. How can I do that?
This is the code that I use;
$text= htmlspecialchars(htmlentities($_POST['text']));
$text= mysqli_real_escape_string($conn,$text);
There are two problems.
First, you are double-encoding HTML characters by using both htmlentities and htmlspecialchars. Both of those functions do the same thing, but htmlspecialchars only does it with a subset of characters that have HTML character entity equivalents (the special ones.) So with your example, the ampersand would be encoded twice (since it is a special character), so what you would actually get would be:
$example = 'Welcome & This is a test paragraph';
$example = htmlentities($example);
var_dump($example); // 'Welcome & This is a test paragraph'
$example = htmlspecialchars($example);
var_dump($example); // 'Welcome & This is a test paragraph'
Decide which one of those functions you need to use (probably htmlspecialchars will be sufficient) and use only one of them.
Second, you are using these functions at the wrong time. htmlentities and htmlspecialchars will not do anything to "sanitize" your data for input into your database. (Not saying that's what you're intending, as you haven't mentioned this, but many people do seem to try to do this.) If you want to protect yourself from SQL injection, bind your values to prepared statements. Escaping it as you are currently doing with mysqli_real_escape_string is good, but it isn't really sufficient.
htmlspecialchars and htmlentities have specific purposes: to convert characters in strings that you are going to output into an HTML document. Just wait to use them until you are ready to do that.
I made this function
function echoSanitizer($var)
{
$var = htmlspecialchars($var, ENT_QUOTES);
$var = nl2br($var, false);
$var = str_replace(array("\\r\\n", "\\r", "\\n"), "<br>", $var);
$var = htmlspecialchars_decode($var);
return stripslashes($var);
}
Would it be safe from xss attacks?
htmlspecialchars to take away html tags
nl2br for the new lines
str_replace to convert the \r\n to <br>
htmlspecialchars_decode to convert back the original characters
stripslashes to STRIPSLASHES
Why I need all of that? Because I want to preview what the users inputed in and I wanted a WYSIWYG thing for them to see. Some of the input came from a textarea box and I wanted the spaces to be preserved so the nl2br is needed.
Generally I'm asking about the (htmlspecialchars_decode) because its new to me. Is it safe? As a whole is the function I made safe if I use it to display user input?
(No database involved in this scenario.)
In your case htmlspecialchars_decode() makes the function unsafe. Users must not be allowed to insert < character unescaped, because that allows them to create arbitrary tags (and filtering/blacklisting is a cat and mouse game you can't win).
At very minimum < must be escaped as <.
If you only allow plain text with newlines, then:
nl2br(htmlspecialchars($text_with_newlines, ENT_QUOTES));
is safe to output in HTML (except inside <script> or attributes that expect JavaScript or URLs such as onclick and href (in the latter case somebody could use javascript:… URL)).
If you want to allow users to use HTML tags, but not exploit your page, then correct function to do this won't fit in StackOverflow post (thousands of lines long, requires full HTML parser, processing of URLs and CSS, etc.) — you'll have to use something heavy-weight like HTMLPurifier.
I have a custom forum in which I employ htmlentities so users aren't able to post malicious code(html/js). Anyway, as I am pulling posts from the database, I use str_replace in order to show certain html elements <, >, &, etc.. is there any harm in doing this? Will it cause side effects/html to render?
User posts data
Data is escaped for mysql, written to DB
User makes request for data
Data is encoded for display (aggressively with htmlentities or htmlspecialchars, or some subset of allowed characters. You could do this with str_replace, but there are better utilities).
Use strip_tags to avoid any html / js / php.
It has some options to allow any tags you want like this:
strip_tags($text, '<p><a>');
strip_tags, as stated in the documentation will not remove inline javascript or sanitise so it isn't a good idea. A common solution is to use bbcode instead for which many libraries exist, or you can make your own and then use preg_replace to substitute in your own markup safely.
Here's a quick sample:
$safe_output = htmlspecialchars($output);
$find = array("'\[b\](.*?)\[/b\]'is");
$replace = array("<strong>\\1</strong>");
$result = preg_replace($find, $replace, nl2br($safe_output));
I need to replace characters in a string with their HTML coding.
Ex. The "quick" brown fox, jumps over the lazy (dog).
I need to replace the quotations with the & quot; and replace the brakets with & #40; and & #41;
I have tried str_replace, but I can only get 1 character to be replaced. Is there a way to replace multiple characters using str_replace? Or is there a better way to do this?
Thanks!
I suggest using the function htmlentities().
Have a look at the Manual.
PHP has a number of functions to deal with this sort of thing:
Firstly, htmlentities() and htmlspecialchars().
But as you already found out, they won't deal with ( and ) characters, because these are not characters that ever need to be rendered as entities in HTML. I guess the question is why you want to convert these specific characters to entities? I can't really see a good reason for doing it.
If you really do need to do it, str_replace() will do multiple string replacements, using arrays in both the search and replace paramters:
$output = str_replace(array('(',')'), array('(',')'), $input);
You can also use the strtr() function in a similar way:
$conversions = array('('=>'(', ')'=>')');
$output = strtr($conversions, $input);
Either of these would do the trick for you. Again, I don't know why you'd want to though, because there's nothing special about ( and ) brackets in this context.
While you're looking into the above, you might also want to look up get_html_translation_table(), which returns an array of entity conversions as used in htmlentities() or htmlspecialchars(), in a format suitable for use with strtr(). You could load that array and add the extra characters to it before running the conversion; this would allow you to convert all normal entity characters as well as the same time.
I would point out that if you serve your page with the UTF8 character set, you won't need to convert any characters to entities (except for the HTML reserved characters <, > and &). This may be an alternative solution for you.
You also asked in a separate comment about converting line feeds. These can be converted with PHP's nl2br() function, but could also be done using str_replace() or strtr(), so could be added to a conversion array with everything else.
I'm having some problems using strip_tags PHP function when the string contains 'less than' and 'greater than' signs. For example:
If I do:
strip_tags("<span>some text <5ml and then >10ml some text </span>");
I'll get:
some text 10ml some text
But, obviously I want to get:
some text <5ml and then >10ml some text
Yes I know that I could use < and >, but I don't have chance to convert those characters into HTML entities since data is already stored as you can see in my example.
What I'm looking for is a clever way to parse HTML in order to get rid only actual HTML tags.
Since TinyMCE was used for generate that data, I know which actual html tags could be used in any case, so a strip_tags($string, $black_list) implementation would be more usefull than strip_tags($string, $allowable_tags).
Any thoughs?
As a wacky workaround you could filter non-html brackets with:
$html = preg_replace("# <(?![/a-z]) | (?<=\s)>(?![a-z]) #exi", "htmlentities('$0')", $html);
Apply strip_tags() afterwards. Note how this only works for your specific example and similar cases. It's a regular expression with some heuristics, not artificial intellegince to discern html tags from unescaped angle brackets with other meaning.
If you want to have "greater than" and "lesser than" signs, you need to escape them:
> is >
< is <
See e.g. this: http://www.w3schools.com/html/html_entities.asp
Instead of strip_tags(), just use htmlspecialchars() instead.
http://php.net/manual/en/function.htmlspecialchars.php
Following up on the accepted answer that uses a heuristic function to try to remove tags while sparing < and > signs, here is a version that uses preg_replace_callback, as the /e modifier in preg_replace is now deprecated:
function HTMLToString($string){
return htmlspecialchars_decode(strip_tags(preg_replace_callback("# <(?![/a-z]) | (?<=\s)>(?![a-z]) #xi",
function ($matches){
return (htmlentities($matches[0]));
}
, $string)));
}