I'm storing HTML and text data in my database table in its raw form - however I am having a slight problem in getting it to output correctly. Here is some sample data stored in the table AS IS:
<p>Professional Freelance PHP & MySQL developer based in Manchester.
<br />Providing an unbeatable service at a competitive price.</p>
To output this data I do:
echo $row['details'];
And this outputs the data correctly, however when I do a W3C validator check it says:
character "&" is the first character of a delimiter but occurred as data
So I tried using htmlemtities and htmlspecialchars but this just causes the HMTL tags to output on the page.
What is the correct way of doing this?
Use & instead of &.
What you want to do is use the php function htmlentities()...
It will convert your input into html entities, and then when it is outputted it will be interpreted as HTML and outputted as the result of that HTML...For example:
$mything = "<b>BOLD & BOLD</b>";
//normally would throw an error if not converted...
//lets convert!!
$mynewthing = htmlentities($mything);
Now, just insert $mynewthing to your database!!
htmlentities is basically as superset of htmlspecialchars, and htmlspecialchars replaces also < and >.
Actually, what you are trying to do is to fix invalid HTML code, and I think this needs an ad-hoc solution:
$row['details'] = preg_replace("/&(?![#0-9a-z]+;)/i", "&", $row['details']);
This is not a perfect solution, since it will fail for strings like: someone&son; (with a trailing ;), but at least it won't break existing HTML entities.
However, if you have decision power over how the data is stored, please enforce that the HTML code stored in the database is correct.
In my Projects I use XSLT Parser, so i had to change to (e.g.). But this is the safety way i found...
here is my code
$html = trim(addslashes(htmlspecialchars(
html_entity_decode($_POST['html'], ENT_QUOTES, 'UTF-8'),
ENT_QUOTES, 'UTF-8'
)));
And when you read from DB, don't forget to use stripslashes();
$html = stripslashes($mysq_row['html']);
Related
I have a form where I can also write HTML tags. I must save this textarea preserving every single HTML tag. So here's the code:
foreach($_POST["comment"] AS $key => $value)
{
mysql_query("UPDATE comments SET title= '".$value["title"]."', comment = '".$value['comment']."' WHERE id = '".$value["id"]."'");
}
When I try to save this:
<b>Hello</b>
In MySQL I get this result:
<b>Hello</b>
I must keep every single HTML as it is. If I write <b> I must save exactly <b> in database. I tryed escaping, html etities, quotes, strip slashes (...) but this guy keep saving everything in the wrong way.
p.s. Before you ask yes, description field is TEXT tupe with UTF-8 encoding.
Have you tried using http://php.net/manual/en/function.htmlspecialchars-decode.php on the mentioned value? This should do exactly what you're asking.
try running:
$sStr = '<b>Hello</b>';
echo htmlspecialchars_decode($sStr);
And it will be encoded properly. Feeding that to the database stores the value correct.
Also, but this is more of a side-notice, you really shouldn't save post data without validating the input. I do assume this is just a quick example and not production code? However, just a suggestion.
You need to escape the entry. If you are using the mysql method, you need the mysql_escape_string function like:
$string = mysql_escape_string("<br>Hello</br>");
I am attempting to write html data from a mysql database to a document using php. My code is below:
$content = html_entity_decode($dataToLoad['Text']);
echo $content;
$dataToLoad['Text'] contains this text data from the database:
<div>stuffInDiv</div>
What I would like to happen is for this text to be written as an actual div element in the document, but instead it is being written as a string. How can I force php to write it as an element?
Update for clarity:
To clarify, my issue isn't with decoding the html entities in the database, it's with writing the decoded html to the document. When I do:
echo $content;
where $content contains
<div>stuffInDiv</div>
I get the string "<div>stuffInDiv</div>" when really what I want to have is a div containing the string "stuffInDiv"
The Answer
It's possible your data has been encoded twice. Try echo $content; and then go to View Source in your browser. If it starts with <div>, then you'd need to run html_entity_decode twice. There is rarely a good reason, however, to store the HTML in the database with the entities encoded. It'd make more sense to store it raw and encode it when need be (e.g. if the code were placed into a textarea).
$content = html_entity_decode(html_entity_decode($dataToLoad['Text']));
echo $content;
The Reasoning
The reason is because the raw data in your database looks like this:
<div>stuffInDiv</div>
Your browser would print this on the screen:
<div>stuffInDiv</div>
The first time you run html_entity_decode, it does exactly that, i.e. it replaces & with the & character (& is the code for the ampersand).
This produces:
<div>stuffInDiv</div>
The web page spits out the encoded entities, i.e.:
<div>stuffInDiv</div>
Running html_entity_decode a second time would replace < with < (less than sign), > with > (greater than sign), etc.
This produces:
<div>stuffInDiv</div>
Which would be outputted to your page as:
stuffInDiv
Your Database Setup
As a note to your database:
When storing information in the database, do not encode the HTML at all. Unless HTML is being outputted onto a web page, it is no different from any other string and you shouldn't treat it differently. So if you were adding data to a table in your database that contains code, just do something like this:
INSERT INTO `my_content` (`name`, `content`) VALUES ("My Page", "<div>stuffInDiv</div>");
If you were obtaining this data from a textarea, use:
$connection->query('INSERT INTO `my_content` (`name`, `content`) VALUES ("'.$connection->real_escape_string($_POST['name']).'", "'.$connection->real_escape_string($_POST['content']).'");');
Without doing anything to manipulate the value of $_POST['content']. If you need to place that data back into the textarea (say, editing a page):
$result = $connection->query('SELECT `content` FROM `my_content` WHERE `name` = "'.$connection->real_escape_string($_GET['edit_page']).'");');
if ($row = $result->fetch_assoc()) {
print '<textarea name="content">'.htmlentities($row['content']).'</textarea>';
}
You can replace html entities with str_replace... (but surely there's an easier way)
$ar = get_html_translation_table();
$dataToLoad['Text'] = '<div>stuffInDiv</div>';
echo str_replace($ar, array_keys($ar), $dataToLoad['Text']);
I have a form with 2 textareas; the first one allows user to send HTML Code, the second allows to send CSS Code. I have to verify with a PHP function, if the language is correct.
If the language is correct, for security, i have to check that there is not PHP code or SQL Injection or whatever.
What do you think ? Is there a way to do that ?
Where can I find this kind of function ?
Is "HTML Purifier" http://htmlpurifier.org/ a good solution ?
If you have to validate the date to insert them in to database - then you just have to use mysql_real_escape_string() function before inserting them in to db.
//Safe database insertion
mysql_query("INSERT INTO table(column) VALUES(".mysql_real_escape_string($_POST['field']).")");
If you want to output the data to the end user as plain text - then you have to escape all html sensitive chars by htmlspecialchars(). If you want to output it as HTML, the you have to use HTML Purify tool.
//Safe plain text output
echo htmlspecialchars($data, ENT_QUOTES);
//Safe HTML output
$data = purifyHtml($data); //Or how it is spiecified in the purifier documentation
echo $data; //Safe html output
for something primitive you can use regex, BUT it should be noted using a parser to fully-exhaust all possibilities is recommended.
/(<\?(?:php)?(.*)\?>)/i
Example: http://regexr.com?2t3e5 (change the < in the expression back to a < and it will work (for some reason rexepr changes it to html formatting))
EDIT
/(<\?(?:php)?(.*)(?:\?>|$))/i
That's probably better so they can't place php at the end of the document (as PHP doesn't actually require a terminating character)
SHJS syntax highlighter for Javascript have files with regular expressions http://shjs.sourceforge.net/lang/ for languages that highlights — You can check how SHJS parse code.
HTMLPurifier is the recommended tool for cleaning up HTML. And as luck has it, it also incudes CSSTidy and can sanitize CSS as well.
... that there is not PHP code or SQL Injection or whatever.
You are basing your question on a wrong premise. While HTML can be cleaned, this is no safeguard against other exploitabilies. PHP "tags" are most likely to be filtered out. If you are doing something other weird (include-ing or eval-ing the content partially), that's no real help.
And SQL exploits can only be prevented by meticously using the proper database escape functions. There is no magic solution to that.
Yes. htmlpurifier is a good tool to remove malicious scripts and validate your HTML. Don't think it does CSS though. Apparently it works with CSS too. Thanks Briedis.
Ok thanks you all.
actually, i realize that I needed a human validation. Users can post HTML + CSS, I can verify in PHP that the langage & the syntax are correct, but it doesn't avoid people to post iframe, html redirection, or big black div that take all the screen.
:-)
I want to display on screen data send by the user,
remembering it can contain dangerous code, it is the best to clean this data with html entities.
Is there a better way to do html entities, besides this:
$name = clean($name, 40);
$email = clean($email, 40);
$comment = clean($comment, 40);
and this:
$data = array("name", "email," "comment")
function confHtmlEnt($data)
{
return htmlentities($data, ENT_QUOTES, 'UTF-8');
}
$cleanPost = array_map('confHtmlEnt', $_POST);
if so, how, and how does my wannabe structure
for html entities look?
Thank you for not flaming the newb :-).
"Clean POST", the only problem is you might not know in what context will your data appear. I have a Chat server now that works via browser client and a desktop client and both need data in a different way. So make sure you save the data as "raw" as possible into the DB and then worry about filtering it on output.
Do not encode everything in $_POST/$_GET. HTML-escaping is an output-encoding issue, not an input-checking one.
Call htmlentities (or, usually better, htmlspecialchars) only at the point where you're taking some plain text and concatenating or echoing it into an HTML page. That applies whether the text you are using comes from a submitted parameter, or from the database, or somewhere else completely. Call mysql_real_escape_string only at the point you insert plain text into an SQL string literal.
It's tempting to shove all that escaping stuff in its own box at the top of the script and then forget about it. But text preparation really doesn't work like that, and if you pretend it does you'll find your database irreparably full of double-encoded crud, backslashes on your HTML page and security holes you didn't spot because you were taking data from a source other than the (encoded) parameters.
You can make the burden of remembering to mysql_real_escape_string go away by using mysqli's parameterised queries or another higher-level data access layer. You can make the burden of typing htmlspecialchars every time less bothersome by defining a shorter-named function for it, eg.:
<?php
function h($s) {
echo(htmlspecialchars($s, ENT_QUOTES));
}
?>
<h1> Blah blah </h1>
<p>
Blah blah <?php h($title); ?> blah.
</p>
or using a different templating engine that encodes HTML by default.
If you wish to convert the five special HTML characters to their equivalent entities, use the following method:
function filter_HTML($mixed)
{
return is_array($mixed)
? array_map('filter_HTML',$mixed)
: htmlspecialchars($mixed,ENT_QUOTES);
}
That would work for both UTF-8 or single-byte encoded string.
But if the string is UTF-8 encoded, make sure to filter out any invalid characters sequence, prior to using the filter_HTML() function:
function make_valid_UTF8($str)
{
return iconv('UTF-8','UTF-8//IGNORE',$str)
}
Also see: http://www.phpwact.org/php/i18n/charsets#character_sets_character_encoding_issues
You need to clean every element bevor displaying it. I do it usually with a function and an array like your secound example.
If you use a framework with a template engine, there is quite likely a possibility to auto-encode strings. Apart from that, what's simpler than calling a function and getting the entity-"encoded" string back?
Check out the filter libraries in php, in particular filter_input_array.
filter_input_array(INPUT_POST, FILTER_SANITIZE_SPECIAL_CHARS);
This will not validate because of the output from print_r, is it not supposed to be used "on a site" or do one have to format it in a certain way?
<?php
$stuff1 = $_POST["stuff1"];//catch variables
$stuff2 = $_POST["stuff2"];
$stuff3 = $_POST["stuff3"];
$myStuff[0] = $stuff1;//put into array
$myStuff[1] = $stuff2;
$myStuff[2] = $stuff3;
print_r($myStuff);
?>
print_r() is mainly designed as a helpful tool for developers, not for actual production use in a manner that end-users would see. Thus, you shouldn't really be trying to validate it - if you're at the stage where you're trying to get stuff to validate, you shouldn't be using print_r anyway.
The validator can't distinguish the output of print_\r() from the surrounding html structure; it simply parses the whole character stream. If the output of your print_r() contains characters that have a special meaning in html (apparently < and > in your case) the validator must assume that it belongs to the html structure, not the text data. You have to mark them as "no, this is just text data, not a control character" for html parsers. One way to do this is to send entities instead of the "real" character itself, e.g. < instead of <
The function htmlspecialchars() takes care of those characters that always have a special meaning in (x)html.
You might also want to enclose the output in a <pre>....</pre> element to keep the formatting of print_r().
echo '<pre>', htmlspecialchars(print_r($myStuff, true)), "</pre>\n";
A plain print_r outputs text, so there's no reason for it not to affect validation. To print it out nicely formatted on an HTML page, use a <pre>:
$printout = print_r($my_var);
echo "<pre>$printout</pre>";
If you don't want to display it, but only to see it as a developer, place it in an HTML (<!-- any text -->).