I have the following function that I use in a PHP application to remove white space and line breaks from the source of a page.
It's based on some examples I have read on Stack Overflow, with some amends to handle JS and HTML comments. Note: I've not used an exisiting library because I wanted something simple without all the additional features that others include and with this code I have fine-grained control over what is stripped and what is not.
protected function MinifyHTML($str) {
$str = preg_replace("/(?<!\S)\/\/\s*[^\r\n]*/", "", $str); // strip JS/CSS comments
$str = preg_replace("/<!--(.*)-->/Uis", "", $str); // strip HTML comments
$protected_parts = array('<pre>,</pre>','<textarea>,</textarea>','<,>');
$extracted_values = array();
$i = 0;
foreach ($protected_parts as $part) {
$finished = false;
$search_offset = $first_offset = 0;
$end_offset = 1;
$startend = explode(',', $part);
if (count($startend) === 1) $startend[1] = $startend[0];
$len0 = strlen($startend[0]); $len1 = strlen($startend[1]);
while ($finished === false) {
$first_offset = strpos($str, $startend[0], $search_offset);
if ($first_offset === false) $finished = true;
else {
$search_offset = strpos($str, $startend[1], $first_offset + $len0);
$extracted_values[$i] = substr($str, $first_offset + $len0, $search_offset - $first_offset - $len0);
$str = substr($str, 0, $first_offset + $len0).'$$#'.$i.'$$'.substr($str, $search_offset);
$search_offset += $len1 + strlen((string)$i) + 5 - strlen($extracted_values[$i]);
++$i;
}
}
}
$str = preg_replace("/\s/", " ", $str);
$str = preg_replace("/\s{2,}/", " ", $str);
$replace = array('> <'=>'><', ' >'=>'>','< '=>'<','</ '=>'</');
$str = str_replace(array_keys($replace), array_values($replace), $str);
for ($d = 0; $d < $i; ++$d)
$str = str_replace('$$#'.$d.'$$', $extracted_values[$d], $str);
return $str;
}
However if I get a scenario like:
Link Link
It will remove that space between the two anchor tags.
I've added '</a> <a' to my $protected_parts in an attempt to stop this, but it still strips out the space between them. So I end up with LinkLink in the source which isn't what I want.
The same also happens with:
<p>This is <span class="">some</span> <span class="">styled</span> text.</p>
Also it seems the protected_parts arn't working as my textareas are being minified too so all the content inside them is compressed down into one line...
Any ideas on the fixes? I've also not been able to find alternatives to use instead that don't implement caching, gzipping and other features I don't want. I purely want a simple solution that strips spaces, line breaks and comments and that's it.
UPDATED 2014/02/25 (late):
Here's another workaround. Instead of touching $protected_parts I'm just adding another replace operation at the end that adds a space after every </a> -- again a workaround, but this shouldn't screw up any of your original operability, and the penalty this time is only one space character after every anchor tag, not bad. Here it is: http://phpfiddle.org/main/code/5qj-13z
UPDATED 2014/02/25:
I added '</a> ' to $protected_parts and it does not strip the space. I threw it into phpfiddle over here, http://phpfiddle.org/lite/code/dms-cud. This is only a workaround for a few lines of synethetic-emulated HTML... I'm not sure what kind of organic code you're running through your function. Obviously this workaround is not a universal fix either.
Original
I added '</a>',' <a ', to $protected_parts and it does not strip the space. I threw it into phpfiddle over here, http://phpfiddle.org/lite/code/ztz-5hf.
Your function is scary to me, but I like some of the basic functionality, like stripping HTML, JS and CSS comments. I'd still recommend using an apache extension or library. Using other people's open source code is the most powerful witchcraft a programmer can yield. :)
Im building a public forum from scratch, and im fine tuning, and testing everything now. Im currently stuck at the function that strips all html tags expect those i use for insering youtube-videoes, and bold/italic tags so that the user atleast has some way of styling their posts. My problem, is that when i use the nl2br2() function for filtering my post-string, it dosnt strip the html-tags from the string, it works fine if i remove nl2br2() ..? My theory is that the strip_tags() function also strips the native system line breaks \n and \r, so that nl2br2() haven't got any line break to convert. Im actually pretty sure, that's the problem! How can i make those two functions work together? Is there any alternatives to strip_tags()? Or can you somehow tell the function, to stop stripping those linebreaks!!? Its really annoying, been spending lots of hours today trying to figure this out :D any help is much apreaciated!
THIS DIDN'T WORKD:
function nl2br2($string) {
$string = str_replace(array("\r\n", "\r", "\n"), "<br />", $string);
return $string;
}
$str = "$_POST[indlaeg]";
mysql_real_escape_string($str); // PROTECT FROM SQL INJECTIONS THROUGH SINGLE QUOTES ''
strip_tags($str, '<b><i><a><video><br>'); // REMOVE ALL TAGS EXPECT
$str = nl2br2($str); // CONVERT LINE BREAKS TO <br>
THIS DIDN'T WORK EITHER:
$str = mysql_real_escape_string(strip_tags(nl2br2($_POST['indlaeg']), '<b><i><a><video><br>'));
THIS WORKED!!!!
function html2txt($document){
$search = array('#<script[^>]*?>.*?</script>#si', // Strip out javascript
'#<[\/\!]*?[^<>]*?>#si', // Strip out HTML tags
'#<style[^>]*?>.*?</style>#siU', // Strip style tags properly
'#<![\s\S]*?--[ \t\n\r]*>#' // Strip multi-line comments including CDATA );
$text = preg_replace($search, '', $document);
return $text;
}
$str = "$_POST[indlaeg]";
$str = html2txt($str);
$str = nl2br2($str);
The html2txt() function is sent from heaven! It strips ALL evil-minded tags! Including the single quotes '' that hackers like to use for SQL injection :D
PROBLEM SOLVED!
You’re applying three functions to your string – mysql_real_escape_string, strip_tags and nl2br2. The order should be reversed because mysql_real_escape_string adds a backslash before \n and \r, making the string unable to be processed by nl2br2. If you apply nl2br2 first, strip_tags next and mysql_real_escape_string last, no such problems should arise.
Replace these four lines
$str = "$_POST[indlaeg]";
mysql_real_escape_string($str); // PROTECT FROM SQL INJECTIONS THROUGH SINGLE QUOTES ''
strip_tags($str, '<b><i><a><video><br>'); // REMOVE ALL TAGS EXPECT
$str = nl2br2($str); // CONVERT LINE BREAKS TO <br>
with
$str = $_POST['indlaeg'];
$str = nl2br2($str); // CONVERT LINE BREAKS TO <br>
$str = strip_tags($str, '<b><i><a><video><br>'); // REMOVE ALL TAGS EXCEPT A FEW
$str = mysql_real_escape_string($str); // PROTECT FROM SQL INJECTIONS THROUGH SINGLE QUOTES ''
i want to know how to keep all whitespaces of a text area in php (for send to database), and then echo then back later. I want to do it like stackoverflow does, for codes, which is the best approach?
For now i using this:
$text = str_replace(' ', '&nbs p;', $text);
It keeps the ' ' whitespaces but i won't have tested it with mysql_real_escape and other "inject prevent" methods together.
For better understanding, i want to echo later from db something like:
function jack(){
var x = "blablabla";
}
Thanks for your time.
Code Blocks
If you're trying to just recreate code blocks like:
function test($param){
return TRUE;
}
Then you should be using <pre></pre> tags in your html:
<pre>
function test($param){
return TRUE;
}
</pre>
As plain html will only show one space even if multiple spaces/newlines/tabs are present. Inside of pre tags spaces will be shown as is.
At the moment your html will look something like this:
function test($param){
return TRUE;
}
Which I would suggest isn't desirable...
Escaping
When you use mysql_real_escape you will convert newlines to plain text \n or \r\n. This means that your code would output something like:
function test($param){\n return TRUE;\n}
OR
<pre>function test($param){\n return TRUE;\n}</pre>
To get around this you have to replace the \n or \r\n strings to newline characters.
Assuming that you're going to use pre tags:
echo preg_replace('#(\\\r\\\n|\\\n)#', "\n", $escapedString);
If you want to switch to html line breaks instead you'd have to switch "\n" to <br />. If this were the case you'd also want to switch out space characters with - I suggest using the pre tags.
try this, works excellently
$string = nl2br(str_replace(" ", " ", $string));
echo "$string";
THE PROBLEM: I need a XML file "full encoded" by UTF8; that is, with no entity representing symbols, all symbols enconded by UTF8, except the only 3 ones that are XML-reserved, "&" (amp), "<" (lt) and ">" (gt). And, I need a build-in function that do it fast: to transform entities into real UTF8 characters (without corrupting my XML).
PS: it is a "real world problem" (!); at PMC/journals, for example, have 2.8 MILLION of scientific articles enconded with a special XML DTD (knowed also as JATS format)... To process as "usual XML-UTF8-text" we need to change from numeric entity to UTF8 char.
THE ATTEMPTED SOLUTION: the natural function to this task is html_entity_decode, but it destroys the XML code (!), transforming the reserved 3 XML-reserved symbols.
Illustrating the problem
Suppose
$xmlFrag ='<p>Hello world! Let A<B and A=∬dxdy</p>';
Where the entities 160 (nbsp) and x222C (double integral) must be transformed into UTF8, and the XML-reserved lt not. The XML text will be (after transformed),
$xmlFrag = '<p>Hello world! Let A<B and A=∬dxdy</p>';
The text "A<B" needs an XML-reserved character, so MUST stay as A<B.
Frustrated solutions
I try to use html_entity_decode for solve (directly!) the problem... So, I updated my PHP to v5.5 to try to use the ENT_XML1 option,
$s = html_entity_decode($xmlFrag, ENT_XML1, 'UTF-8'); // not working
// as I expected
Perhaps another question is, "WHY there are no other option to do what I expected?" -- it is important for many other XML applications (!), not only for me.
I not need a workaround as answer... Ok, I show my ugly function, perhaps it helps you to understand the problem,
function xml_entity_decode($s) {
// here an illustration (by user-defined function)
// about how the hypothetical PHP-build-in-function MUST work
static $XENTITIES = array('&','>','<');
static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
$s = str_replace($XENTITIES,$XSAFENTITIES,$s);
//$s = html_entity_decode($s, ENT_NOQUOTES, 'UTF-8'); // any php version
$s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
$s = str_replace($XSAFENTITIES,$XENTITIES,$s);
return $s;
} // you see? not need a benchmark:
// it is not so fast as direct use of html_entity_decode; if there
// was an XML-safe option was ideal.
PS: corrected after this answer. Must be ENT_HTML5 flag, for convert really all named entities.
This question is creating, time-by-time, a "false answer" (see answers). This is perhaps because people not pay attention, and because there are NO ANSWER: there are a lack of PHP build-in solution.
... So, lets repeat my workaround (that is NOT an answer!) to not create more confusion:
The best workaround
Pay attention:
The function xml_entity_decode() below is the best (over any other) workaround.
The function below is not an answer to the present question, it is only a workwaround.
function xml_entity_decode($s) {
// illustrating how a (hypothetical) PHP-build-in-function MUST work
static $XENTITIES = array('&','>','<');
static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
$s = str_replace($XENTITIES,$XSAFENTITIES,$s);
$s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
$s = str_replace($XSAFENTITIES,$XENTITIES,$s);
return $s;
}
To test and to demonstrate that you have a better solution, please test first with this simple benckmark:
$countBchMk_MAX=1000;
$xml = file_get_contents('sample1.xml'); // BIG and complex XML string
$start_time = microtime(TRUE);
for($countBchMk=0; $countBchMk<$countBchMk_MAX; $countBchMk++){
$A = xml_entity_decode($xml); // 0.0002
/* 0.0014
$doc = new DOMDocument;
$doc->loadXML($xml, LIBXML_DTDLOAD | LIBXML_NOENT);
$doc->encoding = 'UTF-8';
$A = $doc->saveXML();
*/
}
$end_time = microtime(TRUE);
echo "\n<h1>END $countBchMk_MAX BENCKMARKs WITH ",
($end_time - $start_time)/$countBchMk_MAX,
" seconds</h1>";
Use the DTD when loading the JATS XML document, as it will define any mapping from named entities to Unicode characters, then set the encoding to UTF-8 when saving:
$doc = new DOMDocument;
$doc->load($inputFile, LIBXML_DTDLOAD | LIBXML_NOENT);
$doc->encoding = 'UTF-8';
$doc->save($outputFile);
I had the same problem because someone used HTML templates to create XML, instead of using SimpleXML. sigh... Anyway, I came up with the following. It's not as fast as yours, but it's not an order of magnitude slower, and it is less hacky. Yours will inadvertently convert #_x_amp#; to $amp;, however unlikely its presence in the source XML.
Note: I'm assuming default encoding is UTF-8
// Search for named entities (strings like "&abc1;").
echo preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
// Decode the entity and re-encode as XML entities. This means "&"
// will remain "&" whereas "€" becomes "€".
return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>€&foo Ç</Foo>") . "\n";
/* <Foo>€&foo Ç</Foo> */
Also, if you want to replace special characters with numbered entities (in case you don't want a UTF-8 XML), you can easily add a function to the above code:
// Search for named entities (strings like "&abc1;").
$xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
// Decode the entity and re-encode as XML entities. This means "&"
// will remain "&" whereas "€" becomes "€".
return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>€&foo Ç</Foo>") . "\n";
echo mb_encode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);
/* <Foo>€&foo Ç</Foo> */
In your case you want it the other way around. Encode numbered entities as UTF-8:
// Search for named entities (strings like "&abc1;").
$xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
// Decode the entity and re-encode as XML entities. This means "&"
// will remain "&" whereas "€" becomes "€".
return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>€&foo Ç</Foo>") . "\n";
// Encodes (uncaught) numbered entities to UTF-8.
echo mb_decode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);
/* <Foo>€&foo Ç</Foo> */
Benchmark
I've added a benchmark for good measure. This also demonstrates the flaw in your solution for clarity. Below is the input string I used.
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
Your method
php -r '$q=["&",">","<"];$y=["#_x_amp#;","#_x_gt#;","#_x_lt#;"]; $s=microtime(1); for(;++$i<1000000;)$r=str_replace($y,$q,html_entity_decode(str_replace($q,$y,"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"),ENT_HTML5|ENT_NOQUOTES)); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é & ∬</Foo>
=====
Time taken: 2.0397531986237
My method
php -r '$s=microtime(1); for(;++$i<1000000;)$r=preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
=====
Time taken: 4.045273065567
My method (with unicode to numbered entity):
php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_encode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
=====
Time taken: 5.4407880306244
My method (with numbered entity to unicode):
php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_decode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#;</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
=====
Time taken: 5.5400078296661
public function entity_decode($str, $charset = NULL)
{
if (strpos($str, '&') === FALSE)
{
return $str;
}
static $_entities;
isset($charset) OR $charset = $this->charset;
$flag = is_php('5.4')
? ENT_COMPAT | ENT_HTML5
: ENT_COMPAT;
do
{
$str_compare = $str;
// Decode standard entities, avoiding false positives
if ($c = preg_match_all('/&[a-z]{2,}(?![a-z;])/i', $str, $matches))
{
if ( ! isset($_entities))
{
$_entities = array_map('strtolower', get_html_translation_table(HTML_ENTITIES, $flag, $charset));
// If we're not on PHP 5.4+, add the possibly dangerous HTML 5
// entities to the array manually
if ($flag === ENT_COMPAT)
{
$_entities[':'] = ':';
$_entities['('] = '(';
$_entities[')'] = '&rpar';
$_entities["\n"] = '&newline;';
$_entities["\t"] = '&tab;';
}
}
$replace = array();
$matches = array_unique(array_map('strtolower', $matches[0]));
for ($i = 0; $i < $c; $i++)
{
if (($char = array_search($matches[$i].';', $_entities, TRUE)) !== FALSE)
{
$replace[$matches[$i]] = $char;
}
}
$str = str_ireplace(array_keys($replace), array_values($replace), $str);
}
// Decode numeric & UTF16 two byte entities
$str = html_entity_decode(
preg_replace('/(&#(?:x0*[0-9a-f]{2,5}(?![0-9a-f;]))|(?:0*\d{2,4}(?![0-9;])))/iS', '$1;', $str),
$flag,
$charset
);
}
while ($str_compare !== $str);
return $str;
}
For those coming here because your numeric entity in the range 128 to 159 remains as numeric entity instead of being converted to a character:
echo xml_entity_decode('');
//Output instead expected €
This depends on PHP version (at least for PHP >=5.6 the entity remains) and on the affected characters. The reason is that the characters 128 to 159 are not printable characters in UTF-8. This can happen if the data to be converted mix up windows-1252 content (where is the € sign).
Try this function:
function xmlsafe($s,$intoQuotes=1) {
if ($intoQuotes)
return str_replace(array('&','>','<','"'), array('&','>','<','"'), $s);
else
return str_replace(array('&','>','<'), array('&','>','<'), html_entity_decode($s));
}
example usage:
echo '<k nid="'.$node->nid.'" description="'.xmlsafe($description).'"/>';
also: https://stackoverflow.com/a/9446666/2312709
this code used in production seem that no problems happened with UTF-8
just a quick question about Regular expressions: Will this code work for any grooming I will need to do? (i.e. Can this be inputted into a database and be safe?)
function markdown2html($text) {
$text = htmlspecialchars($text, ENT_QUOTES, 'UTF-8');
// Strong Emphasis
$text = preg_replace('/__(.+?)__/s', '<strong>$1</strong>', $text);
$text = preg_replace('/\*\*(.+?)\*\*/s', '<strong>$1</strong>', $text);
// Underline
$text = preg_replace('/_([^_]+)_/', '<p style="text-decoration: underline;">$1</p>', $text);
//Italic
$text = preg_replace('/\*([^\*]+)\*/', '<em>$1</em>', $text);
// Windows to Unix
$text = str_replace('\r\n', '\n', $text);
// Macintosh to Unix
$text = str_replace('\r', '\n', $text);
//Paragraphs
$text = '<p>' . str_replace("\n\n", '</p><p>', $text) . '</p>';
$text = str_replace("\n", '<br />', $text);
// [Linked Text](Url)
$text = preg_replace('/\[([^\]]+)]\(([a-z0-9._~:\/?##!$&\'()*+,;=%]+)\)/i', '$1', $text);
return $text;
}
No, absolutely not.
Your code has nothing to do with SQL -- it does not modify ' or \ characters at all. Commingling the formatting functionality of this function with SQL escaping is silly.
Your code may also introduce HTML injection in some situations -- I'm particularly suspicious of the URL linking regex. Without a proper parser involved, I would not trust it an inch.
No, the data can not assured to be safe after passing through that function.
You need to either escape sql-sensitive characters or use PDO/Mysqli. Preapared statements are much more handy anyway.
Don't use the old way of hacking together a query, ie:
$query = 'select * from table where col = '.$value;
You're just asking for trouble there.
A couple of things jumped out at me:
I believe that the first two regexs ('/__(.+?)__/s' and the corresponding one for *) handle ___word___ and ***word*** incorrectly –– they will treat the third character as part of the word, so you will get *word* (where the first * is bold and the trailing one is not) instead of word.
On the third one ('/_([^_]+)_/'), is it really appropriate for
do _not_ do that
to turn into
do <p style="text-decoration: underline;">not</p> do that
?
Of course I’m not saying that it’s OK to use if you fix these issues.