Removing inline styles using php [duplicate] - php

This question already has answers here:
Remove style attribute from HTML tags
(9 answers)
Closed 1 year ago.
I am using php to output some rich text. How can I strip out the inline styles completely?
The text will be pasted straight out of MS Word, or OpenOffice, and into a which uses TinyMCE, a Rich-Text editor which allows you to add basic HTML formatting to the text.
However I want to remove the inline styles on the tags (see below), but preserve the tags themselves.
<p style="margin-bottom: 0cm;">A patrol of Zograth apes came round the corner, causing Rosette to pull Rufus into a small alcove, where she pressed her body against his. “Sorry.” She said, breathing warm air onto the shy man's neck. Rufus trembled.</p>
<p style="margin-bottom: 0cm;"> </p>
<p style="margin-bottom: 0cm;">Rosette checked the coast was clear and pulled Rufus out of their hidey hole. They watched as the Zograth walked down a corridor, almost out of sight and then collapsed next to a phallic fountain. As their bodies hit the ground, their guns clattered across the floor. Rosette stopped one with her heel and picked it up immediately, tossing the other one to Rufus. “Most of these apes seem to be dying, but you might need this, just to give them a helping hand.”</p>

I quickly put this together, but for 'inline styles' (!) you will need something like
$text = preg_replace('#(<[a-z ]*)(style=("|\')(.*?)("|\'))([a-z ]*>)#', '\\1\\6', $text);

Here is a preg_replace solution I derived from Crozin's answer. This one allows for attributes before and after the style attribute fixing the issue with anchor tags.
$value = preg_replace('/(<[^>]*) style=("[^"]+"|\'[^\']+\')([^>]*>)/i', '$1$3', $value);

Use HtmlPurifier

You could use regular expressions:
$text = preg_relace('#<(.+?)style=(:?"|\')?[^"\']+(:?"|\')?(.*?)>#si', '<a\\1 \\2>', $text);

You can use: $content = preg_replace('/style=[^>]*/', '', $content);

You can also use PHP Simple HTML DOM Parser, as follows:
$html = str_get_html(SOME_HTML_STRING);
foreach ($html->find('*[style]') as $item) {
$item->style = null;
}

Couldn't you just use strip_tags and leave in the tags you want eg <p>, <strong> etc?

Why don't you just overwrite the tags. So you will have clean tags without inline styling.

I found this class very useful for doing strip attributes (especially where there's crazy MS Word formatting all through the text):
http://semlabs.co.uk/journal/php-strip-attributes-class-for-xml-and-html

I am did need to clear style from img tags and did resolved by this code:
$text = preg_replace('#(<img (.*) style=("|\')(.*?)("|\'))([a-z ]*)#', '<img \\2\\6', $text);
echo $text;

Related

Properly rendering stored HTML

A part of my site allows users to create comments in a text box to be stored in an SQL database. Because a lot of people copy/paste things in from word or other places, I have to keep <p> and <br> tags to keep formatting, and also <a> tags to let users create their own links. Everything else gets stripped out. I was accomplishing this like so:
$text = strip_tags( $text, '<br><a><p>' );
But today a user came to me and told me they lost a large portion of their text because they made a arrow <- for visual effect. So now I know strip tags removes everything after a <.
I can accomplish a similar effect with preg_replace like so:
preg_replace('/((?!<((\/)?p|br|a))<[^>]*>)/', "", $text);
But this still has the downside of only working if the tag spans one line (I think), leaving in html comments and probably a few other things that I'm not aware of. What are my options? Is there a catch all solution? A library I can use? I most work alone so I'm not really aware of industry standards.
Use html purifier. It help clean the summited html and removes the unwanted codes for example if a user adds a scripts tag that might cause harm to your website (XSS Attack) html purifier before submitting. It also adds or completes html for example a user inputs < strong > gamer ... with out closing the tag, it will close the tag and output cleaner html.
I can accomplish a similar effect with preg_replace...But this still has the downside of only working if the tag spans one line (I think). Not really! You could use some modifiers to make PHP Regular Expressions span multiple lines. Consider the Example below with Multiline HTML String:
<?php
// $s IS A MULTILINE HTML SNIPPET CONTAINING THE FOLLOWING HTML TAGS
// <div>, <a>, <blockquote>, <em>, <strong>, <span>, <br />
$s = "<div class='one'>
<a href='/link.php'>
<blockquote>
There is real Power in the Hearts of men: not just Power but
\"something so much powerful than Power\" that Power itself begs to \"power down\".
</blockquote>
</a>
<p class='lv'>
This Power is not in the Head nor in the Intellect nor in the Skills of Man...
<em class='em1'>but in the deep recess of the Human Heart...</em>
and it speaks volumes yet only very few understand its language -
<strong>The Language of Love</strong>
- The Greatest Power You can have.... The Power to which nothing is Impossible!!!
</p>
<br />
<span>Do you know this Power? <--</span>
<strong>Do you Speak Love???</strong>
</div>";
// THIS CONCISE REGEX PATTERN REMOVES ALL HTML TAGS WITHIN THE MULTILINE STRING
// EXCEPT FOR TAGS LIKE: <a> <p> <br />
// IT WOULD ALSO LEAVE <- OR <-- OR <------ UNTOUCHED
$r = preg_replace("#<(?!\/[ap]|[ap\-]|br).*?>#si", "", $s);
echo ($r);
If you viewed the Source Code, You would observe that all HTML Tags except for <br>, <p>, <a> and Symbols like <-- were stripped out. In effect, the Source would look something like this:
<a href='/link.php'>
There is real Power in the Hearts of men: not just Power but
"something so much powerful than Power" that Power itself begs to "power down".
</a>
<p class='lv'>
This Power is not in the Head nor in the Intellect nor in the Skills of Man...
but in the deep recess of the Human Heart...
and it speaks volumes yet only very few understand its language -
The Language of Love
- The Greatest Power You can have.... The Power to which nothing is Impossible!!!
</p>
<br />
Do you know this Power? <--
Do you Speak Love???
Cheers and Good-Luck...
If your case is simple as how you showed us in your question, I won't go with external libraries like HTML Purifier.
strip_tags() function has its own way to determine tags. One way that it doesn't consider a < a real tag is when it's followed by an space. By space I mean any character between 0x09 to 0x0d as well as 0x20 (it is how isSpace() internal function works by its call from php_strip_tags_ex()).
So a workaround could be putting one of those allowed spaces between <- characters and then revert it after doing a strip_tags() but you'd better take care of not only a < character followed by - but any < character followed by a [^a-zA-Z!?\s] character (a character which is not an alphabet, ! and ? marks, \s any kind of white-space characters (spaces are fine!))
I'd like to choose my space character to be a carriage-return \r which is 0x0D in hex. That is more specific:
$text = preg_replace( "~<\r([^a-zA-Z!?\s])~", "<\1", strip_tags( preg_replace( '~<([^a-zA-Z!?\s])~', "<\r\1", $text ), '<p><a><br>' ) );
I can recommend you to encode the data that the user submits and then remove the tags you don't allow. This way you won't remove tags that appear normally on the page.
Please note that running complex regex expression on big string so not efficient.
Take the input from the user encode it so instead of <p> you will save <p> and then you can insert it to the page as html so it will render as html but without the actual tags, that way you don't need to remove anything.
You can use htmlspecialchars(string) here is an example

preg_replace issue with html elements

i'm having issues with my preg_match solution.
I have the following html code:
<h1> Text marking test</h1><b> Chicago</b> - This is the text. Can this problem be solved by you?
I also have almost similar content:
Chicago - This is the text. Can this issue be solved by you?
All multiple spaces are gone and Problem has turned into Issue
I want to mark:
Chicago - This is the text. Can this
be solved by you?
So i get this:
<h1> Text marking test</h1><div class="marked"><b> Chicago</b> - This is the text. Can this</div> problem <div class="marked">be solved by you?</div>
I have the following regular expression pattern which works:
$string = preg_replace( "/(?im)(<b>)*Chicago([\s,.!?:;'\"]|<([^>]+)>)*-([\s,.!?:;'\"]|<([^>]+)>)*This([\s,.!?:;'\"]|<([^>]+)>)*is([\s,.!?:;'\"]|<([^>]+)>)*the([\s,.!?:;'\"]|<([^>]+)>)*text([\s,.!?:;'\"]|<([^>]+)>)*Can([\s,.!?:;'\"]|<([^>]+)>)*this([\s,.!?:;'\"]|<([^>]+)>)*/", '<div class="marked">' .'${0}'.'</div> , $string);
The problem is that the appending <b> tag could be any tag with any attribute and also optional.
It can only be the appending tag and not any tag before Chicago.
But somehow i constantly fail in my attempts.
Any help is greatly appreciated!
Maybe you can remove all the html tags before the text analysis by using "<[^>]*>" with replace_all, and then make a simpler text analysis regex.
I invite you to use multiple regex instead of making a big one, it's more confortable to locate a bug or update your program
Edit: I had misread your question and deleted my answer, but upon reading it again I think it might offer you some pointers on how to proceed. I don't completely understand the question, so please pardon the unsatisfactory answer.
You want to strip the text of HTML tags, as well as of multiple spaces. I would tackle these things separately:
function clean_text($text) {
$text = strip_tags($text);
$text = preg_replace('/\s{2,}/', ' ', $text);
return $text;
}
Use built-in functions where possible – no sense in re-inventing the wheel, especially as usually a lot of thought went into the functions. As for the second part, we match two or more whitespace characters and replace them with one space only.

How to remove all HTML tag exclude some tags [duplicate]

This question already has answers here:
Strip all HTML tags, except allowed
(3 answers)
Closed 9 years ago.
I create a form and I want to use PHP to remove all HTML tags but exclude some tags (<b>, <strong>, <em>, <i>, <p>, <br>, <ul>, <li> <ol>... (and some tags for format paragraph) when members click Submit befor it will be insert into Database.
$content = $_POST['content'];
Thanks all for help.
I'm sorry if my english isn't good.
Is this what you are looking for?
$content=strip_tags($content,"<b><strong><em><i><p><br><ul><li><ol>");
The following should do it:
// tags separated by vertical bar
$strip_tags = "a|strong|em";
// target html
$html = '<em><a><b>hadf</em></b>';
// Regex is loose and works for closing/opening tags across multiple lines and
// is case-insensitive
// note: The *? makes the matching non-greedy
$clean_html = preg_replace("#<\s*\/?(".$strip_tags.")\s*[^>]*?>#im", '', $html);
// prints "<b>hadf</b>";
echo $html;
Using strip_tags() might be dangerous as it won't have a look at the HTML attributes. So a malicious user could use this for cross site scripting (XSS) and maybe other attacks (as also noted in my comment to David Chen).
Instead I would suggest using an existing HTML filterer as for example http://htmlpurifier.org/ which probably is much more secure and suitable for this task.

preg_replace on text only and not inside href's

I have this code to do some ugly inline text style color formatting on html content.
But this breaks anything inside tags as links, and emails.
I've figured out halfways how to prevent it from formatting links, but it still doesn't prevent the replace when the text is info#mytext.com
$text = preg_replace('/(?<!\.)mytext(?!\/)/', '<span style="color:#DD1E32">my</span><span style="color:#002d6a">text</span>', $text);
What will be a better approach to only replace text, and prevent the replacing on links?
A better approach would be to use XML functions instead.
Your lookbehind assertion only tests one character, so it's insufficient to assert matches outside of html tags. This is something where regular expression aren't the best option. You can however get an approximation like:
preg_replace("/(>[^<]*)(?<![#.])(mytext)/", "$1<span>$2</span>",
This would overlook the first occourence of mytext if it's not preceeded by a html tag. So works best if $text = "<div>$text</div>" or something.
[edited] ohh I see you solved the href problem.
to solve your email problem, change all #mytext. to [email_safeguard] with str_replace, before working on the text, and when your finished, change it back. :)
$text = str_replace('info#mytext.com','[email_safeguard]',$text);
//work on the text with preg_match()
$text = str_replace('[email_safeguard]','info#mytext.com',$text);
that should do the trick :)
but as people have mentioned before, you better avoid html and regex, or you will suffer the wrath of Cthulhu.
see this instead

Find Links and Remove them from HTML

How can I look for links in HTML and remove them?
$html = '<p>Test Title 1</p>';
$html .= '<p>Test Title 2</p>';
$html .= '<p>Test Title 3</p>';
$match = '<a href="javascript:doThis('Test Title 2')">';
I want to remove the anchor but display the text. see below.
Test Title 1
Test Title 2
Test Title 3
I've never used Regular Expressions before, but maybe i can avoid it also. Let me know if im not clear.
Thanks
Mark
EDIT: its not a client side thing. I cant use javascript for this. I have a custom CMS and want to edit HTML stored in a Database.
You may try the simplest thing:
echo strip_tags($html, '<p>');
This strips all tags except <p>
If you really like regexp:
echo preg_replace('=</?a(\s[^>]*)?>=ims', '', $html);
EDIT:
Delete a - tag AND surrounding tags (code gets messy and doesn't work with broken (X)HTML):
echo preg_replace('=<([a-z]+)[^>]*>\s*<a(\s[^>]*)?>(.*?)</a>\s*</\\1>=ims', '$3', $html);
Howerwer if your problem is that complicated, I recommend that you try xpath.
You could see if Simple HTML DOM does the trick.
You might have some joy with Beautiful Soup - http://www.crummy.com/software/BeautifulSoup/ (Python HTML parsing / manipulation API)
sed -i -e 's/<a.*<\/a>//g' filename.html
Note that using regular expressions for hacking HTML is a... dubious proposition, but it might just work in practice ;-)
You can use
var foo = document.getElementsByTagName('a');
to fetch all the link tags. No need for regular expressions here...
EDIT: I'm just learning to read... ;) Go with PHP's DOM or XML abilities. It should be pretty easy using those.
open the HTML file in Microsoft Expression.
Ctrl+F and then chose replace tag or tag attributes contents
Easy and quick solution
Thanks
Shomaail

Categories