preg_replace HTML code in PHP

preg_replace HTML code in PHP - php

I want to remove string like below from a html code
<span style="font-size: 0.8px; letter-spacing: -0.8px; color: #ecf6f6">3</span>
so I came up with regex.
$pattern = "/<span style=\"font-size: \\d(\\.\\d)?px; letter-spacing: -\\d(\\.\\d)?px; color: #\\w{6}\">\\w\\w?</span>/um";
However, regex doesn’t work. Can someone point me what i did wrong. I'm new to PHP.
when I tested with a simple regex, it works so problem remains with the regex.
$str = $_POST["txtarea"];
$pattern = $_POST["regex"];
echo preg_replace($pattern, "", $str);

As much as I would advocate DOMDocument to do the job here, you would still need some regular expression down the line, so ...
The expression for the px numeric value can be simply [\d.-]+, since you're not trying to validate anything.
The contents of the span can be simplified to [^<]* (i.e. anything but a opening bracket):
$re = '/<span style="font-size: [\d.-]+px; letter-spacing: [\d.-]+px; color: #[0-9a-f]{3,6}">[^<]*<\/span>/';
echo preg_replace($re, '', $str);

Do not use regex for this problem. Use an html parser. Here is a solution in python with BeautifulSoup, because I like this library for these tasks:
from BeautifulSoup import BeautifulSoup
with open('Path/to/file', 'r') as content_file:
content = content_file.read()
soup = BeautifulSoup(content)
for div in soup.findAll('span', {'style':re.compile("font-size: \d(\.\d)?px; letter-spacing: -\d(\.\d)?px; color: #\w{6}")}):
div.extract()
with open('Path/to/file.modified', 'w') as output_file:
output_file.write(str(soup))

you have a slash ( / ) in your ending tag ( closing span )
you need to escape it or to use a different delimiter than slash

Related

How to remove whitespace in inline styles?

I have a php script which generates a html email. In order to optimise the size to not fall foul of Google's 102kB limit I'm trying to squeeze as unnecessary characters out of the code as possible.
I currently use Emogrifier to inline the css and then TinyMinify to minify.
The output from this still has spaces between properties and values in the inlined styles (eg style="color: #ffffff; font-weight: 16px")
I've developed the following regex to remove the extra whitespace, but it also affects the actual content too (eg this & that becomes this &that)
$out = preg_replace("/(;|:)\s([a-zA-Z0-9#])/", "$1$2", $newsletter);
How can I modify this regex to be limited to inlines styles, or is there a better approach?

There are no bullitproof ways to not match the payload (style="" can appear anywhere) and to not match actual CSS values (as in content: 'a: b'). Furthermore consider also
shortening the values: red is shorter than #f00, which is shorter than #ff0000
remove leading and trailing bogus, like whitespaces and semicolons
redesigning your HTML: i.e. using <ins> and <strong> can be effectively shorter than using inline CSS
One approach would be to match all inline style HTML attributes first and then operate on their content only, but you have to test for yourself how good this works:
$out= preg_replace_callback
( '/( style=")([^"]*)("[ >])/' // Find all appropriate HTML attributes
, function( $aMatch ) { // Per match
// Kill any amount of any kind of spaces after colon or semicolon only
$sInner= preg_replace
( '/([;:])\\s*([a-zA-Z0-9#])/' // Escaping backslash in PHP string context
, '$1$2'
, $aMatch[2] // Second sub match
);
// Kill any amount of leading and trailing semicolons and/or spaces
$sInner= preg_replace
( array( '/^\\s*;*\\s*/', '/\\s*;*\\s*$/' )
, ''
, $sInner
);
return $aMatch[1]. $sInner. $aMatch[3]; // New HTML attribute
}
, $newsletter
);

You haven't provided sample input for us to use, but you have mentioned that you are dealing with html. This should sound alarm bells that using regex as a direct solution is ill-advised. When intending to process valid html, you should be using a dom parser to isolate the style attributes.
Why shouldn't you use regex to isolate the inline style declarations? Simply put: Regex is "dom-unaware". It doesn't know when it is inside or outside of a tag (I'll provide a contrived monkeywrench in my demo to express this vulnerability. Furthermore, using a dom parser will add the benefit of correctly handling different types of quoting. While regex can be written to match/acknowledge balanced quoting, it adds considerable bloat (when executed well) and damages the readability and maintainability of your script.
In my demo, I'll show how spaces after colons, semicolons, and commas can be simply/accurately purged after isolating true inline style declarations. I've gone that little bit farther (since color hexcode condensing was mentioned on this page) to show how regex can be used to reduce some six character hexcodes to three characters.
Code: (Demo)
$html = <<<HTML
<div style='font-family: "Times New Roman", Georgia, serif; background-color: #ffffff; '>
<p>Some text
<span class="ohyeah" style="font-weight: bold; color: #ff6633 !important; border: solid 1px grey;">
Monkeywrench: style="padding: 3px;"
</span>
&
<strong style="text-decoration: underline; ">Underlined</strong>
</p>
<h1 style="margin: 1px 2px 3px 4px;">Heading</h1>
<span style="background-image: url('images/not_a_hexcode_ffffff.png'); ">Text</span>
</div>
HTML;
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach ($dom->getElementsByTagName('*') as $node) {
$style = $node->getAttribute('style');
if ($style) {
$patterns = ['~[:;,]\K\s+~', '~#\K([\da-f])\1([\da-f])\2([\da-f])\3~i'];
$replaces = ['', '\1\2\3'];
$node->setAttribute('style', preg_replace($patterns, $replaces, $style));
}
}
$html = $dom->saveHtml();
echo $html;
Output:
<div style='font-family:"Times New Roman",Georgia,serif;background-color:#fff;'>
<p>Some text
<span class="ohyeah" style="font-weight:bold;color:#f63 !important;border:solid 1px grey;">
Monkeywrench: style="padding: 3px;"
</span>
&
<strong style="text-decoration:underline;">Underlined</strong>
</p>
<h1 style="margin:1px 2px 3px 4px;">Heading</h1>
<span style="background-image:url('images/not_a_hexcode_ffffff.png');">Text</span>
</div>
The above snippet uses \K in the patterns to avoid the use of lookaround and excess capture groups.
I am not writing a pattern that removes the space before !important because I have read some (not so recent) posts that some browsers express buggy behavior without the space.

How to remove HTML tags inside of brackets[]?

I have a string like this:
[<span style="font-size: 12.1599998474121px; line-height: 15.8079996109009px;">heading </span>heading="h1"]Its a <span style="text-decoration: line-through;">subject</span>.[/<span style="font-size: 12.1599998474121px; line-height: 15.8079996109009px;">heading</span>]
I want to remove HTML tags which are inside of brackets using PHP preg_replace etc. Final string should be like this:
[heading heading="h1"]Its a <span style="text-decoration: line-through;">subject</span>.[/heading]
I searched a lot for finding the solution but no success.

This should work for you:
Here I just use strip_tags() in every brackets of your string and return it.
echo $str = preg_replace_callback("/\[(.*?)\]/", function($m){
return strip_tags($m[0]);
}, $str);

You can use a callback with the following regular expression and utilize strip_tags() ...
$str = preg_replace_callback('~\[[^]]*]~',
function($m) {
return strip_tags($m[0]);
}, $str);
eval.in

Depends really how much you want to remove.
Example:
Pattern: '<.*?>'
Result: [heading heading="h1"]Its a subject.[/heading]
But judging from your answer you want to keep the html tags that are inside your heading. I don't understand, based on which rule exactly ? Why is this an exception ?

You can use a single regex to get what you want:
$re = "#][^\[\]]*(*SKIP)(*F)|<\/?[a-z].*?>#si";
$str = "[<span style=\"font-size: 12.1599998474121px; line-height: 15.8079996109009px;\">heading </span>heading=\"h1\"]Its a <span style=\"text-decoration: line-through;\">subject</span>.[/<span style=\"font-size: 12.1599998474121px; line-height: 15.8079996109009px;\">heading</span>]";
$result = preg_replace($re, '', $str);
echo $result;
Ouput of the sample code:
[heading heading="h1"]Its a <span style="text-decoration: line-through;">subject</span>.[/heading]

replacing "span" tag with "input" tag results in displaying of extra "<" and ">" in php while using preg_replace

I am trying to replace a string having span tag with the input tag as follows
original string:
<span style="font-family: Times New Roman; font-size: 12pt;"><img width="56" height="25" src="image023.gif" style="vertical-align:middle"></span>
the string i want to change:
<input type="radio" value="1" name="choice"><img width="56" height="25" src="image023.gif" style="vertical-align:middle"></input>
mycode is:
$oldstr1='<span style="font-family: Times New Roman; font-size: 12pt;">';
$oldstr2='</span>';
$newstr1='<input type="radio" value="1" name="choice">';
$newstr2="</input>";
$str=A super set html content of the span i mentioned;
while (preg_match($oldstr1, $str) && preg_match($oldstr2, $str)) {
$str = preg_replace($oldstr1,$newstr1, $str, 1);
$str = preg_replace($oldstr2,$newstr2, $str, 1);
}
return $str;
However, the output i am getting is having extra "<" and ">" tags in the output. like "<" and then the radio button with proper tags and again an extra ">" at the end.Please suggest.

The problem is in your patterns. $oldstr1 and $oldstr2.
#Flosi posted correct answer, but here alternative solution - in your case you can use str_replace which will be faster (without while loop and you dont need to change your patterns):
$str = str_replace($oldstr1,$newstr1, $str);
$str = str_replace($oldstr2,$newstr2, $str);

You didn't set your delimiters, and your strings are not properly escaped. It works if you do that, e.g.
$oldstr1='/\<span style="font-family: Times New Roman; font-size: 12pt;"\>/';
$oldstr2='/\<\/span\>/';

Try to add '/' to your old string. Like this:
$oldstr1='/<span style="font-family: Times New Roman; font-size: 12pt;">/';
$oldstr2='/<\/span>/';
EDIT: I guess for your case, would be better to use #MarkS answer and just replace instead of regex.

nested bb code quotes how to>

Hi im using a pretty basic bbcode parser.
could you guys help me with a problem of mine?
but when for example this is written:
[quote=tanab][quote=1][code]a img{
text-decoration: none;
}[/code][/quote][/quote]
the output is this:
tanab said:
[quote=1]
a img{
text-decoration: none;
}
[/quote]
how would i go and fix that? im realllly bad at the whole preg_replace stuff.
this is my parser:
function bbcode($input){
$input = htmlentities($input);
$search = array(
'/\[b\](.*?)\[\/b\]/is',
'/\[i\](.*?)\[\/i\]/is',
'/\[img\](.*?)\[\/img\]/is',
'/\[url=(.*?)\](.*?)\[\/url\]/is',
'/\[code\](.*?)\[\/code\]/is',
'/\[\*\](.*?)/is',
'/\\t(.*?)/is',
'/\[quote=(.*?)\](.*?)\[\/quote\]/is',
);
$replace = array(
'<b>$1</b>',
'<i>$1</i>',
'<img src="$1">',
'$2',
'<div class="code">$1</div>',
'<ul><li>$1</li></ul>',
' ',
'<div class="quote"><div class="quote-writer">$1 said:</div><div class="quote-body">$2</div></div>',
);
return preg_replace($search,$replace,$input);
}

This could be adapted with a recursive regex:
'/\[quote=(.*?)\](((?R)|.*?)+)\[\/quote\]/is'
Which will at least ensure that the output divs will not be incorrectly nested. But you would still have to run the regex twice or three times to catch all quote blocks.
Otherwise it would require a rewrite of your code with preg_replace_callback. Which I cannot be bothered to showcase, since this came up a few dozen times already (try the site search!), has been solved before, etc.

How to extract image filename from style/background-image tag?

I found lots of posts regarding estracting a filename from an img-tag, but none from a CSS inline style tag. Here's the source string
<span style="width: 40px; height: 30px; background-image: url("./files/foo/bar.png");" class="bar">FOO</span>
What I want to get is bar.png.
I tried this:
$pattern = "/background-image: ?.png/";
preg_match($pattern, $string, $matches);
But this didnt work out.
Any help appreciated..

You need to read up about regular expressions.
"/background-image: ?.png/"
means "background-image:" followed optionally by a space, followed by any single character, followed (directly) by "png".
Exactly what you need depends on how much variation you need to allow for in the layout of the tag, but it will be something like
"/background-image\s*:\s*url\s*(\s*".*([^\/]+)"/
where all the "\s*" are optional spaces, and parenthesis captures something that doesn't contain a slash.
Generally, regexp is not a good tool for parsing HTML, but in this limited case it might be OK.

$string = '<span style="width: 40px; height: 30px; background-image: url("./files/foo/bar.png");" class="bar">FOO</span>';
$pattern = '/background-image:\s*url\(\s*([\'"]*)(?P<file>[^\1]+)\1\s*\)/i';
$matches = array();
if (preg_match($pattern, $string, $matches)) {
echo $matches['file'];
}

something along the lines
$style = "width: 40px; height: 30px; background-image: url('./files/foo/bar.png');";
preg_match("/url[\s]*\(([\'\"])([^\'\"]+)([\'\"])\)/", $style, $matches);
var_dump($matches[2]);
it wont work for filenames that contain ' or ". It basically matches anything between the parenthesis of url() that is not ' or "

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

preg_replace HTML code in PHP - php

you have a slash ( / ) in your ending tag ( closing span ) you need to escape it or to use a different delimiter than slash

Related

How to remove whitespace in inline styles?

How to remove HTML tags inside of brackets[]?

replacing "span" tag with "input" tag results in displaying of extra "<" and ">" in php while using preg_replace

nested bb code quotes how to>

How to extract image filename from style/background-image tag?

Categories

Resources