I have a php script which generates a html email. In order to optimise the size to not fall foul of Google's 102kB limit I'm trying to squeeze as unnecessary characters out of the code as possible.
I currently use Emogrifier to inline the css and then TinyMinify to minify.
The output from this still has spaces between properties and values in the inlined styles (eg style="color: #ffffff; font-weight: 16px")
I've developed the following regex to remove the extra whitespace, but it also affects the actual content too (eg this & that becomes this &that)
$out = preg_replace("/(;|:)\s([a-zA-Z0-9#])/", "$1$2", $newsletter);
How can I modify this regex to be limited to inlines styles, or is there a better approach?
There are no bullitproof ways to not match the payload (style="" can appear anywhere) and to not match actual CSS values (as in content: 'a: b'). Furthermore consider also
shortening the values: red is shorter than #f00, which is shorter than #ff0000
remove leading and trailing bogus, like whitespaces and semicolons
redesigning your HTML: i.e. using <ins> and <strong> can be effectively shorter than using inline CSS
One approach would be to match all inline style HTML attributes first and then operate on their content only, but you have to test for yourself how good this works:
$out= preg_replace_callback
( '/( style=")([^"]*)("[ >])/' // Find all appropriate HTML attributes
, function( $aMatch ) { // Per match
// Kill any amount of any kind of spaces after colon or semicolon only
$sInner= preg_replace
( '/([;:])\\s*([a-zA-Z0-9#])/' // Escaping backslash in PHP string context
, '$1$2'
, $aMatch[2] // Second sub match
);
// Kill any amount of leading and trailing semicolons and/or spaces
$sInner= preg_replace
( array( '/^\\s*;*\\s*/', '/\\s*;*\\s*$/' )
, ''
, $sInner
);
return $aMatch[1]. $sInner. $aMatch[3]; // New HTML attribute
}
, $newsletter
);
You haven't provided sample input for us to use, but you have mentioned that you are dealing with html. This should sound alarm bells that using regex as a direct solution is ill-advised. When intending to process valid html, you should be using a dom parser to isolate the style attributes.
Why shouldn't you use regex to isolate the inline style declarations? Simply put: Regex is "dom-unaware". It doesn't know when it is inside or outside of a tag (I'll provide a contrived monkeywrench in my demo to express this vulnerability. Furthermore, using a dom parser will add the benefit of correctly handling different types of quoting. While regex can be written to match/acknowledge balanced quoting, it adds considerable bloat (when executed well) and damages the readability and maintainability of your script.
In my demo, I'll show how spaces after colons, semicolons, and commas can be simply/accurately purged after isolating true inline style declarations. I've gone that little bit farther (since color hexcode condensing was mentioned on this page) to show how regex can be used to reduce some six character hexcodes to three characters.
Code: (Demo)
$html = <<<HTML
<div style='font-family: "Times New Roman", Georgia, serif; background-color: #ffffff; '>
<p>Some text
<span class="ohyeah" style="font-weight: bold; color: #ff6633 !important; border: solid 1px grey;">
Monkeywrench: style="padding: 3px;"
</span>
&
<strong style="text-decoration: underline; ">Underlined</strong>
</p>
<h1 style="margin: 1px 2px 3px 4px;">Heading</h1>
<span style="background-image: url('images/not_a_hexcode_ffffff.png'); ">Text</span>
</div>
HTML;
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach ($dom->getElementsByTagName('*') as $node) {
$style = $node->getAttribute('style');
if ($style) {
$patterns = ['~[:;,]\K\s+~', '~#\K([\da-f])\1([\da-f])\2([\da-f])\3~i'];
$replaces = ['', '\1\2\3'];
$node->setAttribute('style', preg_replace($patterns, $replaces, $style));
}
}
$html = $dom->saveHtml();
echo $html;
Output:
<div style='font-family:"Times New Roman",Georgia,serif;background-color:#fff;'>
<p>Some text
<span class="ohyeah" style="font-weight:bold;color:#f63 !important;border:solid 1px grey;">
Monkeywrench: style="padding: 3px;"
</span>
&
<strong style="text-decoration:underline;">Underlined</strong>
</p>
<h1 style="margin:1px 2px 3px 4px;">Heading</h1>
<span style="background-image:url('images/not_a_hexcode_ffffff.png');">Text</span>
</div>
The above snippet uses \K in the patterns to avoid the use of lookaround and excess capture groups.
I am not writing a pattern that removes the space before !important because I have read some (not so recent) posts that some browsers express buggy behavior without the space.
Related
This question already has answers here:
PHP's preg_replace regex that matches multiple lines
(2 answers)
Closed 7 years ago.
I am trying to remove signature of an email before inserting the message into a database. The signature is enclosed in a special tag, xxx to help strip out.
The following only works if the signature is condensed without whitespace spread over various lines.
$msgeBody = preg_replace('#(<signature>).*?(</signature>)#', '$1$2', $msgeBody);
I have tried possibilities found online to remove whitespace first between these tags, before applying the line above. But no success. How to do? Here is the sample text spread over lines:-
<signature><p><span style="font-weight: bold;">Gerald Sugan</span><br>
Travel Consultant<br>
<span style="font-size: 18px; font-family: 'Courier New'; font-weight: bold;">Sugan Enterprises Inc</span></p>
</signature>
The solution of php preg_replace regex that matches multiple lines is not a duplicate. I could not see how to apply those solutions here. The solution found below is different I think.
You can use DOMDocument:
$mail= <<<'EOD'
<body>
blah blah blah
<signature><p><span style="font-weight: bold;">Gerald Sugan</span><br>
Travel Consultant<br>
<span style="font-size: 18px; font-family: 'Courier New'; font-weight: bold;">Sugan Enterprises Inc</span></p>
</signature>
blah blah blah
</body>
EOD;
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($mail, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach ($dom->getElementsByTagName('signature') as $node) {
$node->parentNode->removeChild($node);
}
echo $dom->saveHTML();
Here is a simple regex that match your signature : <signature>[\S\s]*<\/signature>
\S : Matches anything other than a space, tab or newline.
\s : Matches any space, tab or newline character.
* : Matches zero or more consecutive characters.
Try use Trim() /Function that remove the whitespaces or a caracter which you specified/:
http://www.w3schools.com/php/func_string_trim.asp
Explode would separate the signature from the email body and is quite a short piece of code but you would need to get rid of the last left-over tag.
To answer the original query chop($yourString, ' ' ) should remove all the whitespaces inside $yourString Reference: http://php.net/manual/en/function.chop.php
Your email is held in a variable called $msgeBodyso split it at "signature" and trim off the remaining tag.
$msgeBody = explode("signature", $msgeBody);
$msgeBody = rtrim($msgeBody[0], "<");
Clean up $msgeBody before putting it in your database.
Using $msgeBody = explode("signature", $msgeBody); leaves the first < from "signature" on the end of the first part - the body of the email - which would be in array position $msgeBody[0].
str_replace('<','', $msgeBody[0]); would also remove the tag but if you have other tags in $msgeBody it would remove those too.
rtrim($msgeBody[0], "<"); should remove it better.
substr() also has possibilities http://php.net/manual/en/function.substr.php and would find the first occurrence of ''
rtrim($msgeBody,'<signature>'); might also chop it off but with Mariano's caveat about multiple signatures. Not tested.
strip_tags($msgeBody, ''); will get rid of all the tags in case that could be used. (You put any tags you want to keep in the '' - as in '<br />' for example.)
I am trying to replace a string having span tag with the input tag as follows
original string:
<span style="font-family: Times New Roman; font-size: 12pt;"><img width="56" height="25" src="image023.gif" style="vertical-align:middle"></span>
the string i want to change:
<input type="radio" value="1" name="choice"><img width="56" height="25" src="image023.gif" style="vertical-align:middle"></input>
mycode is:
$oldstr1='<span style="font-family: Times New Roman; font-size: 12pt;">';
$oldstr2='</span>';
$newstr1='<input type="radio" value="1" name="choice">';
$newstr2="</input>";
$str=A super set html content of the span i mentioned;
while (preg_match($oldstr1, $str) && preg_match($oldstr2, $str)) {
$str = preg_replace($oldstr1,$newstr1, $str, 1);
$str = preg_replace($oldstr2,$newstr2, $str, 1);
}
return $str;
However, the output i am getting is having extra "<" and ">" tags in the output. like "<" and then the radio button with proper tags and again an extra ">" at the end.Please suggest.
The problem is in your patterns. $oldstr1 and $oldstr2.
#Flosi posted correct answer, but here alternative solution - in your case you can use str_replace which will be faster (without while loop and you dont need to change your patterns):
$str = str_replace($oldstr1,$newstr1, $str);
$str = str_replace($oldstr2,$newstr2, $str);
You didn't set your delimiters, and your strings are not properly escaped. It works if you do that, e.g.
$oldstr1='/\<span style="font-family: Times New Roman; font-size: 12pt;"\>/';
$oldstr2='/\<\/span\>/';
Try to add '/' to your old string. Like this:
$oldstr1='/<span style="font-family: Times New Roman; font-size: 12pt;">/';
$oldstr2='/<\/span>/';
EDIT: I guess for your case, would be better to use #MarkS answer and just replace instead of regex.
I want to remove string like below from a html code
<span style="font-size: 0.8px; letter-spacing: -0.8px; color: #ecf6f6">3</span>
so I came up with regex.
$pattern = "/<span style=\"font-size: \\d(\\.\\d)?px; letter-spacing: -\\d(\\.\\d)?px; color: #\\w{6}\">\\w\\w?</span>/um";
However, regex doesn’t work. Can someone point me what i did wrong. I'm new to PHP.
when I tested with a simple regex, it works so problem remains with the regex.
$str = $_POST["txtarea"];
$pattern = $_POST["regex"];
echo preg_replace($pattern, "", $str);
As much as I would advocate DOMDocument to do the job here, you would still need some regular expression down the line, so ...
The expression for the px numeric value can be simply [\d.-]+, since you're not trying to validate anything.
The contents of the span can be simplified to [^<]* (i.e. anything but a opening bracket):
$re = '/<span style="font-size: [\d.-]+px; letter-spacing: [\d.-]+px; color: #[0-9a-f]{3,6}">[^<]*<\/span>/';
echo preg_replace($re, '', $str);
Do not use regex for this problem. Use an html parser. Here is a solution in python with BeautifulSoup, because I like this library for these tasks:
from BeautifulSoup import BeautifulSoup
with open('Path/to/file', 'r') as content_file:
content = content_file.read()
soup = BeautifulSoup(content)
for div in soup.findAll('span', {'style':re.compile("font-size: \d(\.\d)?px; letter-spacing: -\d(\.\d)?px; color: #\w{6}")}):
div.extract()
with open('Path/to/file.modified', 'w') as output_file:
output_file.write(str(soup))
you have a slash ( / ) in your ending tag ( closing span )
you need to escape it or to use a different delimiter than slash
I found lots of posts regarding estracting a filename from an img-tag, but none from a CSS inline style tag. Here's the source string
<span style="width: 40px; height: 30px; background-image: url("./files/foo/bar.png");" class="bar">FOO</span>
What I want to get is bar.png.
I tried this:
$pattern = "/background-image: ?.png/";
preg_match($pattern, $string, $matches);
But this didnt work out.
Any help appreciated..
You need to read up about regular expressions.
"/background-image: ?.png/"
means "background-image:" followed optionally by a space, followed by any single character, followed (directly) by "png".
Exactly what you need depends on how much variation you need to allow for in the layout of the tag, but it will be something like
"/background-image\s*:\s*url\s*(\s*".*([^\/]+)"/
where all the "\s*" are optional spaces, and parenthesis captures something that doesn't contain a slash.
Generally, regexp is not a good tool for parsing HTML, but in this limited case it might be OK.
$string = '<span style="width: 40px; height: 30px; background-image: url("./files/foo/bar.png");" class="bar">FOO</span>';
$pattern = '/background-image:\s*url\(\s*([\'"]*)(?P<file>[^\1]+)\1\s*\)/i';
$matches = array();
if (preg_match($pattern, $string, $matches)) {
echo $matches['file'];
}
something along the lines
$style = "width: 40px; height: 30px; background-image: url('./files/foo/bar.png');";
preg_match("/url[\s]*\(([\'\"])([^\'\"]+)([\'\"])\)/", $style, $matches);
var_dump($matches[2]);
it wont work for filenames that contain ' or ". It basically matches anything between the parenthesis of url() that is not ' or "
PHP => How can i search through this string in such a way that when i have class="font8text">N</span>' to give me 'EARLL' which is in the next <span>.
<div align="left" style=";">
<span style="width:15px; padding:1px; border:1pt solid #999999; background-color:#CCFFCC; text-align:center;" class="font8text">Y</span>
<span style="text-align:left; white-space:nowrap;" class="font8text">DINNIMAN</span>
</div>
<div align="left" style="background-color:#F8F8FF;">
<span style="width:15px; padding:1px; border:1pt solid #999999; background-color:#FFCCCC; text-align:center;" class="font8text">N</span>
<span style="text-align:left; white-space:nowrap;" class="font8text">EARLL</span>
</div>
Use a DOM-parser like: http://simplehtmldom.sourceforge.net/
As mentioned (a painless amount of times). Regex is not a good way to parse HTML. Actually, you can't really parse HTML with Regex. HTML is not regular in any form. You can only extract bits. And that's still (in most cases) very unreliable data.
It's better to use a DOM-parser. Because a parser that parses the HTML to a document, makes it easier to traverse.
Example:
include_once('simple_html_dom.php');
$dom = file_get_html('<html>...');
foreach($dom->find("div.head div.fact p.fact") as $element)
die($element->innertext);
I think you're better off using strpos and substr succinct with each other.
Example:
$str = <insert your string here>; // populate data
$_find = 'class="font8text">'; // set the search text
$start = strpos($str,$find) + strlen($_find); // find the start off the text and offset by the $needle
$len = strpos($str,'<',$start) - $start; find the end, then subtract the start for length
$text = substr($str,$start,$len); // result
This would do it:
/class="font8text">N.*?class="font8text">(.*?)</m
EARLL would be in the first match group. Try it on Rubular.