Remove CSS rules dependent on RegEx Classes - php

Intro:
I'm fairly new to RegEx so bear with me here. We have a client who has an extremely large CSS file. Verging on 27k lines total - 20k lines or so is pure CSS and the following is written in SCSS. I am attempting to cut this down and despite using more than allotted hours to work on this, I found it extremely interesting - so I wrote a little PHP script to do this for me! Unfortunately it's not quite there due to the RegEx being a little troublesome.
Context
remove.txt - Text file containing selectors, line by line that are redundant on our site and can be removed.
main.scss - The big SASS file.
PHP script - Basically reads the remove.txt file line by line, finds the selector in the main.scss file and adds a "UNUSED" string before each selector, so I can go down line by line and remove the rule.
Issue
So the main reason this is troublesome is because we have to account for lots of occurrences at the start of the CSS rules and towards the end as well. For example -
Example scenarios of .foo-bar (bold indicates what should match) -
.foo-bar {}
.foo-bar, .bar-foo {}
.foo-bar .bar-foo {}
.boo-far, .foo-bar {}
.foo-bar,.bar-foo {}
.bar-foo.foo-bar {}
PHP Script
<?php
$unused = 'main.scss';
if ($file = fopen("remove.txt", "r")) {
// Stop an endless loop if file doesn't exist
if (!$file) {
die('plz no loops');
}
// Begin looping through redundant selectors line by line
while(!feof($file)) {
$line = trim(fgets($file));
// Apply the regex to the selector
$line = $line.'([\s\S][^}]*})';
// Apply the global operators
$line = '/^'.$line.'/m';
// Echo the output for reference and debugging
echo ('<p>'.$line.'</p>');
// Find the rule, append it with UNUSED at the start
$dothings = preg_replace($line,'UNUSED $0',file_get_contents($unused), 1);
}
fclose($file);
} else {
echo ('<p>failed</p>');
}
?>
RegEx
From the above you can gather my RegEx will be -
/^REDUNDANTRULE([\s\S][^}]*})/m
It's currently having a hard time with dealing with indentation that typically occur within media queries and also when there are proceeding selectors applied to the same rule.
From this I tried adding to the start (To accommodate for whitespace and when the selector is used in a longer version of the selector) -
^[0a-zA-Z\s]
And also adding this to the end (to accommodate for commas separating selectors)
\,
Could any RegEx/PHP wizards point me in the right direction? Thank you for reading regardless!
Thanks #ctwheels for the fantastically explained answer. I encountered a couple other issues, one being full stops being used within the received redundant rules not being escaped. I've now updated my script to escape them before doing the find an replace as seen below. This is now my most up to date and working script -
<?php
$unused = 'main.scss';
if ($file = fopen("remove.txt", "r")) {
if (!$file) {
die('plz no loops');
}
while(!feof($file)) {
$line = trim(fgets($file));
if( strpos( $line, '.' ) !== false ) {
echo ". found in $line, escaping characters";
$line = str_replace('.', '\.', $line);
}
$line = '/(?:^|,\s*)\K('.$line.')(?=\s*(?:,|{))/m';
echo ('<p>'.$line.'</p>');
var_dump(preg_match_all($line, file_get_contents($unused)));
$dothings = preg_replace($line,'UNUSED $0',file_get_contents($unused), 1);
var_dump(
file_put_contents($unused,
$dothings
)
);
}
fclose($file);
} else {
echo ('<p>failed</p>');
}
?>

Answer
Brief
Based on the examples you provided, the following regex will work, however, it will not work for all CSS rules. If you add more cases, I can update the regex to accommodate those other situations.
Code
See regex in use here
Regex
(?:^|,\s*)\K(\.foo-bar)(?=\s*(?:,|{))
Replacement
UNUSED $1
Note: The multiline m flag is used.
Usage
The following script is generated by regex101 (by clicking on code generator in regex101): Link here
$re = '/(?:^|,\s*)\K(\.foo-bar)(?=\s*(?:,|{))/m';
$str = '.foo-bar {}
.foo-bar, .bar-foo {}
.foo-bar .bar-foo {}
.boo-far, .foo-bar {}
.foo-bar,.bar-foo {}
.bar-foo.foo-bar {}';
$subst = 'UNUSED $1';
$result = preg_replace($re, $subst, $str);
echo "The result of the substitution is ".$result;
Results
Input
.foo-bar {}
.foo-bar, .bar-foo {}
.foo-bar .bar-foo {}
.boo-far, .foo-bar {}
.foo-bar,.bar-foo {}
.bar-foo.foo-bar {}
Output
UNUSED .foo-bar {}
UNUSED .foo-bar, .bar-foo {}
.foo-bar .bar-foo {}
.boo-far, UNUSED .foo-bar {}
UNUSED .foo-bar,.bar-foo {}
.bar-foo.foo-bar {}
Explanation
(?:^|,\s*) Match either of the following
^ Assert position at the start of the line
,\s* Comma character , literally, followed by any number of whitespace characters
\K Resets starting point of the reported match (any previously consumed characters are no longer included in the final match)
(\.foo-bar) Capture into group 1: The dot character . literally, followed by foo-bar literally
(?=\s*(?:,|{)) Positive lookahead ensuring what follows matches the following
\s* Any whitespace character any number of times
(?:,|{)) Match either of the following
, Comma character , literally
{ Left curly bracket { literally
Edit
The following regex is an update from the previous one and moves \s* outside the first group to match the possibility of whitespace after the caret ^ as well.
(?:^|,)\s*\K(\.foo-bar)(?=\s*(?:,|{))

Related

Erasing C comments with preg_replace

I need to erase all comments in $string which contains data from some C file.
The thing I need to replace looks like this:
something before that shouldnt be replaced
/*
* some text in between with / or * on many lines
*/
something after that shouldnt be replaced
and the result should look like this:
something before that shouldnt be replaced
something after that shouldnt be replaced
I have tried many regular expressions but neither work the way I need.
Here are some latest ones:
$string = preg_replace("/\/\*(.*?)\*\//u", "", $string);
and
$string = preg_replace("/\/\*[^\*\/]*\*\//u", "", $string);
Note: the text is in UTF-8, the string can contain multibyte characters.
You would also want to add the s modifier to tell the regex that .* should include newlines. I always think of s to mean "treat the input text as a single line"
So something like this should work:
$string = preg_replace("/\\/\\*(.*?)\\*\\//us", "", $string);
Example: http://codepad.viper-7.com/XVo9Tp
Edit: Added extra escape slashes to the regex as Brandin suggested because he is right.
I don't think regexp fit good here. What about wrote a very small parse to remove this? I don't do PHP coding for a long time. So, I will try to just give you the idea (simple alogorithm) I haven't tested this, it's just to you get the idea, as I said:
buf = new String() // hold the source code without comments
pos = 0
while(string[pos] != EOF) {
if(string[pos] == '/') {
pos++;
while(string[pos] != EOF)
{
if(string[pos] == '*' && string[pos + 1] == '/') {
pos++;
break;
}
pos++;
}
}
buf[buf_index++] = string[pos++];
}
where:
string is the C source code
buf a dynamic allocated string which expands as needed
It is very hard to do this perfectly without ending up writing a full C parser.
Consider the following, for example:
// Not using /*-style comment here.
// This line has an odd number of " characters.
while (1) {
printf("Wheee!
(*\/*)
\\// - I'm an ant!
");
/* This is a multiline comment with a // in, and
// an odd number of " characters. */
}
So, from the above, we can see that our problems include:
multiline quote sequences should be ignored within doublequotes. Unless those doublequotes are part of a comment.
single-line comment sequences can be contained in double-quoted strings, and in multiline strings.
Here's one possibility to address some of those issues, but far from perfect.
// Remove "-strings, //-comments and /*block-comments*/, then restore "-strings.
// Based on regex by mauke of Efnet's #regex.
$file = preg_replace('{("[^"]*")|//[^\n]*|(/\*.*?\*/)}s', '\1', $file);
try this:
$string = preg_replace("#\/\*\n?(.*)\*\/\n?#ms", "", $string);
Use # as regexp boundaries; change that u modifier with the right ones: m (PCRE_MULTILINE) and s (PCRE_DOTALL).
Reference: http://php.net/manual/en/reference.pcre.pattern.modifiers.php
It is important to note that my regexp does not find more than one "comment block"... Use of "dot match all" is generally not a good idea.

regular expression to find all starting brace has an ending brace

I need an regular expression to find all starting brace has an ending brace.
Suppose
([(([[(([]))]]))]) -- this one will return true. but
[](()()[[]])[][[([]) --- this one will return false
for this, I've tried below:-
function check($str) {
$output = "";
$pattern = "/(\{[^}]*)([^{]*\})/im";
preg_match($pattern, $str, $match);
print_r($match[0]);
}
assert(check("[](()()[[]])[][[([])") === FALSE);
any help please...
The easiest way to do this (in my opinion) would be to implement a stack data structure and pass through your string. Essentially something like so:
Traverse the string left to right
If you find an opening parenthesis, add it to the stack
else (you find a closing parenthesis) make sure that the top most item in the stack is the same type of parenthesis (so make sure that if you found a }, the top most item in the stack is {). This should help scenarios where you have something like so: ({)}. If it matches, pop from the stack.
If you repeat the above operation throughout the entire string, you should end up with an empty stack. This would mean that you have managed to match all open parenthesis with a close parenthesis.
You can use this:
$pattern = '~^(\((?1)*\)|\[(?1)*]|{(?1)*})+$~';
(?1) is a reference to the subpattern (not the matched content) in the capture group 1. Since I put it in the capture group 1 itself, I obtain a recursion.
I added anchors for the start ^ and the end $ of the string to be sure to check all the string.
Note: If you need to check a string that contains not only brackets, you can replace each (?1)* with:
(?>[^][}{)(]++|(?1))*
Note 2: If you want that an empty string return true, you must replace the last quantifier + with *
Working example:
function check($str, $display = false) {
if (preg_match('~^(\((?1)*\)|\[(?1)*]|{(?1)*})+$~', $str, $match)) {
if ($display) echo $match[0];
return true;
}
elseif (preg_last_error() == PREG_RECURSION_LIMIT_ERROR) {
if ($display) echo "The recursion limit has been reached\n";
return -1;
} else return false;
}
assert(check(')))') === false);
check('{[[()()]]}', true);
var_dump(check('{[[()()]]}'));

PHP - preg_replace not matching multiple occurrences

Trying to replace a string, but it seems to only match the first occurrence, and if I have another occurrence it doesn't match anything, so I think I need to add some sort of end delimiter?
My code:
$mappings = array(
'fname' => $prospect->forename,
'lname' => $prospect->surname,
'cname' => $prospect->company,
);
foreach($mappings as $key => $mapping) if(empty($mapping)) $mappings[$key] = '$2';
$match = '~{(.*)}(.*?){/.*}$~ise';
$source = 'Hello {fname}Default{/fname} {lname}Last{/lname}';
// $source = 'Hello {fname}Default{/fname}';
$text = preg_replace($match, '$mappings["$1"]', $source);
So if I use the $source that's commented, it matches fine, but if I use the one currently in the code above where there's 2 matches, it doesn't match anything and I get an error:
Message: Undefined index: fname}Default{/fname} {lname
Filename: schedule.php(62) : regexp code
So am I right in saying I need to provide an end delimiter or something?
Thanks,
Christian
Apparently your regexp matches fname}Default{/fname} {lname instead of Default.
As I mentioned here use {(.*?)} instead of {(.*)}.
{ has special meaning in regexps so you should escape it \\{.
And I recommend using preg_replace_callback instead of e modifier (you have more flow control and syntax higlighting and it's impossible to force your program to execute malicious code).
Last mistake you're making is not checking whether the requested index exists. :)
My solution would be:
<?php
class A { // Of course with better class name :)
public $mappings = array(
'fname' => 'Tested'
);
public function callback( $match)
{
if( isset( $this->mappings[$match[1]])){
return $this->mappings[$match[1]];
}
return $match[2];
}
}
$a = new A();
$match = '~\\{([^}]+)\\}(.*?)\\{/\\1\\}~is';
$source = 'Hello {fname}Default{/fname} {lname}Last{/lname}';
echo preg_replace_callback( $match, array($a, 'callback'), $source);
This results into:
[vyktor#grepfruit tmp]$ php stack.php
Hello Tested Last
Your regular expression is anchored to the end of the string so you closing {/whatever} must be the last thing in your string. Also, since your open and closing tags are simply .*, there's nothing in there to make sure they match up. What you want is to make sure that your closing tag matches your opening one - using a backreference like {(.+)}(.*?){/\1} will make sure they're the same.
I'm sure there's other gotchas in there - if you have control over the format of strings you're working with (IE - you're rolling your own templating language), I'd seriously consider moving to a simpler, easier to match format. Since you're not 'saving' the default values, having enclosing tags provides you with no added value but makes the parsing more complicated. Just using $VARNAME would work just as well and be easier to match (\$[A-Z]+), without involving back-references or having to explicitly state you're using non-greedy matching.

Regex to match specific string not enclosed by another, different specific string

I need a regex to match a string not enclosed by another different, specific string. For instance, in the following situation it would split the content into two groups: 1) The content before the second {Switch} and 2) The content after the second {Switch}. It wouldn't match the first {Switch} because it is enclosed by {my_string}'s. The string will always look like shown below (i.e. {my_string}any content here{/my_string})
Some more
{my_string}
Random content
{Switch} //This {Switch} may or may not be here, but should be ignored if it is present
More random content
{/my_string}
Content here too
{Switch}
More content
So far I've gotten what is below which I know isn't very close at all:
(.*?)\{Switch\}(.*?)
I'm just not sure how to use the [^] (not operator) with a specific string versus different characters.
It really seems you're trying to use a regular expression to parse a grammar - something that regular expressions are really bad at doing. You might be better off writing a parser to break down your string into the tokens that build it, and then processing that tree.
Perhaps something like http://drupal.org/project/grammar_parser might help.
Try this simple function:
function find_content()
function find_content($doc) {
$temp = $doc;
preg_match_all('~{my_string}.*?{/my_string}~is', $temp, $x);
$i = 0;
while (isset($x[0][$i])) {
$temp = str_replace($x[0][$i], "{REPL:$i}", $temp);
$i++;
}
$res = explode('{Switch}', $temp);
foreach ($res as &$part)
foreach($x[0] as $id=>$content)
$part = str_replace("{REPL:$id}", $content, $part);
return $res;
}
Use it this way
$content_parts = find_content($doc); // $doc is your input document
print_r($content_parts);
Output (your example)
Array
(
[0] => Some more
{my_string}
Random content
{Switch} //This {Switch} may or may not be here, but should be ignored if it is present
More random content
{/my_string}
Content here too
[1] =>
More content
)
You can try positive lookahead and lookbehind assertions (http://www.regular-expressions.info/lookaround.html)
It might look something like this:
$content = 'string of text before some random content switch text some more random content string of text after';
$before = preg_quote('String of text before');
$switch = preg_quote('switch text');
$after = preg_quote('string of text after');
if( preg_match('/(?<=' $before .')(.*)(?:' $switch .')?(.*)(?=' $after .')/', $content, $matches) ) {
// $matches[1] == ' some random content '
// $matches[2] == ' some more random content '
}
$regex = (?:(?!\{my_string\})(.*?))(\{Switch\})(?:(.*?)(?!\{my_string\}));
/* if "my_string" and "Switch" aren't wrapped by "{" and "}" just remove "\{" and "\}" */
$yourNewString = preg_replace($regex,"$1",$yourOriginalString);
This might work. Can't test it know, but i'll update later!
I don't if this is what you're looking for, but to negate more than one character, the regex syntax is:
(?!yourString)
and it is called "negative lookahead assertion".
/Edit:
This should work and return true:
$stringMatchesYourRulesBoolean = preg_match('~(.*?)('.$my_string.')(.*?)(?<!'.$my_string.') ?('.$switch.') ?(?!'.$my_string.')(.*?)('.$my_string.')(.*?)~',$yourString);
Have a look at PHP PEG. It is a little parser written in PHP. You can write your own grammar and parse it. It's going to be very simple in your case.
The grammar syntax and the way of parsing is all explained in the README.md
Extracts from the readme:
token* - Token is optionally repeated
token+ - Token is repeated at least one
token? - Token is optionally present
Tokens may be :
- bare-words, which are recursive matchers - references to token rules defined elsewhere in the grammar,
- literals, surrounded by `"` or `'` quote pairs. No escaping support is provided in literals.
- regexs, surrounded by `/` pairs.
- expressions - single words (match \w+)
Sample grammar: (file EqualRepeat.peg.inc)
class EqualRepeat extends Packrat {
/* Any number of a followed by the same number of b and the same number of c characters
* aabbcc - good
* aaabbbccc - good
* aabbc - bad
* aabbacc - bad
*/
/*Parser:Grammar1
A: "a" A? "b"
B: "b" B? "c"
T: !"b"
X: &(A !"b") "a"+ B !("a" | "b" | "c")
*/
}

Testing for a backslash in a string

So I'm writing a PHP script that will read in a CSS file, then put the comments and actual CSS in separate arrays. The script will then build a page with the CSS and comments all nicely formatted.
The basic logic for the script is this:
Read in a new line
If it starts with a forward slash or
ends with an opening bracket, set a
bool for CSS or comments to true
Add that line to the appropriate
element in the appropriate array
If the last character is a backslash
(end of a comment) or the first
character is a closing bracket (end
of a CSS tag), set necessary bool to
false
Rinse, repeat
If someone sees an error in that, feel free to point it out, but I think it should do what I want.
The tricky part is the last if statement, checking if the last character is a backslash. Right now I have:
if ($line{(strlen($line) - 3)} == "\\") {do stuff}
where $line is the last line read in from the file. Not entirely sure why I have to go back 3 characters, but I'm guessing it's because there's a newline at the end of each string when reading it in from a file. However, this if statement is never true, even though there are definitely lines which end with slashes. This
echo "<br />str - 3: " . $line{(strlen($line)-3)};
even returns a backslash, yet the if statement is never trigged.
That would be because $line{(strlen($line) - 3)} in your if statement is returning one backslash, while the if statement is looking for two. Try using
substr($line, -2)
instead. (You might have to change it to -3. The reason for this is because the newline character might be included at the end of the string.)
#mcritelli: CSS comments look like /* comment */ though, so just searching for a backslash won't tell you if it's starting or ending the comment. Here's a very basic script I tested which loops through a 'line' and can do something at the beginning and end of a comment --
<?php
$line = "/* test rule */";
$line .= ".test1 { ";
$line .= " text-decoration: none; ";
$line .= "}/* end of test rule */";
for ($i = 0; $i < strlen($line); $i++)
{
if ($line[$i] . $line[$i + 1] == "/*")
{
// start of a comment, do something
}
elseif ($line[$i] . $line[$i + 1] == "*/")
{
// end of a comment, do something
}
}
?>

Categories