replace multiple spaces, tabs and newlines into one space except commented text - php

I need replace multiple spaces, tabs and newlines into one space except commented text in my html.
For example the following code:
<br/> <br>
<!--
this is a comment
-->
<br/> <br/>
should turn into
<br/><br><!--
this is a comment
--><br/><br/>
Any ideas?

The new solution
After thinking a bit, I came up with the following solution with pure regex. Note that this solution will delete the newlines/tabs/multi-spaces instead of replacing them:
$new_string = preg_replace('#(?(?!<!--.*?-->)(?: {2,}|[\r\n\t]+)|(<!--.*?-->))#s', '$1', $string);
echo $new_string;
Explanation
(? # If
(?!<!--.*?-->) # There is no comment
(?: {2,}|[\r\n\t]+) # Then match 2 spaces or more, or newlines or tabs
| # Else
(<!--.*?-->) # Match and group it (group #1)
) # End if
So basically when there is no comment it will try to match spaces/tabs/newlines. If it does find it then group 1 wouldn't exist and there will be no replacements (which will result into the deletion of spaces...). If there is a comment then the comment is replaced by the comment (lol).
Online demo
The old solution
I came up with a new strategy, this code require PHP 5.3+:
$new_string = preg_replace_callback('#(?(?!<!--).*?(?=<!--|$)|(<!--.*?-->))#s', function($m){
if(!isset($m[1])){ // If group 1 does not exist (the comment)
return preg_replace('#\s+#s', ' ', $m[0]); // Then replace with 1 space
}
return $m[0]; // Else return the matched string
}, $string);
echo $new_string; // Output
Explaining the regex:
(? # If
(?!<!--) # Lookahead if there is no <!--
.*? # Then match anything (ungreedy) until ...
(?=<!--|$) # Lookahead, check for <!-- or end of line
| # Or
(<!--.*?-->) # Match and group a comment, this will make for us a group #1
)
# The s modifier is to match newlines with . (dot)
Online demo
Note: What you are asking and what you have provided as expected output are a bit contradicting. Anyways if you want to remove instead of replacing by 1 space, then just edit the code from '#\s+#s', ' ', $m[0] to '#\s+#s', '', $m[0].

It's much simpler to do this in several runs (as is done for instance in php markdown).
Step1: preg_replace_callback() all comments with something unique while keeping their original values in a keyed array -- ex: array('comment_placeholder:' . md5('comment') => 'comment', ...)
Step2: preg_replace() white spaces as needed.
Step3: str_replace() comments back where they originally were using the keyed array.
The approach you're leaning towards (splitting the string and only processing the non-comment parts) works fine too.
There almost certainly is a means to do this with pure regex, using ugly look-behinds, but not really recommended: the regex might yield backtracking related errors, and the comment replacement step allows you to process things further if needed without worrying about the comments themselves.

I’d do the following:
split the input into comment and non-comment parts
do replacement on the non-comment parts
put everything back together
Example:
$parts = preg_split('/(<!--(?:(?!-->).)*-->)/s', $input, -1, PREG_SPLIT_DELIM_CAPTURE);
foreach ($parts as $i => &$part) {
if ($i % 2 === 0) {
// non-comment part
$part = preg_replace('/\s+/', ' ', $part);
} else {
// comment part
}
}
$output = implode('', $parts);

You can use this:
$pattern = '~\s*+(<br[^>]*>|<!--(?>[^-]++|-(?!->))*-->)\s*+~';
$replacement = '$1';
$result = preg_replace($pattern, $replacement, $subject);
This pattern captures br tags and comments, and matches spaces around. Then it replaces the match by the capture group.

Related

How to replace all occurrences of a character except the first one in PHP using a regular expression?

Given an address stored as a single string with newlines delimiting its components like:
1 Street\nCity\nST\n12345
The goal would be to replace all newline characters except the first one with spaces in order to present it like:
1 Street
City ST 12345
I have tried methods like:
[$street, $rest] = explode("\n", $input, 2);
$output = "$street\n" . preg_replace('/\n+/', ' ', $rest);
I have been trying to achieve the same result using a one liner with a regular expression, but could not figure out how.
I would suggest not solving this with complicated regex but keeping it simple like below. You can split the string with a \n, pop out the first split and implode the rest with a space.
<?php
$input = explode("\n","1 Street\nCity\nST\n12345");
$input = array_shift($input) . PHP_EOL . implode(" ", $input);
echo $input;
Online Demo
You could use a regex trick here by reversing the string, and then replacing every occurrence of \n provided that we can lookahead and find at least one other \n:
$input = "1 Street\nCity\nST\n12345";
$output = strrev(preg_replace("/\n(?=.*\n)/", " ", strrev($input)));
echo $output;
This prints:
1 Street
City ST 12345
You can use a lookbehind pattern to ensure that the matching line is preceded with a newline character. Capture the line but not the trailing newline character and replace it with the same line but with a trailing space:
preg_replace('/(?<=\n)(.*)\n/', '$1 ', $input)
Demo: https://onlinephp.io/c/5bd6d
You can use an alternation pattern that matches either the first two lines or a newline character, capture the first two lines without the trailing newline character, and replace the match with what's captured and a space:
preg_replace('/(^.*\n.*)\n|\n/', '$1 ', $input)
Demo: https://onlinephp.io/c/2fb2f
I leave you another method, the regex is correct as long as the conditions are met, in this way it always works
$string=explode("/","1 Street\nCity\nST\n12345");
$string[0]."<br>";
$string[1]." ".$string[2]." ".$string[3]

Replacing space indentation with tabs

I am looking to replace 4 spaces at the start of a line to tabs, but nothing further when there is text present.
My initial regex of / {4}+/ or /[ ]{4}+/ for the sake of readability clearly worked but obviously any instance found with four spaces would be replaced.
$string = ' this is some text --> <-- are these tabs or spaces?';
$string .= "\n and this is another line singly indented";
// I wrote 4 spaces, a tab, then 4 spaces here but unfortunately it will not display
$string .= "\n \t and this is third line with tabs and spaces";
$pattern = '/[ ]{4}+/';
$replace = "\t";
$new_str = preg_replace( $pattern , $replace , $string );
echo '<pre>'. $new_str .'</pre>';
This was an example of what I had originally, using the regex given the expression works perfectly with regards to the conversion but for the fact that the 4 spaces between the ----><---- are replaced by a tab. I am really looking to have text after indentation unaltered.
My best effort so far has been (^) start of line ([ ]{4}+) the pattern (.*?[;\s]*) anything up til the first non space \s
$pattern = '/^[ ]{4}+.*?[;\s]*/m';
which... almost works but for the fact that the indentation is now lost, can anybody help me understand what I am missing here?
[edit]
For clarity what I am trying to do is change the the start of text indentation from spaces to tabs, I really don't understand why this is confusing to anybody.
To be as clear as possible (using the value of $string above):
First line has 8 spaces at the start, some text with 4 spaces in the middle.
I am looking for 2 tabs at the start and no change to spaces in the text.
Second line has 4 spaces at the start.
I am looking to have only 1 tab at the start of the line.
Third line has 4 spaces, 1 tab and 4 spaces.
I am looking to have 3 tabs at the start of the line.
If you're not a regular expression guru, this will probably make most sense to you and be easier to adapt to similar use cases (this is not the most efficient code, but it's the most "readable" imho):
// replace all regex matches with the result of applying
// a given anonymous function to a $matches array
function tabs2spaces($s_with_spaces) {
// before anything else, replace existing tabs with 4 spaces
// to permit homogenous translation
$s_with_spaces = str_replace("\t", ' ', $s_with_spaces);
return preg_replace_callback(
'/^([ ]+)/m',
function ($ms) {
// $ms[0] - is full match
// $ms[1] - is first (...) group fron regex
// ...here you can add extra logic to handle
// leading spaces not multiple of 4
return str_repeat("\t", floor(strlen($ms[1]) / 4));
},
$s_with_spaces
);
}
// example (using dots to make spaces visible for explaining)
$s_with_spaces = <<<EOS
no indent
....4 spaces indent
........8 spaces indent
EOS;
$s_with_spaces = str_replace('.', ' ');
$s_with_tabs = tabs2spaces($s_with_spaces);
If you want a performant but hard to understand or tweak one-liner instead, the solutions in the comments from the regex-gurus above should work :)
P.S. In general preg_replace_callback (and its equivalent in Javascript) is a great "swiss army knife" of structured text processing. I have, shamefully, even writtent parsers to mini-languages using it ;)
The way I would do it is this.
$str = "...";
$pattern = "'/^[ ]{4}+/'";
$replace = "\t";
$multiStr = explode("\n", $str);
$out = "";
foreach ($multiStr as &$line) {
$line = str_replace("\t", " ",$line);
$out .= preg_replace( $pattern , $replace , $line )
}
$results = implode("\n", $out);
Please re-evaluate the code thoroughly as I have done this on a quick and intuitive way.
As I can't run a PHP server to test it :( but should help you resolved this problem.

Regex rules in an array

Maybe it can not be solved this issue as I want, but maybe you can help me guys.
I have a lot of malformed words in the name of my products.
Some of them has leading ( and trailing ) or maybe one of these, it is same for / and " signs.
What I do is that I am explode the name of the product by spaces, and examines these words.
So I want to replace them to nothing. But, a hard drive could be 40GB ATA 3.5" hard drive. I need to process all the word, but I can not use the same method for 3.5" as for () or // because this 3.5" is valid.
So I only need to replace the quotes, when it is at the start of the string AND at end of the string.
$cases = [
'(testone)',
'(testtwo',
'testthree)',
'/otherone/',
'/othertwo',
'otherthree/',
'"anotherone',
'anothertwo"',
'"anotherthree"',
];
$patterns = [
'/^\(/',
'/\)$/',
'~^/~',
'~/$~',
//Here is what I can not imagine, how to add the rule for `"`
];
$result = preg_replace($patterns, '', $cases);
This is works well, but can it be done in one regex_replace()? If yes, somebody can help me out the pattern(s) for the quotes?
Result for quotes should be this:
'"anotherone', //no quote at end leave the leading
'anothertwo"', //no quote at start leave the trailin
'anotherthree', //there are quotes on start and end so remove them.
You may use another approach: rather than define an array of patterns, use one single alternation based regex:
preg_replace('~^[(/]|[/)]$|^"(.*)"$~s', '$1', $s)
See the regex demo
Details:
^[(/] - a literal ( or / at the start of the string
| - or
[/)]$ - a literal ) or / at the end of the string
| - or
^"(.*)"$ - a " at the start of the string, then any 0+ characters (due to /s option, the . matches a linebreak sequence, too) that are captured into Group 1, and " at the end of the string.
The replacement pattern is $1 that is empty when the first 2 alternatives are matched, and contains Group 1 value if the 3rd alternative is matched.
Note: In case you need to replace until no match is found, use a preg_match with preg_replace together (see demo):
$s = '"/some text/"';
$re = '~^[(/]|[/)]$|^"(.*)"$~s';
$tmp = '';
while (preg_match($re, $s) && $tmp != $s) {
$tmp = $s;
$s = preg_replace($re, '$1', $s);
}
echo $s;
This works
preg_replace([[/(]?(.+)[/)]?|/\"(.+)\"/], '$1', $string)

How to manipulate a string so I can make implicit multiplication explicit in a math expression?

I want to manipulate a string like "...4+3(4-2)-...." to become "...4+3*(4-2)-....", but of course it should recognize any number, d, followed by a '(' and change it to 'd*('. And I also want to change ')(' to ')*(' at the same time if possible. Would nice if there is a possibility to add support for constants like pi or e too.
For now, I just do it this stupid way:
private function make_implicit_multiplication_explicit($string)
{
$i=1;
if(strlen($string)>1)
{
while(($i=strpos($string,"(",$i))!==false)
{
if(strpos("0123456789",substr($string,$i-1,1)))
{
$string=substr_replace($string,"*(",$i,1);
$i++;
}
$i++;
}
$string=str_replace(")(",")*(",$string);
}
return $string;
}
But I Believe this could be done much nicer with preg_replace or some other regex function? But those manuals are really cumbersome to grasp, I think.
Let's start by what you are looking for:
either of the following: ((a|b) will match either a or b)
any number, \d
the character ): \)
followed by (: \(
Which creates this pattern: (\d|\))\(. But since you want to modify the string and keep both parts, you can group the \( which results in (\() making it worse to read but better to handle.
Now everything left is to tell how to rearrange, which is simple: \\1*\\2, leaving you with code like this
$regex = "/(\d|\))(\()/";
$replace = "\\1*\\2";
$new = preg_replace($regex, $replace, $test);
To see that the pattern actually matches all cases, see this example.
To recognize any number followed by a ( OR a combination of a )( and place an asterisk in between them, you can use a combination of lookaround assertions.
echo preg_replace("/
(?<=[0-9)]) # look behind to see if there is: '0' to '9', ')'
(?=\() # look ahead to see if there is: '('
/x", '*', '(4+3(4-2)-3)(2+3)');
The Positive Lookbehind asserts that what precedes is either a number or right parentheses. While the Positive Lookahead asserts that the preceding characters are followed by a left parentheses.
Another option is to use the \K escape sequence in replace of the Lookbehind. \K resets the starting point of the reported match. Any previously consumed characters are no longer included ( throws away everything that it has matched up to that point. )
echo preg_replace("/
[0-9)] # any character of: '0' to '9', ')'
\K # resets the starting point of the reported match
(?=\() # look ahead to see if there is: '('
/x", '*', '(4+3(4-2)-3)(2+3)');
Your php code should be,
<?php
$mystring = "4+3(4-2)-(5)(3)";
$regex = '~\d+\K\(~';
$replacement = "*(";
$str = preg_replace($regex, $replacement, $mystring);
$regex1 = '~\)\K\(~';
$replacement1 = "*(";
echo preg_replace($regex1, $replacement1, $str);
?> //=> 4+3*(4-2)-(5)*(3)
Explanation:
~\d+\K\(~ this would match the one or more numbers followed by a (. Because of \K it excludes the \d+
Again it replaces the matched part with *( which in turn produces 3*( and the result was stored in another variable.
\)\K\( Matches )( and excludes the first ). This would be replaced by *( which in turn produces )*(
DEMO 1
DEMO 2
Silly method :^ )
$value = '4+3(4-2)(1+2)';
$search = ['1(', '2(', '3(', '4(', '5(', '6(', '7(', '8(', '9(', '0(', ')('];
$replace = ['1*(', '2*(', '3*(', '4*(', '5*(', '6*(', '7*(', '8*(', '9*(', '0*(', ')*('];
echo str_replace($search, $replace, $value);

Condition inside regex pattern

I would like to remove any extra whitespace from my code, I'm parsing a docblock. The problem is that I do not want to remove whitespace within a <code>code goes here</code>.
Example, I use this to remove extra whitespace:
$string = preg_replace('/[ ]{2,}/', '', $string);
But I would like to keep whitespace within <code></code>
This code/string:
This is some text
This is also some text
<code>
User::setup(array(
'key1' => 'value1',
'key2' => 'value1'
));
</code>
Should be transformed into:
This is some text
This is also some text
<code>
User::setup(array(
'key1' => 'value1',
'key2' => 'value1'
));
</code>
How can I do this?
You aren't really looking for a condition - you need a way to skip parts of the string so they are not replaced. This can be done rather easily using preg_replace, by inserting dummy groups and replacing each group with itself. In your case you only need one:
$str = preg_replace("~(<code>.*?</code>)|^ +| +$|( ) +~smi" , "$1$2", $str);
How does it work?
(<code>.*?</code>) - Match a <code> block into the first group, $1. This assumes simple formatting and no nesting, but can be complicated if needed.
^ + - match and remove spaces on beginnings of lines.
[ ]+$ - match and remove spaces on ends of lines.
( ) + match two or more spaces in the middle of lines, and capture the first one to the second group, $2.
The replace string, $1$2 will keep <code> blocks and the first space if captured, and remove anything else it matches.
Things to remember:
If $1 or $2 didn't capture, it will be replaced with an empty string.
Alternations (a|b|c) work from left to right - when it makes a match it is satisfied, and doesn't try matching again. That is why ^ +| +$ must be before ( ) +.
Working example: http://ideone.com/HxbaV
When parsing markup with PHP and regex, the preg_replace_callback() function combined with the (?R), (?1), (?2)... recursive expressions, make for a very powerful tool indeed. The following script handles your test data quite nicely:
<?php // test.php 20110312_2200
function clean_non_code(&$text) {
$re = '%
# Match and capture either CODE into $1 or non-CODE into $2.
( # $1: CODE section (never empty).
<code[^>]*> # CODE opening tag
(?R)+ # CODE contents w/nested CODE tags.
</code\s*> # CODE closing tag
) # End $1: CODE section.
| # Or...
( # $2: Non-CODE section (may be empty).
[^<]*+ # Zero or more non-< {normal*}
(?: # Begin {(special normal*)*}
(?!</?code\b) # If not a code open or close tag,
< # match non-code < {special}
[^<]*+ # More {normal*}
)*+ # End {(special normal*)*}
) # End $2: Non-CODE section
%ix';
$text = preg_replace_callback($re, '_my_callback', $text);
if ($text === null) exit('PREG Error!\nTarget string too big.');
return $text;
}
// The callback function is called once for each
// match found and is passed one parameter: $matches.
function _my_callback($matches)
{ // Either $1 or $2 matched, but never both.
if ($matches[1]) {
return $matches[1];
}
// Collapse multiple tabs and spaces into a single space.
$matches[2] = preg_replace('/[ \t][ \t]++/S', ' ', $matches[2]);
// Trim each line
$matches[2] = preg_replace('/^ /m', '', $matches[2]);
$matches[2] = preg_replace('/ $/m', '', $matches[2]);
return $matches[2];
}
// Create some test data.
$data = "This is some text
This is also some text
<code>
User::setup(array(
'key1' => 'value1',
'key2' => 'value1',
'key42' => '<code>
Pay no attention to this. It has been proven over and
over again that it is <code> unpossible </code>
to parse nested stuff with regex! </code>'
));
</code>";
// Demonstrate that it works on one small test string.
echo("BEFORE:\n". $data ."\n\n");
echo("AFTER:\n". clean_non_code($data) ."\n\nTesting...");
// Build a large test string.
$bigdata = '';
for ($i = 0; $i < 30000; ++$i) $bigdata .= $data;
$size = strlen($bigdata);
// Measure how long it takes to process it.
$time = microtime(true);
$bigdata = clean_non_code($bigdata);
$time = microtime(true) - $time;
// Print benchmark results
printf("Done.\nTest string size: %d bytes. Time: %.3f sec. Speed: %.0f KB/s.\n",
$size, $time, ($size / $time)/1024.);
?>
Here are the script benchmark results when run on my test box: WinXP32 PHP 5.2.14 (cli)
'Test string size: 10410000 bytes. Time: 1.219 sec. Speed: 8337 KB/s.'
Note that this solution does not handle CODE tags having <> angle brackets in their attributes (probably a very rare edge case), but the regex could be easily modified to handle these as well. Note also that the maximum string length will depend upon the composition of the string content (i.e. Big CODE blocks reduce the maximum input string length.)
p.s. Note to SO staff. The <!-- language: lang-none --> doesn't work.
What you will want is to parse it using some form of HTML parser.
For example, you could iterate through all elements ignoring code elements with DOMDocument and strip whitespace from their text nodes.
Alternatively, open the file with fopen() so you have an array of lines, and step through each line stripping whitespace if outside of a code element.
To determine if you are in a code element, look for the starting tag <code> and set a flag which says in code element mode. You can then skip these lines. Reset the flag when you encounter </code>. You could take into account nesting by having its state stored as an integer, even though nested code elements are not the smartest idea (why would you nest them)?
Mario came up with this before me.
Parsing HTML with regexes is a bad idea.
RegEx match open tags except XHTML self-contained tags
Use something like Zend_DOM to parse HTML and extract parts of it you need to replace spaces in.

Categories