I would like to remove any extra whitespace from my code, I'm parsing a docblock. The problem is that I do not want to remove whitespace within a <code>code goes here</code>.
Example, I use this to remove extra whitespace:
$string = preg_replace('/[ ]{2,}/', '', $string);
But I would like to keep whitespace within <code></code>
This code/string:
This is some text
This is also some text
<code>
User::setup(array(
'key1' => 'value1',
'key2' => 'value1'
));
</code>
Should be transformed into:
This is some text
This is also some text
<code>
User::setup(array(
'key1' => 'value1',
'key2' => 'value1'
));
</code>
How can I do this?
You aren't really looking for a condition - you need a way to skip parts of the string so they are not replaced. This can be done rather easily using preg_replace, by inserting dummy groups and replacing each group with itself. In your case you only need one:
$str = preg_replace("~(<code>.*?</code>)|^ +| +$|( ) +~smi" , "$1$2", $str);
How does it work?
(<code>.*?</code>) - Match a <code> block into the first group, $1. This assumes simple formatting and no nesting, but can be complicated if needed.
^ + - match and remove spaces on beginnings of lines.
[ ]+$ - match and remove spaces on ends of lines.
( ) + match two or more spaces in the middle of lines, and capture the first one to the second group, $2.
The replace string, $1$2 will keep <code> blocks and the first space if captured, and remove anything else it matches.
Things to remember:
If $1 or $2 didn't capture, it will be replaced with an empty string.
Alternations (a|b|c) work from left to right - when it makes a match it is satisfied, and doesn't try matching again. That is why ^ +| +$ must be before ( ) +.
Working example: http://ideone.com/HxbaV
When parsing markup with PHP and regex, the preg_replace_callback() function combined with the (?R), (?1), (?2)... recursive expressions, make for a very powerful tool indeed. The following script handles your test data quite nicely:
<?php // test.php 20110312_2200
function clean_non_code(&$text) {
$re = '%
# Match and capture either CODE into $1 or non-CODE into $2.
( # $1: CODE section (never empty).
<code[^>]*> # CODE opening tag
(?R)+ # CODE contents w/nested CODE tags.
</code\s*> # CODE closing tag
) # End $1: CODE section.
| # Or...
( # $2: Non-CODE section (may be empty).
[^<]*+ # Zero or more non-< {normal*}
(?: # Begin {(special normal*)*}
(?!</?code\b) # If not a code open or close tag,
< # match non-code < {special}
[^<]*+ # More {normal*}
)*+ # End {(special normal*)*}
) # End $2: Non-CODE section
%ix';
$text = preg_replace_callback($re, '_my_callback', $text);
if ($text === null) exit('PREG Error!\nTarget string too big.');
return $text;
}
// The callback function is called once for each
// match found and is passed one parameter: $matches.
function _my_callback($matches)
{ // Either $1 or $2 matched, but never both.
if ($matches[1]) {
return $matches[1];
}
// Collapse multiple tabs and spaces into a single space.
$matches[2] = preg_replace('/[ \t][ \t]++/S', ' ', $matches[2]);
// Trim each line
$matches[2] = preg_replace('/^ /m', '', $matches[2]);
$matches[2] = preg_replace('/ $/m', '', $matches[2]);
return $matches[2];
}
// Create some test data.
$data = "This is some text
This is also some text
<code>
User::setup(array(
'key1' => 'value1',
'key2' => 'value1',
'key42' => '<code>
Pay no attention to this. It has been proven over and
over again that it is <code> unpossible </code>
to parse nested stuff with regex! </code>'
));
</code>";
// Demonstrate that it works on one small test string.
echo("BEFORE:\n". $data ."\n\n");
echo("AFTER:\n". clean_non_code($data) ."\n\nTesting...");
// Build a large test string.
$bigdata = '';
for ($i = 0; $i < 30000; ++$i) $bigdata .= $data;
$size = strlen($bigdata);
// Measure how long it takes to process it.
$time = microtime(true);
$bigdata = clean_non_code($bigdata);
$time = microtime(true) - $time;
// Print benchmark results
printf("Done.\nTest string size: %d bytes. Time: %.3f sec. Speed: %.0f KB/s.\n",
$size, $time, ($size / $time)/1024.);
?>
Here are the script benchmark results when run on my test box: WinXP32 PHP 5.2.14 (cli)
'Test string size: 10410000 bytes. Time: 1.219 sec. Speed: 8337 KB/s.'
Note that this solution does not handle CODE tags having <> angle brackets in their attributes (probably a very rare edge case), but the regex could be easily modified to handle these as well. Note also that the maximum string length will depend upon the composition of the string content (i.e. Big CODE blocks reduce the maximum input string length.)
p.s. Note to SO staff. The <!-- language: lang-none --> doesn't work.
What you will want is to parse it using some form of HTML parser.
For example, you could iterate through all elements ignoring code elements with DOMDocument and strip whitespace from their text nodes.
Alternatively, open the file with fopen() so you have an array of lines, and step through each line stripping whitespace if outside of a code element.
To determine if you are in a code element, look for the starting tag <code> and set a flag which says in code element mode. You can then skip these lines. Reset the flag when you encounter </code>. You could take into account nesting by having its state stored as an integer, even though nested code elements are not the smartest idea (why would you nest them)?
Mario came up with this before me.
Parsing HTML with regexes is a bad idea.
RegEx match open tags except XHTML self-contained tags
Use something like Zend_DOM to parse HTML and extract parts of it you need to replace spaces in.
Related
thanks by your help.
my target is use preg_replace + pattern for remove very sample strings.
then only using preg_replace in this string or others, I need remove ANY content into <tag and next symbol >, the pattern is so simple, then:
$x = '#<\w+(\s+[^>]*)>#is';
$s = 'DATA<td class="td1">111</td><td class="td2">222</td>DATA';
preg_match_all($x, $s, $Q);
print_r($Q[1]);
[1] => Array
(
[0] => class="td1"
[1] => class="td2"
)
work greath!
now I try remove strings using the same pattern:
$new_string = '';
$Q = preg_replace($x, "\\1$new_string", $s);
print_r($Q);
result is completely different.
what is bad in my use of preg_replace?
using only preg_replace() how I can remove this strings?
(we can use foreach(...) for remove each string, but where is the error in my code?)
my result expected when I intro this value:
$s = 'DATA<td class="td1">111</td><td class="td2">222</td>DATA';
is this output:
$Q = 'DATA<td>111</td><td>222</td>DATA';
Let's break down your RegEx, #<\w+(\s+[^>]*)>#is, and see if that helps.
# // Start delimiter
< // Literal `<` character
\w+ // One or more word-characters, a-z, A-Z, 0-9 or _
( // Start capturing group
\s+ // One or more spaces
[^>]* // Zero or more characters that are not the literal `>`
) // End capturing group
> // Literal `>` character
# // End delimiter
is // Ignore case and `.` matches all characters including newline
Given the input DATA<td class="td1">DATA this matches <td class="td1"> and captures class="td1". The difference between match and capture is very important.
When you use preg_match you'll see the entire match at index 0, and any subsequent captures at incrementing indexes.
When you use preg_replace the entire match will be replaced. You can use the captures, if you so choose, but you are replacing the match.
I'm going to say that again: whatever you pass as the replacement string will replace the entirety of the found match. If you say $1 or \\=1, you are saying replace the entire match with just the capture.
Going back to the sample after the breakdown, using $1 is the equivalent of calling:
str_replace('<td class="td1">', ' class="td1"', $string);
which you can see here: https://3v4l.org/ZkPFb
To your question "how to change [0] by $new_string", you are doing it correctly, it is your RegEx itself that is wrong. To do what you are trying to do, your pattern must capture the tag itself so that you can say "replace the HTML tag with all of the attributes with just the tag".
As one of my comments noted, this is where you'd invert the capturing. You aren't interesting in capturing the attributes, you are throwing those away. Instead, you are interested in capturing the tag itself:
$string = 'DATA<td class="td1">DATA';
$pattern = '#<(\w+)\s+[^>]*>#is';
echo preg_replace($pattern, '<$1>', $string);
Demo: https://3v4l.org/oIW7d
Maybe it can not be solved this issue as I want, but maybe you can help me guys.
I have a lot of malformed words in the name of my products.
Some of them has leading ( and trailing ) or maybe one of these, it is same for / and " signs.
What I do is that I am explode the name of the product by spaces, and examines these words.
So I want to replace them to nothing. But, a hard drive could be 40GB ATA 3.5" hard drive. I need to process all the word, but I can not use the same method for 3.5" as for () or // because this 3.5" is valid.
So I only need to replace the quotes, when it is at the start of the string AND at end of the string.
$cases = [
'(testone)',
'(testtwo',
'testthree)',
'/otherone/',
'/othertwo',
'otherthree/',
'"anotherone',
'anothertwo"',
'"anotherthree"',
];
$patterns = [
'/^\(/',
'/\)$/',
'~^/~',
'~/$~',
//Here is what I can not imagine, how to add the rule for `"`
];
$result = preg_replace($patterns, '', $cases);
This is works well, but can it be done in one regex_replace()? If yes, somebody can help me out the pattern(s) for the quotes?
Result for quotes should be this:
'"anotherone', //no quote at end leave the leading
'anothertwo"', //no quote at start leave the trailin
'anotherthree', //there are quotes on start and end so remove them.
You may use another approach: rather than define an array of patterns, use one single alternation based regex:
preg_replace('~^[(/]|[/)]$|^"(.*)"$~s', '$1', $s)
See the regex demo
Details:
^[(/] - a literal ( or / at the start of the string
| - or
[/)]$ - a literal ) or / at the end of the string
| - or
^"(.*)"$ - a " at the start of the string, then any 0+ characters (due to /s option, the . matches a linebreak sequence, too) that are captured into Group 1, and " at the end of the string.
The replacement pattern is $1 that is empty when the first 2 alternatives are matched, and contains Group 1 value if the 3rd alternative is matched.
Note: In case you need to replace until no match is found, use a preg_match with preg_replace together (see demo):
$s = '"/some text/"';
$re = '~^[(/]|[/)]$|^"(.*)"$~s';
$tmp = '';
while (preg_match($re, $s) && $tmp != $s) {
$tmp = $s;
$s = preg_replace($re, '$1', $s);
}
echo $s;
This works
preg_replace([[/(]?(.+)[/)]?|/\"(.+)\"/], '$1', $string)
i would like to inject some code after X paragraphs, and this is pretty easy with php.
public function inject($text, $paragraph = 2) {
$exploded = explode("</p>", $text);
if (isset($exploded[$paragraph])) {
$exploded[$paragraph] = '
MYCODE
' . $exploded[$paragraph];
return implode("</p>", $exploded);
}
return $text;
}
But, I don't want to inject my $text inside a <table>, so how to avoid this?
Thanks
I'm sometimes a bit crazy, sometimes I go for patterns that are lazy, but this time I'm going for something hazy.
$input = 'test <table><p>wuuut</p><table><p>lolwut</p></table></table> <p>foo bar</p> test1 <p>baz qux</p> test3'; # Some input
$insertAfter = 2; # Insert after N p tags
$code = 'CODE'; # The code we want to insert
$regex = <<<'regex'
~
# let's define something
(?(DEFINE)
(?P<table> # To match nested table tags
<table\b[^>]*>
(?:
(?!</?table\b[^>]*>).
|
(?&table)
)*
</table\s*>
)
(?P<paragraph> # To match nested p tags
<p\b[^>]*>
(?:
(?!</?p\b[^>]*>).
|
(?¶graph)
)*
</p\s*>
)
)
(?&table)(*SKIP)(*FAIL) # Let's skip table tags
|
(?¶graph) # And match p tags
~xsi
regex;
$output = preg_replace_callback($regex, function($m)use($insertAfter, $code){
static $counter = 0; # A counter
$counter++;
if($counter === $insertAfter){ # Should I explain?
return $m[0] . $code;
}else{
return $m[0];
}
}, $input);
var_dump($output); # Let's see what we've got
Online regex demo
Online php demo
References:
Reference - What does this regex mean?
What does the "[^][]" regex mean?
Verbs that act after backtracking and failure
Is there a way to define custom shorthands in regular expressions?
EDIT: It was late last night.
The PREG_SPLIT_DELIM_CAPTURE was neat but I am now adding a better idea (Method 1).
Also improved Method 2 to replace the strstr with a faster substr
METHOD 1: preg_replace_callback with (*SKIP)(*FAIL) (better)
Let's do a direct replace on the text that is certifiably table-free using a callback to your inject function.
Here's a regex to match table-free text:
$regex = "~(?si)(?!<table>).*?(?=<table|</table)|<table.*?</table>(*SKIP)(*FAIL)~";
In short, this either matches text that is a complete non-table or matches a complete table and fails.
Here's your replacement:
$injectedString = preg_replace_callback($regex,
function($m){return inject($text,$m[0]);},
$data);
Much shorter!
And here's a demo of $regex showing you how it matches elements that don't contain a table.
$text = "<table> to
</table>not a table # 1<table> to
</table>NOT A TABLE # 2<table> to
</table>";
$regex = "~(?si)(?!<table>).*?(?=<table|</table)|<table.*?</table>(*SKIP)(*FAIL)~";
$a = preg_match_all($regex,$text,$m);
print_r($m);
The output: Array ( [0] => Array ( [0] => not a table # 1 [1] => NOT A TABLE # 2 ) )
Of course the html is not well formed and $data starts in the middle of a table, all bets are off. If that's a problem let me know and we can work on the regex.
METHOD 2
Here is the first solution that came to mind.
In short, I would look at using preg_split with the PREG_SPLIT_DELIM_CAPTURE flag.
The basic idea is to isolate the tables using a special preg_split, and to perform your injections on the elements that are certifiably table-free.
A. Step 1: split $data using an unusual delimiter: your delimiter will be a full table sequence: from <table to </table>
This is achieved with a delimiter specified by a regex pattern such as (?s)<table.*?</table>
Note that I am not closing <table in case you have a class there.
So you have something like
$tableseparator = preg_split( "~(?s)(<table.*?</table>)~", $data, -1, PREG_SPLIT_DELIM_CAPTURE );
The benefit of this PREG_SPLIT_DELIM_CAPTURE flag is that the whole delimiter, which we capture thanks to the parentheses in the regex pattern, becomes an element in the array, so that we can isolate the tables without losing them. [See demo of this at the bottom.] This way, your string is broken into clean "table-free" and "is-a-table" pieces.
B. Step 2: Iterate over the $tableseparator elements. For each element, do a
if(substr($tableseparator[$i],0,6)=="<table")
If <table is found, leave the element alone (don't inject). If it isn't found, that element is clean, and you can do your inject() magic on it.
C. Step 3: Put the elements of $tableseparator back together (implode just like you do in your inject function).
So you have a two-level explosion and implosion, first with preg_split, second with your explode!
Sorry that I don't have time to code everything in detail, but I'm certain that you can figure it out. :)
preg_split with PREG_SPLIT_DELIM_CAPTURE demo
Here's a demo of how the preg_split works:
$text = "Hi#There##Oscar####";
$regex = "~(#+)~";
$a = preg_split($regex,$text,-1,PREG_SPLIT_DELIM_CAPTURE);
print_r($a);
The Output: Array ( [0] => Hi [1] => # [2] => There [3] => ## [4] => Oscar [5] => #### [6] => )
See how in this example the delimiters (the # sequences) are preserved? You have surgically isolated them but not lost them, so you can work on the other strings then put everything back together.
I need replace multiple spaces, tabs and newlines into one space except commented text in my html.
For example the following code:
<br/> <br>
<!--
this is a comment
-->
<br/> <br/>
should turn into
<br/><br><!--
this is a comment
--><br/><br/>
Any ideas?
The new solution
After thinking a bit, I came up with the following solution with pure regex. Note that this solution will delete the newlines/tabs/multi-spaces instead of replacing them:
$new_string = preg_replace('#(?(?!<!--.*?-->)(?: {2,}|[\r\n\t]+)|(<!--.*?-->))#s', '$1', $string);
echo $new_string;
Explanation
(? # If
(?!<!--.*?-->) # There is no comment
(?: {2,}|[\r\n\t]+) # Then match 2 spaces or more, or newlines or tabs
| # Else
(<!--.*?-->) # Match and group it (group #1)
) # End if
So basically when there is no comment it will try to match spaces/tabs/newlines. If it does find it then group 1 wouldn't exist and there will be no replacements (which will result into the deletion of spaces...). If there is a comment then the comment is replaced by the comment (lol).
Online demo
The old solution
I came up with a new strategy, this code require PHP 5.3+:
$new_string = preg_replace_callback('#(?(?!<!--).*?(?=<!--|$)|(<!--.*?-->))#s', function($m){
if(!isset($m[1])){ // If group 1 does not exist (the comment)
return preg_replace('#\s+#s', ' ', $m[0]); // Then replace with 1 space
}
return $m[0]; // Else return the matched string
}, $string);
echo $new_string; // Output
Explaining the regex:
(? # If
(?!<!--) # Lookahead if there is no <!--
.*? # Then match anything (ungreedy) until ...
(?=<!--|$) # Lookahead, check for <!-- or end of line
| # Or
(<!--.*?-->) # Match and group a comment, this will make for us a group #1
)
# The s modifier is to match newlines with . (dot)
Online demo
Note: What you are asking and what you have provided as expected output are a bit contradicting. Anyways if you want to remove instead of replacing by 1 space, then just edit the code from '#\s+#s', ' ', $m[0] to '#\s+#s', '', $m[0].
It's much simpler to do this in several runs (as is done for instance in php markdown).
Step1: preg_replace_callback() all comments with something unique while keeping their original values in a keyed array -- ex: array('comment_placeholder:' . md5('comment') => 'comment', ...)
Step2: preg_replace() white spaces as needed.
Step3: str_replace() comments back where they originally were using the keyed array.
The approach you're leaning towards (splitting the string and only processing the non-comment parts) works fine too.
There almost certainly is a means to do this with pure regex, using ugly look-behinds, but not really recommended: the regex might yield backtracking related errors, and the comment replacement step allows you to process things further if needed without worrying about the comments themselves.
I’d do the following:
split the input into comment and non-comment parts
do replacement on the non-comment parts
put everything back together
Example:
$parts = preg_split('/(<!--(?:(?!-->).)*-->)/s', $input, -1, PREG_SPLIT_DELIM_CAPTURE);
foreach ($parts as $i => &$part) {
if ($i % 2 === 0) {
// non-comment part
$part = preg_replace('/\s+/', ' ', $part);
} else {
// comment part
}
}
$output = implode('', $parts);
You can use this:
$pattern = '~\s*+(<br[^>]*>|<!--(?>[^-]++|-(?!->))*-->)\s*+~';
$replacement = '$1';
$result = preg_replace($pattern, $replacement, $subject);
This pattern captures br tags and comments, and matches spaces around. Then it replaces the match by the capture group.
I have keyword "bolding" function here:
$ignore = array ('div', 'class', 'high', 'light', 'highlight');
$keywords = explode(' ', $qsvarus);
$title[$key] = preg_replace('/'.implode('|', $keywords).'/i', '<b>$0</b>', $title[$key]);
$infoo[$key] = preg_replace('/'.implode('|', $keywords).'/i', '<b>$0</b>', $infoo[$key]);
The problem is it sometimes catches some of my html tags. How to tell it to ignore everything shorter than 3 letters and certain specific words from $ignore array when <b></b> $keywords?
I would simply loop around the keywords array first, removing any matches in the ignore array (use in_array) and also any keywords that are 3 characters or less.
Then if you want to make sure you are not in a tag, the following should suffice:
/\b(keyword|keyword|keyword)\b(?![^<]*[>])/
I've basically appended a negative lookahead:
(?![^<]*[>])
The regex will match as long as that look ahead for a closing html > not preceded by an opening tag < doesn't match. I you do get a closing tag, you should be able* to assume that you are inside a tag.
Putting that back into the preg_replace:
preg_replace('/\b('.implode('|', $keywords).')\b(?![^<]*[>])/i', '<b>$1</b>', $subject);
This does assume that there isn't an unencoded greater than (>)