Regular Expressions: how to do "option split" replaces - php

those reqular expressions drive me crazy. I'm stuck with this one:
test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not
Task:
Remove all [[ and ]] and if there is an option split choose the later one so output should be:
test1:link test2:silver test3:out1insideout2 test4:this|not
I came up with (PHP)
$text = preg_replace("/\\[\\[|\\]\\]/",'',$text); // remove [[ or ]]
this works for part1 of the task. but before that I think I should do the option split, my best solution:
$text = preg_replace("/\\[\\[(.*\|)(.*?)\\]\\]/",'$2',$text);
Result:
test1:silver test3:[[out1[[inside]]out2]] this|not
I'm stuck. may someone with some free minutes help me? Thanks!

I think the easiest way to do this would be multiple passes. Use a regular expression like:
\[\[(?:[^\[\]]*\|)?([^\[\]]+)\]\]
This will replace option strings to give you the last option from the group. If you run it repeatedly until it no longer matches, you should get the right result (the first pass will replace [[out1[[inside]]out2]] with [[out1insideout2]] and the second will ditch the brackets.
Edit 1: By way of explanation,
\[\[ # Opening [[
(?: # A non-matching group (we don't want this bit)
[^\[\]] # Non-bracket characters
* # Zero or more of anything but [
\| # A literal '|' character representing the end of the discarded options
)? # This group is optional: if there is only one option, it won't be present
( # The group we're actually interested in ($1)
[^\[\]] # All the non-bracket characters
+ # Must be at least one
) # End of $1
\]\] # End of the grouping.
Edit 2: Changed expression to ignore ']' as well as '[' (it works a bit better like that).
Edit 3: There is no need to know the number of nested brackets as you can do something like:
$oldtext = "";
$newtext = $text;
while ($newtext != $oldtext)
{
$oldtext = $newtext;
$newtext = preg_replace(regexp,replace,$oldtext);
}
$text = $newtext;
Basically, this keeps running the regular expression replace until the output is the same as the input.
Note that I don't know PHP, so there are probably syntax errors in the above.

This is impossible to do in one regular expression since you want to keep content in multiple "hierarchies" of the content. It would be possible otherwise, using a recursive regular expression.
Anyways, here's the simplest, most greedy regular expression I can think of. It should only replace if the content matches your exact requirements.
You will need to escape all backslashes when putting it into a string (\ becomes \\.)
\[\[((?:[^][|]+|(?!\[\[|]])[^|])++\|?)*]]
As others have already explained, you use this with multiple passes. Keep looping while there are matches, performing replacement (only keeping match group 1.)
Difference from other regular expressions here is that it will allow you to have single brackets in the content, without breaking:
test1:[[link]] test2:[[gold|si[lv]er]]
test3:[[out1[[in[si]de]]out2]] test4:this|not
becomes
test1:[[link]] test2:si[lv]er
test3:out1in[si]deout2 test4:this|not

Why try to do it all in one go. Remove the [[]] first and then deal with options, do it in two lines of code.
When trying to get something going favour clarity and simplicity.
Seems like you have all the pieces.

Why not just simply remove any brackets that are left?
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$str = preg_replace('/\\[\\[(?:[^|\\]]+\\|)+([^\\]]+)\\]\\]/', '$1', $str);
$str = str_replace(array('[', ']'), '', $str);

Well, I didn't stick to just regex, because I'm of a mind that trying to do stuff like this with one big regex leads you to the old joke about "Now you have two problems". However, give something like this a shot:
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not'; $reg = '/(.*?):(.*?)( |$)/';
preg_match_all($reg, $str, $m);
foreach($m[2] as $pos => $match) {
if (strpos($match, '|') !== FALSE && strpos($match, '[[') !== FALSE ) {
$opt = explode('|', $match); $match = $opt[count($opt)-1];
}
$m[2][$pos] = str_replace(array('[', ']'),'', $match );
}
foreach($m[1] as $k=>$v) $result[$k] = $v.':'.$m[2][$k];

This is C# using only using non-escaped strings, hence you will have to double the backslashes in other languages.
String input = "test1:[[link]] " +
"test2:[[gold|silver]] " +
"test3:[[out1[[inside]]out2]] " +
"test4:this|not";
String step1 = Regex.Replace(input, #"\[\[([^|]+)\|([^\]]+)\]\]", #"[[$2]]");
String step2 = Regex.Replace(step1, #"\[\[|\]\]", String.Empty);
// Prints "test1:silver test3:out1insideout2 test4:this|not"
Console.WriteLine(step2);

$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$s = preg_split("/\s+/",$str);
foreach ($s as $k=>$v){
$v = preg_replace("/\[\[|\]\]/","",$v);
$j = explode(":",$v);
$j[1]=preg_replace("/.*\|/","",$j[1]);
print implode(":",$j)."\n";
}

Related

Erasing C comments with preg_replace

I need to erase all comments in $string which contains data from some C file.
The thing I need to replace looks like this:
something before that shouldnt be replaced
/*
* some text in between with / or * on many lines
*/
something after that shouldnt be replaced
and the result should look like this:
something before that shouldnt be replaced
something after that shouldnt be replaced
I have tried many regular expressions but neither work the way I need.
Here are some latest ones:
$string = preg_replace("/\/\*(.*?)\*\//u", "", $string);
and
$string = preg_replace("/\/\*[^\*\/]*\*\//u", "", $string);
Note: the text is in UTF-8, the string can contain multibyte characters.
You would also want to add the s modifier to tell the regex that .* should include newlines. I always think of s to mean "treat the input text as a single line"
So something like this should work:
$string = preg_replace("/\\/\\*(.*?)\\*\\//us", "", $string);
Example: http://codepad.viper-7.com/XVo9Tp
Edit: Added extra escape slashes to the regex as Brandin suggested because he is right.
I don't think regexp fit good here. What about wrote a very small parse to remove this? I don't do PHP coding for a long time. So, I will try to just give you the idea (simple alogorithm) I haven't tested this, it's just to you get the idea, as I said:
buf = new String() // hold the source code without comments
pos = 0
while(string[pos] != EOF) {
if(string[pos] == '/') {
pos++;
while(string[pos] != EOF)
{
if(string[pos] == '*' && string[pos + 1] == '/') {
pos++;
break;
}
pos++;
}
}
buf[buf_index++] = string[pos++];
}
where:
string is the C source code
buf a dynamic allocated string which expands as needed
It is very hard to do this perfectly without ending up writing a full C parser.
Consider the following, for example:
// Not using /*-style comment here.
// This line has an odd number of " characters.
while (1) {
printf("Wheee!
(*\/*)
\\// - I'm an ant!
");
/* This is a multiline comment with a // in, and
// an odd number of " characters. */
}
So, from the above, we can see that our problems include:
multiline quote sequences should be ignored within doublequotes. Unless those doublequotes are part of a comment.
single-line comment sequences can be contained in double-quoted strings, and in multiline strings.
Here's one possibility to address some of those issues, but far from perfect.
// Remove "-strings, //-comments and /*block-comments*/, then restore "-strings.
// Based on regex by mauke of Efnet's #regex.
$file = preg_replace('{("[^"]*")|//[^\n]*|(/\*.*?\*/)}s', '\1', $file);
try this:
$string = preg_replace("#\/\*\n?(.*)\*\/\n?#ms", "", $string);
Use # as regexp boundaries; change that u modifier with the right ones: m (PCRE_MULTILINE) and s (PCRE_DOTALL).
Reference: http://php.net/manual/en/reference.pcre.pattern.modifiers.php
It is important to note that my regexp does not find more than one "comment block"... Use of "dot match all" is generally not a good idea.

PHP - Regex for a string of special characters

Morning SO. I'm trying to determine whether or not a string contains a list of specific characters.
I know i should be using preg_match for this, but my regex knowledge is woeful and i have been unable to glean any information from other posts around this site. Since most of them just want to limit strings to a-z, A-Z and 0-9. But i do want some special characters to be allowed, for example: ! # £ and others not in the below string.
Characters to be matched on: # $ % ^ & * ( ) + = - [ ] \ ' ; , . / { } | \ " : < > ? ~
private function containsIllegalChars($string)
{
return preg_match([REGEX_STRING_HERE], $string);
}
I originally wrote the matching in Javascript, which just looped through each letter in the string and then looped through every character in another string until it found a match. Looking back, i can't believe i even attempted to use such an archaic method. With the advent of json (and a rewrite of the application!), i'm switching the match to php, to return an error message via json.
I was hoping a regex guru could assist with converting the above string to a regex string, but any feedback would be appreciated!
Regexp for a "list of disallowed character" is not mandatory.
You may have a look at strpbrk. It should do the job you need.
Here's an example of usage
$tests = array(
"Hello I should be allowed",
"Aw! I'm not allowed",
"Geez [another] one",
"=)",
"<WH4T4NXSS474K>"
);
$illegal = "#$%^&*()+=-[]';,./{}|:<>?~";
foreach ($tests as $test) {
echo $test;
echo ' => ';
echo (false === strpbrk($test, $illegal)) ? 'Allowed' : "Disallowed";
echo PHP_EOL;
}
http://codepad.org/yaJJsOpT
return preg_match('/[#$%^&*()+=\-\[\]\';,.\/{}|":<>?~\\\\]/', $string);
$pattern = preg_quote('#$%^&*()+=-[]\';,./{}|\":<>?~', '#');
var_dump(preg_match("#[{$pattern}]#", 'hello world')); // false
var_dump(preg_match("#[{$pattern}]#", 'he||o wor|d')); // true
var_dump(preg_match("#[{$pattern}]#", '$uper duper')); // true
Likely, you can cache the $pattern, depending on your implementation.
(Though looking outside of regular expressions, you're best of with strpbrk as mentioned here too)
I think what you're looking for can be greatly simplified by including the characters that you want to allow like so:
preg_match('/[^\w!#£]/', $string)
Here's a quick breakdown of what's happening:
[^] = not included
\w = letters and numbers
! # £ = the list of characters you would also like to allow

preg_match_all syntax problem

Having trouble with preg_match syntax
with in a page I need to find anything like
$first = '/>http:\/\/www.(.*?)\/(.*?)\</';
$second = '/="http:\/\/www.(.*?)\/(.*?)"/';
How could I combine the two?
Something like
$regex = '/(?="|>)http:\/\/www.(.*?)/(.*?)(?"|\<)/';
Sorry not very good at this.
This looks about right to me:
/(?:="|>)http:\/\/www\.(.*?)\/(.*?)["<]/i
Notice a few minor corrections: Your non-capturing group syntax was a little off (it should be (?:pattern) instead of (?pattern)), and you also needed to escape the . and /.
I'm also not sure the (.*?)\/(.*?) is doing exactly what you think it is; I'd probably just replace that with (.*?) unless you want to require a / character.
Here is a funny thought.
Use /(?:(=")|>)http:\/\/www\.(.*?)\/(.*?)(?(1)"|<)/sg using a looping find next search. Extracting variables $2 and $3 each time. This uses a conditional.
Or, use /(?|(?<==")http:\/\/www\.(.*?)\/(.*?)(?=")|(?<=>)http:\/\/www\.(.*?)\/(.*?)(?=<))/sg in a match all. This uses branch reset. The array will acumulate as pairs ($cnt++ % 2).
Depends on what you mean by combining.
A perl test case:
use strict;
use warnings;
my $str = '
<tag asdf="http://www.some.com/directory"/>
<dadr>http://www.adif.com/dir</dadr>
';
while ( $str =~ /(?:(=")|>)http:\/\/www\.(.*?)\/(.*?)(?(1)"|<)/sg )
{
print "'$2' '$3'\n";
}
print "--------------\n";
my #parts = $str =~ /(?|(?<==")http:\/\/www\.(.*?)\/(.*?)(?=")|(?<=>)http:\/\/www\.(.*?)\/(.*?)(?=<))/sg;
my $cnt = 0;
for (#parts)
{
print "'$_' ";
if ($cnt++ % 2) {
print "\n";
}
}
__END__
Output:
'some.com' 'directory'
'adif.com' 'dir'
--------------
'some.com' 'directory'
'adif.com' 'dir'

PHP Formatting Regex - BBCode

To be honest, I suck at regex so much, I would use RegexBuddy, but I'm working on my Mac and sometimes it doesn't help much (for me).
Well, for what I need to do is a function in php
function replaceTags($n)
{
$n = str_replace("[[", "<b>", $n);
$n = str_replace("]]", "</b>", $n);
}
Although this is a bad example in case someone didn't close the tag by using ]] or [[, anyway, could you help with regex of:
[[ ]] = Bold format
** ** = Italic format
(( )) = h2 heading
Those are all I need, thanks :)
P.S - Is there any software like RegexBuddy available for Mac (Snow Leopard)?
function replaceTags($n)
{
$n = preg_replace("/\[\[(.*?)\]\]/", "<strong>$1</strong>", $n);
$n = preg_replace("/\*\*(.*?)\*\*/", "<em>$1</em>", $n);
$n = preg_replace("/\(\((.*?)\)\)/", "<h2>$1</h2>", $n);
return $n;
}
I should probably provide a little explanation: Each special character is preceded by a backslash so it's not treated as regex instructions ("[", "(", etc.). The "(.*?)" captures all characters between your delimiters ("[[" and "]]", etc.). What's captured is then output in the replacements string in place of "$1".
The same reason you can't do this with str_replace() applies to preg_replace() as well. Tag-pair style parsing requires a lexer/parser if you want to yield 100% accuracy and cover for input errors.
Regular expressions can't handle unclosed tags, nested tags, that sort of thing.
That all being said, you can get 50% of the way there with very little effort.
$test = "this is [[some]] test [[content for **you** to try, ((does [[it]])) **work?";
echo convertTags( $test );
// only handles validly formatted, non-nested input
function convertTags( $content )
{
return preg_replace(
array(
"/\[\[(.*?)\]\]/"
, "/\*\*(.*?)\*\*/"
, "/\(\((.*?)\)\)/"
)
, array(
"<strong>$1</strong>"
, "<em>$1</em>"
, "<h2>$1</h2>"
)
, $content
);
}
Modifiers could help too :)
http://lv.php.net/manual/en/reference.pcre.pattern.modifiers.php
U (PCRE_UNGREEDY) This modifier
inverts the "greediness" of the
quantifiers so that they are not
greedy by default, but become greedy
if followed by ?. It is not compatible
with Perl. It can also be set by a
(?U) modifier setting within the
pattern or by a question mark behind a
quantifier (e.g. .*?).

regex question redux regarding definition list

Trying to figure out a way to throw out attributes in this data that do not have any values. Thanks for helping.
My current regex code , thanks to Tomalak looks like this
Regex find
([^=|]+)=([^|]+)(?:\||$)
Regex replace
<dt>$1</dt><dd>$2</dd>
Data looks like this
Bristle Material=|Wire Material=Steel|Dia.=4 in|Grit=|Bristle Diam=|Wire Size=0.0095 in|Arbor Diam=|Arbor Thread - TPI or Pitch=1/2 - 3/8 in|No. of Knots=|Face Width=1/2 in|Face Plate Thickness=7/16 in|Trim Length=7/8 in|Stem Diam=|Speed=6000 rpm [Max]|No. of Rows=|Color=|Hub Material=|Structure=|Tool Shape=|Applications=Cleaning rust, scale and dirt, Light Deburring, Edge Blending, Roughening for adhesion, Finish preparation prior to plating or painting|Applicable Materials=|Type=|Used With=Straight Grinders, Bench/Pedestal Grinders, Right Angle Grinders|Packing Type=|Quantity=1 per pack|Wt.=
End result should like this
<dt>Wire Material</dt><dd>Steel</dd><dt>Dia.</dt><dd>4 in</dd><dt>Wire Size</dt><dd>0.0095 in</dd>
Not this
Bristle Material=|<dt>Wire Material</dt><dd>Steel</dd><dt>Dia.</dt><dd>4 in</dd>Grit=|Bristle Diam=|<dt>Wire Size</dt><dd>0.0095 in
Here is how you can do it in PHP without using regular expressions:
$parts_list = explode('|', "Bristle Material=|Wire M....");
$parts = "";
foreach( $parts_list as $part ){
$p = explode('=', $part);
if(!empty($p[1])) $parts .= "<dt>$p[0]</dt>\n<dd>$p[1]</dd>\n";
}
echo $parts;
And here is how you can do it with regular expressions:
$parts = preg_replace(
array('/([^=|]*)=(?:\||$)/','/([^=|]*)=([^|]+)(?:\||$)/'),
array('', '<dt>$1</dt><dd>$2</dd>'),
$inputString
);
echo $parts;
Update
This is using a special replace feature of the PHP preg_replace which takes an array of regex expressions, and an array of replacement strings. The array() syntax of the function basically equates to this:
If I can match this: /([^=|]*)=(?:\||$)/ then replace it with an empty string.
If I can match this: /([^=|]*)=([^|]+)(?:\||$)/ then replace it with <dt>$1</dt><dd>$2</dd>
To test it in a Regex editor, you would run the first expression first (/([^=|]*)=(?:\||$)/) then run the second expression on the result of the first expression.
([^=|]*)=([^|]*)(?:\||$)
to skip the ones with out a value, try this:
(?:[^=|]*=|([^=|]*)=([^|]+))(?:\||$)
looks like you want preg_match here rather than preg_replace
preg_match_all('~([^|]+)=([^|\s][^|]*)~', $str, $matches, PREG_SET_ORDER);
foreach($matches as $match)
echo "<dt>{$match[1]}</dt><dd>{$match[2]}</dd>\n";

Categories