PHP Formatting Regex - BBCode

PHP Formatting Regex - BBCode - php

To be honest, I suck at regex so much, I would use RegexBuddy, but I'm working on my Mac and sometimes it doesn't help much (for me).
Well, for what I need to do is a function in php
function replaceTags($n)
{
$n = str_replace("[[", "<b>", $n);
$n = str_replace("]]", "</b>", $n);
}
Although this is a bad example in case someone didn't close the tag by using ]] or [[, anyway, could you help with regex of:
[[ ]] = Bold format
** ** = Italic format
(( )) = h2 heading
Those are all I need, thanks :)
P.S - Is there any software like RegexBuddy available for Mac (Snow Leopard)?

function replaceTags($n)
{
$n = preg_replace("/\[\[(.*?)\]\]/", "<strong>$1</strong>", $n);
$n = preg_replace("/\*\*(.*?)\*\*/", "<em>$1</em>", $n);
$n = preg_replace("/\(\((.*?)\)\)/", "<h2>$1</h2>", $n);
return $n;
}
I should probably provide a little explanation: Each special character is preceded by a backslash so it's not treated as regex instructions ("[", "(", etc.). The "(.*?)" captures all characters between your delimiters ("[[" and "]]", etc.). What's captured is then output in the replacements string in place of "$1".

The same reason you can't do this with str_replace() applies to preg_replace() as well. Tag-pair style parsing requires a lexer/parser if you want to yield 100% accuracy and cover for input errors.
Regular expressions can't handle unclosed tags, nested tags, that sort of thing.
That all being said, you can get 50% of the way there with very little effort.
$test = "this is [[some]] test [[content for **you** to try, ((does [[it]])) **work?";
echo convertTags( $test );
// only handles validly formatted, non-nested input
function convertTags( $content )
{
return preg_replace(
array(
"/\[\[(.*?)\]\]/"
, "/\*\*(.*?)\*\*/"
, "/\(\((.*?)\)\)/"
)
, array(
"<strong>$1</strong>"
, "<em>$1</em>"
, "<h2>$1</h2>"
)
, $content
);
}

Modifiers could help too :)
http://lv.php.net/manual/en/reference.pcre.pattern.modifiers.php
U (PCRE_UNGREEDY) This modifier
inverts the "greediness" of the
quantifiers so that they are not
greedy by default, but become greedy
if followed by ?. It is not compatible
with Perl. It can also be set by a
(?U) modifier setting within the
pattern or by a question mark behind a
quantifier (e.g. .*?).

Related

Erasing C comments with preg_replace

I need to erase all comments in $string which contains data from some C file.
The thing I need to replace looks like this:
something before that shouldnt be replaced
/*
* some text in between with / or * on many lines
*/
something after that shouldnt be replaced
and the result should look like this:
something before that shouldnt be replaced
something after that shouldnt be replaced
I have tried many regular expressions but neither work the way I need.
Here are some latest ones:
$string = preg_replace("/\/\*(.*?)\*\//u", "", $string);
and
$string = preg_replace("/\/\*[^\*\/]*\*\//u", "", $string);
Note: the text is in UTF-8, the string can contain multibyte characters.

You would also want to add the s modifier to tell the regex that .* should include newlines. I always think of s to mean "treat the input text as a single line"
So something like this should work:
$string = preg_replace("/\\/\\*(.*?)\\*\\//us", "", $string);
Example: http://codepad.viper-7.com/XVo9Tp
Edit: Added extra escape slashes to the regex as Brandin suggested because he is right.

I don't think regexp fit good here. What about wrote a very small parse to remove this? I don't do PHP coding for a long time. So, I will try to just give you the idea (simple alogorithm) I haven't tested this, it's just to you get the idea, as I said:
buf = new String() // hold the source code without comments
pos = 0
while(string[pos] != EOF) {
if(string[pos] == '/') {
pos++;
while(string[pos] != EOF)
{
if(string[pos] == '*' && string[pos + 1] == '/') {
pos++;
break;
}
pos++;
}
}
buf[buf_index++] = string[pos++];
}
where:
string is the C source code
buf a dynamic allocated string which expands as needed

It is very hard to do this perfectly without ending up writing a full C parser.
Consider the following, for example:
// Not using /*-style comment here.
// This line has an odd number of " characters.
while (1) {
printf("Wheee!
(*\/*)
\\// - I'm an ant!
");
/* This is a multiline comment with a // in, and
// an odd number of " characters. */
}
So, from the above, we can see that our problems include:
multiline quote sequences should be ignored within doublequotes. Unless those doublequotes are part of a comment.
single-line comment sequences can be contained in double-quoted strings, and in multiline strings.
Here's one possibility to address some of those issues, but far from perfect.
// Remove "-strings, //-comments and /*block-comments*/, then restore "-strings.
// Based on regex by mauke of Efnet's #regex.
$file = preg_replace('{("[^"]*")|//[^\n]*|(/\*.*?\*/)}s', '\1', $file);

try this:
$string = preg_replace("#\/\*\n?(.*)\*\/\n?#ms", "", $string);
Use # as regexp boundaries; change that u modifier with the right ones: m (PCRE_MULTILINE) and s (PCRE_DOTALL).
Reference: http://php.net/manual/en/reference.pcre.pattern.modifiers.php
It is important to note that my regexp does not find more than one "comment block"... Use of "dot match all" is generally not a good idea.

PHP - Regex for a string of special characters

Morning SO. I'm trying to determine whether or not a string contains a list of specific characters.
I know i should be using preg_match for this, but my regex knowledge is woeful and i have been unable to glean any information from other posts around this site. Since most of them just want to limit strings to a-z, A-Z and 0-9. But i do want some special characters to be allowed, for example: ! # £ and others not in the below string.
Characters to be matched on: # $ % ^ & * ( ) + = - [ ] \ ' ; , . / { } | \ " : < > ? ~
private function containsIllegalChars($string)
{
return preg_match([REGEX_STRING_HERE], $string);
}
I originally wrote the matching in Javascript, which just looped through each letter in the string and then looped through every character in another string until it found a match. Looking back, i can't believe i even attempted to use such an archaic method. With the advent of json (and a rewrite of the application!), i'm switching the match to php, to return an error message via json.
I was hoping a regex guru could assist with converting the above string to a regex string, but any feedback would be appreciated!

Regexp for a "list of disallowed character" is not mandatory.
You may have a look at strpbrk. It should do the job you need.
Here's an example of usage
$tests = array(
"Hello I should be allowed",
"Aw! I'm not allowed",
"Geez [another] one",
"=)",
"<WH4T4NXSS474K>"
);
$illegal = "#$%^&*()+=-[]';,./{}|:<>?~";
foreach ($tests as $test) {
echo $test;
echo ' => ';
echo (false === strpbrk($test, $illegal)) ? 'Allowed' : "Disallowed";
echo PHP_EOL;
}
http://codepad.org/yaJJsOpT

return preg_match('/[#$%^&*()+=\-\[\]\';,.\/{}|":<>?~\\\\]/', $string);

$pattern = preg_quote('#$%^&*()+=-[]\';,./{}|\":<>?~', '#');
var_dump(preg_match("#[{$pattern}]#", 'hello world')); // false
var_dump(preg_match("#[{$pattern}]#", 'he||o wor|d')); // true
var_dump(preg_match("#[{$pattern}]#", '$uper duper')); // true
Likely, you can cache the $pattern, depending on your implementation.
(Though looking outside of regular expressions, you're best of with strpbrk as mentioned here too)

I think what you're looking for can be greatly simplified by including the characters that you want to allow like so:
preg_match('/[^\w!#£]/', $string)
Here's a quick breakdown of what's happening:
[^] = not included
\w = letters and numbers
! # £ = the list of characters you would also like to allow

html_entity_decode in specific regular expression for a preg_replace

I have this preg_replace patterns and replacements :
$patterns = array(
"/<br\W*?\/>/",
"/<strong>/",
"/<*\/strong>/",
"/<h1>/",
"/<*\/h1>/",
"/<h2>/",
"/<*\/h2>/",
"/<em>/",
"/<*\/em>/",
'/(?:\<code*\>([^\<]*)\<\/code\>)/',
);
$replacements = array(
"\n",
"[b]",
"[/b]",
"[h1]",
"[/h1]",
"[h2]",
"[/h2]",
"[i]",
"[/i]",
'[code]***HTML DECODE HERE***[/code]',
);
In my string I want to html_entity_decode the content between these tags :
<code> < $gt; </code> but keep my array structure for preg replace
so this : <code> < > </code> will be this : [code] < > [/code]
Any help will be very appreciated, thanks!

You cannot encode this in the replacement string. As PoloRM suggested, you could use preg_replace_callback specifically for your last replacement instead:
function decode_html($matches)
{
return '[code]'.html_entity_decode($matches[1]).'[/code]';
}
$str = '<code> < > </code>';
$str = preg_replace_callback('/(?:\<code*\>([^\<]*)\<\/code\>)/', 'decode_html', $str);
Equivalently, using create_function:
$str = preg_replace_callback(
'/(?:\<code*\>([^\<]*)\<\/code\>)/',
create_function(
'$matches',
'return \'[code]\'.html_entity_decode($matches[1]).\'[/code]\';'
),
$str
);
Or, as of PHP 5.3.0:
$str = preg_replace_callback(
'/(?:\<code*\>([^\<]*)\<\/code\>)/',
function ($matches) {
return '[code]'.html_entity_decode($matches[1]).'[/code]';
},
$str
);
But note that in all three cases, your pattern is not really optimal. Firstly, you don't need to escape those < and > (but that is just for readability). Secondly, your first * allows infinite repetition (or omission) of the letter e. I suppose you wanted to allow attributes. Thirdly, you cannot include other tags within your <code> (because [^<] will not match them). In this case maybe you should go with ungreedy repetition instead (I also changed the delimiter for convenience):
~(?:<code[^>]*>(.*?)</code>)~
As you can already see, this is still far from perfect (in terms of correctly matching the HTML in the first place). Hence, the obligatory reminder: don't use regex to parse HTML. You will be much better off, using a DOM parser. PHP brings a built-in one, and there is also this very convenient-to-use 3rd-party one.

Check out this:
http://www.php.net/manual/en/function.preg-replace-callback.php
You can create a callback function that applies the html_entity_decode functionality on your match.

PHP regex optimize

I've got a regular expression that match everything between <anything> and I'm using this:
'#<([\w]+)>#'
today but I believe that there might be a better way to do it?
/ Tobias

\w doesn't match everything like you said, by the way, just [a-zA-Z0-9_]. Assuming you were using "everything" in a loose manner and \w is what you want, you don't need square brackets around the \w. Otherwise it's fine.

If "anything" is "anything except a > char", then you can:
#<([^>]+)>#
Testing will show if this performs better or worse.
Also, are you sure that you need to optimize? Does your original regex do what it should?

You better use PHP string functions for this task. It will be a lot faster and not too complex.
For example:
$string = "abcd<xyz>ab<c>d";
$curr_offset = 0;
$matches = array();
$opening_tag_pos = strpos($string, '<', $curr_offset);
while($opening_tag_pos !== false)
{
$curr_offset = $opening_tag_pos;
$closing_tag_pos = strpos($string, '>', $curr_offset);
$matches[] = substr($string, $opening_tag_pos+1, ($closing_tag_pos-$opening_tag_pos-1));
$curr_offset = $closing_tag_pos;
$opening_tag_pos = strpos($string, '<', $curr_offset);
}
/*
$matches = Array ( [0] => xyz [1] => c )
*/
Of course, if you are trying to parse HTML or XML, use a XHTML parser instead

That looks alright. What's not optimal about it?
You may also want to consider something other regex if you're trying to parse HTML:
RegEx match open tags except XHTML self-contained tags

Regular Expressions: how to do "option split" replaces

those reqular expressions drive me crazy. I'm stuck with this one:
test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not
Task:
Remove all [[ and ]] and if there is an option split choose the later one so output should be:
test1:link test2:silver test3:out1insideout2 test4:this|not
I came up with (PHP)
$text = preg_replace("/\\[\\[|\\]\\]/",'',$text); // remove [[ or ]]
this works for part1 of the task. but before that I think I should do the option split, my best solution:
$text = preg_replace("/\\[\\[(.*\|)(.*?)\\]\\]/",'$2',$text);
Result:
test1:silver test3:[[out1[[inside]]out2]] this|not
I'm stuck. may someone with some free minutes help me? Thanks!

I think the easiest way to do this would be multiple passes. Use a regular expression like:
\[\[(?:[^\[\]]*\|)?([^\[\]]+)\]\]
This will replace option strings to give you the last option from the group. If you run it repeatedly until it no longer matches, you should get the right result (the first pass will replace [[out1[[inside]]out2]] with [[out1insideout2]] and the second will ditch the brackets.
Edit 1: By way of explanation,
\[\[ # Opening [[
(?: # A non-matching group (we don't want this bit)
[^\[\]] # Non-bracket characters
* # Zero or more of anything but [
\| # A literal '|' character representing the end of the discarded options
)? # This group is optional: if there is only one option, it won't be present
( # The group we're actually interested in ($1)
[^\[\]] # All the non-bracket characters
+ # Must be at least one
) # End of $1
\]\] # End of the grouping.
Edit 2: Changed expression to ignore ']' as well as '[' (it works a bit better like that).
Edit 3: There is no need to know the number of nested brackets as you can do something like:
$oldtext = "";
$newtext = $text;
while ($newtext != $oldtext)
{
$oldtext = $newtext;
$newtext = preg_replace(regexp,replace,$oldtext);
}
$text = $newtext;
Basically, this keeps running the regular expression replace until the output is the same as the input.
Note that I don't know PHP, so there are probably syntax errors in the above.

This is impossible to do in one regular expression since you want to keep content in multiple "hierarchies" of the content. It would be possible otherwise, using a recursive regular expression.
Anyways, here's the simplest, most greedy regular expression I can think of. It should only replace if the content matches your exact requirements.
You will need to escape all backslashes when putting it into a string (\ becomes \\.)
\[\[((?:[^][|]+|(?!\[\[|]])[^|])++\|?)*]]
As others have already explained, you use this with multiple passes. Keep looping while there are matches, performing replacement (only keeping match group 1.)
Difference from other regular expressions here is that it will allow you to have single brackets in the content, without breaking:
test1:[[link]] test2:[[gold|si[lv]er]]
test3:[[out1[[in[si]de]]out2]] test4:this|not
becomes
test1:[[link]] test2:si[lv]er
test3:out1in[si]deout2 test4:this|not

Why try to do it all in one go. Remove the [[]] first and then deal with options, do it in two lines of code.
When trying to get something going favour clarity and simplicity.
Seems like you have all the pieces.

Why not just simply remove any brackets that are left?
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$str = preg_replace('/\\[\\[(?:[^|\\]]+\\|)+([^\\]]+)\\]\\]/', '$1', $str);
$str = str_replace(array('[', ']'), '', $str);

Well, I didn't stick to just regex, because I'm of a mind that trying to do stuff like this with one big regex leads you to the old joke about "Now you have two problems". However, give something like this a shot:
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not'; $reg = '/(.*?):(.*?)( |$)/';
preg_match_all($reg, $str, $m);
foreach($m[2] as $pos => $match) {
if (strpos($match, '|') !== FALSE && strpos($match, '[[') !== FALSE ) {
$opt = explode('|', $match); $match = $opt[count($opt)-1];
}
$m[2][$pos] = str_replace(array('[', ']'),'', $match );
}
foreach($m[1] as $k=>$v) $result[$k] = $v.':'.$m[2][$k];

This is C# using only using non-escaped strings, hence you will have to double the backslashes in other languages.
String input = "test1:[[link]] " +
"test2:[[gold|silver]] " +
"test3:[[out1[[inside]]out2]] " +
"test4:this|not";
String step1 = Regex.Replace(input, #"\[\[([^|]+)\|([^\]]+)\]\]", #"[[$2]]");
String step2 = Regex.Replace(step1, #"\[\[|\]\]", String.Empty);
// Prints "test1:silver test3:out1insideout2 test4:this|not"
Console.WriteLine(step2);

$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$s = preg_split("/\s+/",$str);
foreach ($s as $k=>$v){
$v = preg_replace("/\[\[|\]\]/","",$v);
$j = explode(":",$v);
$j[1]=preg_replace("/.*\|/","",$j[1]);
print implode(":",$j)."\n";
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Formatting Regex - BBCode - php

Related

Erasing C comments with preg_replace

PHP - Regex for a string of special characters

html_entity_decode in specific regular expression for a preg_replace

PHP regex optimize

Regular Expressions: how to do "option split" replaces

Categories

Resources