I need an regular expression to find all starting brace has an ending brace.
Suppose
([(([[(([]))]]))]) -- this one will return true. but
[](()()[[]])[][[([]) --- this one will return false
for this, I've tried below:-
function check($str) {
$output = "";
$pattern = "/(\{[^}]*)([^{]*\})/im";
preg_match($pattern, $str, $match);
print_r($match[0]);
}
assert(check("[](()()[[]])[][[([])") === FALSE);
any help please...
The easiest way to do this (in my opinion) would be to implement a stack data structure and pass through your string. Essentially something like so:
Traverse the string left to right
If you find an opening parenthesis, add it to the stack
else (you find a closing parenthesis) make sure that the top most item in the stack is the same type of parenthesis (so make sure that if you found a }, the top most item in the stack is {). This should help scenarios where you have something like so: ({)}. If it matches, pop from the stack.
If you repeat the above operation throughout the entire string, you should end up with an empty stack. This would mean that you have managed to match all open parenthesis with a close parenthesis.
You can use this:
$pattern = '~^(\((?1)*\)|\[(?1)*]|{(?1)*})+$~';
(?1) is a reference to the subpattern (not the matched content) in the capture group 1. Since I put it in the capture group 1 itself, I obtain a recursion.
I added anchors for the start ^ and the end $ of the string to be sure to check all the string.
Note: If you need to check a string that contains not only brackets, you can replace each (?1)* with:
(?>[^][}{)(]++|(?1))*
Note 2: If you want that an empty string return true, you must replace the last quantifier + with *
Working example:
function check($str, $display = false) {
if (preg_match('~^(\((?1)*\)|\[(?1)*]|{(?1)*})+$~', $str, $match)) {
if ($display) echo $match[0];
return true;
}
elseif (preg_last_error() == PREG_RECURSION_LIMIT_ERROR) {
if ($display) echo "The recursion limit has been reached\n";
return -1;
} else return false;
}
assert(check(')))') === false);
check('{[[()()]]}', true);
var_dump(check('{[[()()]]}'));
Related
I need to parse a string that contains some parentheses disposed recursively, but i'm having trouble with determining priority of parentheses.
For exemple, I have the string
$truth = "((A^¬B)->C)";
and I need to return what is between the parentheses. I've already done it with the following regex:
preg_match_all("~\((.*?)\)~", $truth, $str);
But the problem is that it returns what is between the first "(" and the first ")", which is
(A^¬B
Instead of this, i need it to 'know' where the parentheses closes correctly, in order to return
(A^¬B)->C
How can I return this respecting the priority order? Thanks!
The main problem you have right now is the ? non-greedy bit. If you change that to just .+ greedy it will match what you want.
$truth = "((A^¬B)->C)";
preg_match('/\(.+\)/', $truth, $match);
Try it
Output
(A^¬B)->C
If you want to match the inner pair you can use a recursive subpattern:
$truth = "((A^¬B)->C)";
preg_match('/\(([^()]+|(?0))\)/', $truth, $match);
Try It online
Output
A^¬B
If you need to go further then that you can make a lexer/parser. I have some examples here:
https://github.com/ArtisticPhoenix/MISC/tree/master/Lexers
For your sample string, something like this will recursively give you the contents of the parentheses. It works by forcing the parentheses matched to be the outermost pair by using ^[^(]* and [^)]*$ at each end of the regex.
$truth = "((A^¬B)->C)";
while (strpos($truth, '(') !== false) {
preg_match("~^[^(]*\((.*?)\)[^)]*$~", $truth, $str);
$truth = $str[1];
echo "$truth\n";
}
Output
(A^¬B)->C
A^¬B
Note however this will not correctly parse a string such as (A+B)-(C+D). If that could be your scenario, this answer might help.
Demo on 3v4l.org
I have this in a function which is supposed to replace any sequence of parentheses with what is enclosed in it like (abc) becomes abc any where it appears even recursively because parens can be nested.
$return = preg_replace_callback(
'|(\((.+)\))+|',
function ($matches) {
return $matches[2];
},
$s
);
when the above regex is fed this string "a(bcdefghijkl(mno)p)q" as input it returns "ap)onm(lkjihgfedcbq". This shows the regex is matched once. What can I do to make it continue to match even inside already made matches and produce this `abcdefghijklmnopq'"
To match balanced parenthetical substrings you may use a well-known \((?:[^()]++|(?R))*\) pattern (described in Matching Balanced Constructs), inside a preg_replace_callback method, where the match value can be further manipulated (just remove all ( and ) symbols from the match that is easy to do even without a regex:
$re = '/\((?:[^()]++|(?R))*\)/';
$str = 'a(bcdefghijkl(mno)p)q((('; // Added three ( at the end
$result = preg_replace_callback($re, function($m) {
return str_replace(array('(',')'), '', $m[0]);
}, $str);
echo $result; // => abcdefghijklmnopq(((
See the PHP demo
To get overlapping matches, you need to use a known technique, capturing inside a positive lookahead, but you won't be able to perform two operations at once (replacing and matching), you can run matching first, and then replace:
$re = '/(?=(\((?:[^()]++|(?1))*\)))/';
$str = 'a(bcdefghijkl(mno)p)q(((';
preg_match_all($re, $str, $m);
print_r($m[1]);
// => Array ( [0] => (bcdefghijkl(mno)p) [1] => (mno) )
See the PHP demo.
Try this one,
preg_match('/\((?:[^\(\)]*+|(?0))*\)/', $str )
https://regex101.com/r/NsQSla/1
It will match everything inside of the ( ) as long as they are matched pairs.
Example
(abc) (abc (abc))
will have the following matches
Match 1
Full match 0-5 `(abc)`
Match 2
Full match 6-17 `(abc (abc))`
It is slightly unclear exactly what the postcondition of the algorithm is supposed to be. It seems to me that you are wanting to strip out matching pairs of ( ). The assumption here is that unmatched parentheses are left alone (otherwise you just strip out all of the ('s and )'s).
So I guess this means the input string a(bcdefghijkl(mno)p)q becomes abcdefghijklmnopq but the input string a(bcdefghijkl(mno)pq becomes a(bcdefghijklmnopq. Likewise an input string (a)) would become a).
It may be possible to do this using pcre since it does provide some non-regular features but I'm doubtful about it. The language of the input strings is not regular; it's context-free. What #ArtisticPhoenix's answer does is match complete pairs of matched parentheses. What it does not do is match all nested pairs. This nested matching is inherently non-regular in my humble understanding of language theory.
I suggest writing a parser to strip out the matching pairs of parentheses. It gets a little wordy having to account for productions that fail to match:
<?php
// Parse the punctuator sub-expression (i.e. anything within ( ... ) ).
function parse_punc(array $tokens,&$iter) {
if (!isset($tokens[$iter])) {
return;
}
$inner = parse_punc_seq($tokens,$iter);
if (!isset($tokens[$iter]) || $tokens[$iter] != ')') {
// Leave unmatched open parentheses alone.
$inner = "($inner";
}
$iter += 1;
return $inner;
}
// Parse a sequence (inside punctuators).
function parse_punc_seq(array $tokens,&$iter) {
if (!isset($tokens[$iter])) {
return;
}
$tok = $tokens[$iter];
if ($tok == ')') {
return;
}
$iter += 1;
if ($tok == '(') {
$tok = parse_punc($tokens,$iter);
}
$tok .= parse_punc_seq($tokens,$iter);
return $tok;
}
// Parse a sequence (outside punctuators).
function parse_seq(array $tokens,&$iter) {
if (!isset($tokens[$iter])) {
return;
}
$tok = $tokens[$iter++];
if ($tok == '(') {
$tok = parse_punc($tokens,$iter);
}
$tok .= parse_seq($tokens,$iter);
return $tok;
}
// Wrapper for parser.
function parse(array $tokens) {
$iter = 0;
return strval(parse_seq($tokens,$iter));
}
// Grab input from stdin and run it through the parser.
$str = trim(stream_get_contents(STDIN));
$tokens = preg_split('/([\(\)])/',$str,-1,PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
var_dump(parse($tokens));
I know this is a lot more code than a regex one-liner but it does solve the problem as I understand it. I'd be interested to know if anyone can solve this problem with a regular expression.
I'm using the code at the bottom to grab parameters from a wordpress shortcode. The shortcode itself looks like this:
[FLOWPLAYER=http://www.tvovermind.com/wp-content/uploads/2013/01/pll-316-21.jpg|http://www.tvovermind.com/wp-content/uploads/2013/01/PLL316_fv2.h264HD-Clip2.flv,440,280]
Or
[FLOWPLAYER=http://www.tvovermind.com/wp-content/uploads/2013/01/pll-316-21.jpg|http://www.tvovermind.com/wp-content/uploads/2013/01/PLL316_fv2.h264HD-Clip2.flv,440,280,false]
What I would like to have happen is that if the extra parameter (false/true) is missing then that match becomes "false", however with the current code if the parameter is missing a match is never made. Any ideas?
function legacy_hook($content){
$regex = '/\[FLOWPLAYER=([a-z0-9\:\.\-\&\_\/\|]+)\,([0-9]+)\,([0-9]+)\,([a-z0-9\:\.\-\&\_\/\|]+)\]/i';
$matches = array();
preg_match_all($regex, $content, $matches);
if($matches[0][0] != '') {
foreach($matches[0] as $key => $data) {
$content = str_replace($matches[0][$key], flowplayer::build_player($matches[2][$key], $matches[3][$key], $matches[1][$key],$matches[4][$key]),$content);
}
}
return $content;
}
your regex is looking for the last comma to be there and one or more of the characters in the last set of brackets. Something like
/\[FLOWPLAYER=([a-z0-9\:\.\-\&\_\/\|]+)\,([0-9]+)\,([0-9]+)(\,[a-z]+)?\]/i
only issue is you'll get the comma in the match too.
might be what you're after, then you have to test for the last match being present. preg_match_all returns the number of matches so you might be able to use that, or you could do an inline if...
(count($matches) > 4 ? $matches[4][$key] : false)
You can add OR at the end of your expression
(,true|,false|$)
I didn't check does it work but you get the idea.
I want to see if a variable contains a value that matches one of a few values in a hard-coded list. I was using the following, but recently found that it has a flaw somewhere:
if (preg_match("/^(init)|(published)$/",$status)) {
echo 'found';
} else {
echo 'nope';
}
I find that if the variable $status contains the word "unpublished" there is still a match even though 'unpublished' is not in the list, supposedly because the word 'unpublished' contains the word 'published', but I thought the ^ and $ in the regular expression are supposed to force a match of the whole word. Any ideas what I'm doing wrong?
Modify your pattern:
$pattern = "/^(init|published)$/";
$contain = "unpublished";
echo preg_match($pattern, $contain) ? 'found' : 'nope' ;
This pattern says our string must be /^init$/ or /^published$/, meaning those particular strings from start to finish. So substrings cannot be matched under these constraints.
In this case, regex are not the right tool to use.
Just put the words you want your candidate to be checked against in an array:
$list = array('init', 'published');
and then check:
if (in_array($status, $list)) { ... }
^ matches the beginning of a string. $ matches the end of a string.
However, regexes are not a magic bullet that need to get used at every opportunity.
In this case, the code you want is
if ( $status == 'init' || $status == 'published' )
If you are checking against a list of values, create an array based on those values and then check to see if the key exists in the array.
I am trying to validate a Youtube URL using regex:
preg_match('~http://youtube.com/watch\?v=[a-zA-Z0-9-]+~', $videoLink)
It kind of works, but it can match URL's that are malformed. For example, this will match ok:
http://www.youtube.com/watch?v=Zu4WXiPRek
But so will this:
http://www.youtube.com/watch?v=Zu4WX£&P!ek
And this wont:
http://www.youtube.com/watch?v=!Zu4WX£&P4ek
I think it's because of the + operator. It's matching what seems to be the first character after v=, when it needs to try and match everything behind v= with [a-zA-Z0-9-]. Any help is appreciated, thanks.
To provide an alternative that is larger and much less elegant than a regex, but works with PHP's native URL parsing functions so it might be a bit more reliable in the long run:
$url = "http://www.youtube.com/watch?v=Zu4WXiPRek";
$query_string = parse_url($url, PHP_URL_QUERY); // v=Zu4WXiPRek
$query_string_parsed = array();
parse_str($query_string, $query_string_parsed); // an array with all GET params
echo($query_string_parsed["v"]); // Will output Zu4WXiPRek that you can then
// validate for [a-zA-Z0-9] using a regex
The problem is that you are not requiring any particular number of characters in the v= part of the URL. So, for instance, checking
http://www.youtube.com/watch?v=Zu4WX£&P!ek
will match
http://www.youtube.com/watch?v=Zu4WX
and therefore return true. You need to either specify the number of characters you need in the v= part:
preg_match('~http://youtube.com/watch\?v=[a-zA-Z0-9-]{10}~', $videoLink)
or specify that the group [a-zA-Z0-9-] must be the last part of the string:
preg_match('~http://youtube.com/watch\?v=[a-zA-Z0-9-]+$~', $videoLink)
Your other example
http://www.youtube.com/watch?v=!Zu4WX£&P4ek
does not match, because the + sign requires that at least one character must match [a-zA-Z0-9-].
Short answer:
preg_match('%(http://www.youtube.com/watch\?v=(?:[a-zA-Z0-9-])+)(?:[&"\'\s])%', $videoLink)
There are a few assumptions made here, so let me explain:
I added a capturing group ( ... ) around the entire http://www.youtube.com/watch?v=blah part of the link, so that we can say "I want get the whole validated link up to and including the ?v=movieHash"
I added the non-capturing group (?: ... ) around your character set [a-zA-Z0-9-] and left the + sign outside of that. This will allow us to match all allowable characters up to a certain point.
Most importantly, you need to tell it how you expect your link to terminate. I'm taking a guess for you with (?:[&"\'\s])
?) Will it be in html format (e.g. anchor tag) ? If so, the link in href will obviously end with a " or '.
?) Or maybe there's more to the query string, so there would be an & after the value of v.
?) Maybe there's a space or line break after the end of the link \s.
The important piece is that you can get much more accurate results if you know what's surrounding what you are searching for, as is the case with many regular expressions.
This non-capturing group (in which I'm making assumptions for you) will take a stab at finding and ignoring all the extra junk after what you care about (the ?v=awesomeMovieHash).
Results:
http://www.youtube.com/watch?v=Zu4WXiPRek
- Group 1 contains the http://www.youtube.com/watch?v=Zu4WXiPRek
http://www.youtube.com/watch?v=Zu4WX&a=b
- Group 1 contains http://www.youtube.com/watch?v=Zu4WX
http://www.youtube.com/watch?v=!Zu4WX£&P4ek
- No match
a href="http://www.youtube.com/watch?v=Zu4WX&size=large"
- Group 1 contains http://www.youtube.com/watch?v=Zu4WX
http://www.youtube.com/watch?v=Zu4WX£&P!ek
- No match
The "v=..." blob is not guaranteed to be the first parameter in the query part of the URL. I'd recommend using PHP's parse_url() function to break the URL into its component parts. You can also reassemble a pristine URL if someone began the string with "https://" or simply used "youtube.com" instead of "www.youtube.com", etc.
function get_youtube_vidid ($url) {
$vidid = false;
$valid_schemes = array ('http', 'https');
$valid_hosts = array ('www.youtube.com', 'youtube.com');
$valid_paths = array ('/watch');
$bits = parse_url ($url);
if (! is_array ($bits)) {
return false;
}
if (! (array_key_exists ('scheme', $bits)
and array_key_exists ('host', $bits)
and array_key_exists ('path', $bits)
and array_key_exists ('query', $bits))) {
return false;
}
if (! in_array ($bits['scheme'], $valid_schemes)) {
return false;
}
if (! in_array ($bits['host'], $valid_hosts)) {
return false;
}
if (! in_array ($bits['path'], $valid_paths)) {
return false;
}
$querypairs = explode ('&', $bits['query']);
if (count ($querypairs) < 1) {
return false;
}
foreach ($querypairs as $querypair) {
list ($key, $value) = explode ('=', $querypair);
if ($key == 'v') {
if (preg_match ('/^[a-zA-Z0-9\-_]+$/', $value)) {
# Set the return value
$vidid = $value;
}
}
}
return $vidid;
}
Following regex will match any youtube link:
$pattern='#(((http(s)?://(www\.)?)|(www\.)|\s)(youtu\.be|youtube\.com)/(embed/|v/|watch(\?v=|\?.+&v=|/))?([a-zA-Z0-9._\/~#&=;%+?-\!]+))#si';