PHP extend regex to accept round brackets - php

I have an existing regex which checks if the string is not wrapped in either quotes or square brackets then
I'm wrapping that string in quotes
My existing regex is as follow -
if (!preg_match('/^["\[].*["\]]$/', $filter)) {
$filter = '%22' . $filter . '%22';
}
Now I want to extend this regex to check already wrapped in either quotes or square brackets or parentheses
For parentheses, my string value i.e my $filter value would be something like (123 456)
Can anyone help to extended this regex?

I think the regular expression is good for complex string math but like this simple string match why do you use regex it takes almost O(n) time where n is you string size some time it takes O(2^m), where m is the length of regex. But if you check with a simple if just check 1st and the last characters of string it takes O(1). Here is the regex solution.
/^(?(?=")(["].*["])|(?=\[)([\[].*[\]]))|(?=\()(([\(].*[\)]))$/

The regular expressions are a powerful tool but they are not a Swiss army knife. There are problems that simply cannot be resolved using regex and there are problems that can be resolved using regex but a simpler approach produces code that is easier to read and understand. This is such a problem.
Let's reformulate the problem. If the first character of the string is " and also its last character is " then the string is wrapped in quotes and it does not need other processing. The same when the first character is [ and the last one is ]. Or ( and ).
The first character of a string stored in the variable $filter is $filter[0]. The last one is $filter[-1]. Extract them into a new string and search for it into a list of combinations of quotes and parentheses:
if (! in_array($filter[0].$filter[-1], ['""', '[]', '()'])) {
// the string is not enclosed in quotes, square brackets or parentheses
// do something with it (enclose it, etc)
}
If you are using PHP 5 (any version) or PHP 7.0 then you are out of luck (and out of PHP updates, btw) and you cannot use $filter[-1] (because this functionality has been introduced in PHP 7.1).
The PHP function substr() comes to the rescue.
substr($filter, -1) does the same thing as $filter[-1] (returns the last character of $filter and works in all PHP versions.
There are two corner cases to consider:
When $filter is '"' (a string of exactly one character that is a double quote), the code above will report it as enclosed in quotes when, in fact, it is not.
When $filter is '' (the empty string) the code produces two warnings (but does not report it as being enclosed in quotes.
Both cases can be easily solved by adding a check of the string's length to avoid running the other test if the string is too short:
if (strlen($filter) < 2 || ! in_array($filter[0].$filter[-1], ['""', '[]', '()'])) {
// the string is not enclosed in quotes, square brackets or parentheses
// do something with it (enclose it, etc)
}

Related

Simple Regex NOT on multidimensional JSON string

So i will provide this simple example of json string covering most of my actual string cases:
"time":1430702635,\"id\":\"45.33\",\"state\":2,"stamp":14.30702635,
And i'm trying to do a preg replace to the numbers from the string, to enclose them in quotes, except the numbers which index is already quoated, like in my string - '\state\':2
My regex so far is
preg_replace('/(?!(\\\"))(\:)([0-9\.]+)(\,)/', '$2"$3"$4',$string);
The rezulting string i'm tring to obtain in this case is having the "\state\" value unquoted, skipped by the regex, because it contains the \" ahead of :digit,
"time":"1430702635",\"id\":\"45.33\",\"state\":2,"stamp":"14.30702635",
Why is the '\state\' number replaced also ?
Tried on https://regex101.com/r/xI1zI4/1 also ..
New edit:
So from what I tried,
(?!\\")
is not working !!
If I'm allowed, I will leave this unanswered in case someone else does know why.
My solution was to use this regex, instead of NOT, I went for yes ..
$string2 = preg_replace('/(\w":)([0-9\.]+)(,)/', '$1"$2"$3',$string);
Thank you.
(?!\\") is a negative lookahead, which generally isn't useful at the very beginning of a regular expression. In your particular regex, it has no effect at all: the expression (?!(\\\"))(\:) means "empty string not followed by slash-quote, then a colon" which is equivalent to just trying to match a colon by itself.
I think what you were trying to accomplish is a negative lookbehind, which has a slightly different syntax in PCRE: (?<!\\"). Making this change seems to match what you want: https://regex101.com/r/xI1zI4/2

Regex that will match each specific tag that contains ../

I'm trying to find a regex that will match each specific tag that contains ../.
I had it matching when each element was on its own line. But then there was an instance where my HTML rendered on one line causing the regex to match the whole line:
<body><img src="../../../img.png"><img src="../../img.png"><img src="../../img.png"><img src="..//../img.png"><img src="..../../img.png">
Here was the regex that I was using
<.*[\.]{2}[\/].*>
You need to make sure to match only one tag per match.
Using a negative character class like below will accomplish that.
<[^>]*\.\./[^>]*>
< = start of tag
[^>]* = any number of characters that aren't >, since > would end the tag
\.\./ = "../" with escapes for the . characters
[^>]* = same as above
> = end of tag
It appears you might be doing this to prevent path parenting. You should know that for a URL attribute in an HTML tag, the following tags are considered "equivalent":
<img src="../foo.jpg">
<img src="%2e%2e%2ffoo.jpg">
<img src="../foo.jpg">
That's because the src attribute goes through HTML entity un-escaping, and then URL un-escaping (in that order) before being used. As a result, there are 5,832 different ways to write '../' into an HTML tag's path attribute (18 ways to write each character times 3 characters).
Making a regex to match any of these encodings of ../ is more difficult, but still possible.
(\.|.|(%|%)(2|2)([Ee]|E|e)){2}(/|/|(%|%)(2|2)([Ff]|F|f))
For reference:
. = . HTML escape sequence
/ = / HTML escape sequence
%2E or %2e = . URL escape sequence
%2F or %2f = / URL escape sequence
% = % HTML escape sequence
2 = 2 HTML escape sequence
E = E HTML escape sequence
e = e HTML escape sequence
F = F HTML escape sequence
f = f HTML escape sequence
You can see why people usually say it's better to use a real HTML parser, instead of regex!
Anyway, assuming yo need this, and a full HTML parser isn't feasable, here's the version of <[^>]*[="'/]\.\./[^>]*> that also catches HTML and URL escaping:
<[^>]*[="'/](\.|.|(%|%)(2|2)([Ee]|E|e)){2}(/|/|(%|%)(2|2)([Ff]|F|f))[^>]*>
Causing the regex to match the whole line seems you are regex is greedy, try this way as #Avinash Raj commented.
SEE DEMO
To get the regexp you want I will try to follow a step by step approach:
First, we need some regex that matches the beginning and end of the tag. But we must be carefull, as the tag end character > is allowed in single and double quote strings. We construct first the regexp that matches these single/double quoted strings: ([^"'>]|"[^"]*"|'[^']*')* (a sequence of: non-quote (single and double) and non end tag character, or a single quoted string, or a double quoted string)
Now, modify it to match a single quoted string or a double quoted string that includes a ../: ([^"'>]|"[^"]*\.\.\/[^"]*"|'[^']*\.\.\/[^']*')* (we can simplify it, eliminating the last * operator, as we will match the whole string with only one matching ../ inside, and we can eliminate the first option, as we will have the ../ seq inside quoted strings). We get to: ("[^"]*\.\.\/[^"]*"|'[^']*\.\.\/[^']*')
To get a string matching a sequence including at least one of the second strings, we concatenate the first regex at the beginning and at the end, and the other in the middle. We get to: ([^"'>]|"[^"]*"|'[^']*')*("[^"]*\.\.\/[^"]*"|'[^']*\.\.\/[^']*')([^"'>]|"[^"]*"|'[^']*')*
Now, we only need to surround this regexp with the needed sequences first <[iI][mM][gG][ \t\n], and after >, getting to:
<[iI][mM][gG][ \t\n]([^"'>]|"[^"]*"|'[^']*')*("[^"]*\.\.\/[^"]*"|'[^']*\.\.\/[^']*')([^"'>]|"[^"]*"|'[^']*')*>
This is the regexp we need. See demo If we extract the content of the second group ($2, \2, etc.) we'll get to the parameter value that matches (with the quotes included) the ../ string.
Don't try to simplify this further as > characters are allowed inside single and double quoted strings, and " are allowed in single quoted strings, and ' are in double quoted strings. As someone explained in another answer to this question, you cannot be greedy (using .* inside, as you'll eat as much input as possible before matching) This regexp will need to match multiline tags, as these could be part of your input file. If you have a well formed HTML file, then you'll have no problem with this regexp.
And some final quoting: an HTML tag is defined by a grammar that is regular (it is only a regular subset of the full HTML syntax), so it is perfectly parseable with a regex (the same is not true for the complete HTML language) A regex is by far more efficient and less resource consuming than a full HTML parser. The caveats are that you have to write it (and to write it well) and that HTML parsers are easily found with some googling that avoid you the work of doing it, but you have to write it only once. Regexp parsing is a one pass process that grows in complexity (for this example, at least) linearly with input text length. You'll be advised against this by people that simply don't know how to write the right regexp or don't know how to determine is some grammar is regular.
Note:
This regexp will match commented tags. In case you don't want to match commented <img> tags, you'll have to extend your regexp a little or do a two pass to eliminate comments first, and then parse tags (the regexp that only recognizes uncommented tags is far more complicated than this) Also, look below for more difficulties you can have on your task to eliminate parent directory references.
Note 2:
As I have read in your comments to some answers, the problem you want to solve (eliminating .. references in HTML/XML sources) is not regular. The reason is that you can have . and .. references embedded in the path strings. Normally, one must proceed eliminating the /. or ./ components of the path, getting a path without . (actual directory) references. Once you have this, you have to eliminate a/.. references, where a is distinct of ... This deals to eliminating occurrences of a/.., a/b/../.., etc. But the language that matches a^i b^i is not regular (as demonstrated by the pumping lemma ---see google) and you'll need a context independent grammar.
Note 3:
If you limit the number of a/b/c/../../.. levels to some maximum bound, you're still able to find a regexp to match this kind of strings, but you can have one example that breaks your regexp and makes it invalid. Remember, you first have to eliminate the single dot . path component (as you can have something like a/b/./././c/./d/.././e/f/.././../... You will first eliminate the single dot components, leading to: a/b/c/d/../e/f/../../../... Then you proceed by pairs of <non ..>/.., getting a/b/c/[d/..]/e/f/../../../.. to a/b/c/e/[f/..]/../../.. -> a/b/c/[e/..]/../.. -> a/b/[c/..]/.. -> a/[b/..] -> a (you ought to check that all the first components of a pair do exist before being eliminated to be precise) and if you get to an empty path, you will have to change it to . to be usable.
I have code to do this process, but it's embedded in some bigger program. If you are interested, you can access this code. (look at the rel_path() routine here)
You cannot eliminate a .. element at the beginning of a path (better, that has not a <non ..> counterpart), as it refers to outside of the tree, making the reference dependant on the external structure of the tree.

Regular expression doesn't quite work

I have created a Regular Expression (using php) below; which must match ALL terms within the given string that contains only a-z0-9, ., _ and -.
My expression is: '~(?:\(|\s{0,},\s{0,})([a-z0-9._-]+)(?:\s{0,},\s{0,}|\))$~i'.
My target string is: ('word', word.2, a_word, another-word).
Expected terms in the results are: word.2, a_word, another-word.
I am currently getting: another-word.
My Goal
I am detecting a MySQL function from my target string, this works fine. I then want all of the fields from within that target string. It's for my own ORM.
I suppose there could be a situation where by further parenthesis are included inside this expression.
From what I can tell, you have a list of comma-separated terms and wish to find only the ones which satisfy [a-z0-9._\-]+. If so, this should be correct (it returns the correct results for your example at least):
'~(?<=[,(])\\s*([a-z0-9._-]+)\\s*(?=[,)])~i'
The main issues were:
$ at the end, which was anchoring the query to the end of the string
When matching all you continue from the end of the previous match - this means that if you match a comma/close parenthesis at the end of one match it's not there at match at the beginning of the next one. I've solved this with a lookbehind ((?<=...) and a lookahead ((?=...)
Your backslashes need to be double escaped since the first one may be stripped by PHP when parsing the string.
EDIT: Since you said in a comment that some of the terms may be strings that contain commas you will first want to run your input through this:
$input = preg_replace('~(\'([^\']+|(?<=\\\\)\')+\'|"([^"]+|(?<=\\\\)")+")~', '"STRING"', $input);
which should replace all strings with '"STRING"', which will work fine for matching the other regex.
Maybe using of regex is overkill. In this kind of text you can just remove parenthesis and explode string by comma.

php equals regular expression

I know I can use preg_match but I was wondering if php had a way to evaluate to a regular expression like:
if(substr($example, 0, 1) == /\s/){ echo 'whitespace!'; }
PHP does not have first-class regular expressions.
You will need to use the functions provided by the default PCRE extension. Sorry. It's a backslash-escaping nightmare, but it's all we've got.
(There's also the now-deprecated POSIX regex extension, but you should not use them any longer. They are slower, less featureful, and most important, they aren't Unicode-safe. Modern PCRE versions understand Unicode very well, even if PHP itself is ignorant about it.)
With regard to the backslash-escaping nightmare, you can keep the horror to a minimum by using single quotes to enclose the string containing the regex instead of doubles, and picking an appropriate delimiter. Compare:
"/^http:\\/\\/www.foo.bar\\/index.html\\?/"
versus
'!^http://www.foo.bar/index.html\?!'
Inside single quotes, you only need to backslash-escape backslashes and single quotes, and picking a different delimiter avoids needing to escape the delimiter inside the regex.
:)
if(substr($example, 0, 1) == " "){ echo 'whitespace!';}
You should not be using regexp when it is not needed.
There would also be the microoptimization option:
if (strstr(" \t\r\n", $example{0})) {
The {0} is an outdated way to get the first character (same as [0] actually). And strstr simply checks if the character is contained in the list of whitespace characters. Another option would be strspn, at least in your example case.

How do I match a square bracket literal using RegEx?

What's the regex to match a square bracket? I'm using \\] in a pattern in eregi_replace, but it doesn't seem to be able to find a ]...
\] is correct, but note that PHP itself ALSO has \ as an escape character, so you might have to use \\[ (or a different kind of string literal).
Works flawlessly:
<?php
$hay = "ab]cd";
echo eregi_replace("\]", "e", $hay);
?>
Output:
abecd
There are two ways of doing this:
/ [\]] /x;
/ \] /x;
While you may consider the latter as the better option, and indeed I would consider using it in simpler regexps. I would consider the former, the better option for larger regexps. Consider the following:
/ (\w*) ( [\d\]] ) /x;
/ (\w*) ( \d | \] ) /x;
In this example, the former is my preferred solution. It does a better job of combining the separate entities, which may each match at the given location. It may also have some speed benefits, depending on implementation.
Note: This is in Perl syntax, partly to ensure proper highlighting.
In PHP you may need to double up on the back-slashes.
"[\\]]" and "\\]"
You don't need to escape it: if isolated, a ] is treated as a regular character.
Tested with eregi_replace and preg_replace.
[ is another beast, you have to escape it. Looks like single and double quotes, single or double escape are all treated the same by PHP, for both regex families.
Perhaps your problem is elsewhere in your expression, you should give it in full.
In .Net you escape special characters by adding up a backslash; "\" meaning it would become; "["...
Though since you normally do this in string literals you would either have to do something like this;
#"\["
or something like this;
"\\["
You problem may come from the fact you are using eregi_replace with the first parameter enclosed in simple quotes:
'\['
In double quotes, though, it could works well depending on the context, since it changes the way the parameter is passed to the function (simple quotes just pass the string without any interpretation, hence the need to double to "\" character).
Here, if "\[" is interpreted as an escape character, you still need to double "\".
Note: based on your comment, you may try the regex
<\s*(?:br|p)\s*\/?\s*\>\s*\[
in order to detect a [ right after a <br>or a <p>

Categories