How do you create a string to match an regex? - php

I need to create a formatting documentation. I know the regex that are used to format the text but I don't know how to reproduce an example for that regex.
This one should be an internal link:
'{\[((?:\#|/)[^ ]*) ([^]]*)\]}'
Can anyone create an example that would match this, and maybe explain how he got it. I got stuck at '?'.
I never used this meta-character at the beginning, usually I use it to mark that an literal cannot appear or appear exactly once.
Thanks

(?:...) has the same grouping effect as (...), but without "capturing" the contents of the group; see http://php.net/manual/en/regexp.reference.subpatterns.php.
So, (?:\#|/) means "either # or /".
I'm guessing you know that [^ ]* means "zero or more characters that aren't SP", and that [^]]* means "zero or more characters that aren't right-square-brackets".
Putting it together, one possible string is this:
'{[/abcd asdfasefasdc]}'

See Open source RegexBuddy alternatives and Online regex testing for some helpful tools. It's easiest to have a regex explained by them first. I used YAPE here:
NODE EXPLANATION
----------------------------------------------------------------------
\[ '['
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\# '#'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
[^ ]* any character except: ' ' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
' '
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[^]]* any character except: ']' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
\] ']'
----------------------------------------------------------------------
This is under the presumption that { and } in your example are the regex delimiters.
You can just read through the list of explanations and come up with a possible source string such as:
[#NOSPACE NOBRACKET]

I think this is a good post to help design regex. While its fairly easy to write a
general regex to match a string, sometimes its helpfull to look at it in reverse after
its designed. Sometimes it is necessary to see what bizzar things will match.
When mixing a lot of the metachars as literals, its fairly important to format
these kind for ease of reading and to avoid errors.
Here are some samples in Perl which were easier (for me) to prototype.
my #samps = (
'{[/abcd asdfasefasdc]}',
'{[# ]}',
'{[# /# \/]}',
'{[/# {[
| /# {[#\/} ]}',
,
);
for (#samps) {
if (m~{\[([#/][^ ]*) ([^]]*)\]}~)
{
print "Found: '$&'\ngrp1 = '$1'\ngrp2 = '$2'\n===========\n\n";
}
}
__END__
Expanded
\{\[
(
[#/][^ ]*
)
[ ]
(
[^\]]*
)
\]\}
Output
Found: '{[/abcd asdfasefasdc]}'
grp1 = '/abcd'
grp2 = 'asdfasefasdc'
===========
Found: '{[# ]}'
grp1 = '#'
grp2 = ''
===========
Found: '{[# /# \/]}'
grp1 = '#'
grp2 = '/# \/'
===========
Found: '{[/# {[
| /# {[#\/} ]}'
grp1 = '/# {[
|'
grp2 = '/# {[#\/} '
===========

Related

PHP autodetect translatables / detect piece of code by regex

I'm having a multilanguage site which stores translatables within a default.php filled with an array that contains all the keys.
I would prefer to make it automatic.
I already have a (singleton) class that is able to detect all my files based by type. (Controller, action, view, model, etc...)
I would like to detect any piece of code of which the format is like this:
$this->translate('[a-zA-Z]');
$view->translate('[a-zA-Z]');
getView()->translate('[a-zA-Z]');
throw new Exception('[a-zA-Z]');
addMessage(array('message' => '[a-zA-Z]');
However it must be filtered when it starts with/contains:
$this->translate('((0-9)+_)[a-zA-Z]');
$this->translate('[a-zA-Z]' . $* . '[a-zA-Z]'); // Only a variable in the middle must filtered, begin or end is still allowed
ofcourse [a-zA-Z] is a regex example.
Like i sais i already have a class that detect certain files. This class also make use of Reflection (or in this case Zend Reflection, as i'm using Zend) However i could not see a way to reflect a function using regex.
The action will be placed within a cronjob and manual called action so it is not a big issue when the used memory is a bit 'too' large.
Description
[$]this->translate[(]'((?:[^'\\]|\\.|'')*)'[)];
** To see the image better, simply right click the image and select view in new window
This regular expression will do the following:
code blocks starting with $this-translate(' through it's closing ');
places the value inside the ' quotes into capture group 1
avoids messy edge cases where in the substring may contain what looks like an end '); string when in reality the characters could be escaped.
Example
Live Demo
https://regex101.com/r/eC5xQ6/
Sample text
$This->Translate('(?:Droids\');{2}');
$NotTranalate('fdasad');
$this->translate('[a-zA-Z]');
Sample Matches
MATCH 1
1. [17-33] `(?:Droids\');{2}`
MATCH 2
1. [79-87] `[a-zA-Z]`
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
[$] any character of: '$'
----------------------------------------------------------------------
this->translate 'this->translate'
----------------------------------------------------------------------
[(] any character of: '('
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
[^'\\] any character except: ''', '\\'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\\ '\'
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
'' '\'\''
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[)] any character of: ')'
----------------------------------------------------------------------
; ';'
----------------------------------------------------------------------

Regex for validating numbers preceding with two dashes (hyphens)

I got stuck with regexp to validate only numbers from 1-10 that could have two dashes(hyphens) before, for example:
--9
or
--10
or
--1
but not
--11 or not --0
I tried like seems to me everything, example:
/(-\-\[1-10])/
What is wrong?
EDIT 1:
Thanks a lot for so many working examples!!
What if I also wanted to validate to numbers before all of this, example:
8--10 but not 0--10 or not 11--11
I tried this but it didn't work:
/--([1-9]|10:[1-9]|10)\b/
EDIT 2:
Oh, this one works, finally:
/^(10|[1-9])--(10|[1-9])$/
Have a try with:
/\b(?:[1-9]|10)--(?:[1-9]|10)\b/
Change according to OP's edit.
Explanation:
The regular expression:
(?-imsx:\b(?:[1-9]|10)--(?:[1-9]|10)\b)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[1-9] any character of: '1' to '9'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
10 '10'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
-- '--'
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[1-9] any character of: '1' to '9'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
10 '10'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
I guess this will fit
/\-\-([1-9]|10)\b/
if you don't want to capture your number, add ?: :
/\-\-(?:[1-9]|10)\b/
Outside a character class, you don't need to escape hyphens. Also, your character class [1-10] will only match 1 and 0, because [1-10] is equal to [10] and that will only match 1 and 0. Try this regex:
/^--(10|[1-9])$/
The correct regex is
/\b--([1-9]|10)\b/
You're incorrectly escaping the first [ of your character class as \[. The character class used is incorrect as well. It would be treated as a character class with members 1 to 1 and a 0 i.e. [10] which means it matches either 0 or 1.
Also, the hyphens - don't need to be escaped outside a character class []. To validate the numbers that come before the hyphens as well use
/\b([1-9]|10)--([1-9]|10)\b/
When you write [1-10], it mean characters 1 to 1 + the 0 character. It as if you had write [0-1].
In fact, in your case, it would be better to test cases --1 to --9 and case --10 separately with something like : /^(--10)|(--[1-9])$/
You can test your regex on http://myregexp.com/

Understanding Pattern in preg_match_all() Function Call

I am trying to understand how preg_match_all() works and when looking at the documentation on the php.net site, I see some examples but am baffled by the strings sent as the pattern parameter. Is there a really thorough, clear explanation out there? For example, I don't understand what the pattern in this example means:
preg_match_all("/\(? (\d{3})? \)? (?(1) [\-\s] ) \d{3}-\d{4}/x",
"Call 555-1212 or 1-800-555-1212", $phones);
or this:
$html = "<b>bold text</b><a href=howdy.html>click me</a>";
preg_match_all("/(<([\w]+)[^>]*>)(.*?)(<\/\\2>)/", $html, $matches, PREG_SET_ORDER);
I've taken an introductory class on PHP, but never saw anything like this. Some clarification would be appreciated.
Thanks!
Those aren't "PHP patterns", those are Regular Expressions. Instead of trying to explain what has been explained before a thousand times in this answer, I'll point you to http://regular-expressions.info for information and tutorials.
You are looking for this,
PHP PCRE Pattern Syntax
PCRE Standard syntax
Note that first one is a subset of second one.
Also have a look at YAPE, which for example gives this nice textual explanation for your first regex:
(?x-ims:\(? (\d{3})? \)? (?(1) [\-\s] ) \d{3}-\d{4})
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?x-ims: group, but do not capture (disregarding
whitespace and comments) (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n):
----------------------------------------------------------------------
\(? '(' (optional (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \1 (optional
(matching the most amount possible)):
----------------------------------------------------------------------
\d{3} digits (0-9) (3 times)
----------------------------------------------------------------------
)? end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
----------------------------------------------------------------------
\)? ')' (optional (matching the most amount
possible))
----------------------------------------------------------------------
(?(1) if back-reference \1 matched, then:
----------------------------------------------------------------------
[\-\s] any character of: '\-', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
| else:
----------------------------------------------------------------------
succeed
----------------------------------------------------------------------
) end of conditional on \1
----------------------------------------------------------------------
\d{3} digits (0-9) (3 times)
----------------------------------------------------------------------
- '-'
----------------------------------------------------------------------
\d{4} digits (0-9) (4 times)
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
The pattern you write about is a mini-language in it's own called Regular Expression. It's specialized on finding patterns in strings, do replacements etc. for everything that follows some sort of pattern.
More specifically it's a Perl Compatible Regular Expression (PCRE).
The handbook for that language is not available on the PHP manual website, you find it here: PCRE Manpage.
A well made step-by-step introduction is on the Regular Expressions Info Website.

What does this `#((?<=\?)|&)openid\.[^&]+#` regexp mean?

So I am triing to read some php code... I found such line
$uri = rtrim(preg_replace('#((?<=\?)|&)openid\.[^&]+#', '', $_SERVER['REQUEST_URI']), '?');
what does it mean? and if it (seems for me) just returns 'file name' why it is so complicated?
The purpose of that line is to remove values like openid.something=value from the request URI.
There are tools out there to translate regex into prose, with an aim to help you understand what a regex is trying to match. For example, when yours is passed to such a tool the description comes back as:
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\? '?'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
& '&'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
openid 'openid'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
[^&]+ any character except: '&' (1 or more times
(matching the most amount possible))
As the above says, the regex looks for a ? or & followed by openid., followed by anything not &. The resulting match will include the preceeding & if there is one, but not the ? since a look behind was used for the latter.

How does this regex divide text into sentences?

I know this regex divides a text into sentences. Can someone help me understand how?
/(?<!\..)([\?\!\.])\s(?!.\.)/
You can use YAPE::Regex::Explain to decipher Perl regular expressions:
use strict;
use warnings;
use YAPE::Regex::Explain;
my $re = qr/(?<!\..)([\?\!\.])\s(?!.\.)/;
print YAPE::Regex::Explain->new($re)->explain();
__END__
The regular expression:
(?-imsx:(?<!\..)([\?\!\.])\s(?!.\.))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
(?<! look behind to see if there is not:
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
) end of look-behind
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\?\!\.] any character of: '\?', '\!', '\.'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
There is the Regular Expression Analyzer which will do quite the same as toolic already suggested - but completely webbased.
(? # Find a group (don't capture)
< # before the following regular expression
! # that does not match
\. # a literal "."
. # followed by 1 character
) # (End look-behind group)
( # Start a group (capture it to $1)
[\?\!\.] # Containing any one of the characters in the following set "?!."
) # End group $1
\s # followed by a whitespace character " ", \t, etc.
(? # Followed by a group (don't capture)
# after the preceding regular expression
! # that does not have
. # 1 character
\. # followed by a literal "."
) # (End look-ahead group)
The first part (?<!\..) is a negative look-behind. It specifies a pattern which invalidates the match. In this case it's looking for two characters--the first a period and the other one any character.
The second part is a standard capture/group, which could be better expressed: ([?!.]) (you don't need the escapes in the class brackets), that is a sentence ending punctuation character.
The next part is a single (??) white-space character: \s
And the last part is a negative look-ahead: (?!.\.). Again it is guarding against the case of a single character followed by a period.
This should work, relatively well. But I don't think I would recommend it. I don't see what the coder was getting at trying to make sure that just a period wasn't the second most recent character, or that it wasn't the second one to come.
I mean if you are looking to split on terminal punctuation, why don't you want to guard against the same class being two-back or two-ahead? Instead it relies on periods not being there. Thus a more regular expression would be:
/(?<![?!.].)([?!.])\s(?!.[?!.])/
Portions:
([\?\!\.])\s: split by ending character (.,!,or ?) which is followed by a whitespace character (space, tab, newline)
(?<!\..) where the characters before this 'ending character' arent a .+anything
(?!.\.) after the whitespace character any character directly followed by any . isn't allowed.
Those look-ahead ((?!) & look-behind ((?<!) assertions mainly seem to prevent splitting on (whitespaced?) abbreviations (q. e. d. etc.).

Categories