Regex to match uri after mod_rewrite - php

I'm looking for a PHP PCRE regex to match uri's that are rewritted with Apache's mod_rewrite module. The uri's are as follow :
/param1/param2/param3/param4
The rules for the uri
must contain at least one /;
the params must only allow letters, numbers, - and _;
there must be zero or more instances of the first two rules;

/\/[a-zA-Z0-9_\-\/]+$/
I am assuming that it must start with an / and something like this should not match /param1/param2/param3/param4*

How about:
if (preg_match("~^(?:/[\w-]+)+/?$~", $string)) {
# do stuff
}
Explanation:
The regular expression:
(?-imsx:^(?:/[\w-]+)+/?$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
[\w-]+ any character of: word characters (a-z,
A-Z, 0-9, _), '-' (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
)+ end of grouping
----------------------------------------------------------------------
/? '/' (optional (matching the most amount
possible))
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

Related

Regular Expression to match suspect words on a string

I am developing a "word filter" class in PHP that, among other things, need to capture purposely misspelled words. These words are inputted by User as a sentence. Let me show a simple example of a sentence inputted by an User:
I want a coke, sex, drugs and rock'n'roll
The above example is a common phrase write correctly. My class will find the suspect words sex and drugs and everthing will be fine.
But I suppose that the User will try to hinder the detection of words and write the things a little different. In fact he has many different ways to write the same word so that it is still readable for certain types of people. For example, the word sex may be written as s3x or 5ex or 53x or s e x or s 3 x or s33x or 5533xxx of ss 33 xxx and so on.
I know the basics of regular expressions and tried the pattern bellow:
/(\b[\w][\w .'-]+[\w]\b)/g
Because of
\b word boundary
[\w] The word can start with one letter or one digit...
[\w .'-] ... followed by any letter, digit, space, dot, quotes or dash...
+ ... one or more times...
[\w] ... ending with one letter or one digit.
\b word boundary
This works partially.
If the sample phrase was written as I want a coke, 5 3 x, druuu95 and r0ck'n'r011 I get 3 matches:
I want a coke
5 3 x
druuu95 and r0ck'n'r011
What I need is 8 matches
I
want
a
coke
5 3 x
druuu95
and
r0ck'n'r011
To shorten, I need a regular expression that give me each word of a sentence, even if the word begins with a digit, contains a variable number of digits, spaces, dots, dashes and quotes, and end with a letter or digit.
Any help will be appreciated.
Description
Typically good words are 2 or more letters long (with the exception of I and a) and do not contain numbers. This expression isn't flawless, but does help illustrate why doing this type of language matching is absurdly difficult because it's an arms race between creative people trying to express themselves without getting caught, and the development team who is trying to catch flaws.
(?:\s+|\A)[#'"[({]?(?!(?:[a-z]{2}\s+){3})(?:[a-zA-Z'-]{2,}|[ia]|i[nst]|o[fnr])[?!.,;:'")}\]]?(?=(?:\s|\Z))|((?:[a-z]{2}\s+){3}|.*?\b)
** To see the image better, simply right click the image and select view in new window
This regular expression will do the following:
find all acceptable words
find all the rest and store them in Capture Group 1
Example
Live Demo
https://regex101.com/r/cL2bN1/1
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1
or more times (matching the most amount
possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\A the beginning of the string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
[#'"[({]? any character of: '#', ''', '"', '[', '(',
'{' (optional (matching the most amount
possible))
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
(?: group, but do not capture (3 times):
----------------------------------------------------------------------
[a-z]{2} any character of: 'a' to 'z' (2 times)
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
){3} end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[a-zA-Z'-]{2,} any character of: 'a' to 'z', 'A' to
'Z', ''', '-' (at least 2 times
(matching the most amount possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[ia] any character of: 'i', 'a'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
i 'i'
----------------------------------------------------------------------
[nst] any character of: 'n', 's', 't'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
o 'o'
----------------------------------------------------------------------
[fnr] any character of: 'f', 'n', 'r'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
[?!.,;:'")}\]]? any character of: '?', '!', '.', ',', ';',
':', ''', '"', ')', '}', '\]' (optional
(matching the most amount possible))
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\Z before an optional \n, and the end of
the string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
(?: group, but do not capture (3 times):
----------------------------------------------------------------------
[a-z]{2} any character of: 'a' to 'z' (2 times)
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
){3} end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------

PHP Regex on CD Tracklists

I'm using preg_match to format tracklists so the track number, title and duration are separated into their own cells in a table:
<td>01</td><td>Track Title</td><td>01:23</td>
The problem is that the tracks themselves can take any of the following forms (the leading zeros on the track numbers and durations are not always present):
01. Track Title (01:23)
01. Track Title 01:23
01. Track Title
1 Track Title (01:23)
1 Track Title 01:23
1 Track Title
The following only works on tracks with a timestamp:
/([0-9]+)\.?[\s+](.*)[\s+](\?[0-5]?[0-9]:[0-5][0-9]\)?)/
So I added ? to the timestamp:
/([0-9]+)\.?[\s+](.*)[\s+]((\?[0-5]?[0-9]:[0-5][0-9]\)?)?/
This then works for tracks without a timestamp, but tracks with a timestamp end up with the timestamp stuck with the title, like so:
<td>01</td><td>Track Title 01:23</td><td></td>
EDIT: The tracklists are plaintext and are being pulled from an SQL table before parsing.
Try this:
/^([0-9]+)\.?[\s]+(.*)([\s]+(\(?[0-5]?[0-9]:[0-5][0-9]\)?))?$/U
Note I am used ungreedy pattern modifier U to try to match smallest matching string and I have anchored the beginning and end of the string.
By default regular expressions are greedy so the part for matching title .*
eats the rest of the string because last part with duration is optional.
Use /U modifier to turn on ungreedy behavior - look for PCRE_UNGREEDY on
http://us1.php.net/manual/en/reference.pcre.pattern.modifiers.php
How about:
^([0-9]+)\.?\s+(.*?)(?:\(?([0-5]?[0-9]:[0-5][0-9])\)?)?$
explanation:
The regular expression:
(?-imsx:^([0-9]+)\.?\s+(.*?)(?:\(?([0-5]?[0-9]:[0-5][0-9])\)?)?$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-9]+ any character of: '0' to '9' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\.? '.' (optional (matching the most amount
possible))
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
\(? '(' (optional (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
[0-5]? any character of: '0' to '5' (optional
(matching the most amount possible))
----------------------------------------------------------------------
[0-9] any character of: '0' to '9'
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
[0-5] any character of: '0' to '5'
----------------------------------------------------------------------
[0-9] any character of: '0' to '9'
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
\)? ')' (optional (matching the most amount
possible))
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

Trying to do preg_match as well as match entire string length

Is there a way to incorporate preg_match with total string length? I need to be able to match alphanumeric, with single underscores inside the string, with a total string length <= n.
Currently what I'm working with is this:
preg_match('/^[A-Za-z0-9]*(?:_[A-Za-z0-9]+)*$/',$string) && (strlen($string) <= 10)
I have played around with this for too long, trying to incorporate the entire thing into preg_match, so just tacked on the && strlen, but I'm sure there is a better way to do this.
Have a try with:
preg_match('/^(?=[A-Za-z0-9]*(?:_[A-Za-z0-9]+)*).{1,10}$/', $string)
edit according to comments:
/^(?=[A-Za-z0-9]+(?:_[A-Za-z0-9]+)*$).{5,25}$/
explanation:
The regular expression:
(?-imsx:^(?=[A-Za-z0-9]+(?:_[A-Za-z0-9]+)*$).{5,25}$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[A-Za-z0-9]+ any character of: 'A' to 'Z', 'a' to
'z', '0' to '9' (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
_ '_'
----------------------------------------------------------------------
[A-Za-z0-9]+ any character of: 'A' to 'Z', 'a' to
'z', '0' to '9' (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
$ before an optional \n, and the end of
the string
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
.{5,25} any character except \n (between 5 and 25
times (matching the most amount possible))
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
Infos on look around

Currency Regular Expression match with numeric range

I'm using this regular expression:
^\$?([0-9]{1,3},([0-9]{3},)*[0-9]{3}|[0-9]+)(.[0-9][0-9])?$
This regex match for dollar currency amount with comma or without
I want to do match for number between 1000 to 2000 with currency format.
Example:
Match $1,500.00 $2000.0 $1100.20 $1000
Don't match $1,000.0000 $3,000 $2000.1 $4,000 $2500.50
[$]\([1][[:digit:]]\{3\}\|2000\)\(,[[:digit:]]+\|\)
this expression defines this language:
dollar followed by 1 xyz or 2000.
finally, followed optionally by comma and 1 or more digits
How about:
/^\$(?:1,?\d{3}(?:\.\d\d?)?|2000(?:\.00?)?)$/
explanation:
^ the beginning of the string
----------------------------------------------------------------------
\$ '$'
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
1 '1'
----------------------------------------------------------------------
,? ',' (optional (matching the most amount
possible))
----------------------------------------------------------------------
\d{3} digits (0-9) (3 times)
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
\d digits (0-9)
----------------------------------------------------------------------
\d? digits (0-9) (optional (matching the
most amount possible))
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
2000 '2000'
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
0 '0'
----------------------------------------------------------------------
0? '0' (optional (matching the most
amount possible))
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping

How does this regex divide text into sentences?

I know this regex divides a text into sentences. Can someone help me understand how?
/(?<!\..)([\?\!\.])\s(?!.\.)/
You can use YAPE::Regex::Explain to decipher Perl regular expressions:
use strict;
use warnings;
use YAPE::Regex::Explain;
my $re = qr/(?<!\..)([\?\!\.])\s(?!.\.)/;
print YAPE::Regex::Explain->new($re)->explain();
__END__
The regular expression:
(?-imsx:(?<!\..)([\?\!\.])\s(?!.\.))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
(?<! look behind to see if there is not:
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
) end of look-behind
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\?\!\.] any character of: '\?', '\!', '\.'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
There is the Regular Expression Analyzer which will do quite the same as toolic already suggested - but completely webbased.
(? # Find a group (don't capture)
< # before the following regular expression
! # that does not match
\. # a literal "."
. # followed by 1 character
) # (End look-behind group)
( # Start a group (capture it to $1)
[\?\!\.] # Containing any one of the characters in the following set "?!."
) # End group $1
\s # followed by a whitespace character " ", \t, etc.
(? # Followed by a group (don't capture)
# after the preceding regular expression
! # that does not have
. # 1 character
\. # followed by a literal "."
) # (End look-ahead group)
The first part (?<!\..) is a negative look-behind. It specifies a pattern which invalidates the match. In this case it's looking for two characters--the first a period and the other one any character.
The second part is a standard capture/group, which could be better expressed: ([?!.]) (you don't need the escapes in the class brackets), that is a sentence ending punctuation character.
The next part is a single (??) white-space character: \s
And the last part is a negative look-ahead: (?!.\.). Again it is guarding against the case of a single character followed by a period.
This should work, relatively well. But I don't think I would recommend it. I don't see what the coder was getting at trying to make sure that just a period wasn't the second most recent character, or that it wasn't the second one to come.
I mean if you are looking to split on terminal punctuation, why don't you want to guard against the same class being two-back or two-ahead? Instead it relies on periods not being there. Thus a more regular expression would be:
/(?<![?!.].)([?!.])\s(?!.[?!.])/
Portions:
([\?\!\.])\s: split by ending character (.,!,or ?) which is followed by a whitespace character (space, tab, newline)
(?<!\..) where the characters before this 'ending character' arent a .+anything
(?!.\.) after the whitespace character any character directly followed by any . isn't allowed.
Those look-ahead ((?!) & look-behind ((?<!) assertions mainly seem to prevent splitting on (whitespaced?) abbreviations (q. e. d. etc.).

Categories