stackers!
I have been trying to figure this out for some time but no luck.
(.*?(?:\.|\?|!))(?: |$)
the above pattern is capturing and breaking all sentences in a paragraph with ending punctuation.
example
Today is the greatest. You are the greatest.
The match comes back with three
Match {
1.
Today is the greatest.
You are the greatest.
}
However I am trying to get it to not break when there is a number with a period and would like to see the following match instead:
Match {
1.Today is the greatest.
You are the greatest.
}
Thanks for your help in advance
Use
.*?[.?!](?=(?<!\d\.)\s+|\s*$)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
[.?!] any character of: '.', '?', '!'
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
) end of look-ahead
Related
I'm trying to match a group that only matches when the first non-spacing character preceding the match is NOT an alphanumeric character.
RegExp i've tried, consuming the spaces first with \s* then looking behind to check for \w:
(?<!\w)\s*\({\w+}\)
Success
Input: this will = ({match})
Expected: ({match})
Actual: ({match})
Failure, still matches while preceded by alphanumeric (ignoring spaces)
Input: this should = not ({match})
Expected: -
Actual: ({match})
Using \s+ instead of \s* solved it partially but now it requires at least one space which is not desired!
(?<!\w)\s+\({\w+}\)
I've been looking around the internet but cannot solve the problem. Anyone?
Use this solution (a mix of #horcrux and #Wiktor Stribizew suggestions):
<?php
$regex = '/(?<![\w\s])\s*(\({\w+}\))/';
$string = 'this will = ({match})
this should = not ({match})';
preg_match_all($regex, $string, $matches);
var_dump($matches[1]);
?>
See regex proof.
Results:
array(1) {
[0]=>
string(9) "({match})"
}
See PHP proof.
EXPLANATION
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
[\w\s] any character of: word characters (a-z,
A-Z, 0-9, _), whitespace (\n, \r, \t,
\f, and " ")
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
{ '{'
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
} '}'
--------------------------------------------------------------------------------
\) ')'
--------------------------------------------------------------------------------
) end of \1
If there has to be a first non-spacing character present, you could match it and use \K to clear the match buffer.
[^\w\s]\h*\K\({\w+}\)
The pattern matches
[^\w\s] Match a single char other than a word char or whitespace char
\h*\K Match 0+ horizontal whitespace chars and forget what is matched so far
\({\w+}\) Match 1+ word chars between ({ and })
Regex demo | Php demo
I have a file smime.p7m with many content. One or more of this Content is like this
--_3821f5f5-222-4a90-82e0-d8922ee62cc8_
Content-Type: application/pdf;
name="001235_0001.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="001235_0001.pdf"
JVBERi0xLjMNCjMgMCBvYmoNCjw8DQogIC9UeXBlIC9YT2JqZWN0DQogIC9TdWJ0eXBlIC9J
bWFnZQ0KICAvRmlsdGVyIC9EQ1REZWNvZGUNCiAgL1dpZHRoIDI0MDkNCiAgL0hlaWdodCAz
AF6UAFACZoAUUAFABQA1TQAuaADGKAFoASgBaACgBKADpTAQnApAJ0oAdQAdKAD2oAXpQA3p
.........................................
0oAU9KAFHFABQAnSgBOaAFoAKACgAoAWgAoATGOlAAKAFoATpQAYoAO9AC0AFACZ7UAGKAFo
ZPi1JZBodj7GEjdqgELTq0RC7xeSu1yv+dwEltQFPoSMGcbiTf0cGyzbreEAAAAAAAA=
--------------ms021111111111111111111107--
Is there a way to get the filename for example with regex if it's a pDF and the BASE64 code below? It can happen that there is more than one PDF file in the file.
The Filename is not the problem. I get this with "filename="(.*).pdf". But I don't know how I get the base64code after the filename
base64 consists of characters A...Z a...z digits 0..9 symbols + and /. It also can have one or two = in the end and can be split to several lines.
if (preg_match('/filename=\"(?P<filename>[^"]*?\.pdf)\"\s*(?P<base64>([A-Za-z0-9+\/]+\s*)+=?=?)/', $s, $regres)) {
print("FileName: {$regres['filename']}\n");
print("Base64: {$regres['base64']}\n");
}
Use
(?im)^filename="([^"]*\.pdf)"\R+(.+(?:\R.+)+)
See proof
PHP:
preg_match_all('/^filename="([^"]*\.pdf)"\R+(.+(?:\R.+)+)/im', $str, $matches);
Explanation
--------------------------------------------------------------------------------
(?im) set flags for this block (case-
insensitive) (with ^ and $ matching start
and end of line) (with . not matching \n)
(matching whitespace and # normally)
--------------------------------------------------------------------------------
^ the beginning of a "line"
--------------------------------------------------------------------------------
filename=" 'filename="'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
pdf 'pdf'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
" '"'
--------------------------------------------------------------------------------
\R+ any line break sequence (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
.+ any character except \n (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (1 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
\R any line break sequence
--------------------------------------------------------------------------------
.+ any character except \n (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)+ end of grouping
--------------------------------------------------------------------------------
) end of \2
I gather that this task is not about validation at all, and solely focuses on data extraction -- this makes sharpening the regex logic unnecessary.
You only need a pattern that will match filename=" at the start of a line, then capture the quote-wrapped substring (so long as it ends in .pdf), then after any number of whitespace characters, capture all characters until one or two = are encountered,
Using greedy negative character classes allows the regex engine to move quickly. The m pattern modifier tells the regex engine that the ^ meta character (not the ^ used inside of square braces) may match the start of a line in addition to the start of the string.
Perhaps you'd like to generate an associative array where the keys are the filename strings and the encoded strings are the values, array_column() does a snappy job of setting that up when there are qualifying matches.
Code: (Demo)
var_export(
preg_match_all(
'~^filename="([^"]+)\.pdf"\s*([^=]+={1,2})~m',
$fileContents,
$out,
PREG_SET_ORDER
)
? array_column($out, 2, 1)
: "no pdf's found"
);
Output:
array (
'001235_0001' => 'JVBERi0xLjMNCjMgMCBvYmoNCjw8DQogIC9UeXBlIC9YT2JqZWN0DQogIC9TdWJ0eXBlIC9J
bWFnZQ0KICAvRmlsdGVyIC9EQ1REZWNvZGUNCiAgL1dpZHRoIDI0MDkNCiAgL0hlaWdodCAz
AF6UAFACZoAUUAFABQA1TQAuaADGKAFoASgBaACgBKADpTAQnApAJ0oAdQAdKAD2oAXpQA3p
.........................................
0oAU9KAFHFABQAnSgBOaAFoAKACgAoAWgAoATGOlAAKAFoATpQAYoAO9AC0AFACZ7UAGKAFo
ZPi1JZBodj7GEjdqgELTq0RC7xeSu1yv+dwEltQFPoSMGcbiTf0cGyzbreEAAAAAAAA=',
)
I am developing a "word filter" class in PHP that, among other things, need to capture purposely misspelled words. These words are inputted by User as a sentence. Let me show a simple example of a sentence inputted by an User:
I want a coke, sex, drugs and rock'n'roll
The above example is a common phrase write correctly. My class will find the suspect words sex and drugs and everthing will be fine.
But I suppose that the User will try to hinder the detection of words and write the things a little different. In fact he has many different ways to write the same word so that it is still readable for certain types of people. For example, the word sex may be written as s3x or 5ex or 53x or s e x or s 3 x or s33x or 5533xxx of ss 33 xxx and so on.
I know the basics of regular expressions and tried the pattern bellow:
/(\b[\w][\w .'-]+[\w]\b)/g
Because of
\b word boundary
[\w] The word can start with one letter or one digit...
[\w .'-] ... followed by any letter, digit, space, dot, quotes or dash...
+ ... one or more times...
[\w] ... ending with one letter or one digit.
\b word boundary
This works partially.
If the sample phrase was written as I want a coke, 5 3 x, druuu95 and r0ck'n'r011 I get 3 matches:
I want a coke
5 3 x
druuu95 and r0ck'n'r011
What I need is 8 matches
I
want
a
coke
5 3 x
druuu95
and
r0ck'n'r011
To shorten, I need a regular expression that give me each word of a sentence, even if the word begins with a digit, contains a variable number of digits, spaces, dots, dashes and quotes, and end with a letter or digit.
Any help will be appreciated.
Description
Typically good words are 2 or more letters long (with the exception of I and a) and do not contain numbers. This expression isn't flawless, but does help illustrate why doing this type of language matching is absurdly difficult because it's an arms race between creative people trying to express themselves without getting caught, and the development team who is trying to catch flaws.
(?:\s+|\A)[#'"[({]?(?!(?:[a-z]{2}\s+){3})(?:[a-zA-Z'-]{2,}|[ia]|i[nst]|o[fnr])[?!.,;:'")}\]]?(?=(?:\s|\Z))|((?:[a-z]{2}\s+){3}|.*?\b)
** To see the image better, simply right click the image and select view in new window
This regular expression will do the following:
find all acceptable words
find all the rest and store them in Capture Group 1
Example
Live Demo
https://regex101.com/r/cL2bN1/1
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1
or more times (matching the most amount
possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\A the beginning of the string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
[#'"[({]? any character of: '#', ''', '"', '[', '(',
'{' (optional (matching the most amount
possible))
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
(?: group, but do not capture (3 times):
----------------------------------------------------------------------
[a-z]{2} any character of: 'a' to 'z' (2 times)
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
){3} end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[a-zA-Z'-]{2,} any character of: 'a' to 'z', 'A' to
'Z', ''', '-' (at least 2 times
(matching the most amount possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[ia] any character of: 'i', 'a'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
i 'i'
----------------------------------------------------------------------
[nst] any character of: 'n', 's', 't'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
o 'o'
----------------------------------------------------------------------
[fnr] any character of: 'f', 'n', 'r'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
[?!.,;:'")}\]]? any character of: '?', '!', '.', ',', ';',
':', ''', '"', ')', '}', '\]' (optional
(matching the most amount possible))
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\Z before an optional \n, and the end of
the string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
(?: group, but do not capture (3 times):
----------------------------------------------------------------------
[a-z]{2} any character of: 'a' to 'z' (2 times)
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
){3} end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
Suppose there is a WordPress shortcode content like following-
Some content here
[shortcode_1 attr1="val1" attr2="val2"]
[shortcode_2 attr3="val3" attr4="val4"]
Some text
[/shortcode_2]
[/shortcode_1]
Some more content here
My question is suppose I match the shortcode pattern such that I get the output [shortcode_1]....[/shortcode_1]. But can I get the [shortcode_2]...[/shortcode_2] using the same regex pattern in the same run or do I have to run it again using the output from the first run ?
Description
You could just create a couple of capture groups. One for the entire match, and the second for the subordinate match. Of course this approach does have it's limitations and can get hung up on some pretty complex edge cases.
(\[shortcode_1\s[^\]]*].*?(\[shortcode_2\s.*?\[\/shortcode_2\]).*?\[\/shortcode_1\])
Examples
Live Demo
https://regex101.com/r/bQ0vV2/1
Sample Text
[shortcode_1 attr1="val1" attr2="val2"]
[shortcode_2 attr3="val3" attr4="val4"]
Some text
[/shortcode_2]
[/shortcode_1]
Sample Matches
Capture group 1 gets the shortcode_1
Capture group 2 gets the shortcode_2
1. [0-139] `[shortcode_1 attr1="val1" attr2="val2"]
[shortcode_2 attr3="val3" attr4="val4"]
Some text
[/shortcode_2]
[/shortcode_1]`
2. [45-123] `[shortcode_2 attr3="val3" attr4="val4"]
Some text
[/shortcode_2]`
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\[ '['
----------------------------------------------------------------------
shortcode_1 'shortcode_1'
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
[^\]]* any character except: '\]' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
] ']'
----------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
\[ '['
----------------------------------------------------------------------
shortcode_2 'shortcode_2'
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
.*? any character (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
\[ '['
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
shortcode_2 'shortcode_2'
----------------------------------------------------------------------
\] ']'
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
----------------------------------------------------------------------
\[ '['
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
shortcode_1 'shortcode_1'
----------------------------------------------------------------------
\] ']'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
I'm using preg_match to format tracklists so the track number, title and duration are separated into their own cells in a table:
<td>01</td><td>Track Title</td><td>01:23</td>
The problem is that the tracks themselves can take any of the following forms (the leading zeros on the track numbers and durations are not always present):
01. Track Title (01:23)
01. Track Title 01:23
01. Track Title
1 Track Title (01:23)
1 Track Title 01:23
1 Track Title
The following only works on tracks with a timestamp:
/([0-9]+)\.?[\s+](.*)[\s+](\?[0-5]?[0-9]:[0-5][0-9]\)?)/
So I added ? to the timestamp:
/([0-9]+)\.?[\s+](.*)[\s+]((\?[0-5]?[0-9]:[0-5][0-9]\)?)?/
This then works for tracks without a timestamp, but tracks with a timestamp end up with the timestamp stuck with the title, like so:
<td>01</td><td>Track Title 01:23</td><td></td>
EDIT: The tracklists are plaintext and are being pulled from an SQL table before parsing.
Try this:
/^([0-9]+)\.?[\s]+(.*)([\s]+(\(?[0-5]?[0-9]:[0-5][0-9]\)?))?$/U
Note I am used ungreedy pattern modifier U to try to match smallest matching string and I have anchored the beginning and end of the string.
By default regular expressions are greedy so the part for matching title .*
eats the rest of the string because last part with duration is optional.
Use /U modifier to turn on ungreedy behavior - look for PCRE_UNGREEDY on
http://us1.php.net/manual/en/reference.pcre.pattern.modifiers.php
How about:
^([0-9]+)\.?\s+(.*?)(?:\(?([0-5]?[0-9]:[0-5][0-9])\)?)?$
explanation:
The regular expression:
(?-imsx:^([0-9]+)\.?\s+(.*?)(?:\(?([0-5]?[0-9]:[0-5][0-9])\)?)?$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-9]+ any character of: '0' to '9' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\.? '.' (optional (matching the most amount
possible))
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
\(? '(' (optional (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
[0-5]? any character of: '0' to '5' (optional
(matching the most amount possible))
----------------------------------------------------------------------
[0-9] any character of: '0' to '9'
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
[0-5] any character of: '0' to '5'
----------------------------------------------------------------------
[0-9] any character of: '0' to '9'
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
\)? ')' (optional (matching the most amount
possible))
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------