Regex for parsing figure references

Regex for parsing figure references - php

I'm trying to create a regex to parse figures references inside a text. I must match at least these cases:
Fig* 1, 2 and 3 (not only 3, any number)
Fig* 1-3
Fig* 1 and 2
Fig* 1
Fig* 1 to 4
So I tried the following regex:
(Fig[a-zA-Z.]*)(\s(\d(,|\s)* )+|\d\s|and\s\d|\s\d-\d|\s\d)*
The best result would be having the numbers separated, but having the match I can just clean up the result and parse the numbers.
But I just can't seem to be able to parse that "1 to 4". Also, this regex seems not optmized at all. Any ideas?
Here is a sample: http://www.phpliveregex.com/p/3Zj

try this:
(Fig.*) ((\d( to | and |-)\d)|\d)|(\d,\d and \d)

You can use this pattern:
(Fig(?:ures?|s\.)) (\d+(?:(?:-|, | (?:and|to) )\d+)*)
If you need more flexibility, you can replace spaces with \h+ or \h*

edit:
I see my previous regex didn't work.
Atempting redemption, I offer two alternatives that do work -
1.
Using Multi-Line mode - This uses the \G anchor which provides a means
to get an aligned and trimmed output suitable for array
# '/(^Fig[a-zA-Z.]*\h+|(?!^)\G)(?(?<=\d)\h*,\h*)(\d+)(?|\h*(-)\h*(\d+)|\h+(and)\h+(\d+)|\h+(to)\h+(\d+))?/'
( # (1 start)
^ Fig [a-zA-Z.]* \h+ # Fig's
| # or,
(?! ^ ) # Start at the end of last match
\G
) # (1 end)
(?(?<= \d ) # Conditional, if previous digit
\h* , \h* # Require a comma
) # End conditional
( \d+ ) # (2), Digit
(?| # Branch reset (optionally, one of the (-|and|to) \d forms)
\h*
( - ) # (3), '-'
\h*
( \d+ ) # (4), Digit
| \h+
( and ) # (3), 'and'
\h+
( \d+ ) # (4), Digit
| \h+
( to ) # (3), 'to'
\h+
( \d+ ) # (4), Digit
)?
Perl test case
$/ = undef;
$str = <DATA>;
while ($str =~ /(^Fig[a-zA-Z.]*\h+|(?!^)\G)(?(?<=\d)\h*,\h*)(\d+)(?|\h*(-)\h*(\d+)|\h+(and)\h+(\d+)|\h+(to)\h+(\d+))?/mg)
{
length($1) ?
print "'$1'\t'$2'\t'$3'\t'$4'\n" :
print "'$1'\t\t'$2'\t'$3'\t'$4'\n" ;
}
__DATA__
Figs. 1, 2, 3 and 4
Figures 1, 2
Figs. 1 and 2
Figure 1-3
Figure 1 to 3
Figure 1
Output >>
'Figs. ' '1' '' ''
'' '2' '' ''
'' '3' 'and' '4'
'Figures ' '1' '' ''
'' '2' '' ''
'Figs. ' '1' 'and' '2'
'Figure ' '1' '-' '3'
'Figure ' '1' 'to' '3'
'Figure ' '1' '' ''
2. Using Multi-Line mode - This matches entire line, where capture group 1 contains 'Figs',
group 2 contains all the number forms
# '/^(Fig[a-zA-Z.]*\h+)((?(?<=\d)\h*,\h*|\d+(?:\h*-\h*\d+|\h+and\h+\d+|\h+to\h+\d+)?)+)\h*$/'
^
( Fig [a-zA-Z.]* \h+ ) # (1), Fig's
( # (2 start), All the num's
(?(?<= \d ) # Conditional, if previous digit
\h* , \h* # Require a comma
| # or
\d+ # Require a digit
(?: # (and optionally, one of the \d (-|and|to) \d forms)
\h* - \h* \d+
| \h+ and \h+ \d+
| \h+ to \h+ \d+
)?
)+ # End conditional, do many times
) # (2 end)
\h*
$

Related

Regex validation for North American phone numbers

I am having trouble finding a pattern that would detect the following
909-999-9999
909 999 9999
(909) 999-9999
(909) 999 9999
999 999 9999
9999999999
\A[(]?[0-9]{3}[)]?[ ,-][0-9]{3}[ ,-][0-9]{3}\z
I tried it but it doesn't work for all the instances . I was thinking I can divide the problem by putting each character into an array and then checking it. but then the code would be too long.

You have 4 digits in the last group, and you specify 3 in the regex.
You also need to apply a ? quantifier (1 or 0 occurrence) to the separators since they are optional.
Use
^[(]?[0-9]{3}[)]?[ ,-]?[0-9]{3}[ ,-]?[0-9]{4}$
See the demo here
PHP demo:
$re = "/\A[(]?[0-9]{3}[)]?[ ,-]?[0-9]{3}[ ,-]?[0-9]{4}\z/";
$strs = array("909-999-9999", "909 999 9999", "(909) 999-9999", "(909) 999 9999", "999 999 9999","9999999999");
$vals = preg_grep($re, $strs);
print_r($vals);
And another one:
$re = "/\A[(]?[0-9]{3}[)]?[ ,-]?[0-9]{3}[ ,-]?[0-9]{4}\z/";
$str = "909-999-9999";
if (preg_match($re, $str, $m)) {
echo "MATCHED!";
}
BTW, optional ? subpatterns perform better than alternations.

Try this regex:
^(?:\(\d{3}\)|\d{3})[- ]?\d{3}[- ]?\d{4}$
Explaining:
^ # from start
(?: # one of
\(\d{3}\) # '(999)' sequence
| # OR
\d{3} # '999' sequence
) #
[- ]? # may exist space or hyphen
\d{3} # three digits
[- ]? # may exist space or hyphen
\d{4} # four digits
$ # end of string
Hope it helps.

split a string which consists decimals instead of integer

I split a string '3(1-5)' like this:
$pattern = '/^(\d+)\((\d+)\-(\d+)\)$/';
preg_match($pattern, $string, $matches);
But I need to do the same thing for decimals, i.e. '3.5(1.5-4.5)'.
And what do I have to do, if the user writes '3,5(1,5-4,5)'?
Output of '3.5(1.5-4.5)' should be:
$matches[1] = 3.5
$matches[2] = 1.5
$matches[3] = 4.5

You can use the following regular expression.
$pattern = '/^(\d+(?:[.,]\d+)?)\(((?1))-((?1))\)$/';
The first capturing group ( ... ) matches the following pattern:
( # group and capture to \1:
\d+ # digits (0-9) (1 or more times)
(?: # group, but do not capture (optional):
[.,] # any character of: '.', ','
\d+ # digits (0-9) (1 or more times)
)? # end of grouping
) # end of \1
Afterwords we look for an opening parenthesis and then recurse (match/capture) the 1st subpattern followed by a hyphen (-) and then recurse (match/capture) the 1st subpattern again followed by a closing parenthesis.
Code Demo

This pattern should help:
^(\d+\.?\,?\d+)\((\d+\,?\.?\d+)\-(\d+\.?\,?\d+)\)$

RegExp Language Routing

Good Day,
I am in need of a Route-RegExp based on client-language for a website.
It should be like this:
Relative URL / Route:
/(No-Language) -> /?lng=(someDefaultLanguage)
/(No-Language)/ -> /?lng=(someDefaultLanguage)
/lngCode/page -> /page/?lng=lngCode
/lngCode/page/ -> /page/?lng=lngCode
/lngCode/pageL1/pageL2 -> /pageL1/pageL2/?lng=lngCode
/language/page?param=Value -> /page/?lng=lngCode&param=Value
(Notice the trailing slashes on some lines)
Tree structure is, ...well infinite :)
There are cases with single and multiple URL-Params.
I'm absolute no regex wizard, I managed this result in uhm, ...hours:
/^\/([a-z]{2})(?:(.*[^\?])|^$)((?:[\/\?]).*|^$)/
Please don't ask me what I was trying to route there. I am sooooo new to regex.
Thank you in advance
--
Edit for clarification (I hope):
Basically it is this concept: (It is internal routing, no redirection if I didnt mention.)
The language-parameter (as directory-style) must be grabbed from the 1st url and attached as a real parameter named, "lng". The directory-parameter should disappear.
If there are already other parameters, they need to be attached as well (?/&-case).
If there is no language given (=default-language), there is no directory-style-parameter in the url. Would be nice if still a ?lng=en parameter can be attached.
Examples:
localhost/blogpage/coolentry (default language)
localhost/de/blogpage/coolentry
localhost/es/blogpage/coolentry
localhost/blogpage/ -> localhost/blogpage/?lng=en
localhostde/de/blogpage/ -> localhost/blogpage/?lng=de
localhost/blogpage/coolentry/ -> localhost/blogpage/coolentry/?lng=en
localhost/de/blogpage/coolentry/ -> localhost/blogpage/coolentry/?lng=de
localhost/de/blogpage/coolentry/?entryPage=1 -> localhost/blogpage/coolentry/?lng=de&entryPage=1
It gets routed always with a real language parameter.
I have as well edited the first post, there was a confusing typo in it.

Sorry for the delay #hcm.
Well, I gotta tell ya. I spent 10 minutes writing the original regex.
Tested in Perl, everything worked great.
Then I go to dump it into an online php tester and I get this "Undefined offset" error/warning.
Capture groups 2,3,4 are optional, so I do a (?: ( capture ) )? but php is such a mess
you can't even test for undefined group.
Update
#hcm - Ok, figured it out. Those online testers don't translate CRLF's to LF's.
Therefore, when using multiline mode $ is boundry before a newline or end of string.
^ is after a newline or beginning of a string, which is no problem.
So, $ won't match before a CR only before a LF. A workaround is probably using the
\R any linebreak construct but that is not a boundry, its an actual character.
What I did was to cure this is to use (?: $ | (?= \r) ) outside of assertions, and
(?: $ | \r ) inside assertions. This cures all problems.
After reading your message, I've changed the regex so that every part is optional, but
still positional.
The 4 optional parts are as follows.
1. Before the lang code.
2. The lang code.
3. After the lang code.
4. The parameters.
No part will run over the other.
All leading /'s are taken out of each part (not part of the match),
while internal slashes are left in place.
All that's left is to construct the new url as you wish.
Let me know how this turns out or if you need a little tweak.
PHP code:
// Go to this website:
// http://writecodeonline.com/php/
// Cut & paste this into the code box, hit run.
$text = '
invalid
/
/de
/de/coolentry
localhost/
localhost/blogpage
localhost/blogpage/
localhost/blogpage/de/
/localhost/blogpage/coolentry/famous invalid
/root/blog/page/cool/entry/?entryPage=1&var1=A&var2=B
localhost/blogpage/coolentry/
localhost/blogpage/de/coolentry/
localhost/blogpage/de/coolentry/
localhost/blogpage/de/coolentry/?entryPage=1
localhost/blogpage/coolentry/?entryPage=2
';
$str = preg_replace_callback('~^(?![^\S\r\n]*(?:\r|$))(?|(?!/[a-z]{2}[^\S\r\n]*(?:/|(?:\r|$)))/?((?:(?!/[a-z]{2}[^\S\r\n]*(?:/|(?:\r|$))|\?|/[^\S\r\n]*(?:\r|$))\S)*)|())(?|/([a-z]{2})(?=/|[^\S\r\n]*(?:\r|$))|())(?|/((?:(?!/[^\S\r\n]*(?:\r|$))[^?\s])+)|())(?|/\?((?:(?!/[^\S\r\n]*(?:\r|$))\S)*)|())/?[^\S\r\n]*(?:$|(?=\r))~m',
function( $matches )
{
///////////////// URL //////////////////
$url = '';
// Before lang code -- Group 1
if ( $matches[1] != '' ) {
$url .= '/' . $matches[1];
}
// After lang code -- Group 3
if ( $matches[3] != '' ) {
$url .= '/' . $matches[3];
}
///////////////// PARAMS //////////////////
$params = '/?lng=';
// Lang code -- Group 2
if ( $matches[2] != '' ) {
$params .= $matches[2];
}
else {
$params .= 'en'; // No lang given, set default
}
// Other params
if ( $matches[4] != '') {
$params .= '&' . $matches[4];
}
///////////////// Check there is a Url //////////////////
if ( $url == '' ) { // No url given, set a default
$url = '/language'; // 'language', 'localhost', etc...
}
///////////////// Put the pieces back together //////////////////
$NewURL = $url . $params;
return $NewURL;
},
$text);
print $str;
output:
invalid
/language/?lng=en
/language/?lng=de
/coolentry/?lng=de
/localhost/?lng=en
/localhost/blogpage/?lng=en
/localhost/blogpage/?lng=en
/localhost/blogpage/?lng=de
/localhost/blogpage/coolentry/famous invalid
/root/blog/page/cool/entry/?lng=en&entryPage=1&var1=A&var2=B
/localhost/blogpage/coolentry/?lng=en
/localhost/blogpage/coolentry/?lng=de
/localhost/blogpage/coolentry/?lng=de
/localhost/blogpage/coolentry/?lng=de&entryPage=1
/localhost/blogpage/coolentry/?lng=en&entryPage=2
Regex
# '~^(?![^\S\r\n]*(?:\r|$))(?|(?!/[a-z]{2}[^\S\r\n]*(?:/|(?:\r|$)))/?((?:(?!/[a-z]{2}[^\S\r\n]*(?:/|(?:\r|$))|\?|/[^\S\r\n]*(?:\r|$))\S)*)|())(?|/([a-z]{2})(?=/|[^\S\r\n]*(?:\r|$))|())(?|/((?:(?!/[^\S\r\n]*(?:\r|$))[^?\s])+)|())(?|/\?((?:(?!/[^\S\r\n]*(?:\r|$))\S)*)|())/?[^\S\r\n]*(?:$|(?=\r))~m'
^ # BOL
(?! # Not a blank line, remove to generate a total default url
[^\S\r\n]*
(?: \r | $ )
)
(?| # BEFORE lang code
(?!
/ [a-z]{2} [^\S\r\n]* # not lang code
(?:
/
| (?: \r | $ )
)
)
/? # strip leading '/'
( # (1 start)
(?:
(?!
/ [a-z]{2} [^\S\r\n]* # not lang code
(?:
/
| (?: \r | $ )
)
|
\? # not parms
|
/ [^\S\r\n]* # not final slash
(?: \r | $ )
)
\S
)*
) # (1 end)
|
( ) # (1)
)
(?| # LANG CODE
/ # strip leading '/'
( [a-z]{2} ) # (2)
(?=
/
| [^\S\r\n]*
(?: \r | $ )
)
|
( ) # (2)
)
(?| # AFTER lang code
/ # strip leading '/'
( # (3 start)
(?:
(?! # not final slash
/ [^\S\r\n]*
(?: \r | $ )
)
[^?\s] # not parms
)+
) # (3 end)
|
( ) # (3)
)
(?| # PARAMETERS
/ \? # strip leading '/?'
( # (4 start)
(?:
(?! # not final slash
/ [^\S\r\n]*
(?: \r | $ )
)
\S
)*
) # (4 end)
|
( ) # (4)
)
/?
[^\S\r\n]* # EOL
(?:
$
| (?= \r )
)

How can I optimize this regular expression?

I have a regular expression, and I would like to ask whether it is possible to simplify it?
preg_match_all('/([0-9]{2}\.[0-9]{2}\.[0-9]{4}) (([01]?[0-9]|2[0-3])\:[0-5][0-9]\:[0-5][0-9]?) поступление на сумму (\d+) WM([A-Z]) от корреспондента (\d+)/', $message->getMessageBody(), $info);

I think this is the best you can do:
preg_match_all('/((?:\d\d\.){2}\d{4}) (([01]?\d|2[0-3])(:[0-5]\d){1,2}) поступление на сумму (\d+) WM([A-Z]) от корреспондента (\d+)/', $message, $info);
Unless you don't need those exact words in there. Then you could:
preg_match_all('/((?:\d\d\.){2}\d{4}) (([01]?\d|2[0-3])(:[0-5]\d){1,2})\D+(\d+) WM([A-Z])\D+(\d+)/', $message, $info);

You can start by using free-spacing mode and some comments (which will help your and everyone else's understanding - which makes simplifying easier). Note that you'll have to put literal spaces in parentheses now, though:
/
( # group 1
[0-9]{2}\.[0-9]{2}\.[0-9]{4}
# match a date
)
[ ]
( # group 2
( # group 3
[01]?[0-9]# match an hour from 0 to 19
| # or
2[0-3] # match an hour from 20 to 23
)
\:
[0-5][0-9] # minutes
\:
[0-5][0-9]? # seconds
)
[ ]поступление[ ]на[ ]сумму[ ]
# literal text
(\d+) # a number into group 4
[ ]WM # literal text
([A-Z]) # a letter into group 5
[ ]от[ ]корреспондента[ ]
# literal text
(\d+) # a number into group 6
/x
Now we can't simplify the part at the end - unless you don't want to capture the parenthesised things, in which case you can simply omit most of the parentheses.
You can slightly shorten the expression, by using \d as a substitute for \d, in which case \d\d is even shorter than \d{2}.
Next, there is no need to escape colons.
And finally, there seems to be something odd with your seconds. If you want to allow single-digit seconds, make the 0-5 optional, and not the the \d after it:
/
( # group 1
\d\d\.\d\d\.\d{4}
# match a date
)
[ ]
( # group 2
( # group 3
[01]?\d # match an hour from 0 to 19
| # or
2[0-3] # match an hour from 20 to 23
)
:
[0-5]\d # minutes
:
[0-5]?\d # seconds
)
[ ]поступление[ ]на[ ]сумму[ ]
# literal text
(\d+) # a number into group 4
[ ]WM # literal text
([A-Z]) # a letter into group 5
[ ]от[ ]корреспондента[ ]
# literal text
(\d+) # a number into group 6
/x
I don't think it will get much simpler than that.

how to translate math espression in a string to integer

For example I have a statement:
$var = '2*2-3+8'; //variable type is string
How to make it to be equal 9 ?

From this page, a very awesome (simple) calculation validation regular expression, written by Richard van Velzen. Once you have that, and it matches, you can rest assured that you can use eval over the string. Always make sure the input is validated before using eval!
<?php
$regex = '{
\A # the absolute beginning of the string
\h* # optional horizontal whitespace
( # start of group 1 (this is called recursively)
(?:
\( # literal (
\h*
[-+]? # optionally prefixed by + or -
\h*
# A number
(?: \d* \. \d+ | \d+ \. \d* | \d+) (?: [eE] [+-]? \d+ )?
(?:
\h*
[-+*/] # an operator
\h*
(?1) # recursive call to the first pattern.
)?
\h*
\) # closing )
| # or: just one number
\h*
[-+]?
\h*
(?: \d* \. \d+ | \d+ \. \d* | \d+) (?: [eE] [+-]? \d+ )?
)
# and the rest, of course.
(?:
\h*
[-+*/]
\h*
(?1)
)?
)
\h*
\z # the absolute ending of the string.
}x';
$var = '2*2-3+8';
if( 0 !== preg_match( $regex, $var ) ) {
$answer = eval( 'return ' . $var . ';' );
echo $answer;
}
else {
echo "Invalid calculation.";
}

What you have to do is find or write a parser function that can properly read equations and actually calculate the outcome. In a lot of languages this can be implemented by use of a Stack, you should have to look at things like postfix and infix parsers and the like.
Hope this helps.

$string_with_expression = '2+2';
eval('$eval_result = ' . $string_with_expression)`;
$eval_result - is what you need.

There is intval function
But you can't apply direct to $var
For parser Check this Answer

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex for parsing figure references - php

try this: (Fig.*) ((\d( to | and |-)\d)|\d)|(\d,\d and \d)

You can use this pattern: (Fig(?:ures?|s\.)) (\d+(?:(?:-|, | (?:and|to) )\d+)) If you need more flexibility, you can replace spaces with \h+ or \h

Related

Regex validation for North American phone numbers

split a string which consists decimals instead of integer

RegExp Language Routing

How can I optimize this regular expression?

how to translate math espression in a string to integer

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex for parsing figure references - php

try this: (Fig.*) ((\d( to | and |-)\d)|\d)|(\d,\d and \d)

You can use this pattern: (Fig(?:ures?|s\.)) (\d+(?:(?:-|, | (?:and|to) )\d+)*) If you need more flexibility, you can replace spaces with \h+ or \h*

Related

Regex validation for North American phone numbers

split a string which consists decimals instead of integer

RegExp Language Routing

How can I optimize this regular expression?

how to translate math espression in a string to integer

Categories

Resources

You can use this pattern: (Fig(?:ures?|s\.)) (\d+(?:(?:-|, | (?:and|to) )\d+)) If you need more flexibility, you can replace spaces with \h+ or \h