I'm gonna have to learn to use regular expressions soon, but now I just need to know how should a check for "50.080215,14.393983" GPS format look like, thanks, Mike.
You want to find two decimal numbers separated by a comma (and maybe whitespace?) in a string?
$pattern = "/(?P<lat>(\d+(\.\d+)?)),\s*(?P<lon>(\d+(\.\d+)?))/";
This assumes that the fractional portion of each number may not be present if not needed and places no constraints on the number of digits of precision. Depending on your input corpus this may match more often than you want. With a better specification a tighter pattern could be constructed. For example if we specify that latitude will run form -90 to 90 inclusive and longitude will run from -180 to 180 inclusive and both may have up to 6 digits of precision we can construct this pattern:
$pattern = "/(?P<lat>-?([1-8]?[1-9]|[1-9]0)(\.\d{1,6})?)(?P<lon>-?(1?[1-7][1-9]|1?[1-8]0|[1-9]?[0-9])(\.\d{1,6})?)/";
There is a slight bug in this specification in that it will match "90.999999,180.999999" which is outside the hypothetical spec. Correcting this left as an exercise for the reader.
Related
A user can define the format of an identifier in my system, and this is stored in the d/b as a regex string (for example, "/^\d{6}$/", or a more complicated example of "/^[A-Z]{2}\d{8}$/").
Can anyone suggest how I can calculate the maximum length of the string that the given regex can match (thanks #Ulver)?
Many thanks for reading!
This answer assumes 5 things:
The expressions are simple, as per your examples.
You do not have * or + operators in your expression.
You do not have patterns of the type foo{n, }, where n is some positive, integer value.
Each expression starts with ^ and ends with $.
I am also assuming that each term is followed by the amount of times you expect to match it.
To calculate the amount of characters they match, you could go through the expression and look for 2 patterns:
{n}, which translates to match exactly n times. In this case, extract n.
{n, m}, which translates to match at least n times, and at most m times. In this case, extract m.
Once that you will have all the n and m values, you would simply add them together.
Some more details on the assumptions:
As expressions get more complicated, you will need to keep track of various characters. For instance, ^[A-Z]{2}$ means match 2 upper case letters. Thus, the length of what is matched will be 2. On the other hand, foo{2} means fooo. But afooo and foooobar will also be matched. Thus, you have no control over the lenght of the pattern. also (abc){2} means match abc twice, thus, in this case, you would need to multiply the value of n (the value in the braces) with the length of what ever lies within the brackets which precede it, if any. Of course, you could have nested values.
The * and + operator denote 0 or more, and 1 or more respectively. Thus, there is, theoretically, no limit on the length of whatever it is matched.
Similar to point 2, {n,} means match at least n times. Thus, there is no upper limit.
Similar to point 1, without the ^ and $ anchor, an expression can match any string. The expression foo can match afoo, foobar, foooooooooooooooooooooooo and so on.
I took this assumption for reasons similar to point 1. You could enhance your application to look for [] pairs and count them as 1 character, but I think you could have other caveats.
Here is my regex to validate a phone number.
((^\(?(?:(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?\(?(?:0\)?[\s-]?\(?)?|0)(?:\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}|\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4}|\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3})|\d{5}\)?[\s-]?\d{4,5}|8(?:00[\s-]?11[\s-]?11|45[\s-]?46[\s-]?4\d))(?:(?:[\s-]?(?:x|ext\.?\s?|\#)\d+)?)$)|(\(?[2-9][0-8][0-9]\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}))|(?:\((\+?\d+)?\)|(\+\d{0,3}))? ?\d{2,3}([-\.]?\d{2,3} ?){3,4}
Here is the link for regex check http://regex101.com/r/xO4aU4
it validates UK US numbers. lower bound of Range of the number is 7 and higher bound is not restricted.
can I restrict it so that if range of the number is if less then 7 or greater then 14 then it should not filter the number at all.
(\+44)?\s?\(?0?\d{1,5}\)?\s\d{1,7}\s{0,1}\d{0,6}(?:\s-\s|\s)\s{0,2}\d{0,6}|(\+44)?\s?\(?\d{1,5}\)?\s\d{1,7}\s{0,1}\d{0,4}\s{0,1}\d{0,4}|(\+44)?\s?\(\d{1,5}\)\s?\d{3,7}\s?\d{0,4}\s?\d{0,4}|\d{4,5}\s*\d{3,5}\s\d{3,4}
That is a regex I use for Uk phone numbers (landlines) <- it is used in screen scraping sites so it is probably a little more robust and matches edge cases (such as people who put +44(0)1772 99 33 66) - it is used couple with string length checks and doesn't account for extension numbers - but you should put extension numbers as seperate field anyway.
I have no idea about US numbers so sorry can't help there!
I am using some data which gives paths for google maps either as a path or a set of two latitudes and longitudes. I have stored both values as a BLOB in a mySql database, but I need to detect the values which are not paths when they come out in the result. In an attempt to do this, I have saved them in the BLOB in the following format:
array(lat,lng+lat,lng)
I am using preg_match to find these results, but i havent managed to get any to work. Here are the regex codes I have tried:
^[a]{1}[r]{2}[a]{1}[y]{1}[\(]{1}[1-9\.\,\+]{1*}[\)]{1}^
^[a]{1}[r]{2}[a]{1}[y]{1}[\(]{1}(\-?\d+(\.\d+)?),(\-?\d+(\.\d+)?)\+(\-?\d+(\.\d+)?),(\-?\d+(\.\d+)?)[\)]{1}^
Regex confuses me sometimes (as it is doing now). Can anyone help me out?
Edit:
The lat can be 2 digits followed by a decimal point and 8 more digits and the lng can be 3 digits can be 3 digits follwed by a decimal point and 8 more digits. Both can be positive or negative.
Here are some example lat lngs:
51.51160000,-0.12766000
-53.36442000,132.27519000
51.50628000,0.12699000
-51.50628000,-0.12699000
So a full match would look like:
array(51.51160000,-0.12766000+-53.36442000,132.27519000)
Further Edit
I am using the preg_match() php function to match the regex.
Here are some pointers for writing regex:
If you have a single possibility for a character, for example, the a in array, you can indeed write it as [a]; however, you can also write it as just a.
If you are looking to match exactly one of something, you can indeed write it as a{1}, however, you can also write it as just a.
Applying this lots, your example of ^[a]{1}[r]{2}[a]{1}[y]{1}[\(]{1}[1-9\.\,\+]{1*}[\)]{1}^ reduces to ^array\([1-9\.\,\+]{1*}\)^ - that's certainly an improvement!
Next, numbers may also include 0's, as well as 1-9. In fact, \d - any digit - is usually used instead of 1-9.
You are using ^ as the delimiter - usually that is /; I didn't recognize it at first. I'm not sure what you can use for the delimiter, so, just in case, I'll change it to the usual /.This makes the above regex /array\([\d\.\,\+]{1*}\)/.
To match one or more of a character or character set, use +, rather than {1*}. This makes your query /array\([\d\.\,\+]+\)/
Then, to collect the resulting numbers (assuming you want only the part between the brackets, put it in (non-escaped) brackets, thus: /array\(([\d\.\,\+]+)\)/ - you would then need to split them, first by +, then by ,. Alternatively, if there are exactly two lat,lng pairs, you might want: /array\(([\d\.]+),([\d\.]+)\+([\d\.]+),([\d\.]+)\)/ - this will return 4 values, one for each number; the additional stuff (+, ,) will already be removed, because it is not in (unescaped) brackets ().
Edit: If you want negative lats and longs (and why wouldn't you?) you will need \-? (a "literal -", rather than part of a range) in the appropriate places; the ? makes it optional (i.e. 0 or 1 dashes). For example, /array\((\-?[\d\.]+),(\-?[\d\.]+)\+(\-?[\d\.]+),(\-?[\d\.]+)\)/
You might also want to check out http://regexpal.com - you can put in a regex and a set of strings, and it will highlight what matches/doesn't match. You will need to exclude the delimiter / or ^.
Note that this is a little fast and loose; it would also match array(5,0+0,1...........). You can nail it down a little more, for example, by using (\-?\d*\.\d+)\) instead of (\-?[\d\.]+)\) for the numbers; that will match (0 or 1 literal -) followed by (0 or more digits) followed by (exactly one literal dot) followed by (1 or more digits).
This is the regex I made:
array\((-*\d+\.\d+),(-*\d+\.\d+)\+(-*\d+\.\d+),(-*\d+\.\d+)\)
This also breaks the four numbers into groups so you can get the individual numbers.
You will note the repeated pattern of
(-*\d+\.\d+)
Explanation:
-* means 0 or more matches of the - sign ( so - sign is optional)
\d+ means 1 or more matches of a number
\. means a literal period (decimal)
\d+ means 1 or more matches of a number
The whole thing is wrapped in brackets to make it a captured group.
I've learnt a bit of basic regex to wet my feet, but it's all still a bit too complicated for me. I need to take a set of user-inputted coordinates in decimal degrees (example):
$latitude = -42.323432
$longitude = 176.232123
and check whether they're valid using the preg_match() function in PHP. Seems simple, but I can't write the regular expression for the life of me that would ensure that no bad data gets through. I'll check the northing and easting separate of each other so this preg_match() will be iterated through twice using a foreach loop.
I think I've figured out all the necessary conditions:
The first character can either be a minus, a plus, or a number. The minuses and pluses are optional.
The total count of numbers before the decimal point can be 1 to 3, but not 0 or above 3.
Therefore there must be a decimal point in either the second place, or the fourth place. (2.2332, -123.422)
There must be at EXACTLY one decimal point in the whole string, there can be 0 OR 1 minuses or pluses in the whole string.
I want at least 3 decimal places of precision AFTER the decimal point. There is no maximum limit (I'll simply round it to 6 dp myself)
If there are any characters besides numbers, a decimal point, and an optional plus and minus, reject it.
After this though, I'm stuck! Any help would be appreciated in writing the regex expression. Thanks...
Let's take a stab at it:
/^[+\-]?[0-9]{1,3}\.[0-9]{3,}\z/
Broken down:
^ - start of string
[+\-]? - zero or one from set of `+` and `-`
[0-9]{1,3} - 1 to 3 digits
\. - decimal point
[0-9]{3,} - 3 or more digits
\z - end of string
(note, this is untested ;))
Im looking for function (PHP will be the best), which returns true whether exists string matches both regexpA and regexpB.
Example 1:
$regexpA = '[0-9]+';
$regexpB = '[0-9]{2,3}';
hasRegularsIntersection($regexpA,$regexpB) returns TRUE because '12' matches both regexps
Example 2:
$regexpA = '[0-9]+';
$regexpB = '[a-z]+';
hasRegularsIntersection($regexpA,$regexpB) returns FALSE because numbers never matches literals.
Thanks for any suggestions how to solve this.
Henry
For regular expressions that are actually regular (i.e. don't use irregular features like back references) you can do the following:
Transform the regexen into finite automata (the algorithm for that can be found here(chapter 9) for example).
Build the intersection of the automata (You have a state for each state in the cartesian product of the states of the two automata. You then transition between the states according to the original automata's transition rules. E.g. if you're in state x1y2, you get the input a, the first automaton has a transition x1->x4 for input x and the second automaton has y2->y3, you transition into the state x4y3).
Check whether there's a path from the start state to the end state in the new automaton. If there is, the two regexen intersect, otherwise they don't.
Theory.
Java library.
Usage:
/**
* #return true if the two regexes will never both match a given string
*/
public boolean isRegexOrthogonal( String regex1, String regex2 ) {
Automaton automaton1 = new RegExp(regex1).toAutomaton();
Automaton automaton2 = new RegExp(regex2).toAutomaton();
return automaton1.intersection(automaton2).isEmpty();
}
A regular expression specifies a finite state machine that can recognize a potentially infinite set of strings. The set of strings may be infinite but the number of states must be finite, so you can examine the states one by one.
Taking your second example: In the first expression, to get from state 0 to state 1, the string must start with a digit. In the second expression, to get from state 0 to state 1, the string must start with a letter. So you know that there is no string that will get you from state 0 to state 1 in BOTH expressions. You have the answer.
Taking the first example: You know that if the string starts with a digit you can get from state 0 to state 1 with either regular expression. So now you can eliminate state 0 for each, and just answer the question for each of the two (now one state smaller) finite-state-machines.
This looks a lot like the well-known "halting problem", which as you know is unsolvable in the general case for a Turing machine or equivalent. But in fact the halting problem IS solvable for a finite-state machine, simply because there are a finite number of states.
I believe you could solve this with a non-deterministic FSM. If your regex had only one transition from each state to the next, a deterministic FSM could solve it. But a regular expression means that for instance if you are in state 2, then if the caracter is a digit you go to state 3, else if the character is a letter you go to state 4.
So here's what I would do:
Solve it for the subset of FSM's that have only one transition from one state to the next. For instance a regex that matches both "Bob" and "bob", and a second regex that matches only "bob" and "boB".
See if you can implement the solution in a finite state machine. I think this should be possible. The input to a state is a pair representing a transition for one FSM and a transition for the second one. For instance: State 0: If (r1, r2) is (("B" or "b"), "b") then State 1. State 1: If (r1, r2) is (("o"), ("o")) then state 2. etc.
Now for the more general case, where for instance state two goes back to state two or an earlier state; for example, regex 1 recognizes only "meet" but regex 2 recognizes "meeeet" with an unlimited number of e's. You would have to reduce them to regex 1 recognizing "t" and regex 2 recognizing "t". I think this may be solvable by a non-deterministic FSM.
That's my intuition anyway.
I don't think it's NP-complete, just because my intuition tells me you should be able to shorten each regex by one state with each step.
It is possible. I encountered it once with Pellet OWL reasoner while learning semantic web technologies.
Here is an example that shows how regular expressions can be parsed into a tree structure. You could then (in theory) parse your two regular expressions to trees and see if one tree is a subset of the other tree, ie. if one tree can be found in within other tree's nodes.
If it is found, then the other regular expression will match (not only, but also) a subset of what the first regular expression will match.
It is not much of a solution, but maybe it'll help you.