Regex to get street number with spaces - php

I'm still a regex newbie.
I have "nicely formatted" addresses and the data source will only give me nice Australian addresses.
I've got this far:
~([\w\d\-\/\.]*)\s*([\w\d '\-\.\ ()]+)~
Given the address,
123/500-550 Main Street
It will give me two groups (which is what I want):
123/500-550
Main Street
But I'm stuck on trying to accommodate spaces like:
123 / 500-550 Main Street
123 / 500-550 Main Street
123 / 500 - 550 Main Street
Can I maybe use a ^ and lookahead to detect the start of the street name like [\w\d '\-\.\ ()]+ and then get everything to left of it? If so, how?
https://regex101.com/r/kG32Sz/1

You can add whitespace to the number part (removing letters btw) and detect the street part start using positive lookahead:
([\d\-\/\.\s]*)(?=\s+\w)\s+([\w\d '\-\.\ ()]+)
Demo

Though generally not advisable, you could use
^ # start of line
(?P<street_number>[-/\d\h]+)\h+ # capture -, \d and \h => street_number
(?P<street_name>[A-Z][\w\h]+) # capture sth. with UPPERCASE,
# followed by \w and \h => street_name
$ # end of line
See a demo on regex101.com (and mind the modifiers !).
Better use a library with previously known street names instead (or query it with the expression above, that is).
You may add other allowed characters in the class, such as [-'.\w\h]. Note that most characters do not need to be escaped within a class.

Related

preg - Difference between Search Patterns with [] and without

It seems I am not able to understand something very basic with preg regex Patterns in PHP.
What is the difference between these Regex Patterns:
\b([A-Z...]...)
[\b]{1}([A-Z...]...)
The Pattern should start with a word boundary, but why is the result different, when I put it in []{1} ??
The first one works like I expected, but the second not. The problem is, that I want to put more into the [], so that the pattern can start with a word boundary OR a small character [a-z].
Thank you!
Example Text:
Race1529/05/201512:45K4 Senior Men 1000m
LaneName(s)NFBib(s)TimeRank250m500m750m
152
Martin SCHUBERT / Lukas REUSCHENBACH155
11
153
151Kostja STROINSKI / Kai SPENNER
03:07.740
GER
8
I want to find the names of the racers. Sometimes they have a word-break (\b) at the beginning, sometimes not. (But i need the word-break.)
$pattern = '#\b(['.$GB.$KB.'\s\-]{2,40})\s(['.$GB.'\'\-\s]{2,40})[0-9]{0,5}#';
($GB is a variable with all Uppercase Letters, $KB with lower case letters)
preg_match_all gives me all racers where the Name has a word-break at the beginning. (In this example Schubert, Reuschenbach, Spenner) but of course not Stroinski. So, I try this:
$pattern = '#[\b0-9]+(['.$GB.$KB.'\s\-]{2,40})\s(['.$GB.'\'\-\s]{2,40})[0-9]{0,5}#';
Does not work. Even if i remove the 0-9 and only put [\b]{1} at the beginning it doesn't find any hit.
I don't see the difference between \b and [\b]{1}. It seems to be a very basic misunderstanding.
The [\b] is a character class that only matches a backspace char (\u0008).
See PHP regex reference:
note that "\b" has a different meaning, namely the backspace character, inside a character class
Also, .{1} = ., the {1} limiting quantifier is always redundant and only makes sense when your patterns are built dynamically from variables.

RegEx to extract city and state from string AND know when someone leaves out the state part

i have the following code:
preg_match("/^(.+)[,\\s]+(.+?)\s*(\d{5})?$/", trim($searchbox), $matches);
list($arr['add'], $arr['city'], $arr['state']) = $matches;
$citystr = trim(str_replace(',', '', $arr['city']));
$statestr = trim($arr['state']);
This works great when someone types in "Granite Bay, CA", however i would like to modify it to catch when someone leave out the ", CA" part. So if someone only types "granite Bay", the code above is taking "Bay" as the state - thats no good. It also fails if someone adds a zip to the end like "Granite Bay, CA 00000"
Are there any modifications to this RegEx that i can do to avoid both these senarios?
TIA
Yes, you can build a less permissive/more detailed pattern:
^\h*([^,\s]+(?:\h+[^,\s]+)*+)\h*(?:,\h*([A-Z]+))?\h*(\d{5})?\h*$
demo
([^,\s]+(?:\h+[^,\s]+)*+) catches the city name as: something that doesn't start nor end with whitespaces and eventually in several parts.
(?:,\h*([A-Z]+))? makes all the state part optional. Note that I have chosen only uppercase letters for the state, but you can also make it case insensitive, it doesn't matter since the important point is the comma.
As an aside, if you want to be sure of what enter a user, use one field per information (one for the city, one for the state, one for the zip code).
You could go for:
^ # start of the string
(?P<town>[A-Z][^,]+) # uppercase, followed by not a comma
(?> # a non-capturing group
,\h*\K # a comma, horizontal whitespace, \K
(?P<state>[A-Z]{2}) # two UPPERCASE letters
)? # make the whole group optional
See a demo on regex101.com.
To be sure, you'll likely need some database of towns and states to check against, though (the above expression allows XY for a state as well), or as #Casimir points out, use several fields for each information.

Parse Fixed Width File PHP Regular Expression

I have a export file that contains data of fixed widths:
Name (6)
Gender (6)
Phone Number (12) - Includes a space
Data.txt
DanielMale (07654) 521254
Lisa Female(16545) 654456
Sarah Female(54656) 4896546
I need to extract the name and gender data including any spaces if the data doesn't fit the data width.
The brackets for the phone number need to be ignored. (How do you ignore items in regular expressions?
I currently have the following regular expression, that pulls out the people's names. I thought I could simply add the bit on the white to make it pull out the Genders, but this doesn't work. Where am I going wrong?
/(?<name>.{6}+) (?<gender>.{6}+)/
I need the data to look like this at the end.
^ = space
Daniel
Male^^
07654 521254
Lisa^^
Female
16545 654456
Sarah^
Female
54656 4896546
This should catch all four fields:
/^(?<name>.{6})(?<gender>.{6})\((?<prefix>[^\)]+)\)\ (?<number>.+)/
The first ^ means match from the start.
{6}: match 6 times the pattern (the plus sign is redundant here)
\( and \): match brackets (without escaping they would mean boundaries of a subpattern)
[^\)]+ means "everything before the first closing bracket

Regex optional groups

I'd like to capture up to four groups of text between <p> and </p>. I can do that using the following regex:
<h5>Trivia<\/h5><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p>
The text to match on:
<h5>Trivia</h5><p>Was discovered by a freelance photographer while sunbathing on Bournemouth Beach in August 2003.</p><p>Supports Southampton FC.</p><p>She has 11 GCSEs and 2 'A' Levels.</p><p>Listens to soul, R&B, Stevie Wonder, Aretha Franklin, Usher Raymond, Michael Jackson and George Michael.</p>
It outputs the four lines of text. It also works as intended if there are more trivia items or <p> occurrences.
But if there are less than 4 trivia items or <p> groups, it outputs nothing since it cannot find the fourth group. How do I make that group optional?
I've tried: <h5>Trivia<\/h5><p>(.*?)<\/p>(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)? and that works according to http://gskinner.com/RegExr/ but it doesn't work if I put it inside PHP code. It only detects one group and puts everything in it.
The magic word is either 'escaping' or 'delimiters', read on.
The first regex:
<h5>Trivia<\/h5><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p>
worked because you escaped the / characters in tags like </h5> to <\/h5>.
But in your second regex (correctly enclosing each paragraph in a optional non-capturing group, fetching 1 to 5 paragraphs):
<h5>Trivia</h5><p>(.*?)</p>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?
you forgot to escape those / characters.
It should then have been:
$pattern = '/<h5>Trivia<\/h5><p>(.*?)<\/p>(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?/';
The above is assuming you were putting your regex between two / "delimiters" characters (out of conventional habit).
To dive a little deeper into the rabbit-hole, one should note that in php the first and last character of a regular expression is usually a "delimiter", so one can add modifiers at the end (like case-insensitive etc).
So instead of escaping your regex, you could also use a ~ character (or #, etc) as a delimiter.
Thus you could also use the same identical (second) regex that you posted and enclose for example like this:
$pattern = '~<h5>Trivia</h5><p>(.*?)</p>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?~';
Here is a working (web-based) example of that, using # as delimiter (just because we can).
You can use the question mark to make each <p>...</p> optional:
$pattern = '~<h5>Trivia</h5>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?~';
Use the Dom is a good option too.

PHP regular expressions (phonenumber)

I'm having some trouble with a regular expression for phone numbers. I am trying to create a regex that is as broad as possible for european phone numbers. The phone number can start with a + or with two leading 0's, followed by a number in between 0 and 40. this is not necessary however, so this first part can also ignored. After that, it should all be numbers, grouped into pairs of at least two, with a whitespace or a - inbetween the groups.
The regex I have put together can be found below.
/((\+|00)+[0-4]+[0-9]+)?([ -]?[0-9]{2,15}){1,5}/
This should match the following structures
0031 34-56-78
0032123456789
0033 123 456 789
0034-123-456-789
+35 34-56-78
+36123456789
+37 123 456 789
+38-123-456-789
...
What it also matches according to my javascript
+32 a54b 67-0:
So I must have made a mistake somewhere, but I really can't see it. Any help would be appreciated.
The problem is that you don't use anchors ^ $ to define the start and ending of the string and will therefore find a match anywhere in the string.
/^((\+|00)+[0-4]+[0-9]+)?([ -]?[0-9]{2,15}){1,5}$/
Adding anchors will do the trick. More about these meta characters can be found here.
Try this, may be can help you.
if (ereg("^((\([0-9]{3}\) ?)|([0-9]{3}-))?[0-9]{3}-[0-9]{4}$",$var))
{
$valid = true;
}
Put ^ in the beginning of the RegExp and $ in the end.

Categories