I am trying to build a pattern to match all counties from a sentence
eg.
"ABCD XYZ County Herefordshire or Co.Kent or London County"
((co(unty)?\s)|(co\.\s?))?(?P<county>[a-z]{4,})(\scounty)?
But above pattern will also return "ABCD" as both expressions around "county" are optional.
Do I have to use two separate regular expressions or is there any way around it?
EDIT
What I am trying to do is get all the counties from a sentence. I consider word a county name if it is followed by "county" or preceded either of "co.", "co ", "county ". Multiple expressions like that divided by " or " are allowed. Once matched next step would be to remove whole expression eg "Co.London" from original string.
EDIT 2
OK sorry for confusion I know my questions isn't clear. What I am trying to do is:
1. User enters something like 'ABCD County XYZ or Co.London or Kent County or county Herefordshire'
2. I want to get anything that is any of: "co.word" or "co word" or "county word" or "word county" So ideally I should get this: 'ABCD County,County XYZ,Co.London,Kent County,county Herefordshire'
3. I remove 'county' or 'co' etc from matched expression and check each against list of counties I have. If word is a county name I want to remove the whole expression from the original query.
You can do what you're looking for by first matching the group that has it before the text you're matching, and then matching it when it's after it. That explanation is probably unclear, so let me illustrate it this way:
You want to match foo that's either before or after bar:
(bar)foo|foo(bar)
of course in this case the parentheses are not required, but it's to illustrate that it's a group.
In your case, if I'm understanding it correctly, you'd need the following:
((co(unty)?\s)|(co\.\s?))(?P<county>[a-z]{4,})|(?P<county>[a-z]{4,})(\scounty)
or with reduced amount of parentheses:
(co(unty)?\s|co\.\s?)(?P<county>[a-z]{4,})|(?P<county>[a-z]{4,})\scounty
I'm not quite sure what the (?P is supposed to mean though. Regex101 doesn't recognise it either.
In reply to Johannes' comment, what you could do is only match words starting with an uppercase letter:
([Cc]o(unty|\.)? ?)([A-Z]\w+)|([A-Z]\w+) [Cc]ounty
That would also match it if the word is uppercase because it's the start of a sentence, though, so you could prevent it from matching that via:
([Cc]o(unty|\.)? ?)([A-Z]\w+)|((?<![.!?] |.\n)[A-Z]\w+) [Cc]ounty
then again, if the county name is the start of the sentence, it won't match it again, but that's something you're going to have to choose between. Regex can't make a distinction between a county name and a regular word at the start of a sentence.
Demo of the last mentioned regex.
Update per your comments: You can match every word that is followed or preceded by one of the named keywords (including ones that are not necessarily county names) by using the following:
((?<=county\s)|(?<=co\s)|(?<=co\.))(?P<county>[a-z]{4,})|(?P<county2>[a-z]{4,})(?=\scounty)
demo.
That uses lookbehinds, so only matches the actual word, not the word "county", so you could even omit the named capturing group, and directly use the list of matches, instead of filtering it to just the named capturing groups. As you can see in the demo, the only actual text matched is the text you're looking for.
Related
I'm in need of a PHP regular expression to capture the first initial an last name of people listed in a text document. But only capture the names when the sentence or line contains a few keywords. (from, with, of, and ,as ,observed). My current attempt captures list items ie. "A. General" or "B. Issues" because it doesn't seem to care about what's in front of the names.
I've been using preg_match_all() with hopes of it returning an array of names. (first inital, last name).
Example text
"from J. Smith and B. Miller"
"as T. Baker observed M. Kelly"
"We inquired with B. Brown, T. Stark and J. Maddox."
I've tried
$regex = "/[from|with|of|and|as|observed|,|.]\s+([A-Z]. \w+)/";
$regex = "/((from|with|of|and|as|observed|,|.)\s+([A-Z]. \w+))/";
$regex = "/\b(from|with|of|and|as|observed|,|.)\s+([A-Z].\ \w+)/";
$regex = "/\b(from|with|of|and|as|observed|,|.|\b)\s+([A-Z].\ \w+)/";
I cannot make it only capture when the word list is before the names. I can't use ^ to check 'starts with'. I'm horrible at regex and guess until it works. I feel the solution requires some sort of look-behind assertion, though I'm not sure how it works.
Output
Should be an array
[ 'J. Smith', 'B. Miller' ]
[ 'T. Baker', 'M. Kelly' ]
[ 'B. Brown', 'T. Stark', 'J. Maddox' ]
UPDATE
Final Regexp
$regex = "/\b(?:from|with|of|and|as|observed|,)\s+([A-Z].\ \w+)/";
Seems to work with the few documents I have. Thanks everyone!!
You can use this modified version of your third regex :
\b(?:from|with|of|and|as|observed|,)\s+([A-Z].\ \w+)\g
You need to escape . in the first group or it will accept any character. Not relevant after edit
The \g flag will find every occurrence of the pattern, and you will be able to access the results in $matches[1].
(The added ?: in first group prevent it from being captured, you can remove it if you need to know the keyword, but then the results will be stored in $matches[2] )
Edit : Removed \. in first group to not match end of sentences (see author comment).
You can try looking for a capital letter followed by a dot and a word
[A-Z]\.\s\w+
I think this should work
/(?!^from|with|of|and|as|observed|\s)([A-Z]{1,}\.\s\w*)/g
Where
?! = Discard the match of the first group, that begins with first ( and ends with ) and at least is included also the \s (space) at the beginning of the name.
^ = match the begins of the line/sentence/string
Then in second group it should match just one capital letter {1,} and then a dot \., a space \s and the word \w
The /g at the end stands for "global search"
https://regexr.com/3pa9o
OK, I've worked with RegEx numerous times but this is one of the things I honestly can't get my head around. And it looks as if I'm missing something rather simple...
So, let's say we want to match "AB" or "AC". In other words, "A" followed by either "B" OR "C".
This would be expressed like A[BC] or A[B|C] or A(B|C) and so on.
Now, what if A,B,C are not just single letters but sub-expressions?
Please, have a look at this example here (well, I admit it doesn't look that... simple! lol) : http://regexr.com?382a4
I'm trying to match capital = (and its variations) followed by either :
Pattern 1
Pattern 2
Why is it that using the | operator only works on the latter part (my regex also matches "Pattern 2" withOUT preceding capital =). Please note that I've also tried using positive look-arounds, but without any success.
Any ideas?
Your original regex could be summarized as:
capital = (ABC)|(DEF)
This matches capital = ABC or DEF. Add an extra pair of () that wraps the | clause properly.
Demo here
I suppose this regexp:
capital = (ABC|XYZ)
should work (if I did correctly understand your request...)
Actually [B|C] is incorrect, (B|C) is correct.
Character classes
In RegEx jargon [] is called a character class and it is used to represent one (single) character according to the options listed between the brackets.
In your case [B|C] matches either B or | or C. We can correct this by using [BC] to match either B or C. This matches exactly one character either B or C.
Capturing groups
In RegEx jargon () is called a capturing group. It is used to create boundaries between adjacent groups and whatever it matches will be present in the output array of a preg_match or as a variable in preg_replace.
Within that group you can us the | operator to specify that you want to match either whatever's before or whatever's after the operator.
This can be used to match strings with more than one characters such as (Ana|Maria) or various structures such as ([a-zA-Z]+|[0-9]+).
You can also use the | outside of a capturing group such as (group-1)|(group-2) and you can also use subgrouping such as ((group-1)|(group-2)).
I try to create a regular expression with searches in a huge document for a persons full name. In the text the name can be written in full, or the first names can be either abbreviated to a single letter or a letter followed by a dot or omitted. For instance my search for _ALBERTO JORGE ALONSO CALEFACCION_now is:
preg_match('/([;:.,&\s\xc2\-(){}!"'<>]{1})(ALBERTO|A.|A)[\s\xc2-]+
(JORGE|J.|J)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION))([;:.,&\s\xc2(){}
!"'<>]{1})/i', $text, $match);
Between the first names and last names an asterisk (*) can be present.
This is working for the case all first names are at least present some way. But I don't know to extend the expression when first names are omitted. Can you help me?
Let's start by simplifying what you have;
start:
/([;:.,&\s\xc2\-(){}!"'<>]{1})(ALBERTO|A.|A)[\s\xc2-]+(JORGE|J.|J)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION)([;:.,&\s\xc2(){}!"'<>]{1})/i
as I said in my comment, \b is "word break", so you can simplify a lot of that:
/\b(ALBERTO|A.|A)[\s\xc2-]+(JORGE|J.|J)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION)\b/i
(added bonus: it won't match the characters either side now, and it will match at the start and end of the text)
Next, you can use the ? token for the dots (which should be escaped by the way; . is special and means "match anything")
/\b(ALBERTO|A\.?)[\s\xc2-]+(JORGE|J\.?)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION)\b/i
Finally, to actually answer your question, you have 2 choices. Either make the entire bracketed name optional, or add a new blank option. The first is the most flexible, since we'll need to cope with the whitespace too:
/\b((ALBERTO|A\.?)[\s\xc2-]+((JORGE|J\.?)[\s\xc2,]+)?)?(ALONSO)[\s\xc2*-]+(CALEFACCION)\b/i
Note that if you're reading the matched parts you'll need to update your indices. Also note that this fixed an issue where omitting the second name (JORGE) still required an extra space.
This will match things like A. J. ALONSO CALEFACCION, A. ALONSO CALEFACCION and ALONSO CALEFACCION, but not J. ALONSO CALEFACCION (it's only a small tweak if you do want that)
Breaking up that final string for clarity:
/\b
(
(ALBERTO|A\.?)[\s\xc2-]+
(
(JORGE|J\.?)[\s\xc2,]+
)?
)?
(ALONSO)[\s\xc2*-]+
(CALEFACCION)
\b/i
Finally, it's an odd thought, but you could change the names which can be initials to be in this form: (A(LBERTO|\.|)), which means you're not repeating the initials (a potential source of mistakes)
I have a string that contains 5 words. In the string one of the words is a Ham Radio Call Sign and can be anyone of the thousands of call signs in the US. In order to extract the Call Sign from the string I need to utilize the below pattern. The Call Sign I need to extract can be in any of the 5 positions in the string. The number is never the first character and the number is never the last character. The string is actually put together from an Array since it is originally read from a text file.
$string = $word[1] $word[2] $word[3] etc....
So the search can be either done on the whole string or each piece of the array.
Patterns:
1 Number and 3 Letters Example: AB4C A4BC
1 Number and 4 Letters Example: A4BCD
1 Number and 5 Letters Example: AB4CDE
I have tried everything I can think of and search till I cant search no more. I am sure I am over thinking this.
A two-step regular expression like this would do it:
$str = "hello A4AB there BC5AD";
$signs = array();
preg_match_all('/[A-Z][A-Z\d]{1,3}[A-Z]/', $str, $possible_signs);
foreach($possible_signs[0] as $possible_sign)
if (preg_match('/^\D+\d\D+$/', $possible_sign))
array_push($signs, $possible_sign);
print_r($signs); //Array ([0] => A4AB [1] => BC5AD)
Explanation
This is a regular expression approach, using two patterns. I don't think it could be done with one and still satisfy the exact requirements of the matching rules.
The first pattern enforces the following requirements:
substring starts and ends with a capital letter
substring contains only other capital letters or numbers between the first and last letter
substring is, overall, not more than 6 characters long
What I can't do in that same pattern, for complex REGEX reasons I won't go into (unless someone knows a way and can correct me), is enforce that only one number is contained.
#jeroen's answer does enforce this in a single pattern, but in turn does not enforce the correct length of the substring. Either way, we need a second pattern.
So after grabbing the initial matches, we loop over the results. We then apply each to a second pattern that enforces simply that there is only one number in the substring.
If so, we green-light the substring and it's added to the $signs array.
Hope this helps.
It depends on what the other words can contain, but you could use a regular expression like:
#\b[a-z]+\d[a-z]+\b#i
^ case insensitive
^^ a word boundary
^^^^^^ One or more letters
^^ One number
You can make it more restrictive by using {1,3} instead of + for the letters so that you have a sequence of 1 to 3 letters.
The complete expression would be something like:
$success = preg_match('#\b[a-z]+\d[a-z]+\b#i', $input_string, $matches);
where $matches[0] will contain the matched value, see the manual.
I am making a search criterion where it should search for inputted value (at least partial search).
An example is given below.
It searches only for words starting with EXACT match . But I want to make it to search for both caps and smalls...
$rr="In";
$matched_list =array('India','Pakistan','Ausis');
$m=preg_grep('/'.$rr.'/', $matched_list);
print_r($m);
It searches only for "India" and not for "india"....!!!!
What to do to make it to search for "india" also...?????/
Thanks in advance..
Very easy, just add a i (for case insensitive) after the closing /:
$m=preg_grep('/'.$rr.'/i', $matched_list);
Also, as a minor note. Your expression "In" would also match on "China" and others having the match somewhere in between. If this isn't intended, you'll have to tell it to look at the beginning only:
$rr="^In"; // ^ will match the beginning of the string or a line (depending on the settings)