I'm in need of a PHP regular expression to capture the first initial an last name of people listed in a text document. But only capture the names when the sentence or line contains a few keywords. (from, with, of, and ,as ,observed). My current attempt captures list items ie. "A. General" or "B. Issues" because it doesn't seem to care about what's in front of the names.
I've been using preg_match_all() with hopes of it returning an array of names. (first inital, last name).
Example text
"from J. Smith and B. Miller"
"as T. Baker observed M. Kelly"
"We inquired with B. Brown, T. Stark and J. Maddox."
I've tried
$regex = "/[from|with|of|and|as|observed|,|.]\s+([A-Z]. \w+)/";
$regex = "/((from|with|of|and|as|observed|,|.)\s+([A-Z]. \w+))/";
$regex = "/\b(from|with|of|and|as|observed|,|.)\s+([A-Z].\ \w+)/";
$regex = "/\b(from|with|of|and|as|observed|,|.|\b)\s+([A-Z].\ \w+)/";
I cannot make it only capture when the word list is before the names. I can't use ^ to check 'starts with'. I'm horrible at regex and guess until it works. I feel the solution requires some sort of look-behind assertion, though I'm not sure how it works.
Output
Should be an array
[ 'J. Smith', 'B. Miller' ]
[ 'T. Baker', 'M. Kelly' ]
[ 'B. Brown', 'T. Stark', 'J. Maddox' ]
UPDATE
Final Regexp
$regex = "/\b(?:from|with|of|and|as|observed|,)\s+([A-Z].\ \w+)/";
Seems to work with the few documents I have. Thanks everyone!!
You can use this modified version of your third regex :
\b(?:from|with|of|and|as|observed|,)\s+([A-Z].\ \w+)\g
You need to escape . in the first group or it will accept any character. Not relevant after edit
The \g flag will find every occurrence of the pattern, and you will be able to access the results in $matches[1].
(The added ?: in first group prevent it from being captured, you can remove it if you need to know the keyword, but then the results will be stored in $matches[2] )
Edit : Removed \. in first group to not match end of sentences (see author comment).
You can try looking for a capital letter followed by a dot and a word
[A-Z]\.\s\w+
I think this should work
/(?!^from|with|of|and|as|observed|\s)([A-Z]{1,}\.\s\w*)/g
Where
?! = Discard the match of the first group, that begins with first ( and ends with ) and at least is included also the \s (space) at the beginning of the name.
^ = match the begins of the line/sentence/string
Then in second group it should match just one capital letter {1,} and then a dot \., a space \s and the word \w
The /g at the end stands for "global search"
https://regexr.com/3pa9o
Related
So i am stuck - I have looked at tons of answers in here, but none seems to resolve my last problem.
Through an API with JSON, I receive an equipment list in a camelcase format. I can not change that.
I need this camelcase to be translated into normal language -
So far i have gotten most words seperated through:
$string = "SomeEquipmentHere";
$spaced = preg_replace('/([A-Z])/', ' $1', $string);
var_dump($spaced);
string ' Some Equipment Here' (length=20)
$trimmed = trim($spaced);
var_dump($trimmed);
string 'Some Equipment Here' (length=19)
Which is working fine - But in some of the equipments consists of abbreviations
"ABSBrakes" - this would require ABS and separated from Brakes
I can't check for several uppercases next to each other since it will then keep ABS and Brakes together - there are more like these, ie: "CDRadio"
So what is want is the output to be:
"ABS Brakes"
Is there a way to format it so, if there is uppercases next to eachother, then only add a space before the last uppercase letter of that sequence?
I am not strong in regex.
EDIT
Both contributions are awesome - people coming here later should read both answers
The last problems to consists are the following patterns :
"ServiceOK" becomes "Service O K"
"ESP" becomes "ES P"
The pattern only consisting of a pure uppercased abbreviation is fixed by a function counting lowercase letter, if there is none, it will skip over the preg_replace().
But as Flying wrote in the comments on his answer, there could potentially be a lot of instances not covered by his regex, and an answer could be impossible - I don't know if this could be a challenge for the regex.
Possibly by adding some "If there is not a lowercase after the uppercase, there should not be inserted a space" rule
Here is a single-call pattern that doesn't use any anchors, capture groups, or references in the replacement string: /(?:[a-z]|[A-Z]+)\K(?=[A-Z]|\d+)/
Pattern&Replace Demo
Code: (Demo)
$tests = [
'SomeEquipmentHere',
'ABSBrakes',
'CDRadio',
'Valve14',
];
foreach ($tests as $test) {
echo preg_replace('/(?:[a-z]|[A-Z]+)\K(?=[A-Z]|\d+)/',' ',$test),"\n";
}
Output:
Some Equipment Here
ABS Brakes
CD Radio
Valve 14
This is a better method because there is nothing to mop up. If there are new strings to consider (that break my method), please leave them in a comment so that I can update my pattern.
Pattern Explanation:
/ #start the pattern
(?:[a-z] #match 1 lowercase letter
| #or
[A-Z]+) #1 or more uppercase letters
\K #restart the fullstring match (forget the past)
(?=[A-Z] #look-ahead for 1 uppercase letter
| #or
\d+) #1 or more digits
/ #end the pattern
Edit:
There are some other patterns that may provide better accuracy including:
/(?:[a-z]|\B[A-Z]+)\K(?=[A-Z]\B|\d+)/
Granted, the above pattern will not properly handle ServiceOK
Demo Link Word Boundaries Link
or this pattern with an anchor:
/(?!^)(?=[A-Z][a-z]+|(?<=\D)\d)/
The above pattern will accurately split: SomeEquipmentHere, ABSBrakes, CDRadio, Valve14, ServiceOK, ESP as requested by the OP.
Demo Link
*Note: Pattern accuracy can be improved as more sample strings are provided.
Here is how it can be solved:
$tests = [
'SomeEquipmentHere',
'ABSBrakes',
'CDRadio',
'Valve14',
];
foreach ($tests as $test) {
echo trim(preg_replace('/\s+/', ' ', preg_replace('/([A-Z][a-z]+)|([A-Z]+(?=[A-Z]))|(\d+)/', '$1 $2 $3', $test)));
echo "\n";
}
Related test on regex101.
UPDATE: Added example for additional question
I have a regex in PHP that replaces everything I don't want with spaces
/[^a-z0-9\p{L}]/siu
But there is this one exception, I want to keep punctuations for abbreviations.
Example:
F.B.I.Federal.Bureau.of.Investigation => 'F B I Federal Bureau of
Investigation'
S.W.A.T.Team => 'S W A T Team'
Should be:
F.B.I.Federal.Bureau.of.Investigation => 'F.B.I. Federal Bureau of
Investigation'
S.W.A.T.Team => 'S.W.A.T. Team'
PHP code:
$s = "F.B.I.Federal.Bureau.of.Investigation";
return preg_replace('/[^a-z0-9\p{L}]/siu', " ", $s);
so the logic is, that it should check the second char of first match, and if it's an '.' char, then don't replace.
Not sure if this is possible with regex, then I would appreciate an alternative with PHP.
Actually, there are many types of abbreviations, and as Jon Stirling says, there is no really 100% working solution here since you need a whole list of possible abbreviations to filter out. You may have a peek at some fancy regex solution by #ndn and grab the pattern part related to abbreviations there.
If you need to only handle patterns like in the question, you may consider using
'~(\b(?:\p{Lu}\.){2,})|[^0-9\p{L}]~u'
or - if D.Word should also be treated as an abbreviation:
'~(\b(?:\p{Lu}\.)+)|[^0-9\p{L}]~u'
and replace with '$1 '. See the regex demo.
Pattern details:
(\b(?:\p{Lu}\.)+) - Group 1 (later referenced with $1 backreference): 1 or more consequent occurrences of any Unicode uppercase letter and a dot after it
| - or
[^0-9\p{L}] - any char that is not an ASCII digit and a Unicode letter.
And here is a variant of a regex with #ndn's abbreviations:
'~\b((?:[Ee]tc|St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd|pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs|\p{Lu}(?:\.\p{Lu})+)\.)|[^0-9\p{L}]~'
See the regex demo.
If you do not want to remove -, ( and ), just make sure to add them to the negated character class, replace [^0-9\p{L}] with [^0-9\p{L}()-].
Feel free to update by adding more abbreviations or enhance by shrinking the alternatives.
I want to wrap the 2nd to last letter in a span with class "testing2" and the last letter in a span with class "testing". I got how to do the last letter, but what about the 2nd to last letter?
echo preg_replace('/(.)$/', '<span class="testing">\1</span>', $title)
This regex will find the last 2 letters in a string and capture them separately.
.*([a-zA-Z])([a-zA-Z]).*$
Demo: https://regex101.com/r/yR1sT8/1
PHP Usage:
$string = 'aaffffs3.4.4asdf234f4f3_+!#>,3';
preg_match('/.*([a-zA-Z])([a-zA-Z]).*$/', $string, $letters);
print_r($letters);
Output:
Array
(
[0] => aaffffs3.4.4asdf234f4f3_+!#>,3
[1] => d
[2] => f
)
...or...
$string = 'aaffffs3.4.4asdf234f4f3_+!#>,3';
echo preg_replace('/.*([a-zA-Z])([a-zA-Z]).*$/', '<span class="testing2">$1</span><span class="testing">$2</span>',$string);
Output:
<span class="testing2">d</span><span class="testing">f</span>
If you didn't care about the last letters and just wanted any character than this was much easier and just, (.)(.)$.
Possible alternative: https://regex101.com/r/yR1sT8/2
Update:
To keep the previous values as well we just need to add additional capture groups.
$string = 'aaffffs3.4.4asdf234f4f3_+!#>,3';
echo preg_replace('/(.*)([a-zA-Z])([a-zA-Z])(.*)$/', '$1<span class="testing2">$2</span><span class="testing">$3</span>$4',$string);
Output:
aaffffs3.4.4as<span class="testing2">d</span><span class="testing">f</span>234f4f3_+!#>,3
Additional:
The () is a capture group. Anything inside those is grouped which can be used in a number of ways. For example say you wanted to evaluate a sentence and you didn't care the word the started it, you could do.
`(?:The)? wolf was walking down the street`
Here the is grouped and the ? makes that whole word optional. The ?: makes the capturing group not capture so $1 wouldn't be present here. The $1, $2, etc. are named on the order they appear in the regex. You can read more about capture groups here, http://www.regular-expressions.info/refcapture.html and
http://www.rexegg.com/regex-capture.html. Depending on the language the reference to the captured value maybe \1.
Simplest would be an assertion instead of just $:
(?=.$|$)
This would lead the (.) to match at the end and one letter before that.
I am trying to build a pattern to match all counties from a sentence
eg.
"ABCD XYZ County Herefordshire or Co.Kent or London County"
((co(unty)?\s)|(co\.\s?))?(?P<county>[a-z]{4,})(\scounty)?
But above pattern will also return "ABCD" as both expressions around "county" are optional.
Do I have to use two separate regular expressions or is there any way around it?
EDIT
What I am trying to do is get all the counties from a sentence. I consider word a county name if it is followed by "county" or preceded either of "co.", "co ", "county ". Multiple expressions like that divided by " or " are allowed. Once matched next step would be to remove whole expression eg "Co.London" from original string.
EDIT 2
OK sorry for confusion I know my questions isn't clear. What I am trying to do is:
1. User enters something like 'ABCD County XYZ or Co.London or Kent County or county Herefordshire'
2. I want to get anything that is any of: "co.word" or "co word" or "county word" or "word county" So ideally I should get this: 'ABCD County,County XYZ,Co.London,Kent County,county Herefordshire'
3. I remove 'county' or 'co' etc from matched expression and check each against list of counties I have. If word is a county name I want to remove the whole expression from the original query.
You can do what you're looking for by first matching the group that has it before the text you're matching, and then matching it when it's after it. That explanation is probably unclear, so let me illustrate it this way:
You want to match foo that's either before or after bar:
(bar)foo|foo(bar)
of course in this case the parentheses are not required, but it's to illustrate that it's a group.
In your case, if I'm understanding it correctly, you'd need the following:
((co(unty)?\s)|(co\.\s?))(?P<county>[a-z]{4,})|(?P<county>[a-z]{4,})(\scounty)
or with reduced amount of parentheses:
(co(unty)?\s|co\.\s?)(?P<county>[a-z]{4,})|(?P<county>[a-z]{4,})\scounty
I'm not quite sure what the (?P is supposed to mean though. Regex101 doesn't recognise it either.
In reply to Johannes' comment, what you could do is only match words starting with an uppercase letter:
([Cc]o(unty|\.)? ?)([A-Z]\w+)|([A-Z]\w+) [Cc]ounty
That would also match it if the word is uppercase because it's the start of a sentence, though, so you could prevent it from matching that via:
([Cc]o(unty|\.)? ?)([A-Z]\w+)|((?<![.!?] |.\n)[A-Z]\w+) [Cc]ounty
then again, if the county name is the start of the sentence, it won't match it again, but that's something you're going to have to choose between. Regex can't make a distinction between a county name and a regular word at the start of a sentence.
Demo of the last mentioned regex.
Update per your comments: You can match every word that is followed or preceded by one of the named keywords (including ones that are not necessarily county names) by using the following:
((?<=county\s)|(?<=co\s)|(?<=co\.))(?P<county>[a-z]{4,})|(?P<county2>[a-z]{4,})(?=\scounty)
demo.
That uses lookbehinds, so only matches the actual word, not the word "county", so you could even omit the named capturing group, and directly use the list of matches, instead of filtering it to just the named capturing groups. As you can see in the demo, the only actual text matched is the text you're looking for.
I found some partial help but cannot seem to fully accomplish what I need. I need to be able to do the following:
I need an regular expression to replace any 1 to 3 character words between two words that are longer than 3 characters with a match any expression:
For example:
walk to the beach ==> walk(.*)beach
If the 1 to 3 character word is not preceded by a word that's longer than 3 characters then I want to translate that 1 to 3 letter word to '<word> ?'
For example:
on the beach ==> on ?the ?beach
The simpler the rule the better (of course, if there's an alternative more complicated version that's more performant then I'll take that as well as I eventually anticipate heavy usage eventually).
This will be used in a PHP context most likely with preg_replace. Thus, if you can put it in that context then even better!
By the way so far I have got the following:
$string = preg_replace('/\s+/', '(.*)', $string);
$string = preg_replace('/\b(\w{1,3})(\.*)\b/', '${1} ?', $string);
but that results in:
walk to the beach ==> 'walk(.*)to ?beach'
which is not what I want. 'on the beach' seems to translate correctly.
I think you will need two replacements for that. Let's start with the first requirement:
$str = preg_replace('/(\w{4,})(?: \w{1,3})* (?=\w{4,})/', '$1(.*)', $str);
Of course, you need to replace those \w (which match letters, digits and underscores) with a character class of what you actually want to treat as a word character.
The second one is a bit tougher, because matches cannot overlap and lookbehinds cannot be of variable length. So we have to run this multiple times in a loop:
do
{
$str = preg_replace('/^\w{0,3}(?: \w{0,3})* (?!\?)/', '$0?', $str, -1, $count);
} while($count);
Here we match everything from the beginning of the string, as long as it's only up-to-3-letter words separated by spaces, plus one trailing space (only if it is not already followed by a ?). Then we put all of that back in place, and append a ?.
Update:
After all the talk in the comments, here is an updated solution.
After running the first line, we can assume that the only less-than-3-letter words left will be at the beginning or at the end of the string. All others will have been collapsed to (.*). Since you want to append all spaces between those with ?, you do not even need a loop (in fact these are the only spaces left):
$str = preg_replace('/ /', ' ?', $str);
(Do this right after my first line of code.)
This would give the following two results (in combination with the first line):
let us walk on the beach now go => let ?us ?walk(.*)beach ?now ?go
let us walk on the beach there now go => let ?us ?walk(.*)beach(.*)there ?now ?go