It seems I am not able to understand something very basic with preg regex Patterns in PHP.
What is the difference between these Regex Patterns:
\b([A-Z...]...)
[\b]{1}([A-Z...]...)
The Pattern should start with a word boundary, but why is the result different, when I put it in []{1} ??
The first one works like I expected, but the second not. The problem is, that I want to put more into the [], so that the pattern can start with a word boundary OR a small character [a-z].
Thank you!
Example Text:
Race1529/05/201512:45K4 Senior Men 1000m
LaneName(s)NFBib(s)TimeRank250m500m750m
152
Martin SCHUBERT / Lukas REUSCHENBACH155
11
153
151Kostja STROINSKI / Kai SPENNER
03:07.740
GER
8
I want to find the names of the racers. Sometimes they have a word-break (\b) at the beginning, sometimes not. (But i need the word-break.)
$pattern = '#\b(['.$GB.$KB.'\s\-]{2,40})\s(['.$GB.'\'\-\s]{2,40})[0-9]{0,5}#';
($GB is a variable with all Uppercase Letters, $KB with lower case letters)
preg_match_all gives me all racers where the Name has a word-break at the beginning. (In this example Schubert, Reuschenbach, Spenner) but of course not Stroinski. So, I try this:
$pattern = '#[\b0-9]+(['.$GB.$KB.'\s\-]{2,40})\s(['.$GB.'\'\-\s]{2,40})[0-9]{0,5}#';
Does not work. Even if i remove the 0-9 and only put [\b]{1} at the beginning it doesn't find any hit.
I don't see the difference between \b and [\b]{1}. It seems to be a very basic misunderstanding.
The [\b] is a character class that only matches a backspace char (\u0008).
See PHP regex reference:
note that "\b" has a different meaning, namely the backspace character, inside a character class
Also, .{1} = ., the {1} limiting quantifier is always redundant and only makes sense when your patterns are built dynamically from variables.
So i am stuck - I have looked at tons of answers in here, but none seems to resolve my last problem.
Through an API with JSON, I receive an equipment list in a camelcase format. I can not change that.
I need this camelcase to be translated into normal language -
So far i have gotten most words seperated through:
$string = "SomeEquipmentHere";
$spaced = preg_replace('/([A-Z])/', ' $1', $string);
var_dump($spaced);
string ' Some Equipment Here' (length=20)
$trimmed = trim($spaced);
var_dump($trimmed);
string 'Some Equipment Here' (length=19)
Which is working fine - But in some of the equipments consists of abbreviations
"ABSBrakes" - this would require ABS and separated from Brakes
I can't check for several uppercases next to each other since it will then keep ABS and Brakes together - there are more like these, ie: "CDRadio"
So what is want is the output to be:
"ABS Brakes"
Is there a way to format it so, if there is uppercases next to eachother, then only add a space before the last uppercase letter of that sequence?
I am not strong in regex.
EDIT
Both contributions are awesome - people coming here later should read both answers
The last problems to consists are the following patterns :
"ServiceOK" becomes "Service O K"
"ESP" becomes "ES P"
The pattern only consisting of a pure uppercased abbreviation is fixed by a function counting lowercase letter, if there is none, it will skip over the preg_replace().
But as Flying wrote in the comments on his answer, there could potentially be a lot of instances not covered by his regex, and an answer could be impossible - I don't know if this could be a challenge for the regex.
Possibly by adding some "If there is not a lowercase after the uppercase, there should not be inserted a space" rule
Here is a single-call pattern that doesn't use any anchors, capture groups, or references in the replacement string: /(?:[a-z]|[A-Z]+)\K(?=[A-Z]|\d+)/
Pattern&Replace Demo
Code: (Demo)
$tests = [
'SomeEquipmentHere',
'ABSBrakes',
'CDRadio',
'Valve14',
];
foreach ($tests as $test) {
echo preg_replace('/(?:[a-z]|[A-Z]+)\K(?=[A-Z]|\d+)/',' ',$test),"\n";
}
Output:
Some Equipment Here
ABS Brakes
CD Radio
Valve 14
This is a better method because there is nothing to mop up. If there are new strings to consider (that break my method), please leave them in a comment so that I can update my pattern.
Pattern Explanation:
/ #start the pattern
(?:[a-z] #match 1 lowercase letter
| #or
[A-Z]+) #1 or more uppercase letters
\K #restart the fullstring match (forget the past)
(?=[A-Z] #look-ahead for 1 uppercase letter
| #or
\d+) #1 or more digits
/ #end the pattern
Edit:
There are some other patterns that may provide better accuracy including:
/(?:[a-z]|\B[A-Z]+)\K(?=[A-Z]\B|\d+)/
Granted, the above pattern will not properly handle ServiceOK
Demo Link Word Boundaries Link
or this pattern with an anchor:
/(?!^)(?=[A-Z][a-z]+|(?<=\D)\d)/
The above pattern will accurately split: SomeEquipmentHere, ABSBrakes, CDRadio, Valve14, ServiceOK, ESP as requested by the OP.
Demo Link
*Note: Pattern accuracy can be improved as more sample strings are provided.
Here is how it can be solved:
$tests = [
'SomeEquipmentHere',
'ABSBrakes',
'CDRadio',
'Valve14',
];
foreach ($tests as $test) {
echo trim(preg_replace('/\s+/', ' ', preg_replace('/([A-Z][a-z]+)|([A-Z]+(?=[A-Z]))|(\d+)/', '$1 $2 $3', $test)));
echo "\n";
}
Related test on regex101.
UPDATE: Added example for additional question
I'd like to capture up to four groups of text between <p> and </p>. I can do that using the following regex:
<h5>Trivia<\/h5><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p>
The text to match on:
<h5>Trivia</h5><p>Was discovered by a freelance photographer while sunbathing on Bournemouth Beach in August 2003.</p><p>Supports Southampton FC.</p><p>She has 11 GCSEs and 2 'A' Levels.</p><p>Listens to soul, R&B, Stevie Wonder, Aretha Franklin, Usher Raymond, Michael Jackson and George Michael.</p>
It outputs the four lines of text. It also works as intended if there are more trivia items or <p> occurrences.
But if there are less than 4 trivia items or <p> groups, it outputs nothing since it cannot find the fourth group. How do I make that group optional?
I've tried: <h5>Trivia<\/h5><p>(.*?)<\/p>(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)? and that works according to http://gskinner.com/RegExr/ but it doesn't work if I put it inside PHP code. It only detects one group and puts everything in it.
The magic word is either 'escaping' or 'delimiters', read on.
The first regex:
<h5>Trivia<\/h5><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p>
worked because you escaped the / characters in tags like </h5> to <\/h5>.
But in your second regex (correctly enclosing each paragraph in a optional non-capturing group, fetching 1 to 5 paragraphs):
<h5>Trivia</h5><p>(.*?)</p>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?
you forgot to escape those / characters.
It should then have been:
$pattern = '/<h5>Trivia<\/h5><p>(.*?)<\/p>(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?/';
The above is assuming you were putting your regex between two / "delimiters" characters (out of conventional habit).
To dive a little deeper into the rabbit-hole, one should note that in php the first and last character of a regular expression is usually a "delimiter", so one can add modifiers at the end (like case-insensitive etc).
So instead of escaping your regex, you could also use a ~ character (or #, etc) as a delimiter.
Thus you could also use the same identical (second) regex that you posted and enclose for example like this:
$pattern = '~<h5>Trivia</h5><p>(.*?)</p>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?~';
Here is a working (web-based) example of that, using # as delimiter (just because we can).
You can use the question mark to make each <p>...</p> optional:
$pattern = '~<h5>Trivia</h5>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?~';
Use the Dom is a good option too.
I found some partial help but cannot seem to fully accomplish what I need. I need to be able to do the following:
I need an regular expression to replace any 1 to 3 character words between two words that are longer than 3 characters with a match any expression:
For example:
walk to the beach ==> walk(.*)beach
If the 1 to 3 character word is not preceded by a word that's longer than 3 characters then I want to translate that 1 to 3 letter word to '<word> ?'
For example:
on the beach ==> on ?the ?beach
The simpler the rule the better (of course, if there's an alternative more complicated version that's more performant then I'll take that as well as I eventually anticipate heavy usage eventually).
This will be used in a PHP context most likely with preg_replace. Thus, if you can put it in that context then even better!
By the way so far I have got the following:
$string = preg_replace('/\s+/', '(.*)', $string);
$string = preg_replace('/\b(\w{1,3})(\.*)\b/', '${1} ?', $string);
but that results in:
walk to the beach ==> 'walk(.*)to ?beach'
which is not what I want. 'on the beach' seems to translate correctly.
I think you will need two replacements for that. Let's start with the first requirement:
$str = preg_replace('/(\w{4,})(?: \w{1,3})* (?=\w{4,})/', '$1(.*)', $str);
Of course, you need to replace those \w (which match letters, digits and underscores) with a character class of what you actually want to treat as a word character.
The second one is a bit tougher, because matches cannot overlap and lookbehinds cannot be of variable length. So we have to run this multiple times in a loop:
do
{
$str = preg_replace('/^\w{0,3}(?: \w{0,3})* (?!\?)/', '$0?', $str, -1, $count);
} while($count);
Here we match everything from the beginning of the string, as long as it's only up-to-3-letter words separated by spaces, plus one trailing space (only if it is not already followed by a ?). Then we put all of that back in place, and append a ?.
Update:
After all the talk in the comments, here is an updated solution.
After running the first line, we can assume that the only less-than-3-letter words left will be at the beginning or at the end of the string. All others will have been collapsed to (.*). Since you want to append all spaces between those with ?, you do not even need a loop (in fact these are the only spaces left):
$str = preg_replace('/ /', ' ?', $str);
(Do this right after my first line of code.)
This would give the following two results (in combination with the first line):
let us walk on the beach now go => let ?us ?walk(.*)beach ?now ?go
let us walk on the beach there now go => let ?us ?walk(.*)beach(.*)there ?now ?go
Just starting to explore the 'wonders' of regex. Being someone who learns from trial and error, I'm really struggling because my trials are throwing up a disproportionate amount of errors... My experiments are in PHP using ereg().
Anyway. I work with first and last names separately but for now using the same regex. So far I have:
^[A-Z][a-zA-Z]+$
Any length string that starts with a capital and has only letters (capital or not) for the rest. But where I fall apart is dealing with the special situations that can pretty much occur anywhere.
Hyphenated Names (Worthington-Smythe)
Names with Apostophies (D'Angelo)
Names with Spaces (Van der Humpton) - capitals in the middle which may or may not be required is way beyond my interest at this stage.
Joint Names (Ben & Jerry)
Maybe there's some other way a name can be that I'm no thinking of, but I suspect if I can get my head around this, I can add to it. I'm pretty sure there will be instances where more than one of these situations comes up in one name.
So, I think the bottom line is to have my regex also accept a space, hyphens, ampersands and apostrophes - but not at the start or end of the name to be technically correct.
This regex is perfect for me.
^([ \u00c0-\u01ffa-zA-Z'\-])+$
It works fine in php environments using preg_match(), but doesn't work everywhere.
It matches Jérémie O'Co-nor so I think it matches all UTF-8 names.
Hyphenated Names (Worthington-Smythe)
Add a - into the second character class. The easiest way to do that is to add it at the start so that it can't possibly be interpreted as a range modifier (as in a-z).
^[A-Z][-a-zA-Z]+$
Names with Apostophies (D'Angelo)
A naive way of doing this would be as above, giving:
^[A-Z][-'a-zA-Z]+$
Don't forget you may need to escape it inside the string! A 'better' way, given your example might be:
^[A-Z]'?[-a-zA-Z]+$
Which will allow a possible single apostrophe in the second position.
Names with Spaces (Van der Humpton) - capitals in the middle which may or may not be required is way beyond my interest at this stage.
Here I'd be tempted to just do our naive way again:
^[A-Z]'?[- a-zA-Z]+$
A potentially better way might be:
^[A-Z]'?[- a-zA-Z]( [a-zA-Z])*$
Which looks for extra words at the end. This probably isn't a good idea if you're trying to match names in a body of extra text, but then again, the original wouldn't have done that well either.
Joint Names (Ben & Jerry)
At this point you're not looking at single names anymore?
Anyway, as you can see, regexes have a habit of growing very quickly...
THE BEST REGEX EXPRESSIONS FOR NAMES:
I will use the term special character to refer to the following three characters:
Dash -
Hyphen '
Dot .
Spaces and special characters can not appear twice in a row (e.g.: -- or '. or .. )
Trimmed (No spaces before or after)
You're welcome ;)
Mandatory single name, WITHOUT spaces, WITHOUT special characters:
^([A-Za-z])+$
Sierra is valid, Jack Alexander is invalid (has a space), O'Neil is invalid (has a special character)
Mandatory single name, WITHOUT spaces, WITH special characters:
^[A-Za-z]+(((\'|\-|\.)?([A-Za-z])+))?$
Sierra is valid, O'Neil is valid, Jack Alexander is invalid (has a space)
Mandatory single name, optional additional names, WITH spaces, WITH special characters:
^[A-Za-z]+((\s)?((\'|\-|\.)?([A-Za-z])+))*$
Jack Alexander is valid, Sierra O'Neil is valid
Mandatory single name, optional additional names, WITH spaces, WITHOUT special characters:
^[A-Za-z]+((\s)?([A-Za-z])+)*$
Jack Alexander is valid, Sierra O'Neil is invalid (has a special character)
SPECIAL CASE
Many modern smart devices add spaces at the end of each word, so in my applications I allow unlimited number of spaces before and after the string, then I trim it in the code behind. So I use the following:
Mandatory single name + optional additional names + spaces + special characters:
^(\s)*[A-Za-z]+((\s)?((\'|\-|\.)?([A-Za-z])+))*(\s)*$
Add your own special characters
If you wish to add your own special characters, let's say an underscore _ this is the group you need to update:
(\'|\-|\.)
To
(\'|\-|\.|\_)
PS: If you have questions comment here and I will receive an email and respond ;)
While I agree with the answers saying you basically can't do this with regex, I will point out that some of the objections (internationalized characters) can be resolved by using UTF strings and the \p{L} character class (matches a unicode "letter").
security tip: make sure to validate the size of the string before this step to avoid DoS attack that will bring down your system by sending very long charsets.
Check this out:
^(([A-Za-z]+[,.]?[ ]?|[a-z]+['-]?)+)$
You can test it here : https://regex101.com/r/mS9gD7/46
I don't really have a whole lot to add to a regex that takes care of names because there are already some good suggestions here, but if you want a few resources for learning more about regular expressions, you should check out:
Regex Library's Cheat
Sheet
Another cheat sheet
A regex tutorial on the DevNetwork
forums: Part 1 and Part 2
PHP builder's tutorial
And if you ever need to do regex for
JavaScript (it's a little
different flavor), try JavaScript Kit,
or this resource, or Mozilla's
reference
I second the 'give up' advice. Even if you consider numbers, hyphens, apostrophes and such, something like [a-zA-Z] still wouldn't catch international names (for example, those having šđčćž, or Cyrillic alphabet, or Chinese characters...)
But... why are you even trying to verify names? What errors are you trying to catch? Don't you think people know to write their name better than you? ;) Seriously, the only thing you can do by trying to verify names is to irritate people with unusual names.
Basically, I agree with Paul... You will always find exceptions, like di Caprio, DeVil, or such.
Remarks on your message: in PHP, ereg is generally seen as obsolete (slow, incomplete) in favor of preg (PCRE regexes).
And you should try some regex tester, like the powerful Regex Coach: they are great to test quickly REs against arbitrary strings.
If you really need to solve your problem and aren't satisfied with above answers, just ask, I will give a go.
This worked for me:
+[a-z]{2,3} +[a-z]*|[\w'-]*
This regex will correctly match names such as the following:
jean-claude van damme
nadine arroyo-rodriquez
wayne la pierre
beverly d'angelo
billy-bob thornton
tito puente
susan del rio
It will group "van damme", "arroyo-rodriquez" "d'angelo", "billy-bob", etc. as well as the singular names like "wayne".
Note that it does not test that the grouped stuff is actually a valid name. Like others said, you'll need a dictionary for that. Also, it will group numbers, so if that's an issue you may want to modify the regex.
I wrote this to parse names for a MapReduce application. All I wanted was to extract words from the name field, grouping together the del foo and la bar and billy-bobs into one word to make the key-value pair generation more accurate.
^[A-Z][a-zA-Z '&-]*[A-Za-z]$
Will accept anything that starts with an uppercase letter, followed by zero or more of any letter, space, hyphen, ampersand or apostrophes, and ending with a letter.
See this question for more related "name-detection" related stuff.
regex to match a maximum of 4 spaces
Basically, you have a problem in that, there are effectively no characters in existence that can't form a legal name string.
If you are still limiting yourself to words without ä ü æ ß and other similar non-strictly-ascii characters.
Get yourself a copy of UTF32 character table and realise how many millions of valid characters there are that your simple regex would miss.
To add multiple dots in the username use this Regex:
^[a-zA-Z][a-zA-Z0-9_]*\.?[a-zA-Z0-9_\.]*$
String length can be set separately.
You can easily neutralize the whole matter of whether letters are upper or lowercase -- even in unexpected or uncommon locations -- by converting the string to all upper case using strtoupper() and then checking it against your regex.
/([\u00c0-\u01ffa-zA-Z'\-]+[ ]?[*]?[\u00c0-\u01ffa-zA-Z'\-]*)+/;
Try this . You can also force to start with char using ^,and end with char using $
To improve on daan's answer:
^([\u00c0-\u01ffa-zA-Z]+\b['\-]{0,1})+\b$
only allows a single occurances of hyphen or apostrophy within a-z and valid unicode chars.
also does a backtrack to make sure there is no hyphen or apostrophes at the end of the string.
^[A-Z][a-z]*(([,.] |[ '-])[A-Za-z][a-z]*)*(\.?)( [IVXLCDM]+)?$
For complete details, please visit THIS post. This regex doesn't allow ampersands.
if you add spaces then "He went to the market on Sunday" would be a valid name.
I don't think you can do this with a regex, you cannot easily detect names from a chunk of text using a regex, you would need a dictionary of approved names and search based on that. Any names not on the list wouldn't be detected.
I have used this, because name can be the part of file-patch.
//http://support.microsoft.com/kb/177506
foreach(array('/','\\',':','*','?','<','>','|') as $char)
if(strpos($name,$char)!==false)
die("Not allowed char: '$char'");
I ran into this same issue, and like many others that have posted, this isn't a 100% fool proof expression, but it's working for us.
/([\-'a-z]+\s?){2,4}/
This will check for any hyphens and/or apostrophes in either the first and/or last name as well as checking for a space between the first and last names. The last part is a little magic that will check for between 2 and 4 names. If you tend to have a lot of international users that may have 5 or even 6 names, you can change that to 5 or 6 and it should work for you.
i think "/^[a-zA-Z']+$/" is not enough it will allow to pass single letter we can adjust the range by adding {4,20} which means the range of letters are 4 to 20.
I've come up with this RegEx pattern for names:
/^([a-zA-Z]+[\s'.]?)+\S$/
It works. I think you should use it too.
It matches only names or strings like:
Dr. Shaquil O'Neil Armstrong Buzz-Aldrin
It won't match strings with 2 or more spaces like:
John Paul
It won't match strings with ending spaces like:
John Paul
The text above has an ending space. Try highlighting or selecting the text to see the space
Here's what I use to learn and create your own regex patterns:
RegExr: Leanr, Build and Test RegEx
Try this: /^([A-Z][a-z]([ ][a-z]+)([ '-]([&][ ])?[A-Z][a-z]+)*)$/
Demo: http://regexr.com/3bai1
Have a nice day !
you can use this below for names
^[a-zA-Z'-]{3,}\s[a-zA-Z'-]{3,}$
^ start of the string
$ end of the string
\s space
[a-zA-Z'-\s]{3,} will accept any name with a length of 3 characters or more, and it include names with ' or - like jean-luc
So in our case it will only accept names in 2 parts separated by a space
in case of multiple first-name you can add a \s
^[a-zA-Z'-\s]{3,}\s[a-zA-Z'-]{3,}$
Following Regex is simple and useful for proper names (Towns, Cities, First Name, Last Name) allowing all international letters omitting unicode-based regex engine.
It is flexible - you can add/remove characters you want in the expression (focusing on characters you want to reject rather than include).
^(?:(?!^\s|[ \-']{2}|[\d\r\n\t\f\v!"#$%&()*+,\.\/:;<=>?#[\\\]^_`{|}~€‚ƒ„…†‡ˆ‰‹‘’“”•–—˜™›¡¢£¤¥¦§¨©ª«¬®¯°±²³´¶·¸¹º»¼½¾¿×÷№′″ⁿ⁺⁰‱₁₂₃₄]|\s$).){1,50}$
Regex matches: from 1 to 50 international letters separated by single delimiter (space -')
Regex rejects: empty prefix/suffix, consecutive delimiters (space - '), digits, new line, tab, limited list of extended ASCII characters
Demo
This is what I use for full name:
$pattern = "/^((\p{Lu}{1})\S(\p{Ll}{1,20})[^0-9])+[-'\s]((\p{Lu}{1})\S(\p{Ll}{1,20}))*[^0-9]$/u";
Supports all languages
Common names("Jane Doe", "John Doe")
Usefull for composed names("Marie-Josée Côté-Rochon", "Bill O'reilly")
Excludes digits(0-9)
Only excepts uppercase at beginning of names
First and last names from 2-21 characters
Adding trim() to remove whitespace
Does not except("John J. William", "Francis O'reilly Jr. III")
Must use full names, not: ("John", "Jane", "O'reilly", "Smith")
Edit:
It seems that both [^0-9] in the pattern above was matching at least a fourth digit/letter in each of either first and/or last names.
Therefore names of three letters/digits could not be matched.
Here is the edited regular expression:
$pattern = "/^(\p{Lu}{1}\S\p{Ll}{1,20}[-'\s]\p{Lu}{1}\S\p{Ll}{1,20})+([^\d]+)$/u";
Give up. Every rule you can think of has exceptions in some culture or other. Even if that "culture" is geeks who like legally change their names to "37eet".
Try this regex:
^[a-zA-Z'-\s\.]{3,20}\s[a-zA-Z'-\.]{3,20}$
Aomine's answer was quite helpful, I tweaked it a bit to include:
Names with dots (middle): Jane J. Samuels
Names with dots at the end: John Simms Snr.
Also the name will accept minimum 2 letters, and a min. of 2 letters for surname but no more than 20 for each (so total of 40 characters)
Successful Test cases:
D'amalia Jones
David Silva Jnr.
Jay-Silva Thompson
Shay .J. Muhanned
Bob J. Iverson