Explode and Implode in APL

Explode and Implode in APL - php

How could functions similar to PHP's explode and implode be implemented with APL?
I tried to work it out myself and came up with a solution which I'm posting below. I'd like to see other ways that this might be solved.

Pé, the quest for "short" and/or "elegant" solutions to standard-problems in APL is older than PHP and even older than new terminology, such as "explode", "implode" (I think - but I must admit I do not know how old these terms really are...). Anyway, the early APL guys used the term "idiom" for such "solutions to standard problems that fit in one line of APL".
And for some reason, the Finns were especially creative and even started producing a list of these in order to make it easy for newbies. And I find this stuff still useful after 20yrs of doing APL. It is called "FinnAPL" - the Finnish APL idiom library and you can browse it here: https://aplwiki.com/wiki/FinnAPL_idiom_library (BTW, the whole APL Wiki might be interesting to read...)
You may, however, need to be creative with your wording in order to find solutions ;)
And one warning: FinnAPL only works with "classic" (non-nested) data-structures (nested matrices came with "APL2" which is standard these days), so some of the ways they handle data might no longer be "state-of-the-art". (i.e. back in the "old times", CAT BIRD and DOG would have been represented as a 3x4 array, so "implode" of string-array was a simple as ,array,delimeter (but you then had the challenge to remove blanks which were inserted for padding.
Anyway, I'm not sure why I wrote all this - just a few thoughts which came to mind when thinking about my start with APL ;-)
Ok, let me also look at the question. When your delimeter is a single character the APL2ish-idiomatic way of handling this would be something like this:
⎕ml←3 ⍝ "migration-level" (only Dyalog APL) to ensure APL2-compatibility
s←' '
A←s,'BIRD',s,'CAT',s,'DOG' ⍝ note that delimeter also used as 1st char!
exploded_string←1↓¨(+\A=s)⊂A ⍝ explode
imploded←∊s,¨exploded_string
A≡imploded ⍝ test for successfull round-trip should return 1

Explode:
Given the following text string and delimiter string:
F←'CAT BIRD DOG'
B←' '
Explode can be accomplished as follows:
S←⍴,B
P←(⊃~∨/(-S-⍳S)⌽¨S⍴⊂B⍷F)⊂F
P[2] ⍝ returns BIRD
Limitations:
PHP's explode function returns a null array value when two delimiters are adjacent to each other. The code above simply ignores that and treats the two delimiters as if they were one.
The code above also does nothing to handle overlapping delimiters. This is most likely to occur if repeated characters are used for the delimiter. For example:
F←'CATaaaBIRDaaDOG'
B←'aa'
S←⍴,B
P←(⊃~∨/(-S-⍳S)⌽¨S⍴⊂B⍷F)⊂F
P ⍝ returns CAT BIRD DOG
However, the expected result would be CAT aBIRD DOG because it doesn't recognize 'aaa' as the delimiter followed by 'a.' Rather, it treats it as two overlapping delimiters, which end up functioning as a single delimiter. Another example would be 'tat' as the delimiter, in which case, any occurence in the string of 'tatat' would have the same problem.
Overlapping Delimiters:
I have an alternative for the possibility of a single overlap:
S←⍴,B
A←B⍷F
A←(2×A)>⊃+/(-S-⍳S)⌽¨S⍴⊂A
P←(⊃~∨/(-S-⍳S)⌽¨S⍴⊂A)⊂F
The third line of code eliminates any string positions that occur within a distance of S-1 characters from any delimiter position before it. As I said, this only solves the problem for a single overlap. If there are two or more overlaps, the first is recognized as a delimiter, and all the rest are ignored. Here's an example of two overlaps:
F←'CATtatatatBIRDtatDOG'
B←'tat'
S←⍴,B
A←B⍷F
A←(2×A)>⊃+/(-S-⍳S)⌽¨S⍴⊂A
P←(⊃~∨/(-S-⍳S)⌽¨S⍴⊂A)⊂F
P ⍝ returns CAT atatBIRD DOG
The expected result was 'CAT a BIRD DOG,' but it is unable to recognize the final 'tat' as a delimiter because of the overlap. Such a situation would be rare except when repeated characters are used. If the delimiter is 'aa', then 'aaaa' would be considered a double overlap, and only the first delimiter would be recognized.
Implode:
Much simpler:
P←'CAT' 'BIRD' 'DOG'
B←'-'
(⍴,B)↓∊B,¨P
It returns 'CAT-BIRD-DOG' as expected.

An interesting alternative for implode can be accomplished with reduction:
p←'cat' 'bird' 'dog'
↑{⍺,'-',⍵}/p
cat-bird-dog
This technique does not need to explicitly reference the shape of the delimiter.
And an interesting alternative to explode can be done with n-wise reduction:
f←'CATtatBIRDtatDOG'
b←'tat'
b{(~(-⍴⍵)↑(⍴⍺)∨/⍺⍷⍵)⊂⍵}f
CAT BIRD DOG

Related

Retrieve 0 or more matches from comma separated list inside parenthesis using regex

I am trying to retrieve matches from a comma separated list that is located inside parenthesis using regular expression. (I also retrieve the version number in the first capture group, though that's not important to this question)
What's worth noting is that the expression should ideally handle all possible cases, where the list could be empty or could have more than 3 entries = 0 or more matches in the second capture group.
The expression I have right now looks like this:
SomeText\/(.*)\s\(((,\s)?([\w\s\.]+))*\)
The string I am testing this on looks like this:
SomeText/1.0.4 (debug, OS X 10.11.2, Macbook Pro Retina)
Result of this is:
1. [6-11] `1.0.4`
2. [32-52] `, Macbook Pro Retina`
3. [32-34] `, `
4. [34-52] `Macbook Pro Retina`
The desired result would look like this:
1. [6-11] `1.0.4`
2. [32-52] `debug`
3. [32-34] `OS X 10.11.2`
4. [34-52] `Macbook Pro Retina`
According to the image above (as far as I can see), the expression should work on the test string. What is the cause of the weird results and how could I improve the expression?
I know there are other ways of solving this problem, but I would like to use a single regular expression if possible. Please don't suggest other options.

When dealing with a varying number of groups, regex ain't the best. Solve it in two steps.
First, break down the statement using a simple regex:
SomeText\/([\d.]*) \(([^)]*)\)
1. [9-14] `1.0.4`
2. [16-55] `debug, OS X 10.11.2, Macbook Pro Retina`
Then just explode the second result by ',' to get your groups.

Probably the \G anchor works best here for binding the match to an entry point. This regex is designed for input that is always similar to the sample that is provided in your question.
(?<=SomeText\/|\G(?!^))[(,]? *\K[^,)(]+
(?<=SomeText\/|\G) the lookbehind is the part where matches should be glued to
\G matches where the previous match ended (?!^) but don't match start
[(,]? *\ matches optional opening parenthesis or comma followed by any amount of space
\K resets beginning of the reported match
[^,)(]+ matches the wanted characters, that are none of ( ) ,
Demo at regex101 (grab matches of $0)
Another idea with use of capture groups.
SomeText\/([^(]*)\(|\G(?!^),? *([^,)]+)
This one without lookbehind is a bit more accurate (it also requires the opening parenthesis), of better performance (needs fewer steps) and probably easier to understand and maintain.
SomeText\/([^(]*)\( the entry anchor and version is captured here to $1
|\G(?!^),? *([^,)]+) or glued to previous match: capture to $2 one or more characters, that are not , ) preceded by optional space or comma.
Another demo at regex101

Actually, stribizhev was close:
(?:SomeText\/([^() ]*)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\))
Just had to make that one class expect at least one match
(?:SomeText\/([0-9.]+)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\)) is a little more clear as long as the version number is always numbers and periods.

I wanted to come up with something more elegant than this (though this does actually work):
SomeText\/(.*)\s\(([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?\)
Obviously, the
([^\,]+)?\,?\s?
is repeated 6 times.
(It can be repeated any number of times and it will work for any number of comma-separated items equal to or below that number of times).
I tried to shorten the long, repetitive list of ([^\,]+)?\,?\s? above to
(?:([^\,]+)\,?\s?)*
but it doesn't work and my knowledge of regex is not currently good enough to say why not.

This should solve your problem. Use the code you already have and add something like this. It will determine where commas are in your string and delete them.
Use trim() to delete white spaces at the start or the end.
$a = strpos($line, ",");
$line = trim(substr($line, 55-$a));
I hope, this helps you!

php preg_replace remove thousand separator in a string

there have a long articles, I want only remove thousand separator, not a comma.
$str = "Last month's income is 1,022 yuan, not too bad.";
//=>Last month's income is 1022 yuan, not too bad.
preg_replace('#(\d)\,(\d)#i','???',$str);
How to write the regex patterns? Thanks

If the simplified rule "Match any comma that lies directly between digits" is good enough for you, then
preg_replace('/(?<=\d),(?=\d)/','',$str);
should do.
You could improve it by making sure that exactly three digits follow:
preg_replace('/(?<=\d),(?=\d{3}\b)/','',$str);

If you have a look at the preg_replace documentation you can see that you can write captures back in the replacement string using $n:
preg_replace('#(\d),(\d)#','$1$2',$str);
Note that there is no need to escape the comma, or to use i (as there are not letters in the pattern).
An alternative (and probably more efficient) way is to use lookarounds. These are not included in the match, so they don't have to written back:
preg_replace('#(?<=\d),(?=\d)#','',$str);

The first (\d) is represented by $1, the second (\d) by $2. Therefore the solution is to use something like this:
preg_replace('#(\d)\,(\d)#','$1$2',$str);
Actually it would be better to have 3 numbers behind the comma to avoid causing havoc in lists of numbers:
preg_replace('#(\d)\,(\d{3})#','$1$2',$str);

PHP preg_split result not correct

I am trying to learn regex in PHP and messing around with the preg_split function.
It doesn't appear to be correct though, or my understanding is completely wrong.
The test code i am using is:
$string = "test ing ";
var_dump(preg_split('/t/', $string));
I would expect to get an array like the following:
[0] => "es" [1] => " ing "
but the following is being returned:
[0] => "" [1] => "es" [2] => " ing "
Why is there an empty string at the start?
I understand that i can use the PREG_SPLIT_NO_EMPTY flag to filter this but it shouldnt be there to begin with. Should it?

Why shouldn't it? This is exactly how it works. The semantics of a split operation are that you have a string of this format:
value-delimiter-value-delimiter-value-...-delimiter-value
(Note that it is starting and ending with a value, not a delimiter.)
So if your string starts with a delimiter, it is absolutely valid to assume that there is an empty value before that delimiter (since the delimiter is supposed to split something into two). You wouldn't generally want to reject the empty string between two consecutive ts either, would you?
And this is exactly what PREG_SPLIT_NO_EMPTY is for. You use it whenever you do want to get rid of those empty strings.
As a simple example why you would want the default behavior, just think of CSV files. You want to split a line at (for example) ;. You usually also want to allow for empty values. Now if the value in your first column was empty (meaning the line will start with ;, and you chopped that first empty string away completely, then suddenly all indices in the resulting array would correspond to different columns. This is why you want to keep those empty strings as well. In many cases you know how many delimiters there are, and hence how many values - and you want to be able to identify which value belongs at which position. Even if some of them are empty.

It's working 100% correct. The first character is a 't', so it's splitting on that 't' first. Before the first 't' there is nothing, so the array result start with an entry of empty string.

It's happening because of the t at the beginning of your string. If you don't use the PREG_SPLIT_NO_EMPTY option, preg_split will treat an empty string as a valid split.
Think of it this way: Everywhere preg_split sees a t, it chops the string into two chunks: the chunk before the t, and the chunk after it. Even if one of the chunks doesn't have anything in it, it still counts. That piece is just an empty string.
For some applications, this would be perfectly useful -- for example, say you wanted to replace each t with something, but the replacement was too complicated to just use preg_replace. The language wants you to be able to choose, so it keeps the empty split unless you explicitly tell it not to with PREG_SPLIT_NO_EMPTY.

PHP Regex Check if two strings share two common characters

I'm just getting to know regular expressions, but after doing quite a bit of reading (and learning quite a lot), I still have not been able to figure out a good solution to this problem.
Let me be clear, I understand that this particular problem might be better solved not using regular expressions, but for the sake of brevity let me just say that I need to use regular expressions (trust me, I know there are better ways to solve this).
Here's the problem. I'm given a big file, each line of which is exactly 4 characters long.
This is a regex that defines "valid" lines:
"/^[AB][CD][EF][GH]$/m"
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
What I'm trying to do is given one of those lines, match all other lines that contain 2 or more common characters.
The below example assumes the following:
$line is always a valid format
BigFileOfLines.txt contains only valid lines
Example:
// Matches all other lines in string that share 2 or more characters in common
// with "$line"
function findMatchingLines($line, $subject) {
$regex = "magic regex I'm looking for here";
$matchingLines = array();
preg_match_all($regex, $subject, $matchingLines);
return $matchingLines;
}
// Example Usage
$fileContents = file_get_contents("BigFileOfLines.txt");
$matchingLines = findMatchingLines("ACFG", $fileContents);
/*
* Desired return value (Note: this is an example set, there
* could be more or less than this)
*
* BCEG
* ADFG
* BCFG
* BDFG
*/
One way I know that will work is to have a regex like the following (the following regex would only work for "ACFG":
"/^(?:AC.{2}|.CF.|.{2}FG|A.F.|A.{2}G|.C.G)$/m"
This works alright, performance is acceptable. What bothers me about it though is that I have to generate this based off of $line, where I'd rather have it be ignorant of what the specific parameter is. Also, this solution doesn't scale terrible well if later the code is modified to match say, 3 or more characters, or if the size of each line grows from 4 to 16.
It just feels like there's something remarkably simple that I'm overlooking. Also seems like this could be a duplicate question, but none of the other questions I've looked at really seem to address this particular problem.
Thanks in advance!
Update:
It seems that the norm with Regex answers is for SO users to simply post a regular expression and say "This should work for you."
I think that's kind of a halfway answer. I really want to understand the regular expression, so if you can include in your answer a thorough (within reason) explanation of why that regular expression:
A. Works
B. Is the most efficient (I feel there are a sufficient number of assumptions that can be made about the subject string that a fair amount of optimization can be done).
Of course, if you give an answer that works, and nobody else posts the answer *with* a solution, I'll mark it as the answer :)
Update 2:
Thank you all for the great responses, a lot of helpful information, and a lot of you had valid solutions. I chose the answer I did because after running performance tests, it was the best solution, averaging equal runtimes with the other solutions.
The reasons I favor this answer:
The regular expression given provides excellent scalability for longer lines
The regular expression looks a lot cleaner, and is easier for mere mortals such as myself to interpret.
However, a lot of credit goes to the below answers as well for being very thorough in explaining why their solution is the best. If you've come across this question because it's something you're trying to figure out, please give them all a read, helped me tremendously.

Why don't you just use this regex $regex = "/.*[$line].*[$line].*/m";?
For your example, that translates to $regex = "/.*[ACFG].*[ACFG].*/m";

This is a regex that defines "valid" lines:
/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m
In english, each line has either A or B at position 0, either C or D
at position 1, either E or F at position 2, and either G or H at
position 3. I can assume that each line will be exactly 4 characters
long.
That's not what that regex means. That regex means that each line has either A or B or a pipe at position 0, C or D or a pipe at position 1, etc; [A|B] means "either 'A' or '|' or 'B'". The '|' only means 'or' outside of character classes.
Also, {1} is a no-op; lacking any quantifier, everything has to appear exactly once. So a correct regex for the above English is this:
/^[AB][CD][EF][GH]$/
or, alternatively:
/^(A|B)(C|D)(E|F)(G|H)$/
That second one has the side effect of capturing the letter in each position, so that the first captured group will tell you whether the first character was A or B, and so on. If you don't want the capturing, you can use non-capture grouping:
/^(?:A|B)(?:C|D)(?:E|F)(?:G|H)$/
But the character-class version is by far the usual way of writing this.
As to your problem, it is ill-suited to regular expressions; by the time you deconstruct the string, stick it back together in the appropriate regex syntax, compile the regex, and do the test, you would probably have been much better off just doing a character-by-character comparison.
I would rewrite your "ACFG" regex thus: /^(?:AC|A.F|A..G|.CF|.C.G|..FG)$/, but that's just appearance; I can't think of a better solution using regex. (Although as Mike Ryan indicated, it would be better still as /^(?:A(?:C|.E|..G))|(?:.C(?:E|.G))|(?:..EG)$/ - but that's still the same solution, just in a more efficiently-processed form.)

You've already answered how to do it with a regex, and noted its shortcomings and inability to scale, so I don't think there's any need to flog the dead horse. Instead, here's a way that'll work without the need for a regex:
function findMatchingLines($line) {
static $file = null;
if( !$file) $file = file("BigFileOfLines.txt");
$search = str_split($line);
foreach($file as $l) {
$test = str_split($l);
$matches = count(array_intersect($search,$test));
if( $matches > 2) // define number of matches required here - optionally make it an argument
return true;
}
// no matches
return false;
}

There are 6 possibilities that at least two characters match out of 4: MM.., M.M., M..M, .MM., .M.M, and ..MM ("M" meaning a match and "." meaning a non-match).
So, you need only to convert your input into a regex that matches any of those possibilities. For an input of ACFG, you would use this:
"/^(AC..|A.F.|A..G|.CF.|.C.G|..FG)$/m"
This, of course, is the conclusion you're already at--so good so far.
The key issue is that Regex isn't a language for comparing two strings, it's a language for comparing a string to a pattern. Thus, either your comparison string must be part of the pattern (which you've already found), or it must be part of the input. The latter method would allow you to use a general-purpose match, but does require you to mangle your input.
function findMatchingLines($line, $subject) {
$regex = "/(?<=^([AB])([CD])([EF])([GH])[.\n]+)"
+ "(\1\2..|\1.\3.|\1..\4|.\2\3.|.\2.\4|..\3\4)/m";
$matchingLines = array();
preg_match_all($regex, $line + "\n" + $subject, $matchingLines);
return $matchingLines;
}
What this function does is pre-pend your input string with the line you want to match against, then uses a pattern that compares each line after the first line (that's the + after [.\n] working) back to the first line's 4 characters.
If you also want to validate those matching lines against the "rules", just replace the . in each pattern to the appropriate character class (\1\2[EF][GH], etc.).

People may be confused by your first regex. You give:
"/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m"
And then say:
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
But that's not what that regex means at all.
This is because the | operator has the highest precedence here. So, what that regex really says, in English, is: Either A or | or B in the first position, OR C or | or D in the first position, OR E or | or F in the first position, OR G or '|orH` in the first position.
This is because [A|B] means a character class with one of the three given characters (including the |. And because {1} means one character (it is also completely superfluous and could be dropped), and because the outer | alternate between everything around it. In my English expression above each capitalized OR stands for one of your alternating |'s. (And I started counting positions at 1, not 0 -- I didn't feel like typing the 0th position.)
To get your English description as a regex, you would want:
/^[AB][CD][EF][GH]$/
The regex will go through and check the first position for A or B (in the character class), then check C or D in the next position, etc.
--
EDIT:
You want to test for only two of these four characters matching.
Very Strictly speaking, and picking up from #Mark Reed's answer, the fastest regex (after it's been parsed) is likely to be:
/^(A(C|.E|..G))|(.C(E)|(.G))|(..EG)$/
as compared to:
/^(AC|A.E|A..G|.CE|.C.G|..EG)$/
This is because of how the regex implementation steps through text. You first test if A is in the first position. If that succeeds, then you test the sub-cases. If that fails, then you're done with all those possible cases (or which there are 3). If you don't yet have a match, you then test if C is in the 2nd position. If that succeeds, then you test for the two subcases. And if none of those succeed, you test, `EG in the 3rd and 4th positions.
This regex is specifically created to fail as fast as possible. Listing each case out separately, means to fail, you would have test 6 different cases (each of the six alternatives), instead of 3 cases (at a minimum). And in cases of A not being the first position, you would immediately go to test the 2nd position, without hitting it two more times. Etc.
(Note that I don't know exactly how PHP compiles regex's -- it's possible that they compile to the same internal representation, though I suspect not.)
--
EDIT: On additional point. Fastest regex is a somewhat ambiguous term. Fastest to fail? Fastest to succeed? And given what possible range of sample data of succeeding and failing rows? All of these would have to be clarified to really determine what criteria you mean by fastest.

Here's something that uses Levenshtein distance instead of regex and should be extensible enough for your requirements:
$lines = array_map('rtrim', file('file.txt')); // load file into array removing \n
$common = 2; // number of common characters required
$match = 'ACFG'; // string to match
$matchingLines = array_filter($lines, function ($line) use ($common, $match) {
// error checking here if necessary - $line and $match must be same length
return (levenshtein($line, $match) <= (strlen($line) - $common));
});
var_dump($matchingLines);

I bookmarked the question yesterday in the evening to post an answer today, but seems that I'm a little late ^^ Here is my solution anyways:
/^[^ACFG]*+(?:[ACFG][^ACFG]*+){2}$/m
It looks for two occurrences of one of the ACFG characters surrounded by any other characters. The loop is unrolled and uses possessive quantifiers, to improve performance a bit.
Can be generated using:
function getRegexMatchingNCharactersOfLine($line, $num) {
return "/^[^$line]*+(?:[$line][^$line]*+){$num}$/m";
}

Regex for names

Just starting to explore the 'wonders' of regex. Being someone who learns from trial and error, I'm really struggling because my trials are throwing up a disproportionate amount of errors... My experiments are in PHP using ereg().
Anyway. I work with first and last names separately but for now using the same regex. So far I have:
^[A-Z][a-zA-Z]+$
Any length string that starts with a capital and has only letters (capital or not) for the rest. But where I fall apart is dealing with the special situations that can pretty much occur anywhere.
Hyphenated Names (Worthington-Smythe)
Names with Apostophies (D'Angelo)
Names with Spaces (Van der Humpton) - capitals in the middle which may or may not be required is way beyond my interest at this stage.
Joint Names (Ben & Jerry)
Maybe there's some other way a name can be that I'm no thinking of, but I suspect if I can get my head around this, I can add to it. I'm pretty sure there will be instances where more than one of these situations comes up in one name.
So, I think the bottom line is to have my regex also accept a space, hyphens, ampersands and apostrophes - but not at the start or end of the name to be technically correct.

This regex is perfect for me.
^([ \u00c0-\u01ffa-zA-Z'\-])+$
It works fine in php environments using preg_match(), but doesn't work everywhere.
It matches Jérémie O'Co-nor so I think it matches all UTF-8 names.

Hyphenated Names (Worthington-Smythe)
Add a - into the second character class. The easiest way to do that is to add it at the start so that it can't possibly be interpreted as a range modifier (as in a-z).
^[A-Z][-a-zA-Z]+$
Names with Apostophies (D'Angelo)
A naive way of doing this would be as above, giving:
^[A-Z][-'a-zA-Z]+$
Don't forget you may need to escape it inside the string! A 'better' way, given your example might be:
^[A-Z]'?[-a-zA-Z]+$
Which will allow a possible single apostrophe in the second position.
Names with Spaces (Van der Humpton) - capitals in the middle which may or may not be required is way beyond my interest at this stage.
Here I'd be tempted to just do our naive way again:
^[A-Z]'?[- a-zA-Z]+$
A potentially better way might be:
^[A-Z]'?[- a-zA-Z]( [a-zA-Z])*$
Which looks for extra words at the end. This probably isn't a good idea if you're trying to match names in a body of extra text, but then again, the original wouldn't have done that well either.
Joint Names (Ben & Jerry)
At this point you're not looking at single names anymore?
Anyway, as you can see, regexes have a habit of growing very quickly...

THE BEST REGEX EXPRESSIONS FOR NAMES:
I will use the term special character to refer to the following three characters:
Dash -
Hyphen '
Dot .
Spaces and special characters can not appear twice in a row (e.g.: -- or '. or .. )
Trimmed (No spaces before or after)
You're welcome ;)
Mandatory single name, WITHOUT spaces, WITHOUT special characters:
^([A-Za-z])+$
Sierra is valid, Jack Alexander is invalid (has a space), O'Neil is invalid (has a special character)
Mandatory single name, WITHOUT spaces, WITH special characters:
^[A-Za-z]+(((\'|\-|\.)?([A-Za-z])+))?$
Sierra is valid, O'Neil is valid, Jack Alexander is invalid (has a space)
Mandatory single name, optional additional names, WITH spaces, WITH special characters:
^[A-Za-z]+((\s)?((\'|\-|\.)?([A-Za-z])+))*$
Jack Alexander is valid, Sierra O'Neil is valid
Mandatory single name, optional additional names, WITH spaces, WITHOUT special characters:
^[A-Za-z]+((\s)?([A-Za-z])+)*$
Jack Alexander is valid, Sierra O'Neil is invalid (has a special character)
SPECIAL CASE
Many modern smart devices add spaces at the end of each word, so in my applications I allow unlimited number of spaces before and after the string, then I trim it in the code behind. So I use the following:
Mandatory single name + optional additional names + spaces + special characters:
^(\s)*[A-Za-z]+((\s)?((\'|\-|\.)?([A-Za-z])+))*(\s)*$
Add your own special characters
If you wish to add your own special characters, let's say an underscore _ this is the group you need to update:
(\'|\-|\.)
To
(\'|\-|\.|\_)
PS: If you have questions comment here and I will receive an email and respond ;)

While I agree with the answers saying you basically can't do this with regex, I will point out that some of the objections (internationalized characters) can be resolved by using UTF strings and the \p{L} character class (matches a unicode "letter").

security tip: make sure to validate the size of the string before this step to avoid DoS attack that will bring down your system by sending very long charsets.
Check this out:
^(([A-Za-z]+[,.]?[ ]?|[a-z]+['-]?)+)$
You can test it here : https://regex101.com/r/mS9gD7/46

I don't really have a whole lot to add to a regex that takes care of names because there are already some good suggestions here, but if you want a few resources for learning more about regular expressions, you should check out:
Regex Library's Cheat
Sheet
Another cheat sheet
A regex tutorial on the DevNetwork
forums: Part 1 and Part 2
PHP builder's tutorial
And if you ever need to do regex for
JavaScript (it's a little
different flavor), try JavaScript Kit,
or this resource, or Mozilla's
reference

I second the 'give up' advice. Even if you consider numbers, hyphens, apostrophes and such, something like [a-zA-Z] still wouldn't catch international names (for example, those having šđčćž, or Cyrillic alphabet, or Chinese characters...)
But... why are you even trying to verify names? What errors are you trying to catch? Don't you think people know to write their name better than you? ;) Seriously, the only thing you can do by trying to verify names is to irritate people with unusual names.

Basically, I agree with Paul... You will always find exceptions, like di Caprio, DeVil, or such.
Remarks on your message: in PHP, ereg is generally seen as obsolete (slow, incomplete) in favor of preg (PCRE regexes).
And you should try some regex tester, like the powerful Regex Coach: they are great to test quickly REs against arbitrary strings.
If you really need to solve your problem and aren't satisfied with above answers, just ask, I will give a go.

This worked for me:
+[a-z]{2,3} +[a-z]*|[\w'-]*
This regex will correctly match names such as the following:
jean-claude van damme
nadine arroyo-rodriquez
wayne la pierre
beverly d'angelo
billy-bob thornton
tito puente
susan del rio
It will group "van damme", "arroyo-rodriquez" "d'angelo", "billy-bob", etc. as well as the singular names like "wayne".
Note that it does not test that the grouped stuff is actually a valid name. Like others said, you'll need a dictionary for that. Also, it will group numbers, so if that's an issue you may want to modify the regex.
I wrote this to parse names for a MapReduce application. All I wanted was to extract words from the name field, grouping together the del foo and la bar and billy-bobs into one word to make the key-value pair generation more accurate.

^[A-Z][a-zA-Z '&-]*[A-Za-z]$
Will accept anything that starts with an uppercase letter, followed by zero or more of any letter, space, hyphen, ampersand or apostrophes, and ending with a letter.

See this question for more related "name-detection" related stuff.
regex to match a maximum of 4 spaces
Basically, you have a problem in that, there are effectively no characters in existence that can't form a legal name string.
If you are still limiting yourself to words without ä ü æ ß and other similar non-strictly-ascii characters.
Get yourself a copy of UTF32 character table and realise how many millions of valid characters there are that your simple regex would miss.

To add multiple dots in the username use this Regex:
^[a-zA-Z][a-zA-Z0-9_]*\.?[a-zA-Z0-9_\.]*$
String length can be set separately.

You can easily neutralize the whole matter of whether letters are upper or lowercase -- even in unexpected or uncommon locations -- by converting the string to all upper case using strtoupper() and then checking it against your regex.

/([\u00c0-\u01ffa-zA-Z'\-]+[ ]?[*]?[\u00c0-\u01ffa-zA-Z'\-]*)+/;
Try this . You can also force to start with char using ^,and end with char using $

To improve on daan's answer:
^([\u00c0-\u01ffa-zA-Z]+\b['\-]{0,1})+\b$
only allows a single occurances of hyphen or apostrophy within a-z and valid unicode chars.
also does a backtrack to make sure there is no hyphen or apostrophes at the end of the string.

^[A-Z][a-z]*(([,.] |[ '-])[A-Za-z][a-z]*)*(\.?)( [IVXLCDM]+)?$
For complete details, please visit THIS post. This regex doesn't allow ampersands.

if you add spaces then "He went to the market on Sunday" would be a valid name.
I don't think you can do this with a regex, you cannot easily detect names from a chunk of text using a regex, you would need a dictionary of approved names and search based on that. Any names not on the list wouldn't be detected.

I have used this, because name can be the part of file-patch.
//http://support.microsoft.com/kb/177506
foreach(array('/','\\',':','*','?','<','>','|') as $char)
if(strpos($name,$char)!==false)
die("Not allowed char: '$char'");

I ran into this same issue, and like many others that have posted, this isn't a 100% fool proof expression, but it's working for us.
/([\-'a-z]+\s?){2,4}/
This will check for any hyphens and/or apostrophes in either the first and/or last name as well as checking for a space between the first and last names. The last part is a little magic that will check for between 2 and 4 names. If you tend to have a lot of international users that may have 5 or even 6 names, you can change that to 5 or 6 and it should work for you.

i think "/^[a-zA-Z']+$/" is not enough it will allow to pass single letter we can adjust the range by adding {4,20} which means the range of letters are 4 to 20.

I've come up with this RegEx pattern for names:
/^([a-zA-Z]+[\s'.]?)+\S$/
It works. I think you should use it too.
It matches only names or strings like:
Dr. Shaquil O'Neil Armstrong Buzz-Aldrin
It won't match strings with 2 or more spaces like:
John Paul
It won't match strings with ending spaces like:
John Paul
The text above has an ending space. Try highlighting or selecting the text to see the space
Here's what I use to learn and create your own regex patterns:
RegExr: Leanr, Build and Test RegEx

Try this: /^([A-Z][a-z]([ ][a-z]+)([ '-]([&][ ])?[A-Z][a-z]+)*)$/
Demo: http://regexr.com/3bai1
Have a nice day !

you can use this below for names
^[a-zA-Z'-]{3,}\s[a-zA-Z'-]{3,}$
^ start of the string
$ end of the string
\s space
[a-zA-Z'-\s]{3,} will accept any name with a length of 3 characters or more, and it include names with ' or - like jean-luc
So in our case it will only accept names in 2 parts separated by a space
in case of multiple first-name you can add a \s
^[a-zA-Z'-\s]{3,}\s[a-zA-Z'-]{3,}$

Following Regex is simple and useful for proper names (Towns, Cities, First Name, Last Name) allowing all international letters omitting unicode-based regex engine.
It is flexible - you can add/remove characters you want in the expression (focusing on characters you want to reject rather than include).
^(?:(?!^\s|[ \-']{2}|[\d\r\n\t\f\v!"#$%&()*+,\.\/:;<=>?#[\\\]^_`{|}~€‚ƒ„…†‡ˆ‰‹‘’“”•–—˜™›¡¢£¤¥¦§¨©ª«¬®¯°±²³´¶·¸¹º»¼½¾¿×÷№′″ⁿ⁺⁰‱₁₂₃₄]|\s$).){1,50}$
Regex matches: from 1 to 50 international letters separated by single delimiter (space -')
Regex rejects: empty prefix/suffix, consecutive delimiters (space - '), digits, new line, tab, limited list of extended ASCII characters
Demo

This is what I use for full name:
$pattern = "/^((\p{Lu}{1})\S(\p{Ll}{1,20})[^0-9])+[-'\s]((\p{Lu}{1})\S(\p{Ll}{1,20}))*[^0-9]$/u";
Supports all languages
Common names("Jane Doe", "John Doe")
Usefull for composed names("Marie-Josée Côté-Rochon", "Bill O'reilly")
Excludes digits(0-9)
Only excepts uppercase at beginning of names
First and last names from 2-21 characters
Adding trim() to remove whitespace
Does not except("John J. William", "Francis O'reilly Jr. III")
Must use full names, not: ("John", "Jane", "O'reilly", "Smith")
Edit:
It seems that both [^0-9] in the pattern above was matching at least a fourth digit/letter in each of either first and/or last names.
Therefore names of three letters/digits could not be matched.
Here is the edited regular expression:
$pattern = "/^(\p{Lu}{1}\S\p{Ll}{1,20}[-'\s]\p{Lu}{1}\S\p{Ll}{1,20})+([^\d]+)$/u";

Give up. Every rule you can think of has exceptions in some culture or other. Even if that "culture" is geeks who like legally change their names to "37eet".

Try this regex:
^[a-zA-Z'-\s\.]{3,20}\s[a-zA-Z'-\.]{3,20}$
Aomine's answer was quite helpful, I tweaked it a bit to include:
Names with dots (middle): Jane J. Samuels
Names with dots at the end: John Simms Snr.
Also the name will accept minimum 2 letters, and a min. of 2 letters for surname but no more than 20 for each (so total of 40 characters)
Successful Test cases:
D'amalia Jones
David Silva Jnr.
Jay-Silva Thompson
Shay .J. Muhanned
Bob J. Iverson

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Explode and Implode in APL - php

How could functions similar to PHP's explode and implode be implemented with APL? I tried to work it out myself and came up with a solution which I'm posting below. I'd like to see other ways that this might be solved.

Related

Retrieve 0 or more matches from comma separated list inside parenthesis using regex

php preg_replace remove thousand separator in a string

PHP preg_split result not correct

PHP Regex Check if two strings share two common characters

Regex for names

Categories

Resources