Parse Fixed Width File PHP Regular Expression

Parse Fixed Width File PHP Regular Expression - php

I have a export file that contains data of fixed widths:
Name (6)
Gender (6)
Phone Number (12) - Includes a space
Data.txt
DanielMale (07654) 521254
Lisa Female(16545) 654456
Sarah Female(54656) 4896546
I need to extract the name and gender data including any spaces if the data doesn't fit the data width.
The brackets for the phone number need to be ignored. (How do you ignore items in regular expressions?
I currently have the following regular expression, that pulls out the people's names. I thought I could simply add the bit on the white to make it pull out the Genders, but this doesn't work. Where am I going wrong?
/(?<name>.{6}+) (?<gender>.{6}+)/
I need the data to look like this at the end.
^ = space
Daniel
Male^^
07654 521254
Lisa^^
Female
16545 654456
Sarah^
Female
54656 4896546

This should catch all four fields:
/^(?<name>.{6})(?<gender>.{6})\((?<prefix>[^\)]+)\)\ (?<number>.+)/
The first ^ means match from the start.
{6}: match 6 times the pattern (the plus sign is redundant here)
\( and \): match brackets (without escaping they would mean boundaries of a subpattern)
[^\)]+ means "everything before the first closing bracket

Related

How to extract text from multiple lines including the first and last word?

I am trying to extract part of a long text, such as information about caring for a plant. The text contains paragraphs and blank lines. I am not able to capture the specific text I want, the second problem is that the last word isn't showing in the extracted text, and the last problem is when my search starts at the beginning of the line.
I tried searching for the text I want to extract by using a word that isn't at the beginning of the line, it worked except that the end of the desired text is missing a word, and if that word is on new line, it won't show any results at all.
I was using https://scriptun.com/tools/php/preg_match for testing
//The first word to start the search is 'How to'. And I want to capture it as well
// The second word where the text I want ends is '(optional):'
'/(?=How to).*?\s(?=\(optional\):)/'
The sample text I am using to test is:
//Text comes before this..
How to care for Split Leaf Plant
The Split leaf philodendron, also called monstera deliciosa or swiss
cheese plant, is a large, popular, easy- care houseplant that is not
really in the philodendron family. There is a great deal of confusion
about what to call this plant; the various names have become
inter-changeable over the years.
Here is more info (optional):
//And more text goes here
I want to extract all the text from the word 'How to' ending with '(optional):'. Regardless of how many lines or paragraphs are in between
The expected extracted text:
How to care for Split Leaf Plant
The Split leaf philodendron, also called monstera deliciosa or swiss
cheese plant, is a large, popular, easy- care houseplant that is not
really in the philodendron family. There is a great deal of confusion
about what to call this plant; the various names have become
inter-changeable over the years.
Here is more info (optional):
Thank you

That's pretty easy. You can use the following pattern:
https://regex101.com/r/TjE2x8/2
Pattern: ^How to[\w\W]+?\(optional\):$

Pattern: ^How to(?:.|\R)*optional\):$
demo on regex101
Explanation:
^ match the first instance where How to appears at the beginning of the line
(?: ) non capturing group. We need it because of the following OR instruction which is the pipe |. But we don't need to capture the contents. That's why we use ?: after the first parenthesis.
. every character
| or
\R every kind of new line
* make sure to capture zero to every instance of the group
optional\):$ match the word optional with parenthesis (escaped, because it is not an instruction) \) and a colon : at the very end of the text $
Pattern 2: /^How to.*optional\):$/ms
demo on regex101
This pattern is even simpler, but requires the m and s flag to be set in order to match multiline and the . character class to match new lines.

PHP preg_replace: find string part not starting with an exclamation point

I am working on some very messy Excel sheets, and trying to use PHP to find clues..
I have a MySQL database with all formulas from an excel document, and as usual, the cellnames from the current sheet do not have a "sheetname!" in front of it. To make it searchable (and find dead-routes in the formulas) I like to replace all formulas in the database with their sheetname as prefix.
Example:
=+(sheet_factory_costs!A17/sheet_employees!D23)+T12+W12
The database contains the name of the current sheet, and I like to change the formula above with that sheetname (let's call it "sheet_turnover").
=+(sheet_factory_costs!A17 / sheet_employees!D23)+sheet_turnover!T12+sheet_turnover!W12
I try this in PHP with preg_replace, and I think I need the following rules:
Find one or two letters, directly followed by a number. This is always a cell-adress within formulas.
When there is a ! on the position before, there is already a sheetname. So I am only looking for the letters and numbers NOT starting with an exclamation point.
The problem seems to be that the ! is also a special sign within patterns. Even if I try to escape it, it does not work:
$newformula =
preg_replace('/(?<\!)[A-Z]{1,2}[0-9]/',
'lala',
$oldformula);
(lala is my temporary marker to see if it is selecting the right cell-adresses)
(and yes, the lala is only places over the first number, but that's no issue right now)
(and yes, all Excel $..$.. (permanent) markers have already been replaced. No need to build that in the formula)

Your negative lookbehind is corrupt, you need to define it as (?<!!). However, you also need to use either a word boundary before it, or a (?<![A-Z]) lookbehind to make sure you have no other letters before the [A-Z]{1,2}.
So, you may use
'~\b(?<!!)[A-Z]{1,2}[0-9]~'
See the regex demo. Replace with sheet_turnover!$0 where $0 is the whole match value.
Details
\b - a word boundary (it is necessary, or name!AA11 would still get matched)
(?<!!) - no ! immediately to the left of the current location
[A-Z]{1,2} - 1 or 2 letters
[0-9] - a digit.
Another approach is match and skip "wrong" contexts and then match and keep the "right" ones:
'~\w+![A-Z]{1,2}[0-9](*SKIP)(*F)|\b[A-Z]{1,2}[0-9]~'
See this regex demo.
Here, \w+![A-Z]{1,2}[0-9](*SKIP)(*F)| part matches 1 or more word chars, then 1 or 2 uppercase ASCII letters and then a digit, and (*SKIP)(*F) will omit the match and will make the engine proceed looking for matches after the end of the previous match.

PHP regex match multiple pieces

I am new to regex and I know the basics of how to pull out one sub string from a given string but I am struggling to get out multiple parts that I need. I am wondering if someone could help me with this simple example and then I work my way from there. Take this string:
LMJ won Neu. Zone - KEN #55 LEIGH vs LMJ #63 ONEIL
The parts in italics are the parts of the string that will change and bold will stay the same in every string. The parts I need out are:
First team id which in this case is LMJ, this will always start the string and be 3 uppercase letters, ^[A-Z]{3}?
The Neu part which could be one of 3 strings, Neu, Off, Def, [Neu|Off|Def]?
The second team part which will come always after the word Zone -, [A-Z]{3}?
Need the numeric part of the string after the first #. This could be 1 or 2 digits [0-9]{1,2}?
5.Third team part same as 3 except will appear after vs, [A-Z]{3}?
Same as 4 need numeric part after 2nd #, [0-9]{1,2}?
I would like to put that all together into one regex is that possible?

Everything inside square brackets is a so-called character class: it matches only a single character. so, [Neu|Off|Def] means: exactly one of the characters N, e, u, |, O, f or D (repetitions are ignored)
What you want is a capture group: (Neu|Off|Def)
Putting it together:
^([A-Z]{3}) won (Neu|Off|Def)\. Zone - ([A-Z]{3}) #([0-9]{1,2}) [A-Z]+ vs ([A-Z]{3}) #([0-9]{1,2}) [A-Z]+$
(This assumes you're not interested in the "LEIGH" and "ONEIL" parts, and these are always in upper case letters)

The regex should be something like;
'/([A-Z]{3})\ won\ (Neu|Off|Def)\.\ Zone\ -\ ([A-Z]{3})\ (\#[0-9]{1,2}\ \w+)\ vs\ ([A-Z]{3})\ (\#[0-9]{1,2}\ \w+)/'
() are used for capturing the different parts.
This is not tested properly.

How check different spellings of a persons full name

I try to create a regular expression with searches in a huge document for a persons full name. In the text the name can be written in full, or the first names can be either abbreviated to a single letter or a letter followed by a dot or omitted. For instance my search for _ALBERTO JORGE ALONSO CALEFACCION_now is:
preg_match('/([;:.,&\s\xc2\-(){}!"'<>]{1})(ALBERTO|A.|A)[\s\xc2-]+
(JORGE|J.|J)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION))([;:.,&\s\xc2(){}
!"'<>]{1})/i', $text, $match);
Between the first names and last names an asterisk (*) can be present.
This is working for the case all first names are at least present some way. But I don't know to extend the expression when first names are omitted. Can you help me?

Let's start by simplifying what you have;
start:
/([;:.,&\s\xc2\-(){}!"'<>]{1})(ALBERTO|A.|A)[\s\xc2-]+(JORGE|J.|J)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION)([;:.,&\s\xc2(){}!"'<>]{1})/i
as I said in my comment, \b is "word break", so you can simplify a lot of that:
/\b(ALBERTO|A.|A)[\s\xc2-]+(JORGE|J.|J)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION)\b/i
(added bonus: it won't match the characters either side now, and it will match at the start and end of the text)
Next, you can use the ? token for the dots (which should be escaped by the way; . is special and means "match anything")
/\b(ALBERTO|A\.?)[\s\xc2-]+(JORGE|J\.?)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION)\b/i
Finally, to actually answer your question, you have 2 choices. Either make the entire bracketed name optional, or add a new blank option. The first is the most flexible, since we'll need to cope with the whitespace too:
/\b((ALBERTO|A\.?)[\s\xc2-]+((JORGE|J\.?)[\s\xc2,]+)?)?(ALONSO)[\s\xc2*-]+(CALEFACCION)\b/i
Note that if you're reading the matched parts you'll need to update your indices. Also note that this fixed an issue where omitting the second name (JORGE) still required an extra space.
This will match things like A. J. ALONSO CALEFACCION, A. ALONSO CALEFACCION and ALONSO CALEFACCION, but not J. ALONSO CALEFACCION (it's only a small tweak if you do want that)
Breaking up that final string for clarity:
/\b
(
(ALBERTO|A\.?)[\s\xc2-]+
(
(JORGE|J\.?)[\s\xc2,]+
)?
)?
(ALONSO)[\s\xc2*-]+
(CALEFACCION)
\b/i
Finally, it's an odd thought, but you could change the names which can be initials to be in this form: (A(LBERTO|\.|)), which means you're not repeating the initials (a potential source of mistakes)

Regex for names

Just starting to explore the 'wonders' of regex. Being someone who learns from trial and error, I'm really struggling because my trials are throwing up a disproportionate amount of errors... My experiments are in PHP using ereg().
Anyway. I work with first and last names separately but for now using the same regex. So far I have:
^[A-Z][a-zA-Z]+$
Any length string that starts with a capital and has only letters (capital or not) for the rest. But where I fall apart is dealing with the special situations that can pretty much occur anywhere.
Hyphenated Names (Worthington-Smythe)
Names with Apostophies (D'Angelo)
Names with Spaces (Van der Humpton) - capitals in the middle which may or may not be required is way beyond my interest at this stage.
Joint Names (Ben & Jerry)
Maybe there's some other way a name can be that I'm no thinking of, but I suspect if I can get my head around this, I can add to it. I'm pretty sure there will be instances where more than one of these situations comes up in one name.
So, I think the bottom line is to have my regex also accept a space, hyphens, ampersands and apostrophes - but not at the start or end of the name to be technically correct.

This regex is perfect for me.
^([ \u00c0-\u01ffa-zA-Z'\-])+$
It works fine in php environments using preg_match(), but doesn't work everywhere.
It matches Jérémie O'Co-nor so I think it matches all UTF-8 names.

Hyphenated Names (Worthington-Smythe)
Add a - into the second character class. The easiest way to do that is to add it at the start so that it can't possibly be interpreted as a range modifier (as in a-z).
^[A-Z][-a-zA-Z]+$
Names with Apostophies (D'Angelo)
A naive way of doing this would be as above, giving:
^[A-Z][-'a-zA-Z]+$
Don't forget you may need to escape it inside the string! A 'better' way, given your example might be:
^[A-Z]'?[-a-zA-Z]+$
Which will allow a possible single apostrophe in the second position.
Names with Spaces (Van der Humpton) - capitals in the middle which may or may not be required is way beyond my interest at this stage.
Here I'd be tempted to just do our naive way again:
^[A-Z]'?[- a-zA-Z]+$
A potentially better way might be:
^[A-Z]'?[- a-zA-Z]( [a-zA-Z])*$
Which looks for extra words at the end. This probably isn't a good idea if you're trying to match names in a body of extra text, but then again, the original wouldn't have done that well either.
Joint Names (Ben & Jerry)
At this point you're not looking at single names anymore?
Anyway, as you can see, regexes have a habit of growing very quickly...

THE BEST REGEX EXPRESSIONS FOR NAMES:
I will use the term special character to refer to the following three characters:
Dash -
Hyphen '
Dot .
Spaces and special characters can not appear twice in a row (e.g.: -- or '. or .. )
Trimmed (No spaces before or after)
You're welcome ;)
Mandatory single name, WITHOUT spaces, WITHOUT special characters:
^([A-Za-z])+$
Sierra is valid, Jack Alexander is invalid (has a space), O'Neil is invalid (has a special character)
Mandatory single name, WITHOUT spaces, WITH special characters:
^[A-Za-z]+(((\'|\-|\.)?([A-Za-z])+))?$
Sierra is valid, O'Neil is valid, Jack Alexander is invalid (has a space)
Mandatory single name, optional additional names, WITH spaces, WITH special characters:
^[A-Za-z]+((\s)?((\'|\-|\.)?([A-Za-z])+))*$
Jack Alexander is valid, Sierra O'Neil is valid
Mandatory single name, optional additional names, WITH spaces, WITHOUT special characters:
^[A-Za-z]+((\s)?([A-Za-z])+)*$
Jack Alexander is valid, Sierra O'Neil is invalid (has a special character)
SPECIAL CASE
Many modern smart devices add spaces at the end of each word, so in my applications I allow unlimited number of spaces before and after the string, then I trim it in the code behind. So I use the following:
Mandatory single name + optional additional names + spaces + special characters:
^(\s)*[A-Za-z]+((\s)?((\'|\-|\.)?([A-Za-z])+))*(\s)*$
Add your own special characters
If you wish to add your own special characters, let's say an underscore _ this is the group you need to update:
(\'|\-|\.)
To
(\'|\-|\.|\_)
PS: If you have questions comment here and I will receive an email and respond ;)

While I agree with the answers saying you basically can't do this with regex, I will point out that some of the objections (internationalized characters) can be resolved by using UTF strings and the \p{L} character class (matches a unicode "letter").

security tip: make sure to validate the size of the string before this step to avoid DoS attack that will bring down your system by sending very long charsets.
Check this out:
^(([A-Za-z]+[,.]?[ ]?|[a-z]+['-]?)+)$
You can test it here : https://regex101.com/r/mS9gD7/46

I don't really have a whole lot to add to a regex that takes care of names because there are already some good suggestions here, but if you want a few resources for learning more about regular expressions, you should check out:
Regex Library's Cheat
Sheet
Another cheat sheet
A regex tutorial on the DevNetwork
forums: Part 1 and Part 2
PHP builder's tutorial
And if you ever need to do regex for
JavaScript (it's a little
different flavor), try JavaScript Kit,
or this resource, or Mozilla's
reference

I second the 'give up' advice. Even if you consider numbers, hyphens, apostrophes and such, something like [a-zA-Z] still wouldn't catch international names (for example, those having šđčćž, or Cyrillic alphabet, or Chinese characters...)
But... why are you even trying to verify names? What errors are you trying to catch? Don't you think people know to write their name better than you? ;) Seriously, the only thing you can do by trying to verify names is to irritate people with unusual names.

Basically, I agree with Paul... You will always find exceptions, like di Caprio, DeVil, or such.
Remarks on your message: in PHP, ereg is generally seen as obsolete (slow, incomplete) in favor of preg (PCRE regexes).
And you should try some regex tester, like the powerful Regex Coach: they are great to test quickly REs against arbitrary strings.
If you really need to solve your problem and aren't satisfied with above answers, just ask, I will give a go.

This worked for me:
+[a-z]{2,3} +[a-z]*|[\w'-]*
This regex will correctly match names such as the following:
jean-claude van damme
nadine arroyo-rodriquez
wayne la pierre
beverly d'angelo
billy-bob thornton
tito puente
susan del rio
It will group "van damme", "arroyo-rodriquez" "d'angelo", "billy-bob", etc. as well as the singular names like "wayne".
Note that it does not test that the grouped stuff is actually a valid name. Like others said, you'll need a dictionary for that. Also, it will group numbers, so if that's an issue you may want to modify the regex.
I wrote this to parse names for a MapReduce application. All I wanted was to extract words from the name field, grouping together the del foo and la bar and billy-bobs into one word to make the key-value pair generation more accurate.

^[A-Z][a-zA-Z '&-]*[A-Za-z]$
Will accept anything that starts with an uppercase letter, followed by zero or more of any letter, space, hyphen, ampersand or apostrophes, and ending with a letter.

See this question for more related "name-detection" related stuff.
regex to match a maximum of 4 spaces
Basically, you have a problem in that, there are effectively no characters in existence that can't form a legal name string.
If you are still limiting yourself to words without ä ü æ ß and other similar non-strictly-ascii characters.
Get yourself a copy of UTF32 character table and realise how many millions of valid characters there are that your simple regex would miss.

To add multiple dots in the username use this Regex:
^[a-zA-Z][a-zA-Z0-9_]*\.?[a-zA-Z0-9_\.]*$
String length can be set separately.

You can easily neutralize the whole matter of whether letters are upper or lowercase -- even in unexpected or uncommon locations -- by converting the string to all upper case using strtoupper() and then checking it against your regex.

/([\u00c0-\u01ffa-zA-Z'\-]+[ ]?[*]?[\u00c0-\u01ffa-zA-Z'\-]*)+/;
Try this . You can also force to start with char using ^,and end with char using $

To improve on daan's answer:
^([\u00c0-\u01ffa-zA-Z]+\b['\-]{0,1})+\b$
only allows a single occurances of hyphen or apostrophy within a-z and valid unicode chars.
also does a backtrack to make sure there is no hyphen or apostrophes at the end of the string.

^[A-Z][a-z]*(([,.] |[ '-])[A-Za-z][a-z]*)*(\.?)( [IVXLCDM]+)?$
For complete details, please visit THIS post. This regex doesn't allow ampersands.

if you add spaces then "He went to the market on Sunday" would be a valid name.
I don't think you can do this with a regex, you cannot easily detect names from a chunk of text using a regex, you would need a dictionary of approved names and search based on that. Any names not on the list wouldn't be detected.

I have used this, because name can be the part of file-patch.
//http://support.microsoft.com/kb/177506
foreach(array('/','\\',':','*','?','<','>','|') as $char)
if(strpos($name,$char)!==false)
die("Not allowed char: '$char'");

I ran into this same issue, and like many others that have posted, this isn't a 100% fool proof expression, but it's working for us.
/([\-'a-z]+\s?){2,4}/
This will check for any hyphens and/or apostrophes in either the first and/or last name as well as checking for a space between the first and last names. The last part is a little magic that will check for between 2 and 4 names. If you tend to have a lot of international users that may have 5 or even 6 names, you can change that to 5 or 6 and it should work for you.

i think "/^[a-zA-Z']+$/" is not enough it will allow to pass single letter we can adjust the range by adding {4,20} which means the range of letters are 4 to 20.

I've come up with this RegEx pattern for names:
/^([a-zA-Z]+[\s'.]?)+\S$/
It works. I think you should use it too.
It matches only names or strings like:
Dr. Shaquil O'Neil Armstrong Buzz-Aldrin
It won't match strings with 2 or more spaces like:
John Paul
It won't match strings with ending spaces like:
John Paul
The text above has an ending space. Try highlighting or selecting the text to see the space
Here's what I use to learn and create your own regex patterns:
RegExr: Leanr, Build and Test RegEx

Try this: /^([A-Z][a-z]([ ][a-z]+)([ '-]([&][ ])?[A-Z][a-z]+)*)$/
Demo: http://regexr.com/3bai1
Have a nice day !

you can use this below for names
^[a-zA-Z'-]{3,}\s[a-zA-Z'-]{3,}$
^ start of the string
$ end of the string
\s space
[a-zA-Z'-\s]{3,} will accept any name with a length of 3 characters or more, and it include names with ' or - like jean-luc
So in our case it will only accept names in 2 parts separated by a space
in case of multiple first-name you can add a \s
^[a-zA-Z'-\s]{3,}\s[a-zA-Z'-]{3,}$

Following Regex is simple and useful for proper names (Towns, Cities, First Name, Last Name) allowing all international letters omitting unicode-based regex engine.
It is flexible - you can add/remove characters you want in the expression (focusing on characters you want to reject rather than include).
^(?:(?!^\s|[ \-']{2}|[\d\r\n\t\f\v!"#$%&()*+,\.\/:;<=>?#[\\\]^_`{|}~€‚ƒ„…†‡ˆ‰‹‘’“”•–—˜™›¡¢£¤¥¦§¨©ª«¬®¯°±²³´¶·¸¹º»¼½¾¿×÷№′″ⁿ⁺⁰‱₁₂₃₄]|\s$).){1,50}$
Regex matches: from 1 to 50 international letters separated by single delimiter (space -')
Regex rejects: empty prefix/suffix, consecutive delimiters (space - '), digits, new line, tab, limited list of extended ASCII characters
Demo

This is what I use for full name:
$pattern = "/^((\p{Lu}{1})\S(\p{Ll}{1,20})[^0-9])+[-'\s]((\p{Lu}{1})\S(\p{Ll}{1,20}))*[^0-9]$/u";
Supports all languages
Common names("Jane Doe", "John Doe")
Usefull for composed names("Marie-Josée Côté-Rochon", "Bill O'reilly")
Excludes digits(0-9)
Only excepts uppercase at beginning of names
First and last names from 2-21 characters
Adding trim() to remove whitespace
Does not except("John J. William", "Francis O'reilly Jr. III")
Must use full names, not: ("John", "Jane", "O'reilly", "Smith")
Edit:
It seems that both [^0-9] in the pattern above was matching at least a fourth digit/letter in each of either first and/or last names.
Therefore names of three letters/digits could not be matched.
Here is the edited regular expression:
$pattern = "/^(\p{Lu}{1}\S\p{Ll}{1,20}[-'\s]\p{Lu}{1}\S\p{Ll}{1,20})+([^\d]+)$/u";

Give up. Every rule you can think of has exceptions in some culture or other. Even if that "culture" is geeks who like legally change their names to "37eet".

Try this regex:
^[a-zA-Z'-\s\.]{3,20}\s[a-zA-Z'-\.]{3,20}$
Aomine's answer was quite helpful, I tweaked it a bit to include:
Names with dots (middle): Jane J. Samuels
Names with dots at the end: John Simms Snr.
Also the name will accept minimum 2 letters, and a min. of 2 letters for surname but no more than 20 for each (so total of 40 characters)
Successful Test cases:
D'amalia Jones
David Silva Jnr.
Jay-Silva Thompson
Shay .J. Muhanned
Bob J. Iverson

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.