A test of preg_match is successful but preg_split fails - php

I am trying to test a means by which I can break apart a single string containing multiple records about scholarly publications. There is nothing so convenient as a meaningful delimiter separating one record from the next. But I believe it could be accomplished, given the pattern that each record ends with a date followed by a comma and a space (unless no additional records follow, in which case it is merely ended with the date), such as "YYYY-MM-DD, ".
I have begun with a simple test involving a string, and confirming that the regular expression recognizes the pattern I am looking for:
$date="2012-09-12, ";
if (preg_match("/^[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[1-2][0-9]|3[0-1]), $/",$date))
{
echo("yes");
}else{
echo("no");
However, when I try to take it to the next step by using a sample of real data and preg-split(), the split isn't working. I cannot understand why this simple test, taken from example 1 in the manual fails to result in the string being split:
<?php
$pubs="L.J. Santodonato, Y. Zhang, M. Feygenson, C.M. Parish, M.C. Gao, R.J. Weber, J.C. Neuefeind, Z. Tang, P.K. Liaw~Deviation from high-entropy configurations in the atomic distributions of a multi-principal-element alloy.~NATURE COMMUNICATIONS~6~2015~~~~0~~0~~2015-11-21, S. Liu, M.C. Gao, P.K. Liaw, Y. Zhang~Microstructures and mechanical properties of AlxCrFeNiTi 0.25 alloys.~JOURNAL OF ALLOYS AND COMPOUNDS~619~2015~610~~~0~~0~~2015-11-21";
$pubsArray = preg_split("/^[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[1-2][0-9]|3[0-1]), $/", $pubs);
print_r($pubsArray);
?>
Data matching the same pattern is found within the example string $pubs, but all I ever get back is an array with a single element containing the full string. I have run out of ideas as to what to try next, and would be grateful for any suggestions.

But I believe it could be accomplished, given the pattern that each record ends with a date followed by a comma and a space (unless no additional records follow, in which case it is merely ended with the date), such as "YYYY-MM-DD, ".
As you are trying to split string on occurrence of date for which you can use a simple regex like this /\d{4}(-\d{2}){2}/. As you are not validating date, there is no need to match all the months and dates.
To split string at date you should use following regex.
Regex: /(?<=\d{4}(-\d{2}){2}),\s*/ looks for occurrence of date followed by optional comma and space and splits on ,[space] as I suppose you want to keep the date of publication.
Php Code
<?php
$pubs="L.J. Santodonato, Y. Zhang, M. Feygenson, C.M. Parish, M.C. Gao, R.J. Weber, J.C. Neuefeind, Z. Tang, P.K. Liaw~Deviation from high-entropy configurations in the atomic distributions of a multi-principal-element alloy.~NATURE COMMUNICATIONS~6~2015~~~~0~~0~~2015-11-21, S. Liu, M.C. Gao, P.K. Liaw, Y. Zhang~Microstructures and mechanical properties of AlxCrFeNiTi 0.25 alloys.~JOURNAL OF ALLOYS AND COMPOUNDS~619~2015~610~~~0~~0~~2015-11-21";
$pubsArray = preg_split("/(?<=\d{4}(-\d{2}){2}),\s*/", $pubs);
print_r($pubsArray);
?>
Regex101 Demo
Ideone Demo

Related

PHP: Split a string at the first period that isn't the decimal point in a price or the last character of the string

I want to split a string as per the parameters laid out in the title. I've tried a few different things including using preg_match with not much success so far and I feel like there may be a simpler solution that I haven't clocked on to.
I have a regex that matches the "price" mentioned in the title (see below).
/(?=.)\£(([1-9][0-9]{0,2}(,[0-9]{3})*)|[0-9]+)?(\.[0-9]{1,2})?/
And here are a few example scenarios and what my desired outcome would be:
Example 1:
input: "This string should not split as the only periods that appear are here £19.99 and also at the end."
output: n/a
Example 2:
input: "This string should split right here. As the period is not part of a price or at the end of the string."
output: "This string should split right here"
Example 3:
input: "There is a price in this string £19.99, but it should only split at this point. As I want it to ignore periods in a price"
output: "There is a price in this string £19.99, but it should only split at this point"
I suggest using
preg_split('~\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?(*SKIP)(*F)|\.(?!\s*$)~u', $string)
See the regex demo.
The pattern matches your pattern, \£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})? and skips it with (*SKIP)(*F), else, it matches a non-final . with \.(?!\s*$) (even if there is trailing whitespace chars).
If you really only need to split on the first occurrence of the qualifying dot you can use a matching approach:
preg_match('~^((?:\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?|[^.])+)\.(.*)~su', $string, $match)
See the regex demo. Here,
^ - matches a string start position
((?:\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?|[^.])+) - one or more occurrences of your currency pattern or any one char other than a . char
\. - a . char
(.*) - Group 2: the rest of the string.
To split a text into sentences avoiding the different pitfalls like dots or thousand separators in numbers and some abbreviations (like etc.), the best tool is intlBreakIterator designed to deal with natural language:
$str = 'There is a price in this string £19.99, but it should only split at this point. As I want it to ignore periods in a price';
$si = IntlBreakIterator::createSentenceInstance('en-US');
$si->setText($str);
$si->next();
echo substr($str, 0, $si->current());
IntlBreakIterator::createSentenceInstance returns an iterator that gives the indexes of the different sentences in the string.
It takes in account ?, ! and ... too. In addition to numbers or prices pitfalls, it works also well with this kind of string:
$str = 'John Smith, Jr. was running naked through the garden crying "catch me! catch me!", but no one was chasing him. His psychatre looked at him from the window with a circumspect eye.';
More about rules used by IntlBreakIterator here.
You could simply use this regex:
\.
Since you only have a space after the first sentence (and not a price), this should work just as well, right?

Retrieve 0 or more matches from comma separated list inside parenthesis using regex

I am trying to retrieve matches from a comma separated list that is located inside parenthesis using regular expression. (I also retrieve the version number in the first capture group, though that's not important to this question)
What's worth noting is that the expression should ideally handle all possible cases, where the list could be empty or could have more than 3 entries = 0 or more matches in the second capture group.
The expression I have right now looks like this:
SomeText\/(.*)\s\(((,\s)?([\w\s\.]+))*\)
The string I am testing this on looks like this:
SomeText/1.0.4 (debug, OS X 10.11.2, Macbook Pro Retina)
Result of this is:
1. [6-11] `1.0.4`
2. [32-52] `, Macbook Pro Retina`
3. [32-34] `, `
4. [34-52] `Macbook Pro Retina`
The desired result would look like this:
1. [6-11] `1.0.4`
2. [32-52] `debug`
3. [32-34] `OS X 10.11.2`
4. [34-52] `Macbook Pro Retina`
According to the image above (as far as I can see), the expression should work on the test string. What is the cause of the weird results and how could I improve the expression?
I know there are other ways of solving this problem, but I would like to use a single regular expression if possible. Please don't suggest other options.
When dealing with a varying number of groups, regex ain't the best. Solve it in two steps.
First, break down the statement using a simple regex:
SomeText\/([\d.]*) \(([^)]*)\)
1. [9-14] `1.0.4`
2. [16-55] `debug, OS X 10.11.2, Macbook Pro Retina`
Then just explode the second result by ',' to get your groups.
Probably the \G anchor works best here for binding the match to an entry point. This regex is designed for input that is always similar to the sample that is provided in your question.
(?<=SomeText\/|\G(?!^))[(,]? *\K[^,)(]+
(?<=SomeText\/|\G) the lookbehind is the part where matches should be glued to
\G matches where the previous match ended (?!^) but don't match start
[(,]? *\ matches optional opening parenthesis or comma followed by any amount of space
\K resets beginning of the reported match
[^,)(]+ matches the wanted characters, that are none of ( ) ,
Demo at regex101 (grab matches of $0)
Another idea with use of capture groups.
SomeText\/([^(]*)\(|\G(?!^),? *([^,)]+)
This one without lookbehind is a bit more accurate (it also requires the opening parenthesis), of better performance (needs fewer steps) and probably easier to understand and maintain.
SomeText\/([^(]*)\( the entry anchor and version is captured here to $1
|\G(?!^),? *([^,)]+) or glued to previous match: capture to $2 one or more characters, that are not , ) preceded by optional space or comma.
Another demo at regex101
Actually, stribizhev was close:
(?:SomeText\/([^() ]*)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\))
Just had to make that one class expect at least one match
(?:SomeText\/([0-9.]+)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\)) is a little more clear as long as the version number is always numbers and periods.
I wanted to come up with something more elegant than this (though this does actually work):
SomeText\/(.*)\s\(([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?\)
Obviously, the
([^\,]+)?\,?\s?
is repeated 6 times.
(It can be repeated any number of times and it will work for any number of comma-separated items equal to or below that number of times).
I tried to shorten the long, repetitive list of ([^\,]+)?\,?\s? above to
(?:([^\,]+)\,?\s?)*
but it doesn't work and my knowledge of regex is not currently good enough to say why not.
This should solve your problem. Use the code you already have and add something like this. It will determine where commas are in your string and delete them.
Use trim() to delete white spaces at the start or the end.
$a = strpos($line, ",");
$line = trim(substr($line, 55-$a));
I hope, this helps you!

PHP regex match multiple pieces

I am new to regex and I know the basics of how to pull out one sub string from a given string but I am struggling to get out multiple parts that I need. I am wondering if someone could help me with this simple example and then I work my way from there. Take this string:
LMJ won Neu. Zone - KEN #55 LEIGH vs LMJ #63 ONEIL
The parts in italics are the parts of the string that will change and bold will stay the same in every string. The parts I need out are:
First team id which in this case is LMJ, this will always start the string and be 3 uppercase letters, ^[A-Z]{3}?
The Neu part which could be one of 3 strings, Neu, Off, Def, [Neu|Off|Def]?
The second team part which will come always after the word Zone -, [A-Z]{3}?
Need the numeric part of the string after the first #. This could be 1 or 2 digits [0-9]{1,2}?
5.Third team part same as 3 except will appear after vs, [A-Z]{3}?
Same as 4 need numeric part after 2nd #, [0-9]{1,2}?
I would like to put that all together into one regex is that possible?
Everything inside square brackets is a so-called character class: it matches only a single character. so, [Neu|Off|Def] means: exactly one of the characters N, e, u, |, O, f or D (repetitions are ignored)
What you want is a capture group: (Neu|Off|Def)
Putting it together:
^([A-Z]{3}) won (Neu|Off|Def)\. Zone - ([A-Z]{3}) #([0-9]{1,2}) [A-Z]+ vs ([A-Z]{3}) #([0-9]{1,2}) [A-Z]+$
(This assumes you're not interested in the "LEIGH" and "ONEIL" parts, and these are always in upper case letters)
The regex should be something like;
'/([A-Z]{3})\ won\ (Neu|Off|Def)\.\ Zone\ -\ ([A-Z]{3})\ (\#[0-9]{1,2}\ \w+)\ vs\ ([A-Z]{3})\ (\#[0-9]{1,2}\ \w+)/'
() are used for capturing the different parts.
This is not tested properly.

How check different spellings of a persons full name

I try to create a regular expression with searches in a huge document for a persons full name. In the text the name can be written in full, or the first names can be either abbreviated to a single letter or a letter followed by a dot or omitted. For instance my search for _ALBERTO JORGE ALONSO CALEFACCION_now is:
preg_match('/([;:.,&\s\xc2\-(){}!"'<>]{1})(ALBERTO|A.|A)[\s\xc2-]+
(JORGE|J.|J)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION))([;:.,&\s\xc2(){}
!"'<>]{1})/i', $text, $match);
Between the first names and last names an asterisk (*) can be present.
This is working for the case all first names are at least present some way. But I don't know to extend the expression when first names are omitted. Can you help me?
Let's start by simplifying what you have;
start:
/([;:.,&\s\xc2\-(){}!"'<>]{1})(ALBERTO|A.|A)[\s\xc2-]+(JORGE|J.|J)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION)([;:.,&\s\xc2(){}!"'<>]{1})/i
as I said in my comment, \b is "word break", so you can simplify a lot of that:
/\b(ALBERTO|A.|A)[\s\xc2-]+(JORGE|J.|J)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION)\b/i
(added bonus: it won't match the characters either side now, and it will match at the start and end of the text)
Next, you can use the ? token for the dots (which should be escaped by the way; . is special and means "match anything")
/\b(ALBERTO|A\.?)[\s\xc2-]+(JORGE|J\.?)?[\s\xc2,]+(ALONSO)[\s\xc2*-]+(CALEFACCION)\b/i
Finally, to actually answer your question, you have 2 choices. Either make the entire bracketed name optional, or add a new blank option. The first is the most flexible, since we'll need to cope with the whitespace too:
/\b((ALBERTO|A\.?)[\s\xc2-]+((JORGE|J\.?)[\s\xc2,]+)?)?(ALONSO)[\s\xc2*-]+(CALEFACCION)\b/i
Note that if you're reading the matched parts you'll need to update your indices. Also note that this fixed an issue where omitting the second name (JORGE) still required an extra space.
This will match things like A. J. ALONSO CALEFACCION, A. ALONSO CALEFACCION and ALONSO CALEFACCION, but not J. ALONSO CALEFACCION (it's only a small tweak if you do want that)
Breaking up that final string for clarity:
/\b
(
(ALBERTO|A\.?)[\s\xc2-]+
(
(JORGE|J\.?)[\s\xc2,]+
)?
)?
(ALONSO)[\s\xc2*-]+
(CALEFACCION)
\b/i
Finally, it's an odd thought, but you could change the names which can be initials to be in this form: (A(LBERTO|\.|)), which means you're not repeating the initials (a potential source of mistakes)

Regex problem: Can't match a variable length pattern

I have a problem with regex, using preg_match_all(), to match something of a variable length.
What I am trying to match is the traffic condition after the word 'Congestion' What I came up with is this regex pattern:
Congestion\s*:\s*(?P<congestion>.*)
It would however, extract the first instance all the way to the end of the entire subject, since .* would match everything. But that's not what I want though, I would like it to match separately as 3 instances.
Now since the words behind Congestion could be of variable length, I can't really predict how many words and spaces are in between to come up with a stricter \w*\s*\w* match etc.
Any clues on how I can proceed from here?
Highway : Highway 26
Datetime : 18-Oct-2010 05:18 PM
Congestion : Traffic is slow from Smith St to Alice Springs St
Highway : Princes Highway
Datetime : 18-Oct-2010 05:18 PM
Congestion : Traffic is slow at the Flinders St / Elizabeth St intersection
Highway : Eastern Freeway
Datetime : 18-Oct-2010 05:19 PM
Congestion : Traffic is slow from Prince St to Queen St
EDIT FOR CLARITY
These very nicely formatted texts here, are actually received via a very poorly formatted html email. It contains random line breaks here and there eg "Congestion : Traffic\n is slow from Prince\nSt to Queen St".
So while processing the emails, I stripped off all the html codes and the random line breaks, and json_encode() them into one very long single-line string with no line break...
Usually, regex matching is line-based. Regex assumes that your string is a single line. You can use the “m” (PCRE_MULTILINE) flag to change that behaviour. Then you can tell PHP to match only to the end of the line:
preg_match('/^Congestion\s*:\s*(?P<congestion>.*)$/m', $subject, $matches);
There are two things to notice: first, the pattern was modified to include line-begin (^) and line-end ($) markers. Secondly, the pattern now carries the m modifier.
You can try a minimal match:
Congestion\s*:\s*(?P<congestion>.*?)
This would result in returning zero characters in the named group 'congestion' unless you could match something immediately after the congestion string.
So, this could be fixed if "Highway" always starts the traffic condition records:
Congestion\s*:\s*(?P<congestion>.*?)Highway\s*:
If this works (I have not checked it), then the first records are matched but the last record is not! This could be easily fixed by appending the text 'Highway :' at the end of the input string.
Congestion\s*:\s*Traffic is\s*(?P<c1>[^\n]*)\s*from\s*(?P<c2>[^\n]*)\s*to\s*(?P<c3>[^\n]*)$

Categories