Separate (ir)regular expression - php

I have a string:
A12B34C10G34LongerLongerEven LongerA57
Is there any way to separate the above using regular expressions to the form of:
A12,B34,C10,G34,Longer,Longer,Even Longer,A57
So, separated by commas. I would be grateful for any help. Thanks.

This gives what you need:
<?php
$str = "A12B34C10G34LongerLongerEven LongerA57";
echo preg_replace('/([^\s])([A-Z])/', '\1,\2', $str), "\n";
// OUTPUT: A12,B34,C10,G34,Longer,Longer,Even Longer,A57

preg_replace ('/\B([A-Z])/',',$1',$string);
Inserts a comma before any capital letter that is not on a word boundary.
My assumption is that the input data can consist of capital letters followed by numbers and capitalized words that may or may not be separated by spaces.

import re
ss = ' \tA12B34C10#G34LongerVery LongerEven LongerA57 \n'
print '%r\n%r\n\n%r' %\
(
#good 1
re.sub('(?<=\S)(?=[A-Z])', ',', ss),
#good 2
','.join(
re.findall('(\s*[A-Z].+?\s*)(?=(?<=\S)[A-Z]|\s*\Z)',ss)
),
#bad (written at first)
','.join(
re.findall('(?<!\s)([A-Z].+?)(?<!\s)(?![^A-Z])',ss)
)
)
result
' \tA12,B34,C10#,G34,Longer,Very Longer,Even Longer,A57 \n'
' \tA12,B34,C10#,G34,Longer,Very Longer,Even Longer,A57 \n'
'B34,C10#,G34,Longer,Very Longer,Even Longer'
.
The first solution is as close to the idea (inserting a comma) as possible.
(?<=\S) is mandatory in this solution because each comma must be inserted between characters (correction from DJV)
(?<!\s) would match the beginning of string and a comma would be prepended at the very first position.
.
In a first writing, I had written second solution as
# bad
','.join(re.findall( '(?<!\s)([A-Z].+?)(?<!\s)(?![^A-Z])', ss) )
or
# bad
``','.join(re.findall( '(?<!\s)([A-Z].+?)(?<!\s)(?=[A-Z]|\Z)', ss) )``
where
(?![^A-Z]) or (?=[A-Z]|\Z) were to take account of the end of the string as a possible end of matching portion.
Then
I realized that if whitespaces are at the beginning or the end of the string, there are problems. The above code shows which ones.
To prevent these problems, the solution is the good solution number 2. But it's a complicated one harder to get, so the good solution number 1 is evidently my prefered one.

Try this :
$in = 'A12B34C10G34LongerLongerEven LongerA57';
$output = trim(preg_replace('/([^\s])([A-Z])/', "$1,$2", $in),",");
echo $output;
output : A12,B34,C10,G34,Longer,Longer,Even Longer,A57

Assuming you want to add a ',' in front of each upper case character that is not preceded by a space, here is simple Python regex + sub way of doing it.
string = 'A12B34C10G34LongerLongerEven LongerA57'
re.sub(r'(?<=[^ ])([A-Z])', lambda x: ',' + x.group(0), string)
outputs:
'A12,B34,C10,G34,Longer,Longer,Even Longer,A57'
The regex makes a lookbehind to check for a non-space and the match is an upper character. Then this upper char is prepended a ','.

You could use this assuming you won't get a comma anywhere in $in
explode(",", preg_replace('/([^\s])([A-Z]+)/', "$1,$2", $in);
I don't really know python, but the base regex is the same.

Related

PHP: Split a string at the first period that isn't the decimal point in a price or the last character of the string

I want to split a string as per the parameters laid out in the title. I've tried a few different things including using preg_match with not much success so far and I feel like there may be a simpler solution that I haven't clocked on to.
I have a regex that matches the "price" mentioned in the title (see below).
/(?=.)\£(([1-9][0-9]{0,2}(,[0-9]{3})*)|[0-9]+)?(\.[0-9]{1,2})?/
And here are a few example scenarios and what my desired outcome would be:
Example 1:
input: "This string should not split as the only periods that appear are here £19.99 and also at the end."
output: n/a
Example 2:
input: "This string should split right here. As the period is not part of a price or at the end of the string."
output: "This string should split right here"
Example 3:
input: "There is a price in this string £19.99, but it should only split at this point. As I want it to ignore periods in a price"
output: "There is a price in this string £19.99, but it should only split at this point"
I suggest using
preg_split('~\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?(*SKIP)(*F)|\.(?!\s*$)~u', $string)
See the regex demo.
The pattern matches your pattern, \£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})? and skips it with (*SKIP)(*F), else, it matches a non-final . with \.(?!\s*$) (even if there is trailing whitespace chars).
If you really only need to split on the first occurrence of the qualifying dot you can use a matching approach:
preg_match('~^((?:\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?|[^.])+)\.(.*)~su', $string, $match)
See the regex demo. Here,
^ - matches a string start position
((?:\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?|[^.])+) - one or more occurrences of your currency pattern or any one char other than a . char
\. - a . char
(.*) - Group 2: the rest of the string.
To split a text into sentences avoiding the different pitfalls like dots or thousand separators in numbers and some abbreviations (like etc.), the best tool is intlBreakIterator designed to deal with natural language:
$str = 'There is a price in this string £19.99, but it should only split at this point. As I want it to ignore periods in a price';
$si = IntlBreakIterator::createSentenceInstance('en-US');
$si->setText($str);
$si->next();
echo substr($str, 0, $si->current());
IntlBreakIterator::createSentenceInstance returns an iterator that gives the indexes of the different sentences in the string.
It takes in account ?, ! and ... too. In addition to numbers or prices pitfalls, it works also well with this kind of string:
$str = 'John Smith, Jr. was running naked through the garden crying "catch me! catch me!", but no one was chasing him. His psychatre looked at him from the window with a circumspect eye.';
More about rules used by IntlBreakIterator here.
You could simply use this regex:
\.
Since you only have a space after the first sentence (and not a price), this should work just as well, right?

Using regex to extract first half of string

I have variable strings like the below:
The.Test.String.A01Y18.123h.WIB-DI.DO5.1.K.314-ECO
The.Regex.F05P78.123h.WIB-DI.DO5.1.K.314-EYT
Word.C05F78.342T.DSW-RF.EF5.2.F.342-DDF
I would like to extract this part of these string in PHP dynamically and i was looking at using regex but haven't had much success:
The.Test.String.A01Y18
The.Regex.F05P78
Word.C05F78
And ultimately to:
The Test String A01Y18
The Regex F05P78
Word C05F78
The first part of the text will be variable in length and will separate each word with a period. The next part will always be the same length with the pattern:
One letter, 2 number, one letter, 2 numbers pattern (C05F78)
Any thing in the string after that is what I would like to remove.
that's it
$x=array(
"The.Test.String.A01Y18.123h.WIB-DI.DO5.1.K.314-ECO",
"The.Regex.F05P78.123h.WIB-DI.DO5.1.K.314-EYT",
"Word.C05F78.342T.DSW-RF.EF5.2.F.342-DDF"
);
for ($i=0, $tmp_count=count($x); $i<$tmp_count; ++$i) {
echo str_replace(".", " ", preg_replace("/^(.+?)([a-z]{1}[0-9]{2}[a-z]{1}[0-9]{2})\..+$/i", "\\1\\2", $x[$i]))."<br />";
}
Using this regular expression should work, replacing each of your strings with the first capturing group:
^((?:\w+\.)+\w\d{2}\w\d{2}).*
See demo at http://regex101.com/r/fR3pM6
This is valid too:
preg_match("\.*[\w\d]{6}", stringVariable)
.* for all digits atleast we found a composition of letters and words of 6 characters ([\w\d]{6})
Result:
Match 1: The.Test.Stsrisng.A01Y18
Match 2: The.Regex.F05P78
Match 3: Word.C05F78

preg_replace or regex string translation

I found some partial help but cannot seem to fully accomplish what I need. I need to be able to do the following:
I need an regular expression to replace any 1 to 3 character words between two words that are longer than 3 characters with a match any expression:
For example:
walk to the beach ==> walk(.*)beach
If the 1 to 3 character word is not preceded by a word that's longer than 3 characters then I want to translate that 1 to 3 letter word to '<word> ?'
For example:
on the beach ==> on ?the ?beach
The simpler the rule the better (of course, if there's an alternative more complicated version that's more performant then I'll take that as well as I eventually anticipate heavy usage eventually).
This will be used in a PHP context most likely with preg_replace. Thus, if you can put it in that context then even better!
By the way so far I have got the following:
$string = preg_replace('/\s+/', '(.*)', $string);
$string = preg_replace('/\b(\w{1,3})(\.*)\b/', '${1} ?', $string);
but that results in:
walk to the beach ==> 'walk(.*)to ?beach'
which is not what I want. 'on the beach' seems to translate correctly.
I think you will need two replacements for that. Let's start with the first requirement:
$str = preg_replace('/(\w{4,})(?: \w{1,3})* (?=\w{4,})/', '$1(.*)', $str);
Of course, you need to replace those \w (which match letters, digits and underscores) with a character class of what you actually want to treat as a word character.
The second one is a bit tougher, because matches cannot overlap and lookbehinds cannot be of variable length. So we have to run this multiple times in a loop:
do
{
$str = preg_replace('/^\w{0,3}(?: \w{0,3})* (?!\?)/', '$0?', $str, -1, $count);
} while($count);
Here we match everything from the beginning of the string, as long as it's only up-to-3-letter words separated by spaces, plus one trailing space (only if it is not already followed by a ?). Then we put all of that back in place, and append a ?.
Update:
After all the talk in the comments, here is an updated solution.
After running the first line, we can assume that the only less-than-3-letter words left will be at the beginning or at the end of the string. All others will have been collapsed to (.*). Since you want to append all spaces between those with ?, you do not even need a loop (in fact these are the only spaces left):
$str = preg_replace('/ /', ' ?', $str);
(Do this right after my first line of code.)
This would give the following two results (in combination with the first line):
let us walk on the beach now go => let ?us ?walk(.*)beach ?now ?go
let us walk on the beach there now go => let ?us ?walk(.*)beach(.*)there ?now ?go

Regular expression to match an exact number of occurrence for a certain character

I'm trying to check if a string has a certain number of occurrence of a character.
Example:
$string = '123~456~789~000';
I want to verify if this string has exactly 3 instances of the character ~.
Is that possible using regular expressions?
Yes
/^[^~]*~[^~]*~[^~]*~[^~]*$/
Explanation:
^ ... $ means the whole string in many regex dialects
[^~]* a string of zero or more non-tilde characters
~ a tilde character
The string can have as many non-tilde characters as necessary, appearing anywhere in the string, but must have exactly three tildes, no more and no less.
As single character is technically a substring, and the task is to count the number of its occurences, I suppose the most efficient approach lies in using a special PHP function - substr_count:
$string = '123~456~789~000';
if (substr_count($string, '~') === 3) {
// string is valid
}
Obviously, this approach won't work if you need to count the number of pattern matches (for example, while you can count the number of '0' in your string with substr_count, you better use preg_match_all to count digits).
Yet for this specific question it should be faster overall, as substr_count is optimized for one specific goal - count substrings - when preg_match_all is more on the universal side. )
I believe this should work for a variable number of characters:
^(?:[^~]*~[^~]*){3}$
The advantage here is that you just replace 3 with however many you want to check.
To make it more efficient, it can be written as
^[^~]*(?:~[^~]*){3}$
This is what you are looking for:
EDIT based on comment below:
<?php
$string = '123~456~789~000';
$total = preg_match_all('/~/', $string);
echo $total; // Shows 3

PHP preg_match with regex: only single hyphens and spaces between words continue

I was trying to write an regex that allows single hyphens and single spaces only within words but not at the beginning or at the end of the words.
I thought I have this sorted from the answer I got yesterday, but I just realised there is small error which I don't quite understand,
Why it won't accept the inputs like,
'forum-category-b forum-category-a'
'forum-category-b Counter-terrorism'
'forum-category-a Preventing'
'forum-category-a Preventing Violent'
'forum-category-a International-Research-and-Publications'
'International-Research-and-Publications forum-category-b forum-category-a'
but it takes,
'forum-category-b'
'Counter-terrorism forum-category-a'
'Preventing forum-category-a'
'Preventing Violent forum-category-a'
'International-Research-and-Publications forum-category-b'
Why is that? How can I fix it? It Below is the regex with the initial test, but ideally it should accept all the combination inputs above,
$aWords = array(
'a',
'---stack---over---flow---',
' stack over flow',
'stack-over-flow',
'stack over flow',
'stacoverflow'
);
foreach($aWords as $sWord) {
if (preg_match('/^(\w+([\s-]\w+)?)+$/', $sWord)) {
echo 'pass: ' . $sWord . "\n";
} else {
echo 'fail: ' . $sWord . "\n";
}
}
accept/ to reject the input like these below,
---stack---over---flow---
stack-over-flow- stack-over-flow2
stack over flow
Thanks.
Your pattern does not do what you want. Let's break it apart:
^(\w+([\s-]\w+)?)+$
It matches strings that consist solely of one or more sequences of the pattern:
\w+([\s-]\w+)?
...which is a sequence of word characters, followed optionally by one other sequence of word characters, separated by one space or dash character.
In other words, your pattern searches for strings like:
xxx-xxxyyy-yyyzzz zzz
...but you intent to write a pattern that would find:
xxx-xxxxxx-xxxxxx yyy
In your examples, this one is matched:
Counter-terrorism forum-category-a
...but it is interpreted as the following sequence:
(Counter(-terroris)) (m( foru)) (m(-categor) (y(-a))
As you can see, the pattern did not really find the words you are looking for.
This example is not matched:
forum-category-a Preventing Violent
...since the pattern cannot form groups of "word characters, space-or-dash, word-characters" when it encounters a single word character followed by space or dash:
(forum(-categor)) (y(-a)) <Mismatch: Found " " but expected "\w">
If you would add another character to "forum-category-a", say "forum-category-ax", it would match again, since it could split at the "ax":
(forum(-categor)) (y(-a)) (x( Preventin)) (g( Violent))
What you are actually interested in is a pattern like
^(\w+(-\w+)*)(\s\w+(-\w+)*)*$
...which would find a sequence of words that may contain dashes, separated by spaces:
(forum(-category)(-a)) ( Preventing) ( Violent)
By the way, I tested this using a Python script, and while trying to match your pattern against the example string "International-Research-and-Publications forum-category-b forum-category-a", the regular expression engine seemed to run into an infinite loop...
import re
expr = re.compile(r'^(\w+([\s-]\w+)?)+$')
expr.match('International-Research-and-Publications forum-category-b forum-category-a')
the part of your pattern ([\s-]\w+)? is the issue. It's only allowing for one repetition (the trailing ?). Try changing the last ? to * and see if that helps.
Nope, I still believe that's the problem. The original pattern is looking for "word" or "word[space_hyphen]word" repeated 1+ times. Which is weird because the pattern should fall within another match. But switching the question mark worked for me.
There should be only one answer to this problem:
/^((?<=\w)[ -]\w|[^ -])+$/
There is only 1 rule as stated \w[ -]\w and thats it. And its on a per character basis granularity, and cannot be anthing else. Add the [^ -] for the rest.

Categories