How would I go about splitting the word:
oneTwoThreeFour
into an array so that I can get:
one Two Three Four
with preg_match ?
I tired this but it just gives the whole word
$words = preg_match("/[a-zA-Z]*(?:[a-z][a-zA-Z]*[A-Z]|[A-Z][a-zA-Z]*[a-z])[a-zA-Z]*\b/", $string, $matches)`;
You can use preg_split as:
$arr = preg_split('/(?=[A-Z])/',$str);
See it
I'm basically splitting the input string just before the uppercase letter. The regex used (?=[A-Z]) matches the point just before a uppercase letter.
You can also use preg_match_all as:
preg_match_all('/((?:^|[A-Z])[a-z]+)/',$str,$matches);
Explanation:
( - Start of capturing parenthesis.
(?: - Start of non-capturing parenthesis.
^ - Start anchor.
| - Alternation.
[A-Z] - Any one capital letter.
) - End of non-capturing parenthesis.
[a-z]+ - one ore more lowercase letter.
) - End of capturing parenthesis.
I know that this is an old question with an accepted answer, but IMHO there is a better solution:
<?php // test.php Rev:20140412_0800
$ccWord = 'NewNASAModule';
$re = '/(?#! splitCamelCase Rev:20140412)
# Split camelCase "words". Two global alternatives. Either g1of2:
(?<=[a-z]) # Position is after a lowercase,
(?=[A-Z]) # and before an uppercase letter.
| (?<=[A-Z]) # Or g2of2; Position is after uppercase,
(?=[A-Z][a-z]) # and before upper-then-lower case.
/x';
$a = preg_split($re, $ccWord);
$count = count($a);
for ($i = 0; $i < $count; ++$i) {
printf("Word %d of %d = \"%s\"\n",
$i + 1, $count, $a[$i]);
}
?>
Note that this regex, (like codaddict's '/(?=[A-Z])/' solution - which works like a charm for well formed camelCase words), matches only a position within the string and consumes no text at all. This solution has the additional benefit that it also works correctly for not-so-well-formed pseudo-camelcase words such as: StartsWithCap and: hasConsecutiveCAPS.
Input:
oneTwoThreeFour
StartsWithCap
hasConsecutiveCAPS
NewNASAModule
Output:
Word 1 of 4 = "one"
Word 2 of 4 = "Two"
Word 3 of 4 = "Three"
Word 4 of 4 = "Four"
Word 1 of 3 = "Starts"
Word 2 of 3 = "With"
Word 3 of 3 = "Cap"
Word 1 of 3 = "has"
Word 2 of 3 = "Consecutive"
Word 3 of 3 = "CAPS"
Word 1 of 3 = "New"
Word 2 of 3 = "NASA"
Word 3 of 3 = "Module"
Edited: 2014-04-12: Modified regex, script and test data to correctly split: "NewNASAModule" case (in response to rr's comment).
While ridgerunner's answer works great, it seems not to work with all-caps substrings that appear in the middle of sentence. I use following and it seems to deal with these just alright:
function splitCamelCase($input)
{
return preg_split(
'/(^[^A-Z]+|[A-Z][^A-Z]+)/',
$input,
-1, /* no limit for replacement count */
PREG_SPLIT_NO_EMPTY /*don't return empty elements*/
| PREG_SPLIT_DELIM_CAPTURE /*don't strip anything from output array*/
);
}
Some test cases:
assert(splitCamelCase('lowHigh') == ['low', 'High']);
assert(splitCamelCase('WarriorPrincess') == ['Warrior', 'Princess']);
assert(splitCamelCase('SupportSEELE') == ['Support', 'SEELE']);
assert(splitCamelCase('LaunchFLEIAModule') == ['Launch', 'FLEIA', 'Module']);
assert(splitCamelCase('anotherNASATrip') == ['another', 'NASA', 'Trip']);
A functionized version of #ridgerunner's answer.
/**
* Converts camelCase string to have spaces between each.
* #param $camelCaseString
* #return string
*/
function fromCamelCase($camelCaseString) {
$re = '/(?<=[a-z])(?=[A-Z])/x';
$a = preg_split($re, $camelCaseString);
return join($a, " " );
}
$string = preg_replace( '/([a-z0-9])([A-Z])/', "$1 $2", $string );
The trick is a repeatable pattern $1 $2$1 $2 or lower UPPERlower UPPERlower etc....
for example
helloWorld = $1 matches "hello", $2 matches "W" and $1 matches "orld" again so in short you get $1 $2$1 or "hello World", matches HelloWorld as $2$1 $2$1 or again "Hello World". Then you can lower case them uppercase the first word or explode them on the space, or use a _ or some other character to keep them separate.
Short and simple.
When determining the best pattern for your project, you will need to consider the following pattern factors:
Accuracy (Robustness) -- whether the pattern is correct in all cases and is reasonably future-proof
Efficiency -- the pattern should be direct, deliberate, and avoid unnecessary labor
Brevity -- the pattern should use appropriate techniques to avoid unnecessary character length
Readability -- the pattern should be keep as simple as possible
The above factors also happen to be in the hierarchical order that strive to obey. In other words, it doesn't make much sense to me to prioritize 2, 3, or 4 when 1 doesn't quite satisfy the requirements. Readability is at the bottom of the list for me because in most cases I can follow the syntax.
Capture Groups and Lookarounds often impact pattern efficiency. The truth is, unless you are executing this regex on thousands of input strings, there is no need to toil over efficiency. It is perhaps more important to focus on pattern readability which can be associated with pattern brevity.
Some patterns below will require some additional handling/flagging by their preg_ function, but here are some pattern comparisons based on the OP's sample input:
preg_split() patterns:
/^[^A-Z]+\K|[A-Z][^A-Z]+\K/ (21 steps)
/(^[^A-Z]+|[A-Z][^A-Z]+)/ (26 steps)
/[^A-Z]+\K(?=[A-Z])/ (43 steps)
/(?=[A-Z])/ (50 steps)
/(?=[A-Z]+)/ (50 steps)
/([a-z]{1})[A-Z]{1}/ (53 steps)
/([a-z0-9])([A-Z])/ (68 steps)
/(?<=[a-z])(?=[A-Z])/x (94 steps) ...for the record, the x is useless.
/(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/ (134 steps)
preg_match_all() patterns:
/[A-Z]?[a-z]+/ (14 steps)
/((?:^|[A-Z])[a-z]+)/ (35 steps)
I'll point out that there is a subtle difference between the output of preg_match_all() and preg_split(). preg_match_all() will output a 2-dimensional array, in other words, all of the fullstring matches will be in the [0] subarray; if there is a capture group used, those substrings will be in the [1] subarray. On the other hand, preg_split() only outputs a 1-dimensional array and therefore provides a less bloated and more direct path to the desired output.
Some of the patterns are insufficient when dealing with camelCase strings that contain an ALLCAPS/acronym substring in them. If this is a fringe case that is possible within your project, it is logical to only consider patterns that handle these cases correctly. I will not be testing TitleCase input strings because that is creeping too far from the question.
New Extended Battery of Test Strings:
oneTwoThreeFour
hasConsecutiveCAPS
newNASAModule
USAIsGreatAgain
Suitable preg_split() patterns:
/[a-z]+\K|(?=[A-Z][a-z]+)/ (149 steps) *I had to use [a-z] for the demo to count properly
/(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/ (547 steps)
Suitable preg_match_all() pattern:
/[A-Z]?[a-z]+|[A-Z]+(?=[A-Z][a-z]|$)/ (75 steps)
Finally, my recommendations based on my pattern principles / factor hierarchy. Also, I recommend preg_split() over preg_match_all() (despite the patterns having less steps) as a matter of directness to the desired output structure. (of course, choose whatever you like)
Code: (Demo)
$noAcronyms = 'oneTwoThreeFour';
var_export(preg_split('~^[^A-Z]+\K|[A-Z][^A-Z]+\K~', $noAcronyms, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_match_all('~[A-Z]?[^A-Z]+~', $noAcronyms, $out) ? $out[0] : []);
Code: (Demo)
$withAcronyms = 'newNASAModule';
var_export(preg_split('~[^A-Z]+\K|(?=[A-Z][^A-Z]+)~', $withAcronyms, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_match_all('~[A-Z]?[^A-Z]+|[A-Z]+(?=[A-Z][^A-Z]|$)~', $withAcronyms, $out) ? $out[0] : []);
I took cool guy Ridgerunner's code (above) and made it into a function:
echo deliciousCamelcase('NewNASAModule');
function deliciousCamelcase($str)
{
$formattedStr = '';
$re = '/
(?<=[a-z])
(?=[A-Z])
| (?<=[A-Z])
(?=[A-Z][a-z])
/x';
$a = preg_split($re, $str);
$formattedStr = implode(' ', $a);
return $formattedStr;
}
This will return: New NASA Module
Another option is matching /[A-Z]?[a-z]+/ - if you know your input is on the right format, it should work nicely.
[A-Z]? would match an uppercase letter (or nothing). [a-z]+ would then match all following lowercase letters, until the next match.
Working example: https://regex101.com/r/kNZfEI/1
You can split on a "glide" from lowercase to uppercase thus:
$parts = preg_split('/([a-z]{1})[A-Z]{1}/', $string, -1, PREG_SPLIT_DELIM_CAPTURE);
//PREG_SPLIT_DELIM_CAPTURE to also return bracketed things
var_dump($parts);
Annoyingly you will then have to rebuild the words from each corresponding pair of items in $parts
Hope this helps
First of all codaddict thank you for your pattern, it helped a lot!
I needed a solution that works in case a preposition 'a' exists:
e.g. thisIsACamelcaseSentence.
I found the solution in doing a two step preg_match and made a function with some options:
/*
* input: 'thisIsACamelCaseSentence' output: 'This Is A Camel Case Sentence'
* options $case: 'allUppercase'[default] >> 'This Is A Camel Case Sentence'
* 'allLowerCase' >> 'this is a camel case sentence'
* 'firstUpperCase' >> 'This is a camel case sentence'
* #return: string
*/
function camelCaseToWords($string, $case = null){
isset($case) ? $case = $case : $case = 'allUpperCase';
// Find first occurances of two capitals
preg_match_all('/((?:^|[A-Z])[A-Z]{1})/',$string, $twoCapitals);
// Split them with the 'zzzzzz' string. e.g. 'AZ' turns into 'AzzzzzzZ'
foreach($twoCapitals[0] as $match){
$firstCapital = $match[0];
$lastCapital = $match[1];
$temp = $firstCapital.'zzzzzz'.$lastCapital;
$string = str_replace($match, $temp, $string);
}
// Now split words
preg_match_all('/((?:^|[A-Z])[a-z]+)/', $string, $words);
$output = "";
$i = 0;
foreach($words[0] as $word){
switch($case){
case 'allUpperCase':
$word = ucfirst($word);
break;
case 'allLowerCase':
$word = strtolower($word);
break;
case 'firstUpperCase':
($i == 0) ? $word = ucfirst($word) : $word = strtolower($word);
break;
}
// remove te 'zzzzzz' from a word if it has
$word = str_replace('zzzzzz','', $word);
$output .= $word." ";
$i++;
}
return $output;
}
Feel free to use it, and in case there is an 'easier' way to do this in one step please comment!
Full function based on #codaddict answer:
function splitCamelCase($str) {
$splitCamelArray = preg_split('/(?=[A-Z])/', $str);
return ucwords(implode($splitCamelArray, ' '));
}
Related
I have some string, for example:
cats, e.g. Barsik, are funny. And it is true. So,
And I want to get as result:
cats, e.g. Barsik, are funny.
My try:
mb_ereg_search_init($text, '((?!e\.g\.).)*\.[^\.]');
$match = mb_ereg_search_pos();
But it gets position of second dot (after word "true").
How to get desired result?
Since a naive approach works for you, I am posting an answer. However, please note that detecting a sentence end is a very difficult task for a regex, and although it is possible to some degree, an NLP package should be used for that.
Having said that, I suggested using
'~(?<!\be\.g)\.(?=\s+\p{Lu})~ui'
The regex matches any dot (\.) that is not preceded with a whole word e.g (see the negative lookbehind (?<!\be\.g)), but that is followed with 1 or more whitespaces (\s+) followed with 1 uppercase Unicode letter \p{Lu}.
See the regex demo
The case insensitive i modifier does not impact what \p{Lu} matches.
The ~u modifier is required since you are working with Unicode texts (like Russian).
To get the index of the first occurrence, use a preg_match function with the PREG_OFFSET_CAPTURE flag. Here is a bit simplified regex you supplied in the comments:
preg_match('~(?<!т\.н)(?<!т\.к)(?<!e\.g)\.(?=\s+\p{L})~iu', $text, $match, PREG_OFFSET_CAPTURE);
See the lookaheads are executed one by one, and at the same location in string, thus, you do not have to additionally group them inside a positive lookahead. See the regex demo.
IDEONE demo:
$re = '~(?<!т\.н)(?<!т\.к)(?<!e\.g)\.(?=\s+\p{L})~iu';
$str = "cats, e.g. Barsik, are funny. And it is true. So,";
preg_match($re, $str, $match, PREG_OFFSET_CAPTURE);
echo $match[0][1];
Here are two approaches to get substring from start to second last . position of the initial string:
using strrpos and substr functions:
$str = 'cats, e.g. Barsik, and e.g. Lusya are funny. And it is true. So,';
$len = strlen($str);
$str = substr($str, 0, (strrpos($str, '.', strrpos($str, '.') - $len - 1) - $len) + 1);
print_r($str); // "cats, e.g. Barsik, and e.g. Lusya are funny."
using array_reverse, str_split and array_search functions:
$str = 'cats, e.g. Barsik, and e.g. Lusya are funny. And it is true. So,';
$parts = array_reverse(str_split($str));
$pos = array_search('.', $parts) + 1;
$str = implode("", array_reverse(array_slice($parts, array_search('.', array_slice($parts, $pos)) + $pos)));
print_r($str); // "cats, e.g. Barsik, and e.g. Lusya are funny."
So i'm trying to create a regex without success.
This is what i get as in input string:
String A: "##(ABC 50a- {+} UDF 69,22g,-) {*} 3##"
String B: "##ABC 0,10,- DEF {/} 9 ABC {*} UHG 3-##"
And this is what i need processed out of the regex:
Result A: "(50+69,22)*3"
String B: "0,10/9*3"
I just can't get the number replacement combined with the operation symbols.
This is what i got:
'/[^0-9\+\-\*\/\(\)\.]/'
Thankful for every help.
One simple solution consists of getting rid of everything you don't want.
So replace this:
\{(.+?)\}|[^0-9,{}()]+|(?<!\d),|,(?!\d)
With $1.
Simple enough:
$input = "(ABC 50a- {+} UDF 69,22g,-) {*} 3";
$output = preg_replace('#\{(.+?)\}|[^0-9,{}()]+|(?<!\d),|,(?!\d)#', '$1', $input);
\{(.+?)\} part matches everything inside {...} and outputs it (it gets replaced by $1)
[^0-9,{}()]+ gets rid of every character not belonging to the ones we're trying to keep (it's replaced with an empty string)
(?<!\d),|,(?!\d) throws out commas which are not part of a number
Unfortunately, I can't say much else without a better spec.
A good start would be to write down in words the patterns that you want to match. For instance, you've said that you know the operations are inside {}, but that doesn't appear anywhere in your first attempt at a regex.
You can also break it down into separate sections, and then build it up later. So for instance you might say:
if you see parentheses, keep them in the final answer
a number is made up either of digits...
...or digits followed by a comma and more digits
an operation is always in curly braces, and is either +, -, *, or /
everything else should be thrown away
Given the above list:
matching parentheses is easy: [()]
matching a digit can be done with [0-9] or \d; at least one is +; so "digits" is \d+
comma digits is easy: ,\d+; make it optional with ?and you get \d+(,\d+)?
any of four operations is just [+*/-]; escape the / and - to get [+*\/\-] don't forget that { and } have special meanings in regexes, so need to be escaped as \{ and \}; our list of operations in braces becomes: \{[+*\/\-]\}
Now we have to put it together; one way would be to use preg_match_all to find all occurences of any of those patterns, in order, and then we can stick them back together. So our regex is just "this or this or this or this":
/[()]|\d+(,\d+)?|\{[+*\/\-]\}/
I haven't tested this, but given the explanation of how I arrived at it, hopefully you can figure out how to test parts of it and tweak it if necessary.
I`m not good at regex but I found another approach:
Do EXTRA check of input before running eval!!!
$string = "(ABC 50a- {+} UDF 69,22g) {*} 3";
$new ='';
$string = str_split($string);
foreach($string as $char) {
if(!ctype_alnum($char) || ctype_digit($char) ){
//you don't want letters, except symbols like {, ( etc
$new .=$char;
}
}
//echo $new; will output -> ( 50- {+} 69,22) {*} 3
//remove the brackets although you could put it in the if statement ...
$new = str_replace(array('{','}'),array('',''), $new);
//floating point numbers use dot not comma
$new = str_replace(',','.', $new);
$p = eval('return '.$new.';');
print $p; // -57.66
Used: ctype_digit, ctype_alnum, eval, str_split, str_replace
P.S: I assumed that the minus before the base operation is taken into account.
Just a quick try before leaving the office ;-)
$data = array(
"(ABC 50a- {+} UDF 69,22g) {*} 3",
"ABC 0,10- DEF {/} 9 ABC {*} UHG 3-"
);
foreach($data as $d) {
echo $d . " = " . extractFormula($d) . "\n";
}
function extractFormula($string) {
$regex = '/([()])|([0-9]+(,[0-9]+)?)|\{([+\*\/-])\}/';
preg_match_all($regex, $string, $matches);
$formula = implode(' ', $matches[0]);
$formula = str_replace(array('{', '}'),NULL,$formula);
return $formula;
}
Output:
(ABC 50a- {+} UDF 69,22g) {*} 3 = ( 50 + 69,22 ) * 3
ABC 0,10- DEF {/} 9 ABC {*} UHG 3- = 0,10 / 9 * 3
If some one likes to fiddle around with the code, here is a live example: http://sandbox.onlinephpfunctions.com/code/373d76a9c0948314c1d164a555bed847f1a1ed0d
For example, I have an article should be splitted according to sentence boundary such as ".", "?", "!" and ":".
But as well all know, whether preg_split or explode function, they both remove the delimiter.
Any help would be really appreciated!
EDIT:
I can only come up with the code below, it works great though.
$content=preg_replace('/([\.\?\!\:])/',"\\1[D]",$content);
Thank you!!! Everyone. It is only five minutes for getting 3 answers! And I must apologize for not being able to see the PHP manual carefully before asking question. Sorry.
I feel this is worth adding. You can keep the delimiter in the "after" string by using regex lookahead to split:
$input = "The address is http://stackoverflow.com/";
$parts = preg_split('#(?=http://)#', $input);
// $parts[1] is "http://stackoverflow.com/"
And if the delimiter is of fixed length, you can keep the delimiter in the "before" part by using lookbehind:
$input = "The address is http://stackoverflow.com/";
$parts = preg_split('#(?<=http://)#', $input);
// $parts[0] is "The address is http://"
This solution is simpler and cleaner in most cases.
You can set the flag PREG_SPLIT_DELIM_CAPTURE when using preg_split and capture the delimiters too. Then you can take each pair of 2n and 2n+1 and put them back together:
$parts = preg_split('/([.?!:])/', $str, -1, PREG_SPLIT_DELIM_CAPTURE);
$sentences = [];
for ($i = 0, $n = count($parts) - 1; $i <= $n; $i += 2) {
$sentences[] = $parts[$i] . ($parts[$i+1] ?? '');
}
Note to pack the splitting delimiter into a group, otherwise they won’t be captured.
preg_split with PREG_SPLIT_DELIM_CAPTURE flag
For example
$parts = preg_split("/([\.\?\!\:])/", $string, -1, PREG_SPLIT_DELIM_CAPTURE);
Try T-Regx
<?php
$parts = pattern('([.?!:])')->split($string);
Parsing English sentences has a lot of nuance and fringe cases. This makes crafting a perfect parser very difficult to do. It is important to have sufficient test cases using your real project data to make sure that you are covering all scenarios.
There is no need to use lookarounds or capture groups for this task. You simply match the punctuation symbol(s), then forget them with \K, then match one or more whitespace characters that occurs between sentences. Using the PREG_SPLIT_NO_EMPTY flag prevents creating empty elements if your string starts with or ends with characters that satisfy the pattern.
Code: (Demo)
$str = 'Heading: This is a string. Very exciting! What do you think? ...one more thing, this is cool.';
var_export(
preg_split('~[.?!:]+\K\s+~', $str, 0, PREG_SPLIT_NO_EMPTY)
);
Output:
array (
0 => 'Heading:',
1 => 'This is a string.',
2 => 'Very exciting!',
3 => 'What do you think?',
4 => '...one more thing, this is cool.',
)
Here is my concern,
I have a string and I need to extract chraracters two by two.
$str = "abcdef" should return array('ab', 'bc', 'cd', 'de', 'ef'). I want to use preg_match_all instead of loops. Here is the pattern I am using.
$str = "abcdef";
preg_match_all('/[\w]{2}/', $str);
The thing is, it returns Array('ab', 'cd', 'ef'). It misses 'bc' and 'de'.
I have the same problem if I want to extract a certain number of words
$str = "ab cd ef gh ij";
preg_match_all('/([\w]+ ){2}/', $str); // returns array('ab cd', 'ef gh'), I'm also missing the last part
What am I missing? Or is it simply not possible to do so with preg_match_all?
For the first problem, what you want to do is match overlapping string, and this requires zero-width (not consuming text) look-around to grab the character:
/(?=(\w{2}))/
The regex above will capture the match in the first capturing group.
DEMO
For the second problem, it seems that you also want overlapping string. Using the same trick:
/(?=(\b\w+ \w+\b))/
Note that \b is added to check the boundary of the word. Since the match does not consume text, the next match will be attempted at the next index (which is in the middle of the first word), instead of at the end of the 2nd word. We don't want to capture from middle of a word, so we need the boundary check.
Note that \b's definition is based on \w, so if you ever change the definition of a word, you need to emulate the word boundary with look-ahead and look-behind with the corresponding character set.
DEMO
In case if you need a Non-Regex solution, Try this...
<?php
$str = "abcdef";
$len = strlen($str);
$arr = array();
for($count = 0; $count < ($len - 1); $count++)
{
$arr[] = $str[$count].$str[$count+1];
}
print_r($arr);
?>
See Codepad.
I want to split a string on several chars (being +, ~, > and #, but I want those chars to be part of the returned parts.
I tried:
$parts = preg_split('/\+|>|~|#/', $input, PREG_SPLIT_DELIM_CAPTURE);
The result is only 2 parts where there should be 5 and the split-char isn't part of part [1].
I also tried:
$parts = preg_split('/\+|>|~|#/', $input, PREG_SPLIT_OFFSET_CAPTURE);
The result is then 1 part too few (4 instead of 5) and the last part contains a split-char.
Without flags in preg_split, the result is almost perfect (as many parts as there should be) but all the split-chars are gone.
Example:
$input = 'oele>boele#4 + key:type:id + *~the end'; // spaces should be ignored
$output /* should be: */
array( 'oele', '>boele', ' #4 ', '+ key:type:id ', '+ *', '~the end' );
Is there a spl function or flag to do this or do I have to make one myself =(
$parts = preg_split('/(?=[+>~#])/', $input);
See it
Since you want to have the delimiters to be part of the next split piece, your split point is right before the delimiter and this can be easily done using positive look ahead.
(?= : Start of positive lookahead
[+>~#] : character class to match any of your delimiters.
) : End of look ahead assertion.
Effectively you are asking preg_split to split the input string at points just before delimiters.
You're missing an assignment for the limit parameter which is why it's returning less than you expected, try:
$parts = preg_split('/\+|>|~|#/', $input, -1, PREG_SPLIT_OFFSET_CAPTURE);
well i had the same problem in the past. You have to parenthese your regexp with brackets and then it hopefully works
$parts = preg_split('/(\+|>|~|#)/', $input, PREG_SPLIT_OFFSET_CAPTURE);
and here is it explained: http://www.php.net/manual/en/function.preg-split.php#94238
Ben is correct.
Just to add to his answer, PREG_SPLIT_DELIM_CAPTURE is a constant with value of 2 so you get 2 splits, similarly PREG_SPLIT_OFFSET_CAPTURE has a value of 4.