What is the regex of extracting single letter or two letters? - php

There are two string
$str = "Calcium Plus Non Fat Milk Powder 1.8kg";
$str2 = "Super Dry Diapers L 54pcs";
I use
preg_match('/(?P<name>.*) (?P<total_weight>\b[0-9]*\.?[0-9]+)(?P<total_weight_unit>.*)/', $str, $m);
to extract $str and $str2 is similar way.
However I want to extract them such that I know it is weight(i.e. kg, g, etc) or it is portion(i.e. pcs, cans).
How can I do this??

If you want to capture number and unit for pieces and weight at the same time, try this:
$number_pattern="(\d+(?:\.\d+))"; #a sequence of digit with optional fractional part
$weight_unit_pattern="(k?g|oz)"; # kg, g or oz (add any other measure with '|measure'
$number_of_pieces_pattern="(\d+)\s*(pcs)"; # capture the number of pieces
$pattern="/(?:$number_pattern\s*$weight_unit_pattern)|(?:$number_pattern\s*$number_of_pieces_pattern)/";
preg_match_all($pattern,$str1,$result);
#now you should have a number and a unit

maybe
$str = "Calcium Plus Non Fat Milk Powder 1.8kg";
$str2 = "Super Dry Diapers L 54pcs";
$pat = '/([0-9.]+).+/';
preg_match_all($pat, $str2, $result);
print_r($result);

I would suggest ([0-9]+)([^ |^<]+) or ([0-9]+)(.{2,3})

I think you are looking for this code:
preg_match('/(?P<name>.*) (?P<total_weight>\b[0-9]*(\.?[0-9]+)?)(?P<total_weight_unit>.*)/', $str, $m);
I've added parentheses which bounds fractional part. Question mark (?) means zero or one match.

Related

php regex replace single capital with space capital [duplicate]

How would I go about splitting the word:
oneTwoThreeFour
into an array so that I can get:
one Two Three Four
with preg_match ?
I tired this but it just gives the whole word
$words = preg_match("/[a-zA-Z]*(?:[a-z][a-zA-Z]*[A-Z]|[A-Z][a-zA-Z]*[a-z])[a-zA-Z]*\b/", $string, $matches)`;
You can use preg_split as:
$arr = preg_split('/(?=[A-Z])/',$str);
See it
I'm basically splitting the input string just before the uppercase letter. The regex used (?=[A-Z]) matches the point just before a uppercase letter.
You can also use preg_match_all as:
preg_match_all('/((?:^|[A-Z])[a-z]+)/',$str,$matches);
Explanation:
( - Start of capturing parenthesis.
(?: - Start of non-capturing parenthesis.
^ - Start anchor.
| - Alternation.
[A-Z] - Any one capital letter.
) - End of non-capturing parenthesis.
[a-z]+ - one ore more lowercase letter.
) - End of capturing parenthesis.
I know that this is an old question with an accepted answer, but IMHO there is a better solution:
<?php // test.php Rev:20140412_0800
$ccWord = 'NewNASAModule';
$re = '/(?#! splitCamelCase Rev:20140412)
# Split camelCase "words". Two global alternatives. Either g1of2:
(?<=[a-z]) # Position is after a lowercase,
(?=[A-Z]) # and before an uppercase letter.
| (?<=[A-Z]) # Or g2of2; Position is after uppercase,
(?=[A-Z][a-z]) # and before upper-then-lower case.
/x';
$a = preg_split($re, $ccWord);
$count = count($a);
for ($i = 0; $i < $count; ++$i) {
printf("Word %d of %d = \"%s\"\n",
$i + 1, $count, $a[$i]);
}
?>
Note that this regex, (like codaddict's '/(?=[A-Z])/' solution - which works like a charm for well formed camelCase words), matches only a position within the string and consumes no text at all. This solution has the additional benefit that it also works correctly for not-so-well-formed pseudo-camelcase words such as: StartsWithCap and: hasConsecutiveCAPS.
Input:
oneTwoThreeFour
StartsWithCap
hasConsecutiveCAPS
NewNASAModule
Output:
Word 1 of 4 = "one"
Word 2 of 4 = "Two"
Word 3 of 4 = "Three"
Word 4 of 4 = "Four"
Word 1 of 3 = "Starts"
Word 2 of 3 = "With"
Word 3 of 3 = "Cap"
Word 1 of 3 = "has"
Word 2 of 3 = "Consecutive"
Word 3 of 3 = "CAPS"
Word 1 of 3 = "New"
Word 2 of 3 = "NASA"
Word 3 of 3 = "Module"
Edited: 2014-04-12: Modified regex, script and test data to correctly split: "NewNASAModule" case (in response to rr's comment).
While ridgerunner's answer works great, it seems not to work with all-caps substrings that appear in the middle of sentence. I use following and it seems to deal with these just alright:
function splitCamelCase($input)
{
return preg_split(
'/(^[^A-Z]+|[A-Z][^A-Z]+)/',
$input,
-1, /* no limit for replacement count */
PREG_SPLIT_NO_EMPTY /*don't return empty elements*/
| PREG_SPLIT_DELIM_CAPTURE /*don't strip anything from output array*/
);
}
Some test cases:
assert(splitCamelCase('lowHigh') == ['low', 'High']);
assert(splitCamelCase('WarriorPrincess') == ['Warrior', 'Princess']);
assert(splitCamelCase('SupportSEELE') == ['Support', 'SEELE']);
assert(splitCamelCase('LaunchFLEIAModule') == ['Launch', 'FLEIA', 'Module']);
assert(splitCamelCase('anotherNASATrip') == ['another', 'NASA', 'Trip']);
A functionized version of #ridgerunner's answer.
/**
* Converts camelCase string to have spaces between each.
* #param $camelCaseString
* #return string
*/
function fromCamelCase($camelCaseString) {
$re = '/(?<=[a-z])(?=[A-Z])/x';
$a = preg_split($re, $camelCaseString);
return join($a, " " );
}
$string = preg_replace( '/([a-z0-9])([A-Z])/', "$1 $2", $string );
The trick is a repeatable pattern $1 $2$1 $2 or lower UPPERlower UPPERlower etc....
for example
helloWorld = $1 matches "hello", $2 matches "W" and $1 matches "orld" again so in short you get $1 $2$1 or "hello World", matches HelloWorld as $2$1 $2$1 or again "Hello World". Then you can lower case them uppercase the first word or explode them on the space, or use a _ or some other character to keep them separate.
Short and simple.
When determining the best pattern for your project, you will need to consider the following pattern factors:
Accuracy (Robustness) -- whether the pattern is correct in all cases and is reasonably future-proof
Efficiency -- the pattern should be direct, deliberate, and avoid unnecessary labor
Brevity -- the pattern should use appropriate techniques to avoid unnecessary character length
Readability -- the pattern should be keep as simple as possible
The above factors also happen to be in the hierarchical order that strive to obey. In other words, it doesn't make much sense to me to prioritize 2, 3, or 4 when 1 doesn't quite satisfy the requirements. Readability is at the bottom of the list for me because in most cases I can follow the syntax.
Capture Groups and Lookarounds often impact pattern efficiency. The truth is, unless you are executing this regex on thousands of input strings, there is no need to toil over efficiency. It is perhaps more important to focus on pattern readability which can be associated with pattern brevity.
Some patterns below will require some additional handling/flagging by their preg_ function, but here are some pattern comparisons based on the OP's sample input:
preg_split() patterns:
/^[^A-Z]+\K|[A-Z][^A-Z]+\K/ (21 steps)
/(^[^A-Z]+|[A-Z][^A-Z]+)/ (26 steps)
/[^A-Z]+\K(?=[A-Z])/ (43 steps)
/(?=[A-Z])/ (50 steps)
/(?=[A-Z]+)/ (50 steps)
/([a-z]{1})[A-Z]{1}/ (53 steps)
/([a-z0-9])([A-Z])/ (68 steps)
/(?<=[a-z])(?=[A-Z])/x (94 steps) ...for the record, the x is useless.
/(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/ (134 steps)
preg_match_all() patterns:
/[A-Z]?[a-z]+/ (14 steps)
/((?:^|[A-Z])[a-z]+)/ (35 steps)
I'll point out that there is a subtle difference between the output of preg_match_all() and preg_split(). preg_match_all() will output a 2-dimensional array, in other words, all of the fullstring matches will be in the [0] subarray; if there is a capture group used, those substrings will be in the [1] subarray. On the other hand, preg_split() only outputs a 1-dimensional array and therefore provides a less bloated and more direct path to the desired output.
Some of the patterns are insufficient when dealing with camelCase strings that contain an ALLCAPS/acronym substring in them. If this is a fringe case that is possible within your project, it is logical to only consider patterns that handle these cases correctly. I will not be testing TitleCase input strings because that is creeping too far from the question.
New Extended Battery of Test Strings:
oneTwoThreeFour
hasConsecutiveCAPS
newNASAModule
USAIsGreatAgain
Suitable preg_split() patterns:
/[a-z]+\K|(?=[A-Z][a-z]+)/ (149 steps) *I had to use [a-z] for the demo to count properly
/(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/ (547 steps)
Suitable preg_match_all() pattern:
/[A-Z]?[a-z]+|[A-Z]+(?=[A-Z][a-z]|$)/ (75 steps)
Finally, my recommendations based on my pattern principles / factor hierarchy. Also, I recommend preg_split() over preg_match_all() (despite the patterns having less steps) as a matter of directness to the desired output structure. (of course, choose whatever you like)
Code: (Demo)
$noAcronyms = 'oneTwoThreeFour';
var_export(preg_split('~^[^A-Z]+\K|[A-Z][^A-Z]+\K~', $noAcronyms, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_match_all('~[A-Z]?[^A-Z]+~', $noAcronyms, $out) ? $out[0] : []);
Code: (Demo)
$withAcronyms = 'newNASAModule';
var_export(preg_split('~[^A-Z]+\K|(?=[A-Z][^A-Z]+)~', $withAcronyms, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_match_all('~[A-Z]?[^A-Z]+|[A-Z]+(?=[A-Z][^A-Z]|$)~', $withAcronyms, $out) ? $out[0] : []);
I took cool guy Ridgerunner's code (above) and made it into a function:
echo deliciousCamelcase('NewNASAModule');
function deliciousCamelcase($str)
{
$formattedStr = '';
$re = '/
(?<=[a-z])
(?=[A-Z])
| (?<=[A-Z])
(?=[A-Z][a-z])
/x';
$a = preg_split($re, $str);
$formattedStr = implode(' ', $a);
return $formattedStr;
}
This will return: New NASA Module
Another option is matching /[A-Z]?[a-z]+/ - if you know your input is on the right format, it should work nicely.
[A-Z]? would match an uppercase letter (or nothing). [a-z]+ would then match all following lowercase letters, until the next match.
Working example: https://regex101.com/r/kNZfEI/1
You can split on a "glide" from lowercase to uppercase thus:
$parts = preg_split('/([a-z]{1})[A-Z]{1}/', $string, -1, PREG_SPLIT_DELIM_CAPTURE);
//PREG_SPLIT_DELIM_CAPTURE to also return bracketed things
var_dump($parts);
Annoyingly you will then have to rebuild the words from each corresponding pair of items in $parts
Hope this helps
First of all codaddict thank you for your pattern, it helped a lot!
I needed a solution that works in case a preposition 'a' exists:
e.g. thisIsACamelcaseSentence.
I found the solution in doing a two step preg_match and made a function with some options:
/*
* input: 'thisIsACamelCaseSentence' output: 'This Is A Camel Case Sentence'
* options $case: 'allUppercase'[default] >> 'This Is A Camel Case Sentence'
* 'allLowerCase' >> 'this is a camel case sentence'
* 'firstUpperCase' >> 'This is a camel case sentence'
* #return: string
*/
function camelCaseToWords($string, $case = null){
isset($case) ? $case = $case : $case = 'allUpperCase';
// Find first occurances of two capitals
preg_match_all('/((?:^|[A-Z])[A-Z]{1})/',$string, $twoCapitals);
// Split them with the 'zzzzzz' string. e.g. 'AZ' turns into 'AzzzzzzZ'
foreach($twoCapitals[0] as $match){
$firstCapital = $match[0];
$lastCapital = $match[1];
$temp = $firstCapital.'zzzzzz'.$lastCapital;
$string = str_replace($match, $temp, $string);
}
// Now split words
preg_match_all('/((?:^|[A-Z])[a-z]+)/', $string, $words);
$output = "";
$i = 0;
foreach($words[0] as $word){
switch($case){
case 'allUpperCase':
$word = ucfirst($word);
break;
case 'allLowerCase':
$word = strtolower($word);
break;
case 'firstUpperCase':
($i == 0) ? $word = ucfirst($word) : $word = strtolower($word);
break;
}
// remove te 'zzzzzz' from a word if it has
$word = str_replace('zzzzzz','', $word);
$output .= $word." ";
$i++;
}
return $output;
}
Feel free to use it, and in case there is an 'easier' way to do this in one step please comment!
Full function based on #codaddict answer:
function splitCamelCase($str) {
$splitCamelArray = preg_split('/(?=[A-Z])/', $str);
return ucwords(implode($splitCamelArray, ' '));
}

Extract last section of string

I have a string like this:
[numbers]firstword[numbers]mytargetstring
I would like to know how is it possible to extract "targetstring" taking account the following :
a.) Numbers are numerical digits for example, my complete string with numbers:
12firstword21mytargetstring
b.) Numbers can be any digits, for example above are two digits each, but it can be any number of digits like this:
123firstword21567mytargetstring
Regardless of the number of digits, I am only interested in extracting "mytargetstring".
By the way "firstword" is fixed and will not change with any combination.
I am not very good in Regex so I appreciate someone with strong background can suggest how to do this using PHP. Thank you so much.
This will do it (or should do)
$input = '12firstword21mytargetstring';
preg_match('/\d+\w+\d+(\w+)$/', $input, $matches);
echo $matches[1]; // mytargetstring
It breaks down as
\d+\w+\d+(\w+)$
\d+ - One or more numbers
\w+ - followed by 1 or more word characters
\d+ - followed by 1 or more numbers
(\w+)$ - followed by 1 or more word characters that end the string. The brackets mark this as a group you want to extract
preg_match("/[0-9]+[a-z]+[0-9]+([a-z]+)/i", $your_string, $matches);
print_r($matches);
You can do it with preg_match and pattern syntax.
$string ='2firstword21mytargetstring';
if (preg_match ('/\d(\D*)$/', $string, $match)){
// ^ -- end of string
// ^ -- 0 or more
// ^^ -- any non digit character
// ^^ -- any digit character
var_dump($match[1]);}
Try it like,
print_r(preg_split('/\d+/i', "12firstword21mytargetstring"));
echo '<br/>';
echo 'Final string is: '.end(preg_split('/\d+/i', "12firstword21mytargetstring"));
Tested on http://writecodeonline.com/php/
You don't need regex for that:
for ($i=strlen($string)-1; $i; $i--) {
if (is_numeric($string[$i])) break;
}
$extracted_string = substr($string, $i+1);
Above it's probably the faster implementation you can get, certainly faster than using regex, which you don't need for this simple case.
See the working demo
your simple solution is here :-
$keywords = preg_split("/[\d,]+/", "hypertext123language2434programming");
echo($keywords[2]);

Split text into words & numbers with unicode support (preg_split)

I'm trying to split (with preg_split) a text with a lot of foreign chars and digits into words and numbers with length >= 2 and without ponctuation.
Now I have this code but it only split into words without taking account digits and length >= 2 for all.
How can I do please?
$text = 'abc 文 字化け, efg Yukarda mavi gök, asağıda yağız yer yaratıldıkta; (1998 m. siejės 7 d.). Ton pate dėina bandomkojė бойынша бірінші орында тұр (79.65 %), айына 41';
$splitted = preg_split('#\P{L}+#u', $text, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
Expected result should be : array('abc', '字化け', 'efg', 'Yukarda', 'mavi', 'gök', 'asağıda', 'yağız', 'yer', 'yaratıldıkta', '1998', 'siejės', 'Ton', 'pate', 'dėina', 'bandomkojė', 'бойынша', 'бірінші', 'орында', 'тұр', '79.65', 'айына', '41');
NB : already tried with these docs link1 & link2 but i can't get it works :-/
Use preg_match_all instead, then you can check the length condition (that is hard to do with preg_split, but not impossible):
$text = 'abc 文 字化け, efg Yukarda mavi gök, asağıda yağız yer yaratıldıkta; (1998 m. siejės 7 d.). Ton pate dėina bandomkojė бойынша бірінші орында тұр (79.65 %), айына 41';
preg_match_all('~\p{L}{2,}+|\d{2,}+(?>\.\d++)?|\d\.\d++~u',$text,$matches);
print_r($matches);
explanation:
p{L}{2,}+ # letter 2 or more times
| # OR
\d{2,}+ # digit 2 or more times
(?>\.\d++)? # can be a decimal number
| # OR
\d\.\d++ # single digit MUST be followed by at least a decimal
# (length constraint)
With a little hack to match digits separated by dot before matching only digits as part of the word:
preg_match_all("#(?:\d+\.\d+|\w{2,})#u", $text, $matches);
$splitted = $matches[0];
http://codepad.viper-7.com/X7Ln1V
Splitting CJK into "words" is kind of meaningless. Each character is a word. If you use whitespace the you split into phrases.
So it depends on what you're actually trying to accomplish. If you're indexing text, then you need to consider bigrams and/or CJK idioms.

Find first instance of character, then stop at space?

I think I need to use some kind of regex but struggling...
I have a string e.g.
the cat sat on the mat and $10 was all it cost
I want to return
$10
And is there a universal name for currency codes so I could return £10 if it was
the cat sat on the mat and £10 was all it cost
Or a way to add more characters to the expression
If you want to match all currency codes, use the following regex:
/\p{Sc}\d+(\.\d+)?\b/u
explanation:
/ # regex delimiter
\p{Sc} # a currency symbol
\d+ # 1 or more digit
(\.\d+)? # optionally followed by a dot and one or more digit
\b # word boundary
/ # regex delimiter
u # unicode
Have a look at this site to see the meaning of \p{Sc} (Currency Symbol)
You can use
/(\$.*?) /
(note there is a space after the closing parenthesis)
If you want to add more symbols, then use brackets:
$str = 'the cat sat on the mat and £10 was all it cost';
$matches = array();
preg_match( '/([\$£].*?) /', $str, $matches );
This will work if the currency symbol precedes the value, and if there is a space following the value. You might want to check for more general cases, such as the value being at the end of a sentence with no trailing space etc.
$string = 'the cat sat on the mat and $10 was all it cost';
$found = preg_match_all('/[$£]\d*/',$string,$results);
if ($found)
var_dump($results);
This may works for you
$string = "the cat sat on the mat and $10 was all it cost";
preg_match("/ ([\$£\]{1})([0-9]+)/", $string, $matches);
echo "<pre>";
print_r($matches);

Regular expression to match hyphenated words

How can I extract hyphenated strings from this string line?
ADW-CFS-WE CI SLA Def No SLANAME CI Max Outage Service
I just want to extract "ADW-CFS-WE" from it but has been very unsuccessful for the past few hours. I'm stuck with this simple regEx "(.*)" making the all of the string stated about selected.
You can probably use:
preg_match("/\w+(-\w+)+/", ...)
The \w+ will match any number of alphanumeric characters (= one word). And the second group ( ) is any additional number of hyphen with letters.
The trick with regular expressions is often specificity. Using .* will often match too much.
$input = "ADW-CFS-WE X-Y CI SLA Def No SLANAME CI Max Outage Service";
preg_match_all('/[A-Z]+-[A-Z-]+/', $input, $matches);
foreach ($matches[0] as $m) {
echo $matches . "\n";
}
Note that this solutions assumes that only uppercase A-Z can match. If that's not the case, insert the correct character class. For example, if you want to allow arbitrary letters (like a and Ä), replace [A-Z] with \p{L}.
Just catch every space free [^\s] words with at least an '-'.
The following expression will do it:
<?php
$z = "ADW-CFS-WE CI SLA Def No SLANAME CI Max Outage Service";
$r = preg_match('#([^\s]*-[^\s]*)#', $z, $matches);
var_dump($matches);
The following pattern assumes the data is at the beginning of the string, contains only capitalized letters and may contain a hyphen before each group of one or more of those letters:
<?php
$str = 'ADW-CFS-WE CI SLA Def No SLANAME CI Max Outage Service';
if (preg_match('/^(?:-?[A-Z]+)+/', $str, $matches) !== false)
var_dump($matches);
Result:
array(1) {
[0]=>
string(10) "ADW-CFS-WE"
}

Categories