Regular expression to match hyphenated words

Regular expression to match hyphenated words - php

How can I extract hyphenated strings from this string line?
ADW-CFS-WE CI SLA Def No SLANAME CI Max Outage Service
I just want to extract "ADW-CFS-WE" from it but has been very unsuccessful for the past few hours. I'm stuck with this simple regEx "(.*)" making the all of the string stated about selected.

You can probably use:
preg_match("/\w+(-\w+)+/", ...)
The \w+ will match any number of alphanumeric characters (= one word). And the second group ( ) is any additional number of hyphen with letters.
The trick with regular expressions is often specificity. Using .* will often match too much.

$input = "ADW-CFS-WE X-Y CI SLA Def No SLANAME CI Max Outage Service";
preg_match_all('/[A-Z]+-[A-Z-]+/', $input, $matches);
foreach ($matches[0] as $m) {
echo $matches . "\n";
}
Note that this solutions assumes that only uppercase A-Z can match. If that's not the case, insert the correct character class. For example, if you want to allow arbitrary letters (like a and Ä), replace [A-Z] with \p{L}.

Just catch every space free [^\s] words with at least an '-'.
The following expression will do it:
<?php
$z = "ADW-CFS-WE CI SLA Def No SLANAME CI Max Outage Service";
$r = preg_match('#([^\s]*-[^\s]*)#', $z, $matches);
var_dump($matches);

The following pattern assumes the data is at the beginning of the string, contains only capitalized letters and may contain a hyphen before each group of one or more of those letters:
<?php
$str = 'ADW-CFS-WE CI SLA Def No SLANAME CI Max Outage Service';
if (preg_match('/^(?:-?[A-Z]+)+/', $str, $matches) !== false)
var_dump($matches);
Result:
array(1) {
[0]=>
string(10) "ADW-CFS-WE"
}

Related

php regex replace single capital with space capital [duplicate]

How would I go about splitting the word:
oneTwoThreeFour
into an array so that I can get:
one Two Three Four
with preg_match ?
I tired this but it just gives the whole word
$words = preg_match("/[a-zA-Z]*(?:[a-z][a-zA-Z]*[A-Z]|[A-Z][a-zA-Z]*[a-z])[a-zA-Z]*\b/", $string, $matches)`;

You can use preg_split as:
$arr = preg_split('/(?=[A-Z])/',$str);
See it
I'm basically splitting the input string just before the uppercase letter. The regex used (?=[A-Z]) matches the point just before a uppercase letter.

You can also use preg_match_all as:
preg_match_all('/((?:^|[A-Z])[a-z]+)/',$str,$matches);
Explanation:
( - Start of capturing parenthesis.
(?: - Start of non-capturing parenthesis.
^ - Start anchor.
| - Alternation.
[A-Z] - Any one capital letter.
) - End of non-capturing parenthesis.
[a-z]+ - one ore more lowercase letter.
) - End of capturing parenthesis.

I know that this is an old question with an accepted answer, but IMHO there is a better solution:
<?php // test.php Rev:20140412_0800
$ccWord = 'NewNASAModule';
$re = '/(?#! splitCamelCase Rev:20140412)
# Split camelCase "words". Two global alternatives. Either g1of2:
(?<=[a-z]) # Position is after a lowercase,
(?=[A-Z]) # and before an uppercase letter.
| (?<=[A-Z]) # Or g2of2; Position is after uppercase,
(?=[A-Z][a-z]) # and before upper-then-lower case.
/x';
$a = preg_split($re, $ccWord);
$count = count($a);
for ($i = 0; $i < $count; ++$i) {
printf("Word %d of %d = \"%s\"\n",
$i + 1, $count, $a[$i]);
}
?>
Note that this regex, (like codaddict's '/(?=[A-Z])/' solution - which works like a charm for well formed camelCase words), matches only a position within the string and consumes no text at all. This solution has the additional benefit that it also works correctly for not-so-well-formed pseudo-camelcase words such as: StartsWithCap and: hasConsecutiveCAPS.
Input:
oneTwoThreeFour
StartsWithCap
hasConsecutiveCAPS
NewNASAModule
Output:
Word 1 of 4 = "one"
Word 2 of 4 = "Two"
Word 3 of 4 = "Three"
Word 4 of 4 = "Four"
Word 1 of 3 = "Starts"
Word 2 of 3 = "With"
Word 3 of 3 = "Cap"
Word 1 of 3 = "has"
Word 2 of 3 = "Consecutive"
Word 3 of 3 = "CAPS"
Word 1 of 3 = "New"
Word 2 of 3 = "NASA"
Word 3 of 3 = "Module"
Edited: 2014-04-12: Modified regex, script and test data to correctly split: "NewNASAModule" case (in response to rr's comment).

While ridgerunner's answer works great, it seems not to work with all-caps substrings that appear in the middle of sentence. I use following and it seems to deal with these just alright:
function splitCamelCase($input)
{
return preg_split(
'/(^[^A-Z]+|[A-Z][^A-Z]+)/',
$input,
-1, /* no limit for replacement count */
PREG_SPLIT_NO_EMPTY /*don't return empty elements*/
| PREG_SPLIT_DELIM_CAPTURE /*don't strip anything from output array*/
);
}
Some test cases:
assert(splitCamelCase('lowHigh') == ['low', 'High']);
assert(splitCamelCase('WarriorPrincess') == ['Warrior', 'Princess']);
assert(splitCamelCase('SupportSEELE') == ['Support', 'SEELE']);
assert(splitCamelCase('LaunchFLEIAModule') == ['Launch', 'FLEIA', 'Module']);
assert(splitCamelCase('anotherNASATrip') == ['another', 'NASA', 'Trip']);

A functionized version of #ridgerunner's answer.
/**
* Converts camelCase string to have spaces between each.
* #param $camelCaseString
* #return string
*/
function fromCamelCase($camelCaseString) {
$re = '/(?<=[a-z])(?=[A-Z])/x';
$a = preg_split($re, $camelCaseString);
return join($a, " " );
}

$string = preg_replace( '/([a-z0-9])([A-Z])/', "$1 $2", $string );
The trick is a repeatable pattern $1 $2$1 $2 or lower UPPERlower UPPERlower etc....
for example
helloWorld = $1 matches "hello", $2 matches "W" and $1 matches "orld" again so in short you get $1 $2$1 or "hello World", matches HelloWorld as $2$1 $2$1 or again "Hello World". Then you can lower case them uppercase the first word or explode them on the space, or use a _ or some other character to keep them separate.
Short and simple.

When determining the best pattern for your project, you will need to consider the following pattern factors:
Accuracy (Robustness) -- whether the pattern is correct in all cases and is reasonably future-proof
Efficiency -- the pattern should be direct, deliberate, and avoid unnecessary labor
Brevity -- the pattern should use appropriate techniques to avoid unnecessary character length
Readability -- the pattern should be keep as simple as possible
The above factors also happen to be in the hierarchical order that strive to obey. In other words, it doesn't make much sense to me to prioritize 2, 3, or 4 when 1 doesn't quite satisfy the requirements. Readability is at the bottom of the list for me because in most cases I can follow the syntax.
Capture Groups and Lookarounds often impact pattern efficiency. The truth is, unless you are executing this regex on thousands of input strings, there is no need to toil over efficiency. It is perhaps more important to focus on pattern readability which can be associated with pattern brevity.
Some patterns below will require some additional handling/flagging by their preg_ function, but here are some pattern comparisons based on the OP's sample input:
preg_split() patterns:
/^[^A-Z]+\K|[A-Z][^A-Z]+\K/ (21 steps)
/(^[^A-Z]+|[A-Z][^A-Z]+)/ (26 steps)
/[^A-Z]+\K(?=[A-Z])/ (43 steps)
/(?=[A-Z])/ (50 steps)
/(?=[A-Z]+)/ (50 steps)
/([a-z]{1})[A-Z]{1}/ (53 steps)
/([a-z0-9])([A-Z])/ (68 steps)
/(?<=[a-z])(?=[A-Z])/x (94 steps) ...for the record, the x is useless.
/(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/ (134 steps)
preg_match_all() patterns:
/[A-Z]?[a-z]+/ (14 steps)
/((?:^|[A-Z])[a-z]+)/ (35 steps)
I'll point out that there is a subtle difference between the output of preg_match_all() and preg_split(). preg_match_all() will output a 2-dimensional array, in other words, all of the fullstring matches will be in the [0] subarray; if there is a capture group used, those substrings will be in the [1] subarray. On the other hand, preg_split() only outputs a 1-dimensional array and therefore provides a less bloated and more direct path to the desired output.
Some of the patterns are insufficient when dealing with camelCase strings that contain an ALLCAPS/acronym substring in them. If this is a fringe case that is possible within your project, it is logical to only consider patterns that handle these cases correctly. I will not be testing TitleCase input strings because that is creeping too far from the question.
New Extended Battery of Test Strings:
oneTwoThreeFour
hasConsecutiveCAPS
newNASAModule
USAIsGreatAgain
Suitable preg_split() patterns:
/[a-z]+\K|(?=[A-Z][a-z]+)/ (149 steps) *I had to use [a-z] for the demo to count properly
/(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/ (547 steps)
Suitable preg_match_all() pattern:
/[A-Z]?[a-z]+|[A-Z]+(?=[A-Z][a-z]|$)/ (75 steps)
Finally, my recommendations based on my pattern principles / factor hierarchy. Also, I recommend preg_split() over preg_match_all() (despite the patterns having less steps) as a matter of directness to the desired output structure. (of course, choose whatever you like)
Code: (Demo)
$noAcronyms = 'oneTwoThreeFour';
var_export(preg_split('~^[^A-Z]+\K|[A-Z][^A-Z]+\K~', $noAcronyms, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_match_all('~[A-Z]?[^A-Z]+~', $noAcronyms, $out) ? $out[0] : []);
Code: (Demo)
$withAcronyms = 'newNASAModule';
var_export(preg_split('~[^A-Z]+\K|(?=[A-Z][^A-Z]+)~', $withAcronyms, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_match_all('~[A-Z]?[^A-Z]+|[A-Z]+(?=[A-Z][^A-Z]|$)~', $withAcronyms, $out) ? $out[0] : []);

I took cool guy Ridgerunner's code (above) and made it into a function:
echo deliciousCamelcase('NewNASAModule');
function deliciousCamelcase($str)
{
$formattedStr = '';
$re = '/
(?<=[a-z])
(?=[A-Z])
| (?<=[A-Z])
(?=[A-Z][a-z])
/x';
$a = preg_split($re, $str);
$formattedStr = implode(' ', $a);
return $formattedStr;
}
This will return: New NASA Module

Another option is matching /[A-Z]?[a-z]+/ - if you know your input is on the right format, it should work nicely.
[A-Z]? would match an uppercase letter (or nothing). [a-z]+ would then match all following lowercase letters, until the next match.
Working example: https://regex101.com/r/kNZfEI/1

You can split on a "glide" from lowercase to uppercase thus:
$parts = preg_split('/([a-z]{1})[A-Z]{1}/', $string, -1, PREG_SPLIT_DELIM_CAPTURE);
//PREG_SPLIT_DELIM_CAPTURE to also return bracketed things
var_dump($parts);
Annoyingly you will then have to rebuild the words from each corresponding pair of items in $parts
Hope this helps

First of all codaddict thank you for your pattern, it helped a lot!
I needed a solution that works in case a preposition 'a' exists:
e.g. thisIsACamelcaseSentence.
I found the solution in doing a two step preg_match and made a function with some options:
/*
* input: 'thisIsACamelCaseSentence' output: 'This Is A Camel Case Sentence'
* options $case: 'allUppercase'[default] >> 'This Is A Camel Case Sentence'
* 'allLowerCase' >> 'this is a camel case sentence'
* 'firstUpperCase' >> 'This is a camel case sentence'
* #return: string
*/
function camelCaseToWords($string, $case = null){
isset($case) ? $case = $case : $case = 'allUpperCase';
// Find first occurances of two capitals
preg_match_all('/((?:^|[A-Z])[A-Z]{1})/',$string, $twoCapitals);
// Split them with the 'zzzzzz' string. e.g. 'AZ' turns into 'AzzzzzzZ'
foreach($twoCapitals[0] as $match){
$firstCapital = $match[0];
$lastCapital = $match[1];
$temp = $firstCapital.'zzzzzz'.$lastCapital;
$string = str_replace($match, $temp, $string);
}
// Now split words
preg_match_all('/((?:^|[A-Z])[a-z]+)/', $string, $words);
$output = "";
$i = 0;
foreach($words[0] as $word){
switch($case){
case 'allUpperCase':
$word = ucfirst($word);
break;
case 'allLowerCase':
$word = strtolower($word);
break;
case 'firstUpperCase':
($i == 0) ? $word = ucfirst($word) : $word = strtolower($word);
break;
}
// remove te 'zzzzzz' from a word if it has
$word = str_replace('zzzzzz','', $word);
$output .= $word." ";
$i++;
}
return $output;
}
Feel free to use it, and in case there is an 'easier' way to do this in one step please comment!

Full function based on #codaddict answer:
function splitCamelCase($str) {
$splitCamelArray = preg_split('/(?=[A-Z])/', $str);
return ucwords(implode($splitCamelArray, ' '));
}

get the portion of a string between two positions with php

I have a string like "some words 12345cm some more words"
and I want to extract the 12345cm bit from that string. So I get the position of the first number:
$position_of_first_number = strcspn( "some words 12345cm some more words" , '0123456789' );
Then the position of the first space after $position_of_first_number
$position_of_space_after_numbers = strpos("some words 12345cm some more words", " ", $position_of_first_number);
Then I want to have a function which return the portion of the string between $position_of_first_number and $position_of_space_after_numbers.
How do I do it?

You can use the substr function. Note that it takes a starting position and a length, which you can calculate as the difference between the start and end positions.

Since you are looking for a pattern like blank-digits-letters-blank, I would recommend a regular expression using preg_match:
$s = "some words 12345cm some more words";
preg_match("/\s(?P<result>\d+[^\W\d_]+)\s/", $s, $matches);
echo $matches["result"];
12345cm
Explaining the pattern:
"/.../" limits the pattern in PHP
\s matches any whitespace character
(?P<name>...) names the following pattern
\d+ matches 1 or more digits
[^\W\d_]+ matches 1 or more Unicode-letters (i.e. any character that is not a non-alphanumeric character; see this answer)

Make two simple regex's into one

I am trying to make a regex that will look behind .txt and then behind the "-" and get the first digit .... in the example, it would be a 1.
$record_pattern = '/.txt.+/';
preg_match($record_pattern, $decklist, $record);
print_r($record);
.txt?n=chihoi%20%283-1%29
I want to write this as one expression but can only seem to do it as two. This is the first time working with regex's.

You can use this:
$record_pattern = '/\.txt.+-(\d)/';
Now, the first group contains what you want.

Your regex would be,
\.txt[^-]*-\K\d
You don't need for any groups. It just matches from the .txt and upto the literal -. Because of \K in our regex, it discards the previously matched characters. In our case it discards .txt?n=chihoi%20%283- string. Then it starts matching again the first digit which was just after to -
DEMO
Your PHP code would be,
<?php
$mystring = ".txt?n=chihoi%20%283-1%29";
$regex = '~\.txt[^-]*-\K\d~';
if (preg_match($regex, $mystring, $m)) {
$yourmatch = $m[0];
echo $yourmatch;
}
?> //=> 1

Extract last section of string

I have a string like this:
[numbers]firstword[numbers]mytargetstring
I would like to know how is it possible to extract "targetstring" taking account the following :
a.) Numbers are numerical digits for example, my complete string with numbers:
12firstword21mytargetstring
b.) Numbers can be any digits, for example above are two digits each, but it can be any number of digits like this:
123firstword21567mytargetstring
Regardless of the number of digits, I am only interested in extracting "mytargetstring".
By the way "firstword" is fixed and will not change with any combination.
I am not very good in Regex so I appreciate someone with strong background can suggest how to do this using PHP. Thank you so much.

This will do it (or should do)
$input = '12firstword21mytargetstring';
preg_match('/\d+\w+\d+(\w+)$/', $input, $matches);
echo $matches[1]; // mytargetstring
It breaks down as
\d+\w+\d+(\w+)$
\d+ - One or more numbers
\w+ - followed by 1 or more word characters
\d+ - followed by 1 or more numbers
(\w+)$ - followed by 1 or more word characters that end the string. The brackets mark this as a group you want to extract

preg_match("/[0-9]+[a-z]+[0-9]+([a-z]+)/i", $your_string, $matches);
print_r($matches);

You can do it with preg_match and pattern syntax.
$string ='2firstword21mytargetstring';
if (preg_match ('/\d(\D*)$/', $string, $match)){
// ^ -- end of string
// ^ -- 0 or more
// ^^ -- any non digit character
// ^^ -- any digit character
var_dump($match[1]);}

Try it like,
print_r(preg_split('/\d+/i', "12firstword21mytargetstring"));
echo '<br/>';
echo 'Final string is: '.end(preg_split('/\d+/i', "12firstword21mytargetstring"));
Tested on http://writecodeonline.com/php/

You don't need regex for that:
for ($i=strlen($string)-1; $i; $i--) {
if (is_numeric($string[$i])) break;
}
$extracted_string = substr($string, $i+1);
Above it's probably the faster implementation you can get, certainly faster than using regex, which you don't need for this simple case.
See the working demo

your simple solution is here :-
$keywords = preg_split("/[\d,]+/", "hypertext123language2434programming");
echo($keywords[2]);

REGEXR help - how to extract a year from a string

I have a year listed in my string
$s = "Acquired by the University in 1988";
In practice, that could be anywhere in this single line string. How do I extract it using regexr? I tried \d and that didn't work, it just came up with an error.
Jason
I'm using preg_match in LAMP 5.2

You need a regex to match four digits, and these four digits must comprise a whole word (i.e. a string of 10 digits contains four digits but is not a year.) Thus, the regex needs to include word boundaries like so:
if (preg_match('/\b\d{4}\b/', $s, $matches)) {
$year = $matches[0];
}

Try this code:
<?php
$s = "Acquired by the University in 1988 year.";
$yr = preg_replace('/^[^\d]*(\d{4}).*$/', '\1', $s);
var_dump($yr);
?>
OUTPUT:
string(4) "1988"
However this regex works with an assumption that 4 digit number appears just once in the line.

Well, you could use \d{4}, but that will break if there's anything else in the string with four digits.
Edit:
The problem is that, other than the four numeric characters, there isn't really any other identifying information (as, according to your requirements, the number can be anywhere in the string), so based on what you've written, this is probably the best that you can do outside of range checking the returned value.
$str = "the year is 1988";
preg_match('/\d{4}/', $str, $matches);
var_dump($matches);

/(^|\s)(\d{4})(\s|$)/gm
Matches
Acquired by the University in 1988
The 1945 vintage was superb
1492 columbus sailed the ocean blue
Ignores
There were nearly 10000 people there!
Member ID 45678
Phone Number 951-555-2563
See it in action at http://refiddle.com/10k

preg_match('/(\d{4})/', $string, $matches);

For a basic year match, assuming only one year
$year = false;
if(preg_match("/\d{4}/", $string, $match)) {
$year = $match[0];
}
If you need to handle the posibility of multiple years in the same string
if(preg_match_all("/\d{4}/", $string, $matches, PREG_SET_ORDER)) {
foreach($matches as $match) {
$year = $match[0];
}
}

/(?<!\d)\d{4}(?!\d)/ will match only 4-digit numbers that do not have digits before or after them.
(?<!\d) and (?!\d) are look-behind and look-ahead (respectively) assertions that ensure that a \d does not occur before or after the main part of the RE.
It may in practice be more sensible to use \b instead of the assertions; this will ensure that the beginning and end of the year occur at a "word boundary". So then "1337hx0r" would be appropriately ignored.
If you are only for looking for years within the past century or so, you could use
/\b(19|20)\d{2}\b/

Also if your string is something like that :
$date = "20044Q";
You can use below code to extract year from any string.
preg_match('/(?:(?:19|20)[0-9]{2})/', $date, $matches);
echo $matches[0];

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regular expression to match hyphenated words - php

You can probably use: preg_match("/\w+(-\w+)+/", ...) The \w+ will match any number of alphanumeric characters (= one word). And the second group ( ) is any additional number of hyphen with letters. The trick with regular expressions is often specificity. Using .* will often match too much.

Just catch every space free [^\s] words with at least an '-'. The following expression will do it: <?php $z = "ADW-CFS-WE CI SLA Def No SLANAME CI Max Outage Service"; $r = preg_match('#([^\s]-[^\s])#', $z, $matches); var_dump($matches);

Related

php regex replace single capital with space capital [duplicate]

get the portion of a string between two positions with php

Make two simple regex's into one

Extract last section of string

REGEXR help - how to extract a year from a string

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regular expression to match hyphenated words - php

You can probably use: preg_match("/\w+(-\w+)+/", ...) The \w+ will match any number of alphanumeric characters (= one word). And the second group ( ) is any additional number of hyphen with letters. The trick with regular expressions is often specificity. Using .* will often match too much.

Just catch every space free [^\s] words with at least an '-'. The following expression will do it: <?php $z = "ADW-CFS-WE CI SLA Def No SLANAME CI Max Outage Service"; $r = preg_match('#([^\s]*-[^\s]*)#', $z, $matches); var_dump($matches);

Related

php regex replace single capital with space capital [duplicate]

get the portion of a string between two positions with php

Make two simple regex's into one

Extract last section of string

REGEXR help - how to extract a year from a string

Categories

Resources

Just catch every space free [^\s] words with at least an '-'. The following expression will do it: <?php $z = "ADW-CFS-WE CI SLA Def No SLANAME CI Max Outage Service"; $r = preg_match('#([^\s]-[^\s])#', $z, $matches); var_dump($matches);