I am trying to learn regex. I have the string:
$x = "5ft2inches";
How can I read [5,2] into an array using a regex?
If you are assuming that the string will be of the form "{number}ft{number}inches" then you can use preg_match():
preg_match('/(\d+)ft(\d+)inches/', $string, $matches);
(\d+) will match a string of one or more digits. The parentheses will tell preg_match() to place the matched numbers into the $matches variable (the third argument to the function). The function will return 1 if it made a match, of 0 if it didn't.
Here is what $matches looks like after a successful match:
Array
(
[0] => 5ft2inches
[1] => 5
[2] => 2
)
The entire matched string is the first element, then the parenthesized matches follow. So to make your desired array:
$array = array($matches[1], $matches[2]);
Assuming PHP, any reason no one has suggested split?
$numbers = preg_split('/[^0-9]+/', $x, -1, PREG_SPLIT_NO_EMPTY);
In Perl:
#!/usr/bin/perl
use strict;
use warnings;
my $x = "5ft2inches";
my %height;
#height{qw(feet inches)} = ($x =~ /^([0-9]+)ft([0-9]+)inches$/);
use Data::Dumper;
print Dumper \%height;
Output:
$VAR1 = {
'feet' => '5',
'inches' => '2'
};
Or, using split:
#height{qw(feet inches)} = split /ft|inches/, $x;
The regular expression is simply /[0-9]+/ but how to get it into an array depends entirely on what programming language you're using.
With Regular Expressions, you can either extract your data in a contextless way, or a contextful way.
IE, if you match for any digits: (\d+) (NB: Assumes that your language honors \d as the shortcut for 'any digits')
You can then extract each group, but you might not know that your string was actually "5 2inches" instead of "6ft2inches" OR "29Cabbages1Fish4Cows".
If you add context: (\d+)ft(\d+)inches
You know for sure what you've extracted (Because otherwise you'd not get a match) and can refer to each group in turn to get the feet and inches.
If you're not always going to have a pair of numbers to extract, you'll need to make the various components optional. Check out This Regular Expression Cheat Sheet (His other cheat sheets are nifty too) for more info,
You don't mention the language you are using, so here is the general solution: You don't "extract" the numbers, you replace everything except numbers with an empty string.
In C#, this would look like
string numbers = Regex.Replace(dirtyString, "[^0-9]", "");
Have to watch out for double digit numbers.
/\d{1,2}/
might work a little better for feet and inches. The max value of '2' should be upped to whatever is appropriate.
use `/[\d]/`
Related
I am trying to split a string in php, which looks like this:
ABCDE1234ABCD1234ABCDEF1234
Into an array of string which, in this case, would look like this:
ABCDE1234
ABCD1234
ABCDEF1234
So the pattern is "an undefined number of letters, and then 4 digits, then an undefined number of letters and 4 digits etc."
I'm trying to split the string using preg_split like this:
$pattern = "#[0-9]{4}$#";
preg_split($pattern, $stringToSplit);
And it returns an array containing the full string (not split) in the first element.
I'm guessing the problem here is my regex as I don't fully understand how to use them, and I am not sure if I'm using it correctly.
So what would be the correct regex to use?
You don't want preg_split, you want preg_match_all:
$str = 'ABCDE1234ABCD1234ABCDEF1234';
preg_match_all('/[a-z]+[0-9]{4}/i', $str, $matches);
var_dump($matches);
Output:
array(1) {
[0]=>
array(3) {
[0]=>
string(9) "ABCDE1234"
[1]=>
string(8) "ABCD1234"
[2]=>
string(10) "ABCDEF1234"
}
}
PHP uses PCRE-style regexes which let you do lookbehinds. You can use this to see if there are 4 digits "behind" you. Combine that with a lookahead to see if there's a letter ahead of you, and you get this:
(?<=\d{4})(?=[a-z])
Notice the dotted lines on the Debuggex Demo page. Those are the points you want to split on.
In PHP this would be:
var_dump(preg_split('/(?<=\d{4})(?=[a-z])/i', 'ABCDE1234ABCD1234ABCDEF1234'));
Use the principle of contrast:
\D+\d{4}
# requires at least one non digit
# followed by exactly four digits
See a demo on regex101.com.
In PHP this would be:
<?php
$string = 'ABCDE1234ABCD1234ABCDEF1234';
$regex = '~\D+\d{4}~';
preg_match_all($regex, $string, $matches);
?>
See a demo on ideone.com.
I'm no good at regex so here is the road less traveled:
<?php
$s = 'ABCDE1234ABCD1234ABCDEF1234';
$nums = range(0,9);
$num_hit = 0;
$i = 0;
$arr = array();
foreach(str_split($s) as $v)
{
if(isset($nums[$v]))
{
++$num_hit;
}
if(!isset($arr[$i]))
{
$arr[$i] = '';
}
$arr[$i].= $v;
if($num_hit === 4)
{
++$i;
$num_hit = 0;
}
}
print_r($arr);
First, why is your attempted pattern not delivering the desired output? Because the $ anchor tells the function to explode the string by using the final four numbers as the "delimiter" (characters that should be consuming while dividing the string into separate parts).
Your result:
array (
0 => 'ABCDE1234ABCD1234ABCDEF', // an element of characters before the last four digits
1 => '', // an empty element containing the non-existent characters after the four digits
)
In plain English, to fix your pattern, you must:
Not consume any characters while exploding and
Ensure that no empty elements are generated.
My snippet is at the bottom of this post.
Second, there seems to be some debate about what regex function to use (or even if regex is a preferrable tool).
My stance is that using a non-regex method will require a long-winded block of lines which will be equally if not more difficult to read than a regex pattern. Using regex affords you to generate your result in one-line and not in an unsightly fashion. So let's dispose of iterated sets of conditions for this task.
Now the critical concern is whether this task is simply "extracting" data from a consistent and valid string (case "A"), or if it is "validating AND extracting" data from a string (case"B") because the input cannot be 100 trusted to be consistent/correct.
In case A, you needn't concern yourself with producing valid elements in the output, so preg_split() or preg_match_all() are good candidates.
In case B, preg_split() would not be advisable, because it only hunts for delimiting substrings -- it remains ignorant of all other characters in the string.
Assuming this task is case A, then a decision is still pending about the better function to call. Well, both functions generate an array, but preg_match_all() creates a multidimensional array while you desire a flat array (like preg_split() provides). This means you would need to add a new variable to the global scope ($matches) and append [0] to the array to access the desired fullstring matches. To someone who doesn't understand regex patterns, this may border on the bad practice of using "magic numbers".
For me, I strive to code for Directness and Accuracy, then Efficiency, then Brevity and Clarity. Since you're not likely to notice any performance drops while performing such a small operation, efficiency isn't terribly important. I just want to make some comparisons to highlight the cost of a pattern that leverages only look-arounds or a pattern that misses an oportunity to greedily match predictable characters.
/(?<=\d{4})(?=[a-z])/i 79 steps (Demo)
~\d{4}\K~ 25 steps (Demo)
/[a-z]+[0-9]{4}\K/i 13 steps (Demo)
~\D+[0-9]{4}\K~ 13 steps (Demo)
~\D+\d{4}\K~ 13 steps (Demo)
FYI, \K is a metacharacter that means "restart the fullstring match", in other words "forget/release all previously matched characters up to this point". This effectively ensures that no characters are lost while spitting.
Suggested technique: (Demo)
var_export(
preg_split(
'~\D+\d{4}\K~', // pattern
'ABCDE1234ABCD1234ABCDEF1234', // input
0, // make unlimited explosions
PREG_SPLIT_NO_EMPTY // exclude empty elements
)
);
Output:
array (
0 => 'ABCDE1234',
1 => 'ABCD1234',
2 => 'ABCDEF1234',
)
another regex question. I use PHP, and have a string: fdjkaljfdlstopfjdslafdj. You see there is a stop in the middle. I just want to replace any other words excluding that stop. i try to use [^stop], but it also includes the s at the end of the string.
My Solution
Thanks everyone’s help here.
I also figure out a solution with pure RegEx method(I mean in my knowledge scoop to RegEx. PCRE verbs are too advanced for me). But it needs 2 steps. I don’t want to mix PHP method in, because sometimes the jobs are out of coding area, i.e. multi-renaming filenames in Total Commander.
Let’s see the string: xxxfooeoropwfoo,skfhlk;afoofsjre,jhgfs,vnhufoolsjunegpq. For example, I want to keep all foos in this string, and replace any other non-foo greedily into ---.
First, I need to find all the non-foo between each foo: (?<=foo).+?(?=foo).
The string will turn into xxxfoo---foo---foo---foolsjunegpq, just both sides non-foo words left now.
Then use [^-]+(?=foo)|(?<=foo)[^-]+.
This time: ---foo---foo---foo---foo---. All words but foo have been turned into ---.
i just dont want to include "stop"...
You can skip it by using PCRE verbs (*SKIP)(*F) try like this
stop(*SKIP)(*F)|.
Demo at regex101
or sequence: (stop)(*SKIP)(*F)|(?:(?!(?1)).)+
or for words: stop(*SKIP)(*F)|\w+
[^stop] doesn't means any text that is NOT stop. It just means any character that is not one of the 4 characters inside [...] which is in this case s,t,o,p.
Better to split on the text you don't want to match:
$s = 'fdjkaljfdlstopfjdslafdjstopfoobar';
php> $arr = preg_split('/stop/', $s);
php> print_r($arr);
Array
(
[0] => fdjkaljfdl
[1] => fjdslafdj
[2] => foobar
)
You can generalize this to any pattern:
(?<neg>stop)(*SKIP)(*FAIL)|(?s:.)+?(?=\Z|(?&neg))
Demo
Just put the pattern you don't want in the neg group.
This regex will try to do the following for any character position:
Match the pattern you don't want. If it matches, discard it with (*SKIP)(*FAIL) and restart another match at this position.
If the pattern you don't want doesn't match at a particular position, then match anything, until either:
You reach the end of the input string (\Z)
Or the pattern you don't want immediately follows the current matching position ((?&neg))
This approach is slower than manually tuning the expression, you could get better performance at the cost of repeating yourself, which avoids the recursion:
stop(*SKIP)(*FAIL)|(?s:.)+?(?=\Z|stop)
But of course, the best approach would be to use the features provided by your language: match the string you don't want, then use code to discard it and keep everything else.
In PHP, you can use the PREG_OFFSET_CAPTURE flag to tell the preg_match_all function to provide you the offsets of each match.
I have this string authors[0][system:id] and I need a regex that returns:
array('authors', '0', 'system:id')
Any ideas?
Thanks.
Just use PHP's preg_split(), which returns an array of elements similarly to explode() but with RegEx.
Split the string on [ or ] and the remove the last element (which is an empty string) of the provided array, $tokens.
EDIT: Also, remove the 3rd element with array_splice($array, int $offset, int $lenth), since this item is also an empty string.
The regex /[\[\]]/ just means match any [ or ] character
$string = "authors[0][system:id]";
$tokens = preg_split("/[\]\[]/", $string);
array_pop($tokens);
array_splice($tokens, 2, 1);
//rest of your code using $tokens
Here is the format of $tokens after this has run:
Array ( [0] => authors [1] => 0 [2] => system:id )
Taking the most simplistic approach, we would just match the three individual parts. So first of all we'd look for the token that is not enclosed in brackets:
[a-z]+
Then we'd look for the brackets and the value in between:
\[[^\]]+\]
And then we'd repeat the second step.
You'd also need to add capture groups () to extract the actual values that you want.
So when you put it all together you get something like:
([a-z]+)\[([^\]]+)\]\[([^\]]+)\]
That expression could then be used with preg_match() and the values you want would be extracted into the referenced array passed to the third argument (like this). But you'll notice the above expression is quite a difficult-to-read collection of punctuation, and also that the resulting array has an extra element on it that we don't want - preg_match() places the whole matched string into the first index of the output array. We're close, but it's not ideal.
However, as #AlienHoboken correctly points out and almost correctly implements, a simpler solution would be to split the string up based on the position of the brackets. First let's take a look at the expression we'd need (or at least, the one that I would use):
(?:\[|\])+
This looks for at least one occurence of either [ or ] and uses that block as delimiter for the split. This seems like exactly what we need, except when we run it we'll find we have a small issue:
array('authors', '0', 'system:id', '')
Where did that extra empty string come from? Well, the last character of the input string matches you delimiter expression, so it's treated as a split position - with the result that an empty string gets appended to the results.
This is quite a common issue when splitting based on a regular expression, and luckily PCRE knows this and provides a simple way to avoid it: the PREG_SPLIT_NO_EMPTY flag.
So when we do this:
$str = 'authors[0][system:id]';
$expr = '/(?:\[|\])+/';
$result = preg_split($expr, $str, -1, PREG_SPLIT_NO_EMPTY);
print_r($result);
...you will see the result you want.
See it working
I would like to split a string in PHP containing quoted and unquoted substrings.
Let's say I have the following string:
"this is a string" cat dog "cow"
The splitted array should look like this:
array (
[0] => "this is a string"
[1] => "cat"
[2] => "dog"
[3] => "cow"
)
I'm struggling a bit with regex and I'm wondering if it is even possible to achieve with just one regex/preg_split-Call...
The first thing I tried was:
[[:blank:]]*(?=(?:[^"]*"[^"]*")*[^"]*$)[[:blank:]]*
But this splits only array[0] and array[3] correctly - the rest is splitted on a per character base.
Then I found this link:
PHP preg_split with two delimiters unless a delimiter is within quotes
(?=(?:[^"]*"[^"]*")*[^"]*$)
This seems to me as a good startingpoint. However the result in my example is the same as with the first regex.
I tried combining both - first the one for quoted strings and then a second sub-regex which should ommit quoted string (therefore the [^"]):
(?=(?:[^"]*"[^"]*")*[^"]*$)|[[:blank:]]*([^"].*[^"])[[:blank:]]*
Therefore 2 questions:
Is it even possible to achieve what I want with just one regex/preg_split-Call?
If yes, I would appreciate a hint on how to assemble the regex correctly
Since matches cannot overlap, you could use preg_match_all like this:
preg_match_all('/"[^"]*"|\S+/', $input, $matches);
Now $matches[0] should contain what you are looking for. The regex will first try to match a quoted string, and then stop. If that doesn't do it it will just collect as many non-whitespace characters as possible. Since alternations are tried from left to right, the quoted version takes precedence.
EDIT: This will not get rid of the quotes though. To do this, you could use capturing groups:
preg_match_all('/(?|"([^"]*)"|(\S+))/', $input, $matches);
Now $matches[1] will contain exactly what you are looking for. The (?| is there so that both capturing groups end up at the same index.
EDIT 2: Since you were asking for a preg_split solution, that is also possible. We can use a lookahead, that asserts that the space is followed by an even number of quotes (up until the end of the string):
$result = preg_split('/\s+(?=(?:[^"]*"[^"]*")*$)/', $input);
Of course, this will not get rid of the quotes, but that can easily be done in a separate step.
Let's say for an instance I have this string:
var a=23434,bc=3434,erd=5656,ddfeto='dsf3df34dff3',eof='sdfwerwer34',wer=4554;
How should I match all the initializations assigned as integer? Here's my current try, but I don't understand why it's matching everything.
$pattern = '/var (.*=\d)/';
preg_match_all($pattern,$page,$matches);
EDIT: I'm trying to match each initialization:
1 => a=23434
2 => bc=3434
and so on...
EDIT: Here's an update on my try:
$pattern = '/[^v^a^r] (.*=\d+),/';
preg_match_all($pattern,$page,$matches);
0 => 'var a=23434,bc=3434,erd=5656,'
1 => 'a=23434,bc=3434,erd=5656'
The function is using "greedy" matching. You don't want that. In PHP, you can either follow your wildcard with a ? to specify non-greedy matching, as in:
$pattern = '/var (.*?=\d)/';
or using the U flag as documented here, as in:
$pattern = '/var (.*=\d)/U';
which will make all wildcards use non-greedy matching.
EDIT: Also, since you're including "var", you would probably need to change it to
$pattern = '/var (.*?=\d)*/';
or
$pattern = '/var (.*=\d)*/U';
to match any number of (.*=\d) patterns.
EDIT: Update per discussion:
PHP
$page = "var a=23434,bc=3434,erd=5656,ddfeto='dsf3df34dff3',eof='sdfwerwer34',wer=4554;";
$pattern = '/([a-zA-Z]+=\d+)/';
preg_match_all($pattern,$page,$matches);
print_r($matches[1]);
Produces
Array
(
[0] => a=23434
[1] => bc=3434
[2] => erd=5656
[3] => wer=4554
)
Note: This filters out the entries that have the RHS enclosed in single quotes. If you don't want that, let us know.
EDIT #2: My answer to your question exceeded the size of the comment box so I edited my answer.
The [a-zA-z] expression matches only alphabetical characters of either case. Note that the updated code also removed the "ungreedy" modifier, so we actually want it to be greedy now. And since we want it to be greedy, the . will "eat" too much. Go ahead, play around with the code, see what happens when you change it to .* it is a good opportunity to get more familiar with regex.
Since the . "eats" too much, we need to restrict it from matching all characters to matching the ones we want. We could have used something like
$pattern = '/([^\s,]*=\d+)/';
where the [^\s,]* would match any number of non-whitespace, non-comma characters. This would also have worked for your test cases.
But in this case, we can say confidently what the characters we want to include are, so instead of "blacklisting" characters, we'll "whitelist" them. In this case we specify that we want to match any alphabetical character of either case.
As is the case with many things, especially in programming, there are many ways to skin a cat. There are a number of alternative regex patterns that would have also worked for your test cases. Its up to you to understand the limits of each, how they will perform on edge cases, and how maintainable they are, and make a decision.
You don't have to use regex:
$string = substr($string, 4); // remove the first 4 characters, 'var '
$pairs = explode(',', $string); // split using the comma
foreach ($pairs as $pair) {
list($key, $value) = explode('=', $pair);
if (is_int($value)) {
// this is an integer
} else {
// not an integer
}
}
Try this regex
$pattern = '/([a-zA-Z]+=\d+)/';