I am trying to split a string in php, which looks like this:
ABCDE1234ABCD1234ABCDEF1234
Into an array of string which, in this case, would look like this:
ABCDE1234
ABCD1234
ABCDEF1234
So the pattern is "an undefined number of letters, and then 4 digits, then an undefined number of letters and 4 digits etc."
I'm trying to split the string using preg_split like this:
$pattern = "#[0-9]{4}$#";
preg_split($pattern, $stringToSplit);
And it returns an array containing the full string (not split) in the first element.
I'm guessing the problem here is my regex as I don't fully understand how to use them, and I am not sure if I'm using it correctly.
So what would be the correct regex to use?
You don't want preg_split, you want preg_match_all:
$str = 'ABCDE1234ABCD1234ABCDEF1234';
preg_match_all('/[a-z]+[0-9]{4}/i', $str, $matches);
var_dump($matches);
Output:
array(1) {
[0]=>
array(3) {
[0]=>
string(9) "ABCDE1234"
[1]=>
string(8) "ABCD1234"
[2]=>
string(10) "ABCDEF1234"
}
}
PHP uses PCRE-style regexes which let you do lookbehinds. You can use this to see if there are 4 digits "behind" you. Combine that with a lookahead to see if there's a letter ahead of you, and you get this:
(?<=\d{4})(?=[a-z])
Notice the dotted lines on the Debuggex Demo page. Those are the points you want to split on.
In PHP this would be:
var_dump(preg_split('/(?<=\d{4})(?=[a-z])/i', 'ABCDE1234ABCD1234ABCDEF1234'));
Use the principle of contrast:
\D+\d{4}
# requires at least one non digit
# followed by exactly four digits
See a demo on regex101.com.
In PHP this would be:
<?php
$string = 'ABCDE1234ABCD1234ABCDEF1234';
$regex = '~\D+\d{4}~';
preg_match_all($regex, $string, $matches);
?>
See a demo on ideone.com.
I'm no good at regex so here is the road less traveled:
<?php
$s = 'ABCDE1234ABCD1234ABCDEF1234';
$nums = range(0,9);
$num_hit = 0;
$i = 0;
$arr = array();
foreach(str_split($s) as $v)
{
if(isset($nums[$v]))
{
++$num_hit;
}
if(!isset($arr[$i]))
{
$arr[$i] = '';
}
$arr[$i].= $v;
if($num_hit === 4)
{
++$i;
$num_hit = 0;
}
}
print_r($arr);
First, why is your attempted pattern not delivering the desired output? Because the $ anchor tells the function to explode the string by using the final four numbers as the "delimiter" (characters that should be consuming while dividing the string into separate parts).
Your result:
array (
0 => 'ABCDE1234ABCD1234ABCDEF', // an element of characters before the last four digits
1 => '', // an empty element containing the non-existent characters after the four digits
)
In plain English, to fix your pattern, you must:
Not consume any characters while exploding and
Ensure that no empty elements are generated.
My snippet is at the bottom of this post.
Second, there seems to be some debate about what regex function to use (or even if regex is a preferrable tool).
My stance is that using a non-regex method will require a long-winded block of lines which will be equally if not more difficult to read than a regex pattern. Using regex affords you to generate your result in one-line and not in an unsightly fashion. So let's dispose of iterated sets of conditions for this task.
Now the critical concern is whether this task is simply "extracting" data from a consistent and valid string (case "A"), or if it is "validating AND extracting" data from a string (case"B") because the input cannot be 100 trusted to be consistent/correct.
In case A, you needn't concern yourself with producing valid elements in the output, so preg_split() or preg_match_all() are good candidates.
In case B, preg_split() would not be advisable, because it only hunts for delimiting substrings -- it remains ignorant of all other characters in the string.
Assuming this task is case A, then a decision is still pending about the better function to call. Well, both functions generate an array, but preg_match_all() creates a multidimensional array while you desire a flat array (like preg_split() provides). This means you would need to add a new variable to the global scope ($matches) and append [0] to the array to access the desired fullstring matches. To someone who doesn't understand regex patterns, this may border on the bad practice of using "magic numbers".
For me, I strive to code for Directness and Accuracy, then Efficiency, then Brevity and Clarity. Since you're not likely to notice any performance drops while performing such a small operation, efficiency isn't terribly important. I just want to make some comparisons to highlight the cost of a pattern that leverages only look-arounds or a pattern that misses an oportunity to greedily match predictable characters.
/(?<=\d{4})(?=[a-z])/i 79 steps (Demo)
~\d{4}\K~ 25 steps (Demo)
/[a-z]+[0-9]{4}\K/i 13 steps (Demo)
~\D+[0-9]{4}\K~ 13 steps (Demo)
~\D+\d{4}\K~ 13 steps (Demo)
FYI, \K is a metacharacter that means "restart the fullstring match", in other words "forget/release all previously matched characters up to this point". This effectively ensures that no characters are lost while spitting.
Suggested technique: (Demo)
var_export(
preg_split(
'~\D+\d{4}\K~', // pattern
'ABCDE1234ABCD1234ABCDEF1234', // input
0, // make unlimited explosions
PREG_SPLIT_NO_EMPTY // exclude empty elements
)
);
Output:
array (
0 => 'ABCDE1234',
1 => 'ABCD1234',
2 => 'ABCDEF1234',
)
Related
We are developing a small module for one of our clients. In this module, we have to validate input values with a predefined pattern. So, we are using a regular expression to achieve this.
We are trying to create a regular expression but we have failed each time.
Values are allowed in the format of:
1#4#5#654#12....
where 0<=n<=100000000000
Some examples of input strings are given below:
1#23#567#3#98 - Valid string
#1#45#21#4# - Invalid
67##78#56#09# - Invalid
0#0#0 - Valid
Only positive numbers and the hash sign are allowed in the above values. The string should start and end with a number. Only one hash is allowed between two numbers. There are no restrictions for the length of the string.
Can anyone please share the regular expression which would match valid strings?
Thanks in advance!
^(10{11}|\d{1,11})(#(10{11}|\d{1,11}))+$
This should ensure the pattern does not match any numbers that exceed One Hundred Billion.
$strings = array(
'67##78#56#09#32',
'1#23#567#3#98',
'#1#45#21#4#',
'67##78#56#09#',
'0#0#0'
);
$regex = '/^\d+(#\d+)*$/';
foreach ($strings as $string) {
var_dump(preg_match($regex, $string));
}
The output would be:
int(0) int(1) int(0) int(0) int(1)
If you need to restrict the values between 0 and 100bln, you can use the (10{11}|\d{1,11}) grouping (same as #Flosculus). Next, you can just recurse this subpattern with the help of \g<1> to make the pattern shorter (in PCRE only).
Thus, if you allow digit-only input, use this regex:
^(10{11}|\d{1,11})(?:#\g<1>)*$
^
Or if there must be at least one hash, use this regex:
^(10{11}|\d{1,11})(?:#\g<1>)+$
^
PHP demo:
$re = '/^(10{11}|\d{1,11})(?:#\g<1>)+$/';
$str = array("1#23#567#3#98", "0#0#0", "#1#45#21#4#", "67##78#56#09#");
$n = preg_grep($re, $str);
print_r($n);
Returns:
[0] => 1#23#567#3#98
[1] => 0#0#0
I have a regular expression that Im using in php:
$word_array = preg_split(
'/(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)/',
urldecode($path), NULL, PREG_SPLIT_NO_EMPTY
);
It works great. It takes a chunk of url paramaters like:
/2009/06/pagerank-update.html
and returns an array like:
array(4) {
[0]=>
string(4) "2009"
[1]=>
string(2) "06"
[2]=>
string(8) "pagerank"
[3]=>
string(6) "update"
}
The only thing I need is for it to also not return strings that are less than 3 characters. So the "06" string is garbage and I'm currently using an if statement to weed them out.
The magic of the split. My original assumption was technically not correct (albeit a solution easier to come to). So let's check your split pattern:
(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)
I re-arranged it a bit. The outer parenthesis is not necessary and I moved the single characters into a character class at the end:
html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|[\/._=?&%+-]
That for some sorting upfront. Let's call this pattern the split pattern, s in short and define it.
You want to match all parts that are not of those characters from the split-at pattern and at minimum three characters.
I could achieve this with the following pattern, including support of the correct split sequences and unicode support.
$pattern = '/
(?(DEFINE)
(?<s> # define subpattern which is the split pattern
html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|
[\\/._=?&%+-] # a little bit optimized with a character class
)
)
(?:(?&s)) # consume the subpattern (URL starts with \/)
\K # capture starts here
(?:(?!(?&s)).){3,} # ensure this is not the skip pattern, take 3 characters minimum
/ux';
Or in smaller:
$path = '/2009/06/pagerank-update.htmltesthtmltest%C3%A4shtml';
$subject = urldecode($path);
$pattern = '/(?(DEFINE)(?<s>html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|[\\/._=?&%+-]))(?:(?&s))\K(?:(?!(?&s)).){3,}/u';
$word_array = preg_match_all($pattern, $subject, $m) ? $m[0] : [];
print_r($word_array);
Result:
Array
(
[0] => 2009
[1] => pagerank
[2] => update
[3] => test
[4] => testä
)
The same principle can be used with preg_split as well. It's a little bit different:
$pattern = '/
(?(DEFINE) # define subpattern which is the split pattern
(?<s>
html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|
[\/._=?&%+-]
)
)
(?:(?!(?&s)).){3,}(*SKIP)(*FAIL) # three or more is okay
|(?:(?!(?&s)).){1,2}(*SKIP)(*ACCEPT) # two or one is none
|(?&s) # split # split, at least
/ux';
Usage:
$word_array = preg_split($pattern, $subject, 0, PREG_SPLIT_NO_EMPTY);
Result:
Array
(
[0] => 2009
[1] => pagerank
[2] => update
[3] => test
[4] => testä
)
These routines work as asked for. But this does have its price with performance. The cost is similar to the old answer.
Related questions:
Antimatch with Regex
Split string by delimiter, but not if it is escaped
Old answer, doing a two-step processing (first splitting, then filtering)
Because you are using a split routine, it will split - regardless of the length.
So what you can do is to filter the result. You can do that again with a regular expression (preg_filter), for example one that is dropping everything smaller three characters:
$word_array = preg_filter(
'/^.{3,}$/', '$0',
preg_split(
'/(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)/',
urldecode($path),
NULL,
PREG_SPLIT_NO_EMPTY
)
);
Result:
Array
(
[0] => 2009
[2] => pagerank
[3] => update
)
I'm guessing you're building a URL router of some kind.
Detecting which parameters are useful and which are not should not be part of this code. It may vary per page whether a short parameter is relevant.
In this case, couldn't you just ignore the 1'th element? Your page should (or 'handler') should have knowledge over which parameters it wants to be called with, it should do the triage.
I would think that if you were trying to derive meaning from the URL's that you would actually want to write clean URL's in such a way that you don't need a complex regex to derive the value.
In many cases this involves using server redirect rules and a front controller or request router.
So what you build are clean URL's like
/value1/value2/value3
Without any .html,.php, etc. in the URL at all.
It seems to me that you are not addressing the problem at the point of entry into the system (i.e the web server) adequately so as to make your URL parsing as simple as it should be.
How about trying preg_match() instead of preg_split()?
The pattern (using the Assertions):
/([a-z0-9]{3,})(?<!htm|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org)/iu
The function call:
$pattern = '/([a-z0-9]{3,})(?<!htm|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org)/iu';
$subject = '/2009/06/pagerank-update.html';
preg_match_all($pattern, $subject, $matches);
print_r($matches);
You can try the function here: functions-online.com/preg_match_all.html
Hope this helps
Don't use a regex to break apart that path. Just use explode.
$dirs = explode( '/', urldecode($path) );
Then, if you need to break apart an individual element of the array, do that, like on your "pagerank-update" element at the end.
EDIT:
The key is that you have two different problems. First you want to break apart the path elements on slashes. Then, you want to break up the filename into smaller parts. Don't try to cram everything into one regex that tries to do everything.
Three discrete steps:
$dirs = explode...
Weed out arguments < 3 chars
Break up file argument at the end
It is far clearer if you break up your logic into discrete logical chunks rather than trying to make the regex do everything.
I have a PHP array of strings. The strings are supposed to represent PIN codes which are of 6 digits like:
560095
Having a space after the first 3 digits is also considered valid e.g. 560 095.
Not all array elements are valid. I want to filter out all invalid PIN codes.
Yes you can make use of regex for this.
PHP has a function called preg_grep to which you pass your regular expression and it returns a new array with entries from the input array that match the pattern.
$new_array = preg_grep('/^\d{3} ?\d{3}$/',$array);
Explanation of the regex:
^ - Start anchor
\d{3} - 3 digits. Same as [0-9][0-9][0-9]
? - optional space (there is a space before ?)
If you want to allow any number of any whitespace between the groups
you can use \s* instead
\d{3} - 3 digits
$ - End anchor
Yes, you can use a regular expression to make sure there are 6 digits with or without a space.
A neat tool for playing with regular expressions is RegExr... here's what RegEx I came up with:
^[0-9]{3}\s?[0-9]{3}$
It matches the beginning of the string ^, then any three numbers [0-9]{3} followed by an optional space \s? followed by another three numbers [0-9]{3}, followed by the end of the string $.
Passing the array into the PHP function preg_grep along with the Regex will return a new array with only matching indeces.
If you just want to iterate over the valid responses (loop over them), you could always use a RegexIterator:
$regex = '/^\d{3}\s?\d{3}$/';
$it = new RegexIterator(new ArrayIterator($array), $regex);
foreach ($it as $valid) {
//Only matching items will be looped over, non-matching will be skipped
}
It has the benefit of not copying the entire array (it computes the next one when you want it). So it's much more memory efficient than doing something with preg_grep for large arrays. But it also will be slower if you iterate multiple times (but for a single iteration it should be faster due to the memory usage).
If you want to get an array of the valid PIN codes, use codaddict's answer.
You could also, at the same time as filtering only valid PINs, remove the optional space character so that all PINs become 6 digits by using preg_filter:
$new_array = preg_filter('/^(\d{3}) ?(\d{3})$/D', '$1$2', $array);
The best answer might depend on your situation, but if you wanted to do a simple and low cost check first...
$item = str_replace( " ", "", $var );
if ( strlen( $item ) !== 6 ){
echo 'fail early';
}
Following that, you could equally go on and do some type checking - as long as valid numbers did not start with a 0 in which case is might be more difficult.
If you don't fail early, then go on with the regex solutions already posted.
I am trying to learn regex. I have the string:
$x = "5ft2inches";
How can I read [5,2] into an array using a regex?
If you are assuming that the string will be of the form "{number}ft{number}inches" then you can use preg_match():
preg_match('/(\d+)ft(\d+)inches/', $string, $matches);
(\d+) will match a string of one or more digits. The parentheses will tell preg_match() to place the matched numbers into the $matches variable (the third argument to the function). The function will return 1 if it made a match, of 0 if it didn't.
Here is what $matches looks like after a successful match:
Array
(
[0] => 5ft2inches
[1] => 5
[2] => 2
)
The entire matched string is the first element, then the parenthesized matches follow. So to make your desired array:
$array = array($matches[1], $matches[2]);
Assuming PHP, any reason no one has suggested split?
$numbers = preg_split('/[^0-9]+/', $x, -1, PREG_SPLIT_NO_EMPTY);
In Perl:
#!/usr/bin/perl
use strict;
use warnings;
my $x = "5ft2inches";
my %height;
#height{qw(feet inches)} = ($x =~ /^([0-9]+)ft([0-9]+)inches$/);
use Data::Dumper;
print Dumper \%height;
Output:
$VAR1 = {
'feet' => '5',
'inches' => '2'
};
Or, using split:
#height{qw(feet inches)} = split /ft|inches/, $x;
The regular expression is simply /[0-9]+/ but how to get it into an array depends entirely on what programming language you're using.
With Regular Expressions, you can either extract your data in a contextless way, or a contextful way.
IE, if you match for any digits: (\d+) (NB: Assumes that your language honors \d as the shortcut for 'any digits')
You can then extract each group, but you might not know that your string was actually "5 2inches" instead of "6ft2inches" OR "29Cabbages1Fish4Cows".
If you add context: (\d+)ft(\d+)inches
You know for sure what you've extracted (Because otherwise you'd not get a match) and can refer to each group in turn to get the feet and inches.
If you're not always going to have a pair of numbers to extract, you'll need to make the various components optional. Check out This Regular Expression Cheat Sheet (His other cheat sheets are nifty too) for more info,
You don't mention the language you are using, so here is the general solution: You don't "extract" the numbers, you replace everything except numbers with an empty string.
In C#, this would look like
string numbers = Regex.Replace(dirtyString, "[^0-9]", "");
Have to watch out for double digit numbers.
/\d{1,2}/
might work a little better for feet and inches. The max value of '2' should be upped to whatever is appropriate.
use `/[\d]/`
I want to discard all remaining characters in a string as soon as one of several unwanted characters is encountered.
As soon as a blacklisted character is encountered, the string before that point should be returned.
For instance, if I have an array:
$chars = array("a", "b", "c");
How would I go through the following string...
log dog hat bat
...and end up with:
log dog h
The strcspn function is what you are looking for.
<?php
$mask = "abc";
$string = "log dog hat bat";
$result = substr($string,0,strcspn($string,$mask));
var_dump($result);
?>
There is certainly nothing wrong with Vinko's answer and I might be more inclined to recommend that technique in a professional script because regex is likely to perform slower, but purely for a point of difference for researchers, regex could be used.
For the record, to convert the array of ['a', 'b', 'c'] to abc, just call implode($array) -- an empty glue string is not necessary.
Code: (Demo) -- split in half on first occurrence of a|b|c, then access first element
echo preg_split('~[abc]~', $string, 2)[0];
Code: (Demo) -- match leading substring of non-a|b|c characters, then access first element
echo preg_match('~^[^abc]+~', $string, $match) ? $match[0] : '';
I should state that if any of your blacklisted characters have special meaning to the regex engine while inside of a character class, then they will need to be escaped.