I'm writing a PHP function to extract numeric ids from a string like:
$test = '123_123_Foo'
At first I took two different approaches, one with preg_match_all():
$test2 = '123_1256_Foo';
preg_match_all('/[0-9]{1,}/', $test2, $matches);
print_r($matches[0]); // Result: 'Array ( [0] => 123 [1] => 1256 )'
and other with preg_replace() and explode():
$test = preg_replace('/[^0-9_]/', '', $test);
$output = array_filter(explode('_', $test));
print_r($output); // Results: 'Array ( [0] => 123 [1] => 1256 )'
Any of them works well as long as the string does not content mixed letters and numbers like:
$test2 = '123_123_234_Foo2'
The evident result is Array ( [0] => 123 [1] => 1256 [2] => 2 )
So I wrote another regex to get rid off of mixed strings:
$test2 = preg_replace('/([a-zA-Z]{1,}[0-9]{1,}[a-zA-Z]{1,})|([0-9]{1,}[a-zA-Z]{1,}[0-9]{1,})|([a-zA-Z]{1,}[0-9]{1,})|([0-9]{1,}[a-zA-Z]{1,})|[^0-9_]/', '', $test2);
$output = array_filter(explode('_', $test2));
print_r($output); // Results: 'Array ( [0] => 123 [1] => 1256 )'
The problem is evident too, more complicated paterns like Foo2foo12foo1 would pass the filter. And here's where I got a bit stuck.
Recap:
Extract a variable ammount of chunks of numbers from string.
The string contains at least 1 number, and may contain other numbers
and letters separated by underscores.
Only numbers not preceded or followed by letters must be extracted.
Only the numbers in the first half of the string matter.
Since only the first half is needed I decided to split in the first occurrence of letter or mixed number-letter with preg_split():
$test2 = '123_123_234_1Foo2'
$output = preg_split('/([0-9]{1,}[a-zA-Z]{1,})|[^0-9_]/', $test, 2);
preg_match_all('/[0-9]{1,}/', $output[0], $matches);
print_r($matches[0]); // Results: 'Array ( [0] => 123 [1] => 123 [2] => 234 )'
The point of my question is if is there a simpler, safer or more efficient way to achieve this result.
If I understand your question correctly, you want to split an underscore-delimited string, and filter out any substrings that are not numeric. If so, this can be achieved without regex, with explode(), array_filter() and ctype_digit(); e.g:
<?php
$str = '123_123_234_1Foo2';
$digits = array_filter(explode('_', $str), function ($substr) {
return ctype_digit($substr);
});
print_r($digits);
This yields:
Array
(
[0] => 123
[1] => 123
[2] => 234
)
Note that ctype_digit():
Checks if all of the characters in the provided string are numerical.
So $digits is still an array of strings, albeit numeric.
Hope this helps :)
Getting just the numeric part of the string after the explode
$test2 = "123_123_234_1Foo2";
$digits = array_filter(explode('_', $test2 ), 'is_numeric');
var_dump($digits);
Result
array(3) { [0]=> string(3) "123" [1]=> string(3) "123" [2]=> string(3) "234" }
Use strtok
Regex isn't a magic bullet, and there are FAR simpler fixes for your problem, especially considering you're trying to split on a delimiter.
Any of the following approaches would be cleaner, and more maintainable, and the strtok() approach would probably perform better:
Use explode to create and loop through an array, checking each value.
Use preg_split to do the same, but with more a adaptable approach.
Use strtok, as it is designed exactly for this use-case.
Basic exmple for your case:
function strGetInts(string $str, str $delim) {
$word = strtok($str, $delim);
while (false !== $word) {
if (is_integer($word) {
yield (int) $word;
}
$word = strtok($delim);
}
}
$test2 = '123_1256_Foo';
foreach(strGetInts($test2, '_-') as $key {
print_r($key);
}
Note: the second argument to strtok is string containing ANY delimiter to split the string on. Thus, my example will group results into strings separated by underscores or dashes.
Additional Note: If and only if the string only needs to be split on a single delimiter (underscore only), a method using explode will likely result in better performance. For such a solution, see the other answer in this thread: https://stackoverflow.com/a/46937452/1589379 .
I have a string like this:
2234323,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433;4534453,23,23,44,433;
3,23,44,433;23,23,44,433;23,23,44,433
7545455,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433
As you see, there are semicolons between values. I want to split this string based on 'only semicolons before 7 digits values' so I should have this:
>2234323,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433
>4534453,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433;
>7545455,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433
the only thing that I can think of is explode(';',$string) but this returns this:
>2234323,23,23,44,433;
>3,23,44,433;
>23,23,44,433;
>23,23,44,433
>4534453,23,23,44,433;
>3,23,44,433;
>23,23,44,433;23,23,44,433;
>7545455,23,23,44,433;
>3,23,44,433;23,23,44,433;
>23,23,44,433
Is there any fast method to split string with this format based on the ";" before 7 digits values?
You can use preg_split for that:
$s = '2234323,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433;4534453,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433;7545455,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433';
var_dump(preg_split('/(;\d{7},)/', $s, -1, PREG_SPLIT_DELIM_CAPTURE));
Your output will be
array(5) {
[0] =>
string(58) "2234323,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433"
[1] =>
string(9) ";4534453,"
[2] =>
string(50) "23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433"
[3] =>
string(9) ";7545455,"
[4] =>
string(50) "23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433"
}
I think that the next thing (combine the 1st and 2nd and then 3rd and 4th elements) is not a big deal :)
Let me know if you still here problems here.
You could do a find and replace on numbers that are seven digits long, to insert a token that you can use to split. The output may need a little extra filtering to get to your desired format.
<?php
$in =<<<IN
2234323,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433;4534453,23,23,44,433;
3,23,44,433;23,23,44,433;23,23,44,433
7545455,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433
IN;
$out = preg_replace('/([0-9]{7})/', "#$1", $in);
$out = explode('#', $out);
$out = array_filter($out);
var_export($out);
Output:
array (
1 => '2234323,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433;',
2 => '4534453,23,23,44,433;
3,23,44,433;23,23,44,433;23,23,44,433
',
3 => '7545455,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433',
)
Your input structure seems a little unstable, but once it is stabilized, just use preg_split() to match (and consume) semicolons that are immediately followed by exactly 7 digits. \b is a word boundary to ensure that their is no 8th digit.
Code: (Demo)
$string = <<<STR
2234323,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433;4534453,23,23,44,433;
3,23,44,433;23,23,44,433;23,23,44,433
7545455,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433
STR;
$string = preg_replace('/;?\R/', ';', $string); // I don't know if this is actually necessary for your real project
var_export(
preg_split('/;(?=\d{7}\b)/', $string)
);
Output:
array (
0 => '2234323,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433',
1 => '4534453,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433',
2 => '7545455,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433',
)
I've been wrapping my head around this for days now, but nothing seems to give the desired result.
Example:
$var = "Some Words - Other Words (More Words) Dash-Binded-Word";
Desired result:
array(
[0] => Some Words
[1] => Other Words
[2] => More Words
[3] => Dash-Bound-Word
)
I was able to get this all working using preg_match_all, but then the "Dash-Bound-Word" was broken up as well. Trying to match it with surrounding spaces didn't work as it would break all the words except the dash bound ones.
The preg_match_all statement I used (which broke up the dash bound words too) is this:
preg_match_all('#\(.*?\)|\[.*?\]|[^?!\-|\(|\[]+#', $var, $array);
I'm certainly no expert on preg_match, preg_split so any help here would be greatly appreciated.
You can use a simple preg_match_all:
\w+(?:[- ]\w+)*
See demo
\w+ - 1 or more alphanumeric or underscore
(?:[- ]\w+)* - 0 or more sequences of...
[- ] - a hyphen or space (you may change space to \s to match any whitespace)
\w+ - 1 or more alphanumeric or underscore
IDEONE demo:
$re = '/\w+(?:[- ]\w+)*/';
$str = "Some Words - Other Words (More Words) Dash-Binded-Word";
preg_match_all($re, $str, $matches);
print_r($matches[0]);
Result:
Array
(
[0] => Some Words
[1] => Other Words
[2] => More Words
[3] => Dash-Binded-Word
)
You can split by:
/\s*(?<!\w(?=.\w))[\-[\]()]\s*/
Explanation:
The match is attempted against the character class [\-[\]()] (matches any of those characters). You could also add any char you want to that character class.
It's using a negative lookbehind (?<!\w) for the condition: "not preceded by a word character".
And it also has a nested lookahead (?=.\w) that checks for: "if the first condition is met, it shouldn't be followed by any char -the one used to split- and a word character".
\s* at the beggining and the end is to trim whitespaces.
Code:
$input_line = "Some Words - Other Words (More Words) Dash-Binded-Word";
$result = preg_split("/\s*(?<!\w(?=.\w))[\-[\]()]\s*/", $input_line);
var_dump($result);
Output:
array(4) {
[0]=>
string(10) "Some Words"
[1]=>
string(11) "Other Words"
[2]=>
string(10) "More Words"
[3]=>
string(16) "Dash-Binded-Word"
}
Run this code here
Capturing parens
As stated in another comment, if you want to also capture parentheses:
$result = preg_split("/\s*(?:(?<!\w)-(?!\w)|(\(.*?\)|\[.*?]))\s*/", $input_line, -1, PREG_SPLIT_DELIM_CAPTURE);
Modifying the input string to suit any particular exploding technique would be indirect and indicate that a suboptimal exploding technique is being used.
The truth is, your required logic can be boiled down to: "explode on each sequence of non-word characters that have a length of 2 or more". This is what that pattern looks like with preg_split().
Code: (Demo)
$var = "Some Words - Other Words (More Words) Dash-Binded-Word";
var_export(preg_split('~\W{2,}~', $var));
Output:
array (
0 => 'Some Words',
1 => 'Other Words',
2 => 'More Words',
3 => 'Dash-Binded-Word',
)
It doesn't get any simpler than that.
Try this (combination of str_replace and explode). It is not optimum but may work for this case:
$var = "Some Words - Other Words (More Words) Dash-Binded-Word";
$arr = Array(" - ", " (", ") ");
$var2 = str_replace($arr, "|", $var);
$final = explode('|', $var2);
var_dump($final);
Output:
array(4) { [0]=> string(10) "Some Words" [1]=> string(11) "Other
Words" [2]=> string(10) "More Words" [3]=> string(16)
"Dash-Binded-Word" }
$var = "Some Words - Other Words (More Words) Dash-Binded-Word";
$var=preg_replace('/[^A-Za-z\-]/', ' ', $var);
$var=str_replace('-', ' ', $var); // Replaces all hyphens with spaces.
print_r (explode(" ",preg_replace('!\s+!', ' ', $var))); //replaces all multiple spaces with one and explode creates array split where there is space
OUTPUT :-
Array ( [0] => Some [1] => Words [2] => Other [3] => Words [4] => More [5] => Words [6] => Dash [7] => Binded [8] => Word )