Divide PHP string into arrayed substrings based on "divider" characters - php

What I'd like to do is split a PHP string into a set of sub-strings grouped into arrays based on "divider" characters that begin those sub-strings. The characters *, ^, and % are reserved as divider characters. So if I have the string "*Here's some text^that is meant*to be separated^based on where%the divider characters^are", it should be split up and placed in arrays like so:
array(2) {
[0] => "*Here's some text"
[1] => "*to be separated"
}
array(3) {
[0] => "^that is meant"
[1] => "^based on where"
[2] => "^are"
}
array(1) {
[0] => "%the divider characters"
}
I'm totally lost on this one. Does anyone know how to implement this?

You don't ask for $matches[0] so unset it if you want:
preg_match_all('/(\*[^*^%]+)|(\^[^*^%]+)|(%[^*^%]+)/', $string, $matches);
$matches = array_map('array_filter', $matches);
print_r($matches);
The array_filter() removes the empties from the capture group sub-arrays to give the array shown in the question

Related

PHP how to avoid mixed letter-number when extracting chunks of numbers from string

I'm writing a PHP function to extract numeric ids from a string like:
$test = '123_123_Foo'
At first I took two different approaches, one with preg_match_all():
$test2 = '123_1256_Foo';
preg_match_all('/[0-9]{1,}/', $test2, $matches);
print_r($matches[0]); // Result: 'Array ( [0] => 123 [1] => 1256 )'
and other with preg_replace() and explode():
$test = preg_replace('/[^0-9_]/', '', $test);
$output = array_filter(explode('_', $test));
print_r($output); // Results: 'Array ( [0] => 123 [1] => 1256 )'
Any of them works well as long as the string does not content mixed letters and numbers like:
$test2 = '123_123_234_Foo2'
The evident result is Array ( [0] => 123 [1] => 1256 [2] => 2 )
So I wrote another regex to get rid off of mixed strings:
$test2 = preg_replace('/([a-zA-Z]{1,}[0-9]{1,}[a-zA-Z]{1,})|([0-9]{1,}[a-zA-Z]{1,}[0-9]{1,})|([a-zA-Z]{1,}[0-9]{1,})|([0-9]{1,}[a-zA-Z]{1,})|[^0-9_]/', '', $test2);
$output = array_filter(explode('_', $test2));
print_r($output); // Results: 'Array ( [0] => 123 [1] => 1256 )'
The problem is evident too, more complicated paterns like Foo2foo12foo1 would pass the filter. And here's where I got a bit stuck.
Recap:
Extract a variable ammount of chunks of numbers from string.
The string contains at least 1 number, and may contain other numbers
and letters separated by underscores.
Only numbers not preceded or followed by letters must be extracted.
Only the numbers in the first half of the string matter.
Since only the first half is needed I decided to split in the first occurrence of letter or mixed number-letter with preg_split():
$test2 = '123_123_234_1Foo2'
$output = preg_split('/([0-9]{1,}[a-zA-Z]{1,})|[^0-9_]/', $test, 2);
preg_match_all('/[0-9]{1,}/', $output[0], $matches);
print_r($matches[0]); // Results: 'Array ( [0] => 123 [1] => 123 [2] => 234 )'
The point of my question is if is there a simpler, safer or more efficient way to achieve this result.
If I understand your question correctly, you want to split an underscore-delimited string, and filter out any substrings that are not numeric. If so, this can be achieved without regex, with explode(), array_filter() and ctype_digit(); e.g:
<?php
$str = '123_123_234_1Foo2';
$digits = array_filter(explode('_', $str), function ($substr) {
return ctype_digit($substr);
});
print_r($digits);
This yields:
Array
(
[0] => 123
[1] => 123
[2] => 234
)
Note that ctype_digit():
Checks if all of the characters in the provided string are numerical.
So $digits is still an array of strings, albeit numeric.
Hope this helps :)
Getting just the numeric part of the string after the explode
$test2 = "123_123_234_1Foo2";
$digits = array_filter(explode('_', $test2 ), 'is_numeric');
var_dump($digits);
Result
array(3) { [0]=> string(3) "123" [1]=> string(3) "123" [2]=> string(3) "234" }
Use strtok
Regex isn't a magic bullet, and there are FAR simpler fixes for your problem, especially considering you're trying to split on a delimiter.
Any of the following approaches would be cleaner, and more maintainable, and the strtok() approach would probably perform better:
Use explode to create and loop through an array, checking each value.
Use preg_split to do the same, but with more a adaptable approach.
Use strtok, as it is designed exactly for this use-case.
Basic exmple for your case:
function strGetInts(string $str, str $delim) {
$word = strtok($str, $delim);
while (false !== $word) {
if (is_integer($word) {
yield (int) $word;
}
$word = strtok($delim);
}
}
$test2 = '123_1256_Foo';
foreach(strGetInts($test2, '_-') as $key {
print_r($key);
}
Note: the second argument to strtok is string containing ANY delimiter to split the string on. Thus, my example will group results into strings separated by underscores or dashes.
Additional Note: If and only if the string only needs to be split on a single delimiter (underscore only), a method using explode will likely result in better performance. For such a solution, see the other answer in this thread: https://stackoverflow.com/a/46937452/1589379 .

Split string on semicolons immediately followed by 7 digital characters

I have a string like this:
2234323,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433;4534453,23,23,44,433;
3,23,44,433;23,23,44,433;23,23,44,433
7545455,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433
As you see, there are semicolons between values. I want to split this string based on 'only semicolons before 7 digits values' so I should have this:
>2234323,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433
>4534453,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433;
>7545455,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433
the only thing that I can think of is explode(';',$string) but this returns this:
>2234323,23,23,44,433;
>3,23,44,433;
>23,23,44,433;
>23,23,44,433
>4534453,23,23,44,433;
>3,23,44,433;
>23,23,44,433;23,23,44,433;
>7545455,23,23,44,433;
>3,23,44,433;23,23,44,433;
>23,23,44,433
Is there any fast method to split string with this format based on the ";" before 7 digits values?
You can use preg_split for that:
$s = '2234323,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433;4534453,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433;7545455,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433';
var_dump(preg_split('/(;\d{7},)/', $s, -1, PREG_SPLIT_DELIM_CAPTURE));
Your output will be
array(5) {
[0] =>
string(58) "2234323,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433"
[1] =>
string(9) ";4534453,"
[2] =>
string(50) "23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433"
[3] =>
string(9) ";7545455,"
[4] =>
string(50) "23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433"
}
I think that the next thing (combine the 1st and 2nd and then 3rd and 4th elements) is not a big deal :)
Let me know if you still here problems here.
You could do a find and replace on numbers that are seven digits long, to insert a token that you can use to split. The output may need a little extra filtering to get to your desired format.
<?php
$in =<<<IN
2234323,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433;4534453,23,23,44,433;
3,23,44,433;23,23,44,433;23,23,44,433
7545455,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433
IN;
$out = preg_replace('/([0-9]{7})/', "#$1", $in);
$out = explode('#', $out);
$out = array_filter($out);
var_export($out);
Output:
array (
1 => '2234323,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433;',
2 => '4534453,23,23,44,433;
3,23,44,433;23,23,44,433;23,23,44,433
',
3 => '7545455,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433',
)
Your input structure seems a little unstable, but once it is stabilized, just use preg_split() to match (and consume) semicolons that are immediately followed by exactly 7 digits. \b is a word boundary to ensure that their is no 8th digit.
Code: (Demo)
$string = <<<STR
2234323,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433;4534453,23,23,44,433;
3,23,44,433;23,23,44,433;23,23,44,433
7545455,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433
STR;
$string = preg_replace('/;?\R/', ';', $string); // I don't know if this is actually necessary for your real project
var_export(
preg_split('/;(?=\d{7}\b)/', $string)
);
Output:
array (
0 => '2234323,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433',
1 => '4534453,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433',
2 => '7545455,23,23,44,433;3,23,44,433;23,23,44,433;23,23,44,433',
)

PHP: How to split a string by dash and everything between brackets. (preg_split or preg_match)

I've been wrapping my head around this for days now, but nothing seems to give the desired result.
Example:
$var = "Some Words - Other Words (More Words) Dash-Binded-Word";
Desired result:
array(
[0] => Some Words
[1] => Other Words
[2] => More Words
[3] => Dash-Bound-Word
)
I was able to get this all working using preg_match_all, but then the "Dash-Bound-Word" was broken up as well. Trying to match it with surrounding spaces didn't work as it would break all the words except the dash bound ones.
The preg_match_all statement I used (which broke up the dash bound words too) is this:
preg_match_all('#\(.*?\)|\[.*?\]|[^?!\-|\(|\[]+#', $var, $array);
I'm certainly no expert on preg_match, preg_split so any help here would be greatly appreciated.
You can use a simple preg_match_all:
\w+(?:[- ]\w+)*
See demo
\w+ - 1 or more alphanumeric or underscore
(?:[- ]\w+)* - 0 or more sequences of...
[- ] - a hyphen or space (you may change space to \s to match any whitespace)
\w+ - 1 or more alphanumeric or underscore
IDEONE demo:
$re = '/\w+(?:[- ]\w+)*/';
$str = "Some Words - Other Words (More Words) Dash-Binded-Word";
preg_match_all($re, $str, $matches);
print_r($matches[0]);
Result:
Array
(
[0] => Some Words
[1] => Other Words
[2] => More Words
[3] => Dash-Binded-Word
)
You can split by:
/\s*(?<!\w(?=.\w))[\-[\]()]\s*/
Explanation:
The match is attempted against the character class [\-[\]()] (matches any of those characters). You could also add any char you want to that character class.
It's using a negative lookbehind (?<!\w) for the condition: "not preceded by a word character".
And it also has a nested lookahead (?=.\w) that checks for: "if the first condition is met, it shouldn't be followed by any char -the one used to split- and a word character".
\s* at the beggining and the end is to trim whitespaces.
Code:
$input_line = "Some Words - Other Words (More Words) Dash-Binded-Word";
$result = preg_split("/\s*(?<!\w(?=.\w))[\-[\]()]\s*/", $input_line);
var_dump($result);
Output:
array(4) {
[0]=>
string(10) "Some Words"
[1]=>
string(11) "Other Words"
[2]=>
string(10) "More Words"
[3]=>
string(16) "Dash-Binded-Word"
}
Run this code here
Capturing parens
As stated in another comment, if you want to also capture parentheses:
$result = preg_split("/\s*(?:(?<!\w)-(?!\w)|(\(.*?\)|\[.*?]))\s*/", $input_line, -1, PREG_SPLIT_DELIM_CAPTURE);
Modifying the input string to suit any particular exploding technique would be indirect and indicate that a suboptimal exploding technique is being used.
The truth is, your required logic can be boiled down to: "explode on each sequence of non-word characters that have a length of 2 or more". This is what that pattern looks like with preg_split().
Code: (Demo)
$var = "Some Words - Other Words (More Words) Dash-Binded-Word";
var_export(preg_split('~\W{2,}~', $var));
Output:
array (
0 => 'Some Words',
1 => 'Other Words',
2 => 'More Words',
3 => 'Dash-Binded-Word',
)
It doesn't get any simpler than that.
Try this (combination of str_replace and explode). It is not optimum but may work for this case:
$var = "Some Words - Other Words (More Words) Dash-Binded-Word";
$arr = Array(" - ", " (", ") ");
$var2 = str_replace($arr, "|", $var);
$final = explode('|', $var2);
var_dump($final);
Output:
array(4) { [0]=> string(10) "Some Words" [1]=> string(11) "Other
Words" [2]=> string(10) "More Words" [3]=> string(16)
"Dash-Binded-Word" }
$var = "Some Words - Other Words (More Words) Dash-Binded-Word";
$var=preg_replace('/[^A-Za-z\-]/', ' ', $var);
$var=str_replace('-', ' ', $var); // Replaces all hyphens with spaces.
print_r (explode(" ",preg_replace('!\s+!', ' ', $var))); //replaces all multiple spaces with one and explode creates array split where there is space
OUTPUT :-
Array ( [0] => Some [1] => Words [2] => Other [3] => Words [4] => More [5] => Words [6] => Dash [7] => Binded [8] => Word )

Print out certain text from string with preg_match

I want to get a certain string from an .torrent name but I'm only getting this from it:
array
0 => string 'e' (length=1)
What have I done wrong? This is the preg_match I use:
preg_match('/[S(0-9)E(0-9)]/i', 'True.Blood.S04E12.SWESUB.PDTV.XviD-DSMEDiA', $matches);
Thanks in advance.
Remove the square brackets and put them around the numbers and add + (meaning 1 or more) after them. This way you get the entire S##E## string, plus the numbers separately:
preg_match('/S([0-9]+)E([0-9]+)/i', 'True.Blood.S04E12.SWESUB.PDTV.XviD-DSMEDiA', $matches);
print_r($matches);
/* output:
Array
(
[0] => S04E12
[1] => 04
[2] => 12
)
*/
You could also replace [0-9] with \d
I would recommend using:
preg_match('/S[0-9]{1,2}E[0-9]{1,2}/i', 'True.Blood.S04E12.SWESUB.PDTV.XviD-DSMEDiA', $matches);
It gets out this:
array(1) { [0]=> string(6) "S04E12" }
The following regex will return the appropriate string.
/* Pattern: /\d{2}E\d{2}/ */
preg_match_all('/\d{2}E \d{2}/', '{{your data}}', $arr, PREG_PATTERN_ORDER);
/*Result*/
Array
(
[0] => Array
(
[0] => 04E12
)
)

Regular Expressions: get what is outside of the brackets

I'm using PHP and I have text like:
first [abc] middle [xyz] last
I need to get what's inside and outside of the brackets. Searching in StackOverflow I found a pattern to get what's inside:
preg_match_all('/\[.*?\]/', $m, $s)
Now I'd like to know the pattern to get what's outside.
Regards!
You can use preg_split for this as:
$input ='first [abc] middle [xyz] last';
$arr = preg_split('/\[.*?\]/',$input);
print_r($arr);
Output:
Array
(
[0] => first
[1] => middle
[2] => last
)
This allows some surrounding spaces in the output. If you don't want them you can use:
$arr = preg_split('/\s*\[.*?\]\s*/',$input);
preg_split splits the string based on a pattern. The pattern here is [ followed by anything followed by ]. The regex to match anything is .*. Also [ and ] are regex meta char used for char class. Since we want to match them literally we need to escape them to get \[.*\]. .* is by default greedy and will try to match as much as possible. In this case it will match abc] middle [xyz. To avoid this we make it non greedy by appending it with a ? to give \[.*?\]. Since our def of anything here actually means anything other than ] we can also use \[[^]]*?\]
EDIT:
If you want to extract words that are both inside and outside the [], you can use:
$arr = preg_split('/\[|\]/',$input);
which split the string on a [ or a ]
$inside = '\[.+?\]';
$outside = '[^\[\]]+';
$or = '|';
preg_match_all(
"~ $inside $or $outside~x",
"first [abc] middle [xyz] last",
$m);
print_r($m);
or less verbose
preg_match_all("~\[.+?\]|[^\[\]]+~", $str, $matches)
Use preg_split instead of preg_match.
preg_split('/\[.*?\]/', 'first [abc] middle [xyz] last');
Result:
array(3) {
[0]=>
string(6) "first "
[1]=>
string(8) " middle "
[2]=>
string(5) " last"
}
ideone
As every one says that you should use preg_split, but only one person replied with an expression that meets your needs, and i think that is a little complex - not complex, a little to verbose but he has updated his answer to counter that.
This expression is what most of the replies have stated.
/\[.*?\]/
But that only prints out
Array
(
[0] => first
[1] => middle
[2] => last
)
and you stated you wanted whats inside and outside the braces, sio an update would be:
/[\[.*?\]]/
This gives you:
Array
(
[0] => first
[1] => abc
[2] => middle
[3] => xyz
[4] => last
)
but as you can see that its capturing white spaces as well, so lets go a step further and get rid of those:
/[\s]*[\[.*?\]][\s]*/
This will give you a desired result:
Array
(
[0] => first
[1] => abc
[2] => middle
[3] => xyz
[4] => last
)
This i think is the expression your looking for.
Here is a LIVE Demonstration of the above Regex

Categories