Find common chars in array of strings, in the right order

Find common chars in array of strings, in the right order - php

I spent days working on a function to get common chars in an array of strings, in the right order, to create a wildcard.
Here is an example to explain my problem. I made about 3 functions, but I always have a bug when the absolute position of each letter is different.
Let's assume "+" is the "wildcard char":
Array(
0 => '48ca135e0$5',
1 => 'b8ca136a0$5',
2 => 'c48ca13730$5',
3 => '48ca137a0$5');
Should return :
$wildcard='+8ca13+0$5';
In this example, the tricky thing is that $array[2] as 1 char more than others.
Other example :
Array(
0 => "case1b25.occHH&FmM",
1 => "case11b25.occHH&FmM",
2 => "case12b25.occHH&FmM",
3 => "case20b25.occHH&FmM1");
Should return :
$wildcard='case+b25.occHH&FmM+';
In this example, the tricky parts are :
- Repeating chars, such as 1 -> 11 in the "to delete" part, and c -> cc in the common part
- The "2" char in $array[2] & [3] in the "to delete" part is not in the same position
- The "1" char at the end of the last string
I really need help because I can't find a solution to this function and it is a main part of my application.
Thanks in advance, don't hesitate to ask questions, I will answer as fast as possible.
Mykeul

Seems you want to create something like regular expression out of set of example strings.
This might be quite tricki in general. Found this link, not sure if it's relevant:
http://scholar.google.com/scholar?hl=en&rlz=1B3GGGL_enEE351EE351&q=%22regular%20expression%20by%20example%22&oq=&um=1&ie=UTF-8&sa=N&tab=ws
On the other hand, if you need only one specific wildcard meaning "0 or more characters", then it should be much easier. Levenshtein distance algorithm computes similarity between 2 strings. Normally only result is needed, but in your case the places of differences are important. You also need to adapt this for N strings.
So I recommend to study this algorithm and hopefully you'll get some ideas how to solve your problem (at least you'll get some practice with text algorithms and dynamic programming).
Heres algorithm in PHP:
_http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#PHP
You might want also to search for PHP implementations of "diff".
http://paulbutler.org/archives/a-simple-diff-algorithm-in-php/

Main code:
Step 1: Sort strings by length, shortest to longest, into array[]
Step 2: Compare string in array[0] and array[1] to get $temp_wildcard
Step 3: Compare string in array[2] with $temp_wildcard to create new $temp_wildcard
Step 4: Continue comparing each string with $temp_wildcard - the last $wildcard is your $temp_wildcard
OK, so now we're down to the problem of how to compare two strings to return your wildcard string.
Subroutine code:
Compare strings character-by-character, substituting wildcards into your return value when the comparison doesn't match.
To handle the problem of different lengths, run this comparison an extra time for each character that the second string is longer with an offset. (Compare string1[x] to string2[x+offset].) For each returned string, count the number of wildcard characters. The subroutine should return the answer with the fewest number of wildcard characters.
Good luck!

Related

What can NOT be described by a PCRE regex?

I am using a lot of regular expressions and stumbled over the question what actually can not be described by a regex.
First example that came to my mind was matching a string like XOOXXXOOOOXXXXX.... This would be a string consisting of an alternating sequence of X's and O's where each subpart consisting only of the character X or O is longer than the previsous sequence of the other character.
Can anybody explain what is the formal limit of a regex? I know this might be a rather academic question but I'm a curious person ;-)
Edit
As I am a php guy I am especially interested in regex described by PCRE standard as described here: http://php.net/manual/en/reference.pcre.pattern.syntax.php
I know that PCRE allows a lot of things that are not part of the original regular expressions like back references.
Mathcing of balanced parentheses seems to be one example that can not be matched by regular expressions in general but it can be matched using PCRE (see http://sandbox.onlinephpfunctions.com/code/fd12b580bb9ad7a19e226219d5146322a41c6e47 for live example):
$data = array('()', '(())', ')(', '(((()', '(((((((((())))))))))', '()()');
$regex = '/^((?:[^()]|\((?1)\))*+)$/';
foreach($data as $d) {
echo "$d matched by regex: " . (preg_match($regex, $d) ? 'yes' : 'no') . "\n";
}

First example that came to my mind was matching a string like XOOXXXOOOOXXXXX.... This would be a string consisting of an alternating sequence of X's and O's where each subpart consisting only of the character X or O is longer than the previsous sequence of the other character.
Yes, that can be done.
To match a non-empty sequence of x's, followed by a greater number of o's, we can use an approach similar to that of the balanced-parentheses regex:
(x(?1)?o)o+
To match a string of x's and o's such that any sequence of x's is followed by a longer sequence of o's (except optionally at the very end), we can build on pattern #1:
^o*(?:(x(?1)?o)o+)*x*$
And of course, we'll also need a variant of pattern #2 with x's and o's flipped:
^x*(?:(o(?1)?x)x+)*o*$
To match a string of x's and o's that meet both of the above criteria, we can convert pattern #2 to a positive lookahead assertion, and renumber the capture-group in pattern #3:
^(?=o*(?:(x(?1)?o)o+)*x*$)x*(?:(o(?2)?x)x+)*o*$
As for the main question . . . I'm confident that a PCRE can match any context-free language, since the support for (?n) outside of the nth capture-group means that you can basically create a subroutine for each of your non-terminals. For example, this context-free grammar:
S → aTb
S → ε
T → cSd
T → eTf
can be written as:
capture-group #1 (S) → (a(?2)b|)
capture-group #2 (T) → (c(?1)d|e(?2)f)
To assemble that into a single regex, we can just concatenate them all, but appending {0} after all but the start non-terminal, and then add ^ and $:
^(a(?2)b|)(c(?1)d|e(?2)f){0}$
But as you saw from your first example, PCREs can match some non-context-free languages as well. (Another example is anbncn, which is a classic example of a non-context-free language. You can match it with PCRE by combining a PCRE for anbncm with a PCRE for ambncn using a forward lookahead assertion. Although the intersection of two regular languages is necessarily regular, the intersection of two context-free languages is not necessarily context-free; but the intersection of the languages defined by two PCREs can be defined by a PCRE.)

The set of all languages that can be recognized by a regular expression is called, not surprisingly, "regular languages".
The next most complicated languages are the context-free languages. They cannot be parsed by any regular expression. The standard example is "all balanced parentheses" -- so "()()" and "(())" but not "(()".
Another good example of a context-free language is HTML.

I don't have definitive evidence that any of these are impossible with things like recursion, balancing groups, self-referencing groups, and appending text to the string being tested. I would be glad to be proven wrong on any or all of these, as I would learn something!
It's pretty bad at math.
For example, I do not believe it is possible using PCRE, to detect a sequence of numbers that is ascending: that is, to match "1 2 7 97 315 316..." but not "127 97 315 316..."
I'm not sure it's possible to even match a sequence contiguously increasing from 1, like "1 2 3...", without exhaustively listing all possibilities like /1( 2( 3(...)?)?)?/ up to the max length you wish to check.
Thee are hacks to make it work by adding known text to the string under test (eg http://www.rexegg.com/regex-trick-line-numbers.html works by adding a series of numbers to the end of the file). But as raw regex, simple math is only possible by brute-forcing.
Another example which I believe it will fail at is "match any sequence which sums to N".
So for N=4, it should match 4, 3 1, 1 3, 2 2, 1 1 1 1, 2 1 1, 1 2 1, 1 1 2, 1 1 1 1, which looks like you could brute-force it, until you realize it also has to match 4 -12 11 0 1.
In the same manner, I don't think you could have it analyze an equation using SI units, and verify whether the units balanced on both sides of the equation. For example, "10N=2kg*5ms^-1". Never mind checking the values, just checking the units are correct.
Then there're all the classes of problems that no computer program can currently accomplish, such as "check if a string has been correctly title cased in English" which would require a context-sensitive natural language parser to correctly detect the different senses of "like" in "Time Flies like an Arrow But Fruit Flies Like a Banana".

Explode and Implode in APL

How could functions similar to PHP's explode and implode be implemented with APL?
I tried to work it out myself and came up with a solution which I'm posting below. I'd like to see other ways that this might be solved.

Pé, the quest for "short" and/or "elegant" solutions to standard-problems in APL is older than PHP and even older than new terminology, such as "explode", "implode" (I think - but I must admit I do not know how old these terms really are...). Anyway, the early APL guys used the term "idiom" for such "solutions to standard problems that fit in one line of APL".
And for some reason, the Finns were especially creative and even started producing a list of these in order to make it easy for newbies. And I find this stuff still useful after 20yrs of doing APL. It is called "FinnAPL" - the Finnish APL idiom library and you can browse it here: https://aplwiki.com/wiki/FinnAPL_idiom_library (BTW, the whole APL Wiki might be interesting to read...)
You may, however, need to be creative with your wording in order to find solutions ;)
And one warning: FinnAPL only works with "classic" (non-nested) data-structures (nested matrices came with "APL2" which is standard these days), so some of the ways they handle data might no longer be "state-of-the-art". (i.e. back in the "old times", CAT BIRD and DOG would have been represented as a 3x4 array, so "implode" of string-array was a simple as ,array,delimeter (but you then had the challenge to remove blanks which were inserted for padding.
Anyway, I'm not sure why I wrote all this - just a few thoughts which came to mind when thinking about my start with APL ;-)
Ok, let me also look at the question. When your delimeter is a single character the APL2ish-idiomatic way of handling this would be something like this:
⎕ml←3 ⍝ "migration-level" (only Dyalog APL) to ensure APL2-compatibility
s←' '
A←s,'BIRD',s,'CAT',s,'DOG' ⍝ note that delimeter also used as 1st char!
exploded_string←1↓¨(+\A=s)⊂A ⍝ explode
imploded←∊s,¨exploded_string
A≡imploded ⍝ test for successfull round-trip should return 1

Explode:
Given the following text string and delimiter string:
F←'CAT BIRD DOG'
B←' '
Explode can be accomplished as follows:
S←⍴,B
P←(⊃~∨/(-S-⍳S)⌽¨S⍴⊂B⍷F)⊂F
P[2] ⍝ returns BIRD
Limitations:
PHP's explode function returns a null array value when two delimiters are adjacent to each other. The code above simply ignores that and treats the two delimiters as if they were one.
The code above also does nothing to handle overlapping delimiters. This is most likely to occur if repeated characters are used for the delimiter. For example:
F←'CATaaaBIRDaaDOG'
B←'aa'
S←⍴,B
P←(⊃~∨/(-S-⍳S)⌽¨S⍴⊂B⍷F)⊂F
P ⍝ returns CAT BIRD DOG
However, the expected result would be CAT aBIRD DOG because it doesn't recognize 'aaa' as the delimiter followed by 'a.' Rather, it treats it as two overlapping delimiters, which end up functioning as a single delimiter. Another example would be 'tat' as the delimiter, in which case, any occurence in the string of 'tatat' would have the same problem.
Overlapping Delimiters:
I have an alternative for the possibility of a single overlap:
S←⍴,B
A←B⍷F
A←(2×A)>⊃+/(-S-⍳S)⌽¨S⍴⊂A
P←(⊃~∨/(-S-⍳S)⌽¨S⍴⊂A)⊂F
The third line of code eliminates any string positions that occur within a distance of S-1 characters from any delimiter position before it. As I said, this only solves the problem for a single overlap. If there are two or more overlaps, the first is recognized as a delimiter, and all the rest are ignored. Here's an example of two overlaps:
F←'CATtatatatBIRDtatDOG'
B←'tat'
S←⍴,B
A←B⍷F
A←(2×A)>⊃+/(-S-⍳S)⌽¨S⍴⊂A
P←(⊃~∨/(-S-⍳S)⌽¨S⍴⊂A)⊂F
P ⍝ returns CAT atatBIRD DOG
The expected result was 'CAT a BIRD DOG,' but it is unable to recognize the final 'tat' as a delimiter because of the overlap. Such a situation would be rare except when repeated characters are used. If the delimiter is 'aa', then 'aaaa' would be considered a double overlap, and only the first delimiter would be recognized.
Implode:
Much simpler:
P←'CAT' 'BIRD' 'DOG'
B←'-'
(⍴,B)↓∊B,¨P
It returns 'CAT-BIRD-DOG' as expected.

An interesting alternative for implode can be accomplished with reduction:
p←'cat' 'bird' 'dog'
↑{⍺,'-',⍵}/p
cat-bird-dog
This technique does not need to explicitly reference the shape of the delimiter.
And an interesting alternative to explode can be done with n-wise reduction:
f←'CATtatBIRDtatDOG'
b←'tat'
b{(~(-⍴⍵)↑(⍴⍺)∨/⍺⍷⍵)⊂⍵}f
CAT BIRD DOG

How to match those numbers?

I have an array of numbers, for example:
10001234
10002345
Now I have a number, which should be matched against all of those numbers inside the array. The number could either be 10001234 (which would be easy to match), but it could also be 100001234 (4 zeros instead of 3) or 101234 (one zero instead of 3) for example. Any combination could be possible. The only fixed part is the 1234 at the end.
I cant get the last 4 chars, because it can also be 3 or 5 or 6 ..., like 1000123456.
Whats a good way to match that? Maybe its easy and I dont see the wood for the trees :D.
Thanks!

if always the first number is one you can use this
$Num=1000436346;
echo(int)ltrim($Num."","1");
output:
436346

$number % 10000
Will return the remainder of dividing a number by 10000. Meaning, the last four digits.

The question doesn't make the criteria for the match very clear. However, I'll give it a go.
First, my assumptions:
The number always starts with a 1 followed by an unknown number of 0s.
After that, we have a sequence of digits which could be anything (but presumably not starting with zero?), which you want to extract from the string?
Given the above, we can formulate an expression fairly easily:
$input='10002345';
if(preg_match('/10+(\d+)/',$input,$matches)) {
$output = $matches[1];
}
$output now contains the second part of the number -- ie 2345.
If you need to match more than just a leading 1, you can replace that in the expression with \d to match any digit. And add a plus sign after it to allow more than one digit here (although we're still relying on there being at least one zero between the first part of the number and the second).
$input='10002345';
if(preg_match('/\d+0+(\d+)/',$input,$matches)) {
$output = $matches[1];
}

PHP Regex Check if two strings share two common characters

I'm just getting to know regular expressions, but after doing quite a bit of reading (and learning quite a lot), I still have not been able to figure out a good solution to this problem.
Let me be clear, I understand that this particular problem might be better solved not using regular expressions, but for the sake of brevity let me just say that I need to use regular expressions (trust me, I know there are better ways to solve this).
Here's the problem. I'm given a big file, each line of which is exactly 4 characters long.
This is a regex that defines "valid" lines:
"/^[AB][CD][EF][GH]$/m"
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
What I'm trying to do is given one of those lines, match all other lines that contain 2 or more common characters.
The below example assumes the following:
$line is always a valid format
BigFileOfLines.txt contains only valid lines
Example:
// Matches all other lines in string that share 2 or more characters in common
// with "$line"
function findMatchingLines($line, $subject) {
$regex = "magic regex I'm looking for here";
$matchingLines = array();
preg_match_all($regex, $subject, $matchingLines);
return $matchingLines;
}
// Example Usage
$fileContents = file_get_contents("BigFileOfLines.txt");
$matchingLines = findMatchingLines("ACFG", $fileContents);
/*
* Desired return value (Note: this is an example set, there
* could be more or less than this)
*
* BCEG
* ADFG
* BCFG
* BDFG
*/
One way I know that will work is to have a regex like the following (the following regex would only work for "ACFG":
"/^(?:AC.{2}|.CF.|.{2}FG|A.F.|A.{2}G|.C.G)$/m"
This works alright, performance is acceptable. What bothers me about it though is that I have to generate this based off of $line, where I'd rather have it be ignorant of what the specific parameter is. Also, this solution doesn't scale terrible well if later the code is modified to match say, 3 or more characters, or if the size of each line grows from 4 to 16.
It just feels like there's something remarkably simple that I'm overlooking. Also seems like this could be a duplicate question, but none of the other questions I've looked at really seem to address this particular problem.
Thanks in advance!
Update:
It seems that the norm with Regex answers is for SO users to simply post a regular expression and say "This should work for you."
I think that's kind of a halfway answer. I really want to understand the regular expression, so if you can include in your answer a thorough (within reason) explanation of why that regular expression:
A. Works
B. Is the most efficient (I feel there are a sufficient number of assumptions that can be made about the subject string that a fair amount of optimization can be done).
Of course, if you give an answer that works, and nobody else posts the answer *with* a solution, I'll mark it as the answer :)
Update 2:
Thank you all for the great responses, a lot of helpful information, and a lot of you had valid solutions. I chose the answer I did because after running performance tests, it was the best solution, averaging equal runtimes with the other solutions.
The reasons I favor this answer:
The regular expression given provides excellent scalability for longer lines
The regular expression looks a lot cleaner, and is easier for mere mortals such as myself to interpret.
However, a lot of credit goes to the below answers as well for being very thorough in explaining why their solution is the best. If you've come across this question because it's something you're trying to figure out, please give them all a read, helped me tremendously.

Why don't you just use this regex $regex = "/.*[$line].*[$line].*/m";?
For your example, that translates to $regex = "/.*[ACFG].*[ACFG].*/m";

This is a regex that defines "valid" lines:
/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m
In english, each line has either A or B at position 0, either C or D
at position 1, either E or F at position 2, and either G or H at
position 3. I can assume that each line will be exactly 4 characters
long.
That's not what that regex means. That regex means that each line has either A or B or a pipe at position 0, C or D or a pipe at position 1, etc; [A|B] means "either 'A' or '|' or 'B'". The '|' only means 'or' outside of character classes.
Also, {1} is a no-op; lacking any quantifier, everything has to appear exactly once. So a correct regex for the above English is this:
/^[AB][CD][EF][GH]$/
or, alternatively:
/^(A|B)(C|D)(E|F)(G|H)$/
That second one has the side effect of capturing the letter in each position, so that the first captured group will tell you whether the first character was A or B, and so on. If you don't want the capturing, you can use non-capture grouping:
/^(?:A|B)(?:C|D)(?:E|F)(?:G|H)$/
But the character-class version is by far the usual way of writing this.
As to your problem, it is ill-suited to regular expressions; by the time you deconstruct the string, stick it back together in the appropriate regex syntax, compile the regex, and do the test, you would probably have been much better off just doing a character-by-character comparison.
I would rewrite your "ACFG" regex thus: /^(?:AC|A.F|A..G|.CF|.C.G|..FG)$/, but that's just appearance; I can't think of a better solution using regex. (Although as Mike Ryan indicated, it would be better still as /^(?:A(?:C|.E|..G))|(?:.C(?:E|.G))|(?:..EG)$/ - but that's still the same solution, just in a more efficiently-processed form.)

You've already answered how to do it with a regex, and noted its shortcomings and inability to scale, so I don't think there's any need to flog the dead horse. Instead, here's a way that'll work without the need for a regex:
function findMatchingLines($line) {
static $file = null;
if( !$file) $file = file("BigFileOfLines.txt");
$search = str_split($line);
foreach($file as $l) {
$test = str_split($l);
$matches = count(array_intersect($search,$test));
if( $matches > 2) // define number of matches required here - optionally make it an argument
return true;
}
// no matches
return false;
}

There are 6 possibilities that at least two characters match out of 4: MM.., M.M., M..M, .MM., .M.M, and ..MM ("M" meaning a match and "." meaning a non-match).
So, you need only to convert your input into a regex that matches any of those possibilities. For an input of ACFG, you would use this:
"/^(AC..|A.F.|A..G|.CF.|.C.G|..FG)$/m"
This, of course, is the conclusion you're already at--so good so far.
The key issue is that Regex isn't a language for comparing two strings, it's a language for comparing a string to a pattern. Thus, either your comparison string must be part of the pattern (which you've already found), or it must be part of the input. The latter method would allow you to use a general-purpose match, but does require you to mangle your input.
function findMatchingLines($line, $subject) {
$regex = "/(?<=^([AB])([CD])([EF])([GH])[.\n]+)"
+ "(\1\2..|\1.\3.|\1..\4|.\2\3.|.\2.\4|..\3\4)/m";
$matchingLines = array();
preg_match_all($regex, $line + "\n" + $subject, $matchingLines);
return $matchingLines;
}
What this function does is pre-pend your input string with the line you want to match against, then uses a pattern that compares each line after the first line (that's the + after [.\n] working) back to the first line's 4 characters.
If you also want to validate those matching lines against the "rules", just replace the . in each pattern to the appropriate character class (\1\2[EF][GH], etc.).

People may be confused by your first regex. You give:
"/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m"
And then say:
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
But that's not what that regex means at all.
This is because the | operator has the highest precedence here. So, what that regex really says, in English, is: Either A or | or B in the first position, OR C or | or D in the first position, OR E or | or F in the first position, OR G or '|orH` in the first position.
This is because [A|B] means a character class with one of the three given characters (including the |. And because {1} means one character (it is also completely superfluous and could be dropped), and because the outer | alternate between everything around it. In my English expression above each capitalized OR stands for one of your alternating |'s. (And I started counting positions at 1, not 0 -- I didn't feel like typing the 0th position.)
To get your English description as a regex, you would want:
/^[AB][CD][EF][GH]$/
The regex will go through and check the first position for A or B (in the character class), then check C or D in the next position, etc.
--
EDIT:
You want to test for only two of these four characters matching.
Very Strictly speaking, and picking up from #Mark Reed's answer, the fastest regex (after it's been parsed) is likely to be:
/^(A(C|.E|..G))|(.C(E)|(.G))|(..EG)$/
as compared to:
/^(AC|A.E|A..G|.CE|.C.G|..EG)$/
This is because of how the regex implementation steps through text. You first test if A is in the first position. If that succeeds, then you test the sub-cases. If that fails, then you're done with all those possible cases (or which there are 3). If you don't yet have a match, you then test if C is in the 2nd position. If that succeeds, then you test for the two subcases. And if none of those succeed, you test, `EG in the 3rd and 4th positions.
This regex is specifically created to fail as fast as possible. Listing each case out separately, means to fail, you would have test 6 different cases (each of the six alternatives), instead of 3 cases (at a minimum). And in cases of A not being the first position, you would immediately go to test the 2nd position, without hitting it two more times. Etc.
(Note that I don't know exactly how PHP compiles regex's -- it's possible that they compile to the same internal representation, though I suspect not.)
--
EDIT: On additional point. Fastest regex is a somewhat ambiguous term. Fastest to fail? Fastest to succeed? And given what possible range of sample data of succeeding and failing rows? All of these would have to be clarified to really determine what criteria you mean by fastest.

Here's something that uses Levenshtein distance instead of regex and should be extensible enough for your requirements:
$lines = array_map('rtrim', file('file.txt')); // load file into array removing \n
$common = 2; // number of common characters required
$match = 'ACFG'; // string to match
$matchingLines = array_filter($lines, function ($line) use ($common, $match) {
// error checking here if necessary - $line and $match must be same length
return (levenshtein($line, $match) <= (strlen($line) - $common));
});
var_dump($matchingLines);

I bookmarked the question yesterday in the evening to post an answer today, but seems that I'm a little late ^^ Here is my solution anyways:
/^[^ACFG]*+(?:[ACFG][^ACFG]*+){2}$/m
It looks for two occurrences of one of the ACFG characters surrounded by any other characters. The loop is unrolled and uses possessive quantifiers, to improve performance a bit.
Can be generated using:
function getRegexMatchingNCharactersOfLine($line, $num) {
return "/^[^$line]*+(?:[$line][^$line]*+){$num}$/m";
}

Trying to split a string in 3 variables, but a little more tricky - PHP

Having pretty much covered the basics in PHP, I decided to challenge myself and make a simple calculator. After some attempts I figured it out, but I'm not entirely content with it. I want to make it more user friendly and have the calculator input in just one box, very much like google search.
So one would simply type: 5+2
and recieve 7.
How would I split the string "5+2" into three variables so that the math functions can convert the numbers into integers and recognize the operator, as well as accounting for the possibility of someone using spaces between the values as well?
Would you explode the string? But what would you explode it with if there are no spaces?
I've also stumbled upon the preg_split function, but I can't seem to wrap my head around or know if it's suitable to solve this problemt. What method would be the best option for this?

$calc = "5* 2+ 53";
$calc = preg_replace('/(\s*)/','',$calc);
print_r(preg_split('/([\x28-\x2B\x2D\x2F])/',$calc,-1,PREG_SPLIT_DELIM_CAPTURE));
That's my bid, resulting in
Array
(
[0] => 5
[1] => *
[2] => 2
[3] => +
[4] => 53
)

You may need to use some clever regex to split it something like:
$myOutput = split("(-?[0-9]+)|([+-*/]{1})|(-?[0-9]+)");
I haven't tested that - just an semi-psuedo-ish example sorry :-> just trying to highlight that you will need to remember that your - (minus) operator can appear at the start of an integer to make it a negative number so you could end up with problems with things like -1--21 which is valid but makes your regex rules more complicated.

You will have to split the string using regular expressions.
For example a simple regex for 5+2 would be:
\d\+\d
Check out this link. You can create and validate your regular expressions there. For a calculator it will not be that difficult.

You've got the right idea with preg_split. It would work something like this:
$values = preg_split("/[\s]+/", "76 + 23");
The resulting array will contain values that are NOT whitespace:
Values should look like this:
$values[0]: "76"
$values[1]: "+"
$values[2]: "23"
the "/[\s]+/" is a regular expression pattern that matches any whitespace characters one or more times. Howver, if there are no whitespaces at all, preg_split will just return the original "5+2" as a single string in the first element of the array. i.e.:
$values[0] = "5+2"

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.