Standard algorithm to tokenize a string, keep delimiters (in PHP)

Standard algorithm to tokenize a string, keep delimiters (in PHP) - php

I want to split an arithmetic expression into tokens, to convert it into RPN.
Java has the StringTokenizer, which can optionally keep the delimiters. That way, I could use the operators as delimiters. Unfortunately, I need to do this in PHP, which has strtok, but that throws away the delimiters, so I need to brew something myself.
This sounds like a classic textbook example for Compiler Design 101, but I'm afraid I'm lacking some formal education here. Is there a standard algorithm you can point me to?
My other options are to read up on Lexical Analysis or to roll up something quick and dirty with the available string functions.

This might help.
Practical Uses of Tokenizer

As often, I would just use a regular expression to do this:
$expr = '(5*(7 + 2 * -9.3) - 8 )/ 11';
$tokens = preg_split('/([*\/^+-]+)\s*|([\d.]+)\s*/', $expr, -1,
PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$tts = print_r($tokens, true);
echo "<pre>x=$tts</pre>";
It needs a little more work to accept numbers with exponent (like -9.2e-8).

OK, thanks to PhiLho, my final code is this, should anyone need it. It's not even really dirty. :-)
static function rgTokenize($s)
{
$rg = array();
// remove whitespace
$s = preg_replace("/\s+/", '', $s);
// split at numbers, identifiers, function names and operators
$rg = preg_split('/([*\/^+\(\)-])|(#\d+)|([\d.]+)|(\w+)/', $s, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
// find right-associative '-' and put it as a sign onto the following number
for ($ix = 0, $ixMax = count($rg); $ix < $ixMax; $ix++) {
if ('-' == $rg[$ix]) {
if (isset($rg[$ix - 1]) && self::fIsOperand($rg[$ix - 1])) {
continue;
} else if (isset($rg[$ix + 1]) && self::fIsOperand($rg[$ix + 1])) {
$rg[$ix + 1] = $rg[$ix].$rg[$ix + 1];
unset($rg[$ix]);
} else {
throw new Exception("Syntax error: Found right-associative '-' without operand");
}
}
}
$rg = array_values($rg);
echo join(" ", $rg)."\n";
return $rg;
}

Related

PHP Sum of two numbers resulting in a large numbers with a + symbol [duplicate]

Ok, so PHP isn't the best language to be dealing with arbitrarily large integers in, considering that it only natively supports 32-bit signed integers. What I'm trying to do though is create a class that could represent an arbitrarily large binary number and be able to perform simple arithmetic operations on two of them (add/subtract/multiply/divide).
My target is dealing with 128-bit integers.
There's a couple of approaches I'm looking at, and problems I see with them. Any input or commentary on what you would choose and how you might go about it would be greatly appreciated.
Approach #1: Create a 128-bit integer class that stores its integer internally as four 32-bit integers. The only problem with this approach is that I'm not sure how to go about handling overflow/underflow issues when manipulating individual chunks of the two operands.
Approach #2: Use the bcmath extension, as this looks like something it was designed to tackle. My only worry in taking this approach is the scale setting of the bcmath extension, because there can't be any rounding errors in my 128-bit integers; they must be precise. I'm also worried about being able to eventually convert the result of the bcmath functions into a binary string (which I'll later need to shove into some mcrypt encryption functions).
Approach #3: Store the numbers as binary strings (probably LSB first). Theoretically I should be able to store integers of any arbitrary size this way. All I would have to do is write the four basic arithmetic functions to perform add/sub/mult/div on two binary strings and produce a binary string result. This is exactly the format I need to hand over to mcrypt as well, so that's an added plus. This is the approach I think has the most promise at the moment, but the one sticking point I've got is that PHP doesn't offer me any way to manipulate the individual bits (that I know of). I believe I'd have to break it up into byte-sized chunks (no pun intended), at which point my questions about handling overflow/underflow from Approach #1 apply.

The PHP GMP extension will be better for this. As an added bonus, you can use it to do your decimal-to-binary conversion, like so:
gmp_strval(gmp_init($n, 10), 2);

There are already various classes available for this so you may wish to look at them before writing your own solution (if indeed writing your own solution is still needed).

As far as I can tell, the bcmath extension is the one you'll want. The data in the PHP manual is a little sparse, but you out to be able to set the precision to be exactly what you need by using the bcscale() function, or the optional third parameter in most of the other bcmath functions. Not too sure on the binary strings thing, but a bit of googling tells me you ought to be able to do with by making use of the pack() function.

I implemented the following PEMDAS complaint BC evaluator which may be useful to you.
function BC($string, $precision = 32)
{
if (extension_loaded('bcmath') === true)
{
if (is_array($string) === true)
{
if ((count($string = array_slice($string, 1)) == 3) && (bcscale($precision) === true))
{
$callback = array('^' => 'pow', '*' => 'mul', '/' => 'div', '%' => 'mod', '+' => 'add', '-' => 'sub');
if (array_key_exists($operator = current(array_splice($string, 1, 1)), $callback) === true)
{
$x = 1;
$result = #call_user_func_array('bc' . $callback[$operator], $string);
if ((strcmp('^', $operator) === 0) && (($i = fmod(array_pop($string), 1)) > 0))
{
$y = BC(sprintf('((%1$s * %2$s ^ (1 - %3$s)) / %3$s) - (%2$s / %3$s) + %2$s', $string = array_shift($string), $x, $i = pow($i, -1)));
do
{
$x = $y;
$y = BC(sprintf('((%1$s * %2$s ^ (1 - %3$s)) / %3$s) - (%2$s / %3$s) + %2$s', $string, $x, $i));
}
while (BC(sprintf('%s > %s', $x, $y)));
}
if (strpos($result = bcmul($x, $result), '.') !== false)
{
$result = rtrim(rtrim($result, '0'), '.');
if (preg_match(sprintf('~[.][9]{%u}$~', $precision), $result) > 0)
{
$result = bcadd($result, (strncmp('-', $result, 1) === 0) ? -1 : 1, 0);
}
else if (preg_match(sprintf('~[.][0]{%u}[1]$~', $precision - 1), $result) > 0)
{
$result = bcmul($result, 1, 0);
}
}
return $result;
}
return intval(version_compare(call_user_func_array('bccomp', $string), 0, $operator));
}
$string = array_shift($string);
}
$string = str_replace(' ', '', str_ireplace('e', ' * 10 ^ ', $string));
while (preg_match('~[(]([^()]++)[)]~', $string) > 0)
{
$string = preg_replace_callback('~[(]([^()]++)[)]~', __FUNCTION__, $string);
}
foreach (array('\^', '[\*/%]', '[\+-]', '[<>]=?|={1,2}') as $operator)
{
while (preg_match(sprintf('~(?<![0-9])(%1$s)(%2$s)(%1$s)~', '[+-]?(?:[0-9]++(?:[.][0-9]*+)?|[.][0-9]++)', $operator), $string) > 0)
{
$string = preg_replace_callback(sprintf('~(?<![0-9])(%1$s)(%2$s)(%1$s)~', '[+-]?(?:[0-9]++(?:[.][0-9]*+)?|[.][0-9]++)', $operator), __FUNCTION__, $string, 1);
}
}
}
return (preg_match('~^[+-]?[0-9]++(?:[.][0-9]++)?$~', $string) > 0) ? $string : false;
}
It automatically deals with rounding errors, just set the precision to whatever digits you need.

Regex expression for matching all duplicate substrings of any length

Let's say we have a string: "abcbcdcde"
I want to identify all substrings that are repeated in this string using regex (i.e. no brute-force iterative loops).
For the above string, the result set would be: {"b", "bc", "c", "cd", "d"}
I must confess that my regex is far more rusty than it should be for someone with my experience. I tried using a backreference, but that'll only match consecutive duplicates. I need to match all duplicates, consecutive or otherwise.
In other words, I want to match any character(s) that appears for the >= 2nd time. If a substring occurs 5 times, then I want to capture each of occurrences 2-5. Make sense?
This is my pathetic attempt thus far:
preg_match_all( '/(.+)(.*)\1+/', $string, $matches ); // Way off!
I tried playing with look-aheads but I'm just butchering it. I'm doing this in PHP (PCRE) but the problem is more or less language-agnostic. It's a bit embarrassing that I'm finding myself stumped on this.

Your problem is recursi ... you know what, forget about recursion! =p it wouldn't really work well in PHP and the algorithm is pretty clear without it as well.
function find_repeating_sequences($s)
{
$res = array();
while ($s) {
$i = 1; $pat = $s[0];
while (false !== strpos($s, $pat, $i)) {
$res[$pat] = 1;
// expand pattern and try again
$pat .= $s[$i++];
}
// move the string forward
$s = substr($s, 1);
}
return array_keys($res);
}
Out of interest, I wrote Tim's answer in PHP as well:
function find_repeating_sequences_re($s)
{
$res = array();
preg_match_all('/(?=(.+).*\1)/', $s, $matches);
foreach ($matches[1] as $match) {
$length = strlen($match);
if ($length > 1) {
for ($i = 0; $i < $length; ++$i) {
for ($j = $i; $j < $length; ++$j) {
$res[substr($match, $i, $j - $i + 1)] = 1;
}
}
} else {
$res[$match] = 1;
}
}
return array_keys($res);
}
I've let them fight it out in a small benchmark of 800 bytes of random data:
$data = base64_encode(openssl_random_pseudo_bytes(600));
Each code is run for 10 rounds and the execution time is measured. The results?
Pure PHP - 0.014s (10 runs)
PCRE - 40.86s <-- ouch!
It gets weirder when you look at 24k bytes (or anything above 1k really):
Pure PHP - 4.565s (10 runs)
PCRE - 0.232s <-- WAT?!
It turns out that the regular expression broke down after 1k characters and so the $matches array was empty. These are my .ini settings:
pcre.backtrack_limit => 1000000 => 1000000
pcre.recursion_limit => 100000 => 100000
It's not clear to me how a backtrack or recursion limit would have been hit after only 1k of characters. But even if those settings are "fixed" somehow, the results are still obvious, PCRE doesn't seem to be the answer.
I suppose writing this in C would speed it up somewhat, but I'm not sure to what degree.
Update
With some help from hakre's answer I put together an improved version that increases performance by ~18% after optimizing the following:
Remove the substr() calls in the outer loop to advance the string pointer; this was a left over from my previous recursive incarnations.
Use the partial results as a positive cache to skip strpos() calls inside the inner loop.
And here it is, in all its glory (:
function find_repeating_sequences3($s)
{
$res = array();
$p = 0;
$len = strlen($s);
while ($p != $len) {
$pat = $s[$p]; $i = ++$p;
while ($i != $len) {
if (!isset($res[$pat])) {
if (false === strpos($s, $pat, $i)) {
break;
}
$res[$pat] = 1;
}
// expand pattern and try again
$pat .= $s[$i++];
}
}
return array_keys($res);
}

You can't get the required result in a single regex because a regex will match either greedily (finding bc...bc) or lazily (finding b...b and c...c), but never both. (In your case, it does find c...c, but only because c is repeated twice.)
But once you've found a repeated substring of length > 1, it logically follows that all the smaller "substrings of that substring" must also be repeated. If you want to get them spelled out for you, you need to do this separately.
Taking your example (using Python because I don't know PHP):
>>> results = set(m.group(1) for m in re.finditer(r"(?=(.+).*\1)", "abcbcdcde"))
>>> results
{'d', 'cd', 'bc', 'c'}
You could then go and apply the following function to each of your results:
def substrings(s):
return [s[start:stop] for start in range(len(s)-1)
for stop in range(start+1, len(s)+1)]
For example:
>>> substrings("123456")
['1', '12', '123', '1234', '12345', '123456', '2', '23', '234', '2345', '23456',
'3', '34', '345', '3456', '4', '45', '456', '5', '56']

The closest I can get is /(?=(.+).*\1)/
The purpose of the lookahead is to allow the same characters to be matched more than once (for instance, c and cd). However, for some reason it doesn't seem to be getting the b...

Interesting question. I basically took the function in Jacks answer and was trying if the number of tests can be reduced.
I first tried to only search half the string, however it turned out that creating the pattern to search for via substr each time was way too expensive. The way how it is done in Jacks answer by appending one character per each iteration is way better it looks like. And then I did run out of time so I could not look further into it.
However while looking for such an alternative implementation I at least found out that some of the differences in the algorithm I had in mind could be applied to Jacks function as well:
There is no need to cut the beginning of the string in each outer iteration as the search is already done with offsets.
If the rest of the subject to look for repetition is smaller than the repetition needle, you do not need to search for the needle.
If it was already searched for the needle, you don't need to search again.
Note: This is a memory trade. If you have many repetitions, you will use similar memory. However if you do have a low amount of repetitions, than this variant uses more memory than before.
The function:
function find_repeating_sequences($string) {
$result = array();
$start = 0;
$max = strlen($string);
while ($start < $max) {
$pat = $string[$start];
$i = ++$start;
while ($max - $i > 0) {
$found = isset($result[$pat]) ? $result[$pat] : false !== strpos($string, $pat, $i);
if (!$result[$pat] = $found) break;
// expand pattern and try again
$pat .= $string[$i++];
}
}
return array_keys(array_filter($result));
}
So just see this as an addition to Jacks answer.

simplest, shortest way to count capital letters in a string with php?

I am looking for the shortest, simplest and most elegant way to count the number of capital letters in a given string.

function count_capitals($s) {
return mb_strlen(preg_replace('![^A-Z]+!', '', $s));
}

$str = "AbCdE";
preg_match_all("/[A-Z]/", $str); // 3

George Garchagudashvili Solution is amazing, but it fails if the lower case letters contain diacritics or accents.
So I did a small fix to improve his version, that works also with lower case accentuated letters:
public static function countCapitalLetters($string){
$lowerCase = mb_strtolower($string);
return strlen($lowerCase) - similar_text($string, $lowerCase);
}
You can find this method and lots of other string common operations at the turbocommons library:
https://github.com/edertone/TurboCommons/blob/70a9de1737d8c10e0f6db04f5eab0f9c4cbd454f/TurboCommons-Php/src/main/php/utils/StringUtils.php#L373
EDIT 2019
The method to count capital letters in turbocommons has evolved to a method that can count upper case and lower case characters on any string. You can check it here:
https://github.com/edertone/TurboCommons/blob/1e230446593b13a272b1d6a2903741598bb11bf2/TurboCommons-Php/src/main/php/utils/StringUtils.php#L391
Read more info here:
https://turbocommons.org/en/blog/2019-10-15/count-capital-letters-in-string-javascript-typescript-php
And it can also be tested online here:
https://turbocommons.org/en/app/stringutils/count-capital-letters

I'd give another solution, maybe not elegant, but helpful:
$mixed_case = "HelLo wOrlD";
$lower_case = strtolower($mixed_case);
$similar = similar_text($mixed_case, $lower_case);
echo strlen($mixed_case) - $similar; // 4

It's not the shortest, but it is arguably the simplest as a regex doesn't have to be executed. Normally I'd say this should be faster as the logic and checks are simple, but PHP always surprises me with how fast and slow some things are when compared to others.
function capital_letters($s) {
$u = 0;
$d = 0;
$n = strlen($s);
for ($x=0; $x<$n; $x++) {
$d = ord($s[$x]);
if ($d > 64 && $d < 91) {
$u++;
}
}
return $u;
}
echo 'caps: ' . capital_letters('HelLo2') . "\n";

Whats the cleanest way to convert a 5-7 digit number into xxx/xxx/xxx format in php?

I have sets of 5, 6 and 7 digit numbers. I need them to be displayed in the 000/000/000 format. So for example:
12345 would be displayed as 000/012/345
and
9876543 would be displayed as 009/876/543
I know how to do this in a messy way, involving a series of if/else statements, and strlen functions, but there has to be a cleaner way involving regex that Im not seeing.

sprintf and modulo is one option
function formatMyNumber($num)
{
return sprintf('%03d/%03d/%03d',
$num / 1000000,
($num / 1000) % 1000,
$num % 1000);
}

$padded = str_pad($number, 9, '0', STR_PAD_LEFT);
$split = str_split($padded, 3);
$formatted = implode('/', $split);

You asked for a regex solution, and I love playing with them, so here is a regex solution!
I show it for educational (and fun) purpose only, just use Adam's solution, clean, readable and fast.
function FormatWithSlashes($number)
{
return substr(preg_replace('/(\d{3})?(\d{3})?(\d{3})$/', '$1/$2/$3',
'0000' . $number),
-11, 11);
}
$numbers = Array(12345, 345678, 9876543);
foreach ($numbers as $val)
{
$r = FormatWithSlashes($val);
echo "<p>$r</p>";
}

OK, people are throwing stuff out, so I will too!
number_format would be great, because it accepts a thousands separator, but it doesn't do padding zeroes like sprintf and the like. So here's what I came up with for a one-liner:
function fmt($x) {
return substr(number_format($x+1000000000, 0, ".", "/"), 2);
}

Minor improvement to PhiLho's suggestion:
You can avoid the substr by changing the regex to:
function FormatWithSlashes($number)
{
return preg_replace('/^0*(\d{3})(\d{3})(\d{3})$/', '$1/$2/$3',
'0000' . $number);
}
I also removed the ? after each of the first two capture groups because, when given a 5, 6, or 7 digit number (as specified in the question), this will always have at least 9 digits to work with. If you want to guard against the possibility of receiving a smaller input number, run the regex against '000000000' . $number instead.
Alternately, you could use
substr('0000' . $number, -9, 9);
and then splice the slashes in at the appropriate places with substr_replace, which I suspect may be the fastest way to do this (no need to run regexes or do division), but that's really just getting into pointless optimization, as any of the solutions presented will still be much faster than establishing a network connection to the server.

This would be how I would write it if using Perl 5.10 .
use 5.010;
sub myformat(_;$){
# prepend with zeros
my $_ = 0 x ( 9-length($_[0]) ) . $_[0];
my $join = $_[1] // '/'; # using the 'defined or' operator `//`
# m// in a list context returns ($1,$2,$3,...)
join $join, m/ ^ (\d{3}) (\d{3}) (\d{3}) $ /x;
}
Tested with:
$_ = 11111;
say myformat;
say myformat(2222);
say myformat(33333,';');
say $_;
returns:
000/011/111
000/002/222
000;033;333
11111
Back-ported to Perl 5.8 :
sub myformat(;$$){
local $_ = #_ ? $_[0] : $_
# prepend with zeros
$_ = 0 x ( 9-length($_) ) . $_;
my $join = defined($_[1]) ? $_[1] :'/';
# m// in a list context returns ($1,$2,$3,...)
join $join, m/ ^ (\d{3}) (\d{3}) (\d{3}) $ /x;
}

Here's how I'd do it in python (sorry I don't know PHP as well). I'm sure you can convert it.
def convert(num): #num is an integer
a = str(num)
s = "0"*(9-len(a)) + a
return "%s/%s/%s" % (s[:3], s[3:6], s[6:9])
This just pads the number to have length 9, then splits the substrings.
That being said, it seems the modulo answer is a bit better.

Arithmetic with Arbitrarily Large Integers in PHP

The PHP GMP extension will be better for this. As an added bonus, you can use it to do your decimal-to-binary conversion, like so:
gmp_strval(gmp_init($n, 10), 2);

There are already various classes available for this so you may wish to look at them before writing your own solution (if indeed writing your own solution is still needed).

As far as I can tell, the bcmath extension is the one you'll want. The data in the PHP manual is a little sparse, but you out to be able to set the precision to be exactly what you need by using the bcscale() function, or the optional third parameter in most of the other bcmath functions. Not too sure on the binary strings thing, but a bit of googling tells me you ought to be able to do with by making use of the pack() function.

I implemented the following PEMDAS complaint BC evaluator which may be useful to you.
function BC($string, $precision = 32)
{
if (extension_loaded('bcmath') === true)
{
if (is_array($string) === true)
{
if ((count($string = array_slice($string, 1)) == 3) && (bcscale($precision) === true))
{
$callback = array('^' => 'pow', '*' => 'mul', '/' => 'div', '%' => 'mod', '+' => 'add', '-' => 'sub');
if (array_key_exists($operator = current(array_splice($string, 1, 1)), $callback) === true)
{
$x = 1;
$result = #call_user_func_array('bc' . $callback[$operator], $string);
if ((strcmp('^', $operator) === 0) && (($i = fmod(array_pop($string), 1)) > 0))
{
$y = BC(sprintf('((%1$s * %2$s ^ (1 - %3$s)) / %3$s) - (%2$s / %3$s) + %2$s', $string = array_shift($string), $x, $i = pow($i, -1)));
do
{
$x = $y;
$y = BC(sprintf('((%1$s * %2$s ^ (1 - %3$s)) / %3$s) - (%2$s / %3$s) + %2$s', $string, $x, $i));
}
while (BC(sprintf('%s > %s', $x, $y)));
}
if (strpos($result = bcmul($x, $result), '.') !== false)
{
$result = rtrim(rtrim($result, '0'), '.');
if (preg_match(sprintf('~[.][9]{%u}$~', $precision), $result) > 0)
{
$result = bcadd($result, (strncmp('-', $result, 1) === 0) ? -1 : 1, 0);
}
else if (preg_match(sprintf('~[.][0]{%u}[1]$~', $precision - 1), $result) > 0)
{
$result = bcmul($result, 1, 0);
}
}
return $result;
}
return intval(version_compare(call_user_func_array('bccomp', $string), 0, $operator));
}
$string = array_shift($string);
}
$string = str_replace(' ', '', str_ireplace('e', ' * 10 ^ ', $string));
while (preg_match('~[(]([^()]++)[)]~', $string) > 0)
{
$string = preg_replace_callback('~[(]([^()]++)[)]~', __FUNCTION__, $string);
}
foreach (array('\^', '[\*/%]', '[\+-]', '[<>]=?|={1,2}') as $operator)
{
while (preg_match(sprintf('~(?<![0-9])(%1$s)(%2$s)(%1$s)~', '[+-]?(?:[0-9]++(?:[.][0-9]*+)?|[.][0-9]++)', $operator), $string) > 0)
{
$string = preg_replace_callback(sprintf('~(?<![0-9])(%1$s)(%2$s)(%1$s)~', '[+-]?(?:[0-9]++(?:[.][0-9]*+)?|[.][0-9]++)', $operator), __FUNCTION__, $string, 1);
}
}
}
return (preg_match('~^[+-]?[0-9]++(?:[.][0-9]++)?$~', $string) > 0) ? $string : false;
}
It automatically deals with rounding errors, just set the precision to whatever digits you need.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.