RegEx vs. Manually Parsing String (PHP Performance)

RegEx vs. Manually Parsing String (PHP Performance) - php

Is there any problem (performance-wise) with manually parsing a string as follows, as opposed to using Regular Expressions or the built in string replacement functions?
for ($i=0;$i<strlen($string);$i++) {
$thisChar = $string[$i];
//do more stuff
}
Thanks!

There are things that are done more efficient with custom code than with regex.
As long as both have the same O complexity and your not handling humongous strings, readability and maintainability should be an equally or even more important argument.
For the actual performance just do a benchmark to compare the two solutions.

Related

Hypothetical concatenation predicament

So I am working on a simple micro language/alternative syntax for PHP.
Its syntax takes a lot from JavaScript and CoffeeScript including a few of my own concepts. I have hand written the parser (no parser generator used) in PHP to convert the code into PHP then execute it. It is more of a proof of concept/learning tool rather than anything else but I'd be lying if I said I didn't want to see it used on an actual project one day.
Anyway here is a little problem I have come across that I thought I would impose on you great intellects:
As you know in PHP the period ( . ) is used for string concatenation. However in JavaScript it is used for method chaining.
Now one thing that annoys me in PHP is having to do use that bloody arrow (->) for my method chains, so I went the JavaScript way and implemented the period (.) for use with objects.
(I think you can see the problem already)
Because I'm currently only writing a 'dumb' parser that merely does a huge search and replace, there is no way to distinguish whether a period (.) is being used for concatenation or for method chaining.
"So if you are trying to be like JavaScript, just use the addition (+) operator Franky!", I hear you scream. Well I would but because the addition (+) operator is used for math in PHP I would merely be putting myself in the same situation.
Unless I can make my parser smart enough (with a crap load of work) to know that when the addition (+) operator is working with integers then don't convert it into a period (.) for concatenation I am pretty much screwed.
But here is the cool thing. Because this is pretty much a new language. I don't have to use the period or addition operator for concatenation.
So my question is: If I was to decide to introduce a new method of string concatenation, what character would make the most sense?

Does it have to be one character? .. could work!
Any myriad of combinations, like ~~ or >: even!

If you don't want to use + or ., then I would recommend ^ because that's used in some other languages for string concatenation and I don't believe that it's used for anything in PHP.
Edit: It's been pointed out that it's used for XOR. One option would be to use ^ anyway since bitwise XOR is not commonly used and then to map something else like ^^ to XOR. Another option would be to use .. for concatenation. The problem is that the single characters are mostly taken.
Another option would be to use +, but map it to a function which concatenates when one argument is a string and adds otherwise. In order to not break things which rely on strings which are numbers being treated as their values, we should probably treat numeric strings as numbers for these purposes. Here's the function that I would use.
function smart_add($arg1,$arg2) {
if ($arg1.is_numeric() && $arg2.is_numeric()) {
return $arg1 + $arg2;
} else {
return $arg1 . $arg2;
}
}
Then a + b + c + d just gets turned into smart_add(smart_add(smart_add(a,b),c),d)
This may not be perfect in all cases, but it should work pretty well most of the time and has clear rules for use.

So my question is: If I was to decide to introduce a new method of
string concatenation, what character would make the most sense?
As you're well aware of, you'll need to chose a character that is not being used as one of PHP's operators. Since string concatenation is a common technique, I would try to avoid using characters that you need to press SHIFT to type, as those characters will be a hindrance.
Instead of trying to assign one character for string concatenation (as most are already in use), perhaps you should define your own syntax for string concatenation (or any other operation you need to overwrite with a different operator), as a shorthand operator (sort of). Something like:
[$string, $string]
Should be easy to pick up by a parser and form the resulting concatenated string.
Edit: I should also note that whether you're using literal strings or variables, there's no way (as far as I know) to confuse this syntax with any other PHP functionality, since the comma in the middle is invalid for array manipulations. So, all of the following would still be recognized as string concatenation and not something else in PHP.
["stack", "overflow"]
["stack", $overflow]
[$stack, $overflow]
Edit: Since this conflicts to JSON notation, the following alternative variations exist:
Changing the delimiter
Omitting the delimiter
Example:
[$stack $overflow $string $concatenation] // Use nothing (but really need space)

Validate user inputted PHP code before passing it to eval()

Before passing a string to eval() I would like to make sure the syntax is correct and allow:
Two functions: a() and b()
Four operators: /*-+
Brackets: ()
Numbers: 1.2, -1, 1
How can I do this, maybe it has something to do with PHP Tokenizer?
I'm actually trying to make a simple formula interpreter so a() and b() will be replaced by ln() and exp(). I don't want to write a tokenizer and parser from scratch.

As far as validation is concerned, the following character tokens are valid:
operator: [/*+-]
funcs: (a\(|b\()
brackets: [()]
numbers: \d+(\.\d+)?
space: [ ]
A simple validation could then check if the input string matches any combination of these patterns. Because the funcs token is pretty precise and it does not clash much with other tokens, this validation should be quite stable w/o the need implementing any syntax/grammar already:
$tokens = array(
'operator' => '[/*+-]',
'funcs' => '(a\(|b\()',
'brackets' => '[()]',
'numbers' => '\d+(\.\d+)?',
'space' => '[ ]',
);
$pattern = '';
foreach($tokens as $token)
{
$pattern .= sprintf('|(?:%s)', $token);
}
$pattern = sprintf('~^(%s)*$~', ltrim($pattern, '|'));
echo $pattern;
Only if the whole input string matches against the token based pattern, it validates. It still might be syntactically wrong PHP, put you can ensure it only is build upon the specified tokens:
~^((?:[/*+-])|(?:(a\(|b\())|(?:[()])|(?:\d+(\.\d+)?)|(?:[ ]))*$~
If you build the pattern dynamically - as in the example - you're able to modify your language tokens later on more easily.
Additionally this can be the first step to your own tokenizer / lexer. The token stream can then passed on to a parser which can syntactically validate and interpret it. That's the part user187291 wrote about.
Alternatively to writing a full lexer+parser, and you need to validate the syntax, you can formulate your grammar based on tokens as well and then do a regex based token grammar on the token representation of the input.
The tokens are the words you use in your grammar. You will need to describe parenthesis and function definition more precisely then in tokens, and the tokenizer should follow more clear rules which token supersedes another token. The concept is outlined in another question of mine. It uses regex as well for grammar formulation and syntax validation, but it still does not parse. In your case eval would be the parser you're making use of.

Parser generators have indeed already been written for PHP, and "LIME" in particular comes with the typical "calculator" example, which would be an obvious starting point for your "mini language": http://sourceforge.net/projects/lime-php/
It's been years since I last played with LIME, but it was already mature & stable then.
Notes:
1) Using a full-on parser generator gives you the advantage of avoiding PHP eval() entirely if you wish - you can make LIME emit a parser which effectively provides an "eval" function for expressions written in your mini language (with validation baked in). This gives you the additional advantage of allowing you to add support for new functions, as needed.
2) It may seem like overkill at first to use a parser generator for such an apparently small task, but once you get the examples working you'll be impressed by how easy it is to modify and extend them. And it's very easy to underestimate the difficulty of writing a bug-free parser (even a "trivial" one) from scratch.

yes, you need the Tokenizer, or something similar, but it's only part of the story. A tokenizer (more commonly called "lexer") can only read and parse elements of an expression, but has no means to detect that something like "foo()+*bar)" is invalid. You need the second part, called parser which would be able to arrange tokens in a kind of a tree (called "AST") or provide an error message when failing to do so. Ironically, once you've got a tree, "eval" is not needed anymore, you can evaluate your expression directly from the tree.
I would recommend you to write a parser by hand because it's a very useful exercise and a lot of fun. Recursive descent parsers are quite easy to program.

You could use token_get_all(), inspect each token, and abort at the first invalid token.

hakre's answer, using regexes is a nice solution, but is a wee bit complicated. Also handling a whitelist of functions becomes rather messy. And if this does go wrong it could have a very nasty effect on your system.
Is there a reason you don't use the javascript 'eval' instead?

PHP dealing with huge string

I have to replace xmlns with ns in my incomming xml in order to fix SimpleXMLElements xpath() function. Most functions do not have a performance problem. But there allways seems to be an overhead as the string grows.
E.g. preg_replace on a 2 MB string takes 50ms to process, even if I limit the replaces to 1 and the replace is done at the very beginning.
If I substr the first few characters and just replace that part it is slightly faster. But not really that what I want.
Is there any PHP method that would perform better in my problem? And if there is no option, could a simple php extension help, that just does Replace => SimpleXMLElement in C?

If you know exactly where the offending "x", "m" and "l" are, you can just use something like $xml[$x_pos] = ' '; $xml[$m_pos] = ' '; $xml[$l_pos] = ' ' to transform them into spaces. Or transform them into ns___ (where _ = space).

You're always going to get an overhead when trying to do this - you're dealing with a char array and trying to do replace multiple matching elements of the array (i.e. words).
50ms is not much of an overhead, unless (as I suspect) you're trying to do this in a loop?

50ms sounds pretty reasonable to me, for something like this. The requirement itself smells of something being wrong.
Is there any particular reason that you're using regular expressions? Why do people keep jumping to the overkill regex solution?
There is a bog-standard string replace function called str_replace that may do what you want in a fraction of the time (though whether this is right for you depends on how complex your search/replace is).

From the PHP source, as we can see, for example here:
http://svn.php.net/repository/php/php-src/branches/PHP_5_2/ext/standard/string.c
I don`t see, any copies, but I'm not expert in C. From the other hand we can see there many convert to string calls, which at 1st sight could copy values. If they copy values, then we in trouble here.
Only if we in trouble
Try to invent some str_replace wheel here with the help of string-by-char processing. For example we have string $somestring = "somevalue". In PHP we could work with it's chars by indexes as echo $somestring{0}, which will give us "s" or echo $somestring{2} which will give us "m". I'm not sure in this way, but it's possible, if official implimentations don't use references, as they should use.

php - Is strpos the fastest way to search for a string in a large body of text?

if (strpos(htmlentities($storage->getMessage($i)),'chocolate'))
Hi, I'm using gmail oauth access to find specific text strings in email addresses. Is there a way to find text instances quicker and more efficiently than using strpos in the above code? Should I be using a hash technique?

According to the PHP manual, yes- strpos() is the quickest way to determine if one string contains another.
Note:
If you only want to determine if a particular needle occurs within haystack,
use the faster and less memory intensive function strpos() instead.
This is quoted time and again in any php.net article about other string comparators (I pulled this one from strstr())
Although there are two changes that should be made to your statement.
if (strpos($storage->getMessage($i),'chocolate') !== FALSE)
This is because if(0) evaluates to false (and therefore doesn't run), however strpos() can return 0 if the needle is at the very beginning (position 0) of the haystack. Also, removing htmlentities() will make your code run a lot faster. All that htmlentities() does is replace certain characters with their appropriate HTML equivalent. For instance, it replaces every & with &
As you can imagine, checking every character in a string individually and replacing many of them takes extra memory and processor power. Not only that, but it's unnecessary if you plan on just doing a text comparison. For instance, compare the following statements:
strpos('Billy & Sally', '&'); // 6
strpos('Billy & Sally', '&'); // 6
strpos('Billy & Sally', 'S'); // 8
strpos('Billy & Sally', 'S') // 12
Or, in the worst case, you may even cause something true to evaluate to false.
strpos('<img src...', '<'); // 0
strpos('<img src...','<'); // FALSE
In order to circumvent this you'd end up using even more HTML entities.
strpos('<img src...', '<'); // 0
But this, as you can imagine, is not only annoying to code but gets redundant. You're better off excluding HTML entities entirely. Usually HTML entities is only used when you're outputting text. Not comparing.

strpos is likely to be faster than preg_match and the alternatives in this case, the best idea would be to do some benchmarks of your own with real example data and see what is best for your needs, although that may be overdoing it. Don't worry too much about performance until it starts to become a problem

strpos() return the begin position of first occurrence of string, if no match will return Null so statement is fairly usable.
if (!is_null(strpos($storage->getMessage($i),'chocolate'))) {}

PHP preg_match Math Function

I'm writing a script that will allow a user to input a string that is a math statement, to then be evaluated. I however have hit a roadblock. I cannot figure out how, using preg_match, to dissallow statements that have variables in them.
Using this, $calc = create_function("", "return (" . $string . ");" ); $calc();, allows users to input a string that will be evaluated, but it crashes whenever something like echo 'foo'; is put in place of the variable $string.
I've seen this post, but it does not allow for math functions inside the string, such as $string = 'sin(45)';.

For a stack-based parser implemented in PHP that uses Djikstra's shunting yard algorithm to convert infix to postfix notation, and with support for functions with varying number of arguments, you can look at the source for the PHPExcel calculation engine (and which does not use eval)
Also have a look at the responses to this question

How complex of a math function do you need to allow? If you only need basic math, then you might be able to get away with only allowing whitespace + the characters 0123456789.+/-* or some such.
In general, however, using the language's eval-type capabilities to just do math is probably a bad idea.

Something like:
^([\d\(\)\+\-*/ ,.]|sin\(|cos\(|sqrt\(|...)+$
would allow only numbers, brackets, math operations and provided math functions. But it won't check if provided expression is valid, so something like +++sin()))333((( would be still accepted.

I wonder if this class would help you? Found that doing a search on Google for "php math expressions".

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.