Today I just need to know which string matching algorithms str_replace uses. I just analysed the php source code, this function is in ext\standard\string.c. I just found out php_char_to_str_ex. Who can tell me which algorithms this function is written in? (which algorithms achieve str_replace this function ) .
And I just want to realize a highlight program which used Sunday algorithms (very quick algorithms and they say only this algorithms )
So I think this function str_replace maybe fits my goals, so I just analysed it ,but my C is so poor, so help me please guys.
Short answer: it's just a simple brute-force search.
The str_replace function is really just a forwarder to php_str_replace_common. And for the simple case where the subject is not an array, that in turn calls php_str_replace_in_subject. And again, when the search parameter is just a string, and it's more than 1 character, that calls php_str_to_str_ex.
Looking at the php_str_to_str_ex implementation, there are various special cases that are handled.
If the the search string and the replacement string are the same length, it make the memory handling easier because you know the result string is going to be the same size and the source string.
If the search string is longer than the source string, you know it's never going to find anything so you can simply return the source string unchanged.
If the search string length is identical to the source string length, then it's just a straight comparison.
But for the most part, it comes down to repeatedly calling php_memnstr to find the next match, and replacing that match with memcpy.
As for the php_memnstr implementation, that just calls C's memchr repeatedly to try and match the first character of the search string, and then memcmp to see if the rest of the string matches.
There's no fancy preprocessing of the search string to optimise repeated searches. It is just a straightforward brute-force search.
I should add, that even when the subject is an array, and there would be an advantage to preprocessing the search string, the code doesn't do anything different. It just calls php_str_replace_in_subject for each string in the array.
Yes, as of now (March 2015) I see in the PHP source code that str_replace() function relies on Sunday string matching algorithm.
str_replace() function uses zend_memnstr_ex_pre() and zend_memnstr_ex() functions (from zend_operators.c file) that use Sunday algorithm.
Related
I have a long list of words that I want to search for in a large string. There are about 500 words and the string is usually around 500K in size.
PCRE throws an error saying preg_match_all: Compilation failed: regular expression is too large at offset 704416
Is there an alternative to this? I know I can recompile PCRE with a higher internal linkage size, but I want to avoid messing around with server packages.
Perhaps you might consider tokenizing your input string instead, and then simply iterating through each token and seeing if it's one of the words you're looking for?
Could you approach the problem from the other direction?
Use regex to clean up your 500K of HTML and pull out all the words into a big-ass array. Something like \b(\w+)\b.. (sorry haven't tested that).
Build a hash table of the 500 words you want to check. Assuming case doesn't matter, you would lowercase (or uppercase) all the words. The hash table could store integers (or some more complex object) to keep track of matches.
Loop through each word from (1), lowercase it, and then match it against your hashtable.
Increment the item in your hash table when it matches.
You can try re2.
One of it's strengths is that uses automata theory to guarantee that the regex runs in linear time in comparison to it's input.
You can use str_word_count or explode the string on whitespace (or whatever dilimeter makes sense for the context of your document) then filter the results against you keywords.
$allWordsArray = str_word_count($content, 1);
$matchedWords = array_filter($allWordsArray, function($word) use ($keywordsArray) {
return in_array($word, $keywordsArray);
});
This assume php5+ to use the closure, but this can be substituted for create_function in earlier versions of php.
For a small project of my own, I'm writing a parser that parses event logs from a certain application. Normally I'd have little issue with handling such a thing, but the problem is that strings from these logs do not always have the same parameters. For example, one such string could be:
DD/MM HH:MM:SS.MSEC TYPE_OF_EVENT SOURCE, SOURCE_FLAGS, TARGET, TARGET_FLAGS, PARAM1
On another occasion, the string could have a series of parameters, all the way up to 27 of them, the other has 16. Reading through the documentation, there is some logic in the parameters, for example, the 17th Parameters will always hold an integer. While that is good, unfortunately the 17th parameter might be the 7th thing on the string. The only thing that is really constant on every string is the time stamp and the 6th first parameters.
How would I go around parsing strings like these? I'm sorry if my question is a tad unclear, I find it difficult to word my problem.
Ok, followup for my comment up at the top.
If the log's format is "constant" based on the TYPE_OF_EVENT field, you'll just have to do some simple pre-parsing, after which the rest should follow easily.
read a line
extract the universally common fields: timestamp, type of event, source/target
based on type_of_event, do further analysis
switch (event type) {
case 'a': parse out 'a' event parameters
case 'b': parse out 'b' event parameters
default: log unknown event type for future analysis
}
and so on.
I would use a different logging solution, or find a way to modify it so that you have empty place holders, item,,item3,,,item6 etc.
Just my opinion without knowing too much about this app - this app doesn't sound too good. I usually judge apps by factors like this, if there is not a good reason for the log file to be non-standardized then what do you think the rest of the code look like? :)
That's not an input that can be "parsed" as such, because there are no fixed keywords to look out for. But regular expressions seem sufficient to extract and split up the contents.
http://regular-expressions.info/ has a good introduction, and https://stackoverflow.com/questions/89718/is-there-anything-like-regexbuddy-in-the-open-source-world lists a few cool tools that help in designing regular expressions.
In your case you would need \d+ for matching decimals, use delimiters literally, und you probably can get away with .*? separated by the , comma delimiters to find the individual parts. Maybe:
preg_match('#(\d+/\d+) (\d+:\d+:\d+.\d+) (\w+) (.*?),(.*),(.*),...#');
If there is a variable length of attributes, then you should prefer two regexps (though it can be done in one). First get the .* remainder of each line, then split it afterwards.
How about splitting the string by the ", " separator and putting everything in an array. That way you'll have a numeric index to check if a parameter exists or not.
I am building a string to detect whether filename makes sense or if they are completely random with PHP. I'm using regular expressions.
A valid filename = sample-image-25.jpg
A random filename = 46347sdga467234626.jpg
I want to check if the filename makes sense or not, if not, I want to alert the user to fix the filename before continuing.
Any help?
I'm not really sure that's possible because I'm not sure it's possible to define "random" in a way the computer will understand sufficiently well.
"umiarkowany" looks random, but it's a perfectly valid word I pulled off the Polish Wikipedia page for South Korea.
My advice is to think more deeply about why this design detail is important, and look for a more feasible solution to the underlying problem.
You need way to much work on that. You should make an huge array of most-used-word (like a dictionary) and check if most of the work inside the file (maybe separated by - or _) are there and it will have huge bugs.
Basically you will need of
explode()
implode()
array_search() or in_array()
Take the string and look for a piece glue like "_" or "-" with preg_match(); if there are some, explode the string into an array and compare that array with the dictionary array.
Or, since almost every words has alternate vowel and consonants you could make an huge script that checks whatever most of the words inside the file name are considered "not-random" generated. But the problem will be the same: why do you need of that? Check for a more flexible solution.
Notice:
Consider that even a simple-and-friendly-file.png could be the result of a string generator.
Good luck with that.
I wouldn't call myself a master regarding regex, i pretty much just know the basics. I've been playing around with it, but i can't seem to get the desired result. So if someone would help me, i would really appreciate it!
I'm trying to check wether unwanted words exist in a string. I'm working on a math project, and i'm gonna be using eval() to calculate the string, so i need to make sure it's safe.
The string may contain (just for example now, i'll add more functions later) the following words: (read the comments)
floor() // spaces or numbers are allowed between the () chars. If possible, i'd also like to allow other math functions inside, so it'd look like: floor( floor(8)*1 ).
It may contain any digit, any math sign (+ - * /) and dots/commas (,.) anywhere in the string
Just to be clear, here's another example: If a string like this is passed, i do not want it to pass:
9*9 + include('somefile') / floor(2) // Just a random example on something that's not allowed
Now that i think about it, it looks kind of complicated. I hope you can at least give me some hints.
Thanks in advance,
-Anthony
Edit: This is a bit off-topic, but if you know a better way of calculating math functions, please suggest it. I've been looking for a safe math class/function that calculates an input string, but i haven't found one yet.
Please do not use eval() for this.
My standard answer to this question whenever it crops up:
Don't use eval (especially if the formula contains user input) or reinvent the wheel by writing your own formula parser.
Take a look at the evalMath class on PHPClasses. It should do everything that you want in a nice safe sandbox.
To rephrase your problem, you want to allow only a specific set of characters, plus certain predefined words. The alternation operator (pipe symbol) is your friend in this case:
([0-9\+\-\*\/\.\,\(\) ]|floor|ceiling|other|functions)*
Of course, using eval is inherently dangerous, and it is difficult to guarantee that this regex will offer full protection in a language with syntax as expansive as PHP.
I have two strings and I would like to mix the characters from each string into one bigger string, how can I do this in PHP? I can swap chars over but I want something more complicated since it could be guessed.
And please don't say md5() is enough and irreversible. :)
$string1 = '9cb5jplgvsiedji9mi9o6a8qq1';//session_id()
$string2 = '5d41402abc4b2a76b9719d911017c592';//md5()
Thank you for any help.
EDIT: Ah sorry Rob. It would be great if there is a solution where it was just a function I could pass two strings to, and it returned a string.
The returned string must contain both of the previous strings. Not just a concatination, but the characters of each string are mingled into one bigger one.
If you want to make a tamper-proof string which is human readable, add a secure hash to it. MD5 is indeed falling out of favour, so try sha1. For example
$salt="secret";
$hash=sha1($string1.$string2.$salt);
$separator="_";
$str=$string1.$separator.$string2.$separator.$hash;
If you want a string which cannot be read by humans, encrypt it - check out the mcrypt extension which offers a variety of options.
Use one of the SHA variants of the hash() function. Sha2 or sha256 should be sufficient and certainly much better than anything you could come up with.
Unless I am missing something if your wanting to combine those values into a unique value why not do sha1(string1, string2);
I'm guessing you want something reversible, so you can get these values back out. A quick-and-dirty technique for obscuring these two strings further would be to base64-encode them:
base64_encode($string1 . $string2);
Thank you everyone. I completely forgot about the SHA1 - got too into solving a problem that I forgot what else was out there. :)
Well, if not md5(), then sha1(). :)
Anyway,the possibilities to mangle are endless, pick your poison.
What I would do, if I really wanted to do something like that (which can be useful occasionally), I would add another element, chosen on random and shuffle the md5 string by it. and write down the random element in it, too.
For example, let us add to each md5 character a random 2 digit number, which we then split by digits and add 1st digit to resulting string, and 2nd digit - prepend to it.
I stumbled upon someplace where something of that kind was done today. I was trying to find some reference to a particular phone number - whether it appears anywhere on the country-local inet or not.
I visited a popular classified ads site, which gives phone numbers of advertisers and you have the option, when you are looking at a particular ad, to find all ads with the same phone number. Now, what they did, however, was that they encoded search string, so you are not searching for ?phone=123123, but something like ?phone==FFYx23=.
If they hadn't done that, I would be able to find out for my own purposes, rather than checking on ads, IF user with phone 123123 has posted any ads on the site.
If you are looking to verify message integrity and authenticity with hashing - you might want to look at HMAC - there are plenty of implementations in PHP using both SHA1 and MD5:
http://en.wikipedia.org/wiki/HMAC
EDIT: In fact, PHP now has a function for this:
http://us3.php.net/manual/en/function.hash-hmac.php