Large regex patterns: PCRC won't do it - php

I have a long list of words that I want to search for in a large string. There are about 500 words and the string is usually around 500K in size.
PCRE throws an error saying preg_match_all: Compilation failed: regular expression is too large at offset 704416
Is there an alternative to this? I know I can recompile PCRE with a higher internal linkage size, but I want to avoid messing around with server packages.

Perhaps you might consider tokenizing your input string instead, and then simply iterating through each token and seeing if it's one of the words you're looking for?

Could you approach the problem from the other direction?
Use regex to clean up your 500K of HTML and pull out all the words into a big-ass array. Something like \b(\w+)\b.. (sorry haven't tested that).
Build a hash table of the 500 words you want to check. Assuming case doesn't matter, you would lowercase (or uppercase) all the words. The hash table could store integers (or some more complex object) to keep track of matches.
Loop through each word from (1), lowercase it, and then match it against your hashtable.
Increment the item in your hash table when it matches.

You can try re2.
One of it's strengths is that uses automata theory to guarantee that the regex runs in linear time in comparison to it's input.

You can use str_word_count or explode the string on whitespace (or whatever dilimeter makes sense for the context of your document) then filter the results against you keywords.
$allWordsArray = str_word_count($content, 1);
$matchedWords = array_filter($allWordsArray, function($word) use ($keywordsArray) {
return in_array($word, $keywordsArray);
});
This assume php5+ to use the closure, but this can be substituted for create_function in earlier versions of php.

Related

What's the best approach to find words from a set of words in a string?

I must detect the presence of some words (even polyrematic, like in "bag of words") in a user-submitted string.
I need to find the exact word, not part of it, so the strstr/strpos/stripos family is not an option for me.
My current approach (PHP/PCRE regex) is the following:
\b(first word|second word|many other words)\b
Is there any other better approach? Am I missing something important?
Words are about 1500.
Any help is appreciated
A regular expression the way you're demonstrating will work. It may be challenging to maintain if the list of words grows long or changes.
The method you're using will work in the event that you need to look for phrases with spaces and the list doesn't grow much.
If there are no spaces in the words you're looking for, you could split the input string on space characters (\s+, see https://www.php.net/manual/en/function.preg-split.php ), then check to see if any of those words are in a Set (https://www.php.net/manual/en/class.ds-set.php) made up of the words you're looking for. This will be a bit more code, but less regex maintenance, so ymmv based on your application.
If the set has spaces, consider instead using Trie. Wiktor Stribiżew suggests: https://github.com/sters/php-regexp-trie

"Regular Expression is too large" error in PHP

I am working on a relatively complex, and very large regular expression. It is currently 41,127 characters, and may grow somewhat as additional cases may be added. I am starting to get this error in PHP:
preg_match_all(): Compilation failed: regular expression is too large at offset 41123
Is there a way to increase the size limit? The following settings suggested elsewhere did NOT work because these apply to size of data and NOT the regex size:
ini_set("pcre.backtrack_limit", "100000000");
ini_set("pcre.recursion_limit", "100000000");
Alternatively, is there a way to define a "sub-pattern variable" within the regex that could be repeated at various places within the regex? (I am not talking about repetition using * or +, or even repeating matched "1")? I am actually using PHP variables containing sub-Patterns that are repeated in few places within the regex, but this leads to expansion of the regex BEFORE it is passed on to PRCE functions.
This is a complex regular expression, and cannot be replaced by simpler keyword-searching using strpos or similar as suggested at this link.
I would prefer to avoid splitting this into sub-expressions at | and trying to match the sub-expressions separately, because the reduction in size would be modest (there are only 2 or 3 of top-level |), and this would complicate further development.
Depending on the application, valid solutions are:
Shorten the Regular Expression by using DEFINE for any redundant sub-expressions (see below).
Increase the max limit on regex size by re-compiling PHP (see drew010's great answer). Although this may not be available in all environments or may create compatibility issues if changing servers.
Split your regular expression at | and process the resulting sub-expressions separately. If the regex is essentially numerous keywords separated by |, then converting to a strtok or a loop with strpos may be a better & faster choice.
Use other language / regex engine such as C++/Boost, although I did not verify this.
Solution to my specific problem: As per Mario's comment, using the (?(DEFINE)...) construct for some of the sub-expressions that were re-used several times reduced my regex size from 41,127 characters down to "only" 4,071, and this was an elegant solution to get rid of the error “Regular Expression is too large.”
See: (?(DEFINE)...) syntax reference at rexegg.com
I don't disagree with the comments that there might be a better way to do this, but I will answer the question here.
You can increase the maximum size of the regex but only by recompiling PHP yourself. Because of this, your code is not at all portable and if you are using pre-compiled binaries, you are out of luck.
That said, I would suggest finding an alternative for matching.
See pcre_internal.h for comments.
PCRE keeps offsets in its compiled code as 2-byte quantities
(always stored in big-endian order) by default. These are used, for
example, to link from the start of a subpattern to its alternatives
and its end. The use of 2 bytes per offset limits the size of the
compiled regex to around 64K, which is big enough for almost
everybody. However, I received a request for an even bigger limit.
For this reason, and also to make the code easier to maintain, the
storing and loading of offsets from the byte string is now handled
by the macros that are defined here.
The macros are
controlled by the value of LINK_SIZE. This defaults to 2 in the
config.h file, but can be overridden by using -D on the command line.
This is automated on Unix systems via the "configure" command.
So you can either edit ext/pcre/pcrelib/config.h from the PHP source distribution to increase the size limit, or specify it when compiling ./configure -DLINK_SIZE=4
EDIT: If you are trying to match/parse HTML, I would recommend using DOMDocument to parse the HTML and then walk the DOM tree or build an XPATH to find what you're looking for.
have your tried array_chunk to split your array then use you preg_match_all in foreach(). I was using the exact same code and i have an array of 40k+, so i go through the above solutions but it didn't solved my "Compilation failed: regular expression is too large at offset" problem then i split my 40k+ array into 4 arrays of 1k elements and and used foreach() over my preg_match_all condition and voila! it worked.

php str_replace is wrote by what string matching algorithms?

Today I just need to know which string matching algorithms str_replace uses. I just analysed the php source code, this function is in ext\standard\string.c. I just found out php_char_to_str_ex. Who can tell me which algorithms this function is written in? (which algorithms achieve str_replace this function ) .
And I just want to realize a highlight program which used Sunday algorithms (very quick algorithms and they say only this algorithms )
So I think this function str_replace maybe fits my goals, so I just analysed it ,but my C is so poor, so help me please guys.
Short answer: it's just a simple brute-force search.
The str_replace function is really just a forwarder to php_str_replace_common. And for the simple case where the subject is not an array, that in turn calls php_str_replace_in_subject. And again, when the search parameter is just a string, and it's more than 1 character, that calls php_str_to_str_ex.
Looking at the php_str_to_str_ex implementation, there are various special cases that are handled.
If the the search string and the replacement string are the same length, it make the memory handling easier because you know the result string is going to be the same size and the source string.
If the search string is longer than the source string, you know it's never going to find anything so you can simply return the source string unchanged.
If the search string length is identical to the source string length, then it's just a straight comparison.
But for the most part, it comes down to repeatedly calling php_memnstr to find the next match, and replacing that match with memcpy.
As for the php_memnstr implementation, that just calls C's memchr repeatedly to try and match the first character of the search string, and then memcmp to see if the rest of the string matches.
There's no fancy preprocessing of the search string to optimise repeated searches. It is just a straightforward brute-force search.
I should add, that even when the subject is an array, and there would be an advantage to preprocessing the search string, the code doesn't do anything different. It just calls php_str_replace_in_subject for each string in the array.
Yes, as of now (March 2015) I see in the PHP source code that str_replace() function relies on Sunday string matching algorithm.
str_replace() function uses zend_memnstr_ex_pre() and zend_memnstr_ex() functions (from zend_operators.c file) that use Sunday algorithm.

Detect random strings

I am building a string to detect whether filename makes sense or if they are completely random with PHP. I'm using regular expressions.
A valid filename = sample-image-25.jpg
A random filename = 46347sdga467234626.jpg
I want to check if the filename makes sense or not, if not, I want to alert the user to fix the filename before continuing.
Any help?
I'm not really sure that's possible because I'm not sure it's possible to define "random" in a way the computer will understand sufficiently well.
"umiarkowany" looks random, but it's a perfectly valid word I pulled off the Polish Wikipedia page for South Korea.
My advice is to think more deeply about why this design detail is important, and look for a more feasible solution to the underlying problem.
You need way to much work on that. You should make an huge array of most-used-word (like a dictionary) and check if most of the work inside the file (maybe separated by - or _) are there and it will have huge bugs.
Basically you will need of
explode()
implode()
array_search() or in_array()
Take the string and look for a piece glue like "_" or "-" with preg_match(); if there are some, explode the string into an array and compare that array with the dictionary array.
Or, since almost every words has alternate vowel and consonants you could make an huge script that checks whatever most of the words inside the file name are considered "not-random" generated. But the problem will be the same: why do you need of that? Check for a more flexible solution.
Notice:
Consider that even a simple-and-friendly-file.png could be the result of a string generator.
Good luck with that.

Seeking elegant way to remove any instances of 544 words from a string

I need to remove any instances of 544 full-text stopwords from a user-entered search string, then format it to run a partial match full-text search in boolean mode.
input: "new york city", output: "+york* +city*" ("new" is a stopword).
I have an ugly solution that works: explode the search string into an array of words, look up each word in the array of stopwords, unset them if there is a match, implode the remaining words and finally run a regex to add the boolean mode formatting. There has to be a more elegant solution.
My question has 2 parts.
1) What do you think is the cleanest way to do this?
2) I solved part of the problem using a huge regex but this raised another question.
EDIT: This actually works. I'm embarrassed to say that the memory issue I was having (and believed was my regex) was actually generated later in the code due to the huge number of matches after filtering out stopwords.
$tmp = preg_replace('/(\b('.implode('|',$stopwords).')\b)+/','',$this->val);
$boolified = preg_replace('/([^\s]+)/','+$1*',$tmp);
Build a suffix tree from the 544 words and just walk trough it with the input string letter by letter and jump back to the root of the tree at the beginning of every new word. When you find a match at the end of a word, remove it. This is O(n) over the length of the input strings if the word list reamins static.
Split a search string in a words array and then
do array_diff() with stopwords array
or make stopwords a hash and use hash lookups (if isset($stopwords[$word]) then...)
or keep stopwords sorted and use binary search for each word
it's hard to say what's going to be faster, you might want to profile each option (and if you do, please share the results!)

Categories