Any faster, simpler alternative to php preg_match

Any faster, simpler alternative to php preg_match - php

I am using cakephp 1.3 and I have textarea where users submit articles. On submit, I want to look into the article for certain key words and and add respective tags to the article.
I was thinking of preg_match, But preg_match pattern has to be string. So I would have to loop through an array(big).
Is there a easier way to plug in the keywords array for the pattern.
I appreciate all your help.
Thanks.

I suggest treating your array of keywords like a hash table. Lowercase the article text, explode by spaces, then loop through each word of the exploded array. If the word exists in your hash table, push it to a new array while keeping track of the number of times it's been seen.
I ran a quick benchmark comparing regex to hash tables in this scenario. To run it with regex 1000 times, it took 17 seconds. To run it with a hash table 1000 times, it took 0.4 seconds. It should be an O(n+m) process.
$keywords = array("computer", "dog", "sandwich");
$article = "This is a test using your computer when your dog is being a dog";
$arr = explode(" ", strtolower($article));
$tracker = array();
foreach($arr as $word){
if(in_array($word, $keywords)){
if(isset($tracker[$word]))
$tracker[$word]++;
else
$tracker[$word] = 1;
}
}
The $tracker array would output: "computer" => 1, "dog" => 2. You can then do the process to decide what tags to use. Or if you don't care about the number of times the keyword appears, you can skip the tracker part and add the tags as the keywords appear.
EDIT: The keyword array may need to be an inverted index array to ensure the fastest lookup. I am not sure how in_array() works, but if it searches, then this isn't as fast as it should be. An inverted index array would look like
array("computer" => 1, "dog" => 1, "sandwich" => 1); // "1" can be any value
Then you would do isset($keywords[$word]) to check if the word matches a keyword, instead of in_array(), which should give you O(1). Someone else may be able to clarify this for me though.

If you don't need the power of regular expressions, you should just use strpos().
You will still need to loop through the array of words, but strpos is much, much faster than preg_match.

Of course, you could try matching all the keywords using one single regexp, like /word1|word2|word3/, but I'm not sure it is what you are looking for. And also I think it would be quite heavy and resource-consuming.
Instead, you can try with a different approach, such as splitting the text into words and checking if the words are interesting or not. I would make use of str_word_count() using someting like:
$text = 'this is my string containing some words, some of the words in this string are duplicated, some others are not.';
$words_freq = array_count_values(str_word_count($text, 1));
that splits the text into words and counts occurrences. Then you can check with in_array($keyword, $words_freq) or array_intersect(array_keys($words_freq), $my_keywords).
If you are not interested, as I guess, to the keywords case, you can strtolower() the whole text before proceeding with words splitting.
Of course, the only way to determine which approach is the best is to setup some testing, by running various search functions against some "representative" and quite long text and measuring the execution time and resource usage (try microtime(TRUE) and memory_get_peak_usage() to benchmark this).
EDIT: I cleaned up a bit the code and added a missing semi-colon :)

If you want to look for multiple words from an array, then combine said array into an regular expression:
$regex_array = implode("|", array_map("preg_escape", $array));
preg_match_all("/($regex_array)/", $src, $tags);
This converts your array into /(word|word|word|word|word|...)/. The arrray_map and preg_escape part is optional, only needed if the $array might contain special characters.
Avoid strpos and loops for this case. preg_match is faster for searching after alternatives.

strtr()
If given two arguments, the second
should be an array in the form
array('from' => 'to', ...). The return
value is a string where all the
occurrences of the array keys have
been replaced by the corresponding
values. The longest keys will be tried
first. Once a substring has been
replaced, its new value will not be
searched again.

Add tags manually? Just like we add tags here at SO.

Related

How to replace substrings using randomly selected array elements?

Just a comment:
In a previous question, I need to change the values, from column 0 to 1 or 1 to 0, depending on any value being parsed in the $myVar variable, the solution appended by a user here on stackoverflow was ideal:
$myWords=array(
array('funny','sad'),
array('fast','slow'),
array('beautiful','ugly'),
array('left','right'),
array('5','five'),
array('strong','weak')
);
// prepare values from $myWords for use with strtr()
$replacements=array_combine(array_column($myWords,0),array_column($myWords,1))+
array_combine(array_column($myWords,1),array_column($myWords,0));
echo strtr($myVar,$replacements);
Inputs/Outputs:
$myVar='I was beautiful and strong when I was 5 now I\'m ugly and weak';
//outputs: I was ugly and weak when I was five now I'm beautiful and strong
This Question:
But when in cases, do you see arrays with more than one option to make the switch? How to make the system do the exchange for any word among the multiple options, but without running the risk of doing "echo" with the key / word originally presented in $myVar that key / word that called the stock exchange action?
$myWords=array(
array('funny','sad','very sad','hyper funny'),
array('fast','slow','faster','very slow'),
array('beautiful','Studious man','great beauty','very hardworking'),
);
$myVar = 'That man is really fast and very hardworking';
How to make the system choose among other options, but it exclude from the exchange or rand or mt_rand etc ..., the keys responsible for calling the action: fast, hardworking, so as not to run the risk of $myVar not be changed.
Possible expected output:
$myVar = 'That man is really faster and Studious man';
fast must not be replaced by fast and very hardworking must not be replaced by very hardworking.

I think this is what you are trying to do...
Check the string for a random selection from each subarray.
If there is a match, you want to replace it with any of the other words in the same subarray.
If there is no match, just move on to the next subarray.
$myWords=array(
array('funny','sad','very sad','hyper funny'),
array('fast','slow','faster','very slow'),
array('beautiful','Studious man','great beauty','very hardworking'),
);
foreach($myWords as $words){
// randomize the subarray
shuffle($words);
// pipe-together the words and return just one match
if(preg_match('/\b\K'.implode('|',$words).'\b/',$myVar,$out)){ Bad pattern = incorrect use of \b
if(preg_match('/\K\b(?:'.implode('|',$words).')\b/',$myVar,$out)){
// generate "replace_pair" from matched word and a random remaining subarray word
// replace and preserve the new sentence
$myVar=strtr($myVar,[$out[0]=>current(array_diff($words,$out))]);
}
}
echo $myVar;
If:
$myVar='That man is really fast and very hardworking';
Then the output could be any of the following and more:
That man is really faster and great beauty
That man is really slow and Studious man
etc...
Effectively, no matter what random replacement happens, the output will never be the same as the input.
Here is the demo link.
This is the preg_match_all() version:
$myWords=array(
array('funny','sad','very sad','hyper funny'),
array('fast','slow','faster','very slow'),
array('beautiful','Studious man','great beauty','very hardworking'),
);
$myVar='The slow epic fail was both sad and funny';
foreach($myWords as $words){
$replacepairs=[]; // clear array
if(preg_match_all('/\b\K'.implode('|',$words).'\b/',$myVar,$out)){ Bad pattern = incorrect use of \b
if(preg_match_all('/\K\b(?:'.implode('|',$words).')\b/',$myVar,$out)){ // match all occurences
foreach($out[0] as $w){
$remaining=array_diff($words,[$w]); // find the remaining valid replacment words
shuffle($remaining); // randomize the remaining replacement words
$replacepairs[$w]=current($remaining); // pluck first value from remaining words
}
$myVar=strtr($myVar,$replacepairs); // use replacepairs on sentence
}
}
echo $myVar;
Possible outputs:
The faster epic fail was both hyper funny and hyper funny
The very slow epic fail was both very sad and hyper funny
The fast epic fail was both funny and very sad
etc...
Here is the demo link.

PHP & word counting from string

Trying to take numbers from a text file and see how many times they occur.
I've gotten to the point where I can print all of them out, but I want to display just the number once, and then maybe the occurrences after them (ie: Key | Amount; 317 | 42).
Not looking for an Answer per se, all learning is good, but if you figure one out for me, that would be awesome as well!

preg_match_all will return the number of matches against a string.
$count = preg_match_all("#$key#", $string);
print "{$key} - {$count}";

So if you're already extracting the data you need, you can do this using a (fairly) simple array:
$counts = array();
foreach ($keysFoundFromFile AS $key) {
if (!$counts[$key]) $counts[$key] = 0;
$counts[$key]++;
}
print_r($counts);
If you're already looping to extract the keys from the file, then you can simply assign them directly to the $counts array without making a second loop.

I think you're looking for the function substr_count().
Just a heads up though, if you're looking for "123" and it finds "4512367" it will match it as part of it. The alternative would be using RegEx and using word boundaries:
$count = preg_match_all('|\b'. preg_quote($num) .'\b|', $text);
(preg_quote() for good practice, \b for word boundaries so we can be assured that it's not a number embedded in another number.)

What is the best way to unscramble words with PHP?

I have a word list and I want to unscramble words using this word list, in PHP.
It seems to me that PHP doesn't have a built-in function that does this. So could someone please suggest a good algorithm to do this, or at least point me in the right direction?
EDIT: edited to add example
So basically, what I'm talking about is I have a list of words:
apple
banana
orange
Then, I'm given a bunch of jumbled letters.
pplea
nanaba
eroang

Given a dictionary of known words:
foreach ($list as $word)
{
if (count_chars($scrambled_word,1) == count_chars($word,1))
echo "$word\n";
}
Edit: A simple optimization would be to move the count_chars($scrambled_word,1)) outside the loop since it never changes:
$letters = count_chars($scrambled_word,1)
foreach ($list as $word)
{
if ($letters == count_chars($word,1))
echo "$word\n";
}

Warning: I rarely use PHP, so this is dealing only with a general algorithm that should work in almost any language, not anything specific to PHP.
Presumably you have a word in which the letters have been rearranged, and you want to find what word(s) could be made from those letters.
If that's correct, the general idea is fairly simple: take a copy of your word list, and sort the letters in each word into alphabetical order. Put the sorted and unsorted versions of each word side by side, and sort the whole thing by the sorted words (but keeping each unsorted word along with its sorted version). You may want to collapse duplicates together, so that (for example) instead of {abt : bat} and {abt : tab}, you have: {abt: bat, tab}
Then, to match up a scrambled word, sort its letters in alphabetical order. Look for matches in your dictionary (since it's sorted, you can use a binary search). When you find a match, the result is the word (or words) associated with that sorted letter group. Using the example above, if the scrambled word was "tba", you'd sort it to get "abt", then look up "abt" to get "bat" and "tab".
Edit: As #Moron pointed out in the comments, sorting and binary search aren't really crucial points in themselves. The basic points are to turn all equivalent inputs into identical keys, then use some sort of fast lookup by key to find the word(s) for that key.
Sorting the letters in each word is one easy way to turn equivalent inputs into identical keys. Sorting the list and doing a binary search is one easy way to do fast lookups by key.
In both cases, there are quite a few alternatives. I'm not at all sure the alternatives are likely to improve performance a lot, but they certainly could.
Just for example, instead of a pure binary search you could have a second level of index that told you where the keys starting with 'a' were, the keys starting with 'b', and so on. Given that a couple of extremely frequently-used letters are near the beginning of the alphabet ('e' and 'a', for example) you might be better off sorting the words so that relatively uncommon letters ('q', 'z', etc.) are toward the front of the key, and the most commonly used letters are at the end. This would give that first lookup based on the initial character the greatest discrimination.
On the sort/binary search side, there are probably more alternatives, and probably better arguments to be made in favor of using something else. Hash tables typically allow lookups in (nearly) constant time. Tries can reduce storage substantially, especially when many words share a common prefix. The only obvious disadvantage is that the code for either one is probably more work (though PHP's array type is hash-based, so you could probably use it quite nicely).

It is possible to unscramble in O(log p + n) where
p = size of dictionary
n = length of word to be unscrambled
Assume a constant, c, the most occurrences of some letter within any word plus 1.
Assume a constant, k, the number of letters in the alphabet.
Assume a constant, j, the most number of words that can share the same hash or letter-sorted version.
Initialization of O(p) space:
1. Using the dictionary, D, create an associated list of letter sorted words, L, which will be size at most p since each word has a one sorted version.
2. Associate another column to L with a numerical hash of integers which can range [0, c^k-1].
3. For each word in L, generate its hash with the following function:
hash(word) = 0 if word is empty or (c^i + hash(remaining substring of the word))
where i is the zero-based alphabet index of the first letter.
Algorithm:
1. In O(n), determine hash, h, of the letter sorted version of the word in question.
2. In O(log p), search for the hash in L.
3. In O(n), list j associated words of length n.

Try these
http://www.php.net/manual/en/function.similar-text.php
http://www.php.net/manual/en/function.soundex.php
http://www.php.net/manual/en/function.levenshtein.php

The slow option would be to generate all permutations of the letters in a scrambled word, then probe them via pspell_check().
If you however can use a raw dictionary text file, then the best option is to just use a regular expression to scan it:
$dict = file_get_contents("words.txt"); // one word per line
$n = strlen($word);
if (preg_match('/^[$word]{$n}$/im', $dict, $match)) {
print $match[0];
}
I'm quite certain PCRE is significantly faster at searching for permutations than PHP and the guessing approach.

Make use of PHP's array functions since they can solve this for you.
$words = array('hello', 'food', 'stuff', 'happy', 'fast');
$scrambled_word = 'oehll';
foreach ($words as $word)
{
// Same length?
if (strlen($scrambled_word) === strlen($word))
{
// Convert to an array and match
if( ! array_diff(str_split($word), str_split($scrambled_word)))
{
print "Your word is: $word";
}
}
}
Basically, you look for something the same length - then you ask PHP to see if all the letters are the same.

If you have a really large list of words and want this unscramble operation to be fast, I'd try putting the word list into a database. Next add a field to your word list table that is the sum of the ascii values of the word, and then add an index on this ascii sum.
Whenever you want to retrieve a list of possible matches just search the word table for ascii sums that match the sum of the scrambled letters. Keep in mind that you may have a few false matches so you'll have to compare all of the matched words to ensure they contain only the letters of your scrambled word (but the result set should be pretty small).
If you don't want to use a database you could implement the same basic idea using a file, just sort the list by the sum value for faster retrieval of all matches.
Example Data assumes all lowercase (a=97, b=98, c=99, ...)
bat => 311,
cat => 312, ...
Example php function to figure out the sum for a word
function asciiSum($word) {
$characters = str_split(strtolower($word));
$sum = 0;
foreach($characters as $character) {
$sum += ord($character);
}
return $sum;
}
Even faster: add another field to the database that represents the string length, then you can search for words based on an ascii sum and a string length which would further reduce the number of false matches you would need to check for.

Trying to split a string in 3 variables, but a little more tricky - PHP

Having pretty much covered the basics in PHP, I decided to challenge myself and make a simple calculator. After some attempts I figured it out, but I'm not entirely content with it. I want to make it more user friendly and have the calculator input in just one box, very much like google search.
So one would simply type: 5+2
and recieve 7.
How would I split the string "5+2" into three variables so that the math functions can convert the numbers into integers and recognize the operator, as well as accounting for the possibility of someone using spaces between the values as well?
Would you explode the string? But what would you explode it with if there are no spaces?
I've also stumbled upon the preg_split function, but I can't seem to wrap my head around or know if it's suitable to solve this problemt. What method would be the best option for this?

$calc = "5* 2+ 53";
$calc = preg_replace('/(\s*)/','',$calc);
print_r(preg_split('/([\x28-\x2B\x2D\x2F])/',$calc,-1,PREG_SPLIT_DELIM_CAPTURE));
That's my bid, resulting in
Array
(
[0] => 5
[1] => *
[2] => 2
[3] => +
[4] => 53
)

You may need to use some clever regex to split it something like:
$myOutput = split("(-?[0-9]+)|([+-*/]{1})|(-?[0-9]+)");
I haven't tested that - just an semi-psuedo-ish example sorry :-> just trying to highlight that you will need to remember that your - (minus) operator can appear at the start of an integer to make it a negative number so you could end up with problems with things like -1--21 which is valid but makes your regex rules more complicated.

You will have to split the string using regular expressions.
For example a simple regex for 5+2 would be:
\d\+\d
Check out this link. You can create and validate your regular expressions there. For a calculator it will not be that difficult.

You've got the right idea with preg_split. It would work something like this:
$values = preg_split("/[\s]+/", "76 + 23");
The resulting array will contain values that are NOT whitespace:
Values should look like this:
$values[0]: "76"
$values[1]: "+"
$values[2]: "23"
the "/[\s]+/" is a regular expression pattern that matches any whitespace characters one or more times. Howver, if there are no whitespaces at all, preg_split will just return the original "5+2" as a single string in the first element of the array. i.e.:
$values[0] = "5+2"

PHP Create List with Replaced characters

I am trying to basically create a list of obscene words for a filter. I wanted to generate the list by creating arrays of characters and their replacements, example. 'A' can be replaced with '4', or "E" with "3".
So basically I have a bunch of arrays for every char in the alphabet with the several different ways to replace it. EG. $e = array( "e" =>"3"); I have an array of obscene words. I need to print all the obscene words and then their variants where the letters match. Example:
Hello
He11o
He1lo
H3llo
H31lo
H311o
Every variation. How would I go about doing this? Any help would be much appreciated.

Sounds like a job for regular expressions for me.
preg_replace('/H[e3][l1]{2}[0o]/i','H****',$textstr);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.