Split strings into Dictionary words

Split strings into Dictionary words - php

I am looking for the most efficient algorithm in PHP to check if a string was made from dictionary words only or not.
Example:
thissentencewasmadefromenglishwords
thisonecontainsyxxxyxsomegarbagexaatoo
pure
thisisalsobadxyyyaazzz
Output:
thissentencewasmadefromenglishwords
pure
a.txt
contains the dictionary words
b.txt
contains the strings: one in every line, without spaces made from a..z chars only

Another way to do this is to employ the Aho-Corasick string matching algorithm. The basic idea is to read in your dictionary of words and from that create the Aho-Corasick tree structure. Then, you run each string you want to split into words through the search function.
The beauty of this approach is that creating the tree is a one time cost. You can then use it for all of the strings you're testing. The search function runs in O(n) (n being the length of the string), plus the number of matches found. It's really quite efficient.
Output from the search function will be a list of string matches, telling you which words match at what positions.
The Wikipedia article does not give a great explanation of the Aho-Corasick algorithm. I prefer the original paper, which is quite approachable. See Efficient String Matching: An Aid to Bibliographic Search.
So, for example, given your first string:
thissentencewasmadefromenglishwords
You would get (in part):
this, 0
his, 1
sent, 4
ten, 7
etc.
Now, sort the list of matches by position. It will be almost sorted when you get it from the string matcher, but not quite.
Once the list is sorted by position, the first thing you do is make sure that there is a match at position 0. If there is not, then the string fails the test. If there is (and there might be multiple matches at position 0), you take the length of the matched string and see if there's a string match at that position. Add that match's length and see if there's a match at the next position, etc.
If the strings you're testing aren't very long, then you can use a brute force algorithm like that. It would be more efficient, though, to construct a hash map of the matches, indexed by position. Of course, there could be multiple matches for a particular position, so you have to take that into account. But looking to see if there is a match at a position would be very fast.
It's some work, of course, to implement the Aho-Corasick algorithm. A quick Google search shows that there are php implementations available. How well they work, I don't know.
In the average case, this should be very quick. Again, it depends on how long your strings are. But you're helped by there being relatively few matches at any one position. You could probably construct strings that would exhibit pathologically poor runtimes, but you'd probably have to try real hard. And again, even a pathological case isn't going to take too terribly long if the string is short.

This is a problem that can be solved using Dynamic Programming, based on the next formulas:
f(0) = true
f(i) = OR { f(i-j) AND Dictionary.contais(s.substring(i-j,i) } for each j=1,...,i
First, load your file into a dictionary, then use the DP solution for the above formula.
Pseudo code is something like: (Hope I have no "off by one" for indices..)
check(word):
f = new boolean[word.length() + 1)
f[0] = true
for i from 1 to word.length() + 1:
f[i] = false
for j from 1 to i-1:
if dictionary.contains(word.substring(j-1,i-1)) AND f[j]:
f[i] = true
return f[word.length()

I recommend a recursive approach. Something like this:
<?php
$wordsToCheck = array(
'otherword',
'word1andother',
'word1',
'word1word2',
'word1word3',
'word1word2word3'
);
$wordList = array(
'word1',
'word2',
'word3'
);
$results = array();
function onlyListedWords($word, $wordList) {
if (in_array($word, $wordList)) {
return true;
} else {
$length = strlen($word);
$wordTemp = $word;
$part = '';
for ($i=0; $i < $length; $i++) {
$part .= $wordTemp[$i];
if (in_array($part, $wordList)) {
if ($i == $length - 1) {
return true;
} else {
$wordTemp = substr($wordTemp, $i + 1);
return onlyListedWords($wordTemp, $wordList);
}
}
}
}
}
foreach ($wordsToCheck as $word) {
if (onlyListedWords($word, $wordList))
$results[] = $word;
}
var_dump($results);
?>

Related

Is it possible to use Knuth-Morris-Pratt Algorithm for string matching on text to text?

I have a KMP code in PHP which is can do string matching between word to text. I wonder if i can use KMP Algorithm for string matching between text to text. Is it possible or not? and how can i use it for finding the matching of the string between 2 text.
Here's the core of KMP algorithm :
<?php
class KMP{
function KMPSearch($p,$t){
$result = array();
$pattern = str_split($p);
$text = str_split($t);
$prefix = $this->preKMP($pattern);
// print_r($prefix);
// KMP String Matching
$i = $j = 0;
$num=0;
while($j<count($text)){
while($i>-1 && $pattern[$i]!=$text[$j]){
// if it doesn't match, then uses then look at the prefix table
$i = $prefix[$i];
}
$i++;
$j++;
if($i>=count($pattern)){
// if its match, find the matches string potition
// Then use prefix table to swipe to the right.
$result[$num++]=$j-count($pattern);
$i = $prefix[$i];
}
}
return $result;
}
// Making Prefix table with preKMP function
function preKMP($pattern){
$i = 0;
$j = $prefix[0] = -1;
while($i<count($pattern)){
while($j>-1 && $pattern[$i]!=$pattern[$j]){
$j = $prefix[$j];
}
$i++;
$j++;
if(isset($pattern[$i])==isset($pattern[$j])){
$prefix[$i]=$prefix[$j];
}else{
$prefix[$i]=$j;
}
}
return $prefix;
}
}
?>
I calling this class to my index.php if i want to use to find word on the text.
This is the step that i want my code do :
(1). I input a text 1
(2). I input a text 2
(3). I want a text 1 become a pattern (every single word is in text 1 treat as pattern)
(4). I want my code can find every pattern on text 1 in text 2
(5). Last, my code can show me what the percentage of similarity.
Hope you all can help me or teach me. I've been serching for the answer everywhere but can't find it yet. At least you can teach me.

If you just need to find all words that are present in both texts, you don't any string search algorithm to do it. You can just add all words from the first text to a hash table, iterate over the second text and add the words that are in a hash table to the output list.
You can use a trie instead of a hash table if you want a linear time complexity in the worst case, but I'd get started with a hash table because it's easy to use and is likely to be good enough for practical purposes.

What is the fastest way to check amount of specific chars in a string in PHP?

So i need to check if amount of chars from specific set in a string is higher than some number, what a fastest way to do that?
For example i have a long string "some text & some text & some text + a lot more + a lot more ... etc." and i need to check if there r more than 3 of next symbols: [&,.,+]. So when i encounter 4th occurrence of one of these chars i just need to return false, and stop the loop. So i think to create a simple function like that. But i wonder is there any native method in php to do such a thing? But i need some function which will not waste time parsing the string till the end, cuz the string may be pretty long. So i think regexp and functions like count_chars r not suited for that kind of job...
Any suggestions?

I don't know about a native method, I think count_chars is probably as close as you're going to get. However, rolling a custom solution would be relatively simple:
$str = 'your text here';
$chars = ['&', '.', '+'];
$count = [];
$length = strlen($str);
$limit = 3;
for ($i = 0; $i < $length; $i++) {
if (in_array($str[$i], $chars)) {
$count[$str[$i]] += 1;
if ($count[$str[$i]] > $limit) {
break;
}
}
}
Where the data is actually coming from might also make a difference. For example, if it's from a file then you could take advantage of fread's 2nd parameter to only read x number of bytes at a time within a while loop.
Finding the fastest way might be too broad of a question as PHP has a lot of string related functions; other solutions might use strstr, strpos, etc...

Not benchmarked the other solutions but http://php.net/manual/en/function.str-replace.php passing an array of options will be fast. There is an optional parameter which returns the count of replacements. Check that number
str_replace ( ['&','.','+'], '' , $subject , $count )
if ($count > $number ) {

Well, all my thoughts were wrong and my expectations were crushed by real tests. RegExp seems to work from 2 to 7 times faster (with different strings) than self-made function with simple symbol-checking loop.
The code:
// self-made function:
function chk_occurs($str,$chrs,$limit){
$r=false;
$count = 0;
$length = strlen($str);
for($i=0; $i<$length; $i++){
if(in_array($str[$i], $chrs)){
$count++;
if($count>$limit){
$r=true;
break;
}
}
}
return $r;
}
// RegExp i've used for tests:
preg_match('/([&\\.\\+]|[&\\.\\+][^&\\.\\+]+?){3,}?/',$str);
Of course it works faster because it's a single call to native function, but even same code wrapped into function works from 2 to ~4.8 times faster.
//RegExp wrapped into the function:
function chk_occurs_preg($str,$chrs,$limit){
$chrs=preg_quote($chrs);
return preg_match('/(['.$chrs.']|['.$chrs.'][^'.$chrs.']+?){'.$limit.',}?/',$str);
}
P.S. i wasn't bothered to check cpu-time, just was testing walltime measured via microtime(true); of the 200k iteration loop, but it's enough for me.

Password Validation With Multiple Rules

I'm attempting to write a regex in PHP that validates the following:
At least 10 chars
Has at least 2 Upper-case characters
Has at least 2 Numbers OR Symbols
I've looked at just about every reference I can find but, to no avail.
I guess I can test individually, but that makes me very sad :(
Can someone please help? (And send me to a spot where I can learn in plain English Reg Ex?)

This picture is worth more than 1000 words
(and that's a lot of entropy)
(image via XKCD)
With this in mind you might want to consider dropping rules 2 & 3 if password length is higher than X (say.. 20) or increase the minimum to at least 16 characters (as the only rule).
As for your requirement:
As opposed to having one big, ugly, hard-to-maintain, advanced RegExp you might want to break the problem in smaller parts and tackle each bit separately using dedicated functions.
For this you could look at ctype_* functions, count_chars() and MultiByte String Functions.
Now the ugly:
This advanced RegEx will return true or false according to your rules:
preg_match('/^(?=.{10,}$)(?=.*?[A-Z].*?[A-Z])(?=.*?([\x20-\x40\x5b-\x60\x7b-\x7e\x80-\xbf]).*?(?1).*?$).*$/',$string);
Test demo here: http://regex101.com/r/qE9eB2
1st part (LookAhead) : (?=.{10,}$) will check string length and continue if it has at least 10 characters. You could drop this and do a check with strlen() or even better mb_strlen().
2nd part (also a LookAhead): (?=.*?[A-Z].*?[A-Z]) will check for the presence of 2 UPPERCASE characters. You could also do a $upper=preg_replace('/[^A-Z]/','',$string) instead and count the chars in $upper to be more than two.
3rd LookAhead uses a character class: [\x20-\x40\x5b-\x60\x7b-\x7e\x80-\xbf] with hex escaped character ranges for common symbols (pretty much all the symbols one could find on an average keyboard). You could also do a $sym=preg_replace('/[^a-zA-Z]/','',$string) instead and count the chars in $sym to be more than two. Note: to make it shorter I used a recursive group (?1) to not repeat the same character class again
For learning, the most comprehensive RegExp reference I know of is: regular-expressions.info

You can use lookaheads to make sure that what you are looking for is contained appropriately.
/(?=.*[A-Z].*[A-Z])(?=.*[^a-zA-Z].*[^a-zA-Z]).{10,}/

I have always preferred good old procedural code for handling stuff like this. Regular expressions can be useful but they can also be a little cumbersome, especially for code maintenance and quick scanning (regular expressions are not exactly examples of readability).
function strContains($string, $contains, $n = 1, $exact = false) {
$length = strlen($string);
$tally = 0;
for ($i = 0; $i < $length; $i++) {
if (strpos($contains, $string[$i]) !== false) {
$tally++;
}
}
return ($exact ? $tally == $n : $tally >= $n);
}
function validPassword($password) {
if (strlen($password) < 10) {
return false;
}
$upperChars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';
$upperCount = 2;
if (strContains($password, $upperChars, $upperCount) === false) {
return false;
}
$numSymChars = '0123456789!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~';
$numSymCount = 2;
if (strContains($password, $numSymChars, $numSymCount) === false) {
return false;
}
return true;
}

How to parse a word/phrase with 2 words with dictionary database (in PHP)

I want to parse a sentence into words but some sentences have two words that can be combined into one and result in a different meaning.
For example:
Eminem is a hip hop star.
If I parse it by splitting the words by space I will get
Eminem
is
a
**hip**
**hop**
star
but I want something like this:
Eminem
is
a
**hip hop**
star
This is just an example; there might be some other word combinations listed as a word in a dictionary.
How can I parse this easily?
I have a dictionary in a MySQL database. Is there any API to do this?

No API's I know of. However you could try the SQL like clause.
$words = explode(' ', 'Eminem is a hip hop star');
$len = count($words);
$fixed = array();
for($x = 0; $x < $len; $x++) {
//LIKE 'hip %' will match hip hop
$q = mysql_query("SELECT word FROM dict WHERE word LIKE '".$words[$x]." %'");
//Combine current and next word
$combined = $words[$x].' '.$words[($x+1)];
while( $result = mysql_fetch_array($q)) {
if($result['word'] == $combined) { //Word is in dictionary
$fixed[] = $combined;
$x++;
} else { //Word isn't in dictionary
$fixed[] = $words[$x];
}
}
}
*Please excuse my lack of PDO. I'm lazy right now.
EDIT: I've done some thinking. While the code above isn't optimal, the optimized version I've come up with probably can't do very much better. The fact of the matter is regardless of how you approach the problem, you will need to compare every word in your input sentence to your dictionary and perform additional computations. I see two approaches you can take depending on hardware limits.
Both of these methods assume a dict table with (example) structure:
+--+-----+------+
|id|first|second|
+--+-----+------+
|01|hip |hop |
+--+-----+------+
|02|grade|school|
+--+-----+------+
Option 1: Your webserver has lots of available RAM (and a decent processor)
The idea here is to completely bypass the database layer by caching the dictionary in PHP's memory (with APC or memcache, the latter if you plan to run on several severs). This will place all the load on your webserver, however it could be significantly faster since accessing cached data from the RAM is much faster than querying your DB.
(Again, I've left out PDO and Sanitization for simplicity's sake)
// Step One: Cache Dictionary..the entire dictionary
// This could be run on server start-up or before every user input
if(!apc_exists('words')) {
$words = array();
$q = mysql_query('SELECT first, second FROM dict');
while($res = mysql_fetch_array($q)) {
$words[] = array_values($res);
}
apc_store('words', serialize($words)); //You could use memcache if you want
}
// Step Two: Compare cached dictionary to user input
$data = explode(' ', 'Eminem is a hip hop star');
$words = apc_fetch('words');
$count = count($data);
for($x = 0; $x < $count; $x++) { //Simpler to use a for loop
foreach($words as $word) { //Match against each word
if($data[$x] == $word[0] && $data[$x+1] == $word[1]) {
$data[$x] .= ' '.$word[1];
array_splice($data, $x, 1);
$count--;
}
}
}
Option 2: Fast SQL Server
The second option involves querying each of the words in the input text from the SQL server. For example, for the sentence "Eminem is hip hop" you would create a query that looked like SELECT * FROM dict WHERE (first = 'Eminem' && second = 'is') || (first = 'is' && second = 'hip') || (first = 'hip' && second = 'hop'). Then to fix the array of words you would simply loop through MySQL's results and fuse the appropriate words together. If you are willing to take this route, it might be more efficient to cache commonly used words and fix them before querying the database. This way you can eliminate conditions from your query.

How to get a random value from 1~N but excluding several specific values in PHP?

rand(1,N) but excluding array(a,b,c,..),
is there already a built-in function that I don't know or do I have to implement it myself(how?) ?
UPDATE
The qualified solution should have gold performance whether the size of the excluded array is big or not.

No built-in function, but you could do this:
function randWithout($from, $to, array $exceptions) {
sort($exceptions); // lets us use break; in the foreach reliably
$number = rand($from, $to - count($exceptions)); // or mt_rand()
foreach ($exceptions as $exception) {
if ($number >= $exception) {
$number++; // make up for the gap
} else /*if ($number < $exception)*/ {
break;
}
}
return $number;
}
That's off the top of my head, so it could use polishing - but at least you can't end up in an infinite-loop scenario, even hypothetically.
Note: The function breaks if $exceptions exhausts your range - e.g. calling randWithout(1, 2, array(1,2)) or randWithout(1, 2, array(0,1,2,3)) will not yield anything sensible (obviously), but in that case, the returned number will be outside the $from-$to range, so it's easy to catch.
If $exceptions is guaranteed to be sorted already, sort($exceptions); can be removed.
Eye-candy: Somewhat minimalistic visualisation of the algorithm.

I don't think there's such a function built-in ; you'll probably have to code it yourself.
To code this, you have two solutions :
Use a loop, to call rand() or mt_rand() until it returns a correct value
which means calling rand() several times, in the worst case
but this should work OK if N is big, and you don't have many forbidden values.
Build an array that contains only legal values
And use array_rand to pick one value from it
which will work fine if N is small

Depending on exactly what you need, and why, this approach might be an interesting alternative.
$numbers = array_diff(range(1, N), array(a, b, c));
// Either (not a real answer, but could be useful, depending on your circumstances)
shuffle($numbers); // $numbers is now a randomly-sorted array containing all the numbers that interest you
// Or:
$x = $numbers[array_rand($numbers)]; // $x is now a random number selected from the set of numbers you're interested in
So, if you don't need to generate the set of potential numbers each time, but are generating the set once and then picking a bunch of random number from the same set, this could be a good way to go.

The simplest way...
<?php
function rand_except($min, $max, $excepting = array()) {
$num = mt_rand($min, $max);
return in_array($num, $excepting) ? rand_except($min, $max, $excepting) : $num;
}
?>

What you need to do is calculate an array of skipped locations so you can pick a random position in a continuous array of length M = N - #of exceptions and easily map it back to the original array with holes. This will require time and space equal to the skipped array. I don't know php from a hole in the ground so forgive the textual semi-psudo code example.
Make a new array Offset[] the same length as the Exceptions array.
in Offset[i] store the first index in the imagined non-holey array that would have skipped i elements in the original array.
Now to pick a random element. Select a random number, r, in 0..M the number of remaining elements.
Find i such that Offset[i] <= r < Offest[i+i] this is easy with a binary search
Return r + i
Now, that is just a sketch you will need to deal with the ends of the arrays and if things are indexed form 0 or 1 and all that jazz. If you are clever you can actually compute the Offset array on the fly from the original, it is a bit less clear that way though.

Maybe its too late for answer, but I found this piece of code somewhere in my mind when trying to get random data from Database based on random ID excluding some number.
$excludedData = array(); // This is your excluded number
$maxVal = $this->db->count_all_results("game_pertanyaan"); // Get the maximum number based on my database
$randomNum = rand(1, $maxVal); // Make first initiation, I think you can put this directly in the while > in_array paramater, seems working as well, it's up to you
while (in_array($randomNum, $excludedData)) {
$randomNum = rand(1, $maxVal);
}
$randomNum; //Your random number excluding some number you choose

This is the fastest & best performance way to do it :
$all = range($Min,$Max);
$diff = array_diff($all,$Exclude);
shuffle($diff );
$data = array_slice($diff,0,$quantity);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.