Improving the speed and efficiency of this PHP spellchecker - php

I built a simple PHP spellchecker and suggestions application that uses PHP's similar_text() and levenshtein() functions to compare words from a dictionary that is loaded into an array.
How it works: First I load the contents of the dictionary into an
array.
I split the user's input into words and spell check each of the
words.
I spell check by checking if the word is in the array that is the
dictionary.
If it is, then I echo a congratulations message and move on.
If not, I iterate through the dictionary-array comparing each word, in the dictionary-array, with the assumed misspelling.
If the inputted word, in lower-case and without punctuation, is 90%
or more similar to a word in the dictionary array, then I copy that
word from the dictionary array into an array of suggestions.
If no suggestions were found using the 90% or higher similarity
comparison, then I use levenshtein() to do a more liberal comparison
and add suggestions to the suggestions array.
Then I iterate through the suggestions array and echo each
suggestion.
I noticed that this is running slowly. Slow enough to notice. And I was wondering how I could improve the speed and efficiency of this spell checker.
Any and all changes, improvements, suggestions, and code are welcome and appreciated.
Here is the code (for syntax highlighted code, please visit here):
<?php
function addTo($line) {
return strtolower(trim($line));
}
$words = array_map('addTo', file('dictionary.txt'));
$words = array_unique($words);
function checkSpelling($input, $words) {
$suggestions = array();
if (in_array($input, $words)) {
echo "you spelled the word right!";
}
else {
foreach($words as $word) {
$percentageSimilarity = 0.0;
$input = preg_replace('/[^a-z0-9 ]+/i', '', $input);
similar_text(strtolower(trim($input)), strtolower(trim($word)), $percentageSimilarity);
if ($percentageSimilarity >= 90 && $percentageSimilarity<100) {
if(!in_array($suggestions)){
array_push($suggestions, $word);
}
}
}
if (empty($suggestions)) {
foreach($words as $word) {
$input = preg_replace('/[^a-z0-9 ]+/i', '', $input);
$levenshtein = levenshtein(strtolower(trim($input)), strtolower(trim($word)));
if ($levenshtein <= 2 && $levenshtein>0) {
if(!in_array($suggestions)) {
array_push($suggestions, $word);
}
}
}
}
echo "Looks like you spelled that wrong. Here are some suggestions: <br />";
foreach($suggestions as $suggestion) {
echo "<br />".$suggestion."<br />";
}
}
}
if (isset($_GET['check'])) {
$input = trim($_GET['check']);
$sentence = '';
if (stripos($input, ' ') !== false) {
$sentence = explode(' ', $input);
foreach($sentence as $item){
checkSpelling($item, $words);
}
}
else {
checkSpelling($input, $words);
}
}
?>
<!Doctype HTMl>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>Spell Check</title>
</head>
<body>
<form method="get">
<input type="text" name="check" autocomplete="off" autofocus />
</form>
</body>
</html>

Levenshtein over a large list will be pretty processor intensive. Right now if you mistyped refridgerator it would calculate the edit distance to cat dog and pimple.
Too pare the list down before going into the levenstein loop you could match against a precalculated metaphone or soundex key for each of your dictionary entries. This would give you a much shorter list of likely suggestions then you could use levenshtein and similar_text as a means of rank the short list of matches.
Another thing that may help you out is to cache your results. I would venture to guess that most of the misspellings are going to be common.
The following implementation doesn't deal with the paring of the data effectively but it should give you some guidelines on how to dodge the levenshtein distance against the entire dictionary for each word.
First thing you are going to want to do is append the metaphone results to each of your word entries.
This would be a servicable way to do that
<?php
$dict = fopen("dictionary-orig.txt", "r");
$keyedDict = fopen("dictionary.txt", "w");
while ($line = fgets($dict)){
$line = trim(strtolower($line));
fputcsv($keyedDict, array($line,metaphone($line)));
}
fclose($dict);
fclose($keyedDict);
?>
Along with this you are going to need something that can read the dictionary into an array
<?php
function readDictionary($file){
$dict = fopen($file, "r");
$words = array();
while($line = fgetcsv($dict)){
$words[$line[0]] = $line[1];
}
return $words;
}
function checkSpelling($input, $words){
if(array_key_exists($input, $words)){
return;
}
else {
// sanatize the input
$input = preg_replace('/[^a-z0-9 ]+/i', '', $input);
// get the metaphone key for the input
$inputkey = metaphone($input);
echo $inputkey."<br/>";
$suggestions = array();
foreach($words as $word => $key){
// get the similarity between the keys
$percentageSimilarity = 0;
similar_text($key, $inputkey, $percentageSimilarity);
if($percentageSimilarity > 90){
$suggestions[] = array($word, levenshtein($input, $word));
}
}
// rank the suggestions
usort($suggestions, "rankSuggestions");
return $suggestions;
}
}
if(isset($_GET['check'])){
$words = readDictionary("dictionary.txt");
$input = trim($_GET['check']);
$sentence='';
$sentence = explode(' ', $input);
print "Searching Words ".implode(",", $sentence);
foreach($sentence as $item){
$suggestionsArray = checkSpelling($item, $words);
if (is_array($suggestionsArray)){
echo $item, " not found, maybe you meant";
var_dump($suggestionsArray);
} else {
echo "found $item";
}
}
}
function rankSuggestions($a, $b){
return $a[1]-$b[1];
}
?>
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>Spell Check</title>
</head>
<body>
<form method="get">
<input type="text" name="check" autocomplete="off" autofocus />
</form>
</body>
</html>
The simplest way to do actual paring of the data would be to split your dictionary into multiple files partitioned by something like the first character in the string. Something along the lines for dict.a.txt, dict.b.txt, dict.c.txt etc.

Related

extract pure text strings from php scripts for translation

I have project which contains bigger amount of php files. Former programmer wrote everything (texts) in english in source files together with html code and I need to make translation now. Go manually file by file and extract all texts to one lanugage file is huge pain. Is there any free tool please to extract and convert all text to e.g. variables in source files and produce just one big file with text variables to simple translation?
many thanks.
P.S. I would like to automatize this work rather than manually do it file-by-file.
Examples of code in php files:
<?php
echo "Hi back, " . $user;
?>
<center class="title">No list(s) available.</center>
<tr id="exp<?php echo $r; ?>" class="me" onmouseover="dis('<?php echo $u; ?>');"> <td>This is new statement</td></tr>
this function is gonna help you in some cases and it returns the plain text between > and <
before you start you need to replace
' (quotation)
with
\' (backslash quotation)
$text = '
<?php
echo "Hi back, " . $user;
?>
<center class="title">No list(s) available.</center>
<tr id="exp<?php echo $r; ?>" class="me" onmouseover="dis(\'<?php echo $u; ?>\');"> <td>This is new statement</td></tr>
';
the function is:
function getSentences($string){
$arr = array();
$parts = explode(">", $string);
if(count($parts) > 2){
$pattern = "/\>(.*?)</";
foreach($parts as $part){
$part = ">" . $part;
preg_match($pattern, trim($part), $matches);
if(!empty($matches[1]) AND $matches[1] != " "){
if(preg_match('/^[a-zA-Z0-9]/', $matches[1])){
$arr[] = $matches[1];
}
}
}
}else{
$pattern = "/\>(.*?)</";
preg_match($pattern, $string, $matches);
$arr[] = $matches[1];
}
return $arr;
}
and call the function by :
print_r(getSentences($text));
the output will be something like this:
Array ( [0] => No list(s) available. [1] => This is new statement )

Filter out a specific string within a txt file

I have this search script:
$search = $_GET["search"];
$logfile = $_GET['logfile'];
$file = fopen($logfile, "r");
?>
<head>
<title>Searching: <?php echo $search ?></title>
</head>
<?php
while( ($line = fgets($file) )!= false) {
if(stristr($line,$search)) {
// case insensitive
echo "<font face='Arial'> $line </font><hr><p>";
}
}
I want to filter out a specific string when searching for something in the txt file.
For example, the text file consists of this:
http://test.com/?id=2022458&pid=41&user=Ser_Manji
Ser_manji said "hello"
Ser_manju left the game
When you search for instance for "Ser_manji", I want to filter out this string:
http://test.com/?id=2022458&pid=41&user=Ser_Manji
But still display these two lines:
Ser_manji said "hello"
Ser_manju left the game
I hope this is possible, I myself tryied altering it so it wouldn't accept anything to do with lines that contained "test.com", but that didn't work.
This should work for you:
Just get your file into an array with file(). And use strpos() to check if the search needle is in the line and if not display the line.
<?php
$search = $_GET["search"];
$logfile = $_GET['logfile'];
$lines = file($logfile, FILE_SKIP_EMPTY_LINES | FILE_IGNORE_NEW_LINES);
?>
<head>
<title>Searching: <?php echo $search ?></title>
</head>
<?php
foreach($lines as $line) {
if(strpos($line, $search) === FALSE) {
echo "<font face='Arial'>$line</font><hr><p>";
}
}
?>
You just need to modify your if condition like so:
if (stristr($line, $search) && strpos($line,'test.com') === false)
I suppose you need to filter out logs according to a specific username. This seems more complex than finding the right php function.
So you got your search q: $search = $_GET['search'] which is a username.
You got your logs file: $file = fopen($logfile, 'r').
Please note: You use GET parameters to get the filename but your example link http://test.com/?id=2022458&pid=41&user=Ser_Manji doesn't contain any &logfile=logs.txt. I suppose you know what you're doing.
Now if your logs structure is {username} {action} then we know that a "space" splits the username from his action. We can use explode: $clues = explode(' ', $line); and now $username = $clues[0] and $action = clues[1].
So if ($username == $search) echo $action
Keep it simple and clean.
$search = $_GET["search"];
$logfile = $_GET['logfile'];
$file = fopen($logfile, "r");
while ($line = fgets($logfile)) {
$clues = explode(' ', $line);
$username = $clues[0];
$action = $clues[1];
if ($username == $search) {
echo $action;
}
}
You should test this by: http://test.com?search=user_1234&logfile=logs.txt if you are looking for user_1234 inside logs.txt file and so on..
If you want to match text (case insensitive) only at the beginning of the line, you could consider using a case insensitive and anchored regular expression, for filtering on a textfile ideally with the preg_grep function on array (e.g. via file) or with a FilterIterator on SplFileObject.
// regular expression pattern to match string at the
// beginning of the line (^) case insensitive (i).
$pattern = sprintf('~^%s~i', preg_quote($search_term, '~'));
For the array variant:
$result = preg_grep($pattern, file($logfile));
foreach ($result as $line) {
... // $line is each grep'ed (found) line
}
With the iterators it's slightly different:
$file = new SplFileObject($logfile);
$filter = new RegexIterator($file, $pattern);
foreach ($filter as $line) {
... // $line is each filtered (found) line
}
The iterators give you a more object oriented approach, the array feels perhaps more straight forward. Both variants operate with the PCRE regular expressions in PHP which is the standard regular expression dialect in PHP.

filtering bad words from text

This function filer the email from text and return matched pattern
function parse($text, $words)
{
$resultSet = array();
foreach ($words as $word){
$pattern = 'regex to match emails';
preg_match_all($pattern, $text, $matches, PREG_OFFSET_CAPTURE );
$this->pushToResultSet($matches);
}
return $resultSet;
}
Similar way I want to match bad words from text and return them as $resultSet.
Here is code to filter badwords
TEST HERE
$badwords = array('shit', 'fuck'); // Here we can use all bad words from database
$text = 'Man, I shot this f*ck, sh/t! fucking fu*ker sh!t f*cking sh\t ;)';
echo "filtered words <br>";
echo $text."<br/>";
$words = explode(' ', $text);
foreach ($words as $word)
{
$bad= false;
foreach ($badwords as $badword)
{
if (strlen($word) >= strlen($badword))
{
$wordOk = false;
for ($i = 0; $i < strlen($badword); $i++)
{
if ($badword[$i] !== $word[$i] && ctype_alpha($word[$i]))
{
$wordOk = true;
break;
}
}
if (!$wordOk)
{
$bad= true;
break;
}
}
}
echo $bad ? 'beep ' : ($word . ' '); // Here $bad words can be returned and replace with *.
}
Which replaces badwords with beep
But I want to push matched bad words to $this->pushToResultSet() and returning as in first code of email filtering.
can I do this with my bad filtering code?
Roughly converting David Atchley's answer to PHP, does this work as you want it to?
$blocked = array('fuck','shit','damn','hell','ass');
$text = 'Man, I shot this f*ck, damn sh/t! fucking fu*ker sh!t f*cking sh\t ;)';
$matched = preg_match_all("/(".implode('|', $blocked).")/i", $text, $matches);
$filter = preg_replace("/(".implode('|', $blocked).")/i", 'beep', $text);
var_dump($filter);
var_dump($matches);
JSFiddle for working example.
Yes, you can match bad words (saving for later), replace them in the text and build the regex dynamically based on an array of bad words you're trying to filter (you might store it in DB, load from JSON, etc.). Here's the main portion of the working example:
var blocked = ['fuck','shit','damn','hell','ass'],
matchBlocked = new RegExp("("+blocked.join('|')+")", 'gi'),
text = $('.unfiltered').text(),
matched = text.match(matchBlocked),
filtered = text.replace(matchBlocked, 'beep');
Please see the JSFiddle link above for the full working example.

How to parse PHP code in a PHP array?

I am using this piece of code on a site of mine.
If there is PHP code in the array and if you echo it, it does not run.
There is piece of code;
function spin($var){
$words = explode("{",$var);
foreach ($words as $word)
{
$words = explode("}",$word);
foreach ($words as $word)
{
$words = explode("|",$word);
$word = $words[array_rand($words, 1)];
echo $word." ";
}
}
}
$text = "example.com is {the best forum|a <? include(\"myfile.php\");?>Forum|a wonderful Forum|a perfect Forum} {123|some other sting}";
spin($text);
The file that needs to be included "myfile.php" will not be included. and the PHP codes will be visible. Why is that? How can I solve this problem?
I believe that you will want to run the include statement through eval(). However note that:
"The eval() language construct is very dangerous because it allows execution of arbitrary PHP code. Its use thus is discouraged. If you have carefully verified that there is no other option than to use this construct, pay special attention not to pass any user provided data into it without properly validating it beforehand." (PHP.net)
SOURCE: http://php.net/manual/en/function.eval.php
You might try the following:
<?php
function spin($var)
{
$words = explode("\{",$var);
foreach ($words as $word)
{
$words = explode("}",$word);
foreach ($words as $word)
{
$words = explode("|",$word);
$word = $words[array_rand($words, 1)];
if ( preg_match( "/\<\? include\(\\\"([A-Za-z\.]+)\\\"\)\;\?\>/", $word ) )
{
$file = preg_replace( "/^.*\<\? include\(\\\"([A-Za-z\.]+)\\\"\)\;\?\>.*\$/", "\$1", $word );
$pre = preg_replace( "/^(.*)\<\? include\(\\\"[A-Za-z\.]+\\\"\)\;\?\>.*\$/", "\$1", $word );
$post = preg_replace( "/^.*\<\? include\(\\\"[A-Za-z\.]+\\\"\)\;\?\>(.*)\$/", "\$1", $word );
echo $pre;
include( $file );
echo $post;
}
}
}
}
$text = "example.com is {the best forum|a <? include(\"myfile.php\");?>Forum|a wonderful Forum|a perfect Forum} {123|some other sting}";
spin($text);
?>
My suggestion is a bit of other way,
function spin($var){
$words = explode("{",$var);
foreach ($words as $word)
{
$words = explode("}",$word);
foreach ($words as $word)
{
$words = explode("|",$word);
$word = $words[array_rand($words, 1)];
if(str_replace(" ","",$word) == 'thisparam'){
echo 'a';
include("myfile.php");
echo 'Forum';
}else{
echo $word." ";
}
}
}
}
$text = "example.com is {the best forum| thisparam |a wonderful Forum|a perfect Forum} {123|some other sting}";
spin($text);
where thisparam is you variable $test is the parameter to run the if statement.
I place a str_replace infront of $word to replace strings to get exact word.
Well it is just a string of text after all. The echo will just output the text...
I suggest you look to make use of eval http://php.net/manual/en/function.eval.php
I cant really tell why you wish to do this though. Whenever I need to use eval and friends I stop to think "Should I be doing this?"

How to search a multidimensional array using GET

Hey guys, I've had a lot of help from everyone here and i am really appreciative! I'm trying to create a text file search engine and i think i am on the final stretch now! All i need to do now is to be able to search the multi-dimensional array i've created for a certain word submitted by a form and grabbed with GET, and return the results in highest to lowest order (TF-IDF will come later). I can perform a simple search on the content variable which is not really what i want (see in code for $new_content) but not on the $index array.
Here is my code:
<?php
$starttime = microtime();
$startarray = explode(" ", $starttime);
$starttime = $startarray[1] + $startarray[0];
if(isset($_GET['search']))
{
$searchWord = $_GET['search'];
}
else
{
$searchWord = null;
}
?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Untitled Document</title>
</head>
<body>
<div id="wrapper">
<div id="searchbar">
<h1>PHP Search</h1>
<form name='searchform' id='searchform' action='<?php echo $_SERVER['PHP_SELF']; ?>' method='get'>
<input type='text' name='search' id='search' value='<?php echo $_GET['search']; ?>' />
<input type='submit' value='Search' />
</form>
<br />
<br />
</div><!-- close searchbar -->
<?php
include "commonwords.php";
$index = array();
$words = array();
// All files with a .txt extension
// Alternate way would be "/path/to/dir/*"
foreach (glob("./files/*.txt") as $filename) {
// Includes the file based on the include_path
$content = file_get_contents($filename, true);
$pat[0] = "/^\s+/";
$pat[1] = "/\s{2,}/";
$pat[2] = "/\s+\$/";
$rep[0] = "";
$rep[1] = " ";
$rep[2] = "";
$new_content = preg_replace("/[^A-Za-z0-9\s\s+]/", "", $content);
$new_content = preg_replace($pat, $rep, $new_content);
$new_content = strtolower($new_content);
preg_match_all('/\S+/',$new_content,$matches,PREG_SET_ORDER);
foreach ($matches as $match) {
if (!isset($words[$filename][$match[0]]))
$words[$filename][$match[0]]=0;
$words[$filename][$match[0]]++;
}
foreach ($commonWords as $value)
if (isset($words[$filename][$value]))
unset($words[$filename][$value]);
$results = 0;
$totalCount = count($words[$filename]);
// And another item to the list
$index[] = array(
'filename' => $filename,
'word' => $words[$filename],
'all_words_count' => $totalCount
);
}
echo '<pre>';
print_r($index);
echo '</pre>';
if(isset($_GET['search']))
{
$endtime = microtime();
$endarray = explode(" ", $endtime);
$endtime = $endarray[1] + $endarray[0];
$totaltime = $endtime - $starttime;
$totaltime = round($totaltime,5);
echo "<div id='timetaken'><p>This page loaded in $totaltime seconds.</p></div>";
}
?>
</div><!-- close wrapper -->
</body>
</html>
foreach ($index as $result)
if (array_key_exists($searchWord,$result['word']))
echo "Found ".$searchWord." in ".$result['filename']." ".$result['word'][$searchWord]." times\r\n";
As an aside, I would highly recommend only searching the files if the search term has been filled rather than searching with every refresh to the page.
Also, some other things to keep in mind:
- Make sure you declare variables before using them (such as your $pat and $rep variables, should be $pat = Array(); before using it).
- You do the right thing at the top and check for the existence of a $searchWord but keep referencing the $_GET['search']; I would advise continuing to use $searchWord and checking against is_null($searchWord) throughout the page instead of using $_GET. It's good practice to not just output those variables on the page without an integrity check.
- Also, it may be more useful to check if the $searchWord (or words) are in the $commonWords, then process the file. Could take some time off the search if there are a lot of files or big files with a lot of words. I also don't fully understand why you're storing all words when you are only looking for keywords, but if this gets too big you'll be hitting a memory limit in the near future.

Categories