PHP/mysql array search algorithm

PHP/mysql array search algorithm - php

I'd like to be able to use php search an array (or better yet, a column of a mysql table) for a particular string. However, my goal is for it to return the string it finds and the number of matching characters (in the right order) or some other way to see how reasonable the search results are, so then I can make use of that info to decide if I want to display the top result by default or give the user options of the top few.
I know I can do something like
$citysearch = mysql_query(" SELECT city FROM $table WHERE city LIKE '$city' ");
but I can't figure out a way to determine how accurate it is.
The goal would be:
a) find "Milwaukee" if the search term were "milwakee" or something similar.
b) if the search term were "west", return things like "West Bend" and "Westmont".
Anyone know a good way to do this?

You should check out full text searching in MySQL. Also check out Zend's port of the Apache Lucene project, Zend_Search_Lucene.

More searching led me to the Levenshtein distance and then to similar_text, which proved to be the best way to do this.
similar_text("input string", "match against this", $pct_accuracy);
compares the strings and then saves the accuracy as a variable. The Levenshtein distance determines how many delete, insert, or replace functions on a single character it would need to do to get from one string to the other, with an allowance for weighting each function differently (eg. you can make it cost more to replace a character than to delete a character). It's apparently faster but less accurate than similar_text. Other posts I've read elsewhere have mentioned that for strings of fewer than 10000 characters, there's no functional difference in speed.
I ended up using a modified version of something I found to make it work. This ends up saving the top 3 results (except in the case of an exact match).
$input = $_POST["searchcity"];
$accuracy = 0;
$runner1acc = 0;
$runner2acc = 0;
while ($cityarr = mysql_fetch_row($allcities)) {
$cityname = $cityarr[1];
$cityid = $cityarr[0];
$city = strtolower($cityname);
$diff = similar_text($input, $city, $tempacc);
// check for an exact match
if ($tempacc == '100') {
// closest word is this one (exact match)
$closest = $cityname;
$closestid = $cityid;
$accuracy = 100;
break;
}
if ($tempacc >= $accuracy) { // more accurate than current leader
$runner2 = $runner1;
$runner2id = $runner1id;
$runner2acc = $runner1acc;
$runner1 = $closest;
$runner1id = $closestid;
$runner1acc = $accuracy;
$closest = $cityname;
$closestid = $cityid;
$accuracy = $tempacc;
}
if (($tempacc < $accuracy)&&($tempacc >= $runner1acc)) { // new 2nd place
$runner2 = $runner1;
$runner2id = $runner1id;
$runner2acc = $runner1acc;
$runner1 = $cityname;
$runner1id = $cityid;
$runner1acc = $tempacc;
}
if (($tempacc < $runner1acc)&&($tempacc >= $runner2acc)) { // new 3rd place
$runner2 = $cityname;
$runner2id = $cityid;
$runner2acc = $tempacc;
}
}
echo "Input word: $input\n<BR>";
if ($accuracy == 100) {
echo "Exact match found: $closestid $closest\n";
} elseif ($accuracy > 70) { // for high accuracies, assumes that it's correct
echo "We think you meant $closestid $closest ($accuracy)\n";
} else {
echo "Did you mean:<BR>";
echo "$closestid $closest? ($accuracy)<BR>\n";
echo "$runner1id $runner1 ($runner1acc)<BR>\n";
echo "$runner2id $runner2 ($runner2acc)<BR>\n";
}

This can be very complicated, and I am not personally aware of any good 3rd party libraries although I'm sure they exist. Others may be able to suggest some canned solutions, though.
I have written something similar from scratch a few times in the past. If you go down that route, it is probably not something you'd want to do in PHP by itself as every query would involve getting all of the records and performing your calculations on them. It will almost certainly involve creating a set of index tables that meet your specifications.
For instance, you would have to come up with rules for how you imagine that "Milwaukee" could end up spelled "milwakee." My solution to this was to do vowel compression and duplication compression (not sure if these are actually search terms). So, milwaukee would be indexed as:
milwaukee
m_lw__k__
m_lw_k_
When the search query came in for "milwaukee", I would run the same process on the text input, and then run a search on the index table for:
SELECT cityId,
COUNT(*)
FROM myCityIndexTable
WHERE term IN ('milwaukee', 'm_lw__k__', 'm_lw_k_')
When the search query came in for "milwakee", I would run the same process on the text input, and then run a search on the index table for:
SELECT cityId,
COUNT(*)
FROM myCityIndexTable
WHERE term IN ('milwaukee', 'm_lw_k__', 'm_lw_k_')
In the case of Milwaukee (spelled correctly), it would return "3" for the count.
In the case of Milwakee (spelled incorrectly) ,it would return "2" for the count (since it would not match the m_lw__k__ pattern as it only had one vowel in the middle).
If you sort the results based on the count, you would end up meeting one of your rules, that "Milwaukee" would end up being sorted higher as a possible match than "Milwakee."
If you want to build this system in a generic way (as hinted by your use of $table in the query) then you'd probably need another mapping table somewhere in there to map your terms to the appropriate table.
I'm not suggesting this is the best (or even a good) way to go about this, just something I've done in the past that might prove useful to you if you plan to try and do this without a third party solution.

Most maddening result with LIKE is this one "%man" this will return all woman in file!
In case of listing perhaps a not too bad solution is to keep on shortening the searching needle. In your case a match will come up when your searching $ is as short as "milwa".

Related

PHP compare strings in loop with similar_text, but only show if match is unique

I have an array with movie titles. I want to match those movie titles with movies I have in a database. Furthermore, I do the comparison with PHP's similar_text() function. That works fine so far, but sometimes there are more than one 100% matches for one movie title. In that case, I want to exclude this comparison because I will do the match for that manually later. I only want to work with the matches, I can be sure it's the right one.
To show the matches with 100% and also over 80%, I use a switch / case. So to ignore the ones with multiple matches, I tried to find them with regular expression and then do an if-statement. But my approach is not working. It still shows multiple 100% matches.
Here is my approach so far:
$movies = array( // full of movie titles );
foreach ( $movies as $movie ) {
$sim = similar_text($movie->title, $db_title, $perc);
switch($perc) {
case $perc == '100 %':
$preg = preg_match('|^(?=[^%s].*?[%s][^%s]*$)[0-9a-zA-Z_-\s:;,\.\?!\(\)\p{L}(%s){1}]*$|u', $perc);
if ( $preg === 0) {
echo "$movie->title: $perc %\n";
break;
} else {
echo "no usable match!\n";
break;
}
case $perc >= '80 %':
echo "$movie->title: $perc %\n";
break;
case $perc <= '80 %':
break;
}
}
Google didn't really help me. You find numerous questions about similar_text itself, but not how to handle the case when multiple strings have a 100% match.
Does anyone have an idea?
Thank you so much!
Edit (clarification of the usecase): I have a large array with many movie names (and some other stuff related to the movies) in it. I also have a database with many movies in it. Because I don't have a unique identifier to make a match between the movies from the array and the ones in the db, I want to match the titles with similar_text. But sometimes a movie title exists 3 times with an exact 100% match. I want to find those cases, so I can handle those later and just work with unique 100% matches.

Stripping . from text box's

Hello i did use the search before posting this.
Im new to php/mysql been doing soooo much reading. have been able to make a game that a few friends are playing. its like a pvp game.
Anyway one of the people playing found a way to glitch buying and selling units by putting a . in front of the value. i do have a protect feature for stripping illegal characters
function protect($string) {
return mysql_real_escape_string(strip_tags(addslashes($string)));
}
this works for other characters but not with . im not asking for someone to do it for me just wanted to be pointed in the right direction.
but just encase someone asks here is the code im using
if(isset($_POST['buy'])){
$sword = protect($_POST['sword']);
$shield = protect($_POST['shield']);
$gold_needed = (10 * $sword) + (10 * $shield);
if($sword < 0 || $shield < 0){
output("You must buy a positive number of weapons!");
}elseif($stats['gold'] < $gold_needed){
output("You do not have enough gold!");
}else{
$weapon['sword'] += $sword;
$weapon['shield'] += $shield;
$update_weapons = mysql_query("UPDATE `weapon` SET
`sword`='".$weapon['sword']."',
`shield`='".$weapon['shield']."'
WHERE `id`='".$_SESSION['uid']."'") or die(mysql_error());
$stats['gold'] -= $gold_needed;
$update_gold = mysql_query("UPDATE `stats` SET `gold`='".$stats['gold']."'
WHERE `id`='".$_SESSION['uid']."'") or die(mysql_error());
include("update_stats.php");
output("You have bought weapons!");
}
If anyone could give me a hand i would greatly appreciate it
i did find something "string functions, substr replace and str replace"
but can i use two functions in 1 query? sorry im new
EDIT***
Here is the query posted in update_stats
$update_stats = mysql_query("UPDATE `stats` SET
`income`='".$income."',`farming`='".$farming."',
`attack`='".$attack."',`defense`='".$defense."'
WHERE `id`='".$_SESSION['uid']."'") or die(mysql_error());

one of the people playing found a way to glitch buying and selling units by putting a . in front of the value
Well, you've not disclosed EXACTLY what the vulnerability is, but I'll hazard a guess that by input of a decimal value they run around your pricing/math? So, a number of possibilities, I should think?
if (substr($string, 0, 1) == ".") {
//return false, warn, etc.
}
That could go in your "protect" function.
Likewise, you could use intval() or even is_numeric() ... here I just add it to the assignment:
$sword = protect(intval($_POST['sword']));
You could also play with a regular expression. I'm assuming $value to be numeric? How many digits max? I've used 5:
if (preg_match("%\.\d{1,5}%", $sword)) { //this guy's playing w/us
die("Go away, bad hax0rz! :-P");
}

Small scope/huge frustration mystery: buggy array behavior in php

So crazy! I have a bug that's 100% reproducible, it happens in only a few lines of code, yet I cannot for the life of me determine what the problem is.
My project is a workout maker, and the mystery involves two functions:
get_pairings: It makes a set of $together_pairs (easy) and $mixed_pairs (annoying), and combines them into $all_pairs, used to make the workout.
make_mixed_pairs: this has different logic depending on whether it's a partner vs solo workout. Both cases return a set of $mixed_pairs (in the same exact format), called by the function above.
The symptoms/clues:
The case of the solo workout is fine, $all_pairs will only contain $mixed_pairs (because as it's defined, $together_pairs are only for partner workouts)
In the case of the a partner workout, when I combine the two sets in get_pairings(), $all_pairs only successfully gets the first set I give it! (If I swap those lines at step 2 and add $together_pairs first, $all_pairs contains only those. If I do $mixed_pairs first, $all_pairs contains only that).
Then if I uncomment that second-to-last line in make_mixed_pairs() just for troubleshooting to see what happens, then $all_pairs does successfully include exercises from both sets!
That suggests the problem is something I'm doing wrong in making the arrays in make_mixed_pairs(), but I confirmed that the resulting format is identical in both cases.
Anyone see what else I could be missing? I've been narrowing down this bug for 4 hours so far- I can't make it any smaller, and I can't see what's wrong :(
Update: I updated the for loop in make_mixed_pairs() to stop at $mixed_pair_count - 1 (instead of just $mixed_pair_count), and now I sometimes get one single 'together_pair' mixed in the $all_pairs results; the same damn one each time, weirdly. Though it's not 'fixed', because again when I change the order that I add the two sets in get_pairings, when I add $together_pairs first, then $all_pairs is ENTIRELY those- it's so strange...
Here are the functions: first get_pairings (relevant part is right before and after step 2):
/**
* Used in make_workout.php: take the user's available resources, and return valid exercises
*/
function get_pairings($exercises, $count, $outdoor_partner_workout)
{
// 1. Prep our variables, and put exercises into the appropriate buckets
$mixed_exercises = array();
$together_pairs = array();
$mixed_pairs = array();
$all_pairs = array();
$selected_pairs = array();
// Sort the valid exercises: self_pairing exercises go as they are, with extra
// array for consistent formatting. Mixed ones go into $mixed_exercises array
// for more specialized pairing in make_mixed_pairs
foreach($exercises as $exercise)
{
if ($exercise['self_pairing'])
{
$pair = array($exercise);
array_push($together_pairs, [$pair]);
}
else
{
$this_exercise = array($exercise);
array_push($mixed_exercises, $this_exercise);
}
}
// Now get the mixed_pairs
$mixed_pairs = make_mixed_pairs($mixed_exercises, $outdoor_partner_workout);
// 2. combine together into one set, and select random pairs for the workout
// Add both sets to the array of all pairs (to pick from afterward)
$all_pairs += $mixed_pairs;
$all_pairs += $together_pairs;
// Now let's choose at random our desired # of pairs, and save them in $selected_pairs
$pairing_keys = array_rand($all_pairs, $count);
foreach($pairing_keys as $key)
{
array_push($selected_pairs, $all_pairs[$key]);
}
// Finally, shuffle it so we don't always see the self-pairs first
shuffle($selected_pairs);
return $selected_pairs;
}
And the other one- make_mixed_pairs: there are two cases, the first is complicated (and shows the bug) and the second is simple (and works):
/**
* Used by get_pairings: in case of a partner workout that has open space (where
* one person can travel to a point while the other does an exercise til they return)
* we'll pair exercises in a special way. (If not, fine to grab random pairs)
*/
function make_mixed_pairs($mixed_exercises, $outdoor_partner_workout)
{
$mixed_pairs = array();
// When it's an outdoor partner workout, we want to pair travelling with stationary
// put them into arrays and then we'll make pairs using one from each
if ($outdoor_partner_workout)
{
$mixed_travelling = array();
$mixed_stationary = array();
foreach($mixed_exercises as $exercise)
{
if ($exercise[0]['travelling'])
{
array_push($mixed_travelling, $exercise);
}
else
{
array_push($mixed_stationary, $exercise);
}
}
shuffle($mixed_travelling);
shuffle($mixed_stationary);
// determine the smaller set, and pair exercises that many times
$mixed_pair_count = min(count($mixed_travelling), count($mixed_stationary));
for ($i=0; $i < $mixed_pair_count; $i++)
{
$this_pair = array($mixed_travelling[$i], $mixed_stationary[$i]);
array_push($mixed_pairs, $this_pair); // problem is adding them here- we get only self_pairs
}
}
// Otherwise we can just grab pairs from mixed_exercises
else
{
// shuffle the array so it's in random order, then chunk it into pairs
shuffle($mixed_exercises);
$mixed_pairs = array_chunk($mixed_exercises, 2);
}
// $mixed_pairs = array_chunk($mixed_exercises, 2); // when I replace it with this, it works
return $mixed_pairs;
}

Oh for Pete's sake: I mentioned this to a friend, who told me that union is flukey in php, and that I should use array_merge instead.
I replaced these lines:
$all_pairs += $together_pairs;
$all_pairs += $mixed_pairs;
with this:
$all_pairs = array_merge($together_pairs, $mixed_pairs);
And now it all works

PHP performant search a text for given usernames

I am currently dealing with a performance issue where I cannot find a way to fix it. I want to search a text for usernames mentioned with the # sign in front. The list of usernames is available as PHP array.
The problem is usernames may contain spaces or other special characters. There is no limitation for it. So I can't find a regex dealing with that.
Currently I am using a function which gets the whole line after the # and checks char by char which usernames could match for this mention, until there is just one username left which totally matches the mention. But for a long text with 5 mentions it takes several seconds (!!!) to finish. for more than 20 mentions the script runs endlessly.
I have some ideas, but I don't know if they may work.
Going through username list (could be >1.000 names or more) and search for all #Username without regex, just string search. I would say this would be far more inefficient.
Checking on writing the usernames with JavaScript if space or punctual sign is inside the username and then surround it with quotation marks. Like #"User Name". Don't like that idea, that looks dirty for the user.
Don't start with one character, but maybe 4. and if no match, go back. So same principle like on sorting algorithms. Divide and Conquer. Could be difficult to implement and will maybe lead to nothing.
How does Facebook or twitter and any other site do this? Are they parsing the text directly while typing and saving the mentioned usernames directly in the stored text of the message?
This is my current function:
$regular_expression_match = '#(?:^|\\s)#(.+?)(?:\n|$)#';
$matches = false;
$offset = 0;
while (preg_match($regular_expression_match, $post_text, $matches, PREG_OFFSET_CAPTURE, $offset))
{
$line = $matches[1][0];
$search_string = substr($line, 0, 1);
$filtered_usernames = array_keys($user_list);
$matched_username = false;
// Loop, make the search string one by one char longer and see if we have still usernames matching
while (count($filtered_usernames) > 1)
{
$filtered_usernames = array_filter($filtered_usernames, function ($username_clean) use ($search_string, &$matched_username) {
$search_string = utf8_clean_string($search_string);
if (strlen($username_clean) == strlen($search_string))
{
if ($username_clean == $search_string)
{
$matched_username = $username_clean;
}
return false;
}
return (substr($username_clean, 0, strlen($search_string)) == $search_string);
});
if ($search_string == $line)
{
// We have reached the end of the line, so stop
break;
}
$search_string = substr($line, 0, strlen($search_string) + 1);
}
// If there is still one in filter, we check if it is matching
$first_username = reset($filtered_usernames);
if (count($filtered_usernames) == 1 && utf8_clean_string(substr($line, 0, strlen($first_username))) == $first_username)
{
$matched_username = $first_username;
}
// We can assume that $matched_username is the longest matching username we have found due to iteration with growing search_string
// So we use it now as the only match (Even if there are maybe shorter usernames matching too. But this is nothing we can solve here,
// This needs to be handled by the user, honestly. There is a autocomplete popup which tells the other, longer fitting name if the user is still typing,
// and if he continues to enter the full name, I think it is okay to choose the longer name as the chosen one.)
if ($matched_username)
{
$startpos = $matches[1][1];
// We need to get the endpos, cause the username is cleaned and the real string might be longer
$full_username = substr($post_text, $startpos, strlen($matched_username));
while (utf8_clean_string($full_username) != $matched_username)
{
$full_username = substr($post_text, $startpos, strlen($full_username) + 1);
}
$length = strlen($full_username);
$user_data = $user_list[$matched_username];
$mentioned[] = array_merge($user_data, array(
'type' => self::MENTION_AT,
'start' => $startpos,
'length' => $length,
));
}
$offset = $matches[0][1] + strlen($search_string);
}
Which way would you go? The problem is the text will be displayed often and parsing it every time will consume a lot of time, but I don't want to heavily modify what the user had entered as text.
I can't find out what's the best way, and even why my function is so time consuming.
A sample text would be:
Okay, #Firstname Lastname, I mention you!
Listen #[TEAM] John, you are a team member.
#Test is a normal name, but #Thât♥ should be tracked too.
And see #Wolfs garden! I just mean the Wolf.
Usernames in that text would be
Firstname Lastname
[TEAM] John
Test
Thât♥
Wolf
So yes, there is clearly nothing I know where a name may end. Only thing is the newline.

I think the main problem is, that you can't distinguish usernames from text and it's a bad idea, to lookup maybe thousands of usernames in a text, also this can lead to further problems, that John is part of [TEAM] John‌ or JohnFoo...
It's needed to separate the usernames from other text. Assuming that you're using UTF-8, could put the usernames inside invisible zero-w space \xE2\x80\x8B and non-joiner \xE2\x80\x8C.
The usernames can now be extracted fast and with little effort and if needed still verified in db.
$txt = "
Okay, #\xE2\x80\x8BFirstname Lastname\xE2\x80\x8C, I mention you!
Listen #\xE2\x80\x8B[TEAM] John\xE2\x80\x8C, you are a team member.
#\xE2\x80\x8BTest\xE2\x80\x8C is a normal name, but
#\xE2\x80\x8BThât?\xE2\x80\x8C should be tracked too.
And see #\xE2\x80\x8BWolfs\xE2\x80\x8C garden! I just mean the Wolf.";
// extract usernames
if(preg_match_all('~#\xE2\x80\x8B\K.*?(?=\xE2\x80\x8C)~s', $txt, $out)){
print_r($out[0]);
}
Array
(
[0] => Firstname Lastname
1 => [TEAM] John
2 => Test
3 => Thât♥
4 => Wolfs
)
echo $txt;
Okay, #Firstname Lastname, I mention you!
Listen #[TEAM] John‌, you are a team member.
#Test‌ is a normal name, but
#Thât♥‌ should be tracked too.
And see #Wolfs‌ garden! I just mean the Wolf.
Could use any characters you like and that possibly don't occur elsewhere for separation.
Regex FAQ, Test at eval.in (link will expire soon)

comparing variables in PHP with different language files

I have a select menu for users.
It is populated by PHP variables, which are all language variables. For example:
$word = $lang['word'];
$select = array($word);
Therefore, the select menu options will change based on the language the user has chosen. I need to be able to compare users' selections to each other. For example:
if($user1word == $user2word) ...
But because of the language files, this doesn't work. Obviously "one" != "Uno" even though they're the same.
My first fix was to change everything to a numeric value before posting it to the database. Example:
if($_POST['word'] == $lang['word']) { $userWord = 1 }
This worked perfectly for all words except those that contained special characters (å, æ, é...) and nothing I did could resolve this (I tried normalizer; language-specific accept-char onchange events for the form; utf8_encode. It was hopeless.
Currently everything saves to the database as text, dependent on the language the user is in. So if "Language" is an option, but you're in Norwegian, it saves as "Språk".
I need a simple solution that doesn't crush my mind - I am new to PHP.

Currently everything saves to the database as text, dependent on the language the user is in.
This is a design flaw in my opinion. Ideally your data would be as agnostic as possible to language and translations would be performed just for the UI with tools like gettext. Typically items like select values would be stored with keys or IDs.

Use a map that is based on a common index, this index can be english, spanish or numeric. i'd suggest numeric. Store the numeric index in your database.
A step in the right direction:
$lang['en'][0] = 'Hello';
$lang['de'][0] = 'Hallo';
$lang['es'][0] = 'Hola';
$lang['en'][1] = 'Sup?';
$lang['de'][1] = 'Wie gehts?';
$lang['es'][1] = 'Que pasa?';
$userLang = 'en';
// show the select
echo '<select name="word">';
foreach ( $lang[$userLang] as $index => $word {
echo '<option value="'.$index.'">'.$word.'</option>';
}
echo '</select>';
// show the selected word:
echo 'You chose to say "'.$lang[$userLang][$_POST['word']].'".';
// compare the word to what is in the db
if ( $_POST['word'] === $dbRow['word'] ) {
// expression matches!
// assume a column in the db "language" describes the language the user chose, e.g. 'en', 'de', or 'es'
echo 'You previously chose "'.$lang[$dbRow['language']][$dbRow['word']].'" in the language "'.$dbRow['language'].'".;
}
Depending on how you want to organize it, you may favor grouping by phrase instead of grouping by language, i.e.:
$lang[0]['en'] = 'Hello';
$lang[0]['de'] = 'Hallo';
$lang[0]['es'] = 'Hola';

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP/mysql array search algorithm - php

You should check out full text searching in MySQL. Also check out Zend's port of the Apache Lucene project, Zend_Search_Lucene.

Most maddening result with LIKE is this one "%man" this will return all woman in file! In case of listing perhaps a not too bad solution is to keep on shortening the searching needle. In your case a match will come up when your searching $ is as short as "milwa".

Related

PHP compare strings in loop with similar_text, but only show if match is unique

Stripping . from text box's

Small scope/huge frustration mystery: buggy array behavior in php

PHP performant search a text for given usernames

comparing variables in PHP with different language files

Categories

Resources