I am trying to build an 'analysis' feature for my translation software. A translator will be able to analyze a project which checks for similarities in a glossary.
On a project with 10,000 rows (each row contains a source text between 1-500 characters) with a glossary containing 25,000 terms, my current analysis algorithm takes a RIDICULOUS amount of time. I need to get this down to a couple of minutes maximum.
My algorithm looks something like this (I removed code that doesn't effect performance):
foreach($rows as $row){ //10,000 rows
$source = $rows["source"];
$matchPercent = $glossary->findMatch($source); //This line of code is extremely slow
$matchPercents[$matchPercent]++;
}
//Now I have an array of all the matching percentages and how many rows fall into each percentage match
public function findMatch($source)
{
$highestMatchPercent = 0;
foreach ($this->terms as $term) { //25,000 terms
$matchPercent = 0;
similar_text(strtolower($source), strtolower($term), $matchPercent);
$matchPercent = floor($matchPercent);
if ($matchPercent > $highestMatchPercent) $highestMatchPercent = $matchPercent;
if ($highestMatchPercent == 100) return $highestMatchPercent; //Added effeciency
}
return $highestMatchPercent;
}
How can I achieve similar results and speed this process up?
I've tried levenshtein, but it's max character limit is a problem.
Related
I'm working on a 'thought' function for a game i'm working on -- it pulls random strings from an XML file, combines them and makes them 'funny'. However, i'm running into a small issue in that the same couple of items keep getting selected each time.
The two functions I am using are
function randRoller($number)
{
mt_srand((microtime()*time())/3.145);
$x = [];
for($i = 0; $i < 100; $i++)
{
#$x = mt_rand(0,$number);
}
return mt_rand(0,$number);
}
/* RETRIEVE ALL RELEVANT DATA FROM THE XML FILE */
function retrieveFromXML($node)
{
$node = strtolower($node);
$output = [];
$n = substr($node,0,4);
#echo $node;
foreach($this->xml->$node->$n as $data)
{
$output[] = $data->attributes();
}
$count = count($output)-1;
$number = $this->randRoller($count);
return $output[$number];
}
Granted, the "randRoller" function is sorta defunct now because the orginal version I had (Which 'rolled' ten numbers from the count, and then selected the one which got the most number of dice) didn't work as planned.
I've tried everything i can think of to get better results && have googled my brains out to fix it. but still am getting the same repetitive results.
Don't use mt_srand() unless you know what you are doing, since it is called automatically. See the note on http://php.net/manual/en/function.mt-srand.php:
Note: There is no need to seed the random number generator with srand() or mt_srand() as this is done automatically.
Remove (all) the mt_srand() call(s).
So crazy! I have a bug that's 100% reproducible, it happens in only a few lines of code, yet I cannot for the life of me determine what the problem is.
My project is a workout maker, and the mystery involves two functions:
get_pairings: It makes a set of $together_pairs (easy) and $mixed_pairs (annoying), and combines them into $all_pairs, used to make the workout.
make_mixed_pairs: this has different logic depending on whether it's a partner vs solo workout. Both cases return a set of $mixed_pairs (in the same exact format), called by the function above.
The symptoms/clues:
The case of the solo workout is fine, $all_pairs will only contain $mixed_pairs (because as it's defined, $together_pairs are only for partner workouts)
In the case of the a partner workout, when I combine the two sets in get_pairings(), $all_pairs only successfully gets the first set I give it! (If I swap those lines at step 2 and add $together_pairs first, $all_pairs contains only those. If I do $mixed_pairs first, $all_pairs contains only that).
Then if I uncomment that second-to-last line in make_mixed_pairs() just for troubleshooting to see what happens, then $all_pairs does successfully include exercises from both sets!
That suggests the problem is something I'm doing wrong in making the arrays in make_mixed_pairs(), but I confirmed that the resulting format is identical in both cases.
Anyone see what else I could be missing? I've been narrowing down this bug for 4 hours so far- I can't make it any smaller, and I can't see what's wrong :(
Update: I updated the for loop in make_mixed_pairs() to stop at $mixed_pair_count - 1 (instead of just $mixed_pair_count), and now I sometimes get one single 'together_pair' mixed in the $all_pairs results; the same damn one each time, weirdly. Though it's not 'fixed', because again when I change the order that I add the two sets in get_pairings, when I add $together_pairs first, then $all_pairs is ENTIRELY those- it's so strange...
Here are the functions: first get_pairings (relevant part is right before and after step 2):
/**
* Used in make_workout.php: take the user's available resources, and return valid exercises
*/
function get_pairings($exercises, $count, $outdoor_partner_workout)
{
// 1. Prep our variables, and put exercises into the appropriate buckets
$mixed_exercises = array();
$together_pairs = array();
$mixed_pairs = array();
$all_pairs = array();
$selected_pairs = array();
// Sort the valid exercises: self_pairing exercises go as they are, with extra
// array for consistent formatting. Mixed ones go into $mixed_exercises array
// for more specialized pairing in make_mixed_pairs
foreach($exercises as $exercise)
{
if ($exercise['self_pairing'])
{
$pair = array($exercise);
array_push($together_pairs, [$pair]);
}
else
{
$this_exercise = array($exercise);
array_push($mixed_exercises, $this_exercise);
}
}
// Now get the mixed_pairs
$mixed_pairs = make_mixed_pairs($mixed_exercises, $outdoor_partner_workout);
// 2. combine together into one set, and select random pairs for the workout
// Add both sets to the array of all pairs (to pick from afterward)
$all_pairs += $mixed_pairs;
$all_pairs += $together_pairs;
// Now let's choose at random our desired # of pairs, and save them in $selected_pairs
$pairing_keys = array_rand($all_pairs, $count);
foreach($pairing_keys as $key)
{
array_push($selected_pairs, $all_pairs[$key]);
}
// Finally, shuffle it so we don't always see the self-pairs first
shuffle($selected_pairs);
return $selected_pairs;
}
And the other one- make_mixed_pairs: there are two cases, the first is complicated (and shows the bug) and the second is simple (and works):
/**
* Used by get_pairings: in case of a partner workout that has open space (where
* one person can travel to a point while the other does an exercise til they return)
* we'll pair exercises in a special way. (If not, fine to grab random pairs)
*/
function make_mixed_pairs($mixed_exercises, $outdoor_partner_workout)
{
$mixed_pairs = array();
// When it's an outdoor partner workout, we want to pair travelling with stationary
// put them into arrays and then we'll make pairs using one from each
if ($outdoor_partner_workout)
{
$mixed_travelling = array();
$mixed_stationary = array();
foreach($mixed_exercises as $exercise)
{
if ($exercise[0]['travelling'])
{
array_push($mixed_travelling, $exercise);
}
else
{
array_push($mixed_stationary, $exercise);
}
}
shuffle($mixed_travelling);
shuffle($mixed_stationary);
// determine the smaller set, and pair exercises that many times
$mixed_pair_count = min(count($mixed_travelling), count($mixed_stationary));
for ($i=0; $i < $mixed_pair_count; $i++)
{
$this_pair = array($mixed_travelling[$i], $mixed_stationary[$i]);
array_push($mixed_pairs, $this_pair); // problem is adding them here- we get only self_pairs
}
}
// Otherwise we can just grab pairs from mixed_exercises
else
{
// shuffle the array so it's in random order, then chunk it into pairs
shuffle($mixed_exercises);
$mixed_pairs = array_chunk($mixed_exercises, 2);
}
// $mixed_pairs = array_chunk($mixed_exercises, 2); // when I replace it with this, it works
return $mixed_pairs;
}
Oh for Pete's sake: I mentioned this to a friend, who told me that union is flukey in php, and that I should use array_merge instead.
I replaced these lines:
$all_pairs += $together_pairs;
$all_pairs += $mixed_pairs;
with this:
$all_pairs = array_merge($together_pairs, $mixed_pairs);
And now it all works
UPDATE: I think the cakePhp updateAll is the problem. If i uncomment the updateAll and pr the results i get in 1-2 seconds so many language Detections like in 5 minutes!!!! I only must update one row and can determine that row with author and title... is there a better and faster way???
I'm using detectlanguage.com in order to detect all english texts in my sql database. My Database consists of about 500.000 rows. I tried many things to detect the lang of all my texts faster. Now it will take many days... :/
i only send 20% of the text (look at my code)
i tried to copy my function and run the function many times. the copied code shows the function for all texts with a title starting with A
I only can run 6 functions at the same time... (localhost)... i tried a 7th function in a new tab, but
Waiting for available socket....
public function detectLanguageA()
{
set_time_limit(0);
ini_set('max_execution_time', 0);
$mydatas = $this->datas;
$alldatas = $mydatas->find('all')->where(['SUBSTRING(datas.title,1,1) =' => 'A'])->where(['datas.lang =' => '']);
foreach ($alldatas as $row) {
$text = $row->text;
$textLength = round(strlen($text)*0.2);
$text = substr($text,0,$ltextLength);
$title = $row->title;
$author = $row->author;
$languageCode = DetectLanguage::simpleDetect($text);
$mydatas->updateAll(
['lang' => $languageCode], // fields
['author' => $author,'textTitle' => $title]); // conditions*/
}
}
I hope some one has a idea for my problem... Now the language detection for all my texts will take more than one week :/ :/
My computer runs over 20 hours with only little interruptions... But i only detected the language of about 13.000 texts... And in my database are 500.000 texts...
Now i tried sending texts by batch, but its also to slow... I always send 20 texts in one Array and i think thats the maximum...
Is it possible that the cakePhp 3.X updateAll-function makes it so slowly?
THE PROBLEM WAS THE CAKEPHP updateAll
Now i'm using: http://book.cakephp.org/3.0/en/orm/saving-data.html#updating-data with a for loop and all is fast and good
use Cake\ORM\TableRegistry;
$articlesTable = TableRegistry::get('Articles');
for ($i = 1; $i < 460000; $i++) {
$oneArticle = $articlesTable->get($i);
$languageCode = DetectLanguage::simpleDetect($oneArticle->lyrics);
$oneArticle->lang = $languageCode;
$articlesTable->save($oneSong);
}
I've got a problem, and I'll try and describe it in as simplest terms as possible.
Using a combination of PHP and MySQL I need to fulfil the following logic problem, this is a simplified version of what is required, but in a nutshell, the logic is the same.
Think boxes. I have lots of small boxes, and one big box. I need to be able to fill the large box using lots of little boxes.
So lets break this down.
I have a table in MySQL which has the following rows
Table: small_boxes
id | box_size
=============
1 | 100
2 | 150
3 | 200
4 | 1000
5 | 75
..etc
This table can run up to the hundreds, with some boxes being the same size
I now have a requirement to fill one big box, as an example, of size 800, with all the combinations of small_boxes as I find in the table. The big box can be any size that the user wishes to fill.
The goal here is not efficiency, for example, I don't really care about going slightly under, or slightly over, just showing the different variations of boxes that can possibly fit, within a tolerance figure.
So if possible, I'd like to understand how to tackle this problem in PHP/MySQL. I'm quite competent at both, but the problem lies in how I approach this.
Examples would be fantastic, but I'd happily settle for a little info to get me started.
You should probably look into the glorious Knapsack problem
https://codegolf.stackexchange.com/questions/3731/solve-the-knapsack-problem
Read up on this.
Hopefully you've taking algebra 2..
Here is some PHP code that might help you out:
http://rosettacode.org/wiki/Knapsack_problem/0-1#PHP
Thanks to maxhd and Ugo Meda for pointing me in the right direction!
As a result I've come to something very close to what I need. I'm not sure if this even falls into the "Knapsack problem", or whichever variation thereof, but here's the code I've come up with. Feel free to throw me any constructive criticism!
In order to try and get some different variants of boxes inside the knapsack, I've removed the largest item on each main loop iteration, again, if there's a better way, let me know :)
Thanks!
class knapsack {
private $items;
private $knapsack_size;
private $tolerance = 15; //Todo : Need to make this better, perhaps a percentage of knapsack
private $debug = 1;
public function set_knapsack_size($size){
$this->knapsack_size = $size;
}
public function set_items($items){
if(!is_array($items)){
return false;
}
//Array in the format of id=>size, ordered by largest first
$this->items = $items;
}
public function set_tolerance($tolerance){
$this->tolerance = $tolerance;
}
private function remove_large_items(){
//Loop through each of the items making sure we can use this sized item in the knapsack
foreach($this->items as $list_id=>$list){
//Lets look ahead one, and make sure it isn't the last largest item, we will keep largest for oversize.
if($list["size"] > $this->knapsack_size && (isset($this->items[$list_id+1]) && $this->items[$list_id+1]["size"] > $this->knapsack_size)){
unset($this->items[$list_id]);
}else{
//If we ever get here, simply return true as we can start to move on
return true;
}
}
return true;
}
private function append_array($new_data,$array){
if(isset($array[$new_data["id"]])){
$array[$new_data["id"]]["qty"]++;
}else{
$array[$new_data["id"]]["qty"] = 1;
}
return $array;
}
private function process_items(&$selected_items,$knapsack_current_size){
//Loop the list of items to see if we can fit it in the knapsack
foreach($this->items as $list){
//If we can fit the item into the knapsack, lets add it to our selected_items, and move onto the next item
if($list["size"] <= $knapsack_current_size){
$this->debug("Knapsack size is : ".$knapsack_current_size." - We will now take ".$list["size"]." from it");
$selected_items = $this->append_array($list,$selected_items);
$knapsack_current_size -= $list["size"];
//Lets run this method again, start recursion
$knapsack_current_size = $this->process_items($selected_items,$knapsack_current_size);
}else{
//Lets check if we can fit a slightly bigger item into the knapsack, so we can eliminate really small items, within tolerance
if(($list["size"] <= $knapsack_current_size + $this->tolerance) && $knapsack_current_size > 0){
$this->debug("TOLERANCE HIT : Knapsack size is : ".$knapsack_current_size." - We will now take ".$list["size"]." from it");
$selected_items = $this->append_array($list,$selected_items);
$knapsack_current_size -= $list["size"];
}
}
//Lets see if we have to stop the recursion
if($knapsack_current_size < 0){
return $knapsack_current_size;
}
}
}
private function debug($message=""){
if(!$this->debug){
return false;
}
echo $message."\n";
}
public function run(){
//If any of the variables have not been set, return false
if(!is_array($this->items) || !$this->knapsack_size){
return false;
}
//Lets first remove any items that may be too big for the knapsack
$this->remove_large_items();
//Lets now check if we still have items in the array, just incase the knapsack is really small
if(count($this->items) == 0){
return false;
}
//Now that we have a good items list, and we have no items larger than the knapsack, lets move on.
$variants = array();
foreach($this->items as $list_id=>$list){
$this->debug();
$this->debug("Finding variants : ");
$selected_items = array();
$this->process_items($selected_items,$this->knapsack_size);
$variants[] = $selected_items;
//Remove the largest variant, so we get a new set of unique results
unset($this->items[$list_id]);
}
return $variants;
}
}
$products = array(
array("id"=>1,"size"=>90),
array("id"=>2,"size"=>80),
array("id"=>3,"size"=>78),
array("id"=>4,"size"=>66),
array("id"=>5,"size"=>50),
array("id"=>6,"size"=>42),
array("id"=>7,"size"=>36),
array("id"=>8,"size"=>21),
array("id"=>9,"size"=>19),
array("id"=>10,"size"=>13),
array("id"=>11,"size"=>7),
array("id"=>12,"size"=>2),
);
$knapsack = new knapsack();
$knapsack->set_items($products);
$knapsack->set_knapsack_size(62);
$result = $knapsack->run();
var_dump($result);
I'd like to be able to use php search an array (or better yet, a column of a mysql table) for a particular string. However, my goal is for it to return the string it finds and the number of matching characters (in the right order) or some other way to see how reasonable the search results are, so then I can make use of that info to decide if I want to display the top result by default or give the user options of the top few.
I know I can do something like
$citysearch = mysql_query(" SELECT city FROM $table WHERE city LIKE '$city' ");
but I can't figure out a way to determine how accurate it is.
The goal would be:
a) find "Milwaukee" if the search term were "milwakee" or something similar.
b) if the search term were "west", return things like "West Bend" and "Westmont".
Anyone know a good way to do this?
You should check out full text searching in MySQL. Also check out Zend's port of the Apache Lucene project, Zend_Search_Lucene.
More searching led me to the Levenshtein distance and then to similar_text, which proved to be the best way to do this.
similar_text("input string", "match against this", $pct_accuracy);
compares the strings and then saves the accuracy as a variable. The Levenshtein distance determines how many delete, insert, or replace functions on a single character it would need to do to get from one string to the other, with an allowance for weighting each function differently (eg. you can make it cost more to replace a character than to delete a character). It's apparently faster but less accurate than similar_text. Other posts I've read elsewhere have mentioned that for strings of fewer than 10000 characters, there's no functional difference in speed.
I ended up using a modified version of something I found to make it work. This ends up saving the top 3 results (except in the case of an exact match).
$input = $_POST["searchcity"];
$accuracy = 0;
$runner1acc = 0;
$runner2acc = 0;
while ($cityarr = mysql_fetch_row($allcities)) {
$cityname = $cityarr[1];
$cityid = $cityarr[0];
$city = strtolower($cityname);
$diff = similar_text($input, $city, $tempacc);
// check for an exact match
if ($tempacc == '100') {
// closest word is this one (exact match)
$closest = $cityname;
$closestid = $cityid;
$accuracy = 100;
break;
}
if ($tempacc >= $accuracy) { // more accurate than current leader
$runner2 = $runner1;
$runner2id = $runner1id;
$runner2acc = $runner1acc;
$runner1 = $closest;
$runner1id = $closestid;
$runner1acc = $accuracy;
$closest = $cityname;
$closestid = $cityid;
$accuracy = $tempacc;
}
if (($tempacc < $accuracy)&&($tempacc >= $runner1acc)) { // new 2nd place
$runner2 = $runner1;
$runner2id = $runner1id;
$runner2acc = $runner1acc;
$runner1 = $cityname;
$runner1id = $cityid;
$runner1acc = $tempacc;
}
if (($tempacc < $runner1acc)&&($tempacc >= $runner2acc)) { // new 3rd place
$runner2 = $cityname;
$runner2id = $cityid;
$runner2acc = $tempacc;
}
}
echo "Input word: $input\n<BR>";
if ($accuracy == 100) {
echo "Exact match found: $closestid $closest\n";
} elseif ($accuracy > 70) { // for high accuracies, assumes that it's correct
echo "We think you meant $closestid $closest ($accuracy)\n";
} else {
echo "Did you mean:<BR>";
echo "$closestid $closest? ($accuracy)<BR>\n";
echo "$runner1id $runner1 ($runner1acc)<BR>\n";
echo "$runner2id $runner2 ($runner2acc)<BR>\n";
}
This can be very complicated, and I am not personally aware of any good 3rd party libraries although I'm sure they exist. Others may be able to suggest some canned solutions, though.
I have written something similar from scratch a few times in the past. If you go down that route, it is probably not something you'd want to do in PHP by itself as every query would involve getting all of the records and performing your calculations on them. It will almost certainly involve creating a set of index tables that meet your specifications.
For instance, you would have to come up with rules for how you imagine that "Milwaukee" could end up spelled "milwakee." My solution to this was to do vowel compression and duplication compression (not sure if these are actually search terms). So, milwaukee would be indexed as:
milwaukee
m_lw__k__
m_lw_k_
When the search query came in for "milwaukee", I would run the same process on the text input, and then run a search on the index table for:
SELECT cityId,
COUNT(*)
FROM myCityIndexTable
WHERE term IN ('milwaukee', 'm_lw__k__', 'm_lw_k_')
When the search query came in for "milwakee", I would run the same process on the text input, and then run a search on the index table for:
SELECT cityId,
COUNT(*)
FROM myCityIndexTable
WHERE term IN ('milwaukee', 'm_lw_k__', 'm_lw_k_')
In the case of Milwaukee (spelled correctly), it would return "3" for the count.
In the case of Milwakee (spelled incorrectly) ,it would return "2" for the count (since it would not match the m_lw__k__ pattern as it only had one vowel in the middle).
If you sort the results based on the count, you would end up meeting one of your rules, that "Milwaukee" would end up being sorted higher as a possible match than "Milwakee."
If you want to build this system in a generic way (as hinted by your use of $table in the query) then you'd probably need another mapping table somewhere in there to map your terms to the appropriate table.
I'm not suggesting this is the best (or even a good) way to go about this, just something I've done in the past that might prove useful to you if you plan to try and do this without a third party solution.
Most maddening result with LIKE is this one "%man" this will return all woman in file!
In case of listing perhaps a not too bad solution is to keep on shortening the searching needle. In your case a match will come up when your searching $ is as short as "milwa".