php (fuzzy) search matching - php

if anyone has ever submitted a story to digg, it checks whether or not the story is already submitted, I assume by a fuzzy search.
I would like to implement something similar and want to know if they are using a php class that is open source?
Soundex isnt doing it, sentences/strings can be up to 250chars in length

Unfortunately, doing this in PHP is prohibitively expensive (high CPU and memory utilization.) However, you can certainly apply the algorithm to small data sets.
To specifically expand on how you can create a server meltdown: couple of built-in PHP functions will determine "distance" between strings: levenshtein and similar_text.
Dummy data: (pretend they're news headlines)$titles = <<< EOF
Apple
Apples
Orange
Oranges
Banana
EOF;
$titles = explode("\n", $titles );
At this point, $titles should just be an array of strings. Now, create a matrix and compare each headline against EVERY other headline for similarity. In other words, for 5 headlines, you will get a 5 x 5 matrix (25 entries.) That's where the CPU and memory sink goes in.
That's why this method (via PHP) can't be applied to thousands of entries. But if you wanted to:
$matches = array();
foreach( $titles as $title ) {
$matches[$title] = array();
foreach( $titles as $compare_to ) {
$matches[$title][$compare_to] = levenshtein( $compare_to, $title );
}
asort( $matches[$title], SORT_NUMERIC );
}
At this point what you basically have is a matrix with "text distances." In concept (not in real data) it looks sort of like this table below. Note how there is a set of 0 values that go diagonally - that means that in the matching loop, two identical words are -- well, identical.
Apple Apples Orange Oranges Banana
Apple 0 1 5 6 6
Apples 1 0 6 5 6
Orange 5 6 0 1 5
Oranges 6 5 1 0 5
Banana 6 6 5 5 0
The actual $matches array looks sort of like this (truncated):
Array
(
[Apple] => Array
(
[Apple] => 0
[Apples] => 1
[Orange] => 5
[Banana] => 6
[Oranges] => 6
)
[Apples] => Array
(
...
Anyhow, it's up to you to (by experimentation) determine what a good numerical distance cutoff might mostly match - and then apply it. Otherwise, read up on sphinx-search and use it - since it does have PHP libraries.
Orange you glad you asked about this?

I would suggest taking the users submitted URLs and storing them in multiple parts; domain name, path and query string. Use the PHP parse_url() function to derive the parts of the submitted URL.
Index at least the domain name and path. Then, when a new user submits URL you search your database for a record matching the domain and path. Since the columns are indexed, you will be filtering out first all records that are not in the same domain, and then searching through the remaining records. Depending on your dataset, this should be faster that simply indexing the entire URL. Make sure your WHERE clause is setup in the right order.
If that does not meet your needs I would suggest trying Sphinx. Sphinx is an open source SQL full text search engine that is far faster that MySQL's built in full-text search. It supports stemming and some other nice features.
http://sphinxsearch.com/
You could also take the title or text content of the users submission, run it through a function to generate keywords, and search the database for existing records with those or similar keywords.

You could (depending on the size of your dataset) use mySQL's FULLTEXT search, and look for item(s) that have a high score and are within a certain timeframe, and suggest this/these to the user.
More about score here: MySQL Fulltext Search Score Explained

Related

PHP shift scheduling - sequentially looping through a multidimensional array with numerous subarrays

I am trying to design a shift scheduling program. There are about a dozen potential users (+/- a few), all with specific vacation requirements. There are 4 types of shifts (role1, role2, role3, role4); each role must be covered by a person, a single person cannot cover multiple roles, and some users prefer certain shift sequences (eg. role1-role2-role4-role3 vs. role1-role2-role3-role4), though ultimately the shift sequences can be rearranged if it can't be accommodated within the schedule. Some users can't work certain shifts at all. Each user also has a maximum number of shifts - overall, there isn't much flexibility within the system, so I suspect only a limited number of possible valid solutions.
I've created an array that takes into account mandatory user vacations, so for each calendar date I know which users are in theory available for that shift (at baseline):
[2020-07-21] => Array
(
[0] => Array
(
[role1] => userA
[role2] => userB
[role3] => userC
[role4] => userD
)
[1] => Array
(
[role1] => userA
[role2] => userB
[role3] => userD
[role4] => userC
)
[2] => Array
(
[role1] => userA
[role2] => userC
[role3] => userB
[role4] => userD
)
Array is generated by (but this code hasn't had issues - it's when I try to call print_r(cartesian($solutionsArray)); where I run out of memory.
// Use cartesian product to pick one value from each role array, and then exclude arrays that don't have at least 4 unique users
foreach ($globalAvailabilityCalendarArray[$globalAvailabilityCalendarDate] as $role => $availableUsersArray) {
$cartesianProducts = $utilities->cartesian($globalAvailabilityCalendarArray[$globalAvailabilityCalendarDate]);
foreach ($cartesianProducts as $cartesianProduct) {
if(count(array_unique(array_values($cartesianProduct))) !== count($this->roles)) {
continue;
} else {
$solutions[] = $cartesianProduct;
}
}
}
$solutionsArray[$globalAvailabilityCalendarDate] = array_unique($solutions, SORT_REGULAR);
I have a few problems though:
For this time period, there are 63 work days to take into account. Some days only have 4 people available so the possible combinations are small, but some days may have 10+ available people which means hundreds of possible valid combinations. However, the scheduling requirements of each day depend on what combination has been selected for previous days so it doesn't go over an individual worker's maximum number of shifts. Thus, I am trying to sequentially iterate over each subarray until I find the first possible valid solution (and if possible, other valid solutions for users to compare against) - (eg. [2020-07-01][0] ... [2020-07-02][1]... then [2020-07-01][0] ... [2020][07-02][2]). I have tried calculating cartesian product (Finding cartesian product with PHP associative arrays) which worked for finding available combinations for a single day, but my script runs out of memory when applying it to the whole calendar as there are just too many possible sequences. Is there a less memory intensive alternative?
With my array structure, how can I prioritize shift sequences so that users have a good ratio of shifts (target is role1+role2+role4 / role3 = 2.5) and preferred shift sequence (eg. role1-role2-role3-role4 and avoiding sequential busy shifts like role2-role2-role2-role2-role1)?
If the only problem you have is memory usage when using the cartesian function then you could search for another way to do the cartesian product without consuming that much memory. After a quick search I found a possible solution you could use. https://github.com/bpolaszek/cartesian-product
According to its documentation, you can use it like:
print_r(cartesian_product($solutionsArray)->asArray());

PHP - Check duplicated array

I have for example one array:
1,2,3,4,5,5,6,7,8,9
Now this array gets sorted into three arrays (randomly)
For example:
1,2,3
4,5,6
7,8,9
Now I have to create 3 arrays again with the same numbers (1-9)
But the new arrays should not include any same numbers as in the past array;
1,3,5 (incorrect, 3 has already been with 1)
1,5,7 (correct, all numbers are new to eachother)
Now I found a way to detect this using loops, below you see a part of the code.
$temp_plr are the new random created numbers (3 numbered array).
$team_check is on of the previous 3 numbered arrays.
This check gets executed untill it found a new combination that didnt show up before.
It works, but it is really slow sometimes, which makes the browser time-out or it just loops forever.
If you need more explanation please tell me.
if((in_array($temp_plr[0], $team_check) && in_array($temp_plr[1],$team_check))
|| (in_array($temp_plr[0],$team_check) && in_array( $temp_plr[2],$team_check))
|| (in_array($temp_plr[1],$team_check) && in_array( $temp_plr[2],$team_check))
) {
$okey = false;
}
$temp_plr includes 3 values and $team_check also includes 3 values.
image of the end result I have made with this code:
http://i57.tinypic.com/9hqv4w.png
Like you see, alle numbers in each 3 numbered team are different from eachother in each round.
It would be easier to look at it differently.
Get your first trio of arrays as you are doing already.
Then, instead of just generating a new trio and seeing if it fits the rules, make it fit the rules.
To do this, build your new trio by picking one number from the first row of the old trio, then one from the second row, then one from the third.
For instance, if your first trio is:
1 2 3
4 5 6
7 8 9
You can generate a new trio by picking a random number from each row:
1 5 9
2 4 7
3 6 8
This is guaranteed to fit the rule, and is capable of generating all such results from any given "old trio".

sorted php array "introspection"

For some reason, I have a sorted php array:
"$arr_questions" = Array [6]
0 Array [6]
1 Array [6]
2 Array [6]
3 Array [6]
4 Array [6]
5 Array [6]
each of the positions is another array. This time it is associative. See position [0]:
0 = Array [6]
question_id 40
question La tercera pregunta del mundo
explanation
choices Array [3]
correct 0
answer 1
Without looping my array, is there any way to access directly this position 0, just knowing one of its properties?
Example... Imagine I have to change some property of the position of the array whose "question_id" property is 40. That is the only I know. I don't know if the question_id property is gonna be in the first or second or which position. And, for example, imagine I want to change the "answer" property to 2.
How can I access directly to that position without looping the whole array. I mean... I don't want to do this:
foreach ($arr_questions as $question){
if ($question["question_id"] == 40){
$question["answer"] == 2;
}
}
A PHP Array lets you access random values by its id.
It is actually a big deal, because in other languages array indices must be always integers.
However, PHP arrays work mostly like other-languages dictionaries, in which your key can be of other data types, like strings.
By that, if you want to be able to access some question, and you know the ID, then you should have constructed the array by letting your question_id be the index of each array entry.
If you can't do that, don't panic.
In the end you will have to make some kind of search, that's true.
But hey, then you have two cases:
a) Your array is big. Wow, in that case, you should run an optimized sorting algorithm, such as mergesort or quicksort, so that you can order your data quickly and then have them already sorted by your wanted field.
b) Your array is not-so-big. I think in that case it's no big deal, and the sorting can slow your application more than it should, and if you want to be quicker, you should cache the results of sorting the questions (if possible) or refactoring the array construction so it uses your wanted key as the array index.
As a side note, you can't map things avoid wasting some CPU time or some RAM space, and usually you can swap one for the other.
I mean, if you store just one array indexed by question_id then you can look up for question_id's in O(1) + O(array-access) time. If O(array-access) is a constant, then you can get to things in O(1). That means constant time, and it is as fast as it can get.
However, if you need other kind of searches you can end up with O(n * log(n)) or O(n²) time complexity.
But, if you had stored as many arrays as ways to order them you should need, you would need only O(1) time to access each of them. But, you would need O(n) space (where n here is the num of features to have direct access to).
That would increment the time to build the arrays (by a constant).
With your situation it's not possible without a loop, but if you change your array structure to this:
array(
39 => array(...),
40 => array(...)
)
Which 39 and 40 are your question_id, then you can access them so fast without any loop.
If you want or have to to keep that structure, then just write a function to get the array, the associative index and the value you want as parameters to search the array and return the found index, so you will not be forced to write this loop over and over ...
No, there is no way to access that element without looping over your array. You might abstract that search into a helper function, however.

searching and sorting through huge array of latitude and longitude

so I have an array of latitude/longitude (it's fake latitude/longitude as you can see, but just to illustrate the point & the original array size is MUCH larger than this):
<?php
$my_nodes = array(
1=> array(273078.139,353257.444),
2=> array(273122.77,352868.571),
3=> array(272963.687,353782.863),
4=> array(273949.566,353370.127),
5=> array(274006.13,352910.551),
6=> array(273877.095,353829.704),
7=> array(271961.898,353388.245),
8=> array(272839.07,354303.863),
9=> array(273869.141,354417.432),
10=> array(273207.173,351797.405),
11=> array(274817.901,353466.462),
12=> array(274862.533,352958.718),
13=> array(272034.812,351852.642),
14=> array(274128.978,354676.828),
15=> array(271950.85,354370.149),
16=> array(275087.902,353883.617),
17=> array(275545.711,352969.325)));
?>
I want to be able to find the closest node (in this case a node is either 1,2,3, 4,5, ...) for a given latitude X and latitude Y. I know the easiest way to do this is to do a for loop and then do a margin error difference (abs(latitude_X - latitude_X_array) + abs(latitude_Y - latitude_Y_array)) but this will be very inefficient as the size of the array grows.
I was thinking of doing a binary search, however the array needs to be sorted first in a binary search, however it's hard to sort latitude/longitude and in the end we're finding the CLOSEST latitude/longitude in the array for a given lat X, long Y. What approach should I take here?
UPDATE:
Mark has a valid point, this data could be stored in a database. However, how do I get such info from the db if I want the closest one?
Have a read of this article which explains all about finding the closest point using latitude and longitude from records stored in a database, and also gives a lot of help on how to make it efficient.... with PHP code examples.
I had a similar problem when I wanted to re-sample a large number of lat/long points to create a heightfield grid. The simplest way I found was like this:
divide the lat/long space up into a regular grid
create a bucket for each grid square
go through the list, adding each point to the bucket for grid square it falls in
then find the grid square your X,Y point falls in, and search outwards from there
I'm assuming you're storing your data in a DB table like this?
id | lat | long | data
------------------------------------------------
1 | 123.45 | 234.56 | A description of the item
2 | 111.11 | 222.22 | A description of another item
In this case you can use SQL to narrow your result set down.
if you want to find items close to grid ref 20,40, you can do the following query:
SELECT *
FROM locations
WHERE lat BETWEEN 19 AND 21
AND long BETWEEN 39 AND 41
This will return all the tiems in a 2x2 grid near your specified grid ref.
Several databases also provide spacial datatypes (MySQL and Postgres both do) and they might be worth investigating for this job. I don't, however, have experience with such things so I'm afraid I couldn't help with those.
To sort a multidimensional array in PHP you'd have to iterate over all elements and compare two at a time. For an array of size n that makes O(n) comparisons. Finding the closest node in the sorted array needs O(log n) distance calculations.
If you iterate over all elements, calculate the distance to the target node and remember the closest element, you'd be done with O(n) distance calculations.
Assuming that comparing two nodes means to compare both lat and lon values and thus is O(2), and further assuming that calculating the distance between two nodes is O(3), you end with
O(2n + 3 log n) for binary search and O(3n) for naive search.
So binary search takes n - 3 log n less operations and is round about 33% faster.
Depending on the distribution of your nodes it could be even faster to sort the list into buckets. During filling the buckets you could also throw away all nodes that would go in a bucket that could never hold the closest node. I can explain this in more detail if you want.

Display word count / tag cloud in proportion

This is weird, so be patient while I try to explain.
Basic problem: I have a massive string -- it can be of varying lengths depending on the user. My job is to acquire this massive string depending on the user, then send it off to the other piece of software to make a tag cloud. If life were easy for me, I could simply send the whole thing. However, the tag cloud software will only accept a string that is 1000 words long, so I need to do some work on my string to send the most important words.
My first thought was to count each occurrence of the words, and throw all this into an array with each word's count, then sort.
array(517) (
"We" => integer 4
"Five" => integer 1
"Ten's" => integer 1
"best" => integer 2
"climbing" => integer 3
(etc...)
Form here, I create a new string and spit out each word times its count. Once the total string hits 1000 words long, I stop. This creates a problem.
Let's say the word "apple" shows up 900 times, and the word "cat" shows up 100 times. The resulting word cloud would consist of only two words.
My idea is to somehow spit out the words at some sort of ratio to the other words. My attempts so far have failed on different data sets where the ratio is not great -- especially when there are a lot of words at "1", thus making the GCD very low.
I figure this is a simple math problem I can't get my head around, so I turn to the oracle that is stackoverflow.
thanks in advance.
count all words then do this for each word in your array:
floor(count_of_the_word * (1000/numbber_of_total_words))
this will result in a maximum of 1000 words and all word appear in x times reduced by the according proportion.
so having 50 times cat 100 times gozilla 4000 looser and 4000 times bush 1000 times george will first result in
array(
cat[50]
gozilla[100]
looser[4000]
bush[4000]
george[1000]
)
after looping and converting the numbers you will get this:
array(
cat[5]
gozilla[10]
looser[437]
bush[437]
george[109]
)
resulting in 998 total words

Categories