I have a online shop application and a database of around 1000 ITEMS.
ITEM{
categories / up to 5 out of 60
types / up to 2 out of 10
styles / up to 2 out of 10
rating / 0-5
}
Now I wont to create a comparison item-to-item with predefined conditions:
- At least one common category += 25points
- At least one common type += 25p.
- If first item has no styles += 0p.
- If no styles in common -= 10p.
- For each point in rating difference -= 5p.
And store the result in a table. as item_to_item_similarity.score.
Now I made the whole thing with a nice and shiny PHP functions and classes ..
And a function to calculate and update all the relations.
In the test withs 20 items .. all went well.
But when increased the test data to 1000 items .. resulting in 1000x1000 relations
The server started complaining about script_time_out .. and out of memory :)
Indexes, transaction and pre-loading some of the data .. helped me half the way.
Is there a smarter way to compare and evaluate this type of data?
I was thinking to represent the related categories, styles etc.
as a set of IDs, possibly in some binary mask .. so that they can be easily compared
(even in the SQL ?) with out the need to create classes, and loops trough arrays millions of times.
I know this isn't the best but, what about the following:
You have your table which links the two items, a timestamp, and has their score. This table will hold the 1,000,00 records.
You have a CRON script, which runs every 15 mins.
First time cron runs, it creates the 1,000,000 rows. No scores are calculated. This can be done by counting rows in table. If count==0 then it's first run
Second run and thereafter runs, it selects 1000 records, and calculates their score and updates the timestamp. It should select 1000 records ordered by the timestamp, so that it selects 1000 oldest records.
Leave this to run in the background, every 15 mins or so. Will take like 10 days to run in total and calculate all the scores.
Whenever you update a product, you need to reset the date on the linking table, so that when the cron runs it recalculates the score for all rows that mention that item.
When you create a new product, you must create the linking rows, so it has to add a row for each other item
Personally, I'd consider using a different method altogether, there are plenty of algorithms out there you just have to find one which applies to this scenario. Here is one example:
How to find "related items" in PHP
Also, here is the Jaccard Index written in PHP which may be more efficient that your current method
https://gist.github.com/henriquea/540303
Related
I have dataset in dynamodb, whose primary key is user ID, and timestamp is one of the data attribute. I want to run a purge query on this table, where timestamp is older than 1 week.
I do not want to eat up all writes per s units. I would ideally want a rate limiting delete operation(in php). Otherwise for a dataset that's 10sof GBs in size, it will stop other writes.
I was wondering on lines of usingglobal secondary indexing on timestamp (+user ID) would help reduce the rows to be scanned. But again, I'd not want to thrash table such that other writes start failing.
Can someone provide rate limiting insert/delete example code and references for this in php?
You can create a global secondary index:
timestampHash (number, between 1 and 100)
timestamp (number)
Whenever you create/update your timestamp, also set the timestampHash attribute as a random number between 1 to 100. This will distribute the items in your index evenly. You need this hash because to do a range query on a GSI, you need a hash. Querying by user id and timestamp doesn't seem to make sense because that will only return one item every time and you will have to loop over all your users (assuming there is one item per user id).
Then you can run a purger that will do a query 100 times for each timestampHash number and all items with timestamp older than 1 week. Between each run you can wait 5 minutes, or however long you think is appropriate, depending on the number of items you need to purge.
You can use BatchWriteItem to leverage the API's multithreading to delete concurrently.
In pseudocode it looks like this:
while (true) {
for (int i = 0; i < 100; i++) {
records = dynamo.query(timestampHash = i, timestamp < Date.now());
dynamo.batchWriteItem(records, DELETE);
}
sleep(5 minutes);
}
You can also catch ProvisionedThroughputExceededException and do an exponential back off so that if you do exceed the throughput, you will reasonably stop and wait until your throughput recovers.
Another way is to structure structure your tables by time.
TABLE_08292016
TABLE_09052016
TABLE_09122016
All your data for the week of 08/28/2016 will go into TABLE_08292016. Then at the end of every week you can just drop the table.
I'm in the process of revising a PHP page that displays all of our items with various statistics regarding each one. We're looking at a period running from the first of the month one year ago up to yesterday. I've managed to get a basic version of the script working, but it performs poorly. The initial implementation of this script (not my revision) retrieved all sales records and item information at once, then used the resulting records to create objects (not mysql_fetch_objects). These were then stored in an array that used hard-coded values to access the object's attributes. The way this is all set up is fairly confusing and doesn't easily lend itself to reusability. It is, however, significantly faster than my implementation since it only calls to the database once or twice.
My implementation utilizes three calls. The first obtains basic report information needed to create DateTime objects for the report's range (spanning the first of the month twelve months ago up to yesterday's date). This is, however, all it's used for and I really don't think I even need to make this call. The second retrieves all basic information for items included in the report. This comes out to 854 records. The last select statement retrieves all the sales information for these items, and last I checked, this returned over 6000 records.
What I tried to do was select only the records pertinent to the current item in the loop, represented by the following.
foreach($allItems as $item){
//Display properties of item here
//db_controller is a special class used to handle DB manipulation
$query = "SELECT * FROM sales_records WHERE ...";
$result = $db_controller->select($query);
//Use sales records to calculate report values
}
This is the problem. Calling to the database for each and every item is extremely time-consuming and drastically impacts performance. What's returned is simply the sums of quantities sold in each month within the timeframe specified earlier in the script, along with the resulting sales amounts. At maximum, each item will only have 13 sales records, ranging from 2015/1 to 2016/1 for example. However, I'm not sure if performing a single fetch for all these sales records before the loop will help performance, the reason being that I would then have to search through the result array for the first instance of a sales record pertaining to the current item. What can I do to alleviate this issue? Since this is a script important to the company's operations, I want to be sure that performance on my script is just as quick as the old script or, at the very least, only slightly slower than the old script. My results are accurate but just painfully slow.
I've got a table with 1000 recipes in it, each recipe has calories, protein, carbs and fat values associated with it.
I need to figure out an algorithm in PHP that will allow me to specify value ranges for calories, protein, carbs and fat as well as dictating the number of recipes in each permutation. Something like:
getPermutations($recipes, $lowCal, $highCal, $lowProt, $highProt, $lowCarb, $highCarb, $lowFat, $highFat, $countRecipes)
The end goal is allowing a user to input their calorie/protein/carb/fat goals for the day (as a range, 1500-1600 calories for example), as well as how many meals they would like to eat (count of recipes in each set) and returning all the different meal combinations that fit their goals.
I've tried this previously by populating a table with every possible combination (see: Best way to create Combination of records (Order does not matter, no repetition allowed) in mySQL tables ) and querying it with the range limits, however that proved not to be efficient as I end up with billions of records to scan through and it takes an indefinite amount of time.
I've found some permutation algorithms that are close to what I need, but don't have the value range restraint for calories/protein/carbs/fat that I'm looking for (see: Create fixed length non-repeating permutation of larger set) I'm at a loss at this point when it comes to this type of logic/math, so any help is MUCH appreciated.
Based on some comment clarification, I can suggest one way to go about it. Specifically, this is my "try the simplest thing that could possibly work" approach to a problem that is potentially quite tricky.
First, the tricky part is that the sum of all meals has to be in a certain range, but SQL does not have a built-in feature that I'm aware of that does specifically what you want in one pass; that's ok, though, as we can just implement this functionality in PHP instead.
So lets say you request 5 meals that will total 2000 calories - we leave the other variables aside for simplicity, but they will work the same way. We then calculate that the 'average' meal is 2000/5=400 calories, but obviously any one meal could be over or under that amount. I'm no dietician, but I assume you'll want no meal that takes up more than 1.25x-2x the average meal size, so we can restrict out initial query to this amount.
$maxCalPerMeal = ($highCal / $countRecipes) * 1.5;
$mealPlanCaloriesRemaining = $highCal; # more on this one in a minute
We then request 1 random meal which is less than $maxCalPerMeal, and 'save' it as our first meal. We then subtract its actual calorie count from $mealPlanCaloriesRemaining. We now recalculate:
$maxCalPerMeal = ($highCal / $countRecipesRemaining) * 1.5); # 1.5 being a maximum deviation from average multiple
Now the next query will ask for both a random meal that is less than $maxCalPerMeal AND $mealPlanCaloriesRemaining, AND NOT one of the meals you already have saved in this particular meal plan option (thus ensuring unique meals - no mac'n'cheese for breakfast, lunch, and dinner!). And we update the variables as in the last query, until you reach the end. For the last meal requested it we don't care about the average and it's associated multiple, as thanks to a compound query you'll get what you want anyway and don't need to complicate your control loops.
Assuming the worst case with the 5 meal 2000 calorie max diet:
Meal 1: 600 calories
Meal 2: 437
Meal 3: 381
Meal 4: 301
Meal 5: 281
Or something like that, and in most cases you'll get something a bit nicer and more random. But in the worst-case it still works! Now this actually just plain works for the usual case. Adding more maximums like for fat and protein, etc, is easy, so lets deal with the lows next.
All we need to do to support "minimum calories per day" is add another set of averages, as such:
$minCalPerMeal = ($lowCal / $countRecipes) * .5 # this time our multiplier is less than one, as we allow for meals to be bigger than average we must allow them to be smaller as well
And you restrict the query to being greater than this calculated minimum, recalculating with each loop, and happiness naturally ensues.
Finally we must deal with the degenerate case - what if using this method you end up needing a meal that is to small or too big to fill the last slot? Well, you can handle this a number of ways. Here's what I'd recommended.
The easiest is just returning less than the desired amount of meals, but this might be unacceptable. You could also have special low calorie meals that, due to the minimum average dietary content, would only be likely to be returned if someone really had to squeeze in a light meal to make the plan work. I rather like this solution.
The second easiest is throw out the meal plan you have so far and regenerate from scratch; it might work this time, or it just might not, so you'll need a control loop to make sure you don't get into an infinite work-intensive loop.
The least easy, requires a control loop max iteration again, but here you use a specific strategy to try to get a more acceptable meal plan. In this you take the optional meal with the highest value that is exceeding your dietary limits and throw it out, then try pulling a smaller meal - perhaps one that is no greater than the new calculated average. It might make the plan as a whole work, or you might go over value on another plan, forcing you back into a loop that could be unresolvable - or it might just take a few dozen iterations to get one that works.
Though this sounds like a lot when writing it out, even a very slow computer should be able to churn out hundreds of thousands of suggested meal plans every few seconds without pausing. Your database will be under very little strain even if you have millions of recipes to choose from, and the meal plans you return will be as random as it gets. It would also be easy to make certain multiple suggested meal plans are not duplicates with a simple comparison and another call or two for an extra meal plan to be generated - without fear of noticeable delay!
By breaking things down to small steps with minimal mathematical overhead a daunting task becomes manageable - and you don't even need a degree in mathematics to figure it out :)
(As an aside, I think you have a very nice website built there, so no worries!)
In php - how do I display 5 results from possible 50 randomly but ensure all results are displayed equal amount.
For example table has 50 entries.
I wish to show 5 of these randomly with every page load but also need to ensure all results are displayed rotationally an equal number of times.
I've spent hours googling for this but can't work it out - would very much like your help please.
please scroll down for "biased randomness" if you dont want to read.
In mysql you can just use SeleCT * From table order by rand() limit 5.
What you want just does not work. Its logically contradicting.
You have to understand that complete randomness by definition means equal distribution after an infinite period of time.
The longer the interval of selection the more evenly the distribution.
If you MUST have even distribution of selection for example every 24h interval, you cannot use a random algorithm. It is by definition contradicting.
It really depends no what your goal is.
You could for example take some element by random and then lower the possibity for the same element to be re-chosen at the next run. This way you can do a heuristic that gives you a more evenly distribution after a shorter amount of time. But its not random. Well certain parts are.
You could also randomly select from your database, mark the elements as selected, and now select only from those not yet selected. When no element is left, reset all.
Very trivial but might do your job.
You can also do something like that with timestamps to make the distribution a bit more elegant.
This could probably look like ORDER BY RAND()*((timestamps-min(timestamps))/(max(timetamps)-min(timestamps))) DESC or something like that. Basically you could normalize the timestamp of selection of an entry using the time interval window so it gets something between 0 and 1 and then multiply it by rand.. then you have 50% fresh stuff less likely selected and 50% randomness... i am not sure about the formular above, just typed it down. probably wrong but the principle works.
I think what you want is generally referred to as "biased randomness". there are a lot of papers on that and some articles on SO. for example here:
Biased random in SQL?
Copy the 50 results to some temporary place (file, database, whatever you use). Then everytime you need random values, select 5 random values from the 50 and delete them from your temporary data set.
Once your temporary data set is empty, create a new one copying the original again.
I'd like to populate the homepage of my user-submitted-illustrations site with the "hottest" illustrations uploaded.
Here are the measures I have available:
How many people have favourited that illustration
votes table includes date voted
When the illustration was uploaded
illustration table has date created
Number of comments (not so good as max comments total about 10 at the moment)
comments table has comment date
I have searched around, but don't want user authority to play a part, but most algorithms include that.
I also need to find out if it's better to do the calculation in the MySQL that fetches the data or if there should be a PHP/cron method every hour or so.
I only need 20 illustrations to populate the home page. I don't need any sort of paging for this data.
How do I weight age against votes? Surely a site with less submission needs less weight on date added?
Many sites that use some type of popularity ranking do so by using a standard algorithm to determine a score and then decaying eternally over time. What I've found works better for sites with less traffic is a multiplier that gives a bonus to new content/activity - it's essentially the same, but the score stops changing after a period of time of your choosing.
For instance, here's a pseudo-example of something you might want to try. Of course, you'll want to adjust how much weight you're attributing to each category based on your own experience with your site. Comments are rare, but take more effort from the user than a favorite/vote, so they probably should receive more weight.
score = (votes / 10) + comments
age = UNIX_TIMESTAMP() - UNIX_TIMESTAMP(date_created)
if(age < 86400) score = score * 1.5
This type of approach would give a bonus to new content uploaded in the past day. If you wanted to approach this in a similar way only for content that had been favorited or commented on recently, you could just add some WHERE constraints on your query that grabs the score out from the DB.
There are actually two big reasons NOT to calculate this ranking on the fly.
Requiring your DB to fetch all of that data and do a calculation on every page load just to reorder items results in an expensive query.
Probably a smaller gotcha, but if you have a relatively small amount of activity on the site, small changes in the ranking can cause content to move pretty drastically.
That leaves you with either caching the results periodically or setting up a cron job to update a new database column holding this score you're ranking by.
Obviously there is some subjectivity in this - there's no one "correct" algorithm for determining the proper balance - but I'd start out with something like votes per unit age. MySQL can do basic math so you can ask it to sort by the quotient of votes over time; however, for performance reasons, it might be a good idea to cache the result of the query. Maybe something like
SELECT images.url FROM images ORDER BY (NOW() - images.date) / COUNT((SELECT COUNT(*) FROM votes WHERE votes.image_id = images.id)) DESC LIMIT 20
but my SQL is rusty ;-)
Taking a simple average will, of course, bias in favor of new images showing up on the front page. If you want to remove that bias, you could, say, count only those votes that occurred within a certain time limit after the image being posted. For images that are more recent than that time limit, you'd have to normalize by multiplying the number of votes by the time limit then dividing by the age of the image. Or alternatively, you could give the votes a continuously varying weight, something like exp(-time(vote) + time(image)). And so on and so on... depending on how particular you are about what this algorithm will do, it could take some experimentation to figure out what formula gives the best results.
I've no useful ideas as far as the actual agorithm is concerned, but in terms of implementation, I'd suggest caching the result somewhere, with a periodic update - if the resulting computation results in an expensive query, you probably don't want to slow your response times.
Something like:
(count favorited + k) * / time since last activity
The higher k is the less weight has the number of people having it favorited.
You could also change the time to something like the time it first appeared + the time of the last activity, this would ensure that older illustrations would vanish with time.