I have dataset in dynamodb, whose primary key is user ID, and timestamp is one of the data attribute. I want to run a purge query on this table, where timestamp is older than 1 week.
I do not want to eat up all writes per s units. I would ideally want a rate limiting delete operation(in php). Otherwise for a dataset that's 10sof GBs in size, it will stop other writes.
I was wondering on lines of usingglobal secondary indexing on timestamp (+user ID) would help reduce the rows to be scanned. But again, I'd not want to thrash table such that other writes start failing.
Can someone provide rate limiting insert/delete example code and references for this in php?
You can create a global secondary index:
timestampHash (number, between 1 and 100)
timestamp (number)
Whenever you create/update your timestamp, also set the timestampHash attribute as a random number between 1 to 100. This will distribute the items in your index evenly. You need this hash because to do a range query on a GSI, you need a hash. Querying by user id and timestamp doesn't seem to make sense because that will only return one item every time and you will have to loop over all your users (assuming there is one item per user id).
Then you can run a purger that will do a query 100 times for each timestampHash number and all items with timestamp older than 1 week. Between each run you can wait 5 minutes, or however long you think is appropriate, depending on the number of items you need to purge.
You can use BatchWriteItem to leverage the API's multithreading to delete concurrently.
In pseudocode it looks like this:
while (true) {
for (int i = 0; i < 100; i++) {
records = dynamo.query(timestampHash = i, timestamp < Date.now());
dynamo.batchWriteItem(records, DELETE);
}
sleep(5 minutes);
}
You can also catch ProvisionedThroughputExceededException and do an exponential back off so that if you do exceed the throughput, you will reasonably stop and wait until your throughput recovers.
Another way is to structure structure your tables by time.
TABLE_08292016
TABLE_09052016
TABLE_09122016
All your data for the week of 08/28/2016 will go into TABLE_08292016. Then at the end of every week you can just drop the table.
Related
I have a online shop application and a database of around 1000 ITEMS.
ITEM{
categories / up to 5 out of 60
types / up to 2 out of 10
styles / up to 2 out of 10
rating / 0-5
}
Now I wont to create a comparison item-to-item with predefined conditions:
- At least one common category += 25points
- At least one common type += 25p.
- If first item has no styles += 0p.
- If no styles in common -= 10p.
- For each point in rating difference -= 5p.
And store the result in a table. as item_to_item_similarity.score.
Now I made the whole thing with a nice and shiny PHP functions and classes ..
And a function to calculate and update all the relations.
In the test withs 20 items .. all went well.
But when increased the test data to 1000 items .. resulting in 1000x1000 relations
The server started complaining about script_time_out .. and out of memory :)
Indexes, transaction and pre-loading some of the data .. helped me half the way.
Is there a smarter way to compare and evaluate this type of data?
I was thinking to represent the related categories, styles etc.
as a set of IDs, possibly in some binary mask .. so that they can be easily compared
(even in the SQL ?) with out the need to create classes, and loops trough arrays millions of times.
I know this isn't the best but, what about the following:
You have your table which links the two items, a timestamp, and has their score. This table will hold the 1,000,00 records.
You have a CRON script, which runs every 15 mins.
First time cron runs, it creates the 1,000,000 rows. No scores are calculated. This can be done by counting rows in table. If count==0 then it's first run
Second run and thereafter runs, it selects 1000 records, and calculates their score and updates the timestamp. It should select 1000 records ordered by the timestamp, so that it selects 1000 oldest records.
Leave this to run in the background, every 15 mins or so. Will take like 10 days to run in total and calculate all the scores.
Whenever you update a product, you need to reset the date on the linking table, so that when the cron runs it recalculates the score for all rows that mention that item.
When you create a new product, you must create the linking rows, so it has to add a row for each other item
Personally, I'd consider using a different method altogether, there are plenty of algorithms out there you just have to find one which applies to this scenario. Here is one example:
How to find "related items" in PHP
Also, here is the Jaccard Index written in PHP which may be more efficient that your current method
https://gist.github.com/henriquea/540303
I want to block all the requests let's say, for 10 minutes, by IP, if there has been 5 incorrect within any 5 minute interval period.
And I'm thinking of a way to store this data in a way that would least possibly hurt the performance.
Particularly on how to design a DB table and store the data.
If I make a fixed table for IP as a primary, eg with mysql:
ip int(10) unsigned, primary key (ip)
attempts int(5),
lastaccess timestamp default current_timestamp
Then it would inadequately accumulate the attempts...
From the other hand if I log all the incorrect attempts with timestamp, eg:
ip int(10) unsigned,
lastaccess timestamp default current_timestamp,
primary key (ip,lastaccess)
And then count back in time within the 5 minutes interval, the table could potentially grow very huge with all this data and slow the system... It would also require maintenance.
So, could you advice something more convenient for me?...
I would store the IP address and the timestamps of the last x attempts. Either in a database, a memcached type of store or possibly just a number of flat files, depending on how much traffic you anticipate.
If a database, you can easily query for something like COUNT(timestamps) GROUP BY ip WHERE timestamp [within last 5 minutes] and occasionally clean the database with a simple DELETE WHERE timestamp [over 5 minutes ago]. The cleaning could happen in a cron job or every x requests in a garbage collection kind of system.
If something like memcached or a flat file, store the timestamps in a FIFO array, i.e. a simple array(123456..., 123456..., ...) which you keep truncating.
Just a try. I would suggest redis incase you are much worried about the table could potentially grow very huge with all this data
Generate a unique ID per user [most probably IP but consider users from same network. For ex: Users from an organization will have same IP associated with all outgoing requests even from different machine] and Use STRING Data type with key as Unique ID and value as counter [stores number of attempts].
One usecase of String from DOC
Use Strings as atomic counters using commands in the INCR family:
INCR, DECR, INCRBY.
Also String data type supports expires. So all keys that you generate will have a expiry of 5 mins which will self destruct once the time limit is reached. You can just read the counter value to determine whether to block the user or not. Now you need not worry about the number of records as the day proceeds
Redis has all your data in memory. I think you might get some performance improvement as well
I have a person's username, and he is allowed ten requests per day. Every day the requests go back to 0,and old data is not needed.
What is the best way to do this?
This is the way that comes to mind, but I am not sure if it's the best way
(two fields, today_date, request_count):
Query the DB for the date of last request and request count.
Get result and check if it was today.
If today, check the request count, if less than 10, update query database to ++count.
If not today, update the DB with today's date and count = 1.
Is there another way with fewer DB queries?
I think your solution is good. It is possible to reset the count on a daily basis too. That will allow you to skip a column, but you do need to run a cron job. If there are many users that won't have any requests at all, it is needless to reset their count each day.
But whichever you pick, both solutions are very similar in performance, data size and development time/complexity.
Just one column request_count. Then query this column and update it. As far as I know with stored procedures this may be possible in one single query. Even if not, it will be just two. Then create a cron job, that calls a script, that resets the column to 0 every day at 00:00.
To spare you some requests to the DB define
the maximum number of requests per day allowed.
the first day available to your application (date offset).
Then add a requestcount field to the database per user.
On the first request get the count from the db.
The count is always the number of the day multiplied with the maximum + 1 of requests per day plus the actual requests by that user:
day * (max + 1) + n
So if on first request the count from the db is actually higher than allowed, block.
Otherwise if it's lower than the current day base, reset to the current day base (in the PHP variable)
And count up. Store this value into the DB.
This is one read operation, and in case the request is still valid, one write operation to the DB per request.
There is no need to run a cron job to clean this up.
That's actually the same as you propose in your question, but the day information is part of the counter value. So you can do more with one value at once, while counting up with +1 per request still works for the block.
You have to take into account that each user may be in a different time zone than your server, so you can't just store the count or the "day * max" trick. Try to get the time offset and then the start of the user's day can be stored in your "quotas" database. In mySQL, that would look like:
`start_day`=ADDTIME(CURDATE()+INTERVAL 0 SECOND,'$offsetClientServer')
Then simply look at this time the next time you check the quota. The quota check can all be done in one query.
I need to create an invoice number in format:
CONSTANT_STRING/%d/mm/yyyy
mm - Month (two digits)
yyyy - Year
Now, the %d is the number of the invoice in the specific month. Another words, this number is reseted every month.
Now I am checking in database what is the highest number in current month. Then after its incrementation I am saving the row.
I need the whole number to be unique. However, it sometimes happens that it is being duplicated (two users save in the same time).
Any suggestions?
Put a unique index on the field and catch the database error when trying to save the second instance. Also, defer getting the value until the last possible moment.
One solution is SELECT ... FOR UPDATE, which blocks the row until you update it, but can cause deadlocks with a serios multitasking application.
The best way is to fetch the number and increment it in a transaction and then start the work.
This way, the row is not locked for long.
Look into BEGIN WORK and COMMIT.
Use the primary key (preferably an INT) of the invoice table or assign a unique number to each invoice, e.g. via uniqid.
PS. If you are using uniqid, you can increase the uniqueness by setting more_entropy parameter to true.
set the id all in one query.
$query = 'INSERT INTO table (invoice_number) VALUES (CONCAT(\''.CONSTANT.'\', \'/\', (SELECT COUNT(*) + 1 AS current_id FROM table WHERE MONTH(entry_date) = \''.date('n').'\' AND YEAR(entry_date) = \''.date('Y').'\'), \'/\', \''.date('m/Y').'\'))';
I have about 10,000 products in the product table. I want to retrieve one of those items and display it in a section of a web page which stays the same for that particular day. Something like "Product of the day".
For example, if today I get product_id 100, then all of the visitors should be viewing this product item for today. Tomorrow it may fetch any random valid primary key, say, 1289 and visitors get 1289 product all day tomorrow.
Any ideas/suggestions?
Thanks for your help.
SELECT id
FROM products
ORDER BY
RAND(UNIX_TIMESTAMP(CURRENT_DATE()))
LIMIT 1
Maybe you can store the id of the item of the day in a table in the database?
How about create a cache file and invalidate it at midnight?
The benefit of this is you don't make unnecessary calls to your DB as you're only checking the timestamp on the cache file - only once per day do you make DB requests to populate a new cache file.
You don't need a CRON job for this:
if(date_of_file(potd_cache_file) != today){
potd_cache_file = generate_from_db();
}
load_file(potd_cache_file);
This will mean only the first visitor of the day to your website will trigger the regeneration, and every subsequent visitor will have a fast loading cache file served to them.
The idea is pretty simple,
Set a table up call ProductOfTheDay with a product ID and a date field
On the product of the day page when a user visits check the date field
If it is todays date then show the product
If it is not then randonly pick a new product and save it to the field.
Its not that complex of an operation.
SELECT id
FROM products
ORDER BY (id + RAND(UNIX_TIMESTAMP(CURRENT_DATE()))) MOD some_reasonable_value
LIMIT 1
You can start random number generators with a seed value.
Make the seed value be the day (21st) + month(10) + year(2009) so today's seed is 2041.
You will get the same random number all day, and tomorrow a different one. This is more how it works in .net. The random function takes a max and min value (this is your min and max ID values) then an optional seed value and returns a number. For the same seed number you get the same random number generated. It's possible if you change the max and min this can affect the number generated. You would have to look up how php works.
total = SELECT COUNT(id) FROM products;
day_product = SELECT id FROM products WHERE id = (UNIX_TIMESTAMP(CURRENT_DATE()) MOD total) LIMIT 1;
See also this question.