So this one has kinda given me a mental run around. I have Rows, Users, and RowActivity models. Each time a user interacts with a row, it generates a single line in the row activity table saying what status the user changed. Each row can have 3-8 activity entries with a single user reference per line. So my question is this: I want to analyze the rows that particular users have interacted with. So I'd like to search for user A and get all the rows they've touched. But the only way to do that is to query the RowActivity table.
So the query would essentially be this:
Row::whereHas('rowActivity')->whereWithin(Collection of RowActivity, user_id = requested-user)->get();
I know that I can go the long way and query the RowActivity::where('user_id', $requestedUser) and then get all the other row activities based on the related row_id of that those results, but I feel there's a clean way that I can't figure out.
Just for clarification, the row activities are used to generate reports about which users changed the status of the row and how long the duration between the changes are. So I need to get all the activities associated with a row as well as all the rows that a user has touched so that I can analyze how long their portion of that interaction took.
If I need to do this as a multilevel query, so be it, I don't have an aversion to that, I just want to make sure my queries are pristine. If I did it as multi-level, it would look something like this:
$ra = RowActivity::where('user_id', $requestedUser)->pluck('row_id');
$rows = Row::with('rowActivity')->find($ra); //get all rows and their associated activities based on the plucked row id from query 1
You can do that within whereHas(). I hope that's what you want
Row::whereHas('rowActivity', function($query) use($requestedUser) {
$query->where('user_id', $requestedUser);
})->get();
Related
i have a piece of code here
$shorturls = ShortUrl::withCount("clicks")->where([["url_id", "=", $id->id], ["user_id", "=", Auth::id()]])
->orderBy("clicks_count", "desc")
->paginate(10);
this query runs in 6000ms (1 million rows of data)
when i comment orderBy it will around 300-500ms
(Shorturl model has many clicks)
i want a way to have a field in my short url named clicks_count to make this query faster.
Since you order by the clicks_count, mysql has to count the clicks for all rows in ShortUrl (1 million) before it can order.
Not just the 10 paginated.
You could:
Make sure that the ShortUrl<->clicks relationsship has correct indexes in the db. By looking at the query I would guess the field in the clicks table that should be indexed would be named "url_id".
Even though it is indexed it could still take some time. So another ide could be to denormalize the count, and then for each click, you increment a field in the short_urls table. That way it should not have to count on read.
If that not helps, then please provide your table structure including indexes for both short_urls and clicks tables.
I'm trying to write a Laravel eloquent statement to do the following.
Query a table and get all the ID's of all the duplicate rows (or ideally all the IDs except the ID of the first instance of the duplicate).
Right now I have the following mysql statement:
select `codes`, count(`codes`) as `occurrences`, `customer_id` from `pizzas`
group by `codes`, `customer_id`
having `occurrences` > 1;
The duplicates are any row that shares a combination of codes and customer_id, example:
codes,customer_id
183665A4,3
183665A4,3
183665A4,3
183665A4,3
183665A4,3
I'm trying to delete all but 1 of those.
This is returning a set of the codes, with their occurrences and their customer_id, as I only want rows that have both.
Currently I think loop through this, and save the ID of the first instance, and then call this again and delete any without that ID. This seems not very fast, as there's about 50 million rows so each query takes forever and we have multiple queries for each duplicate to delete.
// get every order that shares the same code and customer ID
$orders = Order::select('id', 'codes', DB::raw('count(`codes`) as `occurrences`'), 'customer_id')
->groupBy('codes')
->groupBy('customer_id')
->having('occurrences', '>', 1)
->limit(100)
->get();
// loop through those orders
foreach ($orders as $order)
{
// find the first order that matches this duplicate set
$first_order = Order::where('codes', $order->codes)
->where('customer_id', $order->customer_id)
->first();
// delete all but the first
Order::where('codes', $order->codes)
->where('customer_id', $order->customer_id)
->where('id', '!=', $first_order->id)
->delete();
}
There has got to be a more efficient way to track down all rows that share the same code and customer_id, and delete all the duplicates but keep the first instance, right? lol
I'm thinking maybe if I can add a fake column to the results that is an array of every ID, I could at least then remove the first ID and delete the others.
Don't involve PHP
This seems not very fast
The logic in the question is inherently slow because it's lots of queries and for each query there's:
DB<->PHP network roundtrip
PHP ORM logic/overhead
Given the numbers in the question, the whole code needs calling up to 10k times (if there are exactly 2 occurrences for every one of those 2 million duplicate records), for arguments sake let's say there are 1k sets of duplicates, overall that's:
1,000 queries finding duplicates
100,000 queries finding the first record
100,000 delete queries
201,000 queries is a lot and the php overhead makes it an order of magnitude slower (a guess, based on experience).
Do it directly on the DB
Just eliminating php/orm/network (even if it's on the same machine) time would make the process markedly faster, that would involve writing a procedure to mimic the php logic in the question.
But there's a simpler way, the specifics depend on the circumstances. In comments you've said:
The table is 140GB in size
It contains 50 million rows
Approx 2 million are duplicate records
There isn't enough free space to make a copy of the table
Taking these comments at face value the process I suggest is:
Ensure you have a functional DB backup
Before doing anything make sure you have a functional DB backup. If you manage to make a mistake and e.g. drop the table - be sure you can recover without loss of data.
You'll be testing this process on a copy of the database first anyway, right :) ?
Create a table of "ids to keep" and populate it
This is a permutation of removing duplicate with a unique index:
CREATE TABLE ids_to_keep (
id INT PRIMARY KEY,
codes VARCHAR(50) NOT NULL, # use same schema as source table
customer_id INT NOT NULL, # use same schema as source table
UNIQUE KEY derp (codes,customer_id)
);
INSERT IGNORE INTO ids_to_keep
SELECT id, codes, customer_id from pizzas;
Mysql will silently drop the rows conflicting with the unique index, resulting in a table with one id per codes+customer_id tuple.
If you don't have space for this table - make room :). It shouldn't be too large; 140GB and 50M rows means each row is approx 3kb - this temporary table will likely require single-digit % of the original size.
Delete the duplicate records
Before executing any expected-to-be-slow query use EXPLAIN to check if the query will complete in a reasonable amount of time.
To run as a single query:
DELETE FROM
pizzas
WHERE
id NOT IN (SELECT id from ids_to_keep);
If you wish to do things in chunks:
DELETE FROM
pizzas
WHERE
id BETWEEN (0,10000) AND
id NOT IN (SELECT id from ids_to_keep);
Cleanup
Once the table isn't needed any more, get rid of it:
DROP TABLE ids_to_keep;
Make sure this doesn't happen again
To prevent this happening again, add a unique index to the table:
CREATE UNIQUE INDEX ON pizzas(codes, customer_id);
Try this one it will keep only the duplicate and non-duplicate id lastest id:
$deleteDuplicates = DB::table('orders as ord1')
->join('orders as ord2', 'ord1.codes', '<', 'ord2.codes')
->where('ord1.codes', '=', 'ord2.codes') ->delete();
I'm new to sql & php and unsure about how to proceed in this situation:
I created a mysql database with two tables.
One is just a list of users with their data, each having a unique id.
The second one awards certain amounts of points to users, with relevant columns being the user id and the amount of awarded points. This table is supposed to get new entries regularly and there's no limit to how many times a single user can appear in it.
On my php page I now want to display a list of users sorted by their point total.
My first approach was creating a "points_total" column in the user table, intending to run some kind of query that would calculate and update the correct total for each user every time new entries are added to the other table. To retrieve the data I could then use a very simple query and even use sql's sort features.
However, while it's easy to update the total for a specific user with the sum where function, I don't see a way to do that for the whole user table. After all, plain sql doesn't offer the ability to iterate over each row of a table, or am I missing a different way?
I could probably do the update by going over the table in php, but then again, I'm not sure if that is even a good approach in the first place, because in a way storing the point data twice (the total in one table and then the point breakdown with some additional information in a different table) seems redundant.
A different option would be forgoing the extra column, and instead calculating the sums everytime the php page is accessed, then doing the sorting stuff with php. However, I suppose this would be much slower than having the data ready in the database, which could be a problem if the tables have a lot of entries?
I'm a bit lost here so any advice would be appreciated.
To get the total points awarded, you could use a query similar to this:
SELECT
`user_name`,
`user_id`,
SUM(`points`.`points_award`) as `points`,
COUNT(`points`.`points_award`) as `numberOfAwards`
FROM `users`
JOIN `points`
ON `users`.`user_id` = `points`.`user_id`
GROUP BY `users`.`user_id`
ORDER BY `users`.`user_name` // or whatever users column you want.
My situations is this... I have a table of opportunities that is sorted. We have a paid service that will allow people to view the opportunities on the website any time. However we want an unpaid view that will show a random %/# of opportunities, which will always be the same. The opportunities are sorted out by dates; e.g. they will expire and be removed from the list, and a new one should be on the free search. However the only problem is that they will always have to show the same opportunity. (For example, I can't just pick random rows because it will cycle through them if they keep refreshing, and likewise can't just take the ones about to expire or furthest form expiry because people still end up seeing the entire list.
My only solution thus far is to add an extra column to the table to mark that it is open display. Then to count them on display, and if we are missing rows then to randomly select a few more. Below is a mock up...
SELECT count(id) as total FROM opportunities WHERE display_status="open" LIMIT 1000;
...
while(total < requiredNumber) {
UPDATE opportunities SET display_status="open" WHERE display_status="private" ORDER BY random() LIMIT (required-total);
}
Can anyone think of a better way to solve this problem, preferably one that does not leave me adding another column to the table, and possible conflicts if many people load the page at a single time. One final note as well, it can't be a random set number of them (e.g. pick one, skip a few, take the next).
Any thought/comments would be very helpful,
Thanks.
One way to make sure that a user only sees the same set of random rows is to feed the random number generator a seed that is linked to that user (such as their user_id). That means every user gets a random ordering of rows but it's always the same random ordering for each user.
Your code would be something:
SELECT ...
FROM ...
WHERE ...
ORDER BY random(<user id>)
LIMIT <however many>
Note: as Twelfth pointed out, as new rows are created, they will get new order values and may end up in your random selection.
I'm the type that doesn't like to lose information...including what random rows someone got to see. However I do not like the modification of your existing table idea...
Create a second table as randon_rows or something to that extent to save the ID's of the user and the ID's of the random records they got to see. Inner join to the table whenever you need to find those same rows again. You can also put expirey dates and the sort in the table as well, so the user isn't perma stuck with the same 10 rows.
I have to create a like system (the name won't be "like", Facebook owns it).
So I imagined two ways to store these likes in my database and I want to know, which way is the better for a very high-traffic site.
Create table comment_likes with "id", "comment_id", "user_id" cells. In comments table store the "like_count", so I don't need to count it when I need to write it out. But likes are easy to do thing, so people will create a lots of them and if I need to list a specified comment's likes, I need to read the whole comment_likes table and found all the user_ids. This could be millions of rows in the future. If 1000 user will do it in the same time, my system will die.
My second thought was, to store likes in comments table. create a cell named "likes" with a list of user_ids like this: 1#34#21#56#....
So when somebody like/unlike a comment just CONCAT or REPLACE his/her id in this cell with a #. When I need to list specified comment just explode this list at #-s.
I think 2nd could be faster and smarter, but what do you think about this?
The first option is much better, because you have the benefits of a relational setup. For example: What if you want to get the comments from the database userId x has liked? With the first setup this is a fast and simple query. In the second case you would have to use a LIKE, which is much slower and inaccurate. (Imagine the userId is 1, and the likes field in the comments table contains #10 - it would return the comment if you would use LIKE '%1%').
And even for a high traffic site; just using an index on commentId would make this a fast operation.
So go for the first option.
If you really doubt the speed of the first option, you could create a "cache" field in the comments table in which you count the amount of likes, so you don't have to perform a subquery to select the like count.