How to clean up a 30million record table

How to clean up a 30million record table - php

We have a table on our database that tracks user behavior.
Basically each page that the user views we track this.
In the table we have the following:
id | user_id | user_ip | page | created_on
When a user checks the site from PC and let's say he checks a specific article the system saves under "page" the following "/article/specific/slug" however if the user checks the same page from the mobile version of the website it saves "http://m.website.com/article/specific/slug"
We are looking to change this.
We have added a new field in the database as enum (pc, m) and therefore we want regardless of device to save under "page" always "/article/specific/slug"
One issue is that we now have 30 million records in the past that need to be converted.
Meaning we write a query that checks if "http://m.website.com" exists update the field removing the "http://m.website.com" and making the "device" field updated as "m".
Can someone please help with this?

Query:
update visits_table
set
page=replace(page, 'http://m.website.com', ''),
device='m'
where
page like 'http://m.website.com%';
To go through 30 mil rows you'll have to... go through 30 mil rows. So you'll either do it just with the above query either:
when the site is down for maintenance
when the site has low traffic (e.g early morning hours?)
whenever you want, but expect some stress on the mysql until it's done (it may take a while)
Otherwise, if your ids are incremental, you can update in batches by splitting the query in many queries. e.g:
update ... where ... and id between 1 and 1000000;
update ... where ... and id between 1000001 and 2000000;
update ... where ... and id between 2000001 and 3000000;
...

Related

Using PHP to display varying text based on user, and recording click-through rate

Let's say that I have 3 different headlines for an article:
"Man Bites Dog"
"This Man Unhinged His Jaw as He Approached A Dog, What Happens Next Will Shock You!"
"Only 90's Kids Will Remember That Time a Man Bit a Dog"
I want to use PHP to randomly display one of these three headlines based on the current user (so they're not getting new headlines each time they refresh), then record the number of clicks for each version of the headline via SQL where I get something similar to:
USER HEADLINE CLICK?
1    1     No
2    3     Yes
3    2     Yes
4    3     No
5    2     Yes
6    1     No
Specifically, I'd like advice about:
- Retrieving some sort of variable that's unique to the user (IP address, maybe?)
- Randomly assigning a number (1-3, in the example) based on that unique user variable.
- Displaying different text based on the assigned number.
I can figure out the SQL stuff once I figure this part out. I appreciate any advice you can provide.

You have three problems here:
How to identify user constantly
How to count user clicks(actions)
How to get result statistics
Here I think that showing different subjects on one page is not a problem
Problem 1
Basically you can use an IP address but it is not a constant id for user. For example if user uses mobile phone and walks, he can switch between towers or loose connection and then restore it with different ip.
There are many ways to identify user by the web, but there is no way to identify user on 100% without authorization (active action done by user)
For example you can set Cookie to user with his generated ID. You can easily generate id you can look here. When you set up cookie and user will come back to you, you will know who it is and do the stuff you need.
Also within user uniqueness you can reed this article - browser uniqueness
Also if you use Cookie, you can easily store there subject id for your task. If you will not use Cookie i recommend you use mongodb for this kind of tasks (many objects with small data, that must be retrieved from db very fast, inserted to db very fast and there are no updates in your case)
Problem 2
You showed table that has 3 fields: ID, Used title, Is title clicked.
In this kind of table you will lose all not unique click (when user clicks on subject twice, goes there tomorrow or refreshes target page multiple times)
I suggest you to use following kind of table
ID - some unique id, auto increment field will be good here
Date - some period of measurements (daily, hourly or something like that)
SubjectID - id of subject that was shown
UniqueClicks - count of users that clicks on subject
Clicks - Total count of clicks on subject
In this case you will have aggregated data by period of time and you will easily show data in admin panel
But still we have problem with collecting this data. Solution of this problem depends on count of users. If there is more than 1000 clicks in minute, I think that you need some logging system. For example you will send all data to file 'clickLog-' . date('Ymd_H') . '.log' and send data to this file in some static format, for example:
clientId;SubjectId;
When hour is end you can aggregate this data by shell script or your code and put it to db:
cat clickLog-20160907_12.log | sort -u | awk -F';' '{print $2}' | sort | uniq -c
after this code you will have 2 columns of data. First will be count of unique clicks and second will be subject id
Modifying this script you can get total clicks with just removing sort -u section
Also if you have several subject ids you can do it with for:
For example bash script for unique clicks can be following
for i in subj1 subj2 subj3; do
uniqClicks=$(cat clickLog-20160907_12.log |
grep ';'$i'$' |
sort -u |
wc -l);
clicks=$(cat clickLog-20160907_12.log |
grep ';'$i'$' |
wc -l);
# save data here
done
After this manipulations you will have prepared aggregated data for calculating and source data for future processing (if needed)
And also your db will be small and fast and all source data will be stored in files.
Problem 3
If you will do solution in Problem 2 section, all queries for getting statistic will be so simple, that your database will do it very fast
For example you can run this query in PostgreSQL:
SELECT
SubjectId,
sum(uniqueClicks) AS uniqueClicks,
sum(clicks) AS clicks
FROM
statistic_table
WHERE
Date BETWEEN '2016-09-01 00:00:00' and '2016-09-08 00:00:00'
GROUP BY
SubjectId
ORDER BY
sum(uniqueClicks) DESC
in this case if you have 3 subject ids and hourly based aggregation you will have 504 new rows in weeks (3 subjects * 24 hours * 7 days) that is really small amount of data for database.
Alternatives
You can also use Google Analytics for all calculations. But in this case you need to do some other steps. Most of them are configuration steps that need to be done to enable google analytics monitoring scripts on your site. If you have it, you can easily configure goals support and just apply to script additonal data with subjectid by using GA script api

You can use the IP of the user or his MAC, if the user is registered on the web you can use the user id.
For the second part you can use the function mt_rand() for PHP:
mt_rand(min,max) -> if you want a number bettween 1 and 3 user mt_rand(1,3);
the use an array to store the three diferent headlines and use the ramdomly generated number to acces the array.
Better you can generate a number bettween 0-2 because the arrays start with 0.

Multiple DB users updating at the same time

I am curious what path I should take to accomplish the following. I want multiple computers at one location to be able to view and make changes to data inside a mysql DB with a web browser. I dont have extensive knowledge in this area, but from what I do remember this was very difficult if not impossible.
Example: Lets say I have a record for John and I want 2 computers to be able to edit Johns record. Please note that the computers will not be editing the same portion of Johns record. Lets say one record is changing a status from need to be called to called and the other computer is changing the status of need to be ordered to ordered.
I want a solution that could natively handle this.
My current knowledge is building web interfaces with PHP and SQL. I would like to use these languages as I have some prior knowledge.
So my question: Is this possible? If, so exactly how would it work(flow of info)?

There are several ways that you can accomplish this. There's already some great PHP database editing software packages out there (phpMyAdmin).
To handle this in code though you can either use Transactions (depending on what flavor of SQL you're using this would be done differently)
One of the easier ways to ensure that you don't have people's data clashing with one another is just by adding additional where clauses to your statement.
Lets say you have a user record and you want to update the last name from Smith to Bill, and the user ID is 4.
Instead of writing
UPDATE users SET lastName='Bill' WHERE id='4'
You would add in:
UPDATE users SET lastName='Bill' WHERE id='4' AND lastName='Smith'
That way if someone else updates the last name field while you're working on it, your query will fail and you'll have to re-enter the data, thus faking a transaction

Use Transactions. Updating a single record at the exact same time isn't really supported, but applying one transaction followed immediately by another certainly is. This is native to MySQL.
START TRANSACTION;
SELECT #A:=SUM(salary) FROM table1 WHERE type=1;
UPDATE table2 SET summary=#A WHERE type=1;
COMMIT;

One other thing to do is the old desktop approach. Wich is almost mannualy control the flow of modifications. I will show:
Say that you have a client table with the fields id, firstname, lastname, age. In order to control multiple users updates you will add the version integer default 0 field to this table.
When you populate the object on the form to an user you will also store the actual version that the user has selected.
So lets assume that your client table is like this:
id firstname lastname age version
1 Tomas Luv 20 0
2 Lucas Duh 22 0
3 Christian Bah 30 0
When the user select the client with the id=1 the version of this row is, in this moment, 0. Then the user update the lastname of this client to Bob and submit it.
Here comes the magic:
Create a trigger (before update) that will check the current version of that registry with the version that the user previously selected, something like this (this is just pseudo code, as I'm doing it from my head):
create trigger check_client_version on client before update as
begin
if new.version != old.version then
throw some error saying that a modification already was done;
else
new.version = old.version + 1;
end if;
end;
On the application you check if the update has this error and inform to user that someone else made change on the registry he try to change.
So with the given example it would be like:
1 - The user A selected the row 1 and start editing it
2 - At the same time the user B selected the row 1 and save it before the user A
3 - The user A try to save his modifications and get the error from the application
On this context the user A has the version field pointed to 0 also is the user B but when the user B save the registry it now is 1 and when the user A try to save it it will fail because of the check trigger.
The problem with this approch is that you will have to have a before update trigger to every table in your model or at least the one you are concerned with.

Ways to highlight changes when an update occurs

Consider the following table row:
ID | First Name | Last Name | Email | Age
____________________________________________________
1 | John | Smith | john#smith.com | 23
2 | Mohammad | Naji | me#naji.com | 26
When an update occurs, eg. an email of an account is changed, how should I detect the change was that?
I should bold the changes for website admins.
Current database schema doesn't support it because I don't store previous revisions of the row.
Please advise me with the least cost solution for me now.

You can create a function in php and use it to update the data:
function update_row($new_row, $id)
Parameters:
$new_row is an associative array: array("column name" => new column value)
$id - id of the row to update
Function works like this:
Select current row with id = $id into $old_row
Compare old and new rows and get the columns updated:
$columns_updated = array();
foreach($new_row as $key => $value){
if($new_row[$key] != $value)
{
array_push($key);
}
}
update row where id=$id to $new_row
return $columns_updated

You'll be unable to track changes unless you make some sort of change to the schema. At the very least you'll need a table (or tables) that do that for you.
You can either
a) explicitly track changes as updates are made by modifying the code that makes them. These could be in many places, so this is likely to be time consuming.
b) Track the changes by implementing a mySQL trigger on the database that automatically copies the old version to your new tables each time a row is updated.
In either case, you'll need to query both the current table and the changes table to check for changes you need to highlight.
You'll also need to determine at what point a change no longer needs to be highlighted. Simply deleting the old row from your changes table will remove the change, but you'll need to decide when that should be done. You could use a MySQL event to cull the changes on a regular basis, or you could tie this maintenance to some other trigger or action in your system.
Implementation details will need to be decided based on your system and expectations.
Using Triggers and Events has the advantage that changes can be confined to the database, except where the changes need to be highlighted.

MySQL optimization: more entries vs complex queries

I want to improve the speed of a notification board. It retrieves data from the event table.
At this moment the events MySQL table looks like this
id | event_type | who_added_id | date
In the event table I store one row with information regarding a particular event. Each time a users A asks for new notifications, the query runs through the table and looks if the notifications added by the user B suit him (they have to be friends, members of the same groups, have previously chatted).
Table events became big, because of the bulky query the page loads slow.
I'm thinking of changing entirely this design and, instead of adding one event row and then compare if the user's event suits or not, to add as many rows as interested users. I would change the table events structure as follows:
id | event_type | who_added_id | forwho_id | date
Now, if user B creates an event which interests other 50 members, I create 50 rows with the same information and in the 'forwho_id' field I mention those 50 members which must get this notification.
I think the query will become much more simple and it will take less time to search through it.
How do you think:
1. Is this a good approach in storing such kind of data or we should avoid duplicate data at any cost?
2. How do you think the events table will behave if the number of interested users will be not 50 but hundreds?
Thank you for reading this and I hope I made myself understandable.

Duplicated data is not "bad", and it's not to be "avoided at all cost".
What is "bad" is uncontrolled redundancy, and the kind of problems that come up when the logical data model isn't third normal form. It is acceptable and expected that an implementation will deviate from a logical data model, and introduce redundancy for performance.
Your revised design looks appropriate for your needs.

Need to user-to-user data in the database. How to avoid a disaster?

In my requirements, every user on the website can see a score attached to other users. It gets calculated based of their profile parameters. My score to someone else will be one, but their score to me will be another one.
What I have done so far
Table in the MySQL database like so:
___UserID1___|___UserID2___|___Score___|___Last_Uopdated___
1 | 2 | 45 | 1235686744
2 | 1 | 24 | 1235645332
When a user views someones page, my score class is checking if the record for this pair exists in the database and if not, calculates it and records it. This works fine, because no one will look at absolutely every user page on the site.
Now I need to pull users and sort them based on score. SO I thought, I can create a cronjob, and run it every night, so it will update scores in the database and create them for every pair of user both ways.
Well, problem is I am planing a system for over 500,000 users and I am worried, it will bring my database down and create huge database. So for 500,000 we are talking about 250 billion records... :/
Does anyone know any other way of creating this feature? May be calculation on the fly... or any other way?

If I was in your situation I would create the calculation on the fly. I would generate the scores using your function and then store the values into the database then. That way whenever any user visits any page, the scores are updated. This is an incremental approach rather than trying to run the function on every single combination possible at once. Plus no more database disaster :)
If you have a page that ranks all the users by score, it would be much simpler if you use pagination and use the ORDER BY and OFFSET, LIMIT features of SQL queries instead of fetching all users at once.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to clean up a 30million record table - php

Related

Using PHP to display varying text based on user, and recording click-through rate

Multiple DB users updating at the same time

Ways to highlight changes when an update occurs

MySQL optimization: more entries vs complex queries

Need to user-to-user data in the database. How to avoid a disaster?

Categories

Resources