I have a database in Google BigQuery with million of rows (more than 2 million new rows every day) contains of my user activities
I create a PHP program to get the insight from this database with many queries to show like statistic of data per day, per hour and many more
I have two cases with two problem:
I try to find the data of user activities in date between 2016-11-01 and 2016-11-10, and then I need to break down the data for only 2016-11-05 data only (the data basically is the subset of queries result). This data need to be clasify per day or per hour or per user type and many more. Right now I use many queries in database to group this data and to do many data manipulation. For example "SELECT * FROM user_activities WHERE date>='2016-11-01' AND date<='2016-11-10' GROUP BY date,hour" and then when I need to break down the data in 2016-11-05 only I re-run the query: "SELECT * FROM user_activities WHERE date='2016-11-05' GROUP BY date,hour"
Or sometimes I need to query the data with different parameter, for example the user activities between 2016-11-01 and 2016-11-10 who contains activities "A", and then I need to change witch activities "B". I have a column to identify the type of activities that user do. Right now I run the query like "SELECT * FROM user_activities WHERE activities like 'A' and then when the activities type is changed I run new query "SELECT * FROM user_activities WHERE activities like 'B'.
So my question is:
Because the data on my database is so big, and because the insight query activities in my PHP program is so high frequency, the cost of data management and processing become so high. For case like case 1 and 2 is there any alternate solution like PHP caching to make the database request become less?
In just 1-2 days my BigQuery data request can become Terabyte of data. I'm afraid it is not too efficient in term of my cost database management.
As far I have tried these solutions:
I take the raw data from my database, cache it on PHP and run the
data manipulation manually. For example I run "SELECT * FROM
user_activities WHERE date>='2016-11-01' AND date<='2016-11-10'" and
then I try to run data manipulation like group by hour or group by
user type or group by user activities manually and sequentially on
PHP function. But because my data contains million of data the
process become so long and not efficient.
I take the raw data from my database, insert it to temporary table,
and then manipulate the data by query to temporary table. But this
process become not efficient too because the insert process for
million rows of data become so long.
Do you have any suggestion how can I optimize my problem?
Implement the Partitioned Tables as has been recommended for you.
If you have one single big table with 5TB of data without partition your costs are high.
When you do Partitioned Tables, you have only the storage for those days to query not the whole table. Just a fraction of it, like 10GB or smaller. And you pay that only.
You can save a query result into a table directly instead of reimporting as you say, and query only that table which is smaller for further aggregation.
Try to not use 'SELECT *' instead just select the columns you must have in your output.
If the data is enough small, and you do lots of small querios on it, you may want to take out from BQ and store in ElasticSearch or MySQL and run from there the queries.
Related
I'm new to sql & php and unsure about how to proceed in this situation:
I created a mysql database with two tables.
One is just a list of users with their data, each having a unique id.
The second one awards certain amounts of points to users, with relevant columns being the user id and the amount of awarded points. This table is supposed to get new entries regularly and there's no limit to how many times a single user can appear in it.
On my php page I now want to display a list of users sorted by their point total.
My first approach was creating a "points_total" column in the user table, intending to run some kind of query that would calculate and update the correct total for each user every time new entries are added to the other table. To retrieve the data I could then use a very simple query and even use sql's sort features.
However, while it's easy to update the total for a specific user with the sum where function, I don't see a way to do that for the whole user table. After all, plain sql doesn't offer the ability to iterate over each row of a table, or am I missing a different way?
I could probably do the update by going over the table in php, but then again, I'm not sure if that is even a good approach in the first place, because in a way storing the point data twice (the total in one table and then the point breakdown with some additional information in a different table) seems redundant.
A different option would be forgoing the extra column, and instead calculating the sums everytime the php page is accessed, then doing the sorting stuff with php. However, I suppose this would be much slower than having the data ready in the database, which could be a problem if the tables have a lot of entries?
I'm a bit lost here so any advice would be appreciated.
To get the total points awarded, you could use a query similar to this:
SELECT
`user_name`,
`user_id`,
SUM(`points`.`points_award`) as `points`,
COUNT(`points`.`points_award`) as `numberOfAwards`
FROM `users`
JOIN `points`
ON `users`.`user_id` = `points`.`user_id`
GROUP BY `users`.`user_id`
ORDER BY `users`.`user_name` // or whatever users column you want.
Here is something that hit me and wanted to know if I was right or if it could be done better? I am currently running the PHP part on GAE and use Amazon RDS since it is cheaper than google cloud SQL. And also since PHP on GAE does not have native api for Datastore. I know there is a work around but hey this is simpler and I bet a lot of others want their GAE app to sync with their DB than move the who stuff.
I run two queries
This is a join statement that runs when the page loads
$STH = $DBH->prepare("SELECT .....a few selected colmns with time coversion.....
List of Associates.Supervisor FROM Box Scores INNER JOIN
List of Associates ON Box Scores.Initials = List of
Associates.Initials WHERE str_to_date(Date, '%Y-%m-%d') BETWEEN
'{$startDate}' AND '{$endDate}' AND Box Scores.Initials LIKE
'{$initials}%' AND List of Associates.Supervisor LIKE'{$team}%'
GROUP BY Login");
What I get I calculate and then display as a table with each username as link
echo("<td >$row[0]</td>");
So when some one clicks on this link it will call another PHP and using AJAX to display the output I run the second query
2.Second query. This time I am getting everything.
$STH = $DBH->prepare("SELECT * FROM `Box Scores` INNER JOIN `List of Associates` ON
`Box Scores`.`Initials` = `List of Associates`.`Initials`
WHERE str_to_date(`Date`, '%Y-%m-%d') BETWEEN '{$startDate}' AND '{$endDate}'
AND `V2 Box Scores`.`Initials` LIKE '{$Agent}%'
AND `List of Associates`.`Supervisor` LIKE '{$team}%'");
The output I display in a small pop up as a light box after formatting the output as a table.
I find that the first query to be faster. So it got me thinking should I do something to the second part to make it faster.
Would only selecting the needed columns make it faster. OR should I do a SELECT * FROM as the first and then save it all to a unique file in Google bucket and then make the corresponding SELECT calls from that file?
I trying to make it such that it scale and not slow then when the query has to go through tens of thousands of rows in the DB. The above Queries are executed using PDO or PHP Data Objects.
so what are your thoughts?
Amazon Red Shift stores each column in a separate partition -- something called a columnar database or vertical partitioning. This results in some unusual performance issues.
For instance, I have run a query like this on a table will hundreds of millions of row, and it took about minute to return:
select *
from t
limit 10;
On the other hand, a query like this would return in a few seconds:
select count(*), count(distinct field)
from t;
This takes some getting used to. But, you should explicitly limit the columns you refer to in the query to get the best performance on Amazon (and other columnar databases). Each additional referenced column requires reading in that data from disk to memory.
Also, limiting the number of columns also reduces the I/O needed to the application. This can be significant, if you are storing wide-ish data in some of the columns, and you don't use the data.
What are some of the strategies being used for pagination of data sets that involve complex queries? count(*) takes ~1.5 sec so we don't want to hit the DB for every page view. Currently there are ~45k rows returned by this query.
Here are some of the approaches I've considered:
Cache the row count and update it every X minutes
Limit (and offset) the rows counted to 41 (for example) and display the page picker as "1 2 3 4 ..."; then recompute if anyone actually goes to page 4 and display "... 3 4 5 6 7 ..."
Get the row count once and store it in the user's session
Get rid of the page picker and just have a "Next Page" link
I've had to engineer a few pagination strategies using PHP and MySQL for a site that does over a million page views a day. I persued the strategy in stages:
Multi-column indexes I should have done this first before attempting a materialized view.
Generating a materialized view. I created a cron job that did a common denormalization of the document tables I was using. I would SELECT ... INTO OUTFILE ... and then create the new table, and rotate it in:
SELECT ... INTO OUTFILE '/tmp/ondeck.txt' FROM mytable ...;
CREATE TABLE ondeck_mytable LIKE mytable;
LOAD DATA INFILE '/tmp/ondeck.txt' INTO TABLE ondeck_mytable...;
DROP TABLE IF EXISTS dugout_mytable;
RENAME TABLE atbat_mytable TO dugout_mytable, ondeck_mytable TO atbat_mytable;
This kept the lock time on the write contended mytable down to a minimum and the pagination queries could hammer away on the atbat materialized view. I've simplified the above, leaving out the actual manipulation, which are unimportant.
Memcache I then created a wrapper about my database connection to cache these paginated results into memcache. This was a huge performance win. However, it was still not good enough.
Batch generation I wrote a PHP daemon and extracted the pagination logic into it. It would detect changes mytable and periodically regenerate the from the oldest changed record to the most recent record all the pages to the webserver's filesystem. With a bit of mod_rewrite, I could check to see if the page existed on disk, and serve it up. This also allowed me to take effective advantage of reverse proxying by letting Apache detect If-Modified-Since headers, and respond with 304 response codes. (Obviously, I removed any option of allowing users to select the number of results per page, an unimportant feature.)
Updated:
RE count(*): When using MyISAM tables, COUNT didn't create a problem when I was able to reduce the amount of read-write contention on the table. If I were doing InnoDB, I would create a trigger that updated an adjacent table with the row count. That trigger would just +1 or -1 depending on INSERT or DELETE statements.
RE page-pickers (thumbwheels) When I moved to agressive query caching, thumb wheel queries were also cached, and when it came to batch generating the pages, I was using temporary tables--so computing the thumbwheel was no problem. A lot of thumbwheel calculation simplified because it became a predictable filesystem pattern that actually only needed the largest page numer. The smallest page number was always 1.
Windowed thumbweel The example you give above for a windowed thumbwheel (<< 4 [5] 6 >>) should be pretty easy to do without any queries at all so long as you know your maximum number of pages.
My suggestion is ask MySQL for 1 row more than you need in each query, and decide based on the number of rows in the result set whether or not to show the next page-link.
MySQL has a specific mechanism to compute an approximated count of a result set without the LIMIT clause: FOUND_ROWS().
MySQL is quite good in optimizing LIMIT queries.
That means it picks appropriate join buffer, filesort buffer etc just enough to satisfy LIMIT clause.
Also note that with 45k rows you probably don't need exact count. Approximate counts can be figured out using separate queries on the indexed fields. Say, this query:
SELECT COUNT(*)
FROM mytable
WHERE col1 = :myvalue
AND col2 = :othervalue
can be approximated by this one:
SELECT COUNT(*) *
(
SELECT COUNT(*)
FROM mytable
) / 1000
FROM (
SELECT 1
FROM mytable
WHERE col1 = :myvalue
AND col2 = :othervalue
LIMIT 1000
)
, which is much more efficient in MyISAM.
If you give an example of your complex query, probably I can say something more definite on how to improve its pagination.
I'm by no means a MySQL expert, but perhaps giving up the COUNT(*) and going ahead with COUNT(id)?
I am building a fairly large statistics system, which needs to allow users to requests statistics for a given set of filters (e.g. a date range).
e.g. This is a simple query that returns 10 results, including the player_id and amount of kills each player has made:
SELECT player_id, SUM(kills) as kills
FROM `player_cache`
GROUP BY player_id
ORDER BY kills DESC
LIMIT 10
OFFSET 30
The above query will offset the results by 30 (i.e. The 3rd 'page' of results). When the user then selects the 'next' page, it will then use OFFSET 40 instead of 30.
My problem is that nothing is cached, even though the LIMIT/OFFSET pair are being used on the same dataset, it is performing the SUM() all over again, just to offset the results by 10 more.
The above example is a simplified version of a much bigger query which just returns more fields, and takes a very long time (20+ seconds, and will only get longer as the system grows).
So I am essentially looking for a solution to speed up the page load, by caching the state before the LIMIT/OFFSET is applied.
You can of course use caching, but i would recommend caching the result, not the query in mysql.
But first things first, make sure that a) you have the proper indexing on your data, b) that it's being used.
If this does not work, as group by tends to be slow with large datasets, you need to put the summary data in a static table/file/database.
There are several techniques/libraries etc that help you perform server side caching of your data. PHP Caching to Speed up Dynamically Generated Sites offers a pretty simple but self explanatory example of this.
Have you considered periodically running your long query and storing all the results in a summary table? The summary table can be quickly queried because there are no JOINs and no GROUPings. The downside is that the summary table is not up-to-the-minute current.
I realize this doesn't address the LIMIT/OFFSET issue, but it does fix the issue of running a difficult query multiple times.
Depending on how often the data is updated, data-warehousing is a straightforward solution to this. Basically you:
Build a second database (the data warehouse) with a similar table structure
Optimise the data warehouse database for getting your data out in the shape you want it
Periodically (e.g. overnight each day) copy the data from your live database to the data warehouse
Make the page get its data from the data warehouse.
There are different optimisation techniques you can use, but it's worth looking into:
Removing fields which you don't need to report on
Adding extra indexes to existing tables
Adding new tables/views which summarise the data in the shape you need it.
I have an array of user ids in a query from Database A, Table A (AA).
I have the main user database in Database B, Table A (BA).
For each user id returned in my result array from AA, I want to retrieve the first and last name of that user id from BA.
Different user accounts control each database. Unfortunately each login cannot have permissions to each database.
Question: How can I retrieve the firsts and lasts with the least amount of queries and / or processing time? With 20 users in the array? With 20,000 users in the array? Any order of magnitude higher, if applicable?
Using php 5 / mysql 5.
As long as the databases are on the same server just use a cross database join. The DB login being used to access the data will also need permissions on both databases. Something like:
SELECT AA.userID, BA.first, BA.last
FROM databasename.schema.table AA
INNER JOIN databasename.schema.table BA ON AA.userID = BA.userID
In response to comments:
I don't believe I read the part about multiple logins correctly, sorry. You cannot use two different mySQL logins on one connection. If you need to do multiple queries you really only have three options. A) Loop through the first result set and run multiple queries. B) Run a query which uses a WHERE clause with userID IN (#firstResultSet) and pass in the first result set. C) Select everything out of the second DB and join them in code.
All three of those options are not very good, so I would ask, why can't you change user permissions on one of the two DBs? I would also ask, why would you need to select the names and IDs of 20,000 users? Unless this is some type of data dump, I would be looking for a different way to display the data which would be both easier to use and less query intensive.
All that said, whichever option you choose will be based on a variety of different circumstances. With a low number of records, under 1,000, I would use option B. With a higher number of records, I would probably use options C and try to place the two result sets into something that can be joined (such as using array_combine).
I think they key here is that it should be possible in two database calls.
Your first one to get the id's from database A and the second one to pass them to database B.
I don't know mysql, but in sqlserver I'd use the xml datatype and pass all of the ids into a statement using that. Before the xml datatype I'd have built up some dynamic SQL with the id's in an IN statement.
SELECT UserId FROM DatabaseA.TableA
Loop through id's and build up a comma separated string.
"SELECT FirstName, Surname FROM DataBaseB.TableA WHERE UserId IN(" + stringId + ")"
The problem with this is that wth 20,000 id's you may have some performance issues with the amount of data you are sending. This is where'd I'd use the XML datatype, so maybe look at what alternatives mysql has for passing lists of ids.