PHP - Query on page with many requests

PHP - Query on page with many requests - php

I have around 700 - 800 visitors at all time on my home page (according to analytics) and a lot of hits in general. However, I wish to show live statistics of my users and other stuff on my homepage. I therefore have this:
$stmt = $dbh->prepare("
SELECT
count(*) as totalusers,
sum(cashedout) cashedout,
(SELECT sum(value) FROM xeon_stats_clicks
WHERE typ='1') AS totalclicks
FROM users
");
$stmt->execute();
$stats=$stmt->fetch();
Which I then use as $stats["totalusers"] etc.
table.users have `22210` rows, with index on `id, username, cashedout`, `table.xeon_stats_clicks` have index on `value` and `typ`
However, whenever I enable above query my website instantly becomes very slow. As soon as I disable it, the load time drastically falls.
How else can this be done?

You should not do it that way. You will eventually exhuast your precious DB resources, as you now are experiencing. The good way is to run a separate cronjob in 30 secs or 1 min interval, and then write the result down to a file :
file_put_contents('stats.txt', $stats["totalusers"]);
and then on your mainpage
<span>current users :
<b><? echo file_get_contents('stats.txt');?></b>
</span>
The beauty is, that the HTTP server will cache this file, so until stats.txt is changed, a copy will be upfront in cache too.
Example, save / load JSON by file :
$test = array('test' => 'qwerty');
file_put_contents('test.txt', json_encode($test));
echo json_decode(file_get_contents('test.txt'))->test;
will output qwerty. Replace $test with $stats, as in comment
echo json_decode(file_get_contents('stats.txt'))->totalclicks;

From what I can tell, there is nothing about this query that is specific to any user on the site. So if you have this query being executed for every user that makes a request, you are making thousands of identical queries.
You could do a sort of caching like so:
Create a table that basically looks like the output of this query.
Make a PHP script that just executes this query and updates the aforementioned table with the lastest result.
Execute this PHP script as a cron job every minute to update the stats.
Then the query that gets run for every request can be real simple, like:
SELECT totalusers cashedout, totalclicks FROM stats_table

From the query, I can't see any real reason to use a sub-query in there as it doesn't use any of the data in the users table, and it's likely that that is slowing it down - if memory serves me right it will query that xeon_clicks table once for every row in your users table (which is a lot of rows by the looks of things).
Try doing it as two separate queries rather than one.

Related

MySQL Select Query 17000 rows slow

The problem is that when I display about 17000 records on the page it takes a long time and when I open this page a couple of times in new tabs they load only when the first page is loaded.
Normally this is stupid because web applications should work multithreaded.
I don't understand why this is so.
I use indexes to load things faster and it doesn't bring anything, I think. If I use PHPMyAdmin to load the data with a SELECT * FROM ... LIMIT 30000 load, it takes 10 minutes and then comes a server error. I don't know why that is so.
How can I increase the speed to insert, read and write data and so on?
This page selects 2002 comments (data sets) and is already slow
https://www.prodigy-official.de/punity/questions/show?question=11
This page selects 17000 rows and does not loads...
https://www.prodigy-official.de/punity/questions/show?question=10
WHY
I use InnoDB as Engine.
I use a Virtual Machine.
Infos:
http://prntscr.com/pm8e9j

Glancing through the code, it seems like you display a question, plus all its comments?
$question_id = $_GET['question'];
SELECT id,username_id,question_id,comment,comment_date
FROM `user.comments`
WHERE `question_id` = $question_id
AND is_reply_from_comment_id = '0'
ORDER BY id
foreach result
{
SELECT id,username_id,comment,comment_date
FROM `user.comments`
WHERE is_reply_from_comment_id = $comment_id
ORDER BY id
}
For the sanity of your users, you must put a limit on the number of comments displayed all at once.
Replacing INDEX(is_reply_from_comment_id) with INDEX(is_reply_from_comment_id, question_id) will help the first SELECT without hurting the second.
Do you understand that the schema limits the table to only 32K rows? (It sounds like you will soon hit that limit.)

Performance issue: Load data infile is making select query slow

I have a program which scans twitter, facebook, google+ 24 hours a day. Per user a searchlist is running and inserted with (100 results at one time, function runs in a loop untill there are not futher results)
Yii::app()->db->createCommand(
"LOAD DATA INFILE '/var/tmp/inboxli_user".$user.".txt'
INTO TABLE inbox
FIELDS TERMINATED BY ',$%'
LINES STARTING BY 'thisisthebeginningxxx'
(created_on, created_at, tweet, tweet_id, profile_image,
twitter_user_id, screenname, followers, lang, tags, type,
positive_score, readme, answered, deleted, searchlist_id,
handled_by, used_as_newsitem, user_id)
" )->execute();
into the database in order to keep the load as small as possible on the server. How ever when my functions are doing the bulk insert, my select functions runs very slow. Normally the inbox loads within 1.5 second but when the insertion is running sometimes it takes like 20 seconds for a page to open.
My question how can i optimize this? So insertion and select can use the database at the same time without slowing things down?

Get off MyISAM! Use InnoDB; it does a much better job of not locking out other actions.
Load data is very efficient, increase the count to, say, 500.
What indexes do you have? Let's see SHOW CREATE TABLE. DROP any unnecessary indexes; this will speed up the LOAD.
Consider turning off the Query cache.

Well, first you should make sure you indexed your table correctly. See How does database indexing work?
that will speed up the select statements pretty much.
Second, it's possible that you split your file into multiple chunks. So the database server removes the caches and logs for each new file you loaded.
See: https://www.percona.com/blog/2008/07/03/how-to-load-large-files-safely-into-innodb-with-load-data-infile/

Improve SQL query results when running php app on GAE and DB on Amazon RDS

Here is something that hit me and wanted to know if I was right or if it could be done better? I am currently running the PHP part on GAE and use Amazon RDS since it is cheaper than google cloud SQL. And also since PHP on GAE does not have native api for Datastore. I know there is a work around but hey this is simpler and I bet a lot of others want their GAE app to sync with their DB than move the who stuff.
I run two queries
This is a join statement that runs when the page loads
$STH = $DBH->prepare("SELECT .....a few selected colmns with time coversion.....
List of Associates.Supervisor FROM Box Scores INNER JOIN
List of Associates ON Box Scores.Initials = List of
Associates.Initials WHERE str_to_date(Date, '%Y-%m-%d') BETWEEN
'{$startDate}' AND '{$endDate}' AND Box Scores.Initials LIKE
'{$initials}%' AND List of Associates.Supervisor LIKE'{$team}%'
GROUP BY Login");
What I get I calculate and then display as a table with each username as link
echo("<td >$row[0]</td>");
So when some one clicks on this link it will call another PHP and using AJAX to display the output I run the second query
2.Second query. This time I am getting everything.
$STH = $DBH->prepare("SELECT * FROM `Box Scores` INNER JOIN `List of Associates` ON
`Box Scores`.`Initials` = `List of Associates`.`Initials`
WHERE str_to_date(`Date`, '%Y-%m-%d') BETWEEN '{$startDate}' AND '{$endDate}'
AND `V2 Box Scores`.`Initials` LIKE '{$Agent}%'
AND `List of Associates`.`Supervisor` LIKE '{$team}%'");
The output I display in a small pop up as a light box after formatting the output as a table.
I find that the first query to be faster. So it got me thinking should I do something to the second part to make it faster.
Would only selecting the needed columns make it faster. OR should I do a SELECT * FROM as the first and then save it all to a unique file in Google bucket and then make the corresponding SELECT calls from that file?
I trying to make it such that it scale and not slow then when the query has to go through tens of thousands of rows in the DB. The above Queries are executed using PDO or PHP Data Objects.
so what are your thoughts?

Amazon Red Shift stores each column in a separate partition -- something called a columnar database or vertical partitioning. This results in some unusual performance issues.
For instance, I have run a query like this on a table will hundreds of millions of row, and it took about minute to return:
select *
from t
limit 10;
On the other hand, a query like this would return in a few seconds:
select count(*), count(distinct field)
from t;
This takes some getting used to. But, you should explicitly limit the columns you refer to in the query to get the best performance on Amazon (and other columnar databases). Each additional referenced column requires reading in that data from disk to memory.
Also, limiting the number of columns also reduces the I/O needed to the application. This can be significant, if you are storing wide-ish data in some of the columns, and you don't use the data.

Optimizing queries for content popularity by hits

I've done some searching for this but haven't come up with anything, maybe someone could point me in the right direction.
I have a website with lots of content in a MySQL database and a PHP script that loads the most popular content by hits. It does this by logging each content hit in a table along with the access time. Then a select query is run to find the most popular content in the past 24 hours, 7 day or maximum 30 days. A cronjob deletes anything older than 30 days in the log table.
The problem I'm facing now is as the website grows the log table has 1m+ hit records and it is really slowing down my select query (10-20s). At first I though the problem was a join I had in the query to get the content title, url, etc. But now I'm not sure as in test removing the join does not speed the query as much as I though it would.
So my question is what is best practise of doing this kind of popularity storing/selecting? Are they any good open source scripts for this? Or what would you suggest?
Table scheme
"popularity" hit log table
nid | insert_time | tid
nid: Node ID of the content
insert_time: timestamp (2011-06-02 04:08:45)
tid: Term/category ID
"node" content table
nid | title | status | (there are more but these are the important ones)
nid: Node ID
title: content title
status: is the content published (0=false, 1=true)
SQL
SELECT node.nid, node.title, COUNT(popularity.nid) AS count
FROM `node` INNER JOIN `popularity` USING (nid)
WHERE node.status = 1
AND popularity.insert_time >= DATE_SUB(CURDATE(),INTERVAL 7 DAY)
GROUP BY popularity.nid
ORDER BY count DESC
LIMIT 10;

We've just come across a similar situation and this is how we got around it. We decided we didn't really care about what exact 'time' something happened, only the day it happened on. We then did this:
Every record has a 'total hits' record which is incremented every time something happens
A logs table records these 'total hits' per record, per day (in a cron job)
By selecting the difference between two given dates in this log table, we can deduce the 'hits' between two dates, very quickly.
The advantage of this is the size of your log table is only as big as NumRecords * NumDays which in our case is very small. Also any queries on this logs table are very quick.
The disadvantage is you lose the ability to deduce hits by time of day but if you don't need this then it might be worth considering.

You actually have two problems to solve further down the road.
One, which you've yet to run into but which you might earlier than you want, is going to be insert throughput within your stats table.
The other, which you've outlined in your question, is actually using the stats.
Let's start with input throughput.
Firstly, in case you're doing so, don't track statistics on pages that could use caching. Use a php script that advertises itself as an empty javascript, or as a one-pixel image, and include the latter on pages you're tracking. Doing so allows to readily cache the remaining content of your site.
In a telco business, rather than doing an actual inserts related to billing on phone calls, things are placed in memory and periodically sync'ed with the disk. Doing so allows to manage gigantic throughputs while keeping the hard-drives happy.
To proceed similarly on your end, you'll need an atomic operation and some in-memory storage. Here's some memcache-based pseudo-code for doing the first part...
For each page, you need a Memcache variable. In Memcache, increment() is atomic, but add(), set(), and so forth aren't. So you need to be wary of not miss-counting hits when concurrent processes add the same page at the same time:
$ns = $memcache->get('stats-namespace');
while (!$memcache->increment("stats-$ns-$page_id")) {
$memcache->add("stats-$ns-$page_id", 0, 1800); // garbage collect in 30 minutes
$db->upsert('needs_stats_refresh', array($ns, $page_id)); // engine = memory
}
Periodically, say every 5 minutes (configure the timeout accordingly), you'll want to sync all of this to the database, without any possibility of concurrent processes affecting each other or existing hit counts. For this, you increment the namespace before doing anything (this gives you a lock on existing data for all intents and purposes), and sleep a bit so that existing processes that reference the prior namespace finish up if needed:
$ns = $memcache->get('stats-namespace');
$memcache->increment('stats-namespace');
sleep(60); // allow concurrent page loads to finish
Once that is done, you can safely loop through your page ids, update stats accordingly, and clean up the needs_stats_refresh table. The latter only needs two fields: page_id int pkey, ns_id int). There's a bit more to it than simple select, insert, update and delete statements run from your scripts, however, so continuing...
As another replier suggested, it's quite appropriate to maintain intermediate stats for your purpose: store batches of hits rather than individual hits. At the very most, I'm assuming you want hourly stats or quarter-hourly stats, so it's fine to deal with subtotals that are batch-loaded every 15 minute.
Even more importantly for your sake, since you're ordering posts using these totals, you want to store the aggregated totals and have an index on the latter. (We'll get to where further down.)
One way to maintain the totals is to add a trigger which, on insert or update to the stats table, will adjust the stats total as needed.
When doing so, be especially wary about dead-locks. While no two $ns runs will be mixing their respective stats, there is still a (however slim) possibility that two or more processes fire up the "increment $ns" step described above concurrently, and subsequently issue statements that seek to update the counts concurrently. Obtaining an advisory lock is the simplest, safest, and fastest way to avoid problems related to this.
Assuming you use an advisory lock, it's perfectly OK to use: total = total + subtotal in the update the statement.
While on the topic of locks, note that updating the totals will require an exclusive lock on each affected row. Since you're ordering by them, you don't want them processed all in one go because it might mean keeping an exclusive lock for an extended duration. The simplest here is to process the inserts into stats in smaller batches (say, 1000), each followed by a commit.
For intermediary stats (monthly, weekly), add a few boolean fields (bit or tinyint in MySQL) to your stats table. Have each of these store whether they're to be counted for with monthly, weekly, daily stats, etc. Place a trigger on them as well, in such a way that they increase or decrease the applicable totals in your stat_totals table.
As a closing note, give some thoughts on where you want the actual count to be stored. It needs to be an indexed field, and the latter is going to be heavily updated. Typically, you'll want it stored in its own table, rather than in the pages table, in order to avoid cluttering your pages table with (much larger) dead rows.
Assuming you did all the above your final query becomes:
select p.*
from pages p join stat_totals s using (page_id)
order by s.weekly_total desc limit 10
It should be plenty fast with the index on weekly_total.
Lastly, let's not forget the most obvious of all: if you're running these same total/monthly/weekly/etc queries over and over, their result should be placed in memcache too.

you can add indexes and try tweaking your SQL but the real solution here is to cache the results.
you should really only need to caclulate the last 7/30 days of traffic once daily
and you could do the past 24 hours hourly ?
even if you did it once every 5 minutes, that's still a huge savings over running the (expensive) query for every hit of every user.

RRDtool
Many tools/systems do not build their own logging and log aggregation but use RRDtool (round-robin database tool) to efficiently handle time-series data. RRDtools also comes with powerful graphing subsystem, and (according to Wikipedia) there are bindings for PHP and other languages.
From your questions I assume you don't need any special and fancy analysis and RRDtool would efficiently do what you need without you having to implement and tune your own system.

You can do some 'aggregation' in te background, for example by a con job. Some suggestions (in no particular order) that might help:
1. Create a table with hourly results. This means you can still create the statistics you want, but you reduce the amount of data to (24*7*4 = about 672 records per page per month).
your table can be somewhere along the lines of this:
hourly_results (
nid integer,
start_time datetime,
amount integer
)
after you parse them into your aggregate table you can more or less delete them.
2.Use result caching (memcache, apc)
You can easily store the results (which should not change every minute, but rather every hour?), either in a memcache database (which again you can update from a cronjob), use the apc user cache (which you can't update from a cronjob) or use file caching by serializing objects/results if you're short on memory.
3. Optimize your database
10 seconds is a long time. Try to find out what is happening with your database. Is it running out of memory? Do you need more indexes?

Pagination Strategies for Complex (slow) Datasets

What are some of the strategies being used for pagination of data sets that involve complex queries? count(*) takes ~1.5 sec so we don't want to hit the DB for every page view. Currently there are ~45k rows returned by this query.
Here are some of the approaches I've considered:
Cache the row count and update it every X minutes
Limit (and offset) the rows counted to 41 (for example) and display the page picker as "1 2 3 4 ..."; then recompute if anyone actually goes to page 4 and display "... 3 4 5 6 7 ..."
Get the row count once and store it in the user's session
Get rid of the page picker and just have a "Next Page" link

I've had to engineer a few pagination strategies using PHP and MySQL for a site that does over a million page views a day. I persued the strategy in stages:
Multi-column indexes I should have done this first before attempting a materialized view.
Generating a materialized view. I created a cron job that did a common denormalization of the document tables I was using. I would SELECT ... INTO OUTFILE ... and then create the new table, and rotate it in:
SELECT ... INTO OUTFILE '/tmp/ondeck.txt' FROM mytable ...;
CREATE TABLE ondeck_mytable LIKE mytable;
LOAD DATA INFILE '/tmp/ondeck.txt' INTO TABLE ondeck_mytable...;
DROP TABLE IF EXISTS dugout_mytable;
RENAME TABLE atbat_mytable TO dugout_mytable, ondeck_mytable TO atbat_mytable;
This kept the lock time on the write contended mytable down to a minimum and the pagination queries could hammer away on the atbat materialized view. I've simplified the above, leaving out the actual manipulation, which are unimportant.
Memcache I then created a wrapper about my database connection to cache these paginated results into memcache. This was a huge performance win. However, it was still not good enough.
Batch generation I wrote a PHP daemon and extracted the pagination logic into it. It would detect changes mytable and periodically regenerate the from the oldest changed record to the most recent record all the pages to the webserver's filesystem. With a bit of mod_rewrite, I could check to see if the page existed on disk, and serve it up. This also allowed me to take effective advantage of reverse proxying by letting Apache detect If-Modified-Since headers, and respond with 304 response codes. (Obviously, I removed any option of allowing users to select the number of results per page, an unimportant feature.)
Updated:
RE count(*): When using MyISAM tables, COUNT didn't create a problem when I was able to reduce the amount of read-write contention on the table. If I were doing InnoDB, I would create a trigger that updated an adjacent table with the row count. That trigger would just +1 or -1 depending on INSERT or DELETE statements.
RE page-pickers (thumbwheels) When I moved to agressive query caching, thumb wheel queries were also cached, and when it came to batch generating the pages, I was using temporary tables--so computing the thumbwheel was no problem. A lot of thumbwheel calculation simplified because it became a predictable filesystem pattern that actually only needed the largest page numer. The smallest page number was always 1.
Windowed thumbweel The example you give above for a windowed thumbwheel (<< 4 [5] 6 >>) should be pretty easy to do without any queries at all so long as you know your maximum number of pages.

My suggestion is ask MySQL for 1 row more than you need in each query, and decide based on the number of rows in the result set whether or not to show the next page-link.

MySQL has a specific mechanism to compute an approximated count of a result set without the LIMIT clause: FOUND_ROWS().

MySQL is quite good in optimizing LIMIT queries.
That means it picks appropriate join buffer, filesort buffer etc just enough to satisfy LIMIT clause.
Also note that with 45k rows you probably don't need exact count. Approximate counts can be figured out using separate queries on the indexed fields. Say, this query:
SELECT COUNT(*)
FROM mytable
WHERE col1 = :myvalue
AND col2 = :othervalue
can be approximated by this one:
SELECT COUNT(*) *
(
SELECT COUNT(*)
FROM mytable
) / 1000
FROM (
SELECT 1
FROM mytable
WHERE col1 = :myvalue
AND col2 = :othervalue
LIMIT 1000
)
, which is much more efficient in MyISAM.
If you give an example of your complex query, probably I can say something more definite on how to improve its pagination.

I'm by no means a MySQL expert, but perhaps giving up the COUNT(*) and going ahead with COUNT(id)?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.