Scale-able chat-room using PHP/MySQL?

Scale-able chat-room using PHP/MySQL? - php

Forgive the lack of a question.
I'm trying to build a website with the same functions as a chat room. The idea of 5-50 viewers in each room (with thousands of rooms) is very real, only about 1% of the room would be chatting.
I've had some ideas, but everything I've come up with seems like it would require a crazy amount of processing power... What would be an efficient way to do this?

There are specific programs designed for this purpose (ircd, see http://www.atheme.org/project/charybdis and similar.) However, if you really wish to reinvent the wheel, you will likely want a hosting solution that has a decent amount of physical RAM, and shared memory extensions (ex: APC.)
Shared memory functionality (APC in this case) will be the fastest way to keep everyone's conversations in sync, without the hard drive spinning up too much or otherwise MySQL spiraling out of control. You should be able to accommodate hundreds of concurrent requests this way without the server breaking a sweat, since it doesn't tax MySQL. It reads almost directly off the RAM chips.
You can key-store individual channels for conversations (ex: "channel-#welcome") and poll them directly via AJAX. See apc_store, apc_add and apc_fetch for more information.
Even if you end up storing conversations in MySQL for whatever reason, it's still preferable to use some kind of memory caching for reading, since that takes tremendous load off of the database server.
If you do it this way, it's best to make your databases innodb, since they won't lock during writes. Using APC, your limiting reagent will be amount of RAM and length of conversations that you intend to keep in shared buffer.

You've asked a really broad question, but:
Store each message as a row in your database, use AJAX to reload the chat window content with the last few messages e.g.
SELECT * FROM `chat_messages` WHERE `room_id` = 'ID' ORDER BY `id` DESC LIMIT 100
Will select the 100 most recent messages for the chat room. Loop over the results and display all the messages as you want.
If your database user has permissions to create tables, you could also dynamically create a table for each chat room (which would be a lot faster performance wise)
You'd then simply have an input or textarea in a form, that when submitted, inserts a new row to the database (which will show up to everyone next time the chat window is reloaded).
Another, more optimised way to do it would be to only return new messages to users each query, by storing the timestamp of each message in the database, and storing the timestamp of the last request locally in JavaScript, then use a query like:
SELECT * FROM `chat_messages` WHERE `room_id` = 'ID' AND `timestamp` > 'LAST_REQUEST' ORDER BY `id` DESC LIMIT 100
Then appending the result to the chat window, rather than replacing it.

Related

How to determine the cause of AJAX delay on my site?

I have an AJAX search facility on my website, and when I search something in the live site, until the results come (tables have no more than 20 entries) the page freezes for a short period of time, nowhere else is clickable on the website but it doesn't freeze the computer. I can click other tabs on the browser etc.
I am using this query in MySQL/InnoDB, which takes 0.031 sec to run:
select * from members m where
memberID LIKE '%$keyword%' OR
memberName LIKE '%$keyword%' AND
memberTypeID=2;
I think it is a bit related with the connection to the server and a bit related with the server's performance. How can I improve this?
I use Bootstrap pagination to put all the data in a paginated table that has search, sort, page, amount of entries per page options and all of these are done on client side.

There is no exact method to measure exactly how powerful your server should be, but you can always predict things.
Solution 1:
If you are using complex database queries you should go for a dedicated powerful server.
Solution 2: You could use ajax asynchronous call to database so other things keep on loading and does not hangs up the page.
More about AJAX call: Get data from mysql database using php and jquery ajax

Mostly it depends on two things.
Size of Website.( Each Time someone will open website, your bandwidth
will be consumed)
Memory used by SQL queries.(Suppose you have
100,000 records in a table, a simple SELECT
query wont consume that much, but it your query is complex, contains
joins and not optimized properly then it may consume a lot)
According to My Experience.
A 2MB website , with 4 tables and approximately 50,000 records each. Can you work fine on Shared Hosting if you have less than 50 user in an hour.
Anything above that, you need Virtual Private server. (This is not fixed few website provide very powerful shared hosting, but price is also high)
In your case i guess you are using a free hosting or your localhost.

How to build a proper Database for a traffic analytics system?

How to build a proper structure for an analytics service? Currently i have 1 table that stores data about every user that visits the page with my client's ID so later my clients will be able to see the statistics for a specific date.
I've thought a bit today and I'm wondering: Let's say i have 1,000 users and everyone has around 1,000 impressions on their sites daily, means i get 1,000,000 (1M) new records every day to a single table. How will it work after 2 months or so (when the table reaches 60 Million records)?
I just think that after some time it will have so much records that the PHP queries to pull out the data will be really heavy, slow and take a lot of resources, is it true? and how to prevent that?
A friend of mine working on something similar and he is gonna make a new table for every client, is this the correct way to go with?
Thanks!

Problem you are facing is I/O bound system. 1 million records a day is roughly 12 write queries per second. That's achievable, but pulling the data out while writing at the same time will make your system to be bound at the HDD level.
What you need to do is configure your database to support the I/O volume you'll be doing, such as - use appropriate database engine (InnoDB and not MyISAM), make sure you have fast enough HDD subsystem (RAID, not regular drives since they can and will fail at some point), design your database optimally, inspect queries with EXPLAIN to see where you might have gone wrong with them, maybe even use a different storage engine - personally, I'd use TokuDB if I were you.
And also, I sincerely hope you'd be doing your querying, sorting, filtering on the database side and not on PHP side.

Consider this Link to the Google Analytics Platform Components Overview page and pay special attention to the way the data is written to the database, simply based on the architecture of the entire system.
Instead of writing everything to your database right away, you could write everything to a log file, then process the log later (perhaps at a time when the traffic isn't so high). At the end of the day, you'll still need to make all of those writes to your database, but if you batch them together and do them when that kind of load is more tolerable, your system will scale a lot better.

You could normalize impressions the data like this;
Client Table
{
ID
Name
}
Pages Table
{
ID
Page_Name
}
PagesClientsVisits Table
{
ID
Client_ID
Page_ID
Visits
}
and just increment visits on the final table on each new impression. Then the maximum number of records in there becomes (No. of clients * No. of pages)

Having a table with 60 million records can be ok. That is what a database is for. But you should be careful about how many fields you have in the table. Also what datatype (=>size) each field has.
You create some kind of reports on the data. Think about what data you really need for those reports. For example you might need only the numbers of visits per user on every page. A simple count would do the trick.
What you also can do is generate the report every night and delete the raw data afterwards.
So, read and think about it.

Mysql count rows using filters on high traffic database

Let's say you have a search form, with multiple select fields, let's say a user selects from a dropdown an option, but before he submits the data I need to display the count of the rows in the database .
So let's say the site has at least 300k(300.000) visitors a day, and a user selects options from the form at least 40 times a visit, that would mean 12M ajax requests + 12M count queries on the database, which seems a bit too much .
The question is how can one implement a fast count (using php(Zend Framework) and MySQL) so that the additional 12M queries on the database won't affect the load of the site .
One solution would be to have a table that stores all combinations of select fields and their respective counts (when a product is added or deleted from the products table the table storing the count would be updated). Although this is not such a good idea when for 8 filters (select options) out of 43 there would be +8M rows inserted that need to be managed.
Any other thoughts on how to achieve this?
p.s. I don't need code examples but the idea itself that would work in this scenario.

I would probably have an pre-calculated table - as you suggest yourself. Import is that you have an smart mechanism for 2 things:
Easily query which entries are affected by which change.
Have an unique lookup field for an entire form request.
The 8M entries wouldn't be very significant if you have solid keys, as you would only require an direct lookup.
I would go trough the trouble to write specific updates for this table on all places it is necessary. Even with the high amount of changes, this is still efficient. If correctly done you will know which rows you need to update or invalidate when inserting/updating/deleting the product.
Sidenote:
Based on your comment. If you need to add code on eight places to cover all spots can be deleted - it might be a good time to refactor and centralize some code.

there are few scenarios
mysql has the query cache, you dun have to bother the caching IF the update of table is not that frequently
99% user won't bother how many results that matched, he/she just need the top few records
use the explain - if you notice explain will return how many rows going to matched in the query, is not 100% precise, but should be good enough to act as rough row count

Not really what you asked for, but since you have a lot of options and want to count the items available based on the options you should take a look at Lucene and its faceted search. It was made to solve problems like this.
If you do not have the need to have up to date information from the search you can use a queue system to push updates and inserts to Lucene every now and then (so you don't have to bother Lucene with couple of thousand of updates and inserts every day).

You really only have three options, and no amount of searching is likely to reveal a fourth:
Count the results manually. O(n) with the total number of the results at query-time.
Store and maintain counts for every combination of filters. O(1) to retrieve the count, but requires O(2^n) storage and O(2^n) time to update all the counts when records change.
Cache counts, only calculating them (per #1) when they're not found in the cache. O(1) when data is in the cache, O(n) otherwise.
It's for this reason that systems that have to scale beyond the trivial - that is, most of them - either cap the number of results they'll count (eg, items in your GMail inbox or unread in Google Reader), estimate the count based on statistics (eg, Google search result counts), or both.
I suppose it's possible you might actually require an exact count for your users, with no limitation, but it's hard to envisage a scenario where that might actually be necessary.

I would suggest a separate table that caches the counts, combined with triggers.
In order for it to be fast you make it a memory table and you update it using triggers on the inserts, deletes and updates.
pseudo code:
CREATE TABLE counts (
id unsigned integer auto_increment primary key
option integer indexed using hash key
user_id integer indexed using hash key
rowcount unsigned integer
unique key user_option (user, option)
) engine = memory
DELIMITER $$
CREATE TRIGGER ai_tablex_each AFTER UPDATE ON tablex FOR EACH ROW
BEGIN
IF (old.option <> new.option) OR (old.user_id <> new.user_id) THEN BEGIN
UPDATE counts c SET c.rowcount = c.rowcount - 1
WHERE c.user_id = old.user_id and c.option = old.option;
INSERT INTO counts rowcount, user_id, option
VALUES (1, new.user_id, new.option)
ON DUPLICATE KEY SET c.rowcount = c.rowcount + 1;
END; END IF;
END $$
DELIMITER ;
Selection of the counts will be instant, and the updates in the trigger should not take very long either because you're using a memory table with hash indexes which have O(1) lookup time.
Links:
Memory engine: http://dev.mysql.com/doc/refman/5.5/en/memory-storage-engine.html
Triggers: http://dev.mysql.com/doc/refman/5.5/en/triggers.html

A few things you can easily optimise:
Cache all you can allow yourself to cache. The options for your dropdowns, for example, do they need to be fetched by ajax calls? This page answered many of my questions when I implemented memcache, and of course memcached.org has great documentation available too.
Serve anything that can be served statically. Ie, options that don't change frequently could be stored in a flat file as array via cron every hour for example and included with script at runtime.
MySQL with default configuration settings is often sub-optimal for any serious application load and should be tweaked to fit the needs, of the task at hand. Maybe look into memory engine for high performance read-access.
You can have a look at these 3 great-but-very-technical posts on materialized views, as a matter of fact that whole blog is truly a goldmine of performance tips for mysql.
GOod-luck

Presumably you're using ajax to make the call to the back end that you're talking about. Use some kind of a chached flat file as an intermediate for the data. Set an expire time of 5 seconds or whatever is appropriate. Name the data file as the query key=value string. In the ajax request if the data file is older than your cooldown time, then refresh, if not, use the value stored in your data file.
Also, you might be underestimating the strength of the mysql query cache mechanism. If you're using mysql query cache, I doubt there would be any significant performance dip over doing it the way I just described. If the query was being query cached by mysql then virtually the only slowdown effect would be from the network layer between your application and mysql.

Consider what role replication can play in your architecture. If you need to scale out, you might consider replicating your tables from InnoDB to MyISAM. The MyISAM engine automatically maintains a table count if you are doing count(*) queries. If you are doing count(col) where queries, then you need to rely heavily on well designed indicies. In that case you your count queries might take shape like so:
alter table A add index ixA (a, b);
select count(a) using from A use index(ixA) where a=1 and b=2;

I feel crazy for suggesting this as it seems that no-one else has, but have you considered client-side caching? JavaScript isn't terrible at dealing with large lists, especially if they're relatively simple lists.
I know that your ideal is that you have a desire to make the numbers completely accurate, but heuristics are your friend here, especially since synchronization will never be 100% -- a slow connection or high latency due to server-side traffic will make the AJAX request out of date, especially if that data is not a constant. IF THE DATA CAN BE EDITED BY OTHER USERS, SYNCHRONICITY IS IMPOSSIBLE USING AJAX. IF IT CANNOT BE EDITED BY ANYONE ELSE, THEN CLIENT-SIDE CACHING WILL WORK AND IS LIKELY YOUR BEST OPTION. Oh, and if you're using some sort of port connection, then whatever is pushing to the server can simply update all of the other clients until a sync can be accomplished.
If you're willing to do that form of caching, you can also cache the results on the server too and simply refresh the query periodically.

As others have suggested, you really need some sort of caching mechanism on the server side. Whether it's a MySQL table or memcache, either would work. But to reduce the number of calls to the server, retrieve the full list of cached counts in one request and cache that locally in javascript. That's a pretty simple way to eliminate almost 12M server hits.
You could probably even store the count information in a cookie which expires in an hour, so subsequent page loads don't need to query again. That's if you don't need real time numbers.
Many of the latest browser also support local storage, which doesn't get passed to the server with every request like cookies do.
You can fit a lot of data into a 1-2K json data structure. So even if you have thousands of possible count options, that is still smaller than your typical image. Just keep in mind maximum cookie sizes if you use cookie caching.

Scalably processing large amount of comlpicated database data in PHP, many times a day

I'm soon to be working on a project that poses a problem for me.
It's going to require, at regular intervals throughout the day, processing tens of thousands of records, potentially over a million. Processing is going to involve several (potentially complicated) formulas and the generation of several random factors, writing some new data to a separate table, and updating the original records with some results. This needs to occur for all records, ideally, every three hours. Each new user to the site will be adding between 50 and 500 records that need to be processed in such a fashion, so the number will not be steady.
The code hasn't been written, yet, as I'm still in the design process, mostly because of this issue. I know I'm going to need to use cron jobs, but I'm concerned that processing records of this size may cause the site to freeze up, perform slowly, or just piss off my hosting company every three hours.
I'd like to know if anyone has any experience or tips on similar subjects? I've never worked at this magnitude before, and for all I know, this will be trivial to the server and not pose much of an issue. As long as ALL records are processed before the next three hour period occurs, I don't care if they aren't processed simultaneously (though, ideally, all records belonging to a specific user should be processed in the same batch), so I've been wondering if I should process in batches every 5 minutes, 15 minutes, hour, whatever works, and how best to approach this (and make it scalable in a way that is fair to all users)?

Below I am going to describe how I would approach this problem(but will cost you money and may not be desired solution):
You should use VPS(a quick listing of some cheap VPS). But I guess you should do some more research finding the best VPS for your needs, if you want to achieve your task without pissing of your hosting company(I am sure you will).
You should not use cronjobs but use a message queue like for example beanstalkd to queue up your messages(tasks) and do the processing offline instead. When using a message queue you could also throttle your processing if needed.
Not really necessary, but I would tackle it in this way.
If performance was really a key issue I would have two VPS(at least) instances. one VPS instance to handle the http request from the users visiting your site and one VPS instance to do the offline processing you desire. This way your users/visitor will not notice any heavy offline processing which you are doing.
I also would probably not use PHP to do the offline processing because of the blocking nature. I would use something like node.js to do this kind of processing because nothing is blocking in node.js which is going to be a lot faster.
I also would probably not store the data in a relational database but use the lightning fast redis as a datastore. node_redis is a blazingly fast client for node.js

The problem with many updates on MySQL tables that are used on a website, is that updating data kills your query cache. Meaning that this will slow down you site significantly, even after you update is complete.
A solution we have used before, is to have two MySQL databases (on different servers too, in our case). Only one of them is actively used by the web server. The other is just a fallback and is used for these kind of updates. The two servers replicate their data to one another.
The solution:
Replication is stopped.
The website is told to use Database1.
These large updates you mention done ran on Database2.
Many commonly used queries are executed once on Database2 to warm up the query cache.
The server is told to use Database2.
Replication is started again. Database2 is now used mainly for reading (by both the website and the replication), so there isn't much delay on the websites.

it could be cone using many servers , where each server could do X records/hour , the more records you will be using in the future the more servers you will need , otherwise you might end up with million records being processed while the last 2-3 or even 4th processing is still not finished ...

You might want to consider what kind of database to use. Maybe a relational database isn't the best for this?
Only way to find out is to actually do some benchmarks simulating what you're going to do though.

In this situation I would consider using Gearman (which also has a PHP extension but can be used with many languages)

Do it all server side using a stored procedure that selects subsets of data then processes the data internally.
Here's an example that uses a cursor to select ranges of data:
drop procedure if exists batch_update;
delimiter #
create procedure batch_update
(
in p_from_id int unsigned, -- range of data to select for each batch
in p_to_id int unsigned
)
begin
declare v_id int unsigned;
declare v_val double(10,4);
declare v_done tinyint default 0;
declare v_cur cursor for select id, val from foo where id between = p_from_id and p_to_id;
declare continue handler for not found set v_done = 1;
start transaction;
open v_cur;
repeat
fetch v_cur into v_id, v_val;
-- do work...
if v_val < 0 then
update foo set...
else
insert into foo...
end if;
until v_done end repeat;
close v_cur;
commit;
end #
delimiter ;
call batch_update(1,10000);
call batch_update(10001, 20000);
call batch_update(20001, 30000);
If you can avoid using cursors at all - great, but the main point of my suggestion is about moving the logic from your application tier back into the data tier. I suggest you create a prototype stored procedure in your database and then perform some benchmarks. If the procedure executes in a few seconds then I dont see you having many issues especially if you're using innodb tables with transactions.
Here's another example which may prove of interest although it works on a much larger dataset 50+ million rows:
Optimal MySQL settings for queries that deliver large amounts of data?
Hope this helps :)

MySQL vs Web Server for processing data

I was wondering if it's faster to process data in MySQL or a server language like PHP or Python. I'm sure native functions like ORDER will be faster in MySQL due to indexing, caching, etc, but actually calculating the rank (including ties returning multiple entries as having the same rank):
Sample SQL
SELECT TORCH_ID,
distance AS thisscore,
(SELECT COUNT(distinct(distance))+1 FROM torch_info WHERE distance > thisscore) AS rank
FROM torch_info ORDER BY rank
Server
...as opposed to just doing a SELECT TORCH_ID FROM torch_info ORDER BY score DESC and then figure out rank in PHP on the web server.

Edit: Since posting this, my answer has changed completely, partly due to the experience I've gained since then and partly because relational database systems have gotten significantly better since 2009. Today, 9 times out of 10, I would recommend doing as much of your data crunching in-database as possible. There are three reasons for this:
Databases are highly optimized for crunching data—that's their entire job! With few exceptions, replicating what the database is doing at the application level is going to be slower unless you invest a lot of engineering effort into implementing the same optimizations that the DB provides to you for free—especially with a relatively slow language like PHP, Python, or Ruby.
As the size of your table grows, pulling it into the application layer and operating on it there becomes prohibitively expensive simply due to the sheer amount of data transferred. Many applications will never reach this scale, but if you do, it's best to reduce the transfer overhead and keep the data operations as close to the DB as possible.
In my experience, you're far more likely to introduce consistency bugs in your application than in your RDBMS, since the DB can enforce consistency on your data at a low level but the application cannot. If you don't have that safety net built-in, so you have to be more careful to not make mistakes.
Original answer: MySQL will probably be faster with most non-complex calculations. However, 90% of the time database server is the bottleneck, so do you really want to add to that by bogging down your database with these calculations? I myself would rather put them on the web/application server to even out the load, but that's your decision.

In general, the answer to the "Should I process data in the database, or on the web server question" is, "It depends".
It's easy to add another web server. It's harder to add another database server. If you can take load off the database, that can be good.
If the output of your data processing is much smaller than the required input, you may be able to avoid a lot of data transfer overhead by doing the processing in the database. As a simple example, it'd be foolish to SELECT *, retrieve every row in the table, and iterate through them on the web server to pick the one where x = 3, when you can just SELECT * WHERE x = 3
As you pointed out, the database is optimized for operation on its data, using indexes, etc.

The speed of the count is going to depend on which DB storage engine you are using and the size of the table. Though I suspect that nearly every count and rank done in mySQL would be faster than pulling that same data into PHP memory and doing the same operation.

Ranking is based on count, order. So if you can do those functions faster, then rank will obviously be faster.

A large part of your question is dependent on the primary keys and indexes you have set up.
Assuming that torchID is indexed properly...
You will find that mySQL is faster than server side code.
Another consideration you might want to make is how often this SQL will be called. You may find it easier to create a rank column and update that as each track record comes in. This will result in a lot of minor hits to your database, versus a number of "heavier" hits to your database.
So let's say you have 10,000 records, 1000 users who hit this query once a day, and 100 users who put in a new track record each day. I'd rather have the DB doing 100 updates in which 10% of them hit every record (9,999) then have the ranking query get hit 1,000 times a day.
My two cents.

If your test is running individual queries instead of posting transactions then I would recommend using a JDBC driver over the ODBC dsn because youll get 2-3 times faster performance. (im assuming your using an odbc dsn here in your tests)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.