How to maintain chat data?

How to maintain chat data? - php

I have a curious question...
I wanted to know how to maintain chat data in a database.
I have been using a php-mysql application, that stores chat data of users in a database.
Now my question is that, if the chat data increases, say, to some millions of records, how to store it? Does mysql support it, or have any limitations or burden ?
Take the example of gmail chat. I can chat unlimited and can also retrieve all my previous chat data. How is it possible ?
Can anyone answer this typical question of myne ?

Chat history isn't really that heavyweight. If I calculate around 100 bytes per message, 6 messages per minute, and 5 hours per day, (that is a very talkative chatter, though), permanently, as a worst case, that would give about 61MB per user per year (!).
That means with 1 million talkative chatters (very unprobable) you would need around 58TB or data storage.
Saying that this is a worst-case calculation, I would start off with a maximum of 1TB storage, set up the database, and see how things are going. It is highly unprobably for a very young service to evolve that fast.
Also, I would personally not recommend using a Windows system for something like this, unless you know very well what you're doing. MySQL on a Debian distribution will store billions of records, and probably do this faster due to less OS-level limitations (see the MySQL documentation for details, there should be section about the limitations on Windows).

MySQL will happily store millions, even billions of records; but some of the numeric types won't be enough: see this for the maxima of numeric types. As you can see, it would be better to use BIGINT UNSIGNED for e.g. autoincrement fields.
Performance may become a problem for large tables, but that can be mostly solved with indexes (meaning "I've seen performance decrease somewhere around the 100GB mark in a similar situation").

Google has vast amounts of custom storage designed by it for its requirements. What I suggest is you determine your requirements more concretely and determine the platform you need.

Related

Best implementation of web service with a huge list lookup

I am planning to offer a simple phrase lookup web service. the number of phrases is around 3-5 million entries, and each entry is less than 50 characters. The service is quite simple, if a phrase given by the user is in the list return true and else return false. So on the server side, should I just simply put the entries in mysql database and query it whenever user phrase comes in? Is there any other implementation that is more efficient? i.e. faster, less memory, more scaleable in terms of query volume, etc. Or should load it in memory?
Any feedback, ideas, opinions would be appreciated!

MySQL should handle 5 million entries without any trouble.
If I were you, I'd build the database and test it locally with 5 million and with 50 million rows of pseudorandom data.
If you're able to simulate a lot of concurrent connections, then do that. If not, just roll it out to, say, 100 interested users.
Keep records. Use analytics.

Check out Memcached and Redis. Easy to use and fast.

It really depends on how many concurrent users you'll have.
If you have lots of requests/second, keep the data into memory (looks like you need lots of it), memcached for example.
Else, use a database.
Or use both: if the data is not the cache (memcached), load from the database and keep it the memory for future needs.

fast large scale key-value store for a php program

I'm working on a full text index system for a project of mine. As one part of the process of indexing pages it splits the data into a very, very large number of very small pieces.
I have gotten the size of the pieces to be as low as a constant 20-30 bytes, and it could be less, it is basically 2 8 byte integers and a float that make up the actual data.
Because of the scale I'm looking for and the number of pieces this creates I'm looking for an alternative to mysql which has shown significant issues at value sets well below my goal.
My current thinking is that a key-value store would be the best option for this and I have adjusted my code accordingly.
I have tried a number but for some reason they all seem to scale even less than mysql.
I'm looking to store on the order of hundreds of millions or billions or more key-value pairs so I need something that won't have a large performance degradation with size.
I have tried memcachedb, membase, and mongo and while they were all easy enough to set up, none of them scaled that well for me.
membase had the most issues due to the number of keys required and the limited memory available. Write speed is very important here as this is a very close to even workload, I write a thing once, then read it back a few times and store it for eventual update.
I don't need much performance on deletes and I would prefer something that can cluster well as I'm hoping to eventually have this able to scale across machines but it needs to work on a single machine for now.
I'm also hoping to make this project easy to deploy so an easy setup would be much better. The project is written in php so it needs to be easy accessed from php.
I don't need to have rows or other higher level abstractions, they are mostly useless in this case and I have already made the code from some of my other tests to get down to a key-value store and that seems to likely be the fastest as I only have 2 things that would be retrieved from a row keyed off a third so there is little additional work done to use a key-value store. Does anyone know any easy to use projects that can scale like this?
I am using this store to store individual sets of three numbers, (the sizes are based on how they were stored in mysql, that may not be true in other storage locations) 2 eight byte integers, one for the ID of the document and one for the ID of the word and a float representation of the proportion of the document that that word was (number of times the work appeared divided by the number of words in the document). The index for this data is the word id and the range the document id falls into, every time I need to retrieve this data it will be all of the results for a given word id. I currently turn the word id, the range, and a counter for that word/range combo each into binary representations of the numbers and concatenate them to form the key along with a 2 digit number to say what value for that key I am storing, the document id or the float value.
Performance measurement was somewhat subjective looking at the output from the processes putting data into or pulling data out of the storage and seeing how fast it was processing documents as well as rapidly refreshing my statistics counters that track more accurate statistics of how fast the system is working and looking at the differences when I was using each storage method.

You would need to provide some more data about what you really want to do...
depending on how you define fast large scale you have several options:
memcache
redis
voldemort
riak
and sooo on.. the list gets pretty big..
Edit 1:
Per this post comments I would say that you take a look to cassandra or voldemort. Cassandra isn't a simple KV storage per se since you can storage much more complex objects than just K -> V
if you care to check cassandra with PHP, take a look to phpcassa. but redis is also a good option if you set a replica.

Here's add a few products and ideas that weren't mentioned above:
OrientDB - this is a graph/document database, but you can use it to store very small "documents" - it is extremely fast, highly scalable, and optimized to handle vast amounts of records.
Berkeley DB - Berkeley DB is a key-value store used at the heart of a number of graph and document databases - supposedly has a SQLite-compatible API that works with PHP.
shmop - Shared memory operations might be one possible approach, if you're willing to do some dirty-work. If you records are small and have a fixed size, this might work for you - using a fixed record-size and padding with zeroes.
handlersocket - this has been in development for a long time, and I don't know how reliable it is. It basically lets you use MySQL at a "lower level", almost like a key/value-store. Because you're bypassing the query parser etc. it's much faster than MySQL in general.
If you have a fixed record-size, few writes and lots of reads, you may even consider reading/writing to/from a flat file. Likely nowhere near as fast as reading/writing to shared memory, but it may be worth considering. I suggest you weigh all the pros/cons specifically for your project's requirements, not only for products, but for any approach you can think of. Your requirements aren't exactly "mainstream", and the solution may not be as obvious as picking the right product.

Efficient rating system

I'm building a news rating script for this website that has a lot of users. I'm trying to make this websites as efficient as possible and now I'm wondering what would be the most efficient way to keep track of the votes. Of course I don't want users to vote more than once.
My first though was to store it in a my MySQL database, but I'm worried this would have a negative influence in my website's speed because this table would get quite big.
Would storing it in a database still be the best solution or are there any better solutions.

If you plan on having > 1,000,000 records you should make sure the table's structure is efficient (which shouldnt be hard for your example) and that you index it correctly.
Memcached would be the simplest way to implement caching and is easy to scale if your site grows and more servers are necessary.

With a properly indexed vote table, you can keep reasonable performance regardless of how large your table is (of course, beyond a certain point, your tables will be too large to fit in cache, but that would involve having a very large number of users and items).
Add in some per-user caching (on the client, in $_SESSION, using memcached) and you can get a quite fast "no" response time).

Since you can't use memcached I would say this. A decent database server ( decent hardware + decent db implementation) should be able to handle this quite well. A single table with a physical index on article-id and a second entry representing the vote will handle a few googillion (yes I made up the word) articles easily :P
Rationale :
Database servers maintain statistics -- read: self-tuning -- and only hot items (index + row-entries) remain in-memory.
Moral:
Don't worry about such things unless they become a problem -- i.e., If your company is the size of facebook I would worry.

Memcached would be a very good way to do this. you need to synchronize from memcached once in a while (I would do this using the pull model using a cron script on you mysql server).

Did you see this?
http://destiney.com/php#Destiney_rated_images
Demo here: http://ratedsite.com/

MySQL vs Web Server for processing data

I was wondering if it's faster to process data in MySQL or a server language like PHP or Python. I'm sure native functions like ORDER will be faster in MySQL due to indexing, caching, etc, but actually calculating the rank (including ties returning multiple entries as having the same rank):
Sample SQL
SELECT TORCH_ID,
distance AS thisscore,
(SELECT COUNT(distinct(distance))+1 FROM torch_info WHERE distance > thisscore) AS rank
FROM torch_info ORDER BY rank
Server
...as opposed to just doing a SELECT TORCH_ID FROM torch_info ORDER BY score DESC and then figure out rank in PHP on the web server.

Edit: Since posting this, my answer has changed completely, partly due to the experience I've gained since then and partly because relational database systems have gotten significantly better since 2009. Today, 9 times out of 10, I would recommend doing as much of your data crunching in-database as possible. There are three reasons for this:
Databases are highly optimized for crunching data—that's their entire job! With few exceptions, replicating what the database is doing at the application level is going to be slower unless you invest a lot of engineering effort into implementing the same optimizations that the DB provides to you for free—especially with a relatively slow language like PHP, Python, or Ruby.
As the size of your table grows, pulling it into the application layer and operating on it there becomes prohibitively expensive simply due to the sheer amount of data transferred. Many applications will never reach this scale, but if you do, it's best to reduce the transfer overhead and keep the data operations as close to the DB as possible.
In my experience, you're far more likely to introduce consistency bugs in your application than in your RDBMS, since the DB can enforce consistency on your data at a low level but the application cannot. If you don't have that safety net built-in, so you have to be more careful to not make mistakes.
Original answer: MySQL will probably be faster with most non-complex calculations. However, 90% of the time database server is the bottleneck, so do you really want to add to that by bogging down your database with these calculations? I myself would rather put them on the web/application server to even out the load, but that's your decision.

In general, the answer to the "Should I process data in the database, or on the web server question" is, "It depends".
It's easy to add another web server. It's harder to add another database server. If you can take load off the database, that can be good.
If the output of your data processing is much smaller than the required input, you may be able to avoid a lot of data transfer overhead by doing the processing in the database. As a simple example, it'd be foolish to SELECT *, retrieve every row in the table, and iterate through them on the web server to pick the one where x = 3, when you can just SELECT * WHERE x = 3
As you pointed out, the database is optimized for operation on its data, using indexes, etc.

The speed of the count is going to depend on which DB storage engine you are using and the size of the table. Though I suspect that nearly every count and rank done in mySQL would be faster than pulling that same data into PHP memory and doing the same operation.

Ranking is based on count, order. So if you can do those functions faster, then rank will obviously be faster.

A large part of your question is dependent on the primary keys and indexes you have set up.
Assuming that torchID is indexed properly...
You will find that mySQL is faster than server side code.
Another consideration you might want to make is how often this SQL will be called. You may find it easier to create a rank column and update that as each track record comes in. This will result in a lot of minor hits to your database, versus a number of "heavier" hits to your database.
So let's say you have 10,000 records, 1000 users who hit this query once a day, and 100 users who put in a new track record each day. I'd rather have the DB doing 100 updates in which 10% of them hit every record (9,999) then have the ranking query get hit 1,000 times a day.
My two cents.

If your test is running individual queries instead of posting transactions then I would recommend using a JDBC driver over the ODBC dsn because youll get 2-3 times faster performance. (im assuming your using an odbc dsn here in your tests)

PHP/MYSQL - Pushing it to the limit?

I've been coding php for a while now and have a pretty firm grip on it, MySQL, well, lets just say I can make it work.
I'd like to make a stats script to track the stats of other websites similar to the obvious statcounter, google analytics, mint, etc.
I, of course, would like to code this properly and I don't see MySQL liking 20,000,000 to 80,000,000 inserts ( 925 inserts per second "roughly**" ) daily.
I've been doing some research and it looks like I should store each visit, "entry", into a csv or some other form of flat file and then import the data I need from it.
Am I on the right track here? I just need a push in the right direction, the direction being a way to inhale 1,000 psuedo "MySQL" inserts per second and the proper way of doing it.
Example Insert: IP, time(), http_referer, etc.
I need to collect this data for the day, and then at the end of the day, or in certain intervals, update ONE row in the database with, for example, how many extra unique hits we got. I know how to do that of course, just trying to give a visualization since I'm horrible at explaining things.
If anyone can help me, I'm a great coder, I would be more than willing to return the favor.

We tackled this at the place I've been working the last year so over summer. We didn't require much granularity in the information, so what worked very well for us was coalescing data by different time periods. For example, we'd have a single day's worth of real time stats, after that it'd be pushed into some daily sums, and then off into a monthly table.
This obviously has some huge drawbacks, namely a loss of granularity. We considered a lot of different approaches at the time. For example, as you said, CSV or some similar format could potentially serve as a way to handle a month of data at a time. The big problem is inserts however.
Start by setting out some sample schema in terms of EXACTLY what information you need to keep, and in doing so, you'll guide yourself (through revisions) to what will work for you.
Another note for the vast number of inserts: we had potentially talked through the idea of dumping realtime statistics into a little daemon which would serve to store up to an hours worth of data, then non-realtime, inject that into the database before the next hour was up. Just a thought.

For the kind of activity you're looking at, you need to look at the problem from a new point of view: decoupling. That is, you need to figure out how to decouple the data-recording steps so that delays and problems don't propogate back up the line.
You have the right idea in logging hits to a database table, insofar as that guarantees in-order, non-contended access. This is something the database provides. Unfortunately, it comes at a price, one of which is that the database completes the INSERT before getting back to you. Thus the recording of the hit is coupled with the invocation of the hit. Any delay in recording the hit will slow the invocation.
MySQL offers a way to decouple that; it's called INSERT DELAYED. In effect, you tell the database "insert this row, but I can't stick around while you do it" and the database says "okay, I got your row, I'll insert it when I have a minute". It is conceivable that this reduces locking issues because it lets one thread in MySQL do the insert, not whichever you connect to. Unfortuantely, it only works with MyISAM tables.
Another solution, which is a more general solution to the problem, is to have a logging daemon that accepts your logging information and just en-queues it to wherever it has to go. The trick to making this fast is the en-queueing step. This the sort of solution syslogd would provide.

In my opinion it's a good thing to stick to MySQL for registering the visits, because it provides tools to analyze your data. To decrease the load I would have the following suggestions.
Make a fast collecting table, with no indixes except primary key, myisam, one row per hit
Make a normalized data structure for the hits and move the records once a day to that database.
This gives you a smaller performance hit for logging and a well indexed normalized structure for querying/analyzing.

Presuming that your MySQL server is on a different physical machine to your web server, then yes it probably would be a bit more efficient to log the hit to a file on the local filesystem and then push those to the database periodically.
That would add some complexity though. Have you tested or considered testing it with regular queries? Ie, increment a counter using an UPDATE query (because you don't need each entry in a separate row). You may find that this doesn't slow things down as much as you had thought, though obviously if you are pushing 80,000,000 page views a day you probably don't have much wiggle room at all.

You should be able to get that kind of volume quite easily, provided that you do some stuff sensibly. Here are some ideas.
You will need to partition your audit table on a regular (hourly, daily?) basis, if nothing else only so you can drop old partitions to manage space sensibly. DELETEing 10M rows is not cool.
Your web servers (as you will be running quite a large farm, right?) will probably want to do the inserts in large batches, asynchronously. You'll have a daemon process which reads flat-file logs on a per-web-server machine and batches them up. This is important for InnoDB performance and to avoid auditing slowing down the web servers. Moreover, if your database is unavailable, your web servers need to continue servicing web requests and still have them audited (eventually)
As you're collecting large volumes of data, some summarisation is going to be required in order to report on it at a sensible speed - how you do this is very much a matter of taste. Make sensible summaries.
InnoDB engine tuning - you will need to tune the InnoDB engine quite significantly - in particular, have a look at the variables controlling its use of disc flushing. Writing out the log on each commit is not going to be cool (maybe unless it's on a SSD - if you need performance AND durability, consider a SSD for the logs) :) Ensure your buffer pool is big enough. Personally I'd use the InnoDB plugin and the file per table option, but you could also use MyISAM if you fully understand its characteristics and limitations.
I'm not going to further explain any of the above as if you have the developer skills on your team to build an application of that scale anyway, you'll either know what it means or be capable of finding it out.
Provided you don't have too many indexes, 1000 rows/sec is not unrealistic with your data sizes on modern hardware; we insert that many sometimes (and probably have a lot more indexes).
Remember to performance test it all on production-spec hardware (I don't really need to tell you this, right?).

I think that using MySQL is an overkill for the task of collecting the logs and summarizing them. I'd stick to plain log files in your case. It does not provide the full power of relational database management but it's quite enough to generate summaries. A simple lock-append-unlock file operation on a modern OS is seamless and instant. On the contrary, using MySQL for the same simple operation loads the CPU and may lead to swapping and other hell of scalability.
Mind the storage as well. With plain text file you'll be able to store years of logs of a highly loaded website taking into account current HDD price/capacity ratio and compressability of plain text logs

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.