Best implementation of web service with a huge list lookup

Best implementation of web service with a huge list lookup - php

I am planning to offer a simple phrase lookup web service. the number of phrases is around 3-5 million entries, and each entry is less than 50 characters. The service is quite simple, if a phrase given by the user is in the list return true and else return false. So on the server side, should I just simply put the entries in mysql database and query it whenever user phrase comes in? Is there any other implementation that is more efficient? i.e. faster, less memory, more scaleable in terms of query volume, etc. Or should load it in memory?
Any feedback, ideas, opinions would be appreciated!

MySQL should handle 5 million entries without any trouble.
If I were you, I'd build the database and test it locally with 5 million and with 50 million rows of pseudorandom data.
If you're able to simulate a lot of concurrent connections, then do that. If not, just roll it out to, say, 100 interested users.
Keep records. Use analytics.

Check out Memcached and Redis. Easy to use and fast.

It really depends on how many concurrent users you'll have.
If you have lots of requests/second, keep the data into memory (looks like you need lots of it), memcached for example.
Else, use a database.
Or use both: if the data is not the cache (memcached), load from the database and keep it the memory for future needs.

Related

Best practice for high-volume transactions with real time balance updates

I currently have a MySQL database which deals a very large number of transactions. To keep it simple, it's a data stream of actions (clicks and other events) coming in real time. The structure is such, that users belong to sub-affiliates and sub-affiliates belong to affiliates.
I need to keep a balance of clicks. For the sake of simplicity, let's say I need to increase the clicks balance by 1 (there is actually more processing depending on an event) for each of - the user, for the sub-affiliate and the affiliate. Currently I do it very simply - once I receive the event, I do sequential queries in PHP - I read the balance of user, increment by one and store the new value, then I read the balance of the sub-affiliate, increment and write, etc.
The user's balance is the most important metric for me, so I want to keep it as real time, as possible. Other metrics on the sub-aff and affiliate level are less important, but the closer they are to real-time, the better, however I think 5 minute delay might be ok.
As the project grows, it is already becoming a bottleneck, and I am now looking at alternatives - how to redesign the calculation of balances. I want to ensure that the new design will be able to crunch 50 million of events per day. It is also important for me not to lose a single event and I actually wrap each cycle of changes to click balances in an sql transaction.
Some things I am considering:
1 - Create a cron job that will update the balances on the sub-affiliate and affiliate level not in real time, let's say every 5 mins.
2 - Move the number crunching and balance updates to the database itself by using stored procedures. I am considering adding a separate database, maybe Postgress will be better suited for the job? I tried to see if there is a serious performance improvement, but the Internet seems divided on the topic.
3 - Moving this particular data stream to something like hadoop with parquet (or Apache Kudu?) and just add more servers if needed.
4 - Sharding the existing db, basically adding a separate db server for each affiliate.
Are there some best practices / technologies for this type of task or some obvious things that I could do? Any help is really appreciated!

My advice for High Speed Ingestion is here. In your case, I would collect the raw information in the ping-pong table it describes, then have the other task summarize the table to do mass UPDATEs of the counters. When there is a burst of traffic, it become more efficient, thereby not keeling over.
Click balances (and "Like counts") should be in a table separate from all the associated data. This helps avoid interference with other activity in the system. And it is likely to improve the cacheability of the balances if you have more data than can be cached in the buffer_pool.
Note that my design does not include a cron job (other than perhaps as a "keep-alive"). It processes a table, flips tables, then loops back to processing -- as fast as it can.

If I were you, I would implement Redis in-memory storage, and increase there your metrics. It's very fast and reliable. You can also read from this DB. Create also cron job, which will save those data into MySQL DB.

Is your web tier doing the number crunching as it receives & processes the HTTP request? If so, the very first thing you will want to do is move this to work queue and process these events asynchronously. I believe you hint at this in your Item 3.
There are many solutions and the scope of choosing one is outside the scope of this answer, but some packages to consider:
Gearman/PHP
Sidekiq/Ruby
Amazon SQS
RabbitMQ
NSQ
...etc...
In terms of storage it really depends on what you're trying to achieve, fast reads, fast writes, bulk reads, sharding/distribution, high-availability... the answer to each points you in different directions

This sounds like an excellent candidate for Clustrix which is a drop in replacement for MySQL. They do something like sharding, but instead of putting data in separate databases, they split it and replicate it across nodes in the same DB cluster. They call it slicing, and the DB does it automatically for you. And it is transparent to the developers. There is a good performance paper on it that shows how it's done, but the short of it is that it is a scale-out OTLP DB that happens to be able to absorb mad amounts of analytical processing on real time data as well.

How to maintain chat data?

I have a curious question...
I wanted to know how to maintain chat data in a database.
I have been using a php-mysql application, that stores chat data of users in a database.
Now my question is that, if the chat data increases, say, to some millions of records, how to store it? Does mysql support it, or have any limitations or burden ?
Take the example of gmail chat. I can chat unlimited and can also retrieve all my previous chat data. How is it possible ?
Can anyone answer this typical question of myne ?

Chat history isn't really that heavyweight. If I calculate around 100 bytes per message, 6 messages per minute, and 5 hours per day, (that is a very talkative chatter, though), permanently, as a worst case, that would give about 61MB per user per year (!).
That means with 1 million talkative chatters (very unprobable) you would need around 58TB or data storage.
Saying that this is a worst-case calculation, I would start off with a maximum of 1TB storage, set up the database, and see how things are going. It is highly unprobably for a very young service to evolve that fast.
Also, I would personally not recommend using a Windows system for something like this, unless you know very well what you're doing. MySQL on a Debian distribution will store billions of records, and probably do this faster due to less OS-level limitations (see the MySQL documentation for details, there should be section about the limitations on Windows).

MySQL will happily store millions, even billions of records; but some of the numeric types won't be enough: see this for the maxima of numeric types. As you can see, it would be better to use BIGINT UNSIGNED for e.g. autoincrement fields.
Performance may become a problem for large tables, but that can be mostly solved with indexes (meaning "I've seen performance decrease somewhere around the 100GB mark in a similar situation").

Google has vast amounts of custom storage designed by it for its requirements. What I suggest is you determine your requirements more concretely and determine the platform you need.

Efficient rating system

I'm building a news rating script for this website that has a lot of users. I'm trying to make this websites as efficient as possible and now I'm wondering what would be the most efficient way to keep track of the votes. Of course I don't want users to vote more than once.
My first though was to store it in a my MySQL database, but I'm worried this would have a negative influence in my website's speed because this table would get quite big.
Would storing it in a database still be the best solution or are there any better solutions.

If you plan on having > 1,000,000 records you should make sure the table's structure is efficient (which shouldnt be hard for your example) and that you index it correctly.
Memcached would be the simplest way to implement caching and is easy to scale if your site grows and more servers are necessary.

With a properly indexed vote table, you can keep reasonable performance regardless of how large your table is (of course, beyond a certain point, your tables will be too large to fit in cache, but that would involve having a very large number of users and items).
Add in some per-user caching (on the client, in $_SESSION, using memcached) and you can get a quite fast "no" response time).

Since you can't use memcached I would say this. A decent database server ( decent hardware + decent db implementation) should be able to handle this quite well. A single table with a physical index on article-id and a second entry representing the vote will handle a few googillion (yes I made up the word) articles easily :P
Rationale :
Database servers maintain statistics -- read: self-tuning -- and only hot items (index + row-entries) remain in-memory.
Moral:
Don't worry about such things unless they become a problem -- i.e., If your company is the size of facebook I would worry.

Memcached would be a very good way to do this. you need to synchronize from memcached once in a while (I would do this using the pull model using a cron script on you mysql server).

Did you see this?
http://destiney.com/php#Destiney_rated_images
Demo here: http://ratedsite.com/

MySQL vs Web Server for processing data

I was wondering if it's faster to process data in MySQL or a server language like PHP or Python. I'm sure native functions like ORDER will be faster in MySQL due to indexing, caching, etc, but actually calculating the rank (including ties returning multiple entries as having the same rank):
Sample SQL
SELECT TORCH_ID,
distance AS thisscore,
(SELECT COUNT(distinct(distance))+1 FROM torch_info WHERE distance > thisscore) AS rank
FROM torch_info ORDER BY rank
Server
...as opposed to just doing a SELECT TORCH_ID FROM torch_info ORDER BY score DESC and then figure out rank in PHP on the web server.

Edit: Since posting this, my answer has changed completely, partly due to the experience I've gained since then and partly because relational database systems have gotten significantly better since 2009. Today, 9 times out of 10, I would recommend doing as much of your data crunching in-database as possible. There are three reasons for this:
Databases are highly optimized for crunching data—that's their entire job! With few exceptions, replicating what the database is doing at the application level is going to be slower unless you invest a lot of engineering effort into implementing the same optimizations that the DB provides to you for free—especially with a relatively slow language like PHP, Python, or Ruby.
As the size of your table grows, pulling it into the application layer and operating on it there becomes prohibitively expensive simply due to the sheer amount of data transferred. Many applications will never reach this scale, but if you do, it's best to reduce the transfer overhead and keep the data operations as close to the DB as possible.
In my experience, you're far more likely to introduce consistency bugs in your application than in your RDBMS, since the DB can enforce consistency on your data at a low level but the application cannot. If you don't have that safety net built-in, so you have to be more careful to not make mistakes.
Original answer: MySQL will probably be faster with most non-complex calculations. However, 90% of the time database server is the bottleneck, so do you really want to add to that by bogging down your database with these calculations? I myself would rather put them on the web/application server to even out the load, but that's your decision.

In general, the answer to the "Should I process data in the database, or on the web server question" is, "It depends".
It's easy to add another web server. It's harder to add another database server. If you can take load off the database, that can be good.
If the output of your data processing is much smaller than the required input, you may be able to avoid a lot of data transfer overhead by doing the processing in the database. As a simple example, it'd be foolish to SELECT *, retrieve every row in the table, and iterate through them on the web server to pick the one where x = 3, when you can just SELECT * WHERE x = 3
As you pointed out, the database is optimized for operation on its data, using indexes, etc.

The speed of the count is going to depend on which DB storage engine you are using and the size of the table. Though I suspect that nearly every count and rank done in mySQL would be faster than pulling that same data into PHP memory and doing the same operation.

Ranking is based on count, order. So if you can do those functions faster, then rank will obviously be faster.

A large part of your question is dependent on the primary keys and indexes you have set up.
Assuming that torchID is indexed properly...
You will find that mySQL is faster than server side code.
Another consideration you might want to make is how often this SQL will be called. You may find it easier to create a rank column and update that as each track record comes in. This will result in a lot of minor hits to your database, versus a number of "heavier" hits to your database.
So let's say you have 10,000 records, 1000 users who hit this query once a day, and 100 users who put in a new track record each day. I'd rather have the DB doing 100 updates in which 10% of them hit every record (9,999) then have the ranking query get hit 1,000 times a day.
My two cents.

If your test is running individual queries instead of posting transactions then I would recommend using a JDBC driver over the ODBC dsn because youll get 2-3 times faster performance. (im assuming your using an odbc dsn here in your tests)

PHP/MYSQL - Pushing it to the limit?

I've been coding php for a while now and have a pretty firm grip on it, MySQL, well, lets just say I can make it work.
I'd like to make a stats script to track the stats of other websites similar to the obvious statcounter, google analytics, mint, etc.
I, of course, would like to code this properly and I don't see MySQL liking 20,000,000 to 80,000,000 inserts ( 925 inserts per second "roughly**" ) daily.
I've been doing some research and it looks like I should store each visit, "entry", into a csv or some other form of flat file and then import the data I need from it.
Am I on the right track here? I just need a push in the right direction, the direction being a way to inhale 1,000 psuedo "MySQL" inserts per second and the proper way of doing it.
Example Insert: IP, time(), http_referer, etc.
I need to collect this data for the day, and then at the end of the day, or in certain intervals, update ONE row in the database with, for example, how many extra unique hits we got. I know how to do that of course, just trying to give a visualization since I'm horrible at explaining things.
If anyone can help me, I'm a great coder, I would be more than willing to return the favor.

We tackled this at the place I've been working the last year so over summer. We didn't require much granularity in the information, so what worked very well for us was coalescing data by different time periods. For example, we'd have a single day's worth of real time stats, after that it'd be pushed into some daily sums, and then off into a monthly table.
This obviously has some huge drawbacks, namely a loss of granularity. We considered a lot of different approaches at the time. For example, as you said, CSV or some similar format could potentially serve as a way to handle a month of data at a time. The big problem is inserts however.
Start by setting out some sample schema in terms of EXACTLY what information you need to keep, and in doing so, you'll guide yourself (through revisions) to what will work for you.
Another note for the vast number of inserts: we had potentially talked through the idea of dumping realtime statistics into a little daemon which would serve to store up to an hours worth of data, then non-realtime, inject that into the database before the next hour was up. Just a thought.

For the kind of activity you're looking at, you need to look at the problem from a new point of view: decoupling. That is, you need to figure out how to decouple the data-recording steps so that delays and problems don't propogate back up the line.
You have the right idea in logging hits to a database table, insofar as that guarantees in-order, non-contended access. This is something the database provides. Unfortunately, it comes at a price, one of which is that the database completes the INSERT before getting back to you. Thus the recording of the hit is coupled with the invocation of the hit. Any delay in recording the hit will slow the invocation.
MySQL offers a way to decouple that; it's called INSERT DELAYED. In effect, you tell the database "insert this row, but I can't stick around while you do it" and the database says "okay, I got your row, I'll insert it when I have a minute". It is conceivable that this reduces locking issues because it lets one thread in MySQL do the insert, not whichever you connect to. Unfortuantely, it only works with MyISAM tables.
Another solution, which is a more general solution to the problem, is to have a logging daemon that accepts your logging information and just en-queues it to wherever it has to go. The trick to making this fast is the en-queueing step. This the sort of solution syslogd would provide.

In my opinion it's a good thing to stick to MySQL for registering the visits, because it provides tools to analyze your data. To decrease the load I would have the following suggestions.
Make a fast collecting table, with no indixes except primary key, myisam, one row per hit
Make a normalized data structure for the hits and move the records once a day to that database.
This gives you a smaller performance hit for logging and a well indexed normalized structure for querying/analyzing.

Presuming that your MySQL server is on a different physical machine to your web server, then yes it probably would be a bit more efficient to log the hit to a file on the local filesystem and then push those to the database periodically.
That would add some complexity though. Have you tested or considered testing it with regular queries? Ie, increment a counter using an UPDATE query (because you don't need each entry in a separate row). You may find that this doesn't slow things down as much as you had thought, though obviously if you are pushing 80,000,000 page views a day you probably don't have much wiggle room at all.

You should be able to get that kind of volume quite easily, provided that you do some stuff sensibly. Here are some ideas.
You will need to partition your audit table on a regular (hourly, daily?) basis, if nothing else only so you can drop old partitions to manage space sensibly. DELETEing 10M rows is not cool.
Your web servers (as you will be running quite a large farm, right?) will probably want to do the inserts in large batches, asynchronously. You'll have a daemon process which reads flat-file logs on a per-web-server machine and batches them up. This is important for InnoDB performance and to avoid auditing slowing down the web servers. Moreover, if your database is unavailable, your web servers need to continue servicing web requests and still have them audited (eventually)
As you're collecting large volumes of data, some summarisation is going to be required in order to report on it at a sensible speed - how you do this is very much a matter of taste. Make sensible summaries.
InnoDB engine tuning - you will need to tune the InnoDB engine quite significantly - in particular, have a look at the variables controlling its use of disc flushing. Writing out the log on each commit is not going to be cool (maybe unless it's on a SSD - if you need performance AND durability, consider a SSD for the logs) :) Ensure your buffer pool is big enough. Personally I'd use the InnoDB plugin and the file per table option, but you could also use MyISAM if you fully understand its characteristics and limitations.
I'm not going to further explain any of the above as if you have the developer skills on your team to build an application of that scale anyway, you'll either know what it means or be capable of finding it out.
Provided you don't have too many indexes, 1000 rows/sec is not unrealistic with your data sizes on modern hardware; we insert that many sometimes (and probably have a lot more indexes).
Remember to performance test it all on production-spec hardware (I don't really need to tell you this, right?).

I think that using MySQL is an overkill for the task of collecting the logs and summarizing them. I'd stick to plain log files in your case. It does not provide the full power of relational database management but it's quite enough to generate summaries. A simple lock-append-unlock file operation on a modern OS is seamless and instant. On the contrary, using MySQL for the same simple operation loads the CPU and may lead to swapping and other hell of scalability.
Mind the storage as well. With plain text file you'll be able to store years of logs of a highly loaded website taking into account current HDD price/capacity ratio and compressability of plain text logs

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.