I'm working on a full text index system for a project of mine. As one part of the process of indexing pages it splits the data into a very, very large number of very small pieces.
I have gotten the size of the pieces to be as low as a constant 20-30 bytes, and it could be less, it is basically 2 8 byte integers and a float that make up the actual data.
Because of the scale I'm looking for and the number of pieces this creates I'm looking for an alternative to mysql which has shown significant issues at value sets well below my goal.
My current thinking is that a key-value store would be the best option for this and I have adjusted my code accordingly.
I have tried a number but for some reason they all seem to scale even less than mysql.
I'm looking to store on the order of hundreds of millions or billions or more key-value pairs so I need something that won't have a large performance degradation with size.
I have tried memcachedb, membase, and mongo and while they were all easy enough to set up, none of them scaled that well for me.
membase had the most issues due to the number of keys required and the limited memory available. Write speed is very important here as this is a very close to even workload, I write a thing once, then read it back a few times and store it for eventual update.
I don't need much performance on deletes and I would prefer something that can cluster well as I'm hoping to eventually have this able to scale across machines but it needs to work on a single machine for now.
I'm also hoping to make this project easy to deploy so an easy setup would be much better. The project is written in php so it needs to be easy accessed from php.
I don't need to have rows or other higher level abstractions, they are mostly useless in this case and I have already made the code from some of my other tests to get down to a key-value store and that seems to likely be the fastest as I only have 2 things that would be retrieved from a row keyed off a third so there is little additional work done to use a key-value store. Does anyone know any easy to use projects that can scale like this?
I am using this store to store individual sets of three numbers, (the sizes are based on how they were stored in mysql, that may not be true in other storage locations) 2 eight byte integers, one for the ID of the document and one for the ID of the word and a float representation of the proportion of the document that that word was (number of times the work appeared divided by the number of words in the document). The index for this data is the word id and the range the document id falls into, every time I need to retrieve this data it will be all of the results for a given word id. I currently turn the word id, the range, and a counter for that word/range combo each into binary representations of the numbers and concatenate them to form the key along with a 2 digit number to say what value for that key I am storing, the document id or the float value.
Performance measurement was somewhat subjective looking at the output from the processes putting data into or pulling data out of the storage and seeing how fast it was processing documents as well as rapidly refreshing my statistics counters that track more accurate statistics of how fast the system is working and looking at the differences when I was using each storage method.
You would need to provide some more data about what you really want to do...
depending on how you define fast large scale you have several options:
memcache
redis
voldemort
riak
and sooo on.. the list gets pretty big..
Edit 1:
Per this post comments I would say that you take a look to cassandra or voldemort. Cassandra isn't a simple KV storage per se since you can storage much more complex objects than just K -> V
if you care to check cassandra with PHP, take a look to phpcassa. but redis is also a good option if you set a replica.
Here's add a few products and ideas that weren't mentioned above:
OrientDB - this is a graph/document database, but you can use it to store very small "documents" - it is extremely fast, highly scalable, and optimized to handle vast amounts of records.
Berkeley DB - Berkeley DB is a key-value store used at the heart of a number of graph and document databases - supposedly has a SQLite-compatible API that works with PHP.
shmop - Shared memory operations might be one possible approach, if you're willing to do some dirty-work. If you records are small and have a fixed size, this might work for you - using a fixed record-size and padding with zeroes.
handlersocket - this has been in development for a long time, and I don't know how reliable it is. It basically lets you use MySQL at a "lower level", almost like a key/value-store. Because you're bypassing the query parser etc. it's much faster than MySQL in general.
If you have a fixed record-size, few writes and lots of reads, you may even consider reading/writing to/from a flat file. Likely nowhere near as fast as reading/writing to shared memory, but it may be worth considering. I suggest you weigh all the pros/cons specifically for your project's requirements, not only for products, but for any approach you can think of. Your requirements aren't exactly "mainstream", and the solution may not be as obvious as picking the right product.
Related
I'm designing a small ecommerce website with a mysql database. I wanted to keep the URL clean without any hard coded product id.
So given a path http://example.com/shop/{product-name}
I opted to convert the {product-name} to a crc32 checksum in PHP and store it in the product table.
When a request is received for a product page, it is converted to crc32 checksum and queried for matching rows. I did this only for the product pages and blog pages.
My question
So far its working well on localhost. Will this be a scalable solution once traffic increases? Any way to test this?
Will indexing the checksum column help for Select queries?
checksum INT UNIQUE NOT NULL
I read insert statement will take a performance hit. But assuming inserts would be occasional (2-3 per week for blog or new products once in 3 months perhaps) a 2-3 seconds time is acceptable. How bad could it get?
Is storing the checksum value as a binary better? Consider the additional task of converting the checksum to binary before every request.
I'm designing a small ecommerce website with a mysql database. I wanted to keep the URL clean without any hard coded product id.
So given a path http://example.com/shop/{product-name}
So, you mean https://example.com/shop/jelly-donut and https://example.com/shop/coffee, for example. Excellent. Good search engine optimization.
I opted to convert the {product-name} to a crc32 checksum in PHP and store it in the product table.
This approach has several problems.
There are potential collisions in such a short checksum. More than one product could easily map to the same checksum. This may not happen until you're long gone.
It's unnecessary for performance unless you have many millions of rows in your product table. Lookups on indexed varchar() columns are almost as fast as lookups on integer columns.
It's a programming hassle.
It's a maintenance hassle long after your programming is done.
MySQL (and other database systems) are built for quick SELECT lookups on various kinds of data. Thousands of programmer years (truly!) have gone into making this kind of thing fast. If you think you need to improve on those programmers with a trivial optimization, with respect you're wrong. Certainly in this case.
Will this be a scalable solution once traffic increases?
Yes, it will. But so will a lookup on your product name.
Will indexing the checksum column help for Select queries?
Yes, you should index any set of columns used in SELECT queries. Indexing is a bit of an art. Check this out https://use-the-index-luke.com Once you've correctly indexed a table with less than a million rows, SELECT statements should be fast and scalable.
If you don't index the table correctly, your queries will be very slow no matter what datatypes you're looking up.
I read insert statement will take a performance hit.
A trivial hit. A few milliseconds per insert, at most.
Is storing the checksum value as a binary better?
You'll save tens of microseconds on each SELECT if you use binary. Tens of microseconds do not matter.
Consider the additional task of converting the checksum to binary before every request.
The only thing that counts here is your programming and maintenance time.
Keep your system simple. Don't borrow trouble, especially imagined trouble about the difference between searches on different kinds of data types.
I am currently working on a PHP application (pre-release).
Background
We have the a table in our MySQL database which is expected to grow extremely large - it would not be unusual for a single user to own 250,000 rows in this table. Each row in the table is given an amount and a date, among other things.
Furthermore, this particular table is read from (and written to) very frequently - on the majority of pages. Given that each row has a date, I'm using GROUP BY date to minimise the size of the result-set given by MySQL - rows contained in the same year can now be seen as just one total.
However, a typical page will still have a result-set between 1000-3000 results. There are also places where many SUM()'s are performed, totalling many tens - if not hundreds - of thousands of rows.
Trying MySQL
On a usual page, MySQL was usually taking around around 600-900ms. Using LIMIT and offsets weren't helping performance and the data has been heavily normalised, and so it doesn't seem like further normalisation would help.
To make matters worse, there are parts of the application which require the retrieval of 10,000-15,000 rows from the database. The results are then used in a calculation by PHP and formatted accordingly. Given this, the performance of MySQL wasn't acceptable.
Trying MongoDB
I have converted the table to MongoDB, and it's speed is faster - it usually takes around 250ms to retrieve 2,000 documents. However, the $group command in the aggregation pipeline - needed to aggregate fields depending on the year they fall in - slows things down. Unfortunately, keeping a total and updating that whenever a document is removed/updated/inserted is also out of the question, because although we can use a yearly total for some parts of the app, in other parts the calculations require that each amount falls on a specific date.
I've also considered Redis, although I think the complexity of the data is beyond what Redis was designed for.
The Final Straw
On top of all of this, speed is important. So performance is up there it terms of priorities.
Questions:
What is the best way to store data which is frequently read/written and rapidly growing, with the knowledge that most queries will retrieve a very large result-set?
Is there another solution to the problem? I'm totally open to suggestions.
I'm a little stuck at the moment, I haven't been able to retrieve such a large result-set in an acceptable amount of time. It seems most datastores are great for small retrieval sizes - even on large amounts of data - but I haven't been able to find anything on retrieving large amounts of data from an even larger table/collection.
I only read the first two lines but you are using aggregation (GROUP BY) and then expecting it to just do realtime?
I will say you are new to the internals of databases not to undermine you but to try and help you.
The group operator in both MySQL and MongoDB is in-memory. In other words it takes whatever data structure you povide, whether it be an index or a document (row) and it will go through each row/document taking the field and grouping it up.
This means that you can speed it up in both MySQL and MongoDB by making sure you are using an index for the grouping, but still this only goes so far, even with housing the index in your direct working set in MongoDB (memory).
In fact using LIMIT with a OFFSET as well is probably just slowing things down even further frankly. Since after writing out the set MySQL then needs to query again to get your answer.
Once done it will write out the result, MySQL will write it out to a result set (memory and IO being used here) and MongoDB will reply inline if you have not set $out, the maximum size of the inline output being 16MB (the maximum size of a document).
The final point to take away here is: aggregation is horrible
There is no silver bullet that will save you here, some databases will attempt to boast about their speed etc etc but fact is most big aggregators use something called "pre-aggregated reports". You can find a quick introduction within the MongoDB documentation: http://docs.mongodb.org/ecosystem/use-cases/pre-aggregated-reports/
This means that you put the effort of aggregating and grouping onto some other process which could do it easily enough allowing your reading thread, the one that needs to be realtime to do it's thang in realtime.
I am creating an app that stores new information every week consists of 10 X 12 digit integers for about millions of unique URLs. I need to extract information for particular week or for a particular week range for the given URL. I am going to use MySQL as a database.
Tip: To simplify, grouping the URLs by domain will reduce the amount of data to processed while querying.
I need advice about structuring a database for fast querying that takes optimal processing power and disk space.
Since no-one else has had a go, here's my advice.
To make a start, ignore 'fast querying that takes optimal processing power and disk space.' Looking for that at the start won't get you anywhere. Design and create a sensible database to meet your function requirements. Bung in random data until you've got approximately the volume you expect. Run queries against it and time them.
If your database is normalised properly, the disc space it takes will also be approximately minimised. Queries may be slow: use execution plans to see why they're slow, and add indexes to help their performance. Once you get acceptable performance, you're there.
The main point is a standard saying: don't optimise until you know you have a problem and you've measured it.
I am working on an application using memcache pool (5 servers) and some processing nodes. I have two different possible approaches and I was wondering if you guys have any comments on comparison based on performance (speed primarily) between the two
I extract a big chunk of data from memcache once per request, itereate over it and discard the bits I dont need for the particular request
I extract small small bits from memcached and only extract the ones I need. i.e. I extract value of a and based on value of a, extract value of either b or c. Use this combination to find the next key I want to extract.
The difference between the two is that the number of memcached lookups (which is a pool of servers) reduces in 1. but the size of response increases. Any benchmarking reports around it someone has seen before?
Unfortunately I cant use a better key based on request directly as I dont have enough memcache to support all possible combinations of values, so I got to construct some of it at run time
Thanks
You would have to benchmark for your own setup. The parts that would matter wold be the time spent on:
requesting large amount of data from memcache + retrieving it + extracting data from the resonse
sending several requests to memcache + retrieving the data
Basically first thing you have to measure is how large the overhead for interaction with your cache pool is. And there is that small matter of how this whole thing will react when load increases. What might be fast now, can turn out to be a terrible decision later, when the users start pouring in.
This kinda depends on your definition of "large chunk". Are we talking megabytes here or an array with 100 keys? You also have to consider, that php still needs to process that information.
There are two things you can do at this point:
take a hard looks at how you are storing the information. Maybe you can cut it down to two small requests. One to retrieve the specific data for the conditions, and other to get the conditional information.
setup your own benchmark-thing for your server. Some random article on the web will not be relevant to your system architecture.
I know this is not the answer you wanted to hear, but that's my two cents .. here ya go.
So I am making a small multiplayer game and I am using php as the backend. I basically need to SET and GET a lot of positions of objects, well one object is one player that has a X/Y position in this case.
I don't need todo it in realtime, but perhaps every 5-20 seconds since it's turn based. I don't mind if I loose data since positions will be set again by the clients every now and then.
I was thinking of doing this with memcached, or redis. Basically each player would be a "key" and this key would contain an object with some relevant information, but the most important thing beeing the X/Y positions.
Perhaps I am going about this the wrong way but, this approach would seem very easy to do, however I am not sure how well it would work since I don't have a lot of experience with either of these soutions.
I should add that we are talking about perhaps 10 players here, hence 10 objects with x/y positions that needs updating every now and then.
Can it be done like this, is there a better solution than memcached/redis? If not which of these two would be better performance-wise? From what I understand it's almost the same thing, just that redis offers some more functionality (Which may not necessarily be needed).
Oh and yes I am also using APC with php obviously. Thanks!
With just 10 objects in the entire data model, I would store them all as a serialized array under a single key. The serialization time will pale in comparison to the memcached call, so you may as well minimize the number of reads and writes to one.
I just checked out the redis online demo, and it looks pretty neat. Thanks for the link. I can't speak to which is better, but memcached in PHP is proven and mature so you can't go wrong there.
Redis is cheapest on resources, especially 32 bit version, e.g. if you use less 2 GB cache memory, which is the case I believe, run 23 bit Redis even if your server is 64 bit.