I work for a small company who wants to expand an existing system but by doing so there are also some issues.
The system itself is used to store images and video's.
Always being available
So we talked to our host and they recommended us to use Ceph and Cassandra. Now I did some research on both of them and I really like the idea of Ceph - but Cassandra.. well, it would take a while for the existing system to be adapted to it.
The reason they recommended Cassandra was so our database would always be available. Now - the DB won't increase massively, it would only be used to keep some user-info, image-tags and other small meta-data.
Another issue is that many queries use "like" in order to find the tags. CQL does not support this.
Now we don't have any developers who have knowledge of Cassandra so it might take some time to take used to that.
My question
Is there an alternative for Cassandra, preferred a relational database (so not NoSQL)
which is still highly available (like when one server goes down,
another takes over).
In case there isn't - how long would it take to get used to Cassandra's query language for relatively inexperienced developers (~ 1 year of experience), including the knowhow of how to adapt a system to it.
Just in case I didn't do my research properly, is Cassandra even the system we are looking for, the DB is pretty much only used for storage and some small functions.
In case you recommend a NoSQL language, what other options do I have of searching and finding data (as we now do by using " ..where x like 'something'")
Another issue is that many queries use "like" in order to find the tags. CQL does not support this.
It does, since Cassandra 3.5 with SASI secondary index
Now we don't have any developers who have knowledge of Cassandra so it might take some time to take used to that.
It's not a problem, you have hours of free online video and training at Datastax Academy
In case there isn't - how long would it take to get used to Cassandra's query language for relatively inexperienced developers (~ 1 year of experience), including the knowhow of how to adapt a system to it.
Just watch the data modeling videos, you should grasp quickly the fundamentals ideas behind Cassandra data model.
Just in case I didn't do my research properly, is Cassandra even the system we are looking for, the DB is pretty much only used for storage and some small functions.
If you're looking for extreme high availability (0 downtime) Cassandra is for you.
If you can handle a downtime of some minutes, there are also other systems that can be a good fit.
Related
I've done some research on HTML5 local storage, and it seems plausible that I could mirror a MySQL database's structure for use in an application that needs a lot of data for only one person.
Why would I do this? In my spare time, I'm a web game developer: PHP, MySQL, and all the technologies to dress it up. So far I've built databases that support many players, but my games are intended to be "single-player with multi capabilities". And for games that are only intended to be played single-player, there's no point in even having a DB connection unless they're saving to a web server!
I want to achieve a single-player mode that will never touch my database, and will be available offline. However, the code behind all of this is still going to be making SQL queries. Ideally I imagine that I could set up a sort of abstraction layer of local storage that would respond to queries.
And in short, I'm wondering what's out there. Searching local storage and HTML5 will give you endless posts about the technologies, but I'm not certain that my idea here will work well, or should even be attempted. Likewise there could be frameworks out there already that handle this with ease. I've found nothing yet.
update: The deprecation of web SQL database worries me. It looked very appealing for my situation; as it uses SQL, modifying my queries shouldn't be so difficult. Now with a push toward IndexedDB, I'm not certain it will be as easy.
I've read your question a few times now and continue to wonder 'Why would you do this' ;-) The question brings up more questions for me, so...
What do you mean by 'a lot of data'? Ultimately speaking, is the sql metaphor appropriate in this case? Abstraction layer that can respond to 'sql-like' queries is interesting in and of itself, but sounds extremely complex. Could a simpler solution do the job, like a JSON object? What about persisting the data in cases when users clean cache, history, re-install the browser, etc? Even complex javascript literal or JSON object could provide very straight forward means of occasional persistance/backup and recovery. (It's interesting that just released PostgreSql 9.2 includes JSON as datatype. Could it be that SQL and NoSql will gravitate toward some common ground?) Sorry if this comes off as more of a commentary than an answer, your question comes off as being on a higher plane.
EDIT
googling 'javascript sql interpreter' turns up some interesting stuff:
http://www.terminally-incoherent.com/blog/2009/05/19/sql-emulation-tool-in-javascript-part-2/#comments
https://github.com/forward/sql-parser#readme
Generating a JavaScript SQL parser for SQLite3 (with Lemon? ANTLR3?)
http://jsdb.sourceforge.net/demo.html
I've been working on a new site of mine for a couple of days now which will be retrieving almost all of its most used content from a MySql database. Seeming as the Database and website is still under development the tables are really small at the moment and speed is of no concern yet.
But you know what they say, a little bit of hard work now saves you a headache later on.
Now I'm only 17, the only database I've ever been taught was through Microsoft Access, and we were practically given the database completed - we learned up to 3NF, but that was about it.
I remember reading once when I was looking to pull data (randomly) out of a database how large databases were taking several seconds/minutes to complete a single query, so this just got me thinking. In a fraction of a second I can submit a search to google, google processes the query and returns the result, and then my browser renders it - all done in the blink of an eye. And google has billions of records to search through. And they're also doing this for millions of users simultaneously.
I'm thinking, how do they do it? I know that they have huge data centers, but still.
I realize that it probably comes down to the design of the database, how it's been optimized, and obviously the configuration. And I guess that's my question really. Could someone please tell me how to design high performance databases for millions/billions of rows (yes, I'm being optimistic), and possibly point me towards some good reading material to help me learn further?
Also, all my queries are done via PHP, if that's at all relevant to any answers.
The blog http://highscalability.com/ has some good articles and pointers to how companies handle large problems.
Specifically related to MySQL, you can Google for craigslist.org's use of MySQL.
http://www.slideshare.net/jzawodn/mysql-and-search-at-craigslist
First the good news... MySQL scales well (depending on the hardware) to at least hundreds of millions of rows.
Once you get to a certain point, a single database server will have trouble managing the load. That's when you get into the realm of partitioning or sharding... spreading the load across multiple database servers using any one of a number of different schemes (e.g. putting unrelated tables on different servers, spreading a single table across multiple servers e.g. by using the ID or date range as a partitioning key).
SQL does shard, but is not fundamentally designed to shard well. There's a whole category of storage alternatives collectively referred to as NoSQL that are designed to solve that very problem (MongoDB, Cassandra, HBase are a few).
When you use SQL at very large scale, you run into any number of issues such as making data model changes across a DB server farm, trouble keeping up with data backups, etc. That's a very complex topic, and people that solve it well are rare. For a glimpse at the issues, have a look at http://gigaom.com/cloud/facebook-trapped-in-mysql-fate-worse-than-death/
When selecting a database platform for a specific project, benchmark the solution early and often to understand whether or not it will meet the performance requirements that you envision. Having a framework to do that will help you learn about scalability, and will help you decide whether to invest effort in improving the data storage part of your solution, and will help you know where best to invest your time.
No one can tell how to design databases. It comes after much reading and many hour working on them. A good design is product of many many years doing them though. As you've only seen Access you got no knowledge of databases. Search through Amazon.com and you'll get tons of titles. For someone that's starting, anyone will do it.
I mean no disrespect. I've been there and I'm also tutor of some people learning programming/database design. I do know that there's no silver bullet or shortcuts for the work you have ahead.
If you intend to work with high performance database, you should have something in mind. The design of them in per application. A good design depends on learning more and more how the app's users interact with the system, the usage patterns, etc. The things you'll learn from books will give you options, using them will depend heavily on the scenario.
Good luck!
It doesn't all come down to the design of the database, though that is indeed a big part of it. The guys who made Google are geniouses, and if I'm not completely wrong about Google you won't be able to find out exactly how they do what they do. Also, I know that years back they had more than 10,000 computers processing queries, and today they probably have many more. I also suspect them for caching most of the recent/popular keywords. And all the websites have been indexed and analyzed using an unknown algorithm which will make sure the computers don't have to look through all the words on every page.
In fact, Google crawls the entire internet around every 14 days, so when you do a search you do not search the entire internet. Your search gets broken down into keywords and then these keywords are used to narrow the number of relevant pages - and I'm pretty sure all pages have already been analyzed for important and/or relevant keywords before you even thought of visiting google.com.
Have a look at this question.
Have a look into Sphinx server.
http://sphinxsearch.com/
Craigslist uses that for their search engine. Basically, you give it a source and it indexes whatever you want (mysql database/table, text files, etc.). If it works for craigslist, it should work for you.
What is the most efficient way to search through so many characters? What do you think?
Let's say website built in PHP and MySQL.
What should I learn to be able to build this as much efficiently as possible? Are there any algorythms I should learn or something?
Text indexing algorithm
Google uses a custom-made database solution called BigTable, http://en.wikipedia.org/wiki/Big_table, which is run linked over hundreds of servers all over the world. So they're fast because they wrote the software specifically to be fast, and set up the hardware in such a way that they could squeeze the most out of it.
You can get to a decent set with PHP and MySQL, but once you start dealing with very large data sets, MySQL, and any other generic database, will start to buckle under the stress. If you want to learn more about this, a good place to start is to do a search for concurrency in database design (briefly explained in http://en.wikipedia.org/wiki/Concurrency_control amongst others), which is a topic way too large to cover in a stackoverflow reply =)
Google goes beyond simply optimizing the databases and the code. They also do a lot of distributed programming. While the exact mechanisms they use to power systems such as Gmail are guarded secrets, it is known that they have entire farms of computers networked, each working on parts of the index at any given time, rather than just one server.
For MySQL, look at the Full-Text Search Functions.
This is assuming your content is stored in the database (such as in a CMS).
I guess this question has been asked a lot around. I know Rails can scale because I have worked on it and it's awesome. And there is not much doubt about that as far as PHP frameworks are concerned.
I don't want to know which frameworks are better.
How much is difference in cost of scaling Rails vs other frameworks (PHP, Python) assuming a large app with 1 million visits per month?
This is something I get asked a lot. I can explain to people that "Rails does scale pretty well" but in the long run, what are the economics?
If somebody can provide some metrics, that'd be great.
One major factor in this is that isn't affected by choice of framework is database access. No matter what approach you take, you likely put data in a relational database. Then the question is how efficiently you can get the data out of the database. This primarily depends on the RDBMS (Oracle vs. Postgres vs. MySQL), and not on the framework - except that some data mapping library may make inefficient use of SQL.
For the pure "number of visits" parameter, the question really is how fast your HTML templating system works. So the question is: how many pages can you render per second? I would make this the primary metrics to determine how good a system would scale.
Of course, different pages may have different costs; for some, you can use caching, but not for others. So in measuring scalability, split your 1 million visits into cheap and expensive pages, and measure them separately. Together, they should give a good estimate of the load your system can take (or the number of systems you need to satisfy demand).
There is also the issue of memory usage. If you have the data in SQL, this shouldn't matter - but with caching, you may also need to consider scalability wrt. main memory usage.
IMHO I don't think the cost of scaling is going to be any different between those three because none of them have "scalability batteries" included. I just don't see any huge architectural differences between those three choices that would cause a significant difference in scaling.
In other words, your application architecture is going to dominate how the application scales regardless of which of the three languages.
If you need memory caching you're going to at least use memcached (or something similar which will interface with all three languages). Maybe you help your scalability using nginx to serve directly from memcache, but that's obviously not going to change the performance of php/perl/python/ruby.
If you use MySQL or Postgresql you're still going to have to design your database correctly for scaling regardless of your app language, and any tool you use to start clustering / mirroring is going to be outside of your app.
I think in terms of memory usage Python (with mod_wsgi daemon mode) and Ruby (enterprise ruby with passenger/mod_rack) have pretty decent footprints at least comparable to PHP under fcgi and probably better than PHP under mod_php (i.e. apache MPM prefork + php in all the apache processes sucks a lot of memory).
Where this question might be interesting is trying to compare those 3 languages vs. something like Erlang where you (supposedly) have cheap built-in scalability automatically in all Erlang processes, but even then you'll have a RDBMS database bottleneck unless your app nicely fits into one of the Erlang database ways of doing things, e.g. couchdb.
Unfortunately, I do not know any cost comparison of Rails, PHP and Django. But I do know a cost comparison of some Java Web Frameworks: Spring(9k$), Wicket(59k$) , GWT(6k$) and JSF(49k$).
You have the original conference here:
http://www.devoxx.com/display/DV11/WWW++World+Wide+Wait++A+Performance+Comparison+of+Java+Web+Frameworks
There you have a post about it(shorter):
http://blog.websitesframeworks.com/2013/03/web-frameworks-benchmarks-192/
If you want to try to infere about this cost. You could cross this data with the bencharmk proposed by Techempower:
http://www.techempower.com/benchmarks/
So, if you know Spring is pretty cheap compare to Wicket and Django/Rails have worse marks in benchmarks than Wicket and Spring, it probably means they will represent a higher cost.
Hope it helps.
On a site with a reasonable amount of traffic , would it matter if the application/business logic is written as stored procedures ,triggers and views , instead of inside the PHP code itself?
What would be the best way to go keeping scalability in mind.
I can't provide you statistics, but unless you plan to change PHP for another language in the future, i can say keeping the business logic in PHP is more "scalability friendly".
Its always easier and cheaper to solve web server load problems than having them in the database. Your database will always need to be lighting quick and just throwing mirrors at it won't solve the problem. The more database slaves you have, the more writes you have to do.
In my experience, you should put business logic in PHP code rather than move it onto the database. Assuming your database is on a separate server, you don't want your database to be busy calculating formulas when requests come in.
Keep your database lightning fast to handle selects, inserts and updates.
I think you will have far better scalibility keeping database code in the database where it can be performance tuned as the number of records gets larger. You will also have better data integrity which is critical to the data even being useful. You don't see a lot of terrabyte sized relational dbs with all their code in the application.
Read some books on database performance tuning and then decide if you want to risk your company's data on application code.
There are several things to consider when trying to decide whether to place the business logic in the database or in the application code.
Will the same database be accessed
from different websites / web
applications? Will the sites /
applications be written in the same
language or in a different language?
If the database will be used from a single site, and the site is written in a single language then this becomes a non-issue. Otherwise, you'll need to consider the added complexity of stored procedures, triggers, etc vs trying to maintain database access logic etc in multiple code bases.
What are relational databases in
general good for and what is MySQL
good for specifically? What is PHP
best at?
This consideration is fairly straight-forward. Relational databases across the board and specifically in any variant of SQL are going to do a great job at inserting, updating, and deleting data. Generally they also handle ATOMIC transactions well. However, most variants of SQL (including MySQL) are not good at complex calculations, on-the-fly date handling, file system access etc.
PHP on the other hand is very fast at handling calculations, dates, file system accesses. By taking a little time you can even design your PHP code to work in such a way that records are only retrieved once and then stored when necessary.
What are you most familiar /
comfortable with using?
Obviously it tends to make more sense to use the tool with which you are most familiar.
As a last point consider that just because a drill can be used to cut sheet rock or because a hammer can be used to drive a screw doesn't mean that they should be used for these things. Sometimes I think that programmers do more potential damage by trying to make more powerful tools that do everything rather than making simpler tools that do one thing really, really well.
A well done PHP application should be enought, but keep in mind that it also requires you to do the less calls to the database you can. Store values you'll need later in PHP, shorten queries, cache, etc.
MySQL optimization is always a must, as it will also decrease the amount of databse calls by PHP, and thus getting a better performance. Therefore, there's no way you can't think of stored procedures, etc, if your aim is to increase performance. But MySQL by itself would't be enought if your PHP code isn't well done (lots of unecessary database calls), that's why I think PHP must be well coded, keeping in mind the hole process while developing it, so that unecessary stuff doesn't get in the way. Cache for instance, in "duet" with proper MySQL, is a great boost on performance.
My POV, even not having much experience in developing large applications is to write business logic in the DB for some reasons:
1 - Maintainability, I think that languages deprecate functions and changes many other things in a short time period, so if PHP changes version, you'll need to adapt your code to the new version
2 - DBs tends to be more language stable, so when a new version of a RDBMS comes out, it usually doesn't change many things in the way you write your queries or SPs, or it even doesn't change. Writing your logic in DB will reduce code adaptation because of a new DB version
3 - A RDBMS is more likely to be alive for a long period rather than a programming language. Also, as your data is critical, there is a big worry from the RDBMS developers for automatic migration of your whole data to the new RDBMS version, including your SPs. When clipper died, there were no ways to migrate systems to a new programming language, they had to be completely rewritten.
4 - If you think someday to change completely the language you are writing the application for some reason(language death, for example), the only thing to be rewritten will be the presentation and the SP calls, not business logic.
I'd like to know from other people here if what I pointed out makes sense, and if not, why. I'm on the same situation as Sabeen Malik, I'm thinking to begin my first huge project and I'm tending towards SPs because of what I wrote. So it's time to correct my POV if it's not so correct.
MySQL sucks at using advanced DB techniques, it's simple and fast. PHP, being a dynamic language, makes processing data very easy. Therefore, it usually makes sense to use PHP.