Does Hadoop come handy in my project? [closed]

Does Hadoop come handy in my project? [closed] - php

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
A couple days ago I was asked by my company to find requirements to start a project. The project is creating an e-book store. The term simple, but the total amount of data is about 4TB and the number of files are around 500,000.
As my team members use php and mysql, I tried to look around apache for big data. I obviously faced apache haadoop, and mysql-cluster for big data. But after several days of digging on google, I'm now just completely confused! I now have these questions:
Are even these amount of data (4-5TB) considered as big data? (Some sources said that at least 5TB of data should use hadoop, some other said big data for hadoop mean Zetabytes and Petabytes)
Does hadoop ship with it's own special database, or should be used with mysql or etc.?
Does hadoop works only on a cluster, or it works on a single-nod server as fine?
As I faced these terms very recent, I believe that some or all of my questions maybe really silly... But I'll be really grateful if you have other suggestions for this type project.

Here are my short answers
Are even these amount of data (4-5TB) considered as big data? (Some sources said that at least 5TB of data should use hadoop, some other said big data for hadoop mean Zetabytes and Petabytes)
Yes and no. For certain usecases, this is not big enough data while for others, it is. Questions that should be asked and answered
Is this data is growing. What is the rate of growth.
Are you going to run some analytics on this data from time to time
Does hadoop ship with it's own special database, or should be used with mysql or etc.?
Yes, Hadoop has HDFS file system, which can store flatfile and can be treated like data repository. But that may not be the best solution. You may want to look at NoSQL DBs like Cassandra, HBase, MongoDB
Does hadoop works only on a cluster, or it works on a single-nod server as fine?
Technically, yes, hadoop can run on a single nod in Pseudo cluster or standalone mode. But that is used only for learning or testing purpose for development. For any production environment you should think of Hadoop clusters spanning multiple VMs.... Minimum I saw in prod was 6 VM.
As such 5TB is not very big volume for Relational DB (that supports clustering). But cost of supporting relational DB goes up exponentially with capacity. While with Hadoop and just HDFS, the cost is very low.... add Cassandra or HBase...not much difference. But remember, simply using hadoop, you are looking at a high latency system. If your expectation is that Hadoop will answer your queries in real time ...please look out for other solutions. (eg:queries like list all books checked out to Xyz", then just get it from DB... don't use Hadoop for that query).
Overall my suggestion will be, take a crash course on Hadoop from youtube, cloudera, try to gain some expertise on what is Hadoop and what is not and then decide. Your questions gives an impression , that you have a long learning curve ahead and it is worth taking that challenge.

This should be a comment, but it is too long.
Hadoop is a framework for writing parallel software, originally written by Yahoo. It is loosely based on a framework developed at Google in the 1990s, which in turn was a parallel implementation of map-reduce primitives from the Lisp language. You can think of Hadoop as a bunch of libraries that run either on hardware you own or on hardware on the cloud. These libraries provide a programming interface to java and to other languages. It allows you to take advantage of a cluster of processors and disks (with HDFS). Its major features are scalability and fault tolerance, both very important for large data problems.
Hadoop implements a programming methodology build around a parallel implementation of map-reduce. That was the original application. Nowadays, lots of things are built on Hadoop. You should start with the Apache project description and Wikipedia page to learn more.
Several databases support interfaces to Hadoop (Asterdata comes to mind). Often when one thinks of "databases" and "Hadoop", one is thinking of Pig or Hive or some related open-source project.
As for your question. If your data conforms naturally to a relational database (table with columns connected by keys) then use a relational database. If you need fast performance on web applications with hierarchical data, then learn about NoSQL solutions, such as MongoDB. If your data has a complex structure and requires scalability and you have programming skills on your team, then think about a Hadoop-based component to the solution. And, for a large project, multiple technologies are often needed for different components -- real time operations using NoSQL, reporting using SQL, ad hoc querying using a combination of SQL and Hadoop (for instance).

Related

Mongo, Express vs mySQL

im currently in the face of considering what to use for building a piece of software - The system needs to handle complexity like:
- User Management (ex: Trainer Login - Client login)
Different dashboards (Depending on user profile)
Workout Builder (Trainer must be able to create workout programs and send(email) and attach (Client can see workout program in system) the program to a client)
Diet Plans (much like the above)
Workout Library
Booking/Calendar (Client should be able to book a trainer)
Training Logs etc...
As you can see, there would be alot of relations/bindings etc, and personlization (Dashboards) etc... I think you get the idea :) - However, im a Frontend Developer, I do have php experience and mySQL (However a long time ago) - So the question is... Is this system possible to build completely with ex: Angular, Express, Mongo and Node - Or would I have to depend on a database system like mySQL and use ex: PHP for the system ?
Thx in advance for any answers :)

In my opinion, if your hands on experience with PHP and MySQL is good enough you should go ahead and deploy your application with PHP and MySQL with MongoDB as an additional database.
I understand that MEAN stack can power up your complete app, but the development time would be longer, and for what I have felt while using MongoDB over petabytes of data is that MongoDB is amazingly great for storing complex data in a flat architecture in massive size. But just like all databases, even MongoDB has certain constraints.
You should go ahead with MySQL for your usual Login credentials and minor activities, for storing Diet Plans, Workout Libraries use MongoDB. Because that gives you a flexibility of the varying document structure and high availability. Over the time you will find MongoDB easier to work upon than MySQL.
Using MEAN Stack is great. But, now I prefer to use a mixed architecture of MySQL, MongoDB, and PostGres. If you are going to use any framework it would probably have ACL in it or available as an add-on, and that could help you with building permissions and roles of users.
Also, if you are using MongoDB, make sure you code according to the engine MMAP or WiredTiger, I had to do a major recoding because of the storage engine changes. Just a heads up!

Yes, it is possible to build on pure JavaScript stack like MEAN: MongoDB, Angular, Express, Node.js
Everything that MySQL does, MongoDB can do also. The question is only in proper database design and performance for specific use cases.

Good NoSQL database for Write-Read Intensive Site

Ok I do have a small messaging site for my client. Well its more likely a post-comment system(created in PHP). Now my client want a system that can comment to another existing comment and add some features like liking and tagging. Another thing is the existing system is heavily used by my client in his company as they use it like a skype chat(that makes it write-read intensive). well my client want's to use open source software as possible. so I used mysql community edition.
Too much about my story... So I had a 1 week research about NoSql databases and I found it right for my requirements as my client wants to add features (that means adding and adding columns and tables from time to time.) Now these are nosql database systems that caught my eye.(well if you can suggest other nosql database system its ok)
MongoDB
CouchDB
Redis
Now my question is which of the three is good for my situation? I also read some bad things about those 3 nosql databases
MongoDB is crappy on its 2.x version
CouchDB is slow (my client doesn't want slow)
Redis is memory-based so it just writes on the disk on certain intervals. so when the system crash in the middle of the interval then the data is lost
I want to have some opinions about this and any advice that can help me to cope up with my upcoming situation

MongoDB is a popular solution to this, and my personal preference. The great thing about Mongo (besides being schemaless) is that you can have nested/embedded documents. So for example, you can have a comment which has an array of sub-comments which each have their own arrays of sub-comments. I don't know of any other datastore that has that feature. It's also fast.
CouchDB has some nice features, but Mongo is so similar and much better.
Redis is very different from the other two. It's used mostly as an alternative to memcached. So it's primarily used for temporary data. Although it has some nice pubsub features built in. A lot of people use both MongoDB and Redis, but for different things.

Php/MySQL to ASP.NET/SQL Server, Suggest if its worth the trouble

We have been using PHP/MySQL for our web application which has been growing a lot, the database is around 4-5GB and one of the table is 2GB sometimes, hence slowing down whenever any queries to that table is called.
Should we just try to optimize, or are we using MySQL above its limit? Will switching our web app to .NET/SQL Server resolve the issues?

You're going to get a lot of very passionate responses to this.
PHP is, from a code and performance standpoint, very similar to classic ASP. ASP.NET v1 was , according to many, many benchmarks available via your favorite search engine, 3x-5x faster than classic ASP. Draw your own conclusions.
I feel that MSSQL is a superior database solution. If you're stuck with open source, at least look at Postgres. It's less popular but very powerful.
To answer your real question: performance is a function of your toolset and platform choice, but also of developer skill and project structure. I've seen far more projects that could benefit from some healthy refactoring and optimization than I have that are limited by the platform in which they are written. It is rarely worthwhile to rewrite a large application in a completely different language. Instead, I would focus on improving your existing codebase, and looking for ways to incrementally upgrade to a platform like ASP.NET.

Also keep in mind that switching will require you to jump to IIS Windows server and there will be more cost involved most likely. There are a lot of considerations here when thinking about a switch like this.
I say if the application calls for it, work it out.

You certainly aren't using MySQL above it's limit. But you should consider benchmarking your database queries on MSSQL to see if you notice a huge improvement.
There are many factors involved here, your code base, database optimisations & changes to table structure, server spec.... they all contribute independently.
Are there any particularly slow queries or is it running slow accross the application? Can any caching be implemented here? Do you have propper indexes?

What is optimal hardware configuration for heavy load LAMP application

I need to run Linux-Apache-PHP-MySQL application (Moodle e-learning platform) for a large number of concurrent users - I am aiming 5000 users. By concurrent I mean that 5000 people should be able to work with the application at the same time. "Work" means not only do database reads but writes as well.
The application is not very typical, since it is doing a lot of inserts/updates on the database, so caching techniques are not helping to much. We are using InnoDB storage engine. In addition application is not written with performance in mind. For instance one Apache thread usually occupies about 30-50 MB of RAM.
I would be greatful for information what hardware is needed to build scalable configuration that is able to handle this kind of load.
We are using right now two HP DLG 380 with two 4 core processors which are able to handle much lower load (typically 300-500 concurrent users). Is it reasonable to invest in this kind of boxes and build cluster using them or is it better to go with some more high-end hardware?
I am particularly curious
how many and how powerful servers are
needed (number of processors/cores, size of RAM)
what network equipment should
be used (what kind of switches,
network cards)
any other hardware,
like particular disc storage
solutions, etc, that are needed
Another thing is how to put together everything, that is what is the most optimal architecture. Clustering with MySQL is rather hard (people are complaining about MySQL Cluster, even here on Stackoverflow).

Once you get past the point where a couple of physical machines aren't giving you the peak load you need, you probably want to start virtualising.
EC2 is probably the most flexible solution at the moment for the LAMP stack. You can set up their VMs as if they were physical machines, cluster them, spin them up as you need more compute-time, switch them off during off-peak times, create machine images so it's easy to system test...
There are various solutions available for load-balancing and automated spin-up.
If you can make your app fit, you can get use out of their non-relational database engine as well. At very high loads, relational databases (and MySQL in particular) don't scale effectively. The peak load of SimpleDB, BigTable and similar non-relational databases can scale almost linearly as you add hardware.
Moving away from a relational database is a huge step though, I can't say I've ever needed to do it myself.

I'm not so sure about hardware, but from a software point-of-view:
With an efficient data layer that will cache objects and collections returned from the database then I'd say a standard master-slave configuration would work fine. Route all writes to a beefy master and all reads to slaves, adding more slaves as required.
Cache data as objects returned from your data-mapper/ORM and not HTML, and use Memcached as your caching layer. If you update an object then write to the db and update in memcached, best use IdentityMap pattern for this. You'll probably need quite a few Memcached instances although you could get away with running these on your web servers.
We could never get MySQL clustering to work properly.
Be careful with the SQL queries you write and you should be fine.

Piotr, have you tried asking this question on moodle.org yet? There are a couple of similar scoped installations whose staff members answer that currently.
Also, depending on what your timeframe for deployment is, you might want to check out the moodle 2.0 line rather than the moodle 1.9 line, it looks like there are a bunch of good fixes for some of the issues with moodle's architecture in that version.
also: memcached rocks for this. php acceleration rocks for this. serverfault is probably the better *exchange site for this question though

Which is faster, python webpages or php webpages? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Which is faster, python webpages or php webpages?
Does anyone know how the speed of pylons(or any of the other frameworks) compares to a similar website made with php?
I know that serving a python base webpage via cgi is slower than php because of its long start up every time.
I enjoy using pylons and I would still use it if it was slower than php. But if pylons was faster than php, I could maybe, hopefully, eventually convince my employer to allow me to convert the site over to pylons.

It sounds like you don't want to compare the two languages, but that you want to compare two web systems.
This is tricky, because there are many variables involved.
For example, Python web applications can take advantage of mod_wsgi to talk to web servers, which is faster than any of the typical ways that PHP talks to web servers (even mod_php ends up being slower if you're using Apache, because Apache can only use the Prefork MPM with mod_php rather than multi-threaded MPM like Worker).
There is also the issue of code compilation. As you know, Python is compiled just-in-time to byte code (.pyc files) when a file is run each time the file changes. Therefore, after the first run of a Python file, the compilation step is skipped and the Python interpreter simply fetches the precompiled .pyc file. Because of this, one could argue that Python has a native advantage over PHP. However, optimizers and caching systems can be installed for PHP websites (my favorite is eAccelerator) to much the same effect.
In general, enough tools exist such that one can pretty much do everything that the other can do. Of course, as others have mentioned, there's more than just speed involved in the business case to switch languages. We have an app written in oCaml at my current employer, which turned out to be a mistake because the original author left the company and nobody else wants to touch it. Similarly, the PHP-web community is much larger than the Python-web community; Website hosting services are more likely to offer PHP support than Python support; etc.
But back to speed. You must recognize that the question of speed here involves many moving parts. Fortunately, many of these parts can be independently optimized, affording you various avenues to seek performance gains.

There's no point in attempting to convince your employer to port from PHP to Python, especially not for an existing system, which is what I think you implied in your question.
The reason for this is that you already have a (presumably) working system, with an existing investment of time and effort (and experience). To discard this in favour of a trivial performance gain (not that I'm claiming there would be one) would be foolish, and no manager worth his salt ought to endorse it.
It may also create a problem with maintainability, depending on who else has to work with the system, and their experience with Python.

I would assume that PHP (>5.5) is faster and more reliable for complex web applications because it is optimized for website scripting.
Many of the benchmarks you will find at the net are only made to prove that the favoured language is better. But you can not compare 2 languages with a mathematical task running X-times. For a real benchmark you need two comparable frameworks with hundreds of classes/files an a web application running 100 clients at once.

PHP and Python are similiar enough to not warrent any kind of switching.
Any performance improvement you might get from switching from one language to another would be vastly outgunned by simply not spending the money on converting the code (you don't code for free right?) and just buy more hardware.

It's about the same. The difference shouldn't be large enough to be the reason to pick one or the other. Don't try to compare them by writing your own tiny benchmarks ("hello world") because you will probably not have results that are representative of a real web site generating a more complex page.

If it ain't broke don't fix it.
Just write a quick test, but bear in mind that each language will be faster with certain functions then the other.

You need to be able to make a business case for switching, not just that "it's faster". If a site built on technology B costs 20% more in developer time for maintenance over a set period (say, 3 years), it would likely be cheaper to add another webserver to the system running technology A to bridge the performance gap.
Just saying "we should switch to technology B because technology B is faster!" doesn't really work.
Since Python is far less ubiquitous than PHP, I wouldn't be surprised if hosting, developer, and other maintenance costs for it (long term) would have it fit this scenario.

an IS organization would not ponder this unless availability was becoming an issue.
if so the case, look into replication, load balancing and lots of ram.

The only right answer is "It depends". There's a lot of variables that can affect the performance, and you can optimize many things in either situation.

I had to come back to web development at my new job, and, if not Pylons/Python, maybe I would have chosen to live in jungle instead :) In my subjective opinion, PHP is for kindergarten, I did it in my 3rd year of uni and, I believe, many self-respecting (or over-estimating) software engineers will not want to be bothered with PHP code.
Why my employers agreed? We (the team) just switched to Python, and they did not have much to say. The website still is and will be PHP, but we are developing other applications, including web, in Python. Advantages of Pylons? You can integrate your python libraries into the web app, and that is, imho, a huge advantage.
As for performance, we are still having troubles.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.