Situation:
For a web shop, I want to build paged product lists - and filters on these lists - using Elasticsearch. I want to bypass the PHP/MySQL server on which the application runs entirely and communicate with Elasticsearch directly from the customer's browser through AJAX calls. Advantages are:
A large portion of the load on the PHP/MySQL server will be handled by the ES cluster instead
CDN opportunities (scaling!)
Problem:
This approach would take a massive load off of our backend server but creates a few new issues. Anonymous users will generate lots of requests but we need some control over those:
Traffic control:
How to defend against malicious users making lots of calls and scanning/downloading our entire product catalogue that way? (e.g. competition scraping pricing information)
How can I block IP's that have been identified (somehow) as behaving badly?
Access control:
How to make sure the frontend can only make the queries we want to allow?
How to make sure customers only see a selection of the result fields and can't get any data out of ES that's not intended for them?
It's essential not to have a single machine somewhere taking care of all this cause this would just recreate a single machine responsible for handling everything. I want to take real advantage of the ES cluster without having any middleware that has to deal with the scaling issue as well.
We don't want to be fully dependent on a 3rd party, we're looking for a solution that has some flexibility regarding the partners we're working with (e.g. switch between elastic and AWS).
Possible solutions or partial solutions:
I've been looking at a few 'Elasticsearch as a service' options but I'm not confident about their quality or even if I can solve the issues mentioned with them:
www.elastic.co/found, their premium solution has a 'shield' service which does not seem to cover all of the cases mentioned above (only IP blocking as far as I can tell), but there is a custom plugin (https://github.com/floragunncom/search-guard) that can do filtering on result fields and provides a way to do user management etc. This seems like a reasonable option but it is expensive and ties the application to the 'found' product. We should be able to switch partners should the need arise.
Amazon AWS Elasticsearch service has basic IAM support and it's possible to put CloudFront in front of it but does not provide any access control.
Installing a separate L7 application filtering solution for detecting scrapers etc.
Question:
Is there anyone out there who has this type of approach working and found a good setup that tackles all of these issues?
First thing I would recommend is restrict access to your elastic search instance from behind a security group and only allow the application servers IP address access on ports 22, 80, 9200, and 9300 which are the ports used by ElasticSearch.
As for protecting against scrapping there is no absolute solution for protection however if your aim is to simply limit the load these scrappers put on your application server and ES instance you can take a look at https://github.com/davedevelopment/stiphle which is aimed at rate limited users, the exmaple they use on their page is limiting to 5 requests a second which would seem very reasonable for the average user and could be lowered even further if need be to make scrapping a time consuming effort.
Related
I'm building a web server for my application (PHP) with EB (Elastic Beanstalk).
I'm getting confused about Scaling Trigger.
I know what is and how it works.
I'd like to know whats is the best configuration for web server.
My application is restful and in server it runs the backend.
It only returns JSONS data from database (don't work with images or thing like this),
I think it'll use more RAM than CPU.
What kind of configuration do you use in yours severs?
NetworkIn or Out? How to mensure what my server holds on?
My configuration actually:
Environment type: Load balanced, auto scaling
Number instances: 1 - 10
Scale based on Average CPUUtilization
Add instance when > 60
Remove instance when < 20
This is absolutely dependent on your specific scenario. So, my configuration may not be ideal for yours. But, stick to some conventional rules when doing this. If I were you, I would instead check the availability of my configuration, seeing whether instances are successfully set as healthy once they are launched, if auto scaling group launches and removes instances correctly when cloudwatch triggers them. sometimes this may be tradeoff between the CPU percentage and you need to adjust it up or down. This sometimes causes problems, if not set correctly, your auto scaling group ends up launching and removing instances regularly!!
Also, make sure scaling up is a better choice than scaling out in your scenario. sometimes, it is simply better to use more powerful instances than scaling out auxiliary ones.
If you stick to these rules of thumb, you can assure your configuration is an stable one.
(in terms of security, if that's a web server application, see if you need an extra tier of security, for example the WAF layer, whether you want it to be a separate layer, or you want it to be in a separate VPC which receives traffic, analyzes it and then redirects it to a private ELB in the peered VPC, or you simply want to join WAF with your instances.
Or if you are using HTTP/HTTPS ELB request rather than TCP. because HTTP ELB request are more secure as the ELB drop the connection once the client send traffic to ELB and then send a separate header to the backend instances. This removes SYN ATTACK threat! or
cloudfront, as it grows according to the traffic, so no server unavailable DoS threats to your application anymore, and many other tricks which you can come to know via documentation and also the http://en.clouddesignpattern.org/index.php/Main_Page)
good luck!
I'm trying to figure out whether my current approach will lead me into performance issues into the future, before developing further with this design, and whether there are better ways of doing this. I think this makes the most sense if I provide some context on the design first:
Current Design
I currently have my environment designed with two separate servers, let's call them frontend and backend.
Frontend
This server is open to the world. Customers access this site to view our product, make purchases, and will soon be able to view their account related information.
Backend
This server is where all information is held in a database.
Communication
The only way that the frontend currently needs access to the backend, is when the user authenticates with their license and downloads our product. To do this, the frontend calls a PHP script, which sends a JSON request to the backend server via curl_exec. The response from the backend tells the frontend how to handle that download request (e.g. license invalid).
Reasoning
The reason for this design is to avoid exposing the backend details to the user. Client-side, all the user sees is a request being sent to the frontend server. If the frontend server is ever compromised, anyone reading through how the frontend is built has no access to the backend DB, unless they know exactly what parameters to send to the backend API. Even then, it only gives access to a very low subset of information, depending on what the API exposes.
The Problem
The only time this cross-server communication happens right now is when a user tries to download our products using their license details. Relatively speaking, the traffic through this API between both servers is relatively low.
My concerns are that I want to build a user "control panel". From here they can log in with their license/account, they can view their active licenses, access details on previous orders they made, etc. This already means all these pieces of information are only available through the backend, so I'll need to expose them through the API - which is fine. The issue here is that every request the user makes through the control panel (even just refreshing the page) will build up a lot of traffic between both servers.
Questions
From the experience of developers here, is this communication design scalable? I'm worried I'm building around a bottleneck, which will just result in a slow user interface, since the frontend would end up waiting on a lot of requests it tunnelled through to the backend.
What are your thoughts? Has anyone faced a similar challenge? How did you overcome that challenge? What is the best practice to achieving this kind of requirement? I hope this question doesn't come across as too vague.
I would love to hear other answers but I will share my thoughts.
First, let's call your servers:
Application Server
Database Server
It seems that you are worried about creating a bottleneck due to an increase in the amount of database queries. Since you mentioned that these queries would execute after a page refresh, it's clear that you are not using a cache of any sort. If you could cache the database queries and invalidate the cache only if the data has changed (i.e. the user's actions cause the data to change, so the cache should be cleared) then you will increase performance drastically.
If anyone gains access to your application server, they will most likely be able to access the database server with the user that you've allowed the application server to use. You should give this user as little permissions as necessary to use the API. Still, they may be able to access a lot depending on what your API allows and what you have cached on the application server.
Take a look at Laravel's cache API which allows you to use your cache in place of a database query. If the cache does not exist, the database query would be executed and cached. Then you would delete necessary cache's based on user actions. You can also asynchronously recache database requests so you don't prevent a response to the client if the data is not needed for that response.
I hope this helps.
UPDATE:
After discussing with you further, I better understand your dilemma. You are trying to increase the security of your application by requiring all API calls to go through an extra step of being initiated after a POST request. I agree that this is going to be a bottleneck as the application scales since you won't be able to take advantage of caching and every page request will result in database queries.
What I have done in a similar case is to separate the application server and database server except the database server is literally only a database server without any logic/scripts. PHP, for example, is not even installed on the database server. Database servers and applications servers are only connected via private networking, so database servers are only accessible via the application server. A safe user has been set up to use the remote database.
Since my database queries take a lot of time, I cache as much as possible.
Also consider using https://cloudflare.com It is a reverse proxy to the application server which adds another layer between the client (browser) and your application server. This way, only cloudflare has access to your application server, and only your application server has access to your database server via the safe database user you create.
im no expert on database but using prepared statements
would help you a lot, as it is more secure as well as the best part is..
"Bound parameters minimize bandwidth to the server as you need send only the parameters each time, and not the whole query"
Hope it helps!
I am building a web-application and have a couple of quick questions. From what I learnt, one should not worry about scalability when initially building the app and should only start worrying when the traffic increases. However, this being my first web-application, I am not quite sure if I should take an approach where I design things in an ad-hoc manner and later "fix" them. I have been reading stories about how people start off with an app that gets millions of users in a week or two. Not that I will face the same situation but I can't help but wonder, how do these people do it?
Currently, I bought a shared hosting account on Lunarpages and that got me started in building and testing the application. However, I am interested in learning how to build the same application in a scalable-manner using the cloud, for instance, Amazon's EC2. From my understanding, I can see a couple of components:
There is a load balancer that first receives requests and then decides where to route each request
This request is then handled by a server replica that then processes the request and updates (if required) the database and sends back the response to the client
If a similar request comes in, then a caching mechanism like memcached kicks into picture and returns objects from the cache
A blackbox that handles database replication
Specifically, I am trying to do the following:
Setting up a load balancer (my homework revealed that HAProxy is one such load balancer)
Setting up replication so that databases can be synchronized
Using memcached
Configuring Apache to work with multiple web servers
Partitioning application to use Amazon EC2 and Amazon S3 (my application is something that will need great deal of storage)
Finally, how can I avoid burning myself when using Amazon services? Because this is just a learning phase, I can probably do with 2-3 servers with a simple load balancer and replication but until I want to avoid paying loads of money accidentally.
I am able to find resources on individual topics but am unable to find something that starts off from the big picture. Can someone please help me get started?
Personally, I think you should be considering how your app will scale initially - as otherwise you'll run into problems down the line.
I'm not saying you need to build it initially as a multi-server system, but if you think you'll need to do it later, be mindful of the concerns now.
In my experience, this includes things like:
Sessions. Unless you use 'sticky' load balancing, you will have to have some way of sharing session state between servers. This probably means storing session data on either shared storage, or in a DB.
File uploads and replication. If you allow users to upload files, or you have a CMS that allows you to upload images/documents, it needs to cater for the fact that these files will also need to find their way onto other nodes in your cluster. However, if you've gone down the shared storage route mentioned above, this should cover it.
DB scalability. If you're using traditional DB servers, you might want to think about how you'll implement scalability at that level. This may mean coding your app so you use one connection string for reads, and another for writes. Then, you are free to implement replication with one master node handling the inserts/updates cascading the changes to read only nodes that handle the bulk of the work.
Middleware. You might even want to go down the route of implementing some kind of message oriented middleware solution to completely hand off business logic functions - this will give you a great level of flexibility in how you wish to scale this business logic layer in the future. Although initially this will be a lot of complication and work for not a great deal of payoff.
Have you considered playing around with VMs first? You can run 2-3 VMs on your local machine and set them up like you would actual servers, they just won't be able to handle real traffic levels. If all you're looking for is the learning experience, it might be an ideal way to go about it.
I am developing an iPhone app and would like to create some sort of RESTful API so different users of the app can share information/data. To create a community of sorts.
Say my app is some sort of game, and I want the user to be able to post their highscore on a global leaderboard as well as maintain a list of friends and see their scores. My app is nothing like this but it shows the kind of collective information access I need to implement.
The way I could implement this is to set up a PHP and MySQL server and have a php script that interacts with the database and mediates the requests between the DB and each user on the iPhone, by taking a GET request and returning a JSON string.
Is this a good way to do it? Seems to me like using PHP is a slow way to implement this as opposed to say a compiled language. I could be very wrong though. I am trying to keep my hosting bills down because I plan to release the app for free. I do recognise that an implementation that performs better in terms of CPU cycles and RAM usage (e.g. something compiled written in say C#?) might require more expensive hosting solutions than say a LAMP server so might actually end up being more expensive in terms of $/request.
I also want my implementation to be scalable in the rare case that a lot of people start using the app. Does the usage volume shift the performance/$ ratio towards a different implementation? I.e. if I have 1k request/day it might be cheaper to use PHP+MySQL, but 1M requests/day might make using something else cheaper?
To summarise, how would you implement a (fairly simple) remote database that would be accessed remotely using HTTP(S) in order to minimise hosting bills? What kind of hosting solution and what kind of platform/language?
UPDATE: per Karl's suggestion I tried: Ruby (language) + Sinatra (framework) + Heroku (app hosting) + Amazon S3 (static file hosting). To anyone reading this who might have the same dilemma I had, this setup is amazing: effortlessly scalable (to "infinity"), affordable, easy to use. Thanks Karl!
Can't comment on DB specifics yet because I haven't implemented that yet although for my simple query requirements, CouchDB and MongoDB seem like good choices and they are integrated with Heroku.
Have you considered using Sinatra and hosting it on [Heroku]? This is exactly what Sinatra excels at (REST services). And hosting with Heroku may be free, depending on the amount of data you need to store. Just keep all your supporting files (images, javascript, css) on S3. You'll be in the cloud and flying in no time.
This may not fit with your PHP desires, but honestly, it doesn't get any easier than Sinatra.
It comes down to a tradeoff between cost vs experience.
if you have the expertise, I would definitely look into some form of cloud based infrastructure, something like Google App Engine. Which cloud platform you go with depends on what experience you have with different languages (AppEngine only works with Python/Java for e.g). Generally though, scalable cloud based platforms have more "gotchas" and need more know-how, because they are specifically tuned for high-end scalability (and thus require knowledge of enterprise level concepts in some cases).
If you want to be up and running as quickly and simply as possible I would personally go for a CakePHP install. Setup the model data to represent the basic entities you are managing, then use CakePHP's wonderful convention-loving magic to expose CRUD updates on these models with ease!
The technology you use to implement the REST services will have a far less significant impact on performance and hosting costs than the way you use HTTP. Learning to take advantage of HTTP is far more than simply learning how to use GET, PUT, POST and DELETE.
Use whatever server side technology you already know and spend some quality time reading RFC2616. You'll save yourself a ton of time and money.
In your case its database server that's accessed on each request. so even if you have compiled language (say C# or java) it wont matter much (unless you are doing some data transformation or processing).
So DB server have to scale well. here your choice of language and DB should be well configured with host OS.
In short PHP+MySQL is good if you are sending/receiving JSON strings and storing/retrieving in DB with minimum data processing.
next app gets popular and if your app don't require frequent updates to existing data then you can move such data to very high scalable databases like MongoDB (JSON friendly).
I have a dedicated server and I'm in need for building new version of my personal PHP5 CMS for my customers. Setting aside questions whether i should consider using open source, i need your opinions regarding CMS architecture.
First approach (since the server is completely in my control) is to build a centralized system that can support multiple sites from single administration panel. Basic idea is that I am able to log-in as super user, create new site (technically it creates new web root and new database, and maybe some other things), assign modules and plug-ins for specific customer or develop new ones if needed. If customer log-ins at this panel, he/she sees and can manage only their site content.
I have seen such a system (it was custom build), it's very nice to bug fixes and new features affects all customers instantly without need of patching every CMS that can also be on other hosting server...
The negative aspect I can see is scalability - what if i will need to add second server, how do I merge them to maintain single core.
The second is classical - stand-alone CMS for every customer.
What way will you go and why?
Thank you for your time.
If you were to have one central system for all clients, scalability could become easier. You can have one big database server and several identical web servers (probably behind a load balancer), and that way you don't have to worry about dividing the clients up into different servers. Your resources are pooled so if one client has a day with heavy traffic it can be taken up by several servers, rather than bringing one server (and all other clients' sites on it) to its knees.
You can get PHP sessions to work across multiple servers either by using 'sticky sessions' on your load-balancing configuration, or by getting PHP to store the data somewhere accessible to all servers (e.g. a database).
Keeping the web application files synchronised to one code base shouldn't be too difficult, there are tools like rsync that you could use to help you.
It really depends on the types of sites. That said, I would suggest that you consider using version control software to manage multiple installations. In practise, this can give you the same as with a centralised approach, but it gives you the freedom to postpone update of a single (or a number of) sites.