My client has a host of Facebook pages that have become very successful. In order to move away from big brother Facebook my client wishes to create a large dynamic site that incorporates the more successful parts of the Facebook empire.
One of my client's spin off sites has been created and is getting a lot of traffic. I'm not sure exactly how much but it hit 90 Gigs in a month as the space allocated need to be increased.
In any case my client has dreamed up a massive website with its own community looking to put the community under the one banner. However I am concerned that it will get thrashed, bottlenecks, long load time, etc.
My questions:
Will a managed dedicated server be able to handle a potentially large amount of traffic?
Is it going to be better to create various parts of the empire in their own separate hosting and domain (normal hosting or VPS), or is it better to have them all under the one hood (i.e. using sub-domains).
If they were all together would it be better for SEO and easier to manage? Or if they are separate, they may be quicker but would it need some sort of Passport user system so people can log into any of the website with the same user details?
Whats the best way to implement a Passport style user system? Do you remotely connect to databases? Or run a regular a Cron job that updates each individual user details on each domain? Maybe run CURL request to the other site given then any new data?
Any other Pros/Cons to keeping all the section together or separating them?
Large site like Facebook manages to have everything under the one root. Then sites like eBay have separate domain names but you can use the same user login across all of them.
I'm not sure what the best option is and would appreciate any guidance.
It is a very general question but to give some hints:
Measure, measure and measure again. Know what kind of parts are used heavily and which are not.
Fix things and go back to 1.
Really: Without knowing what takes lots of time, what is used most heavily etc. you cannot say anything usefull.
VPS or dedicated servers are not the right question. You start with: What do I have to do for the users. Then: How am I going to do that? (for example: in database, in scripts, in message queue) and then finally you see how much hardware you need.
One or multiple domains doesn't really matter. Though one exception: For static content it might be interesting if you have lots of it to use a CDN like Amazon. Read for example: http://highscalability.com/blog/2011/12/27/plentyoffish-update-6-billion-pageviews-and-32-billion-image.html where you can read some things about the possibilities with a CDN.
In general serving static content from a static domain is useful many other things don't really need that. So there you could just consider all in one domain.
Related
i'm creating a simple browser game with online transactions, but i'm thinking... "How can i guarantee that my site won't get down with too many players accessing it?"
I'm asking because i'll pay digital influencers to do the marketing, so i suppose many people will access it...
I should contract a VPN and run backend with node.js or pure PHP will do a good job to hold the site?
Site stability has a lot of different factors. Two main points to consider:
If your site is static HTML and JS files, using a CDN like Cloudflare will provide very strong protection against the site ever going down.
Assuming there's a heavier lift than static files (like DB calls and server-side processing), this ultimately comes down to two factors:
The specs of your server (e.g. ram, CPUs)
The efficiency of your code
Books can be written about how hardware and code can be improved. Ultimately releasing it in the wild will show you how they handle the load. Great monitoring software (like AppOptics) can give you insights into when you're getting close to any limits and need to upgrade hardware or optimize code.
Practically speaking, if you're not expecting a giant load on day one (which, unless you have a fantastic marketing channel or a lot of followers, you likely won't have), you should be more concerned with building something of value than optimizing it. Optimizing comes later.
I'm currently building a facebook alike chatbox, and I have encounter several considerations and problems along the way.
I had been googling useful resources all the time,like simple chatbox example or tutorial online.
My goal is to build one just like facebook/gmail chatbox and CometChat, I know it's hard and too much thing to scale behind the scene, but all I want to do is building it as simple as possible, and figuring out how facebook/gmail chatbox implement their chat functionality.
Progress:
I have finished facebook-like chatbox structure where I have sidebar at the right displaying online friends i can chat with, and popup chatbox at the bottom, and it is able to expand and minimize it.
I also have finished simple chatting based on MySQL database.
There's a table with 4 columns 'sender', 'receiver', 'message', 'time' for storing conversation.
My chatbox works this way:
1.The user send a message, and my front-end javascript will fetch the message the user type in and send the message to php file on the server via Ajax.
2. backend php file will store this message to MySQL.
3. The front-end will call the update function every 3 seconds to update the chatbox content if receiver send message to the sender, and show it out in frontend's chat.
I'm not sure this is a good way and long way to do, and I'm really concerned about it.
If users grow and grow, I have to think of ways to scale it well or my database and server will explode and frontend users might feel high latency in updating conversation.
Is BigTable a right way to do this if you have millions of users online?
How does facebook store their customer's text message or chat history in the backend well??
How does chat app like Whatapp store their text message?
Is it able to let the users chat directly to another user without storing state in server?
If I want to implement the chat history functionality in my chatbox, what is a good way to do ??
I am thinking server can create .txt file for each conversation in their file system, and it has a database table column to store the file path. Is this a good way and right way to do with chat history, I know its possible to do it this way, but im not sure if its a right way or good way.
I know this could be a huge, detailed application.
I'm asking not a detailed implementation but a big picture, concept of building it!
thank you!.
That's a good question and here's an attempt at answering it.
I believe you are thinking about scalability a bit too early. Your IM app might not reach the projected number of users for it to stop performing well. Consider enhancing your small product and scale as you go as much as is needed.
Disk I/O is one of the issues that you will face scaling your web application. Storing communication directly onto the disk with txt file might not be a reliable solution.
Push your technology stack to its limits before considering changing it or switching to something else. I assume you are using a relational database for your storage (since you mentioned columns and rows, which is not an ultimate indicator but still), there are other options out there that have good benchmarking results at the expense of multiple other compromises. (NoSQL: which you referred to as BigTable) is one option. Relational databases are great, they have been for quite a long time the industry standard but currently there are alternative solutions that are quite promising.
Look into NoSQL document based datastorage solutions such as MongoDB, CoucheDB or even Casandra and there are many others. There is a considerable amount of information about the performance of each, under specific circumstances and situations. Choose what is best for the problem at hand and not what is most fashionable or hipped.
Another option would be to outsource your scalability problems to a 3rd Party provider such as Firebase. In this situation all you have to worry about is your product and not what's happening under the hood.
Store only the data that you need and archive or dismiss what you don't.
With scalability there are generally 2 broad categories: Horizontal and Vertical scaling.
Horizontal: means adding more nodes to your system i.e. adding more server instances to handle the extra load. There are many cloud solution providers out there that make this genre of scaling very cheap and instantaneous.
Vertical: means adding more resources to the node you are currently running your app from in addition to use specific technologies that allow you to take full advantages of your resources. This optimization happens on the level of the instance resources (i.e. CPU, RAM, Disk Space etc...) and your data storage, programming language of choice, algorithms you are using etc... You might realize that php and mysql aren't the tools for this job, but that's arguable.
Read More about it here
Distributed Systems architects / programmers also take advantage of other (faster) programming languages at runtime (such as C, C++ or even Java) to speed up certain tasks. Look into how you can dissect your application into smaller decoupled modules / components that can run independently. (But i'm not sure if you will ever reach this stage with an IM client unless it becomes as popular as Whatsapp or Facebook chat).
I advise you to grab and read a couple of books about scaling web applications and leveraging cloud computing. Study scalable architectures and design your application depending on your business logic based on them.
This is a very broad and complex topic, I'm sure others might have additional interesting insight on the matter.
For a homework project, I'm creating a PHP driven website which main function is aggregating news about various university courses.
The main problem is this: (almost) each course has it's own website. These are usually just plain HTML or built using some simple free CMS system.
As a student, participating in 6-7 courses, almost every day you go through 6-7 websites checking if there are any news. The idea behind the project is that you don't have to do that, instead, you just check the aggregation site.
My idea is the following: each time a student logs in, go through his course list. For every course, get it's website (recursively, like with wget), and create a hash value of it. If the hash is different then one stored in database, we know that site has changed, and we notify the student.
So, what do you think, is this reasonable way to achieve the functionality?
And if yes, what is (technically) the best way to go about this? I was checking php_curl, put I don't know if it can get a website recursively.
Furthermore, there's a slight problem I have somewhat limited resources, only a few MB of quota on public (university) server. However, if that's a big problem, I could use a seperate hosting solution.
Thanks :)
Just use file_get_contents, or cURL if you absolutely have to (in case you need COOKIES).
You can use your hashing trick to check for modifications but it's not very elegant. What you want to know is when was it last changed. I doubt this information is on the website, but maybe they offer an RSS feed or some webservice or API you can use for this purpose.
Don't worry about doing recursive requests. Just make a new request each time.
"When all else fails, build a scraper"
I've been given a task to connect multiple sites of the same client into a single network. So i would like to hear an architectural advice on connecting these sites into a single community.
These sites include:
1. Invision Power Board Forum (the most important site)
2. 3 custom made cms-s (changes to code allowable)
3. 1 drupal site
4. 3-4 wordpress blogs
Requirements are as follows:
1. Connecting all users of all sites into a single administrable entity. With permissions changing ability, users banning etc.
2. Later on, based on this implementation I have to implement "facebook like" chat, which will be available to all users regardless of place of login.
I have few ideas on my mind on how to go with this, but would like to hear some people with more experience and expertize than my self.
Cheers!
You're going to have one hell of a time. Each of those site platforms has a very disparate user architecture: there is no way to "connect" them all together fluidly without numerous codebase changes. You're looking at making deep changes to each of those platforms to communicate with a central database, likely modifying thousands (if not tens of thousands) of lines of code.
On top of the obvious (massive) changes to all of the platforms, you're going to have to worry about updates: what happens when a new version of Wordpress is released? You'd likely have to update all of your code manually (since you can't just drop in the changes). You'd also have to make sure that all of the code changes are compatible with your current database. God forbid one of the platforms starts storing user information differently---you'd have to make more massive code changes. This just isn't maintainable.
Your alternative (and best bet) is to have some sort of synchronization job that runs every hour or so: iterate through each user in each database and compare it to see if it both exists and is up-to-date in the other databases. If not, push the changes out. The problem with this is that it will get significantly slower as you get more and more users.
Perhaps another alternative is to simply offer a custom OpenID implementation. I believe that Drupal and Wordpress both have OpenID plugins that you can take advantage of. This way, you could allow your users to sign in with a pseudo-single sign-on service across your sites. The downside is that users could opt not to use it.
Good luck
I ask this because I am creating a spider to collect data from blogger.com for a data visualisation project for university.
The spider will look for about 17,000 values on the browse function of blogger and (anonymously) save certain ones if they fit the right criteria.
I've been running the spider (written in PHP) and it works fine, but I don't want to have my IP blacklisted or anything like that. Does anyone have any knowledge on enterprise sites and the restrictions they have on things like this?
Furthermore, if there are restrictions in place, is there anything I can do to circumvent them? At the moment all I can think of to help the problem slightly is; adding a random delay between calls to the site (between 0 and 5 seconds) or running the script through random proxies to disguise the requests.
By having to do things like the methods above, it makes me feel as if I'm doing the wrong thing. I would be annoyed if they were to block me for whatever reason because blogger.com is owned by Google and their main product is a web spider. Allbeit, their spider does not send its requests to just one website.
It's likely they have some kind of restriction, and yes there are ways to circumvent them (bot farms and using random proxies for example) but it is likely that none of them would be exactly legal, nor very feasible technically :)
If you are accessing blogger, can't you log in using an API key and query the data directly, anyway? It would be more reliable and less trouble-prone than scraping their page, which may be prohibited anyway, and lead to trouble once the number of requests is big enough that they start to care. Google is very generous with the amount of traffic they allow per API key.
If all else fails, why not write an E-Mail to them. Google have a reputation of being friendly towards academic projects and they might well grant you more traffic if needed.
Since you are writing a spider, make sure it reads the robots.txt file and does accordingly. Also, one of the rules of HTTP is not to have more than 2 concurrent requests on the same server. Don't worry, Google's servers are really powerful. If you only read pages one at the time, they probably won't even notice. If you inject 1 second interval, it will be completely harmless.
On the other hand, using a botnet or other distributed approach is considered harmful behavior, because it looks like DDOS attack. You really shouldn't be thinking in that direction.
If you want to know for sure, write an eMail to blogger.com and ask them.
you could request it through TOR you would have a different ip each time at a peformance cost.