Crawling a website at uniform rate

Crawling a website at uniform rate - php

I need to crawl a website at a rate, lets say, 8 pages per minute.Now I wish the requests which I make to the remote server to be uniformly distributed over the minute, so that it doesn't harm the server it is requesting to.
How can I maintain a uniform time difference in seconds between two consecutive requests ? What is the best way to do this ?

There are really two separate issues here. Let's tackle them separately:
FIRST QUESTION
I need to crawl a website at a rate, lets say, 8 pages per
minute....so that it doesn't harm the server it is requesting to.
Paraphrase: I want to not send more than 8 requests per minute, because I want to be nice to the remote server.
For this answer, there is a related Stack Overflow question about rate limiting using PHP and Curl.
SECOND QUESTION
I wish the requests which I make to the remote server to be uniformly
distributed over the minute....How can I maintain a uniform time
difference in seconds between two consecutive requests
Paraphrase: I want to have the same amount of time in between each query.
This is a different question than the first one, and trickier. To do this, you will need to use a clock to keep track of the before and after each request, and keep constantly averaging the time taken for a request and how much sleep you request, and/or how often you call get(). You will also have to take into account how long each request is taking (what if you get an extremely laggy connection which lowers your average so that you're only doing 3 or 4 requests per minute...)
I personally don't think this is actually what you need to do "so that it doesn't harm the server".
Here's why: Usually rate limits are set with an "upper bound per lowest time slice". So this means that "8 requests per minute" means that they can all come at once in the minute, but not more than 8 per minute. There is no expectation by the rate limiter that they'll be uniformly distributed over the minute. If they did want that, they'd have said "one request every five seconds".

Related

Optimize PHP algorithm with huge number of threads?

As part of a Laravel based app I am trying to write a PHP script that fetches certain data, that is constantly updated, from across the web about certain products, books to be exact.
The problem:
Books are identified by ISBN, a 10 digit identifier. The first 9 digits can be 0-9, while the last digit can be 0-9 or X. However, the last digit is a check-digit which is calculated based off the first 9 digits, thus there is really only 1 possible digit for the last place.
That being the case, we arrive at:
10*10*10*10*10*10*10*10*10*1 = 1,000,000,000
numerically correct ISBNs. I can do a little better than that if I limit my search to English books, as they would contain only a 0 or a 1 as the first digit. Thus I would get:
2*10*10*10*10*10*10*10*10*1 = 200,000,000
numerically correct ISBNs.
Now for each ISBN I have 3 http requests that are required to fetch the data, each taking roughly 3 seconds to complete. Thus:
3seconds*3requests*200,000,000ISBNs = 1,800,000,000 seconds
1,800,000,000seconds/60seconds/60minutes/24hours/365days = ~57 years
Hopefully in 57 years time, there won't be such thing as a book anymore, and this algorithm will be obsolete.
Actually, since the data I am concerned with is constantly changing, for this algorithm to be useful it would have to complete each pass within just a few days (2 - 7 days is ideal).
Thus the problem is how to optimize this algorithm to bring its runtime down from 57 years, to just one week?
Potential Solutions:
1) The very first thing that you will notice is that while there are 200,000,000 possible ISBNs, there are no where near as many real ISBNs that exist, which means a majority of this algorithm will spend time making http requests on false ISBNs (I could move to the next ISBN after the first failed http request, but that alone will not bring down the timing significantly enough). Thus solution 1 would be to get/buy/download a database which already contains a list of ISBNs in use, thus significantly bringing down the number of ISBNs to search.
My issue with solution 1 is that new books are constantly being published, and I hope to pick up on new books when the algorithm runs again. Using a database of existing books would only be good for books up to date of creation of the database. (A potential fix would be a service that constantly updates their database and will let me download it once a week, but that seems unlikely, and plus I was really hoping to solve this problem through programming!)
2) While this algorithm takes forever to run, most of the time it is actually just sitting idly waiting for an http response. Thus one option would seem to be to use Threads.
If we do the math, I think the equation would look like this:
(numISBNs/numThreads)*secondsPerISBN = totalSecondsToComplete
If we isolate numThreads:
numThreads = (numISBNs * secondsPerISBN) / totalSecondsToComplete
If our threshold is one week, then:
totalSecondsToComplete = 7days * 24hrs * 60min * 60sec = 604,800seconds
numISBNs = 200,000,000
secondsPerISBN = 3
numThreads = (200,000,000 * 3) / 604,800
numThreads = ~992
So 992 threads would have to run concurrently for this to work. Is this a reasonable number of threads to run on say a DigitalOcean server? My mac right now says it is running over 2000 threads, so it could be this number is actually manageable.
My Question(s):
1) Is 992 a reasonable number of threads to run on a DigitalOcean server?
2) Is there a more efficient way to asynchronously perform this algorithm as each http request is completely independent of any other? What is the best way to keep the CPU busy while waiting for all the http requests to return?
3) Is there a specific service I should be looking in to for this that may help achieve what I am looking for?

Keep a DB of ISBN and continue to crawl to keep it updated, similar to google with all the web pages
analyze ISBN generation logic and try to avoid to fetch ISBN that are not possible
at crawling level, not only you can split in various thread, but you can also split by them by multiple servers each with access to the DB server, wich is dedicated to the DB and not overheaded by the crawling
also you could use some kind of web cache if it enhance the performances, for instance google cache or web archive
3 seconds are a lot for a web service, are you sure is there no service that answer you in lesser time? Search for it, maybe
If you manage to list all published books in a certain date, you can try to crawl only new books from that date, by finding some source of them only, this refresh would be very faster than search any book

does the increase in subdomains help to load the multiple ajax finish faster?

I have a doubt regarding the multiple ajax calls.
Consider I have 100 ajax calls to make. If I used a single sub domain it is taking 30 sec to finish. But if I use 2 sub domains it is taking 20 secs & if I use 3 sub domains it is taking 18 secs.
All the Ajax calls are dynamic. The time to finish a call is a max of 3 sec.
Each call need to communicate with db. Previously I had a single db for all the 3 sub-domains. Now I created 3 different databases.
My concern is to get them finished in 10 secs.
Any suggestions please.
KR

If the sub-domains are served by the same Apache server the performance will be a bit slower or almost the same because the Apache needs to serve more virtual hosts.
So the right choice would be to group your requests into one or use WebSocket to communicate with the server "real-time"

The problem you're describing is the limit of simultaneous connections that browser opens per single hostname. If you have many calls to one server, some of them will need to wait before the others are finished (which causes delays). If you distribute resources between servers you get around this per-server-limit and they run simultaneously. However for small amount of data it is usually wiser to just merge requests together and send as one package, as otherwise you loose time for each request to get back and forth + repeating useless headers, opening connection.
Check here for the current limits per browser. They might be nto strick implementing those limits, though.
http://www.browserscope.org/?category=network

Which real-time technology/technique is the best for running a heavy loop every 1 minute?

I'm attempting to create a procedure that will be running on the server every 1 min(more or less). I know I could achieve it with a cronjob but I'm concerned, let's say I have about 1000 tasks (1000 users that the system would need to check every 1 min), wouldn't it kill the server?
This system is supposed to sync data from google adwords API and do something with it. for example it should read a campaign from google and every 1000 impressions or clicks it should do something. So obviously I need to keep running a connection to adwords api to see the stats on real time. Imagine this procedure needs to run with 1000 registered users.
What technology should I use in my case when I need to run a heavy loop every 1 min?
Thanks a lot,
Oron

Ideally, if you're using a distributed workflow, you are better off servicing multiple users this way.

While there are many technologies available at your disposal, it's difficult to pinpoint any specific one that will be useful when you haven't given enough sufficient information.

There are 2 fundamental issues with what you are trying to do, the first one more severe than the other.
AdWords API doesn't give you real-time stats, the reports are usually delayed by 3 hours. See http://adwordsapi.blogspot.com/2011/06/statistics-in-reports.html for additional background information.
What you can do is to pick a timeframe (e.g. once every 2 hours) for which you want to run reports, and then stick to that schedule. To give you a rough estimate, you could run 10 reports in parallel, and assuming it takes 10 seconds to download and parse a report (which gives you a throughput of 1 report / second, but this strictly depends on your internet connection speed, load on AdWords API servers, how big your clients are, what columns and segmentation and date ranges you are requesting the data for, etc. For big clients, the timing could easily run into several minutes per report), you could refresh 1000 accounts only in 20-30 minutes' time.
Even if you wanted to go much faster, AdWords API won't allow you to do so. It has a rate limiting mechanism that will limit your API call rate if you try to make too many calls in a very small period. See http://adwordsapi.blogspot.com/2010/06/better-know-error-rateexceedederror.html for details.
My advise is to ask this question on the official forum for Adwords API - https://groups.google.com/forum/?fromgroups#!forum/adwords-api. You will definitely find users who have tackled similar issues before.

you can safe a variable which contains a timestamp of the last run of your loop.
now when a user visits your page, check if your timestamp is older than 1 minute.
If he is older run your loop and check all user.
so your loop runs only when there are users on your site. this saves a lot of server performance.

Strategies for rarely updated data

Background:
2 minutes before every hour, the server stops access to the site returning a busy screen while it processes data received in the previous hour. This can last less than two minutes, in which case it sleeps until the two minutes is up. If it lasts longer than two minutes it runs as long as it needs to then returns. The block is contained in a its own table with one field and one value in that field.
Currently the user is only informed of the block when (s)he tries to perform an action (click a link, send a form etc). I was planning to update the code to bring down a lightbox and the blocking message via BlockUI jquery plugin automatically.
There are basically 2 methods I can see to achieve my aim:
Polling every N seconds (via PeriodicalUpdater or similar)
Long polling (Comet)
You can reduce server load for 1 by checking the local time and when it gets close to the actual time start the polling loop. This can be more accurate by sending the local time to the server returning the difference mod 60. Still has 100+ people querying the server which causes an additional hit on the db.
Option 2 is the more attractive choice. This removes the repeated hit on the webserver, but doesn't allieve the repeated check on the db. However 2 is not the choice for apache 2.0 runners like us, and even though we own our server, none of us are web admins and don't want to break it - people pay real money to play so if it isn't broke don't fix it (hence why were are running PHP4/MySQL3 still).
Because of the problems with option 2 we are back with option 1 - sub-optimal.
So my question is really two-fold:
Are there any other possibilities I've missed?
Is long polling really such a problem at this size? I understand it doesn't scale, but I am more concerned at what level does it starve Apache of threads. Also are there any options you can adjust in Apache so it scales slightly further?

Can you just send to the page how many time is left before the server starts processing data received in the previous hour. Lets say that when sending the HTML you record that after 1 min the server will start processing. And create a JS that will trigger after that 1 min and will show the lightbox.

The alternative I see is to get it done faster, so there is less downtime from the users perspective. To do that I would use a distributed system to do the actual data processing behind the hourly update, such as Hadoop. Then use whichever method is most appropriate for that short downtime to update the page.

PHP and CPU - Process of chat + notifications

My site has a PHP process running, for each window/tab open, that runs in a maximum of 1 minute, and it returns notifications/chat messages/people online or offline. When JavaScript gets the output, it calls the same PHP process again and so on.
This is like Facebook chat.
But, seems it is taking too much CPU when it is running. Have you something in mind how Facebook handles this problem? What do they do so their processes don't take too much CPU and put their servers down?
My process has a "while(true)", with a "sleep(1)" at the end. Inside the cycle, it checks for notifications, checks if one of the current online people got offline/changed status, reads unread messages, etc.
Let me know if you need more info about how my process works.
Does calling other PHPs from "system()" (and wait for its output) alleviate this?
I ask this because it makes other processes to check notifications, and flushes when finished, while the main PHP is just collecting the results.
Thank you.

I think your main problem here is the parallelism. Apache and PHP do not excell at tasks like this where 100+ Users have an open HTTP-Request.
If in your while(true) you spend 0.1 second on CPU-bound workload (checking change status or other useful things) and 1 second on the sleep, this would result in a CPU load of 100% as soon as you have 10 users online in the chat. So in order so serve more users with THIS model of a chat you would have to optimize the workload in your while(true) cycle and/or bring the sleep interval from 1 second to 3 or higher.
I had the same problem in a http-based chat system I wrote many years ago where at some point too many parallel mysql-selects where slowing down the chat, creating havy load on the system.
What I did is implement a fast "ring-buffer" for messages and status information in shared memory (sysv back in the day - today I would probably use APC or memcached). All operations write and read in the buffer and the buffer itself gets periodicaly "flushed" into the database to persist it (but alot less often than once per second per user). If no persistance is needed you can omit a backend of course.
I was able to increase the number of user I could serve by roughly 500% that way.
BUT as soon as you solved this isse you will be faced with another: Available System Memory (100+ apache processes a ~5MB each - fun) and process context switching overhead. The more active processes you have the more your operating system will spend on the overhead involved with assigning "fair enough" CPU-slots AFAIK.
You'll see it is very hard to scale efficently with apache and PHP alone for your usecase. There are open source tools, client and serverbased to help though. One I remember places a server before the apache and queues messages internally while having a very efficent multi-socket communication with javascript clients making real "push" events possible. Unfortunatly I do not remember any names so you'll have to research or hope on the stackoverflow-community to bring in what my brain discarded allready ;)
Edit:
Hi Nuno,
the comment field has too few characters so I reply here.
Lets get to the 10 users in parallel again:
10*0.1 second CPU time per cycle (assumed) is roughly 1s combined CPU-time over a period of 1.1 second (1 second sleep + 0.1 second execute). This 1 / 1.1 which I would boldly round to 100% cpu utilization even though it is "only" %90.9
If there is 10*0.1s CPU time "stretched" over a period of not 1.1 seconds but 3.1 (3 seconds sleep + 0.1 seconds execute) the calculation is 1 / 3.1 = %32
And it is logical. If your checking-cycle queries your backend three times slower you have only a third of the load on your system.
Regarding the shared memory: The name might imply it but if you use good IDs for your cache-areas, like one ID per conversation or user, you will have private areas within the shared memory. Database tables also rely on you providing good IDs to seperate private data from public information so those should be arround allready :)
I would also not "split" any more. The fewer PHP-processes you have to "juggle" in parallel the easier it is for your systems and for you. Unless you see it makes absolutly sense because one type of notification takes alot more querying ressources than another and you want to have different refresh-times or something like that. But even this can be decided in the whyile cycle. users "away"-status could be checked every 30 seconds while the messages he might have written could get checked every 3. No reason to create more cycles. Just different counter variables or using the right divisor in a modulo operation.
The inventor of PHP said that he believes man is too limited to controll parallel processes :)
Edit 2
ok lets build a formula. We have these variables:
duration of execution (e)
duration of sleep (s)
duration of one cycle (C)
number of concurrent users (u)
CPU load (l)
c=e+s
l=ue / c #expresses "how often" the available time-slot c fits into the CPU load generated by 30 CONCURRENT users.
l=ue / (e+s)
for 30 users ASSUMING that you have 0.1s execution time and 1 second sleep
l=30*0.1 / (0.1 + 1)
l=2.73
l= %273 CPU utilization (aka you need 3 cores :P)
exceeding capab. of your CPU measn that cycles will run longer than you intend. the overal response time will increase (and cpu runs hot)

PHP blocks all sleep() and system() calls. What you really need is to research pcntl_fork(). Fortunately, I had these problems over a decade ago and you can look at most of my code.
I had the need for a PHP application that could connect to multiple IRC servers, sit in unlimited IRC chatrooms, moderate, interact with, and receive commands from people. All this and more was done in a process efficient way.
You can check out the entire project at http://sourceforge.net/projects/phpegg/ The code you want is in source/connect.inc.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.