As part of a Laravel based app I am trying to write a PHP script that fetches certain data, that is constantly updated, from across the web about certain products, books to be exact.
The problem:
Books are identified by ISBN, a 10 digit identifier. The first 9 digits can be 0-9, while the last digit can be 0-9 or X. However, the last digit is a check-digit which is calculated based off the first 9 digits, thus there is really only 1 possible digit for the last place.
That being the case, we arrive at:
10*10*10*10*10*10*10*10*10*1 = 1,000,000,000
numerically correct ISBNs. I can do a little better than that if I limit my search to English books, as they would contain only a 0 or a 1 as the first digit. Thus I would get:
2*10*10*10*10*10*10*10*10*1 = 200,000,000
numerically correct ISBNs.
Now for each ISBN I have 3 http requests that are required to fetch the data, each taking roughly 3 seconds to complete. Thus:
3seconds*3requests*200,000,000ISBNs = 1,800,000,000 seconds
1,800,000,000seconds/60seconds/60minutes/24hours/365days = ~57 years
Hopefully in 57 years time, there won't be such thing as a book anymore, and this algorithm will be obsolete.
Actually, since the data I am concerned with is constantly changing, for this algorithm to be useful it would have to complete each pass within just a few days (2 - 7 days is ideal).
Thus the problem is how to optimize this algorithm to bring its runtime down from 57 years, to just one week?
Potential Solutions:
1) The very first thing that you will notice is that while there are 200,000,000 possible ISBNs, there are no where near as many real ISBNs that exist, which means a majority of this algorithm will spend time making http requests on false ISBNs (I could move to the next ISBN after the first failed http request, but that alone will not bring down the timing significantly enough). Thus solution 1 would be to get/buy/download a database which already contains a list of ISBNs in use, thus significantly bringing down the number of ISBNs to search.
My issue with solution 1 is that new books are constantly being published, and I hope to pick up on new books when the algorithm runs again. Using a database of existing books would only be good for books up to date of creation of the database. (A potential fix would be a service that constantly updates their database and will let me download it once a week, but that seems unlikely, and plus I was really hoping to solve this problem through programming!)
2) While this algorithm takes forever to run, most of the time it is actually just sitting idly waiting for an http response. Thus one option would seem to be to use Threads.
If we do the math, I think the equation would look like this:
(numISBNs/numThreads)*secondsPerISBN = totalSecondsToComplete
If we isolate numThreads:
numThreads = (numISBNs * secondsPerISBN) / totalSecondsToComplete
If our threshold is one week, then:
totalSecondsToComplete = 7days * 24hrs * 60min * 60sec = 604,800seconds
numISBNs = 200,000,000
secondsPerISBN = 3
numThreads = (200,000,000 * 3) / 604,800
numThreads = ~992
So 992 threads would have to run concurrently for this to work. Is this a reasonable number of threads to run on say a DigitalOcean server? My mac right now says it is running over 2000 threads, so it could be this number is actually manageable.
My Question(s):
1) Is 992 a reasonable number of threads to run on a DigitalOcean server?
2) Is there a more efficient way to asynchronously perform this algorithm as each http request is completely independent of any other? What is the best way to keep the CPU busy while waiting for all the http requests to return?
3) Is there a specific service I should be looking in to for this that may help achieve what I am looking for?
Keep a DB of ISBN and continue to crawl to keep it updated, similar to google with all the web pages
analyze ISBN generation logic and try to avoid to fetch ISBN that are not possible
at crawling level, not only you can split in various thread, but you can also split by them by multiple servers each with access to the DB server, wich is dedicated to the DB and not overheaded by the crawling
also you could use some kind of web cache if it enhance the performances, for instance google cache or web archive
3 seconds are a lot for a web service, are you sure is there no service that answer you in lesser time? Search for it, maybe
If you manage to list all published books in a certain date, you can try to crawl only new books from that date, by finding some source of them only, this refresh would be very faster than search any book
Related
we have a PHP/MySQL/Apache Web app which holds a rating system. From time to time we do full recalculations for ratings, which means about 500 iterations of calculation, each taking 4-6 minutes and depending on the results of previous iteration (i.e., parallel solutions are not possible). Time is taken mostly by MySQL queries and loops for each rated player (about 100000 players on each iteration, but complex logic of linking between players gives no possibility for parallelization here also).
The problem is - when we start recalculation in plain old way (one PHP POST request), it dies after about 30-40 minutes from start (which gives only 10-15 iterations completed). The question "why it dies?" and other optimization issues are kinda out of league now - too complex logic, which needs to be refactored and even maybe rewritten in other language/infrastructure, yes, but we have no resources (time/people) for it now. We just need to make things work in the least annoying way.
So, the question: what is the best way to organize such recalculation, if possible, so that site admin can start recalculation by just one click and forget about it for one day, and it still does the thing?
I found on the web few advices for similar problems, but no silver bullet:
move iterations (and, therefore, timeouting) from server to client with usage of AJAX requests instead of plain old PHP requst - could possibly make the browser freeze (and AJAX's async nature is kinda bad for iterations);
make PHP to start a backend service which does the thing (like advised here) - it should take lot of work and I have no idea how to implement it.
So, I humbly ask for any advices possible in such situation.
I need to crawl a website at a rate, lets say, 8 pages per minute.Now I wish the requests which I make to the remote server to be uniformly distributed over the minute, so that it doesn't harm the server it is requesting to.
How can I maintain a uniform time difference in seconds between two consecutive requests ? What is the best way to do this ?
There are really two separate issues here. Let's tackle them separately:
FIRST QUESTION
I need to crawl a website at a rate, lets say, 8 pages per
minute....so that it doesn't harm the server it is requesting to.
Paraphrase: I want to not send more than 8 requests per minute, because I want to be nice to the remote server.
For this answer, there is a related Stack Overflow question about rate limiting using PHP and Curl.
SECOND QUESTION
I wish the requests which I make to the remote server to be uniformly
distributed over the minute....How can I maintain a uniform time
difference in seconds between two consecutive requests
Paraphrase: I want to have the same amount of time in between each query.
This is a different question than the first one, and trickier. To do this, you will need to use a clock to keep track of the before and after each request, and keep constantly averaging the time taken for a request and how much sleep you request, and/or how often you call get(). You will also have to take into account how long each request is taking (what if you get an extremely laggy connection which lowers your average so that you're only doing 3 or 4 requests per minute...)
I personally don't think this is actually what you need to do "so that it doesn't harm the server".
Here's why: Usually rate limits are set with an "upper bound per lowest time slice". So this means that "8 requests per minute" means that they can all come at once in the minute, but not more than 8 per minute. There is no expectation by the rate limiter that they'll be uniformly distributed over the minute. If they did want that, they'd have said "one request every five seconds".
I'm attempting to create a procedure that will be running on the server every 1 min(more or less). I know I could achieve it with a cronjob but I'm concerned, let's say I have about 1000 tasks (1000 users that the system would need to check every 1 min), wouldn't it kill the server?
This system is supposed to sync data from google adwords API and do something with it. for example it should read a campaign from google and every 1000 impressions or clicks it should do something. So obviously I need to keep running a connection to adwords api to see the stats on real time. Imagine this procedure needs to run with 1000 registered users.
What technology should I use in my case when I need to run a heavy loop every 1 min?
Thanks a lot,
Oron
Ideally, if you're using a distributed workflow, you are better off servicing multiple users this way.
While there are many technologies available at your disposal, it's difficult to pinpoint any specific one that will be useful when you haven't given enough sufficient information.
There are 2 fundamental issues with what you are trying to do, the first one more severe than the other.
AdWords API doesn't give you real-time stats, the reports are usually delayed by 3 hours. See http://adwordsapi.blogspot.com/2011/06/statistics-in-reports.html for additional background information.
What you can do is to pick a timeframe (e.g. once every 2 hours) for which you want to run reports, and then stick to that schedule. To give you a rough estimate, you could run 10 reports in parallel, and assuming it takes 10 seconds to download and parse a report (which gives you a throughput of 1 report / second, but this strictly depends on your internet connection speed, load on AdWords API servers, how big your clients are, what columns and segmentation and date ranges you are requesting the data for, etc. For big clients, the timing could easily run into several minutes per report), you could refresh 1000 accounts only in 20-30 minutes' time.
Even if you wanted to go much faster, AdWords API won't allow you to do so. It has a rate limiting mechanism that will limit your API call rate if you try to make too many calls in a very small period. See http://adwordsapi.blogspot.com/2010/06/better-know-error-rateexceedederror.html for details.
My advise is to ask this question on the official forum for Adwords API - https://groups.google.com/forum/?fromgroups#!forum/adwords-api. You will definitely find users who have tackled similar issues before.
you can safe a variable which contains a timestamp of the last run of your loop.
now when a user visits your page, check if your timestamp is older than 1 minute.
If he is older run your loop and check all user.
so your loop runs only when there are users on your site. this saves a lot of server performance.
I have a table of more than 15000 feeds and it's expected to grow. What I am trying to do is to fetch new articles using simplepie, synchronously and storing them in a DB.
Now i have run into a problem, since the number of feeds is high, my server stops responding and i am not able to fetch feeds any longer. I have also implemented some caching and fetching odd and even feeds at diff time intervals.
What I want to know is that, is there any way of improving this process. Maybe, fetching feeds in parallel. Or may be if someone can tell me a psuedo algo for it.
15,000 Feeds? You must be mad!
Anyway, a few ideas:
Increase the Script Execution time-limit - set_time_limit()
Don't go overboard, but ensuring you have a decent amount of time to work in is a start.
Track Last Check against Feed URLs
Maybe add a field for each feed, last_check and have that field set to the date/time of the last successful pull for that feed.
Process Smaller Batches
Better to run smaller batches more often. Think of it as being the PHP equivalent of "all of your eggs in more than one basket". With the last_check field above, it would be easy to identify those with the longest period since the last update, and also set a threshold for how often to process them.
Run More Often
Set a cronjob and process, say 100 records every 2 minutes or something like that.
Log and Review your Performance
Have logfiles and record stats. How many records were processed, how long was it since they were last processed, how long did the script take. These metrics will allow you to tweak the batch sizes, cronjob settings, time-limits, etc. to ensure that the maximum checks are performed in a stable fashion.
Setting all this may sound like alot of work compared to a single process, but it will allow you to handle increased user volumes, and would form a strong foundation for any further maintenance tasks you might be looking at down the track.
fetch new articles using simplepie, synchronously
What do you mean by "synchronously"? Do you mean consecutively in the same process? If so, this is a very dumb approach.
You need a way of sharding the data to run across multiple processes. Doing this declaratively based on, say the modulus of the feed id, or the hash of the URL is not a good solution - one slow URL would cause multiple feeds to be held up.
A better solution would be to start up multiple threads/processes which would each:
lock list of URL feeds
identify the feed with the oldest expiry date in the past which is not flagged as reserved
flag this record as reserved
unlock the list of URL feeds
fetch the feed and store it
remove the reserved flag on the list for this feed and update the expiry time
Note that if there are no expired records at step 2, then the table should be unlocked, the next step depends on whether you run the threads as daemons (in which case it should implement an exponential back of, e.g. sleeping for 10 seconds doubling up to 320 seconds for consecutive iterations) or if you're running as batches, exit.
Thank You for your responses. I apologize I am replying a little late. I got busy with this problem and later I forgot about this post.
I have been researching a lot on this. Faced a lot of problems. You see, 15,000 feed everyday is not easy.
May be I am MAD! :) But I did solve it.
How?
I wrote my own algorithm. And YES! It's written in PHP/MYSQL. I basically implemented a simple weighted machine learning algorithm. My algorithm basically learns the posting time about a feed and then estimates the next polling time for the feed. I save it in my DB.
And since it's a learning algorithm it improves with time. Ofcourse, there are 'misses'. but these misses are alteast better than crashing servers. :)
I have also written a paper on this. which got published in a local computer science journal.
Also, regarding the performance gain, I am getting a 500% to 700% improvement in speed as opposed to sequential polling.
How is it going so far?
I have a DB that has grown in size of TBs. I am using MySQL. Yes, I am facing perforance issues on MySQL. but it's not much. Most probably, I will be moving to some other DB or implement sharding to my existing DB.
Why I chose PHP?
Simple, because I wanted to show people that PHP and MySQL are capable of such things! :)
I'm currently building a user panel which will scrape daily information using curl. For each URL it will INSERT a new row to the database. Every user can add multiple URLs to scrape. For example: the database might contain 1,000 users, and every user might have 5 URLs to scrape on average.
How do I to run the curl scraping - by a cron job once a day at a specific time? Will a single dedicated server stand this without lags? Are there any techniques to reduce the server load? And about MySQL databases: with 5,000 new rows a day the database will be huge after a single month.
If you wonder I'm building a statistics service which will show the daily growth of their pages (not talking about traffic), so as i understand i need to insert a new value per user per day.
Any suggestions will be appreciated.
5000 x 365 is only 1.8 million... nothing to worry about for the database. If you want, you can stuff the data into mongodb (need 64bit OS). This will allow you to expand and shuffle loads around to multiple machines more easily when you need to.
If you want to run curl non-stop until it is finished from a cron, just "nice" the process so it doesn't use too many system resources. Otherwise, you can run a script which sleeps a few seconds between each curl pull. If each scrape takes 2 seconds that would allow you to scrape 43,200 pages per 24 period. If you slept 4 sec between a 2 second pull that would let you do 14,400 pages per day (5k is 40% of 14.4k, so you should be done in half a day with 4 sec sleep between 2 sec scrape).
This seems very doable on a minimal VPS machine for the first year, at least for the first 6 months. Then, you can think about utilizing more machines.
(edit: also, if you want you can store the binary GZIPPED scraped page source if you're worried about space)
I understand that each customer's pages need to be checked at the same time each day to make the growth stats accurate. But, do all customers need to be checked at the same time? I would divide my customers into chunks based on their ids. In this way, you could update each customer at the same time every day, but not have to do them all at once.
For the database size problem I would do two things. First, use partitions to break up the data into manageable pieces. Second, if the value did not change from one day to the next, I would not insert a new row for the page. In my processing of the data, I would then extrapolate for presentation the values of the data. UNLESS all you are storing is small bits of text. Then, I'm not sure the number of rows is going to be all that big a problem if you use proper indexing and pagination for queries.
Edit: adding a bit of an example
function do_curl($start_index,$stop_index){
// Do query here to get all pages with ids between start index and stop index
$query = "select * from db_table where id >= $start_index and id<=$stop_index";
for($i=$start_index; $i<= $stop_index; $i++;){
// do curl here
}
}
urls would look roughly like
http://xxx.example.com/do_curl?start_index=1&stop_index=10;
http://xxx.example.com/do_curl?start_index=11&stop_index=20;
The best way to deal with the growing database size is to perhaps write a single cron script that would generate the start_index and stop_index based on the number of pages you need to fetch and how often you intend to run the script.
Use multi curl and properly optimise not simply normalise your database design. If I were to run this cron job, I will try to spend time studying that is it possible to do this in chunks or not? Regarding hardware start with an average configuration, keep monitoring it and increment the hardware, CPU or Memory. Remember, there is no silver bullet.