How to optimize speed for multiple CURL get requests in PHP? - php

I'm connecting to an API with PHP through a CURL GET and I receive a json with almost 5000 orders. For every order I make another CURL GET and receive the order details (basically 2 foreach). After that I make some inserts and updates in the database (basic stuff) with LARAVEL.
The big problem is that for those 5000 orders I have a loading time of almost one hour. I need to this with a cron every night (and for more than 5000).
I have a cloud solution with 2GB Memory and 2 CoreProcessor.
I tried also Zebra Curl but cannot use that with curl in curl request.

if 2% of the daily jobs (1 of 50 organisations) takes almost 4% of a day you definitely need parallel processing.
There are several solutions for that:
Run a cron job for each organisation.
Process entries for multiple companies in 1 multi-curl call.
Use multiple servers.
I would probably use the first one: You could have a cron job check (every minute?) which organisations still need to be processed and which organisations are being processed at the moment and pick one if there are any left.
That way each job would still take 1 hour but all would be processed in the time-span of 2 hours each night.

Related

Laravel run multiple scheduled tasks

I currently have a scheduled console command that runs every 5 minutes without overlap like this:
$schedule->command('crawler')
->everyFiveMinutes()
->withoutOverlapping()
->sendOutputTo('../_laravel/storage/logs/scheduler-log.txt');
So it works great, but I currently have about 220 pages that takes about 3 hours to finish in increments of 5 minutes because I just force it to crawl 10 pages at each interval since each page takes like 20-30 seconds to crawl due to various factors. Each page is a record in the database. If I end up having 10,000 pages to crawl, this method would not work because it would take more than 24 hours and each page is supposed to be re-crawled once a day.
So my vendor allows up to 10 concurrent requests (or more with higher plans), so what's the best way to run it concurrently? If I just duplicate the scheduler code, does it run the same command twice or like 10 times if I duplicated it 10 times? Any issues that would cause?
And then I need to pass on parameters to the console such as 1, 2, 3, etc... in which I could use to determine which pages to crawl? i.e. 1 would be 1-10 records, 2 would be next 11-20 records, and so on.
Using this StackOverfow answer, I think I know how to pass it along, like this:
$schedule->command('crawler --sequence=1')
But how do I read that parameter within my Command class? Does it just become a regular PHP variable, i.e. $sequence?
Better to use queue for job processing
on cron, add all jobs to queue
Run multiple queue workers, which will process jobs in parallel
Tip: It happened with us.
It might happen that job added previously is not complete, yet cron adds the same task in queue again. As queues works sequentially. To save yourself from the situation, you should in database mark when a task is completed last time, so you know when to execute the job (if it was seriously delayed)
I found this on the documentation, I hope this is what you're looking for:
Retrieving Input
While your command is executing, you will obviously need to access the
values for the arguments and options accepted by your application. To
do so, you may use the argument and option methods:
Retrieving The Value Of A Command Argument
$value = $this->argument('name');
Retrieving All Arguments
$arguments = $this->argument();
Retrieving The Value Of A Command
Option
$value = $this->option('name');
Retrieving All Options
$options = $this->option();
source

Scale multi request to different services

I have a service, where I need to ask 40 external services (API's) to get information from them, by each user request. For example one user is searching for some information and my service is asking 40 external partners to get the information, aggregates it in one DB (mysql) and displays the result to the user.
At this moment I have a multicurl solution, where I have 10 partner request at one time and if someone parnter is done with the request, then the software is adding another partner from the remaining 30 to the queue of multicurl, until all the 40 request are done and the results are in the DB.
The problem on this solution, is that it can not scale on many servers and I want to have some solution, where I can fire 40 request at one time for example divided on 2-3 servers and wait only so long, as the slowest partner delivers the results ;-) What means, that if the slowest partner tooks 10 seconds I will have the result of all 40 partners in 10 seconds. On multicurl I come in troubles, when there are more then 10-12 requests at one time.
What kind of solution, can you offer me, what i getting as low as possible ressources and can run many many process on one server and be scalable. My software is on PHP written, that mean I need an good connect to the solution with framework or API.
I hope you understand my problem and need. Please ask, if something is not clear.
One possible solution would be to use a message queue system like beanstalkd, Apache ActiveMQ, memcacheQ etc.
A high level example would be:
User makes request to your service for information
Your service adds the requests to the queue (presumably one for each of the 40 services you want to query)
One or more job servers continuously poll the queue for work
A job server gets a message from the queue to do some work, adds the data to the DB and deletes the item from the queue.
In this model, since now the one task of performing 40 requests is distributed and is no longer part of one "process", the next part of the puzzle will be figuring out how to mark a set of work as completed. This part may not be that difficult or maybe it introduces a new challenge (depends on the data and your application). Perhaps you could use another cache/db row to set a counter to the number of jobs a particular request needs in order to complete and as each queue worker finishes a request, it can reduce the counter by 1. Once the counter is 0, you know the request has been completed. But when you do that you need to make sure the counter gets to 0 and doesn't get stuck for some reason.
That's one way at least, hope that helps you a little or opens the door for more ideas.

Which real-time technology/technique is the best for running a heavy loop every 1 minute?

I'm attempting to create a procedure that will be running on the server every 1 min(more or less). I know I could achieve it with a cronjob but I'm concerned, let's say I have about 1000 tasks (1000 users that the system would need to check every 1 min), wouldn't it kill the server?
This system is supposed to sync data from google adwords API and do something with it. for example it should read a campaign from google and every 1000 impressions or clicks it should do something. So obviously I need to keep running a connection to adwords api to see the stats on real time. Imagine this procedure needs to run with 1000 registered users.
What technology should I use in my case when I need to run a heavy loop every 1 min?
Thanks a lot,
Oron
Ideally, if you're using a distributed workflow, you are better off servicing multiple users this way.
While there are many technologies available at your disposal, it's difficult to pinpoint any specific one that will be useful when you haven't given enough sufficient information.
There are 2 fundamental issues with what you are trying to do, the first one more severe than the other.
AdWords API doesn't give you real-time stats, the reports are usually delayed by 3 hours. See http://adwordsapi.blogspot.com/2011/06/statistics-in-reports.html for additional background information.
What you can do is to pick a timeframe (e.g. once every 2 hours) for which you want to run reports, and then stick to that schedule. To give you a rough estimate, you could run 10 reports in parallel, and assuming it takes 10 seconds to download and parse a report (which gives you a throughput of 1 report / second, but this strictly depends on your internet connection speed, load on AdWords API servers, how big your clients are, what columns and segmentation and date ranges you are requesting the data for, etc. For big clients, the timing could easily run into several minutes per report), you could refresh 1000 accounts only in 20-30 minutes' time.
Even if you wanted to go much faster, AdWords API won't allow you to do so. It has a rate limiting mechanism that will limit your API call rate if you try to make too many calls in a very small period. See http://adwordsapi.blogspot.com/2010/06/better-know-error-rateexceedederror.html for details.
My advise is to ask this question on the official forum for Adwords API - https://groups.google.com/forum/?fromgroups#!forum/adwords-api. You will definitely find users who have tackled similar issues before.
you can safe a variable which contains a timestamp of the last run of your loop.
now when a user visits your page, check if your timestamp is older than 1 minute.
If he is older run your loop and check all user.
so your loop runs only when there are users on your site. this saves a lot of server performance.

Multiple time-critical background tasks

I'm new to PHP, so I need some guidance as to which would be the simplest and/or elegant solution to the following problem:
I'm working on a project which has a table with as many as 500,000 records, at user specified periods, a background task must be started which will invoke a command line application on the server that does the magic, the problem is, at each 1 minute or so, I need to check on all 500,000 records(and counting) if something needs to be done.
As the title says, it is time-critical, this means that a maximum of 1 minute delay can be allowed between the time expected by the user and the time that the task is executed, of course the less delay, the better.
Thus far, I can only think of a very dirty option, have a simple utility app that runs on the server, that at each minute, will make multiple requests to the server, example:
check records between 1 and 100,000;
check records between 100,000 and 200,000;
etc. you get the point;
and the server basically starts a task for each bulk of 100,000 records or less, but it seems to me that there must be a faster approach, something similar to facebook's notification.
Additional info:
server is Windows 2008
using apache + php
EDIT 1
users have an average of 3 tasks per day at about 6-8 hours interval
more than half of the tasks can be at least 1 time per day executed at the same time[!]
Any suggestion is highly appreciated!
The easiest approach would be using a persistent task that runs the whole time and receives notification about records that need to be processed. Then it could process them immediately or, in case it needs to be processed at a certain time, it could sleep until either that time is reached or another notification arrives.
I think I gave this question more than enough time, I will stick to a utility application(that sits on the server) that will make requests to a URL accessible only from the server's IP which will spawn a new thread for each task if multiple tasks needs to be executed at the same time, it's not really scalable but it will have to do for now.

How can I make my curl-based URL monitoring service lightweight to run?

I'm currently building a user panel which will scrape daily information using curl. For each URL it will INSERT a new row to the database. Every user can add multiple URLs to scrape. For example: the database might contain 1,000 users, and every user might have 5 URLs to scrape on average.
How do I to run the curl scraping - by a cron job once a day at a specific time? Will a single dedicated server stand this without lags? Are there any techniques to reduce the server load? And about MySQL databases: with 5,000 new rows a day the database will be huge after a single month.
If you wonder I'm building a statistics service which will show the daily growth of their pages (not talking about traffic), so as i understand i need to insert a new value per user per day.
Any suggestions will be appreciated.
5000 x 365 is only 1.8 million... nothing to worry about for the database. If you want, you can stuff the data into mongodb (need 64bit OS). This will allow you to expand and shuffle loads around to multiple machines more easily when you need to.
If you want to run curl non-stop until it is finished from a cron, just "nice" the process so it doesn't use too many system resources. Otherwise, you can run a script which sleeps a few seconds between each curl pull. If each scrape takes 2 seconds that would allow you to scrape 43,200 pages per 24 period. If you slept 4 sec between a 2 second pull that would let you do 14,400 pages per day (5k is 40% of 14.4k, so you should be done in half a day with 4 sec sleep between 2 sec scrape).
This seems very doable on a minimal VPS machine for the first year, at least for the first 6 months. Then, you can think about utilizing more machines.
(edit: also, if you want you can store the binary GZIPPED scraped page source if you're worried about space)
I understand that each customer's pages need to be checked at the same time each day to make the growth stats accurate. But, do all customers need to be checked at the same time? I would divide my customers into chunks based on their ids. In this way, you could update each customer at the same time every day, but not have to do them all at once.
For the database size problem I would do two things. First, use partitions to break up the data into manageable pieces. Second, if the value did not change from one day to the next, I would not insert a new row for the page. In my processing of the data, I would then extrapolate for presentation the values of the data. UNLESS all you are storing is small bits of text. Then, I'm not sure the number of rows is going to be all that big a problem if you use proper indexing and pagination for queries.
Edit: adding a bit of an example
function do_curl($start_index,$stop_index){
// Do query here to get all pages with ids between start index and stop index
$query = "select * from db_table where id >= $start_index and id<=$stop_index";
for($i=$start_index; $i<= $stop_index; $i++;){
// do curl here
}
}
urls would look roughly like
http://xxx.example.com/do_curl?start_index=1&stop_index=10;
http://xxx.example.com/do_curl?start_index=11&stop_index=20;
The best way to deal with the growing database size is to perhaps write a single cron script that would generate the start_index and stop_index based on the number of pages you need to fetch and how often you intend to run the script.
Use multi curl and properly optimise not simply normalise your database design. If I were to run this cron job, I will try to spend time studying that is it possible to do this in chunks or not? Regarding hardware start with an average configuration, keep monitoring it and increment the hardware, CPU or Memory. Remember, there is no silver bullet.

Categories