I am trying to create a website monitoring webapp using PHP. At the minute I'm using curl to collect headers from different websites and update a MySQL database when a website's status changes (e.g. if a site that was 'up' goes 'down').
I'm using curl_multi (via the Rolling Curl X class which I've adapted slightly) to process 20 sites in parallel (which seems to give the fastest results) and CURLOPT_NOBODY to make sure only headers are collected and I've tried to streamline the script to make it as fast as possible.
It is working OK and I can process 40 sites in approx. 2-4 seconds. My plan has been to run the script via cron every minute... so it looks like I will be able to process about 600 websites per minute. Although this is fine at the minute it won't be enough in the long term.
So how can I scale this? Is it possible to run multiple crons in parallel or will this run into bottle-necking issues?
Off the top of my head I was thinking that I could maybe break the database into groups of 400 and run a separate script for these groups (e.g. ids 1-400, 401-800, 801-1200 etc. could run separate scripts) so there would be no danger of database corruption. This way each script would be completed within a minute.
However it feels like this might not work since the one script running curl_multi seems to max out performance at 20 requests in parallel. So will this work or is there a better approach?
yes, the simple solution is use the same PHP CLI script and pass the args 1 and 2 i.e., indicates the min and max range to process the db record contains the each site information.
Ex. crontab list
* * * * * php /user/script.php 1 400
* * * * * php /user/script.php 401 800
Or using a single script, you can trigger multi-threading (multi-threading in PHP with pthreads). But the cron interval should be based on the benchmark of completion of 800 sites.
Ref: How can one use multi threading in PHP applications
Ex. the script multithread completes in 3 minutes
then give the interval as */3.
Related
I am using scrapy and scrapyd to crawl some content. I have 28 crawlers that run, but only 8 at a time. Each crawler takes from 10 min to several hours to complete. So im looking for a way to order them correctly, in order to minimize the time the server is active.
I already gather information of how long each crawl takes, so it's only the minimization problem, or how to formulate it.
The script is started using php, so the solutions should preferably run in php.
The best way I've found is to set them up as cronjobs to execute at specific times. I have around 30 cronjobs configured to start at various times meaning you can set specific times per scrap.
Executing a PHP cmmand by cronjob at 5pm every day:
* 17 * * * php /opt/test.php
If you execute the scrapy python command via cronjob, its:
* 17 * * * cd /opt/path1/ && scrapy crawl site1
If your using virtualenv for you python then its
* 17 * * * source /opt/venv/bin/activate && cd /opt/path1/ && scrapy crawl site1
Sorry to disappoint you but there's nothing clever nor any minimization problem in what you describe because you don't state anything about dependencies between crawling jobs. Independent jobs will take ~ TOTAL_TIME/THROUGHPUT regardless of how you order them.
scrapyd will start processing the next job as soon as one finishes. The "8 at a time" isn't some kind of bucket thing so there's no combinatorial/dynamic programming problem here. Just throw all 28 jobs to scrapyd and let it run. When you poll and find it idle, you can shut down your server.
You might have some little benefits by scheduling longest jobs first. You can quickly squeeze a few tiny jobs on the idle slots while the last few long jobs finish. But unless you're in some ill case, those benefits shouldn't be major.
Note also that this number "8" - I guess enforced by max_proc_per_cpu and/or max_proc - is somewhat arbitrary. Unless that's the number where you hit 100% CPU or something, maybe a larger number would be better suited.
If you want major benefits, find the 2-3 top largest jobs and find a way to cut them in half e.g. if you're crawling a site with vehicles split the single crawl to two, one for cars and one for motorbikes. This is usually possible and will yield more significant benefits than reordering. For example, if your longer job is 8 hours and the next longer is 5, by splitting the longest to two-4hour crawls, you will make the 5-hour one be the bottleneck potentially saving your server 3 hours.
I'm connecting to an API with PHP through a CURL GET and I receive a json with almost 5000 orders. For every order I make another CURL GET and receive the order details (basically 2 foreach). After that I make some inserts and updates in the database (basic stuff) with LARAVEL.
The big problem is that for those 5000 orders I have a loading time of almost one hour. I need to this with a cron every night (and for more than 5000).
I have a cloud solution with 2GB Memory and 2 CoreProcessor.
I tried also Zebra Curl but cannot use that with curl in curl request.
if 2% of the daily jobs (1 of 50 organisations) takes almost 4% of a day you definitely need parallel processing.
There are several solutions for that:
Run a cron job for each organisation.
Process entries for multiple companies in 1 multi-curl call.
Use multiple servers.
I would probably use the first one: You could have a cron job check (every minute?) which organisations still need to be processed and which organisations are being processed at the moment and pick one if there are any left.
That way each job would still take 1 hour but all would be processed in the time-span of 2 hours each night.
What is the best way to perform cron-job automation for multiple users?
Example:
A cron-job needs to run every 10 minutes an call a PHP script that connects to an external API (via curl) and collects data (site visitors and other data) a user has received on an external web property. We need to check periodically every 10 minutes via the API if there is any new data available for that user and fetch it -- and that for EACH user in the web-app.
Such cron-job PHP script call to the API usually takes 1-3 seconds per user, but could occasionally take up to 30 seconds or more to complete for one user (in exceptional situations).
Question...
What is the best way to perform this procedure and collect external data like that for MULTIPLE users? Hundreds, even thousands of users?
For each user, we need to check for data every 10 minutes.
Originally I was thinking of calling 10 users in a row in a loop with one cron-job call, but since each user collection can take 30 seconds...for 10 users a loop could take several minutes and...the script could timeout? Correct?
Do you have some tips and suggestions on how to perform this procedure for many users most efficiently? Should separate cron jobs be used for each user? Instead of a loop?
Thank you!
=== EDIT ===
Let's say one PHP script can call the API for 10 users within 1 minute... Could I create 10 cron-jobs that essentially call the same PHP script simultaneously, but each one collecting a different batch of 10 users? This way I could potentially get data for 100 users within one minute? No?
It could look like this:
/usr/local/bin/php -q get_data.php?users_group=1
/usr/local/bin/php -q get_data.php?users_group=2
/usr/local/bin/php -q get_data.php?users_group=3
and so on...
Is this going to work?
=== NOTE ===
Each user has a unique Access Key with the external API service, so one API call can only be for one user at a time. But the API could receive multiple simultaneous calls for different users at once.
If it takes 30 seconds a user and you have more than 20 users, you won't finish before you need to start again. I would consider using GearMan or other job server to handle each of these requests in an async way. GearMan can also wait for jobs to complete, so you should be able to loop over all the requests you need to make and then wait for them to finish. You can probably accomplish the same thing with PHP's pthread implementation, however, that's going to be significantly more difficult.
I want to accomplish the following behavior in php:
1 - Script gets called with parameters
2- I Intiate a thread for a long running operation
3 - Script should return control to the caller
4- Thread executes till its finished
Is this behavior possible? What i am seeing now, is that the script wont return until the thread has finished executing, which makes sense as the execution of the thread would probably die if the script stops executing , but is there no way to stop blocking the client so they can go on about their business? Am i stuck using some exec() call to get this behavior? Is there a way to get this done with threading only? Id like to avoid using exec if possible..
So if someone calls my script from a browser, it should just return immidiatly, and the long running process should keep executing until its done.
Thanks
Daniel
Yes, its possible. Call your php script via AJAX, and and create multiple instances of the ajax function dynamically. See attached screenshot. When I compared results of running a single function versus 24 instances, my data was processed about 15x faster. I am trying to populate a MySQL table with about 30 million records, and each record involves calculating distance in miles from city center, based on lat/lng. So yes, its no walk in the park. As you can see, I am averaging about See this:
multi threads http://gaysugardaddyfinder.com/screen2.PNG
multi threads http://gaysugardaddyfinder.com/screen.png
This may be a glorious hack or what not - but it sure worked great for me.
My server is a Xeon 72 Core setup with 64 GB RAM.
Here's what I'm trying to accomplish in high-level pseudocode:
query db for a list of names (~100)
for each name (using php) {
query a 3rd party site for xml based on the name
parse/trim the data received
update my db with this data
Wait 15 seconds (the 3rd party site has restrictions and I can only make 4 queries / minute)
}
So this was running fine. The whole script took ~25 minutes (99% of the time was spent waiting 15 seconds after every iteration). My web host then made a change so that scripts will timeout after 70 seconds (understandable). This completely breaks my script.
I assume I need to use cronjobs or command line to accomplish this. I only understand the basic us of cronjobs. Any high level advice on how to split up this work in a cronjob? I am not sure how a cronjob could parse through a dynamic list.
cron itself has no idea of your list and what is done already, but you can use two kinds of cron-jobs.
The first cron-job - that runs for example once a day - could add your 100 items to a job queue.
The second cron-job - that runs for example once every minute in a certain period - can check if there are items in the queue, execute one (or a few) and remove it from the queue.
Note that both cron-jobs are just triggers to start a php script in this case and you have two different scripts, one to set the queue and one to process part of a queue so almost everything is still done in php.
In short, there is not much that is different. Instead of executing the script via modphp or fcgi, you are going to execute it via command line php /path/to/script.php.
Because this is a different environment than http, some things obviously don't work. Sessions, cookies, get and post variables. Output gets send to stdout instead of the browser.
You can pass arguments to your script by using $argv.