I am attempting to scrape a large amount of data from a website. (Probably about 50M records.) The website uses $_GET so it is simply a matter of generating a list of links, each one of which collects a bit of the data.
I have one script that generates a list of links on the screen. The links all call the same PHP script, passing a different search value. I then use the Chrome "LinkClump" extension to start all the links in separate tabs simultaneously (Right-click and drag across all links).
I start 26 tabs at once but the called PHP scripts do not all start. A write to log shows that only 6 ever run at once. The next one will not start until one of the others has finished. Is there any way to get more than 6 running at once?
Here is the relevant snippet of code in the 26 worker scripts that does the search. I simply pass a different $value to each one:
$html = file_get_html("http://website.com/cgi-bin/Search?search=$value");
foreach($html->find('table[cellpadding="3"]') as $e)
foreach($e->find('tr') as $f){
$colval=0;
foreach($f->find('td[class="output"]') as $g)
To check whether it was Apache or simple_html_dom that was throttling the connections I wrote another tiny script that simply did a sleep(10) with a write to log before and after. Once again only 6 would execute at once, hence it must be Apache.
Is there some ini setting that I can change in my script to force more to run at once please?
I noticed this comment in another posting at Simultaneous Requests to PHP Script:
"If the requests come from the same client AND the same browser most browsers will queue the requests in this case, even when there is nothing server-side producing this behaviour."
I am running on Chrome.
Browsers typically limit the number of concurrent connections to a single domain. Each successive tab opened after this limit has been reached will have to wait until an earlier one has completed.
A common trick to bypass this behaviour is to spread the resources over several subdomains. So, currently you're sending all your requests to website.com. Change your code to send six requests each to, say, sub1.website.com, sub2.website.com, etc. You'll need to set these up on your DNS and web server, obviously. Provided your PHP script exists on each subdomain you should be able to run more connections concurrently.
I found the answer here: Max parallel http connections in a browser?
It is a browser issue. It indicates that Firefox allows the limit to be increased so I will try that.
For the benefit of others, here is what you have to do to allow Firefox to have more than 6 sessions with the one host. It is slightly different from the above post.
1. Enter about:config
2. Accept the warranty warning
3. Find network.http.max-persistent-connections-per-server and change it from 6 to whatever value you need.
You can now run more scripts on that host from separate tabs.
If this is useful information please up-vote question. I need to get rid of negative reputation.
Related
My father wants to build a newsletter dispatch system which provides customized fields as part of a tailor-made system. The users can use some special variables in the text to insert the name of the recipient (among other things).
The last HTML form asking for all the data in the email will insert the data as well as the set of recipients into the database. Then the user is redirected to the worker script.
In the worker script (let's call it worker.php) he roughly has the following:
# Get current job from the database.
# Pop off the first recipient in the list.
# Retrieve additional data about that recipient from the database.
# Generate and send email.
# Store truncated list of recipients in database.
if (work_left) {
header('Location: worker.php');
}
else {
header('Location: done.php');
}
The worker only does a single work-item in order to dodge the PHP time limit. The system is to be deployed on a shared hoster which might have the most arcane php.ini settings.
It works, the work-items are handled and the number of recipients in the database shrinks. The unforeseen problem now is that the browser eventually runs into a timeout, canceling the connection. The PHP script is then canceled and no more work is done. The process is easily restarted by pointing the browser back to worker.php, but this is something that the end-user should not have to do.
A quick search on this site gave me the ignore_user_abort function which looks promising in order to dodge the browser timeout. I fear that this does not solve the problem in this situation: The browser will close the connection at some point. The currently running PHP script will finish running and then tell the browser to reload worker.php. The browser is not listening any more, the progress also stops. This is an improvement as it does not stop mid-transaction, but not a solution.
Another idea that we had was replacing the redirect with one to worker2.php. That PHP file just contains a redirect back to worker.php. This might be sufficient progress to the browser that it will continue to load and not bump into the timeout (hopefully the timeout is per URL at least?).
If that would not work either, then a HTML redirect with <meta> might be another option. The worker.php would then actually finish to load and the browser would be able to finish the request. The <meta> would then redirect to worker.php again to do the next work-item. This last one has the disadvantage that it still depends on the browser to be open.
In the very best case he is looking for a solution which would run through, once it is started. The browser could timeout, the user could close the window and the script would still run through and send all the emails. Is it possible to generate a worker such that it is immune to PHP execution time limit and browser timeouts?
Maybe instead of relying on fancy background script on shared hosting just use crontab (even if Your hosting provider doesn't give You one You can use some services which will allow You simulate cron), which will fire script once per minute (?). Script will check if there is some work in queue table, pick some (as much as your server will allow You to process) and marking processed tasks as completed (or removing them from table). You can then provide some endpoint which can be accessed by sender to see his progress (count all tasks for user and those marked as completed).
I don't think that You are able to use RabbitMQ + supervisord or something similar on shard, but I don't think that using background php script is good idea :)
I have a webpage that when users go to it, multiple (10-20) Ajax requests are instantly made to a single PHP script, which depending on the parameters in the request, returns a different report with highly aggregated data.
The problem is that a lot of the reports require heavy SQL calls to get the necessary data, and in some cases, a report can take several seconds to load.
As a result, because one client is sending multiple requests to the same PHP script, you end up seeing the reports slowly load on the page one at a time. In other words, the generating of the reports is not done in parallel, and thus causes the page to take a while to fully load.
Is there any way to get around this in PHP and make it possible for all the requests from a single client to a single PHP script to be processed in parallel so that the page and all its reports can be loaded faster?
Thank you.
As far as I know, it is possible to do multi-threading in PHP.
Have a look at pthreads extension.
What you could do is make the report generation part/function of the script to be executed in parallel. This will make sure that each function is executed in a thread of its own and will retrieve your results much sooner. Also, set the maximum number of concurrent threads <= 10 so that it doesn't become a resource hog.
Here is a basic tutorial to get you started with pthreads.
And a few more examples which could be of help (Notably the SQLWorker example in your case)
Server setup
This is more of a server configuration issue and depends on how PHP is installed on your system: If you use php-fpm you have to increase the pm.max_children option. If you use PHP via (F)CGI you have to configure the webserver itself to use more children.
Database
You also have to make sure that your database server allows that many concurrent processes to run. It won’t do any good if you have enough PHP processes running but half of them have to wait for the database to notice them.
In MySQL, for example, the setting for that is max_connections.
Browser limitations
Another problem you’re facing is that browsers won’t do 10-20 parallel requests to the same hosts. It depends on the browser, but to my knowledge modern browsers will only open 2-6 connections to the same host (domain) simultaneously. So any more requests will just get queued, regardless of server configuration.
Alternatives
If you use MySQL, you could try to merge all your calls into one request and use parallel SQL queries using mysqli::poll().
If that’s not possible you could try calling child processes or forking within your PHP script.
Of course PHP can execute multiple requests in parallel, if it uses a Web Server like Apache or Nginx. PHP dev server is single threaded, but this should ony be used for dev anyway. If you are using php's file sessions however, access to the session is serialized. I.e. only one script can have the session file open at any time. Solution: Fetch information from the session at script start, then close the session.
I'm making a app for my website for visitors to look at the camera.
I own a PT( pan-tilt) camera which can be operated by using url's.
I want my camera to move randomly at prefixed times ( like every 5 seconds a different position) and in the background, so i will move without any operator but i can't seem to figure out how to make it movable automatically.
The manufacturer works with CGI commands like:
myip:myport/decoder_control.cgi?command=39&user=user&pwd=password
(this code makes it go to preset 1).
How can i make the camera move with this command using serverside php, making it move after 5 seconds?
Running the CGI script from PHP.
You can perform an HTTP request from PHP, that would load the URL corresponding to the command, causing the camera to change position. Some ways of achieving this:
Using the function http_get: PHP: http_get – Manual.
Using cURL.
Using file_get_contents for very basic requests: question on SO.
If you just need to perform a GET request, and the response is empty (e.g. you just need to check the +200 OK code) or contains some very simple data (e.g. a string), then file_get_contents is more than enough.
If you don't have any background on how HTTP requests work, Wikipedia could be a good introduction; especially if later on you have more complex CGI commands to send to your PT Cam.
Make the camera move every 5 sec.
This is a completely different matter. The problem here is run PHP code periodically and automatically.
You can schedule the PHP script to be executed, using a Cron job (Cron, crontab) and this questions explains you how. BUT Cron's minimal time resolution is one minute; also moving a camera every 5 seconds doesn't really sound like schedule a job, sounds more like it should be handled by a system service.
What you could do, is moving the camera from the PHP script users use to watch: store the last update time on a file/database, and if the elapsed time is >5s, run the CGI script.
This would keep your camera still unless someone is actually watching. Other problems might arise, for example what if many users are visiting the same page and your server serves the request simultaneously? You might get several consecutive commands sent to the camera. Moreover, while the users are watching, staying on your PHP page, you must again find a way of moving the camera every 5".
A possible solution.
Create a PHP script that, when loaded, runs the CGI command only if at least 5s have passed since the last call (by storing the time of the last call).
Create a client page for your users, that, via JavaScript, loads the PHP script every 5s. Look for JavaScript GET request, you will find enough information to fill a book.
Again, this would generate a lot of traffic on your webserver, just for those five seconds of panning. My suggestion is that the movement should be handled by some server side program, not script.
I'm currently running a Linux based VPS, with 768MB of Ram.
I have an application which collects details of domains and then connect to a service via cURL to retrieve details of the pagerank of these domains.
When I run a check on about 50 domains, it takes the remote page about 3 mins to load with all the results, before the script can parse the details and return it to my script. This causes a problem as nothing else seems to function until the script has finished executing, so users on the site will just get a timer / 'ball of death' while waiting for pages to load.
**(The remote page retrieves the domain details and updates the page by AJAX, but the curl request doesnt (rightfully) return the page until loading is complete.
Can anyone tell me if I'm doing anything obviously wrong, or if there is a better way of doing it. (There can be anything between 10 and 10,000 domains queued, so I need a process that can run in the background without affecting the rest of the site)
Thanks
A more sensible approach would be to "batch process" the domain data via the use of a cron triggered PHP cli script.
As such, once you'd inserted the relevant domains into a database table with a "processed" flag set as false, the background script would then:
Scan the database for domains that aren't marked as processed.
Carry out the CURL lookup, etc.
Update the database record accordingly and mark it as processed.
...
To ensure no overlap with an existing executing batch processing script, you should only invoke the php script every five minutes from cron and (within the PHP script itself) check how long the script has been running at the start of the "scan" stage and exit if its been running for four minutes or longer. (You might want to adjust these figures, but hopefully you can see where I'm going with this.)
By using this approach, you'll be able to leave the background script running indefinitely (as it's invoked via cron, it'll automatically start after reboots, etc.) and simply add domains to the database/review the results of processing, etc. via a separate web front end.
This isn't the ideal solution, but if you need to trigger this process based on a user request, you can add the following at the end of your script.
set_time_limit(0);
flush();
This will allow the PHP script to continue running, but it will return output to the user. But seriously, you should use batch processing. It will give you much more control over what's going on.
Firstly I'm sorry but Im an idiot! :)
I've loaded the site in another browser (FF) and it loads fine.
It seems Chrome puts some sort of lock on a domain when it's waiting for a server response, and I was testing the script manually through a browser.
Thanks for all your help and sorry for wasting your time.
CJ
While I agree with others that you should consider processing these tasks outside of your webserver, in a more controlled manner, I'll offer an explanation for the "server standstill".
If you're using native php sessions, php uses an exclusive locking scheme so only a single php process can deal with a given session id at a time. Having a long running php script which uses sessions can certainly cause this.
You can search for combinations of terms like:
php session concurrency lock session_write_close()
I'm sure its been discussed many times here. I'm too lazy to search for you. Maybe someone else will come along and make an answer with bulleted lists and pretty hyperlinks in exchange for stackoverflow reputation :) But not me :)
good luck.
I'm not sure how your code is structured but you could try using sleep(). That's what I use when batch processing.
I know PHP scripts can run concurrently, but I notice that when I run a script and then run another script after it. It waits for the first to finish and then it does what it has to do?
Is there an Apache config I have to change or do I use a different browser or something?!!
Thanks all
EDIT
If I go to a different browser, I can access the page/PHP script I need?! Is this limiting the number of requests by browser? Apache?
Two things comes to mind:
HTTP 1.1 has a recommend maximum limit of 2 simultaneous connections to any given domain. This would exhibit the problem where you can only open one page on a site at a time unless you switch browsers.
Edit: This is usually enforced by the client, not the server
If your site uses sessions, you'll find out that only one page will load at a time...
...unless you call session_write_close(), at which point the second page can now open the now-unlocked session file.
If you run php under fastcgi you can probably avoid this restriction.
I haven't had this problem with apache / php, the project I work on runs 3 iframes each of which is a php script + the main page, apache handles all 4 at the same time.
You may want to check your apache configuration and see how the threading is setup.
This article may help:
http://www.devside.net/articles/apache-performance-tuning