I'll start with what my program does. The index function of controller takes an array of URLs and keywords and stores them in DB. Now the crawlLink method with take all the keywords and URLs. The URLs are searched for all the keywords and the sublinks of all the URLs are generated and again stored in DB which are also searched for the keywords. Keywords are searched in each link using search method. The sublinks are extracted from all the URLs using extract_links function. search and extract_links both have a method called get_web_page which takes the complete content of the page using cURL. get_web_page is used once in search function to get content of web page so that keywords can be extracted from it. It is also used in extract_links function to extract links with valid page content.
Now crawlLink calls search function twice. Once to extract keywords from domain links and second time to extract keywords from sublinks. Hence, get_web_page is called thrice. It approximately takes 5 mins to get contents of around 150 links. And it is called thrice so it takes 15 minutes of processing time. During that duration nothing can be done. Thus, I want to run this process in background and show its status while processing. extract_links and get_web_page are included in the controller using include_once.
The get_web_page function is as follows:
function get_web_page( $url )
{
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle compressed
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
}
An input of URLs and keywords once from the user can be considered as a task. Now this task can be started and it will start running in the background. At the same time another task can be defined and can be started. Each task will have statuses like "To Do", "In Progress", "Pending", "Done", etc. The Simple Task Board by Oscar Dias is the exact way I want the tasks to be displayed.
I read about so many ways to run function in background that now I am in a dilemma about which approach to adopt. I read about exec, pcntl_fork, Gearman and other but all need CLI which I don't want to use. I tried installing Gearman with Cygwin but got stuck in Gearman installation as it cannot find libevent. I've installed libevent separately but still it doesn't work. And Gearman needs CLI so dropped it. I don't want to use CRON also. I just want to know which approach will be best in my scenario.
I am using PHP 5.3.8 | Codeigniter 2.1.3 | Apache 2.2.21 | MySQL 5.5.16 | Windows 7 64 bit
Your problem is, Windows.
windows is simply not very good for running background tasks & cron jobs - there are tools you can find, but they are limited.
However, are you sure you even need this? Most servers are Linux, so why don't you just test on Windows & move over.
--
The second part is command line - you need it if you want to start a new process (which you do). But ti isn't really very scary. CodeIgniter is quite simple:
http://ellislab.com/codeigniter/user-guide/general/cli.html
You can run using nohup process or using cron job.............Please go through below links
nohup: run PHP process in background
Running a php5 background process under Linux
https://nsaunders.wordpress.com/2007/01/12/running-a-background-process-in-php/
The above approach that I was trying to achieve didn't seem possible to be implemented in Windows. Many methods listed in the questions are either removed or modified. I then moved on to a workaround involving use of AJAX.
I execute the controller method as an ajax request and give a count to it which increments with each new AJAX request. Each request can be aborted though the processing will continue but ultimately results matter in my project even if they are taken incomplete. And if the browser is open then that request may complete and later on the user can see the complete result.
On stopping the processing of a task a CANCELLED icon is shown and a link pointing to result page is shown which displays the results generated before the task was cancelled. On AJAX fails or AJAX success I send back the count of the task from server to client which was sent by the client to server. Thus results are displayed for a unique task and don't get messed up.
But there is no tracking of how much a certain task has progressed. The time taken for execution cannot be identified. Thus, this approach works for me but has some drawbacks. The main aim was the user should not be waiting while some task is in progress and that is somehow achieved by the above workaround.
Related
We recently started our first TYPO3 10 project and are currently struggling with a custom import script that moves data to Algolia. Basically, everything works fine, but there is an issue with FAL images, specifically, when they need to be processed.
From the logs, I could find something called DeferredBackendImageProcessor, but the docs are not mentioning this, or I am not looking for the right thing. I'm not sure.
Apparently, images within the backend environment are not just processed anymore. There is something called "processingUrl" which has to be called once for the image to be processed.
I tried calling that url with CURL, but it does not work. The thing is, when I open that "processingUrl" in the browser, it has not effect - but if I open that link in a browser, where I am logged into the TYPO3 backend, then the image is processed.
I'm kind of lost here, as I need the images to be processed within the import script that runs via the scheduler from the backend (manual, not via cron).
That is the function where the problem occurs, the curl part has no effect here, sadly.
protected function processImage($image, $imageProcessingConfiguration)
{
if ($image) {
$scalingOptions = array (
'width' => 170
);
$result = $this->contentObject->getImgResource('fileadmin/'.$image, $scalingOptions);
if (isset($result[3]) && $result[3]) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $result[3]);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);
return '/fileadmin'.$result['processedFile']->getIdentifier();
}
}
return '';
}
$result[3] being the processing url. Example of the url:
domain.com/typo3/index.phproute=%2Fimage%2Fprocess&token=6cbf8275c13623a0d90f15165b9ea1672fe5ad74&id=141
So my question is, how can I process the image from that import script?
I am not sure if there is a more elegant solution but you could disable the deferred processing during your jobs:
$processorConfiguration = $GLOBALS['TYPO3_CONF_VARS']['SYS']['fal']['processors']
unset ($GLOBALS['TYPO3_CONF_VARS']['SYS']['fal']['processors']['DeferredBackendImageProcessor'])
// ... LocalImageProcessor will be used
$GLOBALS['TYPO3_CONF_VARS']['SYS']['fal']['processors'] = $processorConfiguration;
References:
https://github.com/TYPO3/TYPO3.CMS/blob/10.4/typo3/sysext/core/Classes/Resource/Processing/ProcessorRegistry.php
https://github.com/TYPO3/TYPO3.CMS/blob/10.4/typo3/sysext/core/Configuration/DefaultConfiguration.php#L284
I am running an IIS 8 / PHP web server and am attempting to write a so-called 'proxy script' as a means of fetching HTTP content and loading it onto an HTTPS page.
Although the script does run successfully (outputting whatever the HTTP page sends) in some cases - for example, Google.com, Amazon.com, etc. - it does not work in fetching my own website and a few others.
Here is the code of proxy.php:
<?php
$url = $_GET['url'];
echo "FETCHING URL<br/>"; // displays this no matter what URL I enter
$ctx_array = array('http' =>
array(
'method' => 'GET',
'timeout' => 10,
)
);
$ctx = stream_context_create($ctx_array);
$output = file_get_contents($url, false, $output); // times out for certain requests
echo $output;
When I set $_GET['url'] to http://www.ucomc.net, the script fails. With most other URLs, it works fine.
I have checked other answers on here and other places but none of them describe my issue, nor do the solutions offered solve it.
I've seen some suggestions to similar problems that involve changing the user agent, but when I do this it not only does not solve the existing problem but prevents other sites from loading as well. I do not want to rely on third-party proxies (don't trust the free ones/want to deal with their query limit and don't want to pay for the expensive ones)
Turns out that it was just a problem with the firewall. Testing it on a PHP sandbox worked fine, so I just had to modify the outgoing connections settings in the server firewall to allow the request through.
I want to make a platform to get disk space usage of several server.
How i can do it?
$df = disk_free_space("/");
i want this line of code to be executed on my all servers
It's not easy or maybe impossible with only php, the best solution in my opinion is send a request with curl to the server that return the disk space usage.
One way is to use the ssh2_XXX functions of PHP to login to each server and run df /.
Another way is to create a web page on each server that runs disk_free_space('/') and echoes the result. Then you can use file_get_contents("http://servername/disk_free.php") to query each server.
on each server, you can insert a page that returns the free space:
http://server1.com/GetFreeSpace.php
http://server2.com/GetFreeSpace.php
http://server3.com/GetFreeSpace.php
GetFreeSpace.php
<?php
echo disk_free_space("/");
You can use cURL to query several servers
<?php
$ch = curl_init();
curl_setopt_array($ch, array(
CURLOPT_URL => 'http://server1.com/GetFreeSpace.php',
CURLOPT_RETURNTRANSFER => true
));
echo curl_exec($ch);
There are many ways to do this. Curl is an option, you can use cron on each server, that runs at specific time and updates the value to a web server, or you can use ssh to get the same info.
I'm trying to POSTing some data (a JSON string) from a php script to a java server (all written by myself) and getting the response back.
I tried the following code:
$url="http://localhost:8000/hashmap";
$opts = array('http' => array('method' => 'POST', 'content' => $JSONDATA,'header'=>"Content-Type: application/x-www-form-urlencoded"));
$st = stream_context_create($opts);
echo file_get_contents($url, false,$st);
Now, this code actually works (I get back as result the right answer), but file_get_contents hangs everytime 20 seconds while being executed (I printed the time before and after the instruction). The operations performed by the server are executed in a small amount of time, and I'm sure it's not normal to wait all this time to get the response.
Am I missing something?
Badly mis-configured server maybe that doesn't send the right content-size and using HTTP/1.1.
Either fix the server or request the data as HTTP/1.0
Try adding Connection: close and Content-Length: strlen($JSONDATA) headers to the $opts.
Also, if you want to avoid using extensions, have a look at this class I wrote some time ago to perform HTTP requests using PHP core only. It works on PHP4 (which is why I wrote it) and PHP5, and the only extension it ever requires is OpenSSL, and you only need that if you want to do an HTTPS request. Documented(ish) in comments at the top.
Supports all sorts of stuff - GET, POST, PUT and more, including file uploads, cookies, automatic redirect handling. I have used it quite a lot on a platform I work with regularly that is stuck with PHP/4.3.10 and it works beautifully... Even if I do say so myself...
Given a list of urls, I would like to check that each url:
Returns a 200 OK status code
Returns a response within X amount of time
The end goal is a system that is capable of flagging urls as potentially broken so that an administrator can review them.
The script will be written in PHP and will most likely run on a daily basis via cron.
The script will be processing approximately 1000 urls at a go.
Question has two parts:
Are there any bigtime gotchas with an operation like this, what issues have you run into?
What is the best method for checking the status of a url in PHP considering both accuracy and performance?
Use the PHP cURL extension. Unlike fopen() it can also make HTTP HEAD requests which are sufficient to check the availability of a URL and save you a ton of bandwith as you don't have to download the entire body of the page to check.
As a starting point you could use some function like this:
function is_available($url, $timeout = 30) {
$ch = curl_init(); // get cURL handle
// set cURL options
$opts = array(CURLOPT_RETURNTRANSFER => true, // do not output to browser
CURLOPT_URL => $url, // set URL
CURLOPT_NOBODY => true, // do a HEAD request only
CURLOPT_TIMEOUT => $timeout); // set timeout
curl_setopt_array($ch, $opts);
curl_exec($ch); // do it!
$retval = curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200; // check if HTTP OK
curl_close($ch); // close handle
return $retval;
}
However, there's a ton of possible optimizations: You might want to re-use the cURL instance and, if checking more than one URL per host, even re-use the connection.
Oh, and this code does check strictly for HTTP response code 200. It does not follow redirects (302) -- but there also is a cURL-option for that.
Look into cURL. There's a library for PHP.
There's also an executable version of cURL so you could even write the script in bash.
I actually wrote something in PHP that does this over a database of 5k+ URLs. I used the PEAR class HTTP_Request, which has a method called getResponseCode(). I just iterate over the URLs, passing them to getResponseCode and evaluate the response.
However, it doesn't work for FTP addresses, URLs that don't begin with http or https (unconfirmed, but I believe it's the case), and sites with invalid security certificates (a 0 is not found). Also, a 0 is returned for server-not-found (there's no status code for that).
And it's probably easier than cURL as you include a few files and use a single function to get an integer code back.
fopen() supports http URI.
If you need more flexibility (such as timeout), look into the cURL extension.
Seems like it might be a job for curl.
If you're not stuck on PHP Perl's LWP might be an answer too.
You should also be aware of URLs returning 301 or 302 HTTP responses which redirect to another page. Generally this doesn't mean the link is invalid. For example, http://amazon.com returns 301 and redirects to http://www.amazon.com/.
Just returning a 200 response is not enough; many valid links will continue to return "200" after they change into porn / gambling portals when the former owner fails to renew.
Domain squatters typically ensure that every URL in their domains returns 200.
One potential problem you will undoubtably run into is when the box this script is running on looses access to the Internet... you'll get 1000 false positives.
It would probably be better for your script to keep some type of history and only report a failure after 5 days of failure.
Also, the script should be self-checking in some way (like checking a known good web site [google?]) before continuing with the standard checks.
You only need a bash script to do this. Please check my answer on a similar post here. It is a one-liner that reuses HTTP connections to dramatically improve speed, retries n times for temporary errors and follows redirects.