i want to get several pages thru curl_exec, first page is come normally, but all others - 302 header, what reason?
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, ROOT_URL);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$content = curl_exec($curl); // here good content
curl_close($curl);
preg_match_all('/href="(\/users\/[^"]+)"[^>]+>\s*/i', $content, $p);
for ($j=0; $j<count($p[1]); $j++){
$new_curl = curl_init();
curl_setopt($new_curl, CURLOPT_URL, NEW_URL.$p[1][$j]);
curl_setopt($new_curl, CURLOPT_RETURNTRANSFER, 0);
$content = curl_exec($new_curl); // here 302
curl_close($new_curl);
preg_match('/[^#]+#[^"]+/i', $content, $p2);
}
smth like this
You probably want to provide a sample of your code so we can see if you're omitting something.
302 response code typically indicates that the server is redirecting you to a different location (found in the Location response header). Depending on what flags you use, CURL can either retrieve that automatically or you can watch for the 302 response and retrieve it yourself.
Here is how you would get CURL to follow the redirects (where $ch is the handle to your curl connection):
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);// allow redirects
You can use curl multi which is faster and can get data from all the url's in parallel.
You can use it like this
//Initialize
$curlOptions = array(CURLOPT_RETURNTRANSFER => 1);//Add whatever u additionally want.
$curlHandl1 = curl_init($url1);
curl_setopt_array($curlHandl1, $curlOptions);
$curlHandl2 = curl_init($url2);
curl_setopt_array($curlHandl2, $curlOptions);
$multi = curl_multi_init();
curl_multi_add_handle($multi, $curlHandle1);
curl_multi_add_handle($multi, $curlHandle2);
//Run Handles
$running = null;
do {
$status = curl_multi_exec($mh, $running);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($running && $status == CURLM_OK) {
if (curl_multi_select($mh) != -1) {
do {
$status = curl_multi_exec($mh, $running);
} while ($status == CURLM_CALL_MULTI_PERFORM);
}
}
//Retrieve Results
$response1 = curl_multi_getcontent($curlHandle1);
$status1 = curl_getinfo($curlHandle1);
$response1 = curl_multi_getcontent($curlHandle1);
$status1 = curl_getinfo($curlHandle1);
You can find more information here http://www.php.net/manual/en/function.curl-multi-exec.php
Checkout the Example1
Related
In the code below from http://php.net/manual/en/function.curl-multi-init.php
How can I add code before the second request is made (ie sleep(5)) before curl makes the request to twitter)
Regards
<?php
// create both cURL resources
$ch1 = curl_init();
$ch2 = curl_init();
// set URL and other appropriate options
curl_setopt($ch1, CURLOPT_URL, "https://www.google.com");
curl_setopt($ch1, CURLOPT_HEADER, 0);
curl_setopt($ch2, CURLOPT_URL, "https://twitter.com");
curl_setopt($ch2, CURLOPT_HEADER, 0);
//create the multiple cURL handle
$mh = curl_multi_init();
//add the two handles
curl_multi_add_handle($mh,$ch1);
curl_multi_add_handle($mh,$ch2);
$active = null;
//execute the handles
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($active && $mrc == CURLM_OK) {
if (curl_multi_select($mh) != -1) {
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
}
//close the handles
curl_multi_remove_handle($mh, $ch1);
curl_multi_remove_handle($mh, $ch2);
curl_multi_close($mh);
?>
I'm no PHP guy or competent programmer for that matter :D Now that disclaimer is out there, here's my solution.
There's probably a much cleaner way to do this but I have limited knowledge of PHP and how to extend classes. For that reason, I decided to use the built-in process control extensions and create a helper function to handle the curl process. I'm sure there are much better programmers out there ready to provide a much cleaner solution though.
<?php
// Helper function
function async_curl($url,$delay){
sleep($delay);
echo "FORK: Getting $url after $delay seconds\n";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, true);
// Mute the return for demonstration purposes.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_exec($ch);
curl_close($ch);
}
$urls = array("http://google.com","http://twitter.com","http://www.facebook.com");
foreach($urls as $url){
// Generate random timeout for demonstration purposes.
$delay = rand(1,20);
// Create a forked child process for each URL
$pid = pcntl_fork();
// Exit if fork failed
if ($pid == -1) {
exit("Error, failed to create a child process for the URL: $url");
// Create a single child process to call the helper function
} else if ($pid == 0) {
echo "MAIN: Forking process for $url\nPID: " .getmypid() . "\tDelay: $delay\n";
async_curl($url,$delay);
exit();
}
}
// Wait for all forked processes to complete before exiting.
while (($pid = pcntl_waitpid(0, $status)) > 0) {
echo "MAIN: Process $pid completed\n";
}
?>
I have an API written in PHP that sends 10 requests with CURL.
The problem is that when I send a HTTP request to the API, I get the response right away, although the server hasn't finished working( getting the response for all of the 10 requests).
I can't use ignore_user_abort() because I need to know exactly the time that the API finished.
How can I notify the connection "hey, wait for the script to finish working"?
Important note: if I use sleep() the connection holds.
Here's my code: gist
This is just a example to show how ob_start works.
echo "hello";
ob_start(); // output buffering starts here
echo "hello1";
//all curl requests
if(all curl requests completed)
{
ob_end_flush() ;
}
With no code to refer, I can only show implementation of ob_start. You have to change this code according to your requirement.
$handlers = [];
$mh = curl_multi_init();
ob_start(); // output buffering starts here
foreach($query->fetchAll() as $domain){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://'.$domain['name']);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $DEFAULT_REQUEST_TIMEOUT);
curl_setopt($ch, CURLOPT_TIMEOUT, $DEFAULT_REQUEST_TIMEOUT);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 2);
curl_multi_add_handle($mh, $ch);
$handlers[] = ['ch'=>$ch, 'domain_id'=>$domain['domain_id']];
echo $domain['name'];
}
// Execute the handles
$active = null;
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($active && $mrc == CURLM_OK) {
// Wait for activity on any curl-connection
if (curl_multi_select($mh) == -1) {
usleep(1);
}
// Continue to exec until curl is ready to
// give us more data
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
// Extract the content
$values = [];
foreach($handlers as $key => $handle){
// Check for errors
echo $key.'. result: ';
$curlError = curl_error($handle['ch']);
if($curlError == ""){
$res = curl_multi_getcontent($handle['ch']);
echo 'done';
}
else {
echo "Curl error on handle $key: $curlError".' <br />';
}
// Remove and close the handle
curl_multi_remove_handle($mh, $handle['ch']);
curl_close($handle['ch']);
}
// Clean up the curl_multi handle
curl_multi_close($mh);
ob_end_flush() ; // output flushed here
Source - http://php.net/manual/en/function.ob-start.php
I use this code for my website
ob_start("unique_identifier");
// your header script
// your page script
// your footer script
ob_end_flush("unique_identifier");
ob_end_clean("unique_identifier");
I use "unique_identifier" because inside my script also exists another
ob_start()
The XML contains around 50,000 different URLS that I am trying to gather data from to then insert or updade my database.
Currently I am using this, which sort of works but times out because of the large amounts of data being processed, how can I improve the performance of this:
URLs.xml (up to 50,000 loc's)
<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>http://url.com/122122-rob-jones?</loc>
<lastmod>2014-05-05T07:12:41+08:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.9</priority>
</url>
</urlset>
index.php
<?php
include 'config.php';
include 'custom.class.php';
require_once('SimpleLargeXMLParser.class.php');
$custom = new custom();
$xml = dirname(__FILE__)."/URLs.xml";
// create a new object
$parser = new SimpleLargeXMLParser();
// load the XML
$parser->loadXML($xml);
$parser->registerNamespace("urlset", "http://www.sitemaps.org/schemas/sitemap/0.9");
$array = $parser->parseXML("//urlset:url/urlset:loc");
for ($i=0, $n=count($array); $i<$n; $i++){
$FirstURL=$array[$i];
$URL = substr($FirstURL, 0, strpos($FirstURL,'?')) . "/";
$custom->infoc($URL);
}
custom.class.php (included bits)
<?php
public function load($url, $postData='')
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6");
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
if($postData != '') {
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
}
curl_setopt($ch, CURLOPT_HTTPHEADER, array("X-Requested-With: XMLHttpRequest"));
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
public function infoc($url) {
$get_tag = $this->load($url);
// Player ID
$playeridTAG = '/<input type="text" id="player-(.+?)" name="playerid" value="(.+?)" \/>/';
preg_match($playeridTAG, $get_tag, $playerID);
// End Player ID
// Full Name
preg_match("/(.+?)-(.+?)\//",$url, $title);
$fullName = ucwords(preg_replace ("/-/", " ", $title[2]));
// End Full Name
// Total
$totalTAG = '/<li>
<span>(.+?)<\/span><span class="none"><\/span> <label>Total<\/label>
<\/li>/';
preg_match($totalTAG, $get_tag, $total);
// End Total
$query = $db->query('SELECT * FROM playerblank WHERE playerID = '.$playerID[1].'');
if($query->num_rows > 0) {
$db->query('UPDATE playerblank SET name = "'.$fullName.'", total = "'.$total[1].'" WHERE playerID = '.$playerID[1].'') or die(mysqli_error($db));
echo "UPDATED ".$playerID[1]."";
}
else {
$db->query('INSERT INTO playerblank SET playerID = '.$playerID[1].', name = "'.$fullName.'", total = "'.$total[1].'"') or die(mysqli_error($db));
echo "Inserted ".$playerID[1]."";
}
}
?>
Gathering each URL (loc) from the XML file is no problem, it's when trying to gather data using cURL for each URL that I am struggling to do without having to wait a very long time.
Try using curl_multi. In the PHP documentation there's a goot example
// create both cURL resources
$ch1 = curl_init();
$ch2 = curl_init();
// set URL and other appropriate options
curl_setopt($ch1, CURLOPT_URL, "http://lxr.php.net/");
curl_setopt($ch1, CURLOPT_HEADER, 0);
curl_setopt($ch2, CURLOPT_URL, "http://www.php.net/");
curl_setopt($ch2, CURLOPT_HEADER, 0);
//create the multiple cURL handle
$mh = curl_multi_init();
//add the two handles
curl_multi_add_handle($mh,$ch1);
curl_multi_add_handle($mh,$ch2);
$active = null;
//execute the handles
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($active && $mrc == CURLM_OK) {
if (curl_multi_select($mh) != -1) {
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
}
//close the handles
curl_multi_remove_handle($mh, $ch1);
curl_multi_remove_handle($mh, $ch2);
curl_multi_close($mh);
Try work with offline copy of XML file, and delete already updating or inserting urls, then start script again untill offline file have urls. Then get new copy of XML file if need.
Problem in "load" function: it blocks execution until single url is ready, while you can easily load several urls at the same time. Here is explanation of idea how to do it. The best way to improve performance is to load several (10-20) urls parallel and add new one for loading "on fly", when one of previous done. ParallelCurl will do the trick, something like:
require_once('parallelcurl.php');
// $max_requests = 10 or more, try to pick best value manually
$parallel_curl = new ParallelCurl($max_requests, $curl_options);
// $array - 50000 urls
$in_urls = array_splice($array, 0, $max_requests);
foreach ($in_urls as $url) {
$parallel_curl->startRequest($url, 'on_request_done');
}
function on_request_done($content, $url, $ch, $search) {
// here you can parse $content and save data to DB
// and add next url for loading
$next_url = array_shift($array);
if($next_url) {
$parallel_curl->startRequest($url, 'on_request_done');
}
}
// This should be called when you need to wait for the requests to finish.
$parallel_curl->finishAllRequests();
I'm having trouble creating multiple xml requests using php's curl_multi_exec.
The problem is that the do...while loop containing the curl_multi_exec command runs only once and then quits.
Resources Used:
http://www.phpied.com/simultaneuos-http-requests-in-php-with-curl/
http://php.net/manual/en/function.curl-multi-exec.php/
http://www.rustyrazorblade.com/2008/02/curl_multi_exec/
Take a look at my code:
//Multi handle curl initialization
$mh = curl_multi_init();
//set url
$url = 'my_url';
foreach($latLng as $id => $l) {
$ch[$id] = curl_init();
//$request previously set
//Initialize and set options
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $request);
//add to multi_handle
curl_multi_add_handle($mh, $ch[$id]);
}
//Execute the handles
$running = null;
do {
$mrc = curl_multi_exec($mh, $running);
$ready=curl_multi_select($mh);
echo "Ran once\n";
} while ($mrc == CURLM_CALL_MULTI_PERFORM && $ready > 0);
while ($active && $mrc == CURLM_OK) {
if ($curl_multi_select($mh) != -1) {
do {
$mrc = curl_multi_exec($mh, $running);
echo "Ran again\n";
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
}
foreach ($mh as $c) {
// HTTP response code
$code = curl_getinfo($c, CURLINFO_HTTP_CODE);
// cURL error number
$curl_errno = curl_errno($c);
// cURL error message
$curl_error = curl_error($c);
// output if there was an error
if ($curl_error) {
echo("*** cURL error: ($curl_errno) $curl_error\n");
}
}
//get content and remove handles
foreach ($ch as $c) {
$result[] = curl_multi_getcontent($c);
curl_multi_remove_handle($mh, $c);
}
print_r($result);
//Close curl
curl_multi_close($mh);
}
I know the request is valid because I receive the correct return data when I perform a single curl execution. The problem lies with the curl_multi_exec().
The output I am receiving is "Ran once" followed by the empty arrays of the curl_multi_getcontent() calls. See below:
Ran once
Array
(
[0] =>
[1] =>
[2] =>
[3] =>
[4] =>
[5] =>
[6] =>
[7] =>
[8] =>
)
Any help is greatly appreciated.
You're not setting up the curl options correctly:
Currently, you're setting options on $ch which is your array, you need to be setting the options specifically on the current curl handler, which in your loop is $ch[$id]:
//Initialize and set options
curl_setopt($ch[$id], CURLOPT_URL, $url);
curl_setopt($ch[$id], CURLOPT_HEADER, 0);
curl_setopt($ch[$id], CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch[$id], CURLOPT_POST, 1);
curl_setopt($ch[$id], CURLOPT_POSTFIELDS, $request);
change this:
foreach ($mh as $c) {
$code = curl_getinfo($c, CURLINFO_HTTP_CODE);
to:
for($i=1;$i<=count($array);$i++){
$code = curl_multi_getcontent($ch[$i]);
assuming $array is the array for your multiple $url.
I am trying to login to a site and then call numerous URLs to get the source and scrape for images. It works fine using regular curl but when I try to use multi_curl I am getting back the exact same response. So that I only have to login once I am resuing the curl resource (this works fine with regular curl) and I think this may be the reason why it is returning the same response.
Does anyone know how to use multi_curl but authenticate first?
Here is the code I am using:
<?php
// LICENSE: PUBLIC DOMAIN
// The author disclaims copyright to this source code.
// AUTHOR: Shailesh N. Humbad
// SOURCE: http://www.somacon.com/p539.php
// DATE: 6/4/2008
// index.php
// Run the parallel get and print the total time
$s = microtime(true);
// Define the URLs
$urls = array(
"http://localhost/r.php?echo=request1",
"http://localhost/r.php?echo=request2",
"http://localhost/r.php?echo=request3"
);
$pg = new ParallelGet($urls);
print "<br />total time: ".round(microtime(true) - $s, 4)." seconds";
// Class to run parallel GET requests and return the transfer
class ParallelGet
{
function __construct($urls)
{
// Create get requests for each URL
$mh = curl_multi_init();
$count = 0;
$ch = curl_init();
foreach($urls as $i => $url)
{
$count++;
if($count == 1)
{
// SET URL FOR THE POST FORM LOGIN
curl_setopt($ch, CURLOPT_URL, 'https://www.example.com/login.php');
// ENABLE HTTP POST
curl_setopt ($ch, CURLOPT_POST, 1);
// SET POST PARAMETERS : FORM VALUES FOR EACH FIELD
curl_setopt ($ch, CURLOPT_POSTFIELDS, 'user=myuser&password=mypassword');
// IMITATE CLASSIC BROWSER'S BEHAVIOUR : HANDLE COOKIES
curl_setopt ($ch, CURLOPT_COOKIEJAR, realpath($_SERVER['DOCUMENT_ROOT']) . '/cookie.txt');
# Setting CURLOPT_RETURNTRANSFER variable to 1 will force cURL
# not to print out the results of its query.
# Instead, it will return the results as a string return value
# from curl_exec() instead of the usual true/false.
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
// EXECUTE 1st REQUEST (FORM LOGIN)
curl_exec ($ch);
}
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_COOKIEFILE, realpath($_SERVER['DOCUMENT_ROOT']) . '/cookie.txt');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$ch_array[$i] = $ch;
curl_multi_add_handle($mh, $ch_array[$i]);
}
// Start performing the request
do {
$execReturnValue = curl_multi_exec($mh, $runningHandles);
} while ($execReturnValue == CURLM_CALL_MULTI_PERFORM);
// Loop and continue processing the request
while ($runningHandles && $execReturnValue == CURLM_OK) {
// Wait forever for network
$numberReady = curl_multi_select($mh);
if ($numberReady != -1) {
// Pull in any new data, or at least handle timeouts
do {
$execReturnValue = curl_multi_exec($mh, $runningHandles);
} while ($execReturnValue == CURLM_CALL_MULTI_PERFORM);
}
}
// Check for any errors
if ($execReturnValue != CURLM_OK) {
trigger_error("Curl multi read error $execReturnValue\n", E_USER_WARNING);
}
// Extract the content
foreach($urls as $i => $url)
{
// Check for errors
$curlError = curl_error($ch_array[$i]);
if($curlError == "") {
$res[$i] = curl_multi_getcontent($ch_array[$i]);
} else {
print "Curl error on handle $i: $curlError\n";
}
// Remove and close the handle
curl_multi_remove_handle($mh, $ch_array[$i]);
curl_close($ch_array[$i]);
}
// Clean up the curl_multi handle
curl_multi_close($mh);
// Print the response data
print_r($res);
}
}
?>
you need to enable/use cookies with curl as well. look for it on the documentation, don't forget to create the cookies (empty files) with read and write permission for curl.
$cookie = tempnam ("/tmp", "CURLCOOKIE");
$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );