Scraping Book Prices - php

I'm trying to write a scrape app, and I'm running in to problems. My PHP Curl code isn't pulling up the pages with the price of the books. It's returning me to the web root of the domain.
I'm trying to search the site by ISBN.
I've been bashing my head against the wall for days. Any help will be most appreciated!
Code:
<form method="post" for="new-search" name="SearchTerm" class='form-validate' id="SearchTerm" action="index.php">
<textarea rows="3" name="SearchTerm" id="SearchTerm" cols="40" class="validate-required error"></textarea><div class="error" id="SearchTerm-error">
<br>
<button class="search primary" type="submit">continue</button>
</form>
<?php
/*
echo("<pre>");print_r($_GET);echo("</pre>");
echo("<pre>");print_r($_POST);echo("</pre>");
*/
$isbn = $_POST['SearchTerm'];
$userAgent = 'User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;rv:1.8.1.16) Gecko/20080702 Firefox/2.0.0.16';
$fields = array(
'url' => ("http://www.bookleberry.com/Search/SearchKeyword"),
'qurl' => ("http://www.bookleberry.com/Search/SearchKeyword/" . $_POST['SearchTerm']),
'SearchTerm' => ($_POST['SearchTerm']),
'Page' => ('1'),
'class' => ('textfield validate-required'),
'for' => ('new-search'),
'result-count' => ('1'),
'status' => 'success',
);
$SearchTerm = ($fields['SearchTerm']);
$url = ($fields['url']);
$Page = ($fields['Page']);
echo("<pre>");
print_r($fields);
echo("</pre>");
if ($isbn != NULL){
//open connection
$ch = curl_init($url);
//set the url, number of POST vars, POST data
curl_setopt($ch, CURLOPT_HEADER, $userAgent);
curl_setopt($ch, CURLOPT_URL, $url);
echo "before curl_exec:<br>";
echo "curl_errno=". curl_errno($ch) ."<br>";
echo "curl_error=". curl_error($ch) ."<br>";
curl_setopt($ch,CURLOPT_POST,count($fields));
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, "?SearchTerm=$SearchTerm");
curl_setopt($ch, CURLOPT_HTTPGET, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 9999999);
curl_setopt($ch,CURLOPT_HTTPHEADER,array (
"Accept: application/json"
));
$info = curl_getinfo($ch);
//execute post
$result = curl_exec($ch);
print $result;
print "<pre>\n";
print_r(curl_getinfo($ch)); // get error info
?>

Don't hurt your head, use it!
Install fiddler.
Do a request using the browser, look in fiddler to exactly what is posted. This includes all headers, cookies and form variables.
Do a post using your code, examine fiddler again
Compare the differences between the two and adjust your script.
Repeat.
Also it helps to install firebug. Using the copy Xpath, and putting that into a php DOM xpath query makes scraping fun and easy!

Related

Slack Incoming Webhook with PHP form

I'm trying to create a PHP script that automatically pushes text from <textarea> in my webform to Slack channel.
HTML:
<form action="http://main.xfiddle.com/<?php echo pf_file('g7f-ds0'); ?>" method="post" id="myform" name="myform">
<textarea name="text" id="" rows="3" cols="30">
</textarea> <br /><br />
<button id="mysubmit" type="submit" name="submit">Submit</button><br /><br /></form>
I managed to write a PHP script that posts hard coded message to Slack like this:
<?php
//API Url
$url = 'https://hooks.slack.com/services/T02NZ01FU/B08TTAPGE/000000000000000000';
//Initiate cURL.
$ch = curl_init($url);
//The JSON data.
$payload = array(
’text' => 'Testing text with PHP'
);
//Encode the array into JSON.
$jsonDataEncoded = json_encode($payload);
//Tell cURL that we want to send a POST request.
curl_setopt($ch, CURLOPT_POST, 1);
//Attach our encoded JSON string to the POST fields.
curl_setopt($ch, CURLOPT_POSTFIELDS, $jsonDataEncoded);
//Set the content type to application/json
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/json'));
//Execute the request
$result = curl_exec($ch);
?>
But for some reason when I try to get text from <textarea name="text" rows="3" cols="30"></textarea> and save it into a variable then it doesn't work. I add this to the beginning of PHP to set the text variable:
if(isset($_POST['submit']))
$textdata = $_POST['text'];
and then change the $payload to
'text' => $textdata
A simple example of how to use slack incoming webhook with curl
<?php
define('SLACK_WEBHOOK', 'https://hooks.slack.com/services/xxx/yyy/zzz');
function slack($txt) {
$msg = array('text' => $txt);
$c = curl_init(SLACK_WEBHOOK);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($c, CURLOPT_POST, true);
curl_setopt($c, CURLOPT_POSTFIELDS, array('payload' => json_encode($msg)));
curl_exec($c);
curl_close($c);
}
?>
Snippet taken from here
There are two likely issues here.
The PHP formatting in your post is incorrect.
Replace ’text' => 'Testing text with PHP' with
'text' => 'Testing text with PHP'
Your curl is not set up correctly. Please see the following posts to debug curl and to fix what is likely wrong - no trusted SSL certificates

PHP curl not sending post params

I have an online form to submit data as http post.
If no data is send as post , it will return the following error message
We are sorry, the form that you have submitted is invalid or no longer exists.
I am using the following code to send http post data using curl.But I am always getting the error message as output.
How to resolve this issue.Did I miss anything.I printed the curl request, and the post parameters are not sent.How can I fix this?
My code is given below
function curlUsingPost($url, $data)
{
$content_type = 'application/x-www-form-urlencoded';
if(empty($url) OR empty($data))
{
return 'Error: invalid Url or Data';
}
//url-ify the data for the POST
$fields_string = '';
foreach($data as $key=>$value) { $fields_string .= $key.'='.$value.'&'; }
$fields_string = http_build_query($data).'\n';
echo $fields_string;
//open connection
$ch = curl_init();
//set the url, number of POST vars, POST data
// curl_setopt($ch, CURLOPT_HTTPHEADER, ['Accept: ' . $content_type]);
curl_setopt($ch, CURLOPT_HTTPHEADER,array($content_type));
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_POST,true);
curl_setopt( $curl_handle, CURLOPT_HTTPHEADER, array( 'Expect:' ) );
curl_setopt($ch,CURLOPT_POSTFIELDS,$fields_string);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,10); # timeout after 10 seconds, you can increase it
//curl_setopt($ch,CURLOPT_HEADER,false);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); # Set curl to return the data instead of printing it to the browser.
curl_setopt($ch, CURLOPT_USERAGENT , "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1)"); # Some server may refuse your request if you dont pass user agent
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$info = curl_getinfo($ch);
print_r($info );
//execute post
$result = curl_exec($ch);
//close connection
curl_close($ch);
return $result;
}
$url = 'office.promptip.com:8443/his1.max/Services/Webform.aspx?request=submitwebform&token=202D36560E444E466658461615120F75411B09634C696C6F79454E795E7C7A5154505645414F456C584C17101B443D431D0B654D66686D7C454E795E2D7E5152565341404145625E4213141E4273421E5E66486B666F2A464B7A5B79785F54520645444F15635C454516134422421A0D664A3D68687C454E79587F7D5F52575714424043635E1611181D16704D1D0E654F69696F2A474A7A0A78785807525E45414E406350424016134422421A0D374C6F686C7C44482F59787D5E530150104218426D584316181A4177411B0F';
//echo $url;
$data = [
'C0IFirstName' => 'John Doe',
'C1ILastName' => 'Doe John',
'C2IPosition' =>'Father-Guardian',
'U3I80'=>'8129020464',
'U4I150'=>'johndoe#johndoe.com',
'U5I79'=>'This is the message buddy',
'U6I102' => 'Ann',
'U7I105' =>'Doe',
'U8I52' =>'FS1',
'C9IClientId'=> rand ( 0,100000 ),
'U10I21'=>'Website-Enquiry Form'
];
echo curlUsingPost($url,$data);
Solution
Add this to your curl request:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
And replace & with & in URL you're requesting:
$url = 'office.promptip.com:8443/his1.max/Services/Webform.aspx?request=submitwebform&token=202D36560E444E466658461615120F75411B09634C696C6F79454E795E7C7A5154505645414F456C584C17101B443D431D0B654D66686D7C454E795E2D7E5152565341404145625E4213141E4273421E5E66486B666F2A464B7A5B79785F54520645444F15635C454516134422421A0D664A3D68687C454E79587F7D5F52575714424043635E1611181D16704D1D0E654F69696F2A474A7A0A78785807525E45414E406350424016134422421A0D374C6F686C7C44482F59787D5E530150104218426D584316181A4177411B0F';
Explenation
At first, you're requesting wrong URL, which was probably copied from some HTML documentation. Normally a & entity would be replaced by & char by HTML interpreter, but this does not work with curl and you have to replace the entity manually.
Also the server returns
We are sorry, the form that you have submitted is invalid or no
longer exists.
for each end every request in response body. But for successfull request it also provides a Location header. Normally browser would follow this location, make a second request and display success message. You have to tell cURL to do exactly the same with curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true).

How to make Curl post Using Php involving cookies

I am developing a script involving Php Curl to send sms using http://www.gysms.com/freesms.php
The page stores a cookie PHPSESSID and also a hidden field named token is passed during the posting.
I have written a script involving two curl requests. 1st curl request parse the page and obtain the token value .
Here is the code for that:
<?php
$phone = '9197xxxxxxx';
$msg = 'Hi this is curlpost';
$get_cookie_page = 'http://www.gysms.com/freesms.php';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $get_cookie_page);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$sabin = curl_exec($ch);
$html=explode('<input type="hidden" name="trigger" value="',$sabin);
$html=explode('"/>',$html[1]);
//store the token value to $html[0]
?>
Curl post is done using the following code:
<?php
$fields = array(
'trigger'=>urlencode($html[0]), //token value
'number'=>urlencode($phone), //phone no
'message'=>urlencode($msg) //message
);
//posting curl request
foreach($fields as $key=>$value) { $fields_string .= $key.'='.$value.'&'; }
rtrim($fields_string,'&');
$url = 'http://www.gysms.com/freesms.php';
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields_string);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookie.txt');
//execute post
$result = curl_exec($ch);
echo $result;
curl_close($ch);
?>
The sms is not sending Using the above code.
If the sms is sent It should show sms is send to-No.
I don't Know where I went wrong. Please help, I am new to PHP.
Finally this attempt is only for my educational purpouse.
Here is some code I came up with that worked. Hope it helps. Some explanations and feedback about your code follow.
<?php
$number = '14155556666';
$message = 'This is my text in all its glory.';
$url = 'http://www.gysms.com/freesms.php';
$cookieFile = tempnam(null, 'SMS');
$userAgent = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:11.0) Gecko/20100101 Firefox/11.0';
if (strlen($message) > 100) {
die('Message length cannot exceed 100 characters.');
}
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); // empty user agents probably not accepted
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1); // enable this - they check referer on POST
$html = curl_exec($ch);
// <input type="hidden" name="trigger" value="CXXrtmqVC7KbUnJ22UBodFy1kBj4ign5PsQ3qNR91nH2055307b4xP4"/>
if (!preg_match('/name=.trigger.\s+value=.([^\'"]+)/i', $html, $trigger)) {
die('Failed to locate hidden input value');
}
sleep(5); // without a slight delay, i often would not receive sms
$trigger = $trigger[1];
// build array of post values - all are important
$post = array('number' => $number,
'trigger' => $trigger,
'message' => $message,
'remLen' => 100 - strlen($message),
$trigger => 'Send Message');
// switch request to POST, use http_build_query to encode post data for us
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($post));
$html = curl_exec($ch);
if (strpos($html, '<b>Message sent to</b>') !== false) {
echo "Message sent!";
} else {
echo "<b>Message not sent :(</b><br /><br />";
echo $html;
}
I think you may have had trouble for several reasons:
A User-Agent should be specified in the request, they seem to reject if you leave it empty
I used http_build_query to build the POST string (preference)
You were missing 2 fields in the request, remLen, and the trigger value as the submit button
I often would not receive the messages if I didn't sleep a few seconds before sending the message after getting the trigger value.
In most of the cases where I didn't get the message, it still showed the "Message sent to phone #" on the screen even though it never came. Once I combined all the right things (sleep time, user agent, valid post fields) I would see the success message but also get the response.
I think the most critical thing left out from your code was that on the first request where you grab the trigger value, they also set a cookie (PHPSESSID) that you are required to capture. Without sending that on the POST request it was probably an automatic reject.
To get around this, make sure you capture cookies on the first request as well as subsequent requests. I chose to re-use the same curl handle for both requests. You don't have to do it that way, but you would have to use the same cookie file and cookie jar between requests.
Hope that helps.

cURL response is 200, but it doesn't really post the values

I have the following code for cURL using PHP;
$product_id_edit="Playful Minds (1062)";
$item_description_edit="TEST";
$rank_edit="0";
$price_type_edit="2";
$price_value_edit="473";
$price_previous_value_edit="473";
$active_edit="1";
$platform_edit="ios";
//set POST variables
$url = 'https://www.domain.com/adm_test/phpgen/offline_items.php?operation=insert';
$useragent = 'Mozilla/5.0 (Windows NT 6.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1';
$fields = array(
'product_id_edit'=>urlencode($product_id_edit),
'item_description_edit'=>urlencode($item_description_edit),
'rank_edit'=>urlencode($rank_edit),
'price_type_edit'=>urlencode($price_type_edit),
'price_value_edit'=>urlencode($price_value_edit),
'price_previous_value_edit'=>urlencode($price_previous_value_edit),
'active_edit'=>urlencode($active_edit),
'platform_edit'=>urlencode($platform_edit)
);
$fields_string="";
//url-ify the data for the POST
foreach($fields as $key=>$value) { $fields_string .= $key.'='.$value.'&'; }
rtrim($fields_string,'&');
//open connection
$ch = curl_init();
//set the url, number of POST vars, POST data
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
//add useragent
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt ($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch,CURLOPT_POSTFIELDS,$fields_string);
curl_setopt($ch,CURLOPT_POST,count($fields));
//execute post
$result = curl_exec($ch);
if(curl_errno($ch)){
print "" . curl_error($ch);
}else{
//print_r($result);
}
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
//echo "HTTP Response Code: " . curl_error($ch);
echo $httpCode;
//close connection
curl_close($ch);
I have $httpCode printed; I get the code 200; I presume this is OK as I have read in the Manual Pages, however, when I check against the site, the POSTed values does not exist,
does this have something to do with cross-domains as I am not posting it on the same domain?, I'm doing it on 127.0.0.1/site/scrpt.php but how do I get the response code 200 if its not successful?
I also tried to get a 404 which I did by removing a part on the request URL it did return a 404 (this means that cURL is working properly in my assumption)
Does having the url https://www.domain.com/adm_test/phpgen/offline_items.php?operation=insert with the "?operation=insert" has something to do with it?
Let's presume(tho not implied), I'm from another site and I want post values into the form of another website sort'a a robot. tho my objective does not imply any evil intentions, is it that I have to encode thousand lines of info if this is not doable.
Likewise, I don't need a response from the server (but if one is available, then its just fine)
The operation should be passed with CURLOPT_POSTFIELDS. Along with other paramters.
Cross-domain issue happens in case of browser. And your code seems to be a php server side code so this should not be an issue.
Not sure if this is the solution or the problem is different, but this line:
rtrim($fields_string,'&');
Should be this:
$fields_string = rtrim($fields_string,'&');
curl_setopt($ch,CURLOPT_POST,TRUE);
CURLOPT_POST - boolean, it's not a count of values, it's use post flag.
Code 200 indicates that the connection is set up correctly and received a response from the server, but it does not mean that the requested action has been implemented.
Print $result after request to see the response from a web server.

How CURL Login with Captcha and Session

define('COOKIE', './cookie.txt');
define('MYURL', 'https://register.pandi.or.id/main');
function getUrl($url, $method='', $vars='', $open=false) {
$agents = 'Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.204 Safari/534.16';
$header_array = array(
"Via: 1.1 register.pandi.or.id",
"Keep-Alive: timeout=15,max=100",
);
static $cookie = false;
if (!$cookie) {
$cookie = session_name() . '=' . time();
}
$referer = 'https://register.pandi.or.id/main';
$ch = curl_init();
if ($method == 'post') {
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, "$vars");
}
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header_array);
curl_setopt($ch, CURLOPT_USERAGENT, $agents);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 5);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_REFERER, $referer);
curl_setopt($ch, CURLOPT_COOKIE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, COOKIE);
curl_setopt($ch, CURLOPT_COOKIEFILE, COOKIE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
$buffer = curl_exec($ch);
if (curl_errno($ch)) {
echo "error " . curl_error($ch);
die;
}
curl_close($ch);
return $buffer;
}
function save_captcha($ch) {
$agents = 'Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.204 Safari/534.16';
$url = "https://register.pandi.or.id/jcaptcha";
static $cookie = false;
if (!$cookie) {
$cookie = session_name() . '=' . time();
}
$ch = curl_init(); // Initialize a CURL session.
curl_setopt($ch, CURLOPT_URL, $url); // Pass URL as parameter.
curl_setopt($ch, CURLOPT_USERAGENT, $agents);
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
curl_setopt($ch, CURLOPT_COOKIE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, COOKIE);
curl_setopt($ch, CURLOPT_COOKIEFILE, COOKIE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // Return stream contents.
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1); // We'll be returning this
$data = curl_exec($ch); // // Grab the jpg and save the contents in the
curl_close($ch); // close curl resource, and free up system resources.
$captcha_tmpfile = './captcha/captcha-' . rand(1000, 10000) . '.jpg';
$fp = fopen($tmpdir . $captcha_tmpfile, 'w');
fwrite($fp, $data);
fclose($fp);
return $captcha_tmpfile;
}
if (isset($_POST['captcha'])) {
$id = "yudohartono";
$pw = "mypassword";
$postfields = "navigation=authenticate&login-type=registrant&username=" . $id . "&password=" . $pw . "&captcha_response=" . $_POST['captcha'] . "press=login";
$url = "https://register.pandi.or.id/main";
$result = getUrl($url, 'post', $postfields);
echo $result;
} else {
$open = getUrl('https://register.pandi.or.id/main', '', '', true);
$captcha = save_captcha($ch);
$fp = fopen($tmpdir . "/cookie12.txt", 'r');
$a = fread($fp, filesize($tmpdir . "/cookie12.txt"));
fclose($fp);
<form action='' method='POST'>
<img src='<?php echo $captcha ?>' />
<input type='text' name='captcha' value=''>
<input type='submit' value='proses'>
</form>";
if (!is_readable('cookie.txt') && !is_writable('cookie.txt')) {
echo "cookie fail to read";
chmod('../pandi/', '777');
}
}
this cookie.txt
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
register.pandi.or.id FALSE / FALSE 0 JSESSIONID 05CA8241C5B76F70F364CA244E4D1DF4
after i submit form just display
HTTP/1.1 200 OK Date: Wed, 27 Apr 2011 07:38:08 GMT Server: Apache-Coyote/1.1 X-Powered-By: Servlet 2.4; Tomcat-5.0.28/JBoss-4.0.0 (build: CVSTag=JBoss_4_0_0 date=200409200418) Content-Length: 0 Via: 1.1 register.pandi.or.id Content-Type: text/plain X-Pad: avoid browser bug
if not error "Captcha invalid"
always failed login to pandi
what wrong in my script?
I'm not want to Break Captcha but i want display captcha and user input captcha from my web page, so user can registrar domain dotID from my web automaticaly
A captcha is intended to differentiate between humans and robots (programs). Seems like you are trying to log in with a program. The captcha seems to do its job :).
I don't see a legal way around.
It happens because,
You took your captcha image from first getURL (ie first curl_exec) and processed the captcha but to submit your captcha you are requested getURL (ie again curl_exec) which means to a new page with a new captcha again.
So you are placing the old captcha and putting it in the new captcha. I'm having the same problem & resolved it.
Captcha is a dynamic image created by the server when you hit the page. It will keep changing, you must extract the captcha from the page and then parse it and then submit your page for a login. Captcha will keep changing as and when the page is triggered to load!
Using a headless browsing solution this is possible. ie: zombie.js coffee.js on Node.. Also it may be possible to extract the "image" from the captcha and, using image recognition, "read" the image and convert it to text, which is then posted with the form.
As of today, the only surefire method to "trick" a captcha is to use headless browsing.
Yes, Andro Selva is right. On the second request it gives new captcha. Once it loads captcha with getUrl function and the second load is from the save_captcha function, so this are 2 different images.
It must do something like this:
Download the captcha image before close the curl and before post and tell the script to wait untill you provide captcha answer - I will use preg_match. It will require some javascript as well.
If the captcha image is generated from javascript, you need to execute this javascript with the same cookie or token. In this situation, the easier solution is to record the headers with e.g. livehttpheaders addon for mozila ffox.
With PHP I do not know how to do it, you have to get the captcha and find a way to solve it. It has a lot of algorithms to do it for you, but if you want to use java, I already hacked the source code from this link to get the code to solve the captcha and it works very well for a lot of captcha systems.
So, you could try to implement your own captcha solver, that will take a lot of time, try to find an existing implementation for PHP, or, IMHO, the best option, to use the JDownloader code base.

Categories