I am trying to load an url using cURL. But here I am facing problem with browser cookies. The webpage is asking me to enable browser cookies. My code is like this.
public function execute() {
// Set two default options, and merge any extra ones in
if(!isset($this->options[CURLOPT_TIMEOUT])) $this->options[CURLOPT_TIMEOUT] = 45;
if(!isset($this->options[CURLOPT_COOKIESESSION])) $this->options[CURLOPT_COOKIESESSION] = TRUE;
if(!isset($this->options[CURLOPT_COOKIEJAR])) $this->options[CURLOPT_COOKIEJAR] = 'cookie.txt';
if(!isset($this->options[CURLOPT_RETURNTRANSFER])) $this->options[CURLOPT_RETURNTRANSFER] = TRUE;
if(!isset($this->options[CURLOPT_FOLLOWLOCATION])) $this->options[CURLOPT_FOLLOWLOCATION] = TRUE;
if(!isset($this->options[CURLOPT_USERAGENT])) $this->options[CURLOPT_USERAGENT] = "Mozilla/5.0 (Windows NT 6.1; rv:20.0) Gecko/20100101 Firefox/20.0";
if(!isset($this->options[CURLOPT_AUTOREFERER])) $this->options[CURLOPT_AUTOREFERER] = TRUE;
if(!isset($this->options[CURLOPT_CONNECTTIMEOUT])) $this->options[CURLOPT_CONNECTTIMEOUT] = 15;
if(!isset($this->options[CURLOPT_MAXREDIRS])) $this->options[CURLOPT_MAXREDIRS] = 4;
if(!isset($this->options[CURLOPT_HEADER])) $this->options[CURLOPT_HEADER] = FALSE;
if(!isset($this->options[CURLOPT_SSL_VERIFYPEER])) $this->options[CURLOPT_SSL_VERIFYPEER] = FALSE;
if(!isset($this->options[CURLOPT_FAILONERROR])) $this->options[CURLOPT_FAILONERROR] = FALSE;
if(!isset($this->options[CURLOPT_ENCODING])) $this->options[CURLOPT_ENCODING] = '';
$this->options();
$return = curl_exec($this->session);
// Request failed
if($return === FALSE){
$this->error_code = curl_errno($this->session);
$this->error_string = curl_error($this->session);
curl_close($this->session);
$this->session = NULL;
return $return;
// Request successful
} else {
$this->info = curl_getinfo($this->session);
curl_close($this->session);
$this->session = NULL;
return $return;
}
}
But still facing the same problem, please help me in that.
I know, this is purely restricted by the website
"Before you can move on, please activate your browser cookies. "
I think you forget about CURLOPT_COOKIE, try to set it on TRUE
Related
I would like to apologize in advance if this is a stupid question, but I am a junior developer starting a new job (and yes, very afraid of making a huge mistake). Most of my expertise is in Python, SQL, JavaScript, CSS and HTML. However, in my job I've been tasked with deactivating cookies in their website (they have to because of privacy laws in Europe). Some of the pages' backends are written in javascript and I was able to find the cookies and deactivate them, but some are written in php. I can tell what the code is and what it does, but since I've never dealt with php before, I'm not sure if I should just delete the script or if I should modify it in any way. Any help or advice will be greatly appreciated. This is the code (it is in its own file):
<?php
// Real-time Data Aggregation (RDA)
// error_reporting( E_ALL );
// ini_set('display_errors', 1);
class RDA {
private $session_cookie = '';
private $log_site = '';
private $config = array();
private $raw_payload = '';
private $payload = array();
private $publish_path_map = array();
public function __construct($config){
$this->config = $config;
}
public function process(){
$this->raw_payload = file_get_contents('php://input');
if(!$this->is_json($this->raw_payload)){
echo 'Expected payload was not provided. Script has been aborted.';
return;
}
$this->payload = json_decode($this->raw_payload);
if(array_key_exists('passed_through_rda', $this->payload) && $this->payload->passed_through_rda == 'true') return; // If this had previously passed through a RDA script so let's abort to prevent recursion.
if($this->is_test_payload()) return; // When the Test button is clicked from account settings simply echo back the payload and abort.
$this->send_next_webhook_request(); // forward payload to another webhook listener.
if($this->payload->finished != 'true') return; // we only want to react when the event has finished and not when it has been started.
$this->set_publish_path_map(); // sets up an index of publish paths to use as reference to prevent publish recursion.
foreach($this->config['actions'] as $action){
if(!$this->payload_contains_trigger_path($action)) continue; // payload does not contain trigger path so end execution.
$this->authenicate();
$this->publish($action);
}
$this->log_request();
}
private function authenicate(){
if($session_cookie != '') return; // session cookie was already created so exit authenication.
$endpoint = $this->config['ouc_base_url'] . '/authentication/login';
$config = array(
'skin' => $this->config['skin'],
'account' => $this->config['account'],
'username' => $this->config['username'],
'password' => $this->config['password']
);
$post_fields = http_build_query($config);
$cURLConnection = curl_init($endpoint);
curl_setopt($cURLConnection, CURLOPT_POSTFIELDS, $post_fields);
curl_setopt($cURLConnection, CURLOPT_RETURNTRANSFER, true);
curl_setopt($cURLConnection, CURLOPT_HEADER, true);
$api_response = curl_exec($cURLConnection);
$header = curl_getinfo( $cURLConnection );
curl_close($cURLConnection);
$header_content = substr($api_response, 0, $header['header_size']);
$pattern = "#Set-Cookie:\\s+(?<cookie>[^=]+=[^;]+)#m";
preg_match_all($pattern, $header_content, $matches);
$this->session_cookie = implode("; ", $matches['cookie']);
}
private function publish($action){
$endpoint = '/files/publish';
$config = array(
'site' => $action['site'],
'path' => $action['publish_path'],
'include_scheduled_publish' => 'true',
'include_checked_out' => 'true'
);
$this->log_site = $action['site']; // set a site to use to create log files if logging is turned on.
$this->send($endpoint, $config);
}
private function set_publish_path_map(){
foreach($this->config['actions'] as $action){
$this->publish_path_map[$action['site'] . $action['publish_path']] = 1;
}
}
private function log_request(){
if($this->config['log'] != 'true' || $this->log_site == '') return; // don't log when logging turned or if log_site not set
$log_id = uniqid();
$endpoint = '/files/save';
$config = array(
'site' => $this->log_site,
'path' => $this->config['config_file'], // uses the config PCF to do a "save as" to a log file
'new_path' => $this->get_root_relative_folderpath() . '_log/' . $log_id . '.txt',
'text' => $this->raw_payload
);
$this->send($endpoint, $config);
}
private function send_next_webhook_request(){
$next_webhook_url = trim($this->config['next_webhook_url']);
if($next_webhook_url == '') return; // next_webhook_url not entered so just return.
$this->payload->passed_through_rda = 'true';
$connection = curl_init($next_webhook_url);
curl_setopt($connection, CURLOPT_POSTFIELDS, json_encode($this->payload, JSON_UNESCAPED_SLASHES));
curl_setopt($connection, CURLOPT_RETURNTRANSFER, true);
$api_response = curl_exec($connection);
curl_close($connection);
}
private function send($endpoint, $config){
$endpoint = $this->config['ouc_base_url'] . $endpoint;
$post_fields = http_build_query($config);
$connection = curl_init($endpoint);
curl_setopt($connection, CURLOPT_POSTFIELDS, $post_fields);
curl_setopt($connection, CURLOPT_RETURNTRANSFER, true);
curl_setopt($connection, CURLOPT_COOKIE, $this->session_cookie);
$api_response = curl_exec($connection);
curl_close($connection);
}
private function payload_contains_trigger_path($action){
$site = $action['site'];
$success = array(); // the success node in the webhook payload contains files that were published.
if(!array_key_exists($site, $this->payload->success)) return false; // no success array so just return false.
$success = $this->payload->success->{$site};
$published_paths = array();
foreach($success as $i){
if(!array_key_exists($site . $i->path, $this->publish_path_map)) $published_paths[] = $i->path; // only include paths that aren't also publish targets configured in this script to avoid publish recursion.
}
$trigger_paths = $action['trigger_path'];
$trigger_paths = explode(',', $trigger_paths);
foreach($trigger_paths as $trigger_path){
$trigger_path = trim($trigger_path);
$trigger_path = preg_replace('/(.)[\/]+$/', '$1', $trigger_path); // removes trailing slash unless the value is the string length is 1, for instance: '/'
if($trigger_path == '') continue;
foreach($published_paths as $path){
if($this->starts_with($path, $trigger_path)) return true;
}
}
return false;
}
private function is_test_payload(){
$account = $this->payload->account;
if($account == '<account name>'){ // This is the account name value used by the test http request.
echo $this->raw_payload;
return true;
}
return false;
}
private function is_json($string){
if(trim($string) == '') return false;
json_decode($string);
return (json_last_error() == JSON_ERROR_NONE);
}
private function starts_with($string, $startString){
$len = strlen($startString);
return (substr($string, 0, $len) === $startString);
}
private function get_root_relative_folderpath(){
$result = $this->get_root_relative_filepath();
$result = str_replace('\\', '/', $result);
$result = preg_replace('/[^\/]+$/', '', $result);
return $result;
}
private function get_root_relative_filepath(){
$result = str_replace($_SERVER['DOCUMENT_ROOT'], '', $_SERVER['SCRIPT_FILENAME']);
return $result;
}
}
?>
For clarification: they have a service that manages cookies and they were able to turn those off, but there are a number of cookies that are persisting, and they are being generated by scripts leftover from years ago (I have no idea who wrote this code, or how old it is) and they need to be deleted. I just want to make sure that if I delete something it won't cause other bugs on the website
Method to close all the cookies and sessions
i think you have start the sessions session_start()
session_start();
you can read the documentation here
//http://php.net/manual/en/function.setcookie.php#73484
To destory and close the sessions try the code below
the below method will help to unset the cookies serving in your php program
if (isset($_SERVER['HTTP_COOKIE'])) {
$cookies = explode(';', $_SERVER['HTTP_COOKIE']);
foreach($cookies as $cookie) {
$parts = explode('=', $cookie);
$name = trim($parts[0]);
setcookie($name, '', time()-1000);
setcookie($name, '', time()-1000, '/');
}
}
session_destroy();
I'm trying to create a simple script that'll let me know if a website is based off WordPress.
The idea is to check whether I'm getting a 404 from a URL when trying to access its wp-admin like so:
https://www.audi.co.il/wp-admin (which returns "true" because it exists)
When I try to input a URL that does not exist, like "https://www.audi.co.il/wp-blablabla", PHP still returns "true", even though Chrome, when pasting this link to its address bar returns 404 on the network tab.
Why is it so and how can it be fixed?
This is the code (based on another user's answer):
<?php
$file = 'https://www.audi.co.il/wp-blabla';
$file_headers = #get_headers($file);
if(!$file_headers || strpos($file_headers[0], '404 Not Found')) {
$exists = "false";
}
else {
$exists = "true";
}
echo $exists;
You can try to find the wp-admin page and if it is not there then there's a good change it's not WordPress.
function isWordPress($url)
{
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER , 1 );
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
// grab URL and pass it to the browser
curl_exec($ch);
$httpStatus = curl_getinfo($ch, CURLINFO_RESPONSE_CODE);
// close cURL resource, and free up system resources
curl_close($ch);
if ( $httpStatus == 200 ) {
return true;
}
return false;
}
if ( isWordPress("http://www.example.com/wp-admin") ) {
// This is WordPress
} else {
// Not WordPress
}
This may not be one hundred percent accurate as some WordPress installations protect the wp-admin URL.
I'm probably late to the party but another way you can easily determine a WordPress site is by crawling the /wp-json. If you're using Guzzle by PHP, you can do this:
function isWorpress($url) {
try {
$http = new \GuzzleHttp\Client();
$response = $http->get(rtrim($url, "/")."/wp-json");
$contents = json_decode($response->getBody()->getContents());
if($contents) {
return true;
}
} catch (\Exception $exception) {
//...
}
return false;
}
I make a loop until 500000 and it stop at 65536, dies this php limit or system limit? Anyway to avoid it?
Btw, it's just an ordinary for loop.
for($i=0; $i<500000; $i++){}
real code, I care about the load, but my concern is not that one.
class ECStun{
var $sp = 1;
var $d = 5;
var $url = 'http://example.com/';
var $html = '';
var $invalid = '404';
var $data = array();
var $db = false; // db connection
var $dbhost = 'localhost';
var $dbuser = 'root';
var $dbpass = '';
var $dbname = 'scrape';
var $tbname = 'ecstun'; // this table will be created automatically by the script.
var $proxies = array(
);
var $startat = 0;
function init(){
$this->initDB();
$this->startat = microtime(true);
$x = rand(0, count($this->proxies)-1);
print('start scraping using proxy : '.$this->proxies[$x]."\n");
for($i = $this->sp; $i <= $this->d; $i++){
$url = $this->url.'ES'.$i;
if(!$this->isSaved($url)){ // skip is already saved to DB
$link = curl_init();
if(count($this->proxies) > 0){
curl_setopt($link, CURLOPT_PROXY, $this->proxies[$x]);
}
curl_setopt($link, CURLOPT_URL, $url);
curl_setopt($link, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17)');
curl_setopt($link, CURLOPT_RETURNTRANSFER, true);
curl_setopt($link, CURLOPT_FOLLOWLOCATION, true);
$this->html = curl_exec($link);
$code = curl_getinfo($link, CURLINFO_HTTP_CODE);
curl_close($link);
if($code == 200 || $code == 301){
$this->parse($url);
}else{
/*
** this block of script will make sure if IP got banned then delete the banned ip
** and then check if there is another remaining IPs we can use
** if no more IPs in $proxies then exit the script with information of latest ES
*/
//unset($this->proxies[$x]);
array_splice($this->proxies, $x, 1);
if(count($this->proxies) > 0){
$this->sp = $i; // if banned then set starting point to the latest ES to try again using different ip
$this->init();
}else{
exit('LAST ES: ES'.$i);
}
}
}
}
//$this->initDB();
}
}
$ecs = new ECStun;
if(isset($_GET['verify_proxy'])){ // browser only
$ecs->verifyProxy();
}else{
$ecs->init();
}
65535 is a magical number. From wikipedia:
65535 occurs frequently in the field of computing because it is the
highest number which can be represented by an unsigned 16-bit binary
number.
Try this:
var_dump(PHP_INT_MAX);
It should be the 65535 to confirm your error. Try upgrading to 32 bit or more.
I use this function to make cURL requests:
function curl_request($options) //single custom cURL request.
{
$ch = curl_init();
$options[CURLOPT_FOLLOWLOCATION] = true;
$options[CURLOPT_COOKIEJAR] = 'cookies.txt';
$options[CURLOPT_COOKIEFILE] = 'cookies.txt';
$options[CURLINFO_HEADER_OUT] = true;
$options[CURLOPT_VERBOSE] = true;
$options[CURLOPT_RETURNTRANSFER] = true;
$options[CURLOPT_CONNECTTIMEOUT] = 5;
$options[CURLOPT_TIMEOUT] = 5;
curl_setopt_array($ch, $options);
$response = curl_exec($ch);
curl_close($ch);
return $response;
}
The script hangs sometimes, but not always, on the $response = curl_exec($ch) line. This occurs even when the PHP script is set with infinite timeout (on the client side, Firebug takes this as "Aborted"). There is nothing in the error log.. It just doesn't get past that line when it hangs.
What could be going on? Any suggestions?
The issue seems to have been the resources of the server. When I switched to a better web host with a higher bandwidth limit things worked fine.
Could you please tell me , is there any limitation to send a request using multi_curl.
When I tried to send a request more than 200 , it was getting timeout.
see the below code ..............
.........................................
foreach($newUrlArry as $url){
$gatherUrl[] = $url['url'];
}
/*...................Array slice----------------------*/
$totalUrlRequest = count($gatherUrl);
if($totalUrlRequest > 10){
$offset = 10;
$index = 0;
$matchedAnchors = array();
$dom = new DOMDocument;
$NoOfTilesRequest = ceil($totalUrlRequest/$offset);
for($sl = 0; $sl<$NoOfTilesRequest;$sl++){
$output = array_slice($gatherUrl, $index, $offset);
$index = $offset+$index;
$responseAction = $this->multiRequestAction($output);
$k=0;
foreach($responseAction as $responseHtml){
#$dom->loadHTML($responseHtml);
$documentLinks = $dom->getElementsByTagName("a");
$chieldFlag = false;
for($i=0;$i<$documentLinks->length;$i++) {
$documentLink = $documentLinks->item($i);
if ($documentLink->hasAttribute('href') AND substr($documentLink->getAttribute('href'), 0, strlen($match)) == $match) {
$description = $documentLink->childNodes;
foreach($description as $words) {
$name = trim($words->nodeName);
if($name == 'em' || $name == 'b' || $name=="span" || $name=="p") {
if(!empty($words->nodeValue)) {
$matchedAnchors[$sl][$k]['anchor'] = trim($words->nodeValue);
$matchedAnchors[$sl][$k]['img'] = 0;
if($documentLink->hasAttribute('rel'))
$matchedAnchors[$sl][$k]['rel'] = 'Y';
else
$matchedAnchors[$sl][$k]['rel'] = 'N';
$chieldFlag = true;
break;
}
}
elseif($name == 'img' ) {
$alt= $words->getAttribute('alt');
if(!empty($alt)) {
$matchedAnchors[$sl][$k]['anchor'] = trim($words->getAttribute('alt'));
$matchedAnchors[$sl][$k]['img'] = 1;
if($documentLink->hasAttribute('rel'))
$matchedAnchors[$sl][$k]['rel'] = 'Y';
else
$matchedAnchors[$sl][$k]['rel'] = 'N';
$chieldFlag = true;
break;
}
}
}
if(!$chieldFlag){
$matchedAnchors[$sl][$k]['anchor'] = $documentLink->nodeValue;
$matchedAnchors[$sl][$k]['img'] = 0;
if($documentLink->hasAttribute('rel'))
$matchedAnchors[$sl][$k]['rel'] = 'Y';
else
$matchedAnchors[$sl][$k]['rel'] = 'N';
}
}
}$k++;
}
}
}
Both #Phliplip & #lunixbochs have mentioned common cURL pitfalls (max execution time & denied by the target server.)
When sending that many cURL requests to the same server I try to "be nice" and place voluntarily sleep periods so I don't bombard the host. For a low-end site, 1000+ requests could be like a mini DDOS!
Here's code that's worked for me. I it used to scrape a client's product data from their old site, since the data was locked in a proprietary database system with NO export function.
<?php
header('Content-type: text/html; charset=utf-8', true);
set_time_limit(0);
$urls = array(
'http://www.example.com/cgi-bin/product?id=500',
'http://www.example.com/cgi-bin/product?id=501',
'http://www.example.com/cgi-bin/product?id=502',
'http://www.example.com/cgi-bin/product?id=503',
'http://www.example.com/cgi-bin/product?id=504',
);
$i = 0;
foreach($urls as $url){
echo $url."\n";
$curl = curl_init($url);
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1 );
curl_setopt($curl, CURLOPT_TIMEOUT, 25 );
$html = curl_exec($curl);
$html = #mb_convert_encoding($html, 'HTML-ENTITIES', 'utf-8');
curl_close($curl);
// now do something with info returned by curl
$i++;
if($i%10==0){
sleep(20);
} else {
sleep(2);
}
}
?>
The main features are:
no max execution time
voluntary sleep-ing
new curl init & exec for each request.
In my experience, going to sleep() will stop servers from denying you.
However if by "different different server" you mean that you are sending a small number of requests a large number of servers, for example:
$urls = array(
'http://www.example-one.com/',
'http://www.example-two.com/',
'http://www.example-three.com/',
'http://www.example-four.com/',
'http://www.example-five.com/',
'http://www.example-six.com/'
);
And you are using set_time_limit(0); then something then an error may be causing your code to die; try
ini_set('display_errors',1);
error_reporting(E_ALL);
And tell us the error message you are getting.
PHP doesn't place a restriction on the number of connections using curl_multi_init, but memory usage and time limits will be an issue.
Check your memory_limit setting in your php.ini and try to increase it to see if that helps you.