PHP scrapy cURL request not working - php

I am implementing a scrapy spider to crawl a website that contains real estate offers. The site contains a telephone number to the real estate agent, which can be retreived be an ajax post request.
To get a phone number I have to get ID from URL, next get from source csrfToken and send this with POST by special URL with ID. This method was working good but since yesterday not working.
My code:
$urlSite = "https://www.otodom.pl/mazowieckie/oferta/piekne-mieszkanie-na-mokotowie-do-wynajecia-ID3ezHA.html";
$ch = curl_init();
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_URL, $urlSite);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
curl_close($ch);
preg_match("/csrfToken = '(.+?)'/", $result, $output_array);
preg_match("/ID(.+?).html/", $urlSite, $output_array_id);
$token = $output_array[1];
$id = $output_array_id[1];
$url = "https://www.otodom.pl/ajax/mazowieckie/misc/contact/phone/" . $id . "/";
$headers = [
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding: gzip, deflate, br',
'Accept-Language: pl,en-US;q=0.8,en;q=0.6,ru;q=0.4',
'Cache-Control: no-cache',
'Content-Type: application/x-www-form-urlencoded; charset=UTF-8',
'Content-Length: 74',
'Host: www.otodom.pl',
'Referer: ' . $urlSite,
'User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36'
];
$data = array(
'CSRFToken' => $token
);
$data_string = http_build_query($data);
$ch = curl_init();
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "POST");
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data_string);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$phone = utf8_decode(curl_exec($ch));
curl_close($ch);
echo $phone;
Please help me, I am working for this a few hours and nothing.

{"status":"error","message":"Spróbuj wykonać operację ponownie. Jeśli
to nie pomoże, sprawdź czy masz włączoną obsługę JavaScript w
przeglądarce."}
As I mentioned on my comment, you need JavaScript in order to get the phone number. One way to achieve this is using selenium, here's a python example:
import time
from selenium import webdriver
geckodriver = 'C:/path_to/geckodriver.exe'
driver = webdriver.Firefox(executable_path = geckodriver)
driver.get("https://www.otodom.pl/mazowieckie/oferta/piekne-mieszkanie-na-mokotowie-do-wynajecia-ID3ezHA.html")
driver.find_element_by_class_name("phone-spoiler").click()
time.sleep(2)
print driver.find_element_by_class_name("phone-number").text
# 515 174 616
Notes:
1 - Install Selenium:
pip install selenium
2 - Download the geckodriver
3 - Replace C:/path_to with the path where you saved geckodriver.exe.
4 - Add C:/path_to to your environment.
5 - Restart your system.
6 - Run python name_of_script.py and the phone number will be displayed.
The steps above assume that you're using a windows machine.

Related

PHP cURL function returns 403 or 503 on websites that are working & public

I have a curl function in PHP which I use to request the title & description of websites. I've tested it with over 1000s of different websites (using my existing bookmarks list) and it works great, but there are some websites that don't work because cURL returns a status code of either 403 or 503. For example, CodePen pens sites such as: https://codepen.io/vrugtehagel/pen/eYJjYNm returns 503 or sometimes 403 error.
This is the cURL function with the options that I have set up:
$ch = curl_init();
$header[] = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-GB,en-US;q=0.9,en;q=0.8";
$header[] = "Pragma: no-cache";
// Set agent as MacBook Chrome
$user_agent_chrome = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36';
// Set cURL Options
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, TRUE);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent_chrome);
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_MAXREDIRS, 4);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 4);
curl_setopt($ch, CURLOPT_TIMEOUT, 8);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE); // Staging
// Execute & fetch page string data
$url_data = curl_exec($ch);
// Flag if cURL was terminated
$curl_err = curl_error($ch) ? true : false;
// Get the status (404 | 200 | ... )
$status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
// Get final URL (if redirection happened)
$effective_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
// Terminate connection
curl_close($ch);
Since these websites are publicly accessible & they work in social media sites, I'm wondering what could be setup wrong. I've searched a lot and tried other different methods. Could it be a cookie, or a particular referrer? Does anyone have a clue or maybe you have your own well-tested cURL method/options that you can share with us?
Is there an ultimate "should work everywhere" cURL example somewhere on the Web we don't know about?
I've been trying to get around this problem for almost four months now, always giving up because I can't figure out why only a few minor websites are not working. I'de appreciate if anyone can help.
Thanks in advance!

Instagram API retrieve the code using PHP

I try to use the Instagram API but it's really not easy.
According to the API documentation, a code must be retrieved in order to get an access token and then make requests to Instagram API.
But after few try, I don't succeed.
I already well-configured the settings in https://www.instagram.com/developer
I call the url api.instagram.com/oauth/authorize/?client_id=[CLIENT_ID]&redirect_uri=[REDIRECT_URI]&response_type=code with curl, but I don't have the redirect uri with the code in response.
Can you help me please ;)!
I would recommend you use one of the existing PHP instagram client libraries like this https://github.com/cosenary/Instagram-PHP-API
I did this not too long ago, here's a good reference:
https://auth0.com/docs/connections/social/instagram
Let me know if it helps!
I've made this code, I hope it doesnt have error, but i've just made it for usecase like you wantedHere is the code, I'll explain it below how this code works.
$authorization_url = "https://api.instagram.com/oauth/authorize/?client_id=".$instagram_client_id."&redirect_uri=".$your_website_redirect_uri."&response_type=code";
$username='ig_username';
$password='ig_password';
$_defaultHeaders = array(
'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Accept-Encoding: ',
'Connection: keep-alive',
'Upgrade-Insecure-Requests: 1',
'Cache-Control: max-age=0'
);
$ch = curl_init();
$cookie='/application/'.strtoupper(VERSI)."instagram_cookie/instagram.txt";
curl_setopt( $ch, CURLOPT_POST, 0 );
curl_setopt( $ch, CURLOPT_HTTPGET, 1 );
if($this->token!==null){
array_push($this->_defaultHeaders,"Authorization: ".$this->token);
}
curl_setopt( $ch, CURLOPT_HTTPHEADER, $this->_defaultHeaders);
curl_setopt( $ch, CURLOPT_HEADER, true);
curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_COOKIEFILE,getcwd().$cookie );
curl_setopt( $ch, CURLOPT_COOKIEJAR, getcwd().$cookie );
curl_setopt($this->curlHandle,CURLOPT_URL,$url);
curl_setopt($this->curlHandle,CURLOPT_FOLLOWLOCATION,true);
$result = curl_exec($this->curlHandle);
$redirect_uri = curl_getinfo($this->curlHandle, CURLINFO_EFFECTIVE_URL);
$form = explode('login-form',$result)[1];
$form = explode("action=\"",$form)[1];
// vd('asd',$form);
$action = substr($form,0,strpos($form,"\""));
// vd('action',$action);
$csrfmiddlewaretoken = explode("csrfmiddlewaretoken\" value=\"",$form);
$csrfmiddlewaretoken = substr($csrfmiddlewaretoken[1],0,strpos($csrfmiddlewaretoken[1],"\""));
//finish getting parameter
$post_param['csrfmiddlewaretoken']=$csrfmiddlewaretoken;
$post_param['username']=$username;
$post_param['password']=$password;
//format instagram cookie from vaha's answer https://stackoverflow.com/questions/26003063/instagram-login-programatically
preg_match_all('/^Set-Cookie:\s*([^;]*)/mi', $result, $matches);
$cookieFileContent = '';
foreach($matches[1] as $item)
{
$cookieFileContent .= "$item; ";
}
$cookieFileContent = rtrim($cookieFileContent, '; ');
$cookieFileContent = str_replace('sessionid=; ', '', $cookieFileContent);
$cookie=getcwd().'/application/'.strtoupper(VERSI)."instagram_cookie/instagram.txt";
$oldContent = file_get_contents($cookie);
$oldContArr = explode("\n", $oldContent);
if(count($oldContArr))
{
foreach($oldContArr as $k => $line)
{
if(strstr($line, '# '))
{
unset($oldContArr[$k]);
}
}
$newContent = implode("\n", $oldContArr);
$newContent = trim($newContent, "\n");
file_put_contents(
$cookie,
$newContent
);
}
// end format
$useragent = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0";
$arrSetHeaders = array(
'origin: https://www.instagram.com',
'authority: www.instagram.com',
'upgrade-insecure-requests: 1',
'Host: www.instagram.com',
"User-Agent: $useragent",
'content-type: application/x-www-form-urlencoded',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Accept-Encoding: deflate, br',
"Referer: $redirect_uri",
"Cookie: $cookieFileContent",
'Connection: keep-alive',
'cache-control: max-age=0',
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_COOKIEJAR, dirname(__FILE__)."/".$cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, dirname(__FILE__)."/".$cookie);
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_HTTPHEADER, $arrSetHeaders);
curl_setopt($ch, CURLOPT_URL, $this->base_url.$action);
curl_setopt($ch, CURLOPT_REFERER, $redirect_uri);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($post_param));
sleep(5);
$page = curl_exec($ch);
preg_match_all('/^Set-Cookie:\s*([^;]*)/mi', $page, $matches);
$cookies = array();
foreach($matches[1] as $item) {
parse_str($item, $cookie1);
$cookies = array_merge($cookies, $cookie1);
}
var_dump($page);
Step 1:
We need to get to the login page first.
We can access it using curl get, with CURLOPT_FOLLOWLOCATION set to true so that we will be redirected to the login page, we access our application instagram authorization url
$authorization_url = "https://api.instagram.com/oauth/authorize/?client_id=".$instagram_client_id."&redirect_uri=".$your_website_redirect_uri."&response_type=code";
$username='ig_username';
This is step one from this Instagram documentation here
Now the result of the first get curl we have the response page and its page uri that we store at $redirect_uri, this must be needed and placed on referer header when we do http post for login.
After get the result of login_page, we will need to format the cookie, I know this and use some code from vaha answer here vaha's answer
Step 2:
After we get the login_page we will extract the action url , extract csrfmiddlewaretoken hidden input value.
After we get it, we will do a post parameter to login.
We must set the redirect uri, and dont forget the cookiejar, and other header setting like above code.After success sending the parameter post for login, Instagram will call your redirect uri, for example https://www.yourwebsite.com/save_instagram_code at there you must use or save your instagram code to get the access token using curl again ( i only explain how to get the code :D)
I make this in a short time, I'll update the code which I have tested and work if i have time, Feel free to suggest an edit of workable code or a better explanation.

Using Curl from SSL server to download xml feed?

When i try this code on some other server it works properly, but when i run it on server where is SSL "installed" i get empty string from var_dump.
$feedUrl = 'https://api.pinnaclesports.com/v1/feed?sportid=29&leagueid=1980-1977-1957-1958-1983-2421-2417-2418-2419-1842-1843-2436-2438-2196-2432-2036-2037-1928-1817-2386-2592-2081';
// Set your credentials here, format = clientid:password from your account.
$credentials = base64_encode("password");
// Build the header, the content-type can also be application/json if needed
$header[] = 'Content-length: 0';
$header[] = 'Content-type: application/xml';
$header[] = 'Authorization: Basic ' . $credentials;
// Set up a CURL channel.
$httpChannel = curl_init();
// Prime the channel
curl_setopt($httpChannel, CURLOPT_URL, $feedUrl);
curl_setopt($httpChannel, CURLOPT_RETURNTRANSFER, true);
curl_setopt($httpChannel, CURLOPT_HTTPHEADER, $header);
curl_setopt($httpChannel, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)' );
// Unless you have all the CA certificates installed in your trusted root authority, this should be left as false.
curl_setopt($httpChannel, CURLOPT_SSL_VERIFYPEER, false);
// This fetches the initial feed result. Next we will fetch the update using the fdTime value and the last URL parameter
$initialFeed = curl_exec($httpChannel);
//var_dump($initialFeed);
I already have script on this ssl server who downloads csv files from an other url and it works normally, so i think that problem is in my header, but how it works on other servers, same code?
Try this
Basically says to do:
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
curl_setopt($ch, CURLOPT_CAINFO, getcwd() . "/CAcerts/BuiltinObjectToken-EquifaxSecureCA.crt");
Or try this

Scrape website with javascript using cURL

I try to scrape data of this website:
http://ntthnue.edu.vn/tracuudiem
First, when I insert the SBD field with data 'TS4740', I can successfully get the result. However, when I try to run this code:
Here is my PHP cURL code:
<?php
function getData($id) {
$url = 'http://ntthnue.edu.vn/tracuudiem';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, ['sbd' => $id]);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
echo getData('TS4740');
I just got the old page. Can anybody explain why? Thank you!
Make sure you add all the necessary headers and input data. The server that is processing this request can do all kinds of checks to see if it's a "valid" form request. As such you need to spoof the request to be as close to a regular browser request as possible.
Use tools like Chrome Dev Tools to see both the request and respons headers that are sent between the server and your browser to better understand what you curl setup should be like. And further use a app like Postman to make the request simulation super easy and to see what works and not.
Working example:
<?php
function getData($id) {
$url = 'http://ntthnue.edu.vn/tracuudiem';
$ch = curl_init($url);
$postdata = 'namhoc=2015-2016&kythi_name=Tuy%E1%BB%83n+sinh+v%C3%A0o+l%E1%BB%9Bp+10&hoten=&sbd='.$id.'&btnSearch=T%C3%ACm+ki%E1%BA%BFm';
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postdata);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Origin: http://ntthnue.edu.vn',
'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36',
'Content-Type: application/x-www-form-urlencoded',
'Referer: http://ntthnue.edu.vn/tracuudiem',
));
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
echo getData('TS4740');

Getting Data from Bitsnoop API

I am using the following piece of code to get tracker data (converted from JSON to PHP) and find the sum total of the number of seeders from the BitSnoop API:
$hash = "98C5C361D0BE5F2A07EA8FA5052E5AA48097E7F6";
if(!function_exists("curl_init")) die("cURL extension is not installed");
$url = "http://bitsnoop.com/api/trackers.php?hash=" . $hash . "&json=1";
echo $url;
$ch=curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$r=curl_exec($ch);
curl_close($ch);
$myarr = json_decode($r,true);
print_r($myarr);
But the script is not able to retrieve ANY data from the URL.
Chrome's view-source is working on the page, but any other way of retrieving the source of the page, either via viewsource.in or i-tools don't seem to retrieve any data from the URL as well.
Could anyone explain why is it so?
And please provide an alternative way to accomplish the retrieval.
Thanks in advance !
You should pretend to be a legit browser:
$hash = "98C5C361D0BE5F2A07EA8FA5052E5AA48097E7F6";
if(!function_exists("curl_init")) die("cURL extension is not installed");
$url = "http://bitsnoop.com/api/trackers.php?hash=" . $hash . "&json=1";
$headers = array(
'Host: bitsnoop.com',
'Connection: keep-alive',
'Cache-Control: max-age=0',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36',
'Accept-Encoding: deflate,sdch',
'Accept-Language: ru,en-US;q=0.8,en;q=0.6');
echo $url;
$ch=curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$r=curl_exec($ch);
curl_close($ch);
$myarr = json_decode($r,true);
print_r($myarr);
And also it's a good idea to test curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); , however, I did not look if they use or not HTTP redirect

Categories