I had a simple parser for an external site that's required to confirm that the link user submitted leads to an account this user owns (by parsing a link to their profile from linked page). And it worked for a good long while with just this wordpress function:
function fetch_body_url($fetch_link){
$response = wp_remote_get($fetch_link, array('timeout' => 120));
return wp_remote_retrieve_body($response);
}
But then the website changed something in their cloudflare defense, and now this results in "Please wait..." page of cloudflare with no option to pass it.
Thing is, I don't even need it done automatically - if there was a captcha, the user could've complete it. But it won't show anything other than endlessly spinning "checking your browser".
Googled a bunch of curl examples, and best I could get so far is this:
<?php
$url='https://ficbook.net/authors/1000'; //random profile from requrested website
$agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
curl_setopt($ch, CURLOPT_TIMEOUT, 120);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_REFERER, 'https://facebook.com/');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
$response = curl_exec($ch);
curl_close($ch);
echo '<textarea>'.$response.'</textarea>';
?>
Yet it still returns the browser check screen. Adding random free proxy to it doesn't seem to work either, or maybe I wasn't lucky finding a working one (or couldn't figure out how to insert it correctly in this case). Is there any way around it? Or perhaps there is some other way to see if there is a specific keyword/link on the page?
Ok, I've spent most of the day on this problem, and seems like I got it more or less sorted. Not exactly the way I expected, but hey, it works... sort of.
Instead of solving this on the server side, I ended up looking for solution to parse it on my own PC (it has better uptime than my hosting's server anyway). Turns out, there are plenty of ready-to-use open source scrapers, including those that know how to bypass cloudflare being extra defensive for no good reason.
Solution for python dummies like myself:
Install Anaconda if you don't have python installed yet.
In cmd type pip install cloudscraper
Open Spyder (it comes along with Anaconda) and paste this:
import cloudscraper
scraper = cloudscraper.create_scraper()
print(scraper.get("https://your-parse-target/").text)
Save it anywhere and poke at run button to test. If it works, you got your data in the console window of same app.
Replace print with whatever you're gonna do with that data.
For my specific case it also required to install mysql-connector-python and to enable remote access for mysql database (and my hosting had it available for free all this time, huh?). So instead of directly verifying that user is the owner of the profile they input, there's now a queue - which isn't perfect, but oh well, they'll have to wait.
First, user request is saved to mysql. My local python script will check that table every now and then to see if anything's in line to be verified. It'll get the page's content and save it back to mysql. Then the old php parser will do its job like before, but from mysql fetch instead of actual website.
Perhaps there are better solutions that don't require resorting to measures like creating a separate local parser, but maybe this will help to someone running into similar issue.
Related
I'm running into an issue with cURL while getting customer review data from Google (without API). Before my cURL request was working just fine, but it seems Google now redirects all requests to a cookie consent page.
Below you'll find my current code:
$ch = curl_init('https://www.google.com/maps?cid=4493464801819550785');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
print_r($result);
$result now just prints "302 Moved. The document had moved here."
I also tried setting curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0); but that didn't help either.
Does anyone has an idea on how to overcome this? Can I programmatically deny (or accept) Google's cookies somehow? Or maybe there is a better way of handling this?
What you need is the following:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
The above curl option is what tells curl to follow redirects. However, I am not sure whether what is returned will be of much use for the specific URL you are trying to fetch. By adding the above option you will obtain the HTML source for the final page Google redirects to. But this page contains scripts that when executed load the map and other content that is ultimately displayed in your browser. So if you need to fetch data from what is subsequently loaded by JavaScript, then you will not find it in the returned results. Instead you should look into using a tool like selenium with PHP (you might take a look at this post).
I am just playing around trying to learn php and decided to write a php page that could pull info from the leagueoflegends boards. Problem I am having is the site needs me to login first. Ive tried just
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://forums.euw.leagueoflegends.com/board');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.6 (KHTML, like Gecko) Chrome/16.0.897.0 Safari/535.6');
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_REFERER, "http://leagueoflegends.com");
$html = curl_exec($ch);
curl_close($ch);
echo $html;
and I have tried
file_get_contents('http://forums.euw.leagueoflegends.com/board/')
but every time I get nowhere. I was hoping that being logged in on another tab would allow me to get the source of pages on the forums, but that doesn't seem to be the case. I honestly don't even know where to go from here or what I should be searching for to give me a clue. Normally I like to post a little more info, but like I said I am trying to learn PHP; i've seem to learn best by just jumping in.
First, good luck on your path of learning PHP! Curl is mighty powerful, but lately I've been using Guzzle instead (guzzlephp.org) for it's ease of use.
Most sites that have login mechanisms do in fact use sessions or cookies to map users so you are on the right path. What you have above will simply retrieve the main board page. From here, you'll submit a second curl request to login. The login page there is:
https://account.leagueoflegends.com/login
That actually pops up a modal window though and uses a captcha. You'll submit the following form fields:
username
password
recaptcha_response_field
to: https://account.leagueoflegends.com/auth
Since this has a captcha, your best bet may be to login as yourself and export your cookie data for this domain and see if you can reuse it in your script. It'll expire at some point so this won't be fully automated.
I have a small web page that, every day, displays a one word answer - either Yes or No - depending on some other factor that changes daily.
Underneath this, I have a Facebook like button. I want this button to post, in the title/description, either "Yes" or "No", depending on the verdict that day.
I have set up the OG metadata dynamically using php to echo the correct string into the og:title etc. But Facebook caches the value, so someone sharing my page on Tuesday can easily end up posting the wrong content to Facebook.
I have confirmed this is the issue by using the Facebook object debugger. As soon as I force a refresh, all is well. I attempted to automate this using curl, but this doesn't seem to work.
$ch = curl_init();
$timeout = 30;
curl_setopt($ch, CURLOPT_URL, "http://developers.facebook.com/tools/lint/?url={http://ispizzahalfprice.com}");
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$data = curl_exec($ch);
curl_close($ch);
echo $data;
Am I missing some easy fix here? Or do I need to re-evaluate my website structure to acheive what I am looking for (e.g. use two separate pages)?
Here's the page in case it's useful: http://ispizzahalfprice.com
Using two separate URL's would be the safe bet. As you have observed, Facebook does quite heavy caching on URL scrapes. You've also seen that you, as the admin of the App, can flush and refresh Facebook's cache by pulling the page through the debugger again.
Using two URL's would solve this issue because Facebook could cache the results all they want! There will still be a separate URL for "yes" and one for "no".
I'm trying to login to an external webpage using a php script with cURL. I'm new to cURL, so I feel like I'm missing a lot of pieces. I found a few examples and modified them to allow access to https pages. Ultimately, my goal is to be able to login to the page and download a .csv by following a specified link once logged in. So far, what I have is a script that tests logging in to the page; the script is shown below:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.websiteurl.com/login');
curl_setopt($ch, CURLOPT_POSTFIELDS,'Email='.urlencode($login_email).'&Password='.urlencode($login_pass).'&submit=1');
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3");
curl_setopt($ch, CURLOPT_REFERER, "https://www.websiteurl.com/login");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
$output = curl_exec($ch);
I have a few questions. First, is there a reason this does not redirect on its own? The only way for me to view the contents of the page is to
echo $output
even though CURLOPT_RETURNTRANSFER and CURLOPT_FOLLOWLOCATION are both set to True.
Second, the URL for the page stays at "localhost/folderName/test.php" instead of directing to the actual website. Can anyone explain why this happens? Because the script doesn't actually redirect to a logged in webpage, I can't seem to do anything that I need to do.
Does my issue have to do with cookies? My cookies.txt file is in the same folder that my .php script is. (I'm using wampServer btw). Should it be located elsewhere?
Once I'm able to fix these two issues, it seems that all I need to be able to do is to redirect to the link that start the download process for the .csv file.
Thanks for any help, much appreciated!
Answering part of your question:
From http://php.net/manual/en/function.curl-setopt.php :
CURLOPT_RETURNTRANSFER TRUE to return the transfer as a string of the
return value of curl_exec() instead of outputting it out directly.
In other words - doing exactly what you described. It's returning the response to a string and you echo it to see it. As requested...
----- EDIT-----
As for the second part of your question - when I change the last three lines of the script to
$output = curl_exec($ch);
header('Location:'.$website);
echo $output;
The address of the page as displayed changes to $website - which in my case is the variable I use to store my equivalent of your 'https://www.websiteurl.com/login'
I am not sure that is what you wanted to do - because I'm not sure I understand what your next steps are. If you were getting redirected by the login site, wouldn't the new address be part of the header that is returned? And wouldn't you need to extract that address in order to perform the next request (wget or whatever) in order to download the file you wanted to get?
To do so, you need to set CURLOPT_HEADER to TRUE,
You can get the URL where you ended up from
$last_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
(see cURL , get redirect url to a variable ).
The same link also has a useful script for completely parsing the header information (returned when CURLOPT_HEADER==true. It's in the answer by nico limpica.
Bottom line: CURL gets the information that your browser would have received if you had pointed it to a particular site; that doesn't mean your browser behaves as though you pointed it to that site...
Having trouble with Google blocking my IPs when querying Google for content matches. I've got 300 private IPs and have no trouble connecting to Google with a desktop app (w/ the same IPs) that performs a similar function. Yet, when I crank it up on my server using CURL my IPs get temporarily blocked - your machine maybe sending automated queries, please try again in 30 secs. So there must be a footprint somewhere.
Any how here's my code:
function file_get_contents_curl($url, $proxy = true) {
global $proxies;
App::import('Vendor', 'proxies');
$proxies = $this->shuffle_assoc($proxies);
$proxy_ip = $proxies[array_rand($proxies, 1)];//proxy IP here
$proxy = $proxy_ip.':60099';
$loginpassw = 'myusername:mypassword'; //proxy login and password here
$ch = curl_init();
if($proxy) {
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
//curl_setopt($ch, CURLOPT_PROXYPORT, $proxy_port);
curl_setopt($ch, CURLOPT_PROXYTYPE, 'HTTP');
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $loginpassw);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)');
}
curl_setopt($ch, CURLOPT_HEADER, 1);
#curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
//Set curl to return the data instead of printing it to the browser.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_0);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
//echo $data;
curl_close($ch);
return $data;
}
And I've verified that the IP is being set and that I'm connecting thru the proxy.
Anyone got any ideas?
Tried with SOCKS5 but no difference. Trouble with the Google api is that you only get 100 queries per day.
HTTP proxies as well as SOCKS proxies can be used, there is no difference when scraping google results.
There are multiple possible reasons why you get detected.
Your proxies are of bad quality or shared (maybe without your knowledge)
Your proxies are in only one or two subnets / too sequential
You query Google too fast or too often
You should not query Google more often with an IP than 20 times per hour, that's just a rough value that works and doesn't get punished by the search engine.
So you should implement delay based on your proxy count.
But if option 1) or 2) are true than even that won't help, you'll need another IP solution.
Check out the Google rank scraper ( http://google-rank-checker.squabbel.com/), it's a free PHP project for scraping Google and includes proper delay routines you could use for your own code.
Also the caching functions might proof useful for you as you don't want to query Google more than required.
And not to forget:
If you get detected then make your script STOP automated!
You just cause trouble by going on, detection means you did something wrong.
Http-proxies doesn't guarantee your privacy. You may try to use socks.
But you better use google-api instead.