I have a small web page that, every day, displays a one word answer - either Yes or No - depending on some other factor that changes daily.
Underneath this, I have a Facebook like button. I want this button to post, in the title/description, either "Yes" or "No", depending on the verdict that day.
I have set up the OG metadata dynamically using php to echo the correct string into the og:title etc. But Facebook caches the value, so someone sharing my page on Tuesday can easily end up posting the wrong content to Facebook.
I have confirmed this is the issue by using the Facebook object debugger. As soon as I force a refresh, all is well. I attempted to automate this using curl, but this doesn't seem to work.
$ch = curl_init();
$timeout = 30;
curl_setopt($ch, CURLOPT_URL, "http://developers.facebook.com/tools/lint/?url={http://ispizzahalfprice.com}");
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$data = curl_exec($ch);
curl_close($ch);
echo $data;
Am I missing some easy fix here? Or do I need to re-evaluate my website structure to acheive what I am looking for (e.g. use two separate pages)?
Here's the page in case it's useful: http://ispizzahalfprice.com
Using two separate URL's would be the safe bet. As you have observed, Facebook does quite heavy caching on URL scrapes. You've also seen that you, as the admin of the App, can flush and refresh Facebook's cache by pulling the page through the debugger again.
Using two URL's would solve this issue because Facebook could cache the results all they want! There will still be a separate URL for "yes" and one for "no".
Related
I'm running into an issue with cURL while getting customer review data from Google (without API). Before my cURL request was working just fine, but it seems Google now redirects all requests to a cookie consent page.
Below you'll find my current code:
$ch = curl_init('https://www.google.com/maps?cid=4493464801819550785');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
print_r($result);
$result now just prints "302 Moved. The document had moved here."
I also tried setting curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0); but that didn't help either.
Does anyone has an idea on how to overcome this? Can I programmatically deny (or accept) Google's cookies somehow? Or maybe there is a better way of handling this?
What you need is the following:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
The above curl option is what tells curl to follow redirects. However, I am not sure whether what is returned will be of much use for the specific URL you are trying to fetch. By adding the above option you will obtain the HTML source for the final page Google redirects to. But this page contains scripts that when executed load the map and other content that is ultimately displayed in your browser. So if you need to fetch data from what is subsequently loaded by JavaScript, then you will not find it in the returned results. Instead you should look into using a tool like selenium with PHP (you might take a look at this post).
I had a simple parser for an external site that's required to confirm that the link user submitted leads to an account this user owns (by parsing a link to their profile from linked page). And it worked for a good long while with just this wordpress function:
function fetch_body_url($fetch_link){
$response = wp_remote_get($fetch_link, array('timeout' => 120));
return wp_remote_retrieve_body($response);
}
But then the website changed something in their cloudflare defense, and now this results in "Please wait..." page of cloudflare with no option to pass it.
Thing is, I don't even need it done automatically - if there was a captcha, the user could've complete it. But it won't show anything other than endlessly spinning "checking your browser".
Googled a bunch of curl examples, and best I could get so far is this:
<?php
$url='https://ficbook.net/authors/1000'; //random profile from requrested website
$agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
curl_setopt($ch, CURLOPT_TIMEOUT, 120);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_REFERER, 'https://facebook.com/');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
$response = curl_exec($ch);
curl_close($ch);
echo '<textarea>'.$response.'</textarea>';
?>
Yet it still returns the browser check screen. Adding random free proxy to it doesn't seem to work either, or maybe I wasn't lucky finding a working one (or couldn't figure out how to insert it correctly in this case). Is there any way around it? Or perhaps there is some other way to see if there is a specific keyword/link on the page?
Ok, I've spent most of the day on this problem, and seems like I got it more or less sorted. Not exactly the way I expected, but hey, it works... sort of.
Instead of solving this on the server side, I ended up looking for solution to parse it on my own PC (it has better uptime than my hosting's server anyway). Turns out, there are plenty of ready-to-use open source scrapers, including those that know how to bypass cloudflare being extra defensive for no good reason.
Solution for python dummies like myself:
Install Anaconda if you don't have python installed yet.
In cmd type pip install cloudscraper
Open Spyder (it comes along with Anaconda) and paste this:
import cloudscraper
scraper = cloudscraper.create_scraper()
print(scraper.get("https://your-parse-target/").text)
Save it anywhere and poke at run button to test. If it works, you got your data in the console window of same app.
Replace print with whatever you're gonna do with that data.
For my specific case it also required to install mysql-connector-python and to enable remote access for mysql database (and my hosting had it available for free all this time, huh?). So instead of directly verifying that user is the owner of the profile they input, there's now a queue - which isn't perfect, but oh well, they'll have to wait.
First, user request is saved to mysql. My local python script will check that table every now and then to see if anything's in line to be verified. It'll get the page's content and save it back to mysql. Then the old php parser will do its job like before, but from mysql fetch instead of actual website.
Perhaps there are better solutions that don't require resorting to measures like creating a separate local parser, but maybe this will help to someone running into similar issue.
I am making some kind of plugin for wordpress and lately i have a bit of problems, and im not sure if it's plugin related or not.
The plugin is made to pull videos and their description, tags, thumb, etc...
So when i type in search term in my plugin and hit enter, the code goes to youtube search page,search video and pull data from it.
The problem is related to not pulling videos every time when i search. So sometime it works, sometime it doesn't and it doesn't matter if it's same search terms or not.
Here's an example of the code, it's a bit long so ill just set search terms in variable instead in a form.
$searchterms = 'funny cat singing';
$get_search = rawurlencode($searchterms);
$searchterm = str_replace('+', ' ', $get_search);
$getfeed = curl_init();
curl_setopt($getfeed, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($getfeed, CURLOPT_URL, 'https://www.youtube.com/results?search_query='.$searchterm.'');
curl_setopt($getfeed, CURLOPT_RETURNTRANSFER, true);
curl_setopt($getfeed, CURLOPT_CONNECTTIMEOUT, 20);
$str = curl_exec($getfeed);
curl_close($getfeed);
$feedURL = str_get_html($str);
foreach($feedURL->find('ol[id="search-results"] li') as $video) {
get info like thumb time etc...
}
So sometime as i said i get the videos updated, and sometime i don't
How can i record actions in log file so i can have or know what's happening when i press search.
Something like
Pulling videos
Search terms: https://www.youtube.com/results?search_query=funny+cat+singing
And than if i get response from youtube something like
Page found, pulling videos.
Or if page is not found
Page not found, didn't get response from youtube.
If page is found than next step is to see if search term actually returns something, etc...
If i only know the basic how to start with logging, i will customize it later based on criteria what info i need to log.
Any advices?
You may try out one of these two tutorials
http://www.devshed.com/c/a/php/logging-with-php/
http://www.hotscripts.com/blog/php-error-log-file-archive/
I am just playing around trying to learn php and decided to write a php page that could pull info from the leagueoflegends boards. Problem I am having is the site needs me to login first. Ive tried just
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://forums.euw.leagueoflegends.com/board');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.6 (KHTML, like Gecko) Chrome/16.0.897.0 Safari/535.6');
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_REFERER, "http://leagueoflegends.com");
$html = curl_exec($ch);
curl_close($ch);
echo $html;
and I have tried
file_get_contents('http://forums.euw.leagueoflegends.com/board/')
but every time I get nowhere. I was hoping that being logged in on another tab would allow me to get the source of pages on the forums, but that doesn't seem to be the case. I honestly don't even know where to go from here or what I should be searching for to give me a clue. Normally I like to post a little more info, but like I said I am trying to learn PHP; i've seem to learn best by just jumping in.
First, good luck on your path of learning PHP! Curl is mighty powerful, but lately I've been using Guzzle instead (guzzlephp.org) for it's ease of use.
Most sites that have login mechanisms do in fact use sessions or cookies to map users so you are on the right path. What you have above will simply retrieve the main board page. From here, you'll submit a second curl request to login. The login page there is:
https://account.leagueoflegends.com/login
That actually pops up a modal window though and uses a captcha. You'll submit the following form fields:
username
password
recaptcha_response_field
to: https://account.leagueoflegends.com/auth
Since this has a captcha, your best bet may be to login as yourself and export your cookie data for this domain and see if you can reuse it in your script. It'll expire at some point so this won't be fully automated.
I am trying to CURL this URL so that it automatically adds a product to a basket
http://www.juno.co.uk/cart/add/440551/01/
When I follow the URL in the browser it adds the product to basket
When I CURL it it doesnt add it
This is my CURL code
$url = "http://www.juno.co.uk/cart/add/440551/01/";
$c = curl_init();
curl_setopt($c, CURLOPT_URL,"$url");
$file_path = 'cookies.txt';
curl_setopt($c,CURLOPT_POST,true);
curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 50);
curl_setopt($c,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($c, CURLOPT_RETURNTRANSFER,1);
curl_setopt($c, CURLOPT_COOKIEJAR, $file_path);
$complete = curl_exec($c);
curl_close($c);
Any ideas? CURL is definitely set up on my server as I am successfully using it for other scripts.
You can see the output here http://soundshelter.net/addjuno.php?id=440551 - it is redirecting to the page that I expect it to (i.e. adding the item to basket) but I do not want to redirect the user to this page - only ping the page so that the item is added to basket but the user remains on my page. Any ideas?
Thanks in advance
The cart (or something about it (id, content, etc) is stored in a session, you have to create a custom function in which you can pass the id of the cart, and you can update it.
EDIT:
if this would be possible, then it would be a security risk (add items to anybody cart ?)
user is identified by session id, you need to "steal" it from your visitor and call the url via curl like you were the user (you can create cookies for the curl session i think and set the session id), but of course this is a very similar thing like stealing cookie / session datas, and there are defending techniques against it
my opinion is only one possible solution is, if the juno.co.uk has a public api for such operations
Answer may be as simple as you shouldn't need to POST, that might be causing problems since you aren't sending/specifying any data. What I mean is to comment out that line:
//curl_setopt($c,CURLOPT_POST,true);
sidebar: Can you show the output that you do get?