Delay the response by some seconds when scraping a website

Delay the response by some seconds when scraping a website - php

I am working on a php scraping server , so i have website list to loop and then return the content of each page to get the data that i want.
The problem that some sites are not fully returned and as i see some data appear after the page is fully loaded
I tried with both these methods but i cant get the full page
First method :
$opts = array('http' =>
array(
'method' => 'GET',
'timeout' => 10
) );
$context = stream_context_create($opts);
$html = file_get_contents('some url',false,$context);
echo $html;
Second method
$html = implode('',file('some url'));
echo $html;
I just want to return the content of the page after 1 or 2 seconds after the page is loaded.
For Exemple with this url i cant get the search results just this
: Résultats
News Photos Vidéos Tags Filtre par date
Précédente Suivante

Things are not as they seem to be.
Actually the url that you want to hit is
https://api.swiftype.com/api/v1/public/engines/search.json because the webpage on load makes a json request that is on this url.
in that url you have to post the following json
$search = array("engine_key"=>"naxCjQ58frTkB_diETvu","page"=>1,"q"=>"kardas","per_page"=>12,"sort_direction"=>"","filters"=>array("page"=>array("category"=>"News")),"facets"=>array("page"=>array("0"=>"tag")));
A quick guide:
On the "page" property type a value, that represents the page number you want to get,
on the "q" property type the term that you want to search,
"per_page" property is the entries that you will get, try some
values, 12 is the default,
the rest you have to find them out yourself.
a code example that works
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch,CURLOPT_URL,"https://api.swiftype.com/api/v1/public/engines/search.json");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_POSTFIELDS,json_encode($search));
curl_setopt($ch,CURLOPT_POST, true);
curl_setopt($ch,CURLOPT_HTTPHEADER, array('Content-Type: application/json; charset=utf-8'));
curl_setopt($ch,CURLOPT_HEADER, 0);
$data = curl_exec($ch);
curl_close($ch);
and to check the results
print_r(json_decode($data));
this thing is beautiful is like them to give you an API on the plate...

Related

Unable to get Healthline search results with PHP

I am trying to run a script that will search Healthline with a query string and determine if there are any search results, but I can't get the contents with the query string posting to the page. To search for something on their site, you go to https://www.healthline.com/search?q1=search+string.
Here is what I tried:
$healthline_url = 'https://www.healthline.com/search';
$search_string = 'ashwaganda';
$postdata = http_build_query(
array(
'q1' => $search_string
)
);
$opts = array('http' =>
array(
'method' => 'POST',
'header' => 'Content-type: application/x-www-form-urlencoded',
'content' => $postdata
)
);
$stream = stream_context_create($opts);
$theHtmlToParse = file_get_contents($healthline_url, false, $stream);
print_r($theHtmlToParse);
I also tried to just add the query string to the url and skip the stream, amongst other variations, but I'm running out of ideas. This also didn't work:
$healthline_url = 'https://www.healthline.com/search';
$search_string = 'ashwaganda';
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Content-Type: text/xml; charset=utf-8"
)
);
$stream = stream_context_create($opts);
$theHtmlToParse = file_get_contents($healthline_url.'&q1='.$search_string, false, $stream);
print_r($theHtmlToParse);
And suggestions?
EDIT: I changed the url in case someone wants to look at the search page. Also fixed the query string. Still doesn't work.
In response to Ken Lee, I did try the following cURL script that also just returns the page without search results:
$healthline_url = 'https://www.healthline.com/search?q1=ashwaganda';
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $healthline_url);
$data = curl_exec($ch);
curl_close($ch);
print_r($data);

Healthline does not load the search result directly. It has its search index stored in Algolia and made extra javascript calls to retrieve the result. Therefore you cannot see the search result by file_get_content.
To see the search result, you need to run a browser simulator that simulates a javascript-capable browser to properly run the site page.
For PHP developers, you may try using php-webdriver to control browers through webdriver (e.g. Selenium, Chrome + chromedriver, Firefox + geckodriver).
Update: Didn't know that the target site is Healthline. Updated the answer once I found out.

extract reCaptcha from web page to be completed externally via cURL and then return results to view page

I am creating a web scraper for personal use that scrape car dealership sites based on my personal input but several of the sites that I attempting to collect data from a blocked by a redirected captcha page. The current site I am scraping with curl returns this HTML
<html>
<head>
<title>You have been blocked</title>
<style>#cmsg{animation: A 1.5s;}#keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style>
</head>
<body style="margin:0">
<p id="cmsg">Please enable JS and disable any ad blocker</p>
<script>
var dd={'cid':'AHrlqAAAAAMA1gZrYHNP4MIAAYhtzg==','hsh':'C0705ACD75EBF650A07FF8291D3528','t':'fe','host':'geo.captcha-delivery.com'}
</script>
<script src="https://ct.captcha-delivery.com/c.js"></script>
</body>
</html>
I am using this to scrape the page:
<?php
function web_scrape($url)
{
$ch = curl_init();
$imei = "013977000272744";
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_COOKIE, '_ym_uid=1460051101134309035; _ym_isad=1; cxx=80115415b122e7c81172a0c0ca1bde40; _ym_visorc_20293771=w');
curl_setopt($ch, CURLOPT_POSTFIELDS, array(
'imei' => $imei,
));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$server_output = curl_exec($ch);
return $server_output;
curl_close($ch);
}
echo web_scrape($url);
?>
And to reiterate what I want to do; I want to collect the Recaptcha from this page so when I want to view the page details on an external site I can fill in the Recaptcha on my external site and then scrape the page initially imputed.
Any response would be great!

Datadome is currently utilizing Recaptcha v2 and GeeTest captchas, so this is what your script should do:
Navigate to redirection https://geo.captcha-delivery.com/captcha/?initialCid=….
Detect what type of captcha is used.
Obtain token for this captcha using any captcha solving service like Anti Captcha.
Submit the token, check if you were redirected to the target page.
Sometimes target page contains an iframe with address https://geo.captcha-delivery.com/captcha/?initialCid=.. , so you need to repeat from step 2 in this iframe.
I’m not sure if steps above could be made with PHP, but you can do it with browser automation engines like Puppeteer, a library for NodeJS. It launches a Chromium instance and emulates a real user presence. NodeJS is a must you want to build pro scrapers, worth investing some time in Youtube lessons.
Here’s a script which does all steps above: https://github.com/MoterHaker/bypass-captcha-examples/blob/main/geo.captcha-delivery.com.js
You’ll need a proxy to bypass GeeTest protection.

based on the high demand for code, HERE is my upgraded scraper that bypassed this specific issue. However my attempt to obtain the captcha did not work and I still have not solved how to obtain it.
include "simple_html_dom.php";
/**
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the HTTP server response header fields and content.
*/
// This function is where the Magic comes from. It bypasses ever peice of security carsales.com.au can throw at me
function get_web_page( $url ) {
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_SSL_VERIFYPEER => false // Disabled SSL Cert checks
);
$ch = curl_init( $url ); //initiate the Curl program that we will use to scrape data off the webpage
curl_setopt_array( $ch, $options ); //set the data sent to the webpage to be readable by the webpage (JSON)
$content = curl_exec( $ch ); //creates function to read pages content. This variable will be used to hold the sites html
$err = curl_errno( $ch ); //errno function that saves all the locations our scraper is sent to. This is just for me so that in the case of a error,
//I can see what parts of the page has it seen and more importantly hasnt seen
$errmsg = curl_error( $ch ); //check error message function. for example if I am denied permission this string will be equal to: 404 access denied
$header = curl_getinfo( $ch ); //the information of the page stored in a array
curl_close( $ch ); //Closes the Curler to save site memory
$header['errno'] = $err; //sending the header data to the previously made errno, which contains a array path of all the places my scraper has been
$header['errmsg'] = $errmsg; //sending the header data to the previously made error message checker function.
$header['content'] = $content; //sending the header data to the previously made content checker that will be the variable holder of the webpages HTML.
return $header; //Return all the pages data and my identifying functions in a array. To be used in the presentation of the search results.
};
//using the function we just made, we use the url genorated by the form to get a developer view of the scraping.
$response_dev = get_web_page($url);
// print_r($response_dev);
$response = end($response_dev); //takes only the end of the developer response because the rest is for my eyes only in the case that the site runs into a issue

cURL using info from mySQL, then storing the cURL'ed info

I'm programming in PHP.
An article I've found useful until now was mainly about how to CURL through one site with a lot of information, but what I really need is how to cURL on multiple sites with not so much information - a few lines, as a matter of fact!
Another part is, the article focus is mainly at storing it at the FTP server in a txt file, but I have loaded around 900 addresses into mysql, and want to load them from there, and enrich the table with the information stored in the links - Which I will provided beneath!
We have some open public libraries with addresses and information about these and an API.
Link to the main site:
The function I would like to use: http://dawa.aws.dk/adresser/autocomplete?q=
SQL Structure:
Data example: http://i.imgur.com/jP1J26U.jpg
fx this addresse: Dornen 2 6715 Esbjerg N (called AdrName in databasen).
http://dawa.aws.dk/adresser/autocomplete?q=Dornen%202%206715%20Esbjerg%20N
This will give me the following output (which I want to store in the AdrID in the database):
[
{
"tekst": "Dornen 2, Tarp, 6715 Esbjerg N",
"adresse": {
"id": "0a3f50b8-d085-32b8-e044-0003ba298018",
"href": "http://dawa.aws.dk/adresser/0a3f50b8-d085-32b8-e044-0003ba298018",
"vejnavn": "Dornen",
"husnr": "2",
"etage": null,
"dør": null,
"supplerendebynavn": "Tarp",
"postnr": "6715",
"postnrnavn": "Esbjerg N"
}
}
]
How to store it all in a blob, as seen in the SQL structure?

If you want to make a cURL request in php use this method
function curl_download($Url){
// is cURL installed yet?
if (!function_exists('curl_init')){
die('Sorry cURL is not installed!');
}
// OK cool - then let's create a new cURL resource handle
$ch = curl_init();
// Now set some options (most are optional)
// Set URL to download
curl_setopt($ch, CURLOPT_URL, $Url);
// Set a referer
curl_setopt($ch, CURLOPT_REFERER, "http://www.example.org/yay.htm");
// User agent
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
// Include header in result? (0 = yes, 1 = no)
curl_setopt($ch, CURLOPT_HEADER, 0);
// Should cURL return or print out the data? (true = return, false = print)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Timeout in seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
// Download the given URL, and return output
$output = curl_exec($ch);
// Close the cURL resource, and free system resources
curl_close($ch);
return $output;
}
And then you call it using
print curl_download('http://dawa.aws.dk/adresser/autocomplete?q=Melvej');
Or you can directly convert it jSON object
$jsonString=curl_download('http://dawa.aws.dk/adresser/autocomplete?q=Melvej');
var_dump(json_decode($jsonString));

The data you download is json, so you can store that in a varchar column rather than blog.
Also the site with the api does not seem bothered about http referrer, user agent etc so you can use file_get_contents in place of curl.
So simply get all the results from your db, iterate over them, making a call to the api, and update the appropriate row with the correct data:
//get all the rows from your database
$addresses = DB::exec('SELECT * FROM addresses'); //i dont know how you actually access your db, this is just an example
foreach($addresses as $address){
$searchTerm = $address['AdrName'];
$addressId = $address['Vid'];
//download the json
$apidata = file_get_contents('http://dawa.aws.dk/adresser/autocomplete?q=' . urlencode($searchTerm));
//save back to db
DB::exec('UPDATE addresses SET status=? WHERE id=?', [$apidata, $searchTerm]);
//if you want to access the data, you can use json_decode:
$data = json_decode($apidata);
echo $data[0]->tekst; //outputs Dornen 2, Tarp, 6715 Esbjerg N
}

How to call posts from PHP

I have a website, that uses WP Super Cache plugin. I need to recycle cache once a day and then I need to call 5 posts (URL adresses) so WP Super Cache put these posts into cache again (caching is quite time consuming so I'd like to have it precached before users come so they dont have to wait).
On my hosting I can use a CRON but only for 1 call/hour. And I need to call 5 different URL's at once.
Is it possible to do that? Maybe create one HTML page with these 5 posts in iframe? Will something like that work?
Edit: Shell is not available, so I have to use PHP scripting.

The easiest way to do it in PHP is to use file_get_contents() (fopen() also works), if the HTTP stream wrapper is enabled on your server:
<?php
$postUrls = array(
'http://my.site.here/post1',
'http://my.site.here/post2',
'http://my.site.here/post3',
'http://my.site.here/post4',
'http://my.site.here/post5',
);
foreach ($postUrls as $url) {
// Get the post as an user will do it
$text = file_get_contents();
// Here you can check if the request was successful
// For example, use strpos() or regex to find a piece of text you expect
// to find in the post
// Replace 'copyright bla, bla, bla' with a piece of text you display
// in the footer of your site
if (strpos($text, 'copyright bla, bla, bla') === FALSE) {
echo('Retrieval of '.$url." failed.\n");
}
}
If file_get_contents() fails to open the URLs on your server (some ISP restrict this behaviour) you can try to use curl:
function curl_get_contents($url)
{
$ch = curl_init($url);
curl_setopt_array($ch, array(
CURLOPT_CONNECTTIMEOUT => 30, // timeout in seconds
CURLOPT_RETURNTRANSFER => TRUE, // tell curl to return the page content instead of just TRUE/FALSE
));
$text = curl_exec($ch);
curl_close($ch);
return $text;
}
Then use the function curl_get_contents() listed above instead of file_get_contents().

An example using PHP without building a cURL request.
Using PHP's shell exec, you can have an extremely light function like so :
$siteList = array("http://url1", "http://url2", "http://url3", "http://url4", "http://url5");
foreach ($siteList as &$site) {
$request = shell_exec('wget '.$site);
}
Now of course this is not the most concise answer and not always a good solution also, if you actually want anything from the response you will have to work with it a different way to cURLbut its a low impact option.

Thanks to Arkascha tip I created a PHP page that I call from CRON. This page contains simple function using cURL:
function cache_it($Url){
if (!function_exists('curl_init')){
die('No cURL, sorry!');
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 50); //higher timeout needed for cache to load
curl_exec($ch); //dont need it as output, otherwise $output = curl_exec($ch);
curl_close($ch);
}
cache_it('http://www.mywebsite.com/url1');
cache_it('http://www.mywebsite.com/url2');
cache_it('http://www.mywebsite.com/url3');
cache_it('http://www.mywebsite.com/url4');

curl is not working in Yii

when i am using curl in my core php file it's working fine for me and getting expected result also... my core php code is...
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, "http://stage.auth.stunnerweb.com/index.php?r=site/getUser");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($curl);
echo $data; //here i am getting respond proper
here in above i am making call to getUser function and i am getting respond from that function...
but now my problem is when i am using this same code in my any Yii controller (tried to use it in SiteController & Controller) but it's not working...
public function beforeAction()
{
if(!Yii::app()->user->isGuest)
{
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL ,"http://stage.auth.stunnerweb.com/index.php?r=site/kalpit");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($curl);
echo $data;
}
else
return true;
}
in yii can't we use curl like this?
Can you please suggest me how to use curl in yii?
Thanks in advance

Better use yii-curl
Setup instructions
Place Curl.php into protected/extensions folder of your project
in main.php, add the following to 'components':
php
'curl' => array(
'class' => 'ext.Curl',
'options' => array(/.. additional curl options ../)
);
Usage
to GET a page with default params
php
$output = Yii::app()->curl->get($url, $params);
// output will contain the result of the query
// $params - query that'll be appended to the url
to POST data to a page
php
$output = Yii::app()->curl->post($url, $data);
// $data - data that will be POSTed
to PUT data
php
$output = Yii::app()->curl->put($url, $data, $params);
// $data - data that will be sent in the body of the PUT
to set options before GET or POST or PUT
php
$output = Yii::app()->curl->setOption($name, $value)->get($url, $params);
// $name & $value - CURL options
$output = Yii::app()->curl->setOptions(array($name => $value))->get($get, $params);
// pass key value pairs containing the CURL options

You are running your code inside a beforeAction() method which is not supposed to render any data at all. On top of that, you do not let the method return anything if the current user is a guest. Please read the API docs concerning this.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Delay the response by some seconds when scraping a website - php

Related

Unable to get Healthline search results with PHP

extract reCaptcha from web page to be completed externally via cURL and then return results to view page

cURL using info from mySQL, then storing the cURL'ed info

How to call posts from PHP

curl is not working in Yii

Categories

Resources