This question already has answers here:
Scrape web page data generated by javascript
(2 answers)
Closed 3 years ago.
first let me say that I have read over numerous "scrapping" threads on here and none have been of help to me. I also checked around the internet for days and now I am getting close to the wire I am hoping someone can shed some light on this for me.
I am using PHP Simple HTML DOM Parser to scrape some data from a page. The url I am working with serves dynamic content and I can not seem to get anything to work to pull that content in. I need to scrape the text(plain) from <tr id="0" class="ui-widget-content jqgrow ui-row-ltr" role="row"> to <tr id="9" class="ui-widget-content jqgrow ui-row-ltr" role="row">, I feel like once I get one to work I can get the others. Because this info is not actually on the page when the page is loaded but rather comes into the fold after the page loads I am in a rutt.
With that said, here is what I have tried:
echo file_get_html('http://sheriffclevelandcounty.com/p2c/jailinmates.aspx')->plaintext;
The above will show me everything BUT the info I need, like this:
I also tried using the example from the plugin using IMDb and modified to my needs, this is it:
// Defining the basic cURL function
function curl($url) {
// Assigning cURL options to an array
$options = Array(
CURLOPT_RETURNTRANSFER => TRUE, // Setting cURL's option to return the webpage data
CURLOPT_FOLLOWLOCATION => TRUE, // Setting cURL to follow 'location' HTTP headers
CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
CURLOPT_CONNECTTIMEOUT => 120, // Setting the amount of time (in seconds) before the request times out
CURLOPT_TIMEOUT => 120, // Setting the maximum amount of time for cURL to execute queries
CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8", // Setting the useragent
CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
);
$ch = curl_init(); // Initialising cURL
curl_setopt_array($ch, $options); // Setting cURL's options using the previously assigned array data in $options
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
// Defining the basic scraping function
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
$scraped_page = curl("http://sheriffclevelandcounty.com/p2c/jailinmates.aspx"); // Downloading IMDB home page to variable $scraped_page
$scraped_data = scrape_between($scraped_page, '<table id="tblII" class="ui-jqgrid-btable" cellspacing="0" cellpadding="0" border="0" role="grid" aria-multiselectable="false" aria-labelledby="gbox_tblII" style="width: 456px;">', '</table>'); // Scraping downloaded dara in $scraped_page for content between <title> and </title> tags
echo $scraped_data; // Echoing $scraped data, should show "The Internet Movie Database (IMDb)"
Of course neither of these work, so my question is: How do I use the PHP Simple DOM Parser to get dynamic content that is loaded after page load? Is it possible or am I just completely on the wrong track here?
I understand that you need the dynamic data that comes in the jqgrid. For that you can use post URL which in response gives the data.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://sheriffclevelandcounty.com/p2c/jqHandler.ashx?op=s");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_POST, 1);
curl_setopt($ch,CURLOPT_POSTFIELDS, array(
'rows'=>10000, //Here you can specify how many records you want
't'=>'ii'
));
$output = curl_exec($ch);
curl_close($ch);
echo "<pre>";
print_r(json_decode($output));
Related
I am creating a web scraper for personal use that scrape car dealership sites based on my personal input but several of the sites that I attempting to collect data from a blocked by a redirected captcha page. The current site I am scraping with curl returns this HTML
<html>
<head>
<title>You have been blocked</title>
<style>#cmsg{animation: A 1.5s;}#keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style>
</head>
<body style="margin:0">
<p id="cmsg">Please enable JS and disable any ad blocker</p>
<script>
var dd={'cid':'AHrlqAAAAAMA1gZrYHNP4MIAAYhtzg==','hsh':'C0705ACD75EBF650A07FF8291D3528','t':'fe','host':'geo.captcha-delivery.com'}
</script>
<script src="https://ct.captcha-delivery.com/c.js"></script>
</body>
</html>
I am using this to scrape the page:
<?php
function web_scrape($url)
{
$ch = curl_init();
$imei = "013977000272744";
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_COOKIE, '_ym_uid=1460051101134309035; _ym_isad=1; cxx=80115415b122e7c81172a0c0ca1bde40; _ym_visorc_20293771=w');
curl_setopt($ch, CURLOPT_POSTFIELDS, array(
'imei' => $imei,
));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$server_output = curl_exec($ch);
return $server_output;
curl_close($ch);
}
echo web_scrape($url);
?>
And to reiterate what I want to do; I want to collect the Recaptcha from this page so when I want to view the page details on an external site I can fill in the Recaptcha on my external site and then scrape the page initially imputed.
Any response would be great!
Datadome is currently utilizing Recaptcha v2 and GeeTest captchas, so this is what your script should do:
Navigate to redirection https://geo.captcha-delivery.com/captcha/?initialCid=….
Detect what type of captcha is used.
Obtain token for this captcha using any captcha solving service like Anti Captcha.
Submit the token, check if you were redirected to the target page.
Sometimes target page contains an iframe with address https://geo.captcha-delivery.com/captcha/?initialCid=.. , so you need to repeat from step 2 in this iframe.
I’m not sure if steps above could be made with PHP, but you can do it with browser automation engines like Puppeteer, a library for NodeJS. It launches a Chromium instance and emulates a real user presence. NodeJS is a must you want to build pro scrapers, worth investing some time in Youtube lessons.
Here’s a script which does all steps above: https://github.com/MoterHaker/bypass-captcha-examples/blob/main/geo.captcha-delivery.com.js
You’ll need a proxy to bypass GeeTest protection.
based on the high demand for code, HERE is my upgraded scraper that bypassed this specific issue. However my attempt to obtain the captcha did not work and I still have not solved how to obtain it.
include "simple_html_dom.php";
/**
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the HTTP server response header fields and content.
*/
// This function is where the Magic comes from. It bypasses ever peice of security carsales.com.au can throw at me
function get_web_page( $url ) {
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_SSL_VERIFYPEER => false // Disabled SSL Cert checks
);
$ch = curl_init( $url ); //initiate the Curl program that we will use to scrape data off the webpage
curl_setopt_array( $ch, $options ); //set the data sent to the webpage to be readable by the webpage (JSON)
$content = curl_exec( $ch ); //creates function to read pages content. This variable will be used to hold the sites html
$err = curl_errno( $ch ); //errno function that saves all the locations our scraper is sent to. This is just for me so that in the case of a error,
//I can see what parts of the page has it seen and more importantly hasnt seen
$errmsg = curl_error( $ch ); //check error message function. for example if I am denied permission this string will be equal to: 404 access denied
$header = curl_getinfo( $ch ); //the information of the page stored in a array
curl_close( $ch ); //Closes the Curler to save site memory
$header['errno'] = $err; //sending the header data to the previously made errno, which contains a array path of all the places my scraper has been
$header['errmsg'] = $errmsg; //sending the header data to the previously made error message checker function.
$header['content'] = $content; //sending the header data to the previously made content checker that will be the variable holder of the webpages HTML.
return $header; //Return all the pages data and my identifying functions in a array. To be used in the presentation of the search results.
};
//using the function we just made, we use the url genorated by the form to get a developer view of the scraping.
$response_dev = get_web_page($url);
// print_r($response_dev);
$response = end($response_dev); //takes only the end of the developer response because the rest is for my eyes only in the case that the site runs into a issue
I am trying to scrape torrentz2.eu search results by using old(dead) torrentz.eu scraper code:
when I run http://localhost/jits/torz/api.php?key=kabali
it showing me warning and null value.
Notice: Undefined variable: results_urls in /Applications/XAMPP/xamppfiles/htdocs/jits/torz/api.php on line 59
null
why?
can anybody tell me what's wrong with code.?
here is code:
<?php
$t= $_GET['key'];
// Defining the basic cURL function
function curl($url) {
// Assigning cURL options to an array
$options = Array(
CURLOPT_RETURNTRANSFER => TRUE, // Setting cURL's option to return the webpage data
CURLOPT_FOLLOWLOCATION => TRUE, // Setting cURL to follow 'location' HTTP headers
CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
CURLOPT_CONNECTTIMEOUT => 120, // Setting the amount of time (in seconds) before the request times out
CURLOPT_TIMEOUT => 120, // Setting the maximum amount of time for cURL to execute queries
CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
CURLOPT_USERAGENT => "Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0", // Setting the useragent
CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
);
$ch = curl_init(); // Initialising cURL
curl_setopt_array($ch, $options); // Setting cURL's options using the previously assigned array data in $options
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
?>
<?php
// Defining the basic scraping function
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
?>
<?php
$url = "https://torrentz2.eu/search?f=$t"; // Assigning the URL we want to scrape to the variable $url
$results_page = curl($url); // Downloading the results page using our curl() funtion
//var_dump($results_page);
//die();
$results_page = scrape_between($results_page, "<dl><dt>", "<a href=\"http://www.viewme.com/search?q=$t\" title=\"Web search results on ViewMe\">"); // Scraping out only the middle section of the results page that contains our results
$separate_results = explode("</dd></dl>", $results_page); // Expploding the results into separate parts into an array
// For each separate result, scrape the URL
foreach ($separate_results as $separate_result) {
if ($separate_result != "") {
$results_urls[] = scrape_between($separate_result, "\">", "<b>"); // Scraping the page ID number and appending to the IMDb URL - Adding this URL to our URL array
}
}
//print_r($results_urls); // Printing out our array of URLs we've just scraped
if($_GET["key"] === null) {
echo "Keyword Missing ";
} else if(isset($_GET["key"])) {
echo json_encode($results_urls);
}
?>
for old torrentz.eu scraper code ref: GIT repo
First thing you get NOTICE "Undefined variable: results_urls" because $results_urls is defined and used directly. Define it and then use it.
Do something like:-
// $results_urls defined here:-
$results_urls = [];
// For each separate result, scrape the URL
foreach ($separate_results as $separate_result) {
if ($separate_result != "") {
$results_urls[] = scrape_between($separate_result, "\">", "<b>"); // Scraping the page ID number and appending to the IMDb URL - Adding this URL to our URL array
}
}
Secondly the null is printed because $results_urls is not getting populated because $separate_results is not getting populated correctly. It just has one value which is empty.
I debugged further and found $results_page value is false. So whatever you are trying to do in "scrape_between" function is not working as expected. Fix your function.
The following PHP code using foreach does not seem to work. I believe it has to do with "<a href='/$value/access'>".
I've shared the entire codebase.
Does anyone know what is wrong with my statement?
// include/functions.php
<?php
// Defining the basic cURL function
function curl($url) {
// Assigning cURL options to an array
$options = Array(
CURLOPT_RETURNTRANSFER => TRUE, // Setting cURL's option to return the webpage data
CURLOPT_FOLLOWLOCATION => TRUE, // Setting cURL to follow 'location' HTTP headers
CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
CURLOPT_CONNECTTIMEOUT => 120, // Setting the amount of time (in seconds) before the request times out
CURLOPT_TIMEOUT => 120, // Setting the maximum amount of time for cURL to execute queries
CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8", // Setting the useragent
CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
);
$ch = curl_init(); // Initialising cURL
curl_setopt_array($ch, $options); // Setting cURL's options using the previously assigned array data in $options
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
// Defining the basic scraping function
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
?>
// code.php
<?php
//include functions
include_once("include/functions.php");
// Set URL
$url = "https://www.instituteforsupplymanagement.org/ismreport/mfgrob.cfm";
$source = curl($url);
// Collect dataset
$arr = array("PMI","New Orders");
foreach ($arr as $value) {
$data = scrape_between($source,"<strong>$$value","</tr>");
print_r($data);
}
?>
within the string the second $ is being treated as a string, so you end up with strings like '$PMI'. why not just assign your variable variable to a temp variable:
$start = $$value;
$data = scrape_between($source,"<strong>$start","</tr>");
also I don't know how you plan to use "New Orders" as a variable name what with the space and all.
maybe you're trying to use "PMI" and "New Orders" in the string, in which case just drop the extra $
I am trying to perform a cURL request within a form. Basically I have a page called monitoringform.php results.php and curl.php. When the user enters a MAC address into the form and presses submit, i would like the value of mac address to process through curl.php, obtain the result and post it in results.php. The cURL function works fine but I cannot get it to POST to results.php While I was playing around I noticed sometimes the IP shows up however it is incorrect. Can it be old data from the session? Any help is greatly appreciated!
monitoringform.php
<?php session_start(); ?>
<h3>Monitoring Request Form</h3>
<form name="form1" id="form1" action="" method="post">
<label for="regularInput">Modem MAC</label>
<input type="text" name="modemmac" id="modemmac" />
<button type="submit" name="submit">Submit Form</button>
</form>
results.php
<?php session_start(); ?>
<?php
include "curl.php";
$mactosearch = $_POST['modemmac'];
$mymessage=$_GET["message"];
if($mymessage=="success")
{
echo "<b>Modem IP Address:</b> $modemip <br />";
}
else{ echo "You are getting error";}
?>
curl.php
<?php
// Defining the basic cURL function
function curl($url) {
// Assigning cURL options to an array
$options = Array(
CURLOPT_RETURNTRANSFER => TRUE, // Setting cURL's option to return the webpage data
CURLOPT_FOLLOWLOCATION => TRUE, // Setting cURL to follow 'location' HTTP headers
CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
CURLOPT_CONNECTTIMEOUT => 120, // Setting the amount of time (in seconds) before the request times out
CURLOPT_TIMEOUT => 120, // Setting the maximum amount of time for cURL to execute queries
CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8", // Setting the useragent
CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
);
$ch = curl_init(); // Initialising cURL
curl_setopt_array($ch, $options); // Setting cURL's options using the previously assigned array data in $options
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
// Defining the basic scraping function
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
$scraped_page = curl("http://localhost?q_mac=$mactosearch&fields=only:ip");
$modemip = scrape_between($scraped_page, "<ip>", "</ip>");
}
?>
I am taking part in a beauty competition, and I require that I be nominated.
The nomination form requires my details, and my nominators details.
My nominators may have a problem switching between my email containing my details and the nomination form, and may discourage them from filling the form in the first place.
The solution I came up with is to create an HTML page (which I have 100% control on), and it contains my pre-filled details already, so that the nominators do not get confused filling up my details, all I have to do is ask them for their own details.
Now I want my HTML form to parse the details onto an another website (the competition organiser's website) and have the form automatically filled in, and all the nominators have to do is click submit on the competition's website. I have absolute no control on the competition's website so that I cannot add or change any programming code.
How can I parse the data from my own HTML page (100% under my control) onto a third party PHP page?
Any examples of coding are appreciated.
Thank you xx
The same origin policy makes this impossible unless the competition organiser were to grant you permission using CORS (in which case you could load their site in a frame and modify it using JavaScript to manipulate its DOM … in supporting browsers).
The form they are using is submitting the form data to a mailing script which is secured by checking the referer (at least). You could use something like cURL in PHP to spoof the
referer like this (not tested):
function get_web_page( $url,$curl_data )
{
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)", // who am i
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_POST => 1, // i am sending post data
CURLOPT_POSTFIELDS => $curl_data, // this are my post vars
CURLOPT_SSL_VERIFYHOST => 0, // don't verify ssl
CURLOPT_SSL_VERIFYPEER => false, //
CURLOPT_REFERER => "http://http://fashionawards.com.mt/nominationform.php",
CURLOPT_VERBOSE => 1 //
);
$ch = curl_init($url);
curl_setopt_array($ch,$options);
$content = curl_exec($ch);
$err = curl_errno($ch);
$errmsg = curl_error($ch) ;
$header = curl_getinfo($ch);
curl_close($ch);
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
}
$curl_data = "nameandsurname_nominator=XXXX&id_nominator=XXX.....etc....";
$url = "http://www.logix.com.mt/cgi-bin/FormMail.pl";
$response = get_web_page($url,$curl_data);
print '<pre>';
print_r($response);
print '</pre>';
In the line where it says $curl_data = "nameandsurname_nominator=XXXX&id_nominator=XXX.....etc...."; you can set the post variables according to their names in the original form.
Thus you could make your own form to submit to their mailing script & have some of the field populated with what you need...
BEWARE: You may easily get disqualified or run into legal troubles for using such techniques! The recipient may very easily notice that the form has been compromised!