something is not working in CURL scraping

something is not working in CURL scraping - php

I am trying to scrape torrentz2.eu search results by using old(dead) torrentz.eu scraper code:
when I run http://localhost/jits/torz/api.php?key=kabali
it showing me warning and null value.
Notice: Undefined variable: results_urls in /Applications/XAMPP/xamppfiles/htdocs/jits/torz/api.php on line 59
null
why?
can anybody tell me what's wrong with code.?
here is code:
<?php
$t= $_GET['key'];
// Defining the basic cURL function
function curl($url) {
// Assigning cURL options to an array
$options = Array(
CURLOPT_RETURNTRANSFER => TRUE, // Setting cURL's option to return the webpage data
CURLOPT_FOLLOWLOCATION => TRUE, // Setting cURL to follow 'location' HTTP headers
CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
CURLOPT_CONNECTTIMEOUT => 120, // Setting the amount of time (in seconds) before the request times out
CURLOPT_TIMEOUT => 120, // Setting the maximum amount of time for cURL to execute queries
CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
CURLOPT_USERAGENT => "Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0", // Setting the useragent
CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
);
$ch = curl_init(); // Initialising cURL
curl_setopt_array($ch, $options); // Setting cURL's options using the previously assigned array data in $options
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
?>
<?php
// Defining the basic scraping function
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
?>
<?php
$url = "https://torrentz2.eu/search?f=$t"; // Assigning the URL we want to scrape to the variable $url
$results_page = curl($url); // Downloading the results page using our curl() funtion
//var_dump($results_page);
//die();
$results_page = scrape_between($results_page, "<dl><dt>", "<a href=\"http://www.viewme.com/search?q=$t\" title=\"Web search results on ViewMe\">"); // Scraping out only the middle section of the results page that contains our results
$separate_results = explode("</dd></dl>", $results_page); // Expploding the results into separate parts into an array
// For each separate result, scrape the URL
foreach ($separate_results as $separate_result) {
if ($separate_result != "") {
$results_urls[] = scrape_between($separate_result, "\">", "<b>"); // Scraping the page ID number and appending to the IMDb URL - Adding this URL to our URL array
}
}
//print_r($results_urls); // Printing out our array of URLs we've just scraped
if($_GET["key"] === null) {
echo "Keyword Missing ";
} else if(isset($_GET["key"])) {
echo json_encode($results_urls);
}
?>
for old torrentz.eu scraper code ref: GIT repo

First thing you get NOTICE "Undefined variable: results_urls" because $results_urls is defined and used directly. Define it and then use it.
Do something like:-
// $results_urls defined here:-
$results_urls = [];
// For each separate result, scrape the URL
foreach ($separate_results as $separate_result) {
if ($separate_result != "") {
$results_urls[] = scrape_between($separate_result, "\">", "<b>"); // Scraping the page ID number and appending to the IMDb URL - Adding this URL to our URL array
}
}
Secondly the null is printed because $results_urls is not getting populated because $separate_results is not getting populated correctly. It just has one value which is empty.
I debugged further and found $results_page value is false. So whatever you are trying to do in "scrape_between" function is not working as expected. Fix your function.

Related

How to use foreach to loop through a file with variable variable?

The following PHP code using foreach does not seem to work. I believe it has to do with "<a href='/$value/access'>".
I've shared the entire codebase.
Does anyone know what is wrong with my statement?
// include/functions.php
<?php
// Defining the basic cURL function
function curl($url) {
// Assigning cURL options to an array
$options = Array(
CURLOPT_RETURNTRANSFER => TRUE, // Setting cURL's option to return the webpage data
CURLOPT_FOLLOWLOCATION => TRUE, // Setting cURL to follow 'location' HTTP headers
CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
CURLOPT_CONNECTTIMEOUT => 120, // Setting the amount of time (in seconds) before the request times out
CURLOPT_TIMEOUT => 120, // Setting the maximum amount of time for cURL to execute queries
CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8", // Setting the useragent
CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
);
$ch = curl_init(); // Initialising cURL
curl_setopt_array($ch, $options); // Setting cURL's options using the previously assigned array data in $options
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
// Defining the basic scraping function
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
?>
// code.php
<?php
//include functions
include_once("include/functions.php");
// Set URL
$url = "https://www.instituteforsupplymanagement.org/ismreport/mfgrob.cfm";
$source = curl($url);
// Collect dataset
$arr = array("PMI","New Orders");
foreach ($arr as $value) {
$data = scrape_between($source,"<strong>$$value","</tr>");
print_r($data);
}
?>

within the string the second $ is being treated as a string, so you end up with strings like '$PMI'. why not just assign your variable variable to a temp variable:
$start = $$value;
$data = scrape_between($source,"<strong>$start","</tr>");
also I don't know how you plan to use "New Orders" as a variable name what with the space and all.
maybe you're trying to use "PMI" and "New Orders" in the string, in which case just drop the extra $

Unable to log in with CURL... But do not know why

Here's my problem. A few months ago, I wrote a PHP script to get connected to my account on a website. I was using CURL to get connected and everything was fine. Then, they updated the website and now I am no longer able to get connected. The problem is not with CURL, as I do not get any error from CURL, but it is the website itself which tells me that I am not able.
Here's my script :
<?php
require('simple_html_dom.php');
//Getting the website main page
$url = "http://www.kijiji.ca/h-ville-de-quebec/1700124";
$main = file_get_html($url);
$links = $main -> find('a');
//Finding the login page
foreach($links as $link){
if($link -> innertext == "Ouvrir une session"){
$page = $link;
}
}
$to_go = "http://www.kijiji.ca/".$page->href;
//Getting the login page
$main = file_get_html($to_go);
$form = $main -> find("form");
//Parsing the page for the login form
foreach($form as $f){
if($f -> id == "login-form"){
$cform = $f;
}
}
$form = str_get_html($cform);
//Getting my post data ready
$postdata = "";
$tot = count($form->find("input"));
$count = 0;
/*I've got here a foreach loop to find all the inputs in the form. As there are hidden input for security, I make my script look for all the input and get the value of each, and then add them in my post data. When the name of the input is emailOrNickname or password, I enter my own info there, then it gets added to the post data*/
foreach($form -> find("input") as $input){
$count++;
$postdata .= $input -> name;
$postdata .= "=";
if($input->name == "emailOrNickname"){
$postdata.= "my email address ";
}else if($input->name == "password"){
$postdata.= "my password";
}else{
$postdata .= $input -> value;
}
if($count<$tot){
$postdata .= "&";
}
}
//Getting my curl session
$ch = curl_init();
curl_setopt_array($ch, array(
CURLOPT_URL => $to_go,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => $postdata,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_COOKIESESSION => true,
CURLOPT_COOKIEJAR => 'cookie.txt'
));
$result = curl_exec ($ch);
curl_close ($ch);
echo $result;
?>
CURL nor PHP return any error. In fact, it returns the webpage of the website, but this webpage tells me that there's an error that occurred, as if there was missing some post data.
What do you think can cause that ? Could it be some missing curl_setopts ? I've got no idea, do you have any ?

$form = $main -> find("form") finds first occurrence of element
and that is <form id="SearchForm" action="/b-search.html">
you will need to change that into $form = $main->find('#login-form')

Most likely the problem is that the site (server) checks cookies. This process mainly consists of two phases:
1) When you visit the site first time on some page, e.g. on the login page, the server sets cookies with some data.
2) On each subsequent page visit or POST request the server checks cookies it has set.
So you have to reproduce this process in your script which mean you have to use CURL to get any page from the site, including the login page which should be getting by CURL, not file_get_html.
Furthemore you have to set both CURLOPT_COOKIEJAR and CURLOPT_COOKIEFILE options to the same absolute path value ('cookies.txt' is a relative path) on each request. This is necessary in order to enable cookies auto-handling (session maintaining) within entire series of requests (including redirects) the script will perform.

PHP cURL within a form

I am trying to perform a cURL request within a form. Basically I have a page called monitoringform.php results.php and curl.php. When the user enters a MAC address into the form and presses submit, i would like the value of mac address to process through curl.php, obtain the result and post it in results.php. The cURL function works fine but I cannot get it to POST to results.php While I was playing around I noticed sometimes the IP shows up however it is incorrect. Can it be old data from the session? Any help is greatly appreciated!
monitoringform.php
<?php session_start(); ?>
<h3>Monitoring Request Form</h3>
<form name="form1" id="form1" action="" method="post">
<label for="regularInput">Modem MAC</label>
<input type="text" name="modemmac" id="modemmac" />
<button type="submit" name="submit">Submit Form</button>
</form>
results.php
<?php session_start(); ?>
<?php
include "curl.php";
$mactosearch = $_POST['modemmac'];
$mymessage=$_GET["message"];
if($mymessage=="success")
{
echo "<b>Modem IP Address:</b> $modemip <br />";
}
else{ echo "You are getting error";}
?>
curl.php
<?php
// Defining the basic cURL function
function curl($url) {
// Assigning cURL options to an array
$options = Array(
CURLOPT_RETURNTRANSFER => TRUE, // Setting cURL's option to return the webpage data
CURLOPT_FOLLOWLOCATION => TRUE, // Setting cURL to follow 'location' HTTP headers
CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
CURLOPT_CONNECTTIMEOUT => 120, // Setting the amount of time (in seconds) before the request times out
CURLOPT_TIMEOUT => 120, // Setting the maximum amount of time for cURL to execute queries
CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8", // Setting the useragent
CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
);
$ch = curl_init(); // Initialising cURL
curl_setopt_array($ch, $options); // Setting cURL's options using the previously assigned array data in $options
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
// Defining the basic scraping function
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
$scraped_page = curl("http://localhost?q_mac=$mactosearch&fields=only:ip");
$modemip = scrape_between($scraped_page, "<ip>", "</ip>");
}
?>

How to scrape dynamic data with PHP Simple HTML DOM Parser [duplicate]

This question already has answers here:
Scrape web page data generated by javascript
(2 answers)
Closed 3 years ago.
first let me say that I have read over numerous "scrapping" threads on here and none have been of help to me. I also checked around the internet for days and now I am getting close to the wire I am hoping someone can shed some light on this for me.
I am using PHP Simple HTML DOM Parser to scrape some data from a page. The url I am working with serves dynamic content and I can not seem to get anything to work to pull that content in. I need to scrape the text(plain) from <tr id="0" class="ui-widget-content jqgrow ui-row-ltr" role="row"> to <tr id="9" class="ui-widget-content jqgrow ui-row-ltr" role="row">, I feel like once I get one to work I can get the others. Because this info is not actually on the page when the page is loaded but rather comes into the fold after the page loads I am in a rutt.
With that said, here is what I have tried:
echo file_get_html('http://sheriffclevelandcounty.com/p2c/jailinmates.aspx')->plaintext;
The above will show me everything BUT the info I need, like this:
I also tried using the example from the plugin using IMDb and modified to my needs, this is it:
// Defining the basic cURL function
function curl($url) {
// Assigning cURL options to an array
$options = Array(
CURLOPT_RETURNTRANSFER => TRUE, // Setting cURL's option to return the webpage data
CURLOPT_FOLLOWLOCATION => TRUE, // Setting cURL to follow 'location' HTTP headers
CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
CURLOPT_CONNECTTIMEOUT => 120, // Setting the amount of time (in seconds) before the request times out
CURLOPT_TIMEOUT => 120, // Setting the maximum amount of time for cURL to execute queries
CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8", // Setting the useragent
CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
);
$ch = curl_init(); // Initialising cURL
curl_setopt_array($ch, $options); // Setting cURL's options using the previously assigned array data in $options
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
// Defining the basic scraping function
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
$scraped_page = curl("http://sheriffclevelandcounty.com/p2c/jailinmates.aspx"); // Downloading IMDB home page to variable $scraped_page
$scraped_data = scrape_between($scraped_page, '<table id="tblII" class="ui-jqgrid-btable" cellspacing="0" cellpadding="0" border="0" role="grid" aria-multiselectable="false" aria-labelledby="gbox_tblII" style="width: 456px;">', '</table>'); // Scraping downloaded dara in $scraped_page for content between <title> and </title> tags
echo $scraped_data; // Echoing $scraped data, should show "The Internet Movie Database (IMDb)"
Of course neither of these work, so my question is: How do I use the PHP Simple DOM Parser to get dynamic content that is loaded after page load? Is it possible or am I just completely on the wrong track here?

I understand that you need the dynamic data that comes in the jqgrid. For that you can use post URL which in response gives the data.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://sheriffclevelandcounty.com/p2c/jqHandler.ashx?op=s");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_POST, 1);
curl_setopt($ch,CURLOPT_POSTFIELDS, array(
'rows'=>10000, //Here you can specify how many records you want
't'=>'ii'
));
$output = curl_exec($ch);
curl_close($ch);
echo "<pre>";
print_r(json_decode($output));

cURL - How to getting last redirect address

i write some code in php.
I wanna get last redirecting adress on this is site:
fluege.de
I posting this is;
$dep= "sFlightInput[accDep]=ZRH";
$arr= "sFlightInput[accArr]=VIE";
$depregion= "sFlightInput[accDepRegion]=";
$arrregion= "sFlightInput[accArrRegion]=";
$multidep= "sFlightInput[accMultiAirportDep]=ZRH";
$multiarr= "sFlightInput[accMultiAirportArr]=ZRH";
$ftype = "sFlightInput[flightType]=RT";
$depcity = "sFlightInput[depCity]=Zürich+-+Flughafen+(ZRH)+-+Schweiz";
$arrcity = "sFlightInput[arrCity]=Wien+-+Internationaler+Flughafen+(VIE)+-+Österreich";
$sdate = "sFlightInput[departureDate]=29.03.2014";
$srange = "sFlightInput[departureTimeRange]=2";
$rdate ="sFlightInput[returnDate]=05.04.2014";
$rrange = "sFlightInput[returnTimeRange]=2";
$adt = "sFlightInput[paxAdt]=1";
$chd ="sFlightInput[paxChd]=0";
$inf = "sFlightInput[paxInf]=0";
$cabin = "sFlightInput[cabinClass]=Y";
$airline = "sFlightInput[depAirline]=";
$send = $dep.$arr.$depregion.$arrregion.$multidep.$multiarr.$ftype.$depcity.$arrcity.$sdate.$srange.$rdate.$rrange.$adt.$chd.$inf.$cabin.$airline;
I using this ;
echo getLastEffectiveUrl("http://www.fluege.de/flight/wait/".$send);
And there is function
function getLastEffectiveUrl($url)
{
// initialize cURL
$curl = curl_init($url);
curl_setopt_array($curl, array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
));
// execute the request
$result = curl_exec($curl);
// fail if the request was not successful
if ($result === false) {
curl_close($curl);
return null;
}
// extract the target url
$redirectUrl = curl_getinfo($curl, CURLINFO_EFFECTIVE_URL);
curl_close($curl);
return $redirectUrl;
}
They code must give this url;
www.fluege.de/wait/?accDep=&accArr=&accDepRegion=&accArrRegion=&accMultiAirportDep=&accMultiAirportArr=&flightType=RT&depCity=Z%FCrich+-+Flughafen+%28ZRH%29+-+Schweiz&arrCity=Wien+-+Internationaler+Flughafen+%28VIE%29+-+%D6sterreich&departureDate=04.04.2014&departureTimeRange=2&returnDate=20.04.2014&returnTimeRange=2&paxAdt=1&paxChd=0&paxInf=0&cabinClass=Y&depAirline=
But i need ;
http://www.fluege.de/flight/encodes/sFlightInput/5f8ccad612bafb69e7693f04cfaf1458/ (etc)

The code you provided does not handle cookies, so if the site you are query'ing requires this, your code won't work.
I checked http://php.net/manual/en/function.curl-setopt.php, but it seems like cURL cannot store cookies in memory. By adding the following line under curl_setopt_array, cookies are kept in a temporary file:
CURLOPT_COOKIEJAR => tempnam(sys_get_temp_dir(), 'cookiejar'),
However, I did not get your specific case to work. I noticed that the URL you create does not contain a question mark, and that the URL that your script creates does not redirect at all; it returns with 200 OK. I checked this using the following shell command:
curl -LI 'http://www.fluege.de/flight/wait/sFlightInput\[accDep\]=ZRHsFlightInput\[accArr\]=VIEsFlightInput\[accDepRegion\]=sFlightInput\[accArrRegion\]=sFlightInput\[accMultiAirportDep\]=ZRHsFlightInput\[accMultiAirportArr\]=ZRHsFlightInput\[flightType\]=RTsFlightInput\[depCity\]=Zürich+-+Flughafen+(ZRH)+-+SchweizsFlightInput\[arrCity\]=Wien+-+Internationaler+Flughafen+(VIE)+-+ÖsterreichsFlightInput\[departureDate\]=29.03.2014sFlightInput\[departureTimeRange\]=2sFlightInput\[returnDate\]=05.04.2014sFlightInput\[returnTimeRange\]=2sFlightInput\[paxAdt\]=1sFlightInput\[paxChd\]=0sFlightInput\[paxInf\]=0sFlightInput\[cabinClass\]=YsFlightInput\[depAirline\]='
If it's unclear what the URL should look like, you should contact fluege.de to ask them how to use their API.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

something is not working in CURL scraping - php

Related

How to use foreach to loop through a file with variable variable?

Unable to log in with CURL... But do not know why

PHP cURL within a form

How to scrape dynamic data with PHP Simple HTML DOM Parser [duplicate]

cURL - How to getting last redirect address

Categories

Resources