Get data from XML sites - php

I want to get some data from a site based on xml's.
The problem is I need to be logged to it as a PublicUser without password.
I have tryed:
$url = 'http://IP/wcd/system_counter.xml';
$content = file_get_contents($url);
echo $content
But i only get this:
err En ReloginAttempt /wcd/index.html false
This is the xml code used for loggin:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="top.xsl" type="text/xsl"?>
<MFP>
<SelNo>Auto</SelNo>
<LangNo>En</LangNo>
<Service><Setting><AuthSetting><AuthMode><AuthType>None</AuthType>
<ListOn>false</ListOn>
<PublicUser>true</PublicUser>
<BoxAdmin>false</BoxAdmin>
</AuthMode><TrackMode><TrackType>None</TrackType></TrackMode></AuthSetting>
<MiddleServerSetting><ControlList><ArraySize>0</ArraySize></ControlList><Screen>
<Id>0</Id></Screen></MiddleServerSetting>
</Setting></Service><LangDummy>false</LangDummy></MFP>
Is there a way to send the user as well when i want to get the XML info ?

You cannot access pages requiring posted login information using file_get_contents. Instead you need to use curl. Something along these lines:
$ch = curl_init($url); // The url you want to call
curl_setopt_array(
$ch, array(
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_POST => 1,
CURLOPT_POSTFIELDS => $login_xml, // The xml login in string form
)
);
//getting response from server
$response = curl_exec($ch);
echo curl_error($ch);
curl_close($ch);

Related

extract reCaptcha from web page to be completed externally via cURL and then return results to view page

I am creating a web scraper for personal use that scrape car dealership sites based on my personal input but several of the sites that I attempting to collect data from a blocked by a redirected captcha page. The current site I am scraping with curl returns this HTML
<html>
<head>
<title>You have been blocked</title>
<style>#cmsg{animation: A 1.5s;}#keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style>
</head>
<body style="margin:0">
<p id="cmsg">Please enable JS and disable any ad blocker</p>
<script>
var dd={'cid':'AHrlqAAAAAMA1gZrYHNP4MIAAYhtzg==','hsh':'C0705ACD75EBF650A07FF8291D3528','t':'fe','host':'geo.captcha-delivery.com'}
</script>
<script src="https://ct.captcha-delivery.com/c.js"></script>
</body>
</html>
I am using this to scrape the page:
<?php
function web_scrape($url)
{
$ch = curl_init();
$imei = "013977000272744";
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_COOKIE, '_ym_uid=1460051101134309035; _ym_isad=1; cxx=80115415b122e7c81172a0c0ca1bde40; _ym_visorc_20293771=w');
curl_setopt($ch, CURLOPT_POSTFIELDS, array(
'imei' => $imei,
));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$server_output = curl_exec($ch);
return $server_output;
curl_close($ch);
}
echo web_scrape($url);
?>
And to reiterate what I want to do; I want to collect the Recaptcha from this page so when I want to view the page details on an external site I can fill in the Recaptcha on my external site and then scrape the page initially imputed.
Any response would be great!
Datadome is currently utilizing Recaptcha v2 and GeeTest captchas, so this is what your script should do:
Navigate to redirection https://geo.captcha-delivery.com/captcha/?initialCid=….
Detect what type of captcha is used.
Obtain token for this captcha using any captcha solving service like Anti Captcha.
Submit the token, check if you were redirected to the target page.
Sometimes target page contains an iframe with address https://geo.captcha-delivery.com/captcha/?initialCid=.. , so you need to repeat from step 2 in this iframe.
I’m not sure if steps above could be made with PHP, but you can do it with browser automation engines like Puppeteer, a library for NodeJS. It launches a Chromium instance and emulates a real user presence. NodeJS is a must you want to build pro scrapers, worth investing some time in Youtube lessons.
Here’s a script which does all steps above: https://github.com/MoterHaker/bypass-captcha-examples/blob/main/geo.captcha-delivery.com.js
You’ll need a proxy to bypass GeeTest protection.
based on the high demand for code, HERE is my upgraded scraper that bypassed this specific issue. However my attempt to obtain the captcha did not work and I still have not solved how to obtain it.
include "simple_html_dom.php";
/**
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the HTTP server response header fields and content.
*/
// This function is where the Magic comes from. It bypasses ever peice of security carsales.com.au can throw at me
function get_web_page( $url ) {
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_SSL_VERIFYPEER => false // Disabled SSL Cert checks
);
$ch = curl_init( $url ); //initiate the Curl program that we will use to scrape data off the webpage
curl_setopt_array( $ch, $options ); //set the data sent to the webpage to be readable by the webpage (JSON)
$content = curl_exec( $ch ); //creates function to read pages content. This variable will be used to hold the sites html
$err = curl_errno( $ch ); //errno function that saves all the locations our scraper is sent to. This is just for me so that in the case of a error,
//I can see what parts of the page has it seen and more importantly hasnt seen
$errmsg = curl_error( $ch ); //check error message function. for example if I am denied permission this string will be equal to: 404 access denied
$header = curl_getinfo( $ch ); //the information of the page stored in a array
curl_close( $ch ); //Closes the Curler to save site memory
$header['errno'] = $err; //sending the header data to the previously made errno, which contains a array path of all the places my scraper has been
$header['errmsg'] = $errmsg; //sending the header data to the previously made error message checker function.
$header['content'] = $content; //sending the header data to the previously made content checker that will be the variable holder of the webpages HTML.
return $header; //Return all the pages data and my identifying functions in a array. To be used in the presentation of the search results.
};
//using the function we just made, we use the url genorated by the form to get a developer view of the scraping.
$response_dev = get_web_page($url);
// print_r($response_dev);
$response = end($response_dev); //takes only the end of the developer response because the rest is for my eyes only in the case that the site runs into a issue

PHP cURL HTTP GET XML Format

I have an application that has a Web Services RESTful API. When I make HTTP GET requests in the browser I get XML responses back.
When I make the same request using PHP I get the correct information but it is not formatted in XML and so I can't pass it to Simple XML.
Here's my code.
<?php
//Deifne user credentials to use with requests
$user = "user";
$passwd = "user";
//Define header array for cURL requestes
$header = array('Contect-Type:application/xml', 'Accept:application/xml');
//Define base URL
$url = 'http://192.168.0.100:8080/root/restful/';
//Define http request nouns
$ls = $url . "landscapes";
//Initialise cURL object
$ch = curl_init();
//Set cURL options
curl_setopt_array($ch, array(
CURLOPT_HTTPHEADER => $header, //Set http header options
CURLOPT_URL => $ls, //URL sent as part of the request
CURLOPT_HTTPAUTH => CURLAUTH_BASIC, //Set Authentication to BASIC
CURLOPT_USERPWD => $user . ":" . $passwd, //Set username and password options
CURLOPT_HTTPGET => TRUE //Set cURL to GET method
));
//Define variable to hold the returned data from the cURL request
$data = curl_exec($ch);
//Close cURL connection
curl_close($ch);
//Print results
print_r($data);
?>
Any thoughts or suggestions would be really helpful.
S
EDIT:
So this is the response I get from the PHP code:
0x100000rhel-mlsptrue9.2.3.0101
This is the response if I use the WizTools Rest Client or a browser.
<?xml version="1.0" encoding="UTF-16"?>
<landscape-response total-landscapes="1" xmlns="http://www.url.com/root/restful/schema/response">
<landscape>
<id>0x100000</id>
<name>rhel-mlsp</name>
<isPrimary>true</isPrimary>
<version>9.2.3.010</version>
</landscape>
</landscape-response>
As you can see the information is there but the PHP is not really presenting this in a useful way.
I was able to find the answer to this question so I thought I would share the code here.
//Initialise curl object
$ch = curl_init();
//Define curl options in an array
$options = array(CURLOPT_URL => "http://192.168.0.100/root/restful/<URI>",
CURLOPT_PORT => "8080",
CURLOPT_HEADER => "Content-Type:application/xml",
CURLOPT_USERPWD => "<USER>:<PASSWD>",
CURLOPT_HTTPAUTH => CURLAUTH_BASIC,
CURLOPT_RETURNTRANSFER => TRUE
);
//Set options against curl object
curl_setopt_array($ch, $options);
//Assign execution of curl object to a variable
$data = curl_exec($ch);
//Close curl object
curl_close($ch);
//Pass results to the SimpleXMLElement function
$xml = new SimpleXMLElement($data);
print_r($xml);
As you can see the code is not all that different, the main thing was separating the port option out of the URL and into its own option.
Hopefully this helps someone else out!!!
S
Try this
$resp = explode("\n<?", $data);
$response = "<?{$resp[1]}";
$xml = new SimpleXMLElement($response);
Does it print anything at all (your code)? Try using echo $data but hit F12 to view the results on the console.

Api in PHP - Accept a curl request and process

Not sure if anyone can help me out with a question.
I had to write some php for the company I work for that lets us integrate with an API that accepts a JSON body. I used the cUrl method, and the script is working great.
If I wanted to build another php page that would accept the request im sending, how would I go about this?
Say I wanted to allow someone to send this same request to me, and then wanted the info they sent to go into my database, how would turn their request into php strings?
Here is the code im sending.
<?
$json_string = json_encode(array("FirstName" => $name, "MiddleName" => " ", "LastName" => $last));;
// echo $json_string;
// jSON URL which should be requested
$json_url = 'http://www.exampleurl.com';
// jSON String for request
// Initializing curl
$ch = curl_init( $json_url );
// Configuring curl options
$options = array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => array(
'Accept: application/json;charset=utf-8',
'Content-Type: application/json;charset=utf-8',
'Expect: 100-continue',
'Connection: Keep-Alive') ,
CURLOPT_POSTFIELDS => $json_string
);
// Setting curl options
curl_setopt_array( $ch, $options );
// Getting results
$result = curl_exec($ch); // Getting jSON result string
echo $result;
$myArray = json_decode($result);
$action = $myArray->Action;
?>
To get the raw data from the POST that you would be receiving you would use $postData = file_get_contents('php://input');
http://php.net/manual/en/reserved.variables.post.php
Then you would json_decode() the contents of that POST back into JSON.
http://php.net/manual/en/function.json-decode.php
Not really good understood your question. May be you are looking for the way to read raw POST data? In that case open and read from php://stdin stream.
$stdin = fopen('php://stdin', 'r');
By the way read here ( http://php.net/manual/en/function.curl-setopt.php ) how to use CURLOPT_POSTFIELDS. This parameter can either be passed as a urlencoded string like 'para1=val1&para2=val2&...' or as an array with the field name as key and field data as value.

Getting JSON response with PHP

I'm trying to connect an API that uses 0AUTH2 via PHP. The original plan was to use client-side JS, but that isn't possible with 0AUTH2.
I'm simply trying get a share count from the API's endpoint which is here:
https://api.bufferapp.com/1/links/shares.json?url=[your-url-here]&access_token=[your-access-key-here]
I do have a proper access_token that I am using to access the json file, that is working fine.
This is the code I have currently written, but I'm not even sure I'm on the right track.
// 0AUTH2 ACCESS TOKEN FOR AUTHENTICATION
$key = '[my-access-key-here]';
// JSON URL TO BE REQUESTED
$json_url = 'https://api.bufferapp.com/1/links/shares.json?url=http://bufferapp.com&access_token=' . $key;
// GET THE SHARE COUNT FROM THE REQUEST
$json_string = '[shares]';
// INITIALIZE CURL
$ch = curl_init( $json_url );
// CONFIG CURL OPTIONS
$options = array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => array('Content-type: application/json') ,
CURLOPT_POSTFIELDS => $json_string
);
// SETTING CURL AOPTIONS
curl_setopt_array( $ch, $options );
// GET THE RESULTS
$result = curl_exec($ch); // Getting jSON result string
Like I said, I don't know if this is the best method - so I'm open to any suggestions.
I'm just trying to retrieve the share count with this PHP script, then with JS, spit out the share count where I need it on the page.
My apologies for wasting anyone's time. I have since been able to work this out. All the code is essentially the same - to test to see if you're getting the correct response, just print it to the page. Again, sorry to have wasted anyones time.
<?php
// 0AUTH2 ACCESS TOKEN FOR AUTHENTICATION
$key = '[your_access_key_here]';
// URL TO RETRIEVE SHARE COUNT FROM
$url = '[your_url_here]';
// JSON URL TO BE REQUESTED - API ENDPOINT
$json_url = 'https://api.bufferapp.com/1/links/shares.json?url=' . $url . ' &access_token=' . $key;
// GET THE SHARE COUNT FROM THE REQUEST
$json_string = '[shares]';
// INITIALIZE CURL
$ch = curl_init( $json_url );
// CONFIG CURL OPTIONS
$options = array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => array('Content-type: application/json') ,
CURLOPT_POSTFIELDS => $json_string
);
// SETTING CURL AOPTIONS
curl_setopt_array( $ch, $options );
// GET THE RESULTS
$result = curl_exec($ch); // Getting jSON result string
print $result;
?>

Caching CURL json data from server

I am trying to figure out how I would go about caching the data I am pulling from a webservce json file onto my page so that I do not continually request this data and bring down the server.
I currently am pullin the json data like so:
// jSON URL which should be requested
$json_url = 'http://example.com/datastore.json?toolbar_id='.$persona['toolbar_id'].'';
// jSON String for request
$json_string = '[Json string? What is this]';
// Initializing curl
$ch = curl_init( $json_url );
// Configuring curl options
$options = array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => array('Content-type: application/json') ,
CURLOPT_POSTFIELDS => $json_string
);
// Setting curl options
curl_setopt_array( $ch, $options );
// Getting results
$result = curl_exec($ch); // Getting jSON result string
$result = json_decode($result, true);
$result = $result[0];
From here I can pull the associative array results as I need them. But If I were to refresh the page, it would recall the server info. Any solutions?
You'd treat it like any other cache file:
Check if the cache file exists
If it does, check the filemtime() against the current time()
If it needs to be refreshed, make the cURL call and write the data to the cache file and carry on
If it does not need to be refreshed, simply return the data from the file to to your variable.
It will be the same JSON regardless if PHP returns it via cURL or if PHP returns it via fread() on a cache file.

Categories