I'm using Goutte to make a webscraper.
For development, I've saved a .html document I'd like to traverse (so i'm not constantly making requests to the website). Here's what I have so far:
use Goutte\Client;
$client = new Client();
$html=file_get_contents('test.html');
$crawler = $client->request(null,null,[],[],[],$html);
Which based of what I know should call request in Symfony\Component\BrowserKit, and pass in the raw body data. Here's the error message I'm getting:
PHP Fatal error: Uncaught exception 'GuzzleHttp\Exception\ConnectException' with message 'cURL error 7: Failed to connect to localhost port 80: Connection refused (see http://curl.haxx.se/libcurl/c/libcurl-errors.html)' in C:\Users\Ally\Sites\scrape\vendor\guzzlehttp\guzzle\src\Handler\CurlFactory.
If I were to just use DomCrawler, it's non-trivial to create a crawler using a string. (see: http://symfony.com/doc/current/components/dom_crawler.html). I'm just unsure about how to do the equivalent with Goutte.
Thanks in advance.
Tools you decided to use make real http connections and are not suitable for what you want to do. At least out of the box.
Option 1: Implement your own BrowserKit Client
All goutte does is it extends BrowserKit's Client. It implements http requests with Guzzle.
All you need to do to implement your own client, is to extend the Symfony\Component\BrowserKit\Client and provide the doRequest() method:
use Symfony\Component\BrowserKit\Client;
use Symfony\Component\BrowserKit\Request;
use Symfony\Component\BrowserKit\Response;
class FilesystemClient extends Client
{
/**
* #param object $request An origin request instance
*
* #return object An origin response instance
*/
protected function doRequest($request)
{
$file = $this->getFilePath($request->getUri());
if (!file_exists($file)) {
return new Response('Page not found', 404, []);
}
$content = file_get_contents($file);
return new Response($content, 200, []);
}
private function getFilePath($uri)
{
// convert an uri to a file path to your saved response
// could be something like this:
return preg_replace('#[^a-zA-Z_\-\.]#', '_', $uri).'.html';
}
}
$client = new FilesystemClient();
$client->request('GET', '/test');
Client's request() needs to accept real URIs, therefore you need to implement your own logic to convert it to a filesystem location.
Have a look at Goutte's Client for insipration.
Option 2: Implement a custom Guzzle handler
Since Goutte uses Guzzle, you could provide your own Guzzle handler that would load responses from files, instead of making real http requests. Have a look at the handlers and middleware doc.
If you're just after caching responses so you make less http requests, Guzzle provides support for this already.
Option 3: Use DomCrawler directly
new Crawler(file_get_contents('test.html'))
The only drawback is you'll loose some of convenience methods of the BrowserKit client, like click() or selectLink().
Related
I use external libraries to communicate with external systems. Communication is via Soap.
I provide the library with a set of data in a simple way. She checks the data, sends the request, and receives the reply. Converts the response to its object and returns.
How can I access the SoapClient object without making changes to the external library?
For this object, I need the original data. Request, response, headers.
Is it possible to do?
EDIT:
A simple example of using one of the many external libraries:
class fedex {
public function trackShipment($number)
{
$trackRequest = new TrackServiceTrackRequest();
$trackRequest->WebAuthenticationDetail->UserCredential->Key = $this->getAccessNumber();
$trackRequest->WebAuthenticationDetail->UserCredential->Password = $this->getAccountPassword();
$trackRequest->ClientDetail->AccountNumber = $this->getAccountNumber();
$trackRequest->ClientDetail->MeterNumber = $this->getMeterNumber();
$trackRequest->SelectionDetails[0]->PackageIdentifier->Value = $number;
$request = new TrackServiceRequest();
return $request->getTrackReply($trackRequest);
}
}
$fedex = new fedex();
$result = $fedex->trackShipment('123456789');
How to get original xml with request and header that was sent by library without modifying it?
The TrackServiceRequest() object does not access the SoapClient() object.
I want to send a request with or without 'Token' as a header.
If request has 'Token' as a header: if the user already has that item, it will return the item with the proper item_id of a specific user (based on its token), otherwise it will return null.
If request doesn't have 'Token' as a header: it will return the item with that item_id
I'm working with Zend Framework and in ItemResource I have this method:
public function fetch($id)
{
}
How can I check if my request has Token as a header or not and implement both cases inside fetch()?
Using Laminas API Tools it depends on wether you 're using a RPC or a REST resource. I will explain which tools the Laminas API Tools give you to evaluate the received header data.
You don 't have to reinvent the wheel, because Laminas API Tools has the received headers already at hand, when you 're in your fetch method.
Representational State Transfer (REST)
Rest resources normally extend the \Laminas\ApiTools\Rest\AbstractResourceListener class. This class listens for \Laminas\ApiTools\Rest\ResourceEvent. Fortunately, this event provides you with a request object that also contains the received header data.
<?php
declare(strict_types=1);
namespace Marcel\V1\Rest\Example;
use Laminas\ApiTools\Rest\AbstractResourceListener;
class ExampleResource extends AbstractResourceListener
{
public function fetch($id)
{
// requesting for an authorization header
$token = $this->getEvent()->getRequest()->getHeader('Authorization', null);
if ($token === null) {
// header was not received
}
}
}
As you can see the ResourceEvent returns a \Laminas\Http\Request instance when calling getRequest(). The request instance already contains all request headers you 've received. Just call getHeader with the given name and as second parameter a default value, which should be returned, when the header was not set. If there is no http_token header, you 'll get null as a result.
Remote Procedure Calls (RPC)
Since RPC requests are handled with a MVC controller class, you can get the request as easy as in a rest resource. Controller classes extend from \Laminas\Mvc\Controller\AbstractActionController, which already contains a request instance.
<?php
declare(strict_types=1);
namespace Marcel\V1\Rpc\Example;
use Laminas\Mvc\Controller\AbstractActionController;
class ExampleController extends AbstractActionController
{
public function exampleAction()
{
$token = $this->getRequest()->getHeader('Authorization', null);
if ($token === null) {
// token was not set
}
}
}
As you can see getting header data in rpc requests is as easy as in resource listeners. The procedure is the same because a request instance is also used here.
Conclusion
There is absolutely no need for coding things, that are already there. Just get the request instance from the event or the abstract controller and retrieve the header you want. Always keep in mind, that there are security aspects like CRLF injections, when dealing with raw data. The Laminas framework handles all this for you already.
Additionally you can check for all received headers by calling ->getHeaders() instead of ->getHeader($name, $default). You 'll get a \Laminas\Http\Header instance with all received headers.
You can get all HTTP header values by getallheaders() or just get the specific value by $_SERVER['HTTP_XXX'], in your case, replace XXX with Token, $_SERVER['HTTP_Token'].
Manual: https://www.php.net/manual/en/reserved.variables.server.php
public function fetch($id)
{
$token = $_SERVER['HTTP_Token'];
// do your busniess code
}
I have an application running on webserver A. I have a second application running on webserver B. Both webservers require a login. What I need to do is have a request to webserver A pass through to webserver B and return a file to the client without having the client login to Webserver B. (In other words, webserver B will be invisible to the client and I will take care of the auth credentials with my request to B from A). The code below is built on a laravel framework, but I don't believe the answer needs to be laravel specific.
The code works but it is only returning the HEAD information of the file to the calling client. Not the file itself.
Any help will be greatly appreciated!
Controller:
public function getAudioFile(Request $request)
{
//This is the id we are looking to pull
$uid = $request->uniqueid;
$audioServices = new AudioServices();
return $audioServices->getWavFile($uid);
}
Service:
public function getWavFile(String $uniqueId)
{
$client = new GuzzleHttp\Client(['verify' => false]);
return $client->request('GET', $this->connectString.$uniqueId, ['auth' => ['username', 'password']]);
}
As mentioned by bishop you can use sink option from Guzzle to stream the response of a Guzzle request.
You can pass that stream to a response from your controller. I'm not sure if Laravel has built-in stream support, but the underlying symfony httpfoundation components do. An example of it's usage can be found in this tutorial.
If you prefer not to use the sink option from Guzzle you can also use the response itself as that implements PSR-7 stream objects.
I am using Slim on a REST API server. Some of the endpoints need to blindly be proxied to to another server, and I am using Guzzle for this part. It works most of the time to just use the Slim request as the Guzzle request (with some minor modification such as the host, etc).
<?php
use Psr\Http\Message\ServerRequestInterface as SlimRequest;
use Psr\Http\Message\ResponseInterface as SlimResponse;
use GuzzleHttp\Psr7\Request as GuzzleRequest;
use GuzzleHttp\Psr7\Response as GuzzleRequest;
$app->post('/bla/bla/bla', function (SlimRequest $slimRequest, SlimResponse $slimResponse) {
$slimRequest = $slimRequest->withUri($slimRequest->getUri()->withHost('https://example.com'));
$guzzleResponse=$this->httpClient->send($slimRequest);
});
One of my endpoints uses multipart content, and the files nor the POST content is being sent. As an alternative, I've tried the following but without success.
<?php
use Psr\Http\Message\ServerRequestInterface as SlimRequest;
use Psr\Http\Message\ResponseInterface as SlimResponse;
use GuzzleHttp\Psr7\Request as GuzzleRequest;
use GuzzleHttp\Psr7\Response as GuzzleRequest;
$app->post('/bla/bla/bla', function (SlimRequest $slimRequest, SlimResponse $slimResponse) {
$headers = array_intersect_key($slimRequest->getHeaders(), array_flip(["HTTP_CONNECTION", "CONTENT_LENGTH", "HTTP_ACCEPT", "HTTP_ACCEPT_ENCODING", "HTTP_ACCEPT_LANGUAGE", "CONTENT_TYPE"]));
$guzzleRequest = new \GuzzleHttp\Psr7\Request($slimRequest->getMethod(), $slimRequest->getUri()->getPath(), $headers, $slimRequest->getBody());
$guzzleResponse=$this->httpClient->send($guzzleRequest);
});
If necessary, I will resort to manually creating the multipart form, however, I expect there is a better way to do so since both are PSR-7 complient.
How should this best be accomplished?
PSR-7 Request objects are IMMUTABLE. That is, you cannot alter the values. Setting something new will return a new instance.
https://www.php-fig.org/psr/psr-7/
So, just change
$slimRequest->withUri($slimRequest->getUri()->withHost('https://example.com'));
to
$slimRequest = $slimRequest->withUri($slimRequest->getUri()->withHost('https://example.com'));
Also, $slimRequest->getUri()->withHost('https://example.com') returns a Request object too. What you need here is:
$slimRequest->getUri()->withHost('https://example.com')->getHost()
which will give you your string.
$app->get('/', function () {
// Initial page load.
include 'body-index.php';
return $response;
});
I have the code above on my /index.php. How would I then call and modify functions within body-index.php? As I'm learning MVCs and frameworks right now on my own I'd rather do it this way, rather then breaking out of Slim and do a get('/body-index.php', with the page code. Is this possible?
Thanks.
From Slim Framework documentation:
Most often, you’ll need to write to the PSR 7 Response object. You can write content to the StreamInterface instance with its write() method like this:
$body = $response->getBody();
$body->write('Hello');
You can also replace the PSR 7 Response object’s body with an entirely new StreamInterface instance. This is particularly useful when you want to pipe content from a remote destination (e.g. the filesystem or a remote API) into the HTTP response. You can replace the PSR 7 Response object’s body with its withBody(StreamInterface $body) method. Its argument MUST be an instance of \Psr\Http\Message\StreamInterface.
$newStream = new \GuzzleHttp\Psr7\LazyOpenStream('/path/to/file', 'r');
$newResponse = $oldResponse->withBody($newStream);
Source: Response - Slim Framework