I have a PHP goutte script that submits a couple of forms from a web page, but the next thing I want to is to take a screenshot of the page the crawler is and save it on a folder.
<?php
require_once __DIR__.'/vendor/autoload.php';
use Goutte\Client;
use Symfony\Component\DomCrawler\Crawler;
$client = new Client();
$crawler = $client->request('GET', 'https://login.siat.sat.gob.mx/nidp/idff/sso?id=mat-ptsc-totp&sid=10&option=credential&sid=10');
$form = $crawler->selectButton('Enviar')->form();
$crawler = $client->submit($form, array('Ecom_User_ID' => 'xxx', 'Ecom_Password' => 'xxx'));
$crawler = $client->request('GET', 'https://www.siat.sat.gob.mx/PTSC/');
echo $crawler->html();
Any ideas?
Related
I am stuck with this error...
but the client is defined.
my code like this
use Goutte\Client;
use Illuminate\Http\Request;
use GuzzleHttp\Client as GuzzleClient;
class WebScrapingController extends Controller
{
public function doWebScraping()
{
$goutteClient = new Client();
$guzzleClient = new GuzzleClient(array(
'timeout' => 60,
'verify' => false
));
$goutteClient->setClient($guzzleClient);
$crawler = $goutteClient->request('GET', 'https://duckduckgo.com/html/?q=Laravel');
$crawler->filter('.result__title .result__a')->each(function ($node) {
dump($node->text());
});
}
}
I think error from this line
$goutteClient->setClient($guzzleClient);
goutte: "^4.0" guzzle: "7.0" Laravel Framework: "6.20.4"
This answer is regarding creating instance of Goutte client, a simple PHP Web Scraper
For Version >= 4.0.0
Pass HttpClient(either guzzle httpclient , symfony httpclient) instance directly inside the instance of Goutte Client.
use Goutte\Client;
use Symfony\Component\HttpClient\HttpClient;// or use GuzzleHttp\Client as GuzzleClient;
$client = new Client(HttpClient::create(['timeout' => 60]));
// or
// $guzzleClient = new GuzzleClient(['timeout' => 60, 'verify' => false]); // pass this to Goutte Client
For Version <= 4.0.0 (i.e from 0.1.0 to 3.3.1)
use Goutte\Client;
use GuzzleHttp\Client as GuzzleClient;
$goutteClient = new Client();
$guzzleClient = new GuzzleClient(['timeout' => 60]);
$goutteClient->setClient($guzzleClient);
I have some problems with Goutte submit the form. I'm trying to login to app, but every time when I refresh it's returning 404 error, but 404 in the app because as a final result I'm returning html() to see what is happening. Maybe 404 because my link is ex.
http://localhost:8000/trylogin
Here is the code.
$client = new Client();
$crawler = $client->request('GET', 'https://app.productlistgenie.io/signin');
$form = $crawler->selectButton('LogIn')->form();
$crawler = $client->submit($form, array('email' => 'myemail#here.com', 'password' => 'mypasshere'));
$crawler->filter('.flash-error')->each(function ($node) {
print $node->text()."\n";
});
return $crawler->html();
This is the working code
$client = new Client();
$crawler = $client->request('GET', 'https://app.productlistgenie.io/signin');
$form = $crawler->selectButton('SIGN IN')->form();
$crawler = $client->submit($form, array('email' => 'myemail#here.com', 'password' => 'mypasshere'));
$crawler->filter('.font-roboto-light.color-dark-white')->each(function ($node) {
print $node->text()."\n";
});
return $crawler->html();
I want to get with (curl) guzzle html content of a other page inside my laravel app.
The classic way would be:
$client = new Client();
$client = $client->request('GET', route('print.page'))->getBody();
The problem is, all this routes are auth protected and I get there only html from my login page.
I tried to send login trough guzzle again but I think this is not a good idea with double login.
Is there any better way to get html from this protected route?
In case you calling this inside a controller and you have a current authenticated user, you have to get the session name and the real session id:
public function FooController()
{
$name = Session::getName();
$sessionId = $_COOKIE[$name];
$cookieJar = CookieJar::fromArray([
$name => $sessionId,
], 'example.com');
$client = new Client();
$body = $client->request( // changed the variable from $client to $body here
'GET',
route('print.page'),
['cookies' => $cookieJar]
)->getBody();
}
Mailchimp code:
$api = new MCAPI ('xxxxxxxxxxxx-us11')
This is my code:
$api = new MCAPI ('<?php echo $member[chimpapi]; ?>')
The second for the listID:
$api = new MCAPI ('<?php echo $member[chimplist]; ?>')
Of course this is not good not working, any suggestions?
I am not familiar with the Mailchimp API, but this piece of code:
$api = new MCAPI ('xxxxxxxxxxxx-us11');
is already PHP code, there is no need to include the php tags in it:
$api = new MCAPI ($member['chimpapi']);
and
$api = new MCAPI ($member['chimplist']);
I am using the php guzzle Client to grab the website, and then process it with the symfony 2.1 crawler
I am trying to access a form....for example this test form here
http://de.selfhtml.org/javascript/objekte/anzeige/forms_method.htm
$url = 'http://de.selfhtml.org/javascript/objekte/anzeige/forms_method.htm';
$client = new Client($url);
$request = $client->get();
$request->getCurlOptions()->set(CURLOPT_SSL_VERIFYHOST, false);
$request->getCurlOptions()->set(CURLOPT_SSL_VERIFYPEER, false);
$response = $request->send();
$body = $response->getBody(true);
$crawler = new Crawler($body);
$filter = $crawler->selectButton('submit')->form();
var_dump($filter);die();
But i get the exception:
The current node list is empty.
So i am kind of lost, on how to access the form
Try using Goutte, It is a screen scraping and web crawling library build on top of the tools that you are already using (Guzzle, Symfony2 Crawler). See the GitHub repo for more info.
Your code would look like this using Goutte
<?php
use Goutte\Client;
$url = 'http://de.selfhtml.org/javascript/objekte/anzeige/forms_method.htm';
$client = new Client();
$crawler = $client->request('GET', $url);
$form = $crawler->selectButton('submit')->form();
$crawler = $client->submit($form, array(
'username' => 'myuser', // assuming you are submitting a login form
'password' => 'P#S5'
));
var_dump($crawler->count());
echo $crawler->html();
echo $crawler->text();
If you really need to setup the CURL options you can do it this way:
<?php
$url = 'http://de.selfhtml.org/javascript/objekte/anzeige/forms_method.htm';
$client = new Client();
$guzzle = $client->getClient();
$guzzle->setConfig(
array(
'curl.CURLOPT_SSL_VERIFYHOST' => false,
'curl.CURLOPT_SSL_VERIFYPEER' => false,
));
$client->setClient($guzzle);
// ...
UPDATE:
When using the DomCrawler I often times get that same error. Most of the time is because I'm not selecting the correct element in the page, or because it doesn't exist. Try instead of using:
$crawler->selectButton('submit')->form();
do the following:
$form = $crawler->filter('#signin_button')->form();
Where you are using the filter method to get the element by id if it has one '#signin_button' or you could also get it by class '.signin_button'.
The filter method requires The CssSelector Component.
Also debug your form by printing out the HTML (echo $crawler->html();) and ensuring that you are actually on the right page.