PHP : Scrape data generated with javascript ( ES6 )

PHP : Scrape data generated with javascript ( ES6 ) - php

I try to scrape data of some URL with phantomjs and php phantomjs , but my target page generated some of the data with ES6 and phantomjs doesn't support it yet , and I got some errors like this ( in Console log ) :
ReferenceError: Can't find variable: Set
and my code is :
use JonnyW\PhantomJs\Client;
$client = Client::getInstance();
$client->getEngine()->setPath('C:\\Users\\XXX\\Desktop\\bin\\phantomjs.exe');
$request = $client->getMessageFactory()->createRequest('example.com', 'GET');
$response = $client->getMessageFactory()->createResponse();
$client->send($request, $response);
var_dump($response->getConsole());
I search a lot! and I found the phantomjs will support ES6 in new version ( v2.5 ) and release a beta version but it's doesn't work for me!
now, what I do? is there any way to scrape this page?

While the future of PhantomJS is not yet certain, may I suggest another headless browser to use: puppeteer. It is based on Google Chrome headless and behind it is a separate team of Google engineers.
There are already projects to control it from PHP, most notable at the moment is puphpeteer*
__
* (notable in the way that not only can it make screenshots/PDF, but it also offers javascript evaluation)

Related

Automatic download of Azure ServiceTags via php

I need to download Azure Public ServiceTag JSON from a PHP script. Unfortunately manual steps are required for the download (AWS offers direct download of json file with their networks).
The download link is following: https://www.microsoft.com/en-us/download/confirmation.aspx?id=57064
Hopefully, some has an idea of how follow the download link..
Thanks in advance!

You could try the below Service Tags -list REST API End Point :
GET https://management.azure.com/subscriptions/{subscriptionId}/providers/Microsoft.Network/locations/{location}/serviceTags?api-version=2020-05-01
Suggestion of REST API, because you wanted to invoke from the PHP. You will have to pass the authorization header.
Other Programmatical Approach
Get-AzNetworkServiceTag Powershell Commandlet
$serviceTags = Get-AzNetworkServiceTag -Location eastus2
az network list-service-tags Azure CLI Commandlet
az network list-service-tags --location
[--subscription
]

if you can do this with proper api's like in #Satya V's answer, it's a much better solution than web scraping, but if you don't have access to use the apis (seems you need a subscriptionId, whatever that is, to use the api),
you can grab the download link with the XPath //a[contains(.,'click here to download manually')]
this seems to work:
<?php
$ch = curl_init();
curl_setopt_array($ch,array(
CURLOPT_URL=>"https://www.microsoft.com/en-us/download/confirmation.aspx?id=57064",
// setting CURLOPT_COOKIEFILE to emptystring enables the cookie engine.
CURLOPT_COOKIEFILE=>"",
CURLOPT_RETURNTRANSFER=>1,
));
$html=curl_exec($ch);
$domd=new DOMDocument();
#$domd->loadHTML($html);
$xp=new DOMXPath($domd);
$download_url = $xp->query("//a[contains(.,'click here to download manually')]")->item(0)->getAttribute("href");
//var_dump($download_url);
curl_setopt_array($ch,array(
CURLOPT_RETURNTRANSFER=>0,
CURLOPT_URL=>$download_url
));
curl_exec($ch);

Webscraping Symfony/Panther: Can't get HTML

I want to scrape a site with the symfony panther package within a Laravel application. According to the documentation https://github.com/symfony/panther#a-polymorphic-feline I cannot use the HttpBrowser nor the HttpClient classes because they do not support JS.
Therefore I try to use the ChromClient which uses a local chrome executable and a chromedriver binary shipped with the panther package.
$client = Client::createChromeClient();
$crawler = $client->request('GET', 'http://example.com');
dd($crawler->html());
Unfortunately, I only receive the empty default chrome page as HTML:
<html><head></head><body></body></html>
Every approach to do something else with the $client or the $crawler-instance leads to an error "no nodes available".
Additionally, I tried the basic example from the documentation https://github.com/symfony/panther#basic-usage --> same result.
I'm using ubuntu 18.04 Server under WSL on Windows and installed the google-chrome-stable deb-package. This seemed to work because after the installation the error "the binary was not found" does not longer occur.
I also tried to manually use the executable of the Windows host system but this only opens an empty CMD window always reopened when closing. I have to kill the process via TaskManager.
Is this because the Ubuntu server does not have any x-server available?
What can I do to receive any HTML?

So, I'm probably late, but I got the same problem with a pretty easy solution: Just open a simple crawler with the response content.
This one differs from the Panther DomCrawler especially in methods, but it is is safer to evaluate HTML structures.
$client = Client::createChromeClient();
$client->request('GET', 'http://example.com');
$html = $client->getInternalResponse()->getContent();
$crawler = new Symfony\Component\DomCrawler\Crawler($html);
// you can use following to get the whole HTML
$crawler->outerHtml();
// or specific parts
$crawler->filter('.some-class')->outerHtml();

$client = Client::createChromeClient();
$crawler = $client->request('GET', 'http://example.com');
/**
* Get all Html code of page
*/
$client->getCrawler()->html();
/**
* For example to filter field by ID = AuthenticationBlock and get text
*/
$loginUsername = $client->getCrawler()->filter('#AuthenticationBlock')->text();

Web Scraping with PHP Goutte

I want to get all the item name and price from this website
For example, i want to search for "apple"
https://redmart.com/search/apple
I use Goutte for scraping the website. This is the code so far to get all item's name in the list:
$client = new Client();
$crawler = $client->request('GET', 'https://redmart.com/search/apple');
$crawler->filter('h4 > a')->each(function ($node) {
print $node->text()."\n";
});
but when i run the code, it prints nothing. How to get all the item's name and price from the list?

The redmart.com website is using react js to generate the content. You cannot use a website scraper like Goutte. Instead, try using the developer console in Firefox or Google Chrome and see what's going on.
In this case, a url is requested (via ajax) that returns JSON format and is rendered by react: https://api.redmart.com/v1.6.0/catalog/search?q=apple&pageSize=18&sort=1024&variation=BETA
With PHP, you just use json_decode on the response and you have everything you need.

Not need to scrap the web, you can just request on website rest API and use the poutput JSON, for example this is API for apple listing:
https://api.redmart.com/v1.6.0/catalog/search?q=apple&pageSize=18&sort=1024&page=1&variation=BETA

How to retrieve data from Rdio API using PHP?

I can't get the Rdio PHP API to work (i.e. retrieve some meaningful data). I need to write a simple function for fetching info about track / album stored in a ShortCode.
I am using PHP version of this:
https://github.com/rdio/rdio-simple
And this is my PHP code:
require_once 'rdio.php';
$rdio = new Rdio(array("MyClientID", "MyClientSecret"));
$request = $rdio -> call("getObjectFromShortCode", array("short_code" => "SomeShortCode"));
return $request;
However, when I do print_r() on the function output, there is no output at all.
I am a bit confused about how to initialize the API and get the response. The Rdio developer documentation itself doesn't give me a clue when trying to resolve the issue.
Would you mind telling me how to properly retrieve data using PHP from the API?

The Rdio PHP library currently doesn't support OAuth 2.0. You can try using a generic OAuth 2.0 library to make requests.

Guidance on using openstack to launch an instance via php and automatically build an instance depending on request?

A very open question which I need some advice on and more importantly pointers in the right direction.
I'm looking at using openstack for my private cloud (currently using VMware) as the main aim is to be able to launch a new VM instance from within our web application so this could be trigger via a php page to deploy new apache worker server for example. The next aim is to develop our code to be able to see when a server load is getting high or needs more worker servers to preform a task to auto launch an instance?
I've been looking at the openstack API to see if this is the best approach? But also looking at juju to see if you can use charms to do this and seeing if the api for juju to is best?
The aim is get this working with VMware or to replace vmware.
My current setup is running openstack on a laptop using nova as the storage so any help with the pointers would be great
I know its a open question

Well there is an SDK page listing many of the OpenStack API client SDKs that exist.
Ref:
https://wiki.openstack.org/wiki/SDKs#PHP
Listed in there are two PHP SDKs for OpenStack currently:
Ref:
https://github.com/rackspace/php-opencloud
https://github.com/zendframework/ZendService_OpenStack
I wouldn't use Juju as an interface. And frankly I am not sure OpenStack is the right tool for what you are doing. But, if you want to play with devstack and get an idea, I think rackspace's php client SDK is probably a good start. Devstack is not a bad way to get that experience either.
example of spinning up a server with php-opencloud:
$server = $compute->server();
try {
$response = $server->create(array(
'name' => 'My lovely server',
'image' => $ubuntu,
'flavor' => $twoGbFlavor
));
} catch (\Guzzle\Http\Exception\BadResponseException $e) {
// No! Something failed. Let's find out:
$responseBody = (string) $e->getResponse()->getBody();
$statusCode = $e->getResponse()->getStatusCode();
$headers = $e->getResponse()->getHeaderLines();
echo sprintf("Status: %s\nBody: %s\nHeaders: %s", $statusCode, $responseBody, implode(', ', $headers));
}
This would be a polling function:
use OpenCloud\Compute\Constants\ServerState;
$callback = function($server) {
if (!empty($server->error)) {
var_dump($server->error);
exit;
} else {
echo sprintf(
"Waiting on %s/%-12s %4s%%",
$server->name(),
$server->status(),
isset($server->progress) ? $server->progress : 0
);
}
};
$server->waitFor(ServerState::ACTIVE, 600, $callback);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP : Scrape data generated with javascript ( ES6 ) - php

Related

Automatic download of Azure ServiceTags via php

Webscraping Symfony/Panther: Can't get HTML

Web Scraping with PHP Goutte

How to retrieve data from Rdio API using PHP?

Guidance on using openstack to launch an instance via php and automatically build an instance depending on request?

Categories

Resources