How to set referer Header in Guzzle and get CDN Content - php

I want to scrape a website and am using guzzle 7.4 and Symfony Dom Crawler
I successfully retrieved the HTML data But the website is using CDN to host some resources and they are not loading because the header is not sent to get those resources
below is code retrieving html
<?php
require "vendor/autoload.php";
use Symfony\Component\DomCrawler\Crawler;
// Url
$url = 'scrapingdomain.com';
$headers = [
'referer' => 'examplescrapingdomain.com'
];
$client = new \GuzzleHttp\Client([
'headers' => $headers
]);
// go get the data from url
$response = $client->request('GET', $url);
$html = ''.$response->getBody();
$crawler = new Crawler($html);
echo $html;
?>
If I access the CDN directly and set referer header I get a response of 200
Below Code
<?php
require "vendor/autoload.php";
use Symfony\Component\DomCrawler\Crawler;
// Url
$url = 'examplecdnresource.com/Images.png';
$headers = [
'referer' => 'examplescrapingdomain.com'
];
$client = new \GuzzleHttp\Client([
'headers' => $headers
]);
// go get the data from url
$response = $client->request('GET', $url);
$html = ''.$response->getBody();
$crawler = new Crawler($html);
echo $html;
?>
I want to get the scrapdomain.com get resources and download the cdn hosted images that it has

All I needed to do to get the CDN hosted content inside the scraped html is use file_get_content function and set referer stream to download the data no inside guzzle as i was getting css and image files

Related

In PHP why can't I scrape this page with DomCrawler to get css selectors text?

This works on other pages including this sites homepage, but I cannot use it on any pages I want which is that website's items they have for sale. I'm trying to get the price, shipping cost, title, description and other details. I'm using symfony\domcrawler\crawler and symfony\cssSelector but it will not work with the page I am trying to scrape. Is there something else I am not doing or what's the problem i've tried request and goute and it still won't work, I am stuck. There is something going on here I do not know.
Page i'm scraping https://app.cjdropshipping.com/product-detail.html?id=189F56FD-0D65-42C7-88A6-093FA8D7ADB0&push_id=&fromType=
The text i am getting in chrome tools from selector
require "vendor/autoload.php";
require "vendor/rmccue/requests/library/Requests.php";
///require "vendor/rmccue/requests/library/Requests";
use Symfony\Component\DomCrawler\Crawler;
use Symfony\Component\CssSelector\CssSelectorConverter;
use Symfony\Component\CssSelector;
$url = 'https://app.cjdropshipping.com/product-detail.html?id=189F56FD-0D65-42C7-88A6-093FA8D7ADB0&push_id=&fromType=';
$headers = array('Content-Type' => 'application/json');
$data = array('price' => '.pd-price-span1');
$response = Requests::post($url, $headers, json_encode($data));
//echo $response->p->text;
$crawler = new Crawler($response->body);
$filter = '#merch-price';
$catsHTML = $crawler
->filter($filter)
->each(function (Crawler $node) {
return $node->html();
});
var_dump($catsHTML);
//var_dump( $catsHTML);
//var_dump($response->body);

How to parse AJAX by cURL?

I am trying to parse this url:
https://halykbank.kz/presscenter/novosti
The news are loaded by AJAX. In network I found URL, that must show loaded news, but there only
{"result":false,"hint":"NO_AUTHORIZATION_DATA"}
Why I am getting this? Here the URL, that I think must show me loaded news:
https://backend.halykbank.kz/struct/category-items?categoryId=199&sort=position%20desc&offset=200&limit=100
You must send these header
Auth-User-Id, Auth-Token, Auth-Time
composer require guzzlehttp/guzzle:~6.0
$client = new GuzzleHttp\Client();
$res = $client->get('https://api.github.com/user', ['auth' => ['user', 'pass']]);
echo $res->getStatusCode(); // 200
$json_string = $res->getBody();
$json_obj = json_decode($json_string); //return Object<BR>
$json_arr = json_decode($json_string,true); //return Array

Upload an image in wordpress using REST API

Using rest api i can create a post, get blog category etc. but i cannot upload an image i am referring
https://github.com/WP-API/client-php/blob/master/library/WPAPI/Media.php
and
http://wp-api.org/#entities_media-meta_width
and my code is
$data = array('file'=>$filePath,'is_image'=>true);
print_r($data);
$headers = array('Content-Type' => 'application/octet-stream');
$response = $this->api->post(WPAPI::ROUTE_MEDIA, $headers, $data);
they talk about [$data in $response = $this->api->post(WPAPI::ROUTE_MEDIA, $headers, $data);]
what are the key value pairs used in $data?
Finally, I fixed it myself:
specify the file path
put necessary headers
send post request
Here is the code:
$filePath = 'URL of image'; //right click your fav img and copy url.
$imageData = #file_get_contents($filePath); //get image content
$headers = array('Content-Type' => 'application/json; charset= UTF-8',
'Content-Disposition' => 'attachment; filename='.basename($filepath)');
$response = $this->api->post('/wp-json/media', $headers, $imageData);
print_r($response);
Check the WordPress Media Gallery.
Enjoy!
try this.
1 open chrome browser.
2 install apps (https://chrome.google.com/webstore/search/rest%20client?utm_source=chrome-ntp-icon)
3 open this apps and use
Screenshort

"Invalid content-type specified" when uploading file to Apigility

I'm a building a RPC API using Zend Framework 2 and Apigility by Zend Framework. For testing, I use the chrome extension Postman REST-Client.
I can do POST requests without problems when I use Postman.
But my code doesn't work.
$client = new \Zend\Http\Client();
$client->setUri($uri)
->setMethod('POST')
->setParameterPost(
array(
'file' => '/home/user/Downloads/file.csv'
)
);
$headers = new \Zend\Http\Headers();
$headers->addHeaders(array(
'Accept' => 'application/json;',
));
$client->setHeaders($headers);
$client->setStream();
$response = $client->send();
$file = '/home/user/Downloads/file.csv';
$file2 = '/home/user/Downloads/file2.csv';
copy($response->getStreamName(), $file);
$fp = fopen($file2, "w");
stream_copy_to_stream($response->getStream(), $fp);
$client->setStream($file);
$responce = $client->send();
echo $responce->getBody();
I tried to pass other headers Content-Type, but it leads to Fatal error.
What do I need to pass the headers to make it work?
You have to set your Content-Type header to multipart/form-data. The Postman add-on you are using in Chrome does this automatically for file uploads.
So set your headers like this:
$headers->addHeaders(array(
'Accept' => 'application/json',
'Content-Type' => 'multipart/form-data'
));
If that doesn't work please be more specific about the fatal error you get.

PHP cURL HTTP GET XML Format

I have an application that has a Web Services RESTful API. When I make HTTP GET requests in the browser I get XML responses back.
When I make the same request using PHP I get the correct information but it is not formatted in XML and so I can't pass it to Simple XML.
Here's my code.
<?php
//Deifne user credentials to use with requests
$user = "user";
$passwd = "user";
//Define header array for cURL requestes
$header = array('Contect-Type:application/xml', 'Accept:application/xml');
//Define base URL
$url = 'http://192.168.0.100:8080/root/restful/';
//Define http request nouns
$ls = $url . "landscapes";
//Initialise cURL object
$ch = curl_init();
//Set cURL options
curl_setopt_array($ch, array(
CURLOPT_HTTPHEADER => $header, //Set http header options
CURLOPT_URL => $ls, //URL sent as part of the request
CURLOPT_HTTPAUTH => CURLAUTH_BASIC, //Set Authentication to BASIC
CURLOPT_USERPWD => $user . ":" . $passwd, //Set username and password options
CURLOPT_HTTPGET => TRUE //Set cURL to GET method
));
//Define variable to hold the returned data from the cURL request
$data = curl_exec($ch);
//Close cURL connection
curl_close($ch);
//Print results
print_r($data);
?>
Any thoughts or suggestions would be really helpful.
S
EDIT:
So this is the response I get from the PHP code:
0x100000rhel-mlsptrue9.2.3.0101
This is the response if I use the WizTools Rest Client or a browser.
<?xml version="1.0" encoding="UTF-16"?>
<landscape-response total-landscapes="1" xmlns="http://www.url.com/root/restful/schema/response">
<landscape>
<id>0x100000</id>
<name>rhel-mlsp</name>
<isPrimary>true</isPrimary>
<version>9.2.3.010</version>
</landscape>
</landscape-response>
As you can see the information is there but the PHP is not really presenting this in a useful way.
I was able to find the answer to this question so I thought I would share the code here.
//Initialise curl object
$ch = curl_init();
//Define curl options in an array
$options = array(CURLOPT_URL => "http://192.168.0.100/root/restful/<URI>",
CURLOPT_PORT => "8080",
CURLOPT_HEADER => "Content-Type:application/xml",
CURLOPT_USERPWD => "<USER>:<PASSWD>",
CURLOPT_HTTPAUTH => CURLAUTH_BASIC,
CURLOPT_RETURNTRANSFER => TRUE
);
//Set options against curl object
curl_setopt_array($ch, $options);
//Assign execution of curl object to a variable
$data = curl_exec($ch);
//Close curl object
curl_close($ch);
//Pass results to the SimpleXMLElement function
$xml = new SimpleXMLElement($data);
print_r($xml);
As you can see the code is not all that different, the main thing was separating the port option out of the URL and into its own option.
Hopefully this helps someone else out!!!
S
Try this
$resp = explode("\n<?", $data);
$response = "<?{$resp[1]}";
$xml = new SimpleXMLElement($response);
Does it print anything at all (your code)? Try using echo $data but hit F12 to view the results on the console.

Categories