Scrape div contents using PHP and cURL

Scrape div contents using PHP and cURL - php

I'm new to cURL.
I have been trying to scrape contents of this amazon link, (ie., image, book title, author and price of the 20 books) into a html page. So far I've got is print the page using the below code
<?php
function curl($url) {
$options = Array(
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_FOLLOWLOCATION => TRUE,
CURLOPT_AUTOREFERER => TRUE,
CURLOPT_CONNECTTIMEOUT => 120,
CURLOPT_TIMEOUT => 120,
CURLOPT_MAXREDIRS => 10,
CURLOPT_URL => $url,
);
$ch = curl_init();
curl_setopt_array($ch, $options);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
?>
$url = "http://www.amazon.in/gp/bestsellers/books/1318209031/ref=zg_bs_nav_b_2_1318203031";
$results_page = curl($url);
echo $results_page;
I have tried using regex and failed; I have tried everything possible for 6hrs straight and got really tired, hoping I will find solution here; just thanks isn't enough for the solution but tq in advance. :)
UPDATE: Found a really helpful site(click here) for beginners like me(without using cURL though).

You really should be using the AWSECommerce API, but here's a way to leverage Yahoo's YQL service:
<?php
$query = sprintf(
'http://query.yahooapis.com/v1/public/yql?q=%s',
urlencode('SELECT * FROM html WHERE url = "http://www.amazon.in/gp/bestsellers/books/1318209031/ref=zg_bs_nav_b_2_1318203031" AND xpath=\'//div[#class="zg_itemImmersion"]\'')
);
$xml = new SimpleXMLElement($query, null, true);
foreach ($xml->results->div as $product) {
vprintf("%s\n", array(
$product->div[1]->div[1]->a,
));
}
/*
Engineering Thermodynamics
A Textbook of Fluids Mechanics
The Design of Everyday Things
A Forest History of India
Computer Networking
The Story of Microsoft
Private Empire: ExxonMobil and Americ...
Project Management Metrics, KPIs, and...
Design and Analysis of Experiments: I...
IES - 2013: General English
Foundation of Software Testing: ISTQB...
Faster: 100 Ways to Improve your Digi...
A Textbook of Fluid Mechanics and Hyd...
Software Engineering for Embedded Sys...
Communication Skills for Engineers
Making Things Move DIY Mechanisms for...
Virtual Instrumentation Using Labview
Geometric Dimensioning and Tolerancin...
Power System Protection & Switchgear...
Computer Networks
*/

Related

Is there any way to improve performance and execution time of this code? - PHP

First time asking this kind of question. I am working on a small internal app that works along an external API for provisioning purpose.
The code essentially is comprised of a series of forms where the user inputs data, this data is then sent to the API to register new customers.
This is somewhat quite a linear process I will try to explain:
Contact Creation: Basic customer info (email, adress...).
Customer Creation: More advanced info, for a customer to be created there must be a contactId belonging to this customer.
Subscriber Creation: Info linking customer to the service acquired. The previous step has to be completed and a customerId must exist.
Service Creation: Lasts bits of advanced info about the service. Once again, a suscriberId needs to exist in order to link the service.
I've managed a more or less quick process with a few tweaks here and there, but the first step (Contact) has a method I can't seem to improve, which in turn causes this process to take up to a full minute!
As the whole process described earlier, the creation of each one of these is very linear too.
The API documentation states that the resuts of any GET should be paginated to a max of 10 entries, but following this further increases the time over the minute. Manual experiments showed that the best ratio is about 500 entries per page or in some cases, even the whole number of entries proved to be the fastest way rather than going 10 by 10.
Since the contact email can't be duplicated, one of the first things to do is check for the email provided in the form and compare it to all the already existing emails stored.
In order to provide the $page and $entries to the API call, I must first fetch the total number of conctacts. This number appears when calling the API to get the contacts. So the first methos I use is:
function fetchTotalContacts($uri, $auth){
$ch = curl_init();
$options = array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_SSL_VERIFYHOST => false,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_URL => $uri.'?page=1&rows=1',
CURLOPT_HTTPHEADER => array('Content-Type: application/json', 'Authorization: $auth')
);
curl_setopt_array($ch, $options);
$response = curl_exec($ch);
$response = json_decode($response, true);
$totalContacts = $response['total_count'];
$http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ($http_code != 200) {
echo "Error en fetchTotalContacts() - Código: $http_code | ";
}
curl_close($ch);
return $totalContacts;
}
Now having $totalContacts I can proceed and search if the email has been already registered, and this is the step I suspect to be responsible for the high execution time. This method searchs the contacts and their emails, if it finds no coincidence, proceeds to create the contact with the data provided.
function checkDuplicatedEmail($uri, $totalContacts, $contactEmailArray, $auth, $contactEmail, $dataContact){
$ch = curl_init();
$options = array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_SSL_VERIFYHOST => false,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_URL => $uri.'?page=1&rows='.$totalContacts,
CURLOPT_HTTPHEADER => array('Content-Type: application/json', 'Authorization: $auth')
);
curl_setopt_array($ch, $options);
$customers = curl_exec($ch);
$customers = json_decode($customers, true);
$http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ($http_code != 200) {
echo "Error en checkDuplicatedEmail() - Código: $http_code | ";
}
curl_close($ch);
/*
foreach ($customers['_embedded']['ngcp:customercontacts'] as $customer) {
$email = $customer['email'];
array_push($contactEmailArray, $email);
}
if (in_array($contactEmail, $contactEmailArray)) {
echo('El email utilizado ya ha sido registrado en la base de datos');
die();
}else{
$contactCreated = createContact($uri, $dataContact);
return $contactCreated;
}
*/
$repeated = 0;
for ($i=0; $i < $totalContacts ; $i++) {
if ($contactEmail == $customer["_embedded"]["ngcp:customercontacts"][$i]["email"]) {
$repeated += 1;
}
}
if ($repeated > 0) {
die(echo('El email utilizado ya ha sido registrado en la base de datos'));
}else{
$contactCreated = createContact($uri, $dataContact, $auth);
return $contactCreated;
}
}
As you can see, these are the quickiest options I've found, both making the whole process take 40s, which is too much still.
The response for createContact($uri, $dataContact, $auth); is a success code (400, 201..), so when I want to go to the next step I do need to, again, search all the contacts to find the one I just created and get the id. Fortunately, here I can simply skip to the last 20 contacts (not the last one directly so it can be used simoultaneousy without issues) and search there, this makes it real quick, but for the email there is no such skip, all the entries must be analyzed.
I don't know how to drop the time here, the rest of the code consists of fetching the contactId and creating the customer so not much to do there as it is now.
If any of you deem necessary to see the rest of the page I wil update the post.
As a final reminder, I have manually tried with different configs of pages and entries and for this page, the fastest was 1 page - All entries. I've also tried taking the for/each loop outside the method, to no avail. All advice is welcome,
Thanks for the help provided!

Fill out HTML form using data from dataset, and store results into a file

What is the most effective way of programmatically filling out an HTML form on a website, using data from a dataset (either CSV, JSON, or similar..) and then retrieving the results of that submitted form into another dataset? I would like to be able to do this multiple times, populating the form with different parameters each time, always retrieving those parameters from my input dataset.
I was reading about Selenium and HTMLUnit, which seem to do similar things. But they require installing dependencies and learning how to use them. Would it be overkill? Is there an easier way to do this by maybe writing my own script?
I tried writing a php curl script, but this one doesn't generate the headers or cookies that the request requires, so I'm not able to retrieve anything.
<?php
/**
* Send a POST requst using cURL
* #param string $url to request
* #param array $post values to send
* #param array $options for cURL
* #return string
*/
function curl_post($url, array $post = NULL, array $options = array())
{
$defaults = array(
CURLOPT_POST => 1,
CURLOPT_HEADER => 0,
CURLOPT_URL => $url,
CURLOPT_FRESH_CONNECT => 1,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_FORBID_REUSE => 1,
CURLOPT_TIMEOUT => 4,
CURLOPT_POSTFIELDS => http_build_query($post)
);
$ch = curl_init();
curl_setopt_array($ch, ($options + $defaults));
if( ! $result = curl_exec($ch))
{
trigger_error(curl_error($ch));
}
curl_close($ch);
return $result;
}
?>
I'm not sure if that's the right approach.
Any tips/resources would be appreciated.

You can write this script in Selenium - it's just a browser driver, it will fill the form from the client side. If the page isn't very complicated, you can use library requests in Python and directly send POST data to the final page. Requests is a faster lib, and to write a script sending POST data you will need 5 mins of learning.

Generate thumbnail in php, posting to Azure Computer Vision API

I want to use Azure Computer Vision API to generate thumbnails for my Wordpress site. I'm trying to make it work in php with wp_remote_post, but i don't know how to parse the parameters ? It returns a thumbnail in really bad quality and default 500x500px. Any ideas on how to resolve this issue ?
function get_thumbnail($URL) //* * * * Azure Computer Vision API - v1.0 * * * *
{
$posturl='https://api.projectoxford.ai/vision/v1.0/generateThumbnail';
$request = wp_remote_post($posturl, array(
'headers' => array(
'Content-Type' => 'application/json',
'Ocp-Apim-Subscription-Key' => 'xxxxxxxxxxxxxxxxxxxxxxxxxxxx'),
'body' => array('url' => $URL)
));
if ( is_wp_error( $request ) )
{
$error_message = $request->get_error_message();
return "Something went wrong: $error_message";
} else
{
return $request['body'];
}
}
EDIT 1
Thanks #Gary your right! Now the cropping is correct, but i got a huge problem with the quality! I'm using a trial but i see no info from Azure on downgrading the thumb quality for trial users. They are claiming to deliver high quality thumbnails, but if thats the standard it's totaly useless.
I must have overlooked something i guess?
Of course Gary, if i get no correct answer on my quality question i will close the thread with your answer as correct.

According the description of Get Thumbnail, the width,height and smartCropping should be set as request parameters which should combined in URL.
However the second args in wp_remote_post() do not accept the URL parameters and will do nothing on them. So you need to combine the url first before set into wp_remote_post().
You can try to use add_query_arg() to combine your url first,
$posturl='https://api.projectoxford.ai/vision/v1.0/generateThumbnail';
$posturl=add_query_arg( array(
'width' => 600,
'height' => 400,
'smartCropping' => true
), $posturl);

Dynamic W3C Validation

I have a small element on my website that displays the validity of the current page's markup. At the moment, it is statically set as "HTML5 Valid", as I constantly check whether it is, in fact, HTML5 valid. If it's not then I fix any issues so it stays HTML5-valid.
I would like this element to be dynamic, though. So, is there any way to ping the W3C Validation Service with the current URL, receive the result and then plug the result into a PHP or JavaScript function? Does the W3C offer an API for this or do you have to manually code this?

Maintainer of the W3C HTML Checker (aka validator) here. In fact the checker does expose an API that lets you do, for example:
https://validator.w3.org/nu/?doc=https%3A%2F%2Fgoogle.com%2F&out=json
…which gives you the results back as JSON. There’s also a POST interface.
You can find more details here:
https://github.com/validator/validator/wiki/Service-»-HTTP-interface
https://github.com/validator/validator/wiki/Service-»-Input-»-POST-body
https://github.com/validator/validator/wiki/Service-»-Input-»-GET
https://github.com/validator/validator/wiki/Output-»-JSON

They do not have an API that I am aware of.
As such, my suggestion would be:
Send a request (GET) to the result page (http://validator.w3.org/check?uri=) with your page's URL (using file_get_contents() or curl). Parse the response for the valid message (DOMDocument or simple string search).
Note: This is a brittle solution. Subject to break if anything changes on W3C's side. However, it will work and this tool has been available for several years.
Also, if you truly want this on your live site I'd strongly recommend some kind of caching. Doing this on every page request is expensive. Honestly, this should be a development tool. Something that is run and reports the errors to you. Keep the badge static.

Here is an example how to implement W3C API to validate HTML in PHP:
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => "http://validator.w3.org/nu/?out=json",
CURLOPT_RETURNTRANSFER => true,
CURLOPT_ENCODING => "",
CURLOPT_MAXREDIRS => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_CUSTOMREQUEST => "POST",
CURLOPT_POSTFIELDS => '<... your html text to validate ...>',
CURLOPT_HTTPHEADER => array(
"User-Agent: Any User Agent",
"Cache-Control: no-cache",
"Content-type: text/html",
"charset: utf-8"
),
));
$response = curl_exec($curl);
$err = curl_error($curl);
curl_close($curl);
if ($err) {
//handle error here
die('sorry etc...');
}
$resJson = json_decode($response, true);
$resJson will look like this:
{
"messages": [
{
"type": "error",
"lastLine": 13,
"lastColumn": 110,
"firstColumn": 5,
"message": "Attribute “el” not allowed on element “link” at this point.",
"extract": "css\">\n <link el=\"stylesheet\" href=\"../css/plugins/awesome-bootstrap-checkbox/awesome-bootstrap-checkbox.min.css\">\n <",
"hiliteStart": 10,
"hiliteLength": 106
},
{
"type": "info",
"lastLine": 294,
"lastColumn": 30,
"firstColumn": 9,
"subType": "warning",
"message": "Empty heading.",
"extract": ">\n <h1 id=\"promo_codigo\">\n ",
"hiliteStart": 10,
"hiliteLength": 22
},....
Check https://github.com/validator/validator/wiki/Service-»-Input-»-POST-body for more details.

Scraping aspx page using php curl

I am trying to scrape a aspx page using php curl code, which contains data page wise. Initially the page loads with get method, but as we select page no. from drop down it submits page the page using post method.
I want to find data of particular page no by passing postfields to curl, but couldn't do that.
I have created a dummy code to get records of 5th page, but it always returns result of first page.
Sample code
$url = 'http://www.ticketalternative.com/SitePages/Search.aspx?catid=All&pattern=Enter%20Artist%2c%20Team%2c%20or%20Venue';
$file=file_get_contents($url);
//<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value=
preg_match_all("#<input.*?name=\"__VIEWSTATE\".*?value=\"(.*?)\".*?>.*?<input.*?name=\"__EVENTVALIDATION\".*?value=\"(.*?)\".*?>#mis", $file, $arr_viewstate);
$viewstate = urlencode($arr_viewstate[1][0]);
$eventvalidation = urlencode($arr_viewstate[2][0]);
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => true, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 1120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_POST => true,
CURLOPT_VERBOSE => true,
CURLOPT_POSTFIELDS => '__EVENTTARGET='.urlencode('ctl00$ContentPlaceHolder1$SearchResults1$SearchResultsGrid$ctl13$ctl05').'&__EVENTARGUMENT='.urlencode('').'&__VIEWSTATE='.$viewstate.'&__EVENTVALIDATION='.$eventvalidation.'&__LASTFOCUS='.urlencode('').'&ctl00$ContentPlaceHolder1$SearchResults1$SearchResultsGrid$ctl13$ctl05=4');
$ch = curl_init($url);
curl_setopt_array($ch,$options);
$result = curl_exec($ch);
curl_close($ch);
preg_match_all('/<a id=\".*?LinkToVenue\" href=\"(.*?)\">(.*?)<\/a>/ms',$result,$matches);
print_r($matches);
Can anybody help me out with this, where am I getting wrong, I think its not working because at first time page loads with GET method and as we go on page links it uses post.
How will I get records of particular page no.?
Regards

I write scrapers in php sometimes when a client requires it but I would never attempt to scrape an ASP.NET site with php. For that you need perl python or ruby. All 3 have a mechanize library that usually makes it easy.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scrape div contents using PHP and cURL - php

Related

Is there any way to improve performance and execution time of this code? - PHP

Fill out HTML form using data from dataset, and store results into a file

Generate thumbnail in php, posting to Azure Computer Vision API

Dynamic W3C Validation

Scraping aspx page using php curl

Categories

Resources