I am trying to scrape a aspx page using php curl code, which contains data page wise. Initially the page loads with get method, but as we select page no. from drop down it submits page the page using post method.
I want to find data of particular page no by passing postfields to curl, but couldn't do that.
I have created a dummy code to get records of 5th page, but it always returns result of first page.
Sample code
$url = 'http://www.ticketalternative.com/SitePages/Search.aspx?catid=All&pattern=Enter%20Artist%2c%20Team%2c%20or%20Venue';
$file=file_get_contents($url);
//<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value=
preg_match_all("#<input.*?name=\"__VIEWSTATE\".*?value=\"(.*?)\".*?>.*?<input.*?name=\"__EVENTVALIDATION\".*?value=\"(.*?)\".*?>#mis", $file, $arr_viewstate);
$viewstate = urlencode($arr_viewstate[1][0]);
$eventvalidation = urlencode($arr_viewstate[2][0]);
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => true, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 1120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_POST => true,
CURLOPT_VERBOSE => true,
CURLOPT_POSTFIELDS => '__EVENTTARGET='.urlencode('ctl00$ContentPlaceHolder1$SearchResults1$SearchResultsGrid$ctl13$ctl05').'&__EVENTARGUMENT='.urlencode('').'&__VIEWSTATE='.$viewstate.'&__EVENTVALIDATION='.$eventvalidation.'&__LASTFOCUS='.urlencode('').'&ctl00$ContentPlaceHolder1$SearchResults1$SearchResultsGrid$ctl13$ctl05=4');
$ch = curl_init($url);
curl_setopt_array($ch,$options);
$result = curl_exec($ch);
curl_close($ch);
preg_match_all('/<a id=\".*?LinkToVenue\" href=\"(.*?)\">(.*?)<\/a>/ms',$result,$matches);
print_r($matches);
Can anybody help me out with this, where am I getting wrong, I think its not working because at first time page loads with GET method and as we go on page links it uses post.
How will I get records of particular page no.?
Regards
I write scrapers in php sometimes when a client requires it but I would never attempt to scrape an ASP.NET site with php. For that you need perl python or ruby. All 3 have a mechanize library that usually makes it easy.
Related
What is the most effective way of programmatically filling out an HTML form on a website, using data from a dataset (either CSV, JSON, or similar..) and then retrieving the results of that submitted form into another dataset? I would like to be able to do this multiple times, populating the form with different parameters each time, always retrieving those parameters from my input dataset.
I was reading about Selenium and HTMLUnit, which seem to do similar things. But they require installing dependencies and learning how to use them. Would it be overkill? Is there an easier way to do this by maybe writing my own script?
I tried writing a php curl script, but this one doesn't generate the headers or cookies that the request requires, so I'm not able to retrieve anything.
<?php
/**
* Send a POST requst using cURL
* #param string $url to request
* #param array $post values to send
* #param array $options for cURL
* #return string
*/
function curl_post($url, array $post = NULL, array $options = array())
{
$defaults = array(
CURLOPT_POST => 1,
CURLOPT_HEADER => 0,
CURLOPT_URL => $url,
CURLOPT_FRESH_CONNECT => 1,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_FORBID_REUSE => 1,
CURLOPT_TIMEOUT => 4,
CURLOPT_POSTFIELDS => http_build_query($post)
);
$ch = curl_init();
curl_setopt_array($ch, ($options + $defaults));
if( ! $result = curl_exec($ch))
{
trigger_error(curl_error($ch));
}
curl_close($ch);
return $result;
}
?>
I'm not sure if that's the right approach.
Any tips/resources would be appreciated.
You can write this script in Selenium - it's just a browser driver, it will fill the form from the client side. If the page isn't very complicated, you can use library requests in Python and directly send POST data to the final page. Requests is a faster lib, and to write a script sending POST data you will need 5 mins of learning.
Trying to help out someone who is trying to access and API using PHP. My code using ColdFusion works fine posting to the API, but we can't get the PHP to work. In CF the code uses urlparams to send the data:
<cfhttp url="https://example.com/_api/proxyApi.cfc" method="post" result="httpResult" charset="UTF-8">
<cfhttpparam type="url" name="method" value="apiauth"/>
<cfhttpparam type="url" name="argumentCollection" value="#jsData#"/>
</cfhttp>
A dump of the resulting call from the API shows the variables in the URL like this:
method = apiauth is the main authorization function, and then the json string in argumentCollection is passed to the correction function in the API by apiauth.
From PHP his curl is posting as form data, not URL and the API complains that the required information is missing because it's in the wrong scope. I've been trying to figure out how to make curl use URL scope instead:
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_URL => $target_url,
CURLOPT_POST => 1,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 2,
CURLOPT_AUTOREFERER => true,
CURLOPT_POSTFIELDS => array(
'method' => 'apiauth',
'argumentCollection' => $json
)
));
The same dump from the API shows the same data, but in the wrong scope:
It seems like if we can get the data in the right scope we'll make progress, but my PHP knowledge is dangerously limited.
You are sending an empty POST in your CF example.
<cfhttpparam type="url" is processed as a query string parameter, as in:
https://example.com/_api/proxyApi.cfc?method=apiauth&argumentCollection=...
Thus your dump of the URL scope (the key-value-paired query string) shows the data.
To put those parameters into your POST body, you would use:
<cfhttpparam type="formfield"
And then you FORM scope would show the data.
Your PHP cURL does the latter: it adds your parameters to the POST body.
If you want the cURL to work as your example CF code, do this instead:
// add the parameters to the URL's query string
// start with & instead of ?, if the URL already contains a query string, see comment below snippet
$target_url .= '?'.'method=apiauth'.'&'.'argumentCollection='.urlencode($json);
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_URL => $target_url,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 2,
CURLOPT_AUTOREFERER => true
));
no query string in $target_url:
$target_url = 'https://example.com/_api/proxyApi.cfc';
$target_url .= '?'.'method=apiauth'.'&'.'argumentCollection='.urlencode($json);
query string in $target_url:
$target_url = 'https://example.com/_api/proxyApi.cfc?p=';
$target_url .= '&'.'method=apiauth'.'&'.'argumentCollection='.urlencode($json);
On a side note: You probably don't want to send JSON via query string as the query string has a limit of about 2000 chars (depends on browser and webserver). If your JSON is complex, your query string will be truncated and mess everything up. Use the POST body for this instead.
I've got two different remote forms which I need to submit data to. The first one is an http form, and it works just fine. Submit > redirect to result page > return response as variable.
The second one lives on an https page, and it just doesn't work, no matter what I try. So here's what I'm working with:
First form's form tag
<form method="post" name="entry_form" action="a_relative_page.asp?start=1">
Second form's form tag
<form method="post" novalidate enctype="multipart/form-data" action="https://asubdomain.formstack.com/forms/index.php" class="stuff" id="stuff1234567890">
Both buttons are completely unremarkable, with no fancy javascript, and look essentially like
<input type="submit">
And here's the PHP cURL request
$post_data = http_build_query(
array(
'lots_of' => 'values_here'
)
);
$url = 'https://asubdomain.formstack.com/forms/a_page_with_form';
$ch = curl_init($url);
$opts = array(
CURLOPT_SSL_VERIFYPEER => FALSE,
CURLOPT_SSL_VERIFYHOST => FALSE,
CURLOPT_UNRESTRICTED_AUTH => TRUE,
CURLOPT_VERBOSE => TRUE,
// Above options for debugging because I'm desperate
CURLOPT_CONNECTTIMEOUT => 30,
CURLOPT_USERAGENT => 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_FOLLOWLOCATION => TRUE,
CURLOPT_POST => TRUE,
CURLOPT_POSTFIELDS => $post_data
);
curl_setopt_array($ch, $opts);
//Capture result and close connection
$result = curl_exec($ch);
$debug = curl_getinfo($ch);
curl_close($ch);
There's nothing out of the ordinary in curl_getinfo except the expected ["ssl_verify_result"]=> int(0), which I'm ignoring for debugging.
HTTP code is 200. If I echo $result, I can see that all form values are filled out, but the form never submitted and thus never redirected. If I then CLICK on submit, the form submits without issue and redirects as expected.
Again, this is not a problem with the first form, only the second. I'm guessing there's something about the underlying code on formstack that prevents cURL POST from submitting the form, but I can't seem to find it.
EDIT: I found the problem. There are two invisible elements on formstack forms. One of them is an input field called _submit which must be set to 1. The other is the form identifier, which is an integer.
I have a project where I need to do the following: retrieve some data from a form, fill the database, create a response for the user and post the data to a third party. To give an example, it's like booking a ticket to a concert.The ajax call: You buy the ticket, you receive a response (whether the purchase was successful), a php script sends data to the database, and someone may be announced that a new ticket was bought. Now, I need to pass data to that "someone". Which is what I don't know how to do.
Or, like when someone posts a comment to my question on stackoverflow, I get a notification.
In my particular case, the user creates an event, receives a response and I will need to have certain parameters posted by the user on a calendar. It is important that I could hardly integrate the calendar with the script retrieving the data. I would rather need to "forward" the data to the calendar- quite like pushing notifications.
Can anyone please give me a clue what should I use, or what should I need in order to do the above?
The process will go like this:
AJAX
user----> php script->database
|_ calendar
So if i get you right, you could post your data to the calendar via curl:
$url = "http://www.your-url-to-the-calendar.com";
$postData = array(
"prop1" => "value1",
"prop2" => "value2",
"prop3" => "value3"
);
//urlify the data for the post
$data_string = "";
foreach ($postData as $key => $value)
$data_string .= urlencode($key) . '=' . urlencode($value) . '&';
$data_string = rtrim($data_string, '&');
//will output --> prop1=value1&prop2=value2=prop3=value3
$ch = curl_init();
curl_setopt_array($ch, array(
CURLOPT_URL => $url,
CURLOPT_HEADER => false,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => false,
CURLOPT_POST => count($postData),
CURLOPT_POSTFIELDS => $data_string
));
$result = curl_exec($ch);
If your third party calendar does not require authentication than this would be the best way to post it, if you can not write to the database yourself.
When it requires authentication you would have to first login via curl (send credentials via curl-post, receive cookies, send cookies with your data)
Hope this helps.
I have a small element on my website that displays the validity of the current page's markup. At the moment, it is statically set as "HTML5 Valid", as I constantly check whether it is, in fact, HTML5 valid. If it's not then I fix any issues so it stays HTML5-valid.
I would like this element to be dynamic, though. So, is there any way to ping the W3C Validation Service with the current URL, receive the result and then plug the result into a PHP or JavaScript function? Does the W3C offer an API for this or do you have to manually code this?
Maintainer of the W3C HTML Checker (aka validator) here. In fact the checker does expose an API that lets you do, for example:
https://validator.w3.org/nu/?doc=https%3A%2F%2Fgoogle.com%2F&out=json
…which gives you the results back as JSON. There’s also a POST interface.
You can find more details here:
https://github.com/validator/validator/wiki/Service-»-HTTP-interface
https://github.com/validator/validator/wiki/Service-»-Input-»-POST-body
https://github.com/validator/validator/wiki/Service-»-Input-»-GET
https://github.com/validator/validator/wiki/Output-»-JSON
They do not have an API that I am aware of.
As such, my suggestion would be:
Send a request (GET) to the result page (http://validator.w3.org/check?uri=) with your page's URL (using file_get_contents() or curl). Parse the response for the valid message (DOMDocument or simple string search).
Note: This is a brittle solution. Subject to break if anything changes on W3C's side. However, it will work and this tool has been available for several years.
Also, if you truly want this on your live site I'd strongly recommend some kind of caching. Doing this on every page request is expensive. Honestly, this should be a development tool. Something that is run and reports the errors to you. Keep the badge static.
Here is an example how to implement W3C API to validate HTML in PHP:
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => "http://validator.w3.org/nu/?out=json",
CURLOPT_RETURNTRANSFER => true,
CURLOPT_ENCODING => "",
CURLOPT_MAXREDIRS => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_CUSTOMREQUEST => "POST",
CURLOPT_POSTFIELDS => '<... your html text to validate ...>',
CURLOPT_HTTPHEADER => array(
"User-Agent: Any User Agent",
"Cache-Control: no-cache",
"Content-type: text/html",
"charset: utf-8"
),
));
$response = curl_exec($curl);
$err = curl_error($curl);
curl_close($curl);
if ($err) {
//handle error here
die('sorry etc...');
}
$resJson = json_decode($response, true);
$resJson will look like this:
{
"messages": [
{
"type": "error",
"lastLine": 13,
"lastColumn": 110,
"firstColumn": 5,
"message": "Attribute “el” not allowed on element “link” at this point.",
"extract": "css\">\n <link el=\"stylesheet\" href=\"../css/plugins/awesome-bootstrap-checkbox/awesome-bootstrap-checkbox.min.css\">\n <",
"hiliteStart": 10,
"hiliteLength": 106
},
{
"type": "info",
"lastLine": 294,
"lastColumn": 30,
"firstColumn": 9,
"subType": "warning",
"message": "Empty heading.",
"extract": ">\n <h1 id=\"promo_codigo\">\n ",
"hiliteStart": 10,
"hiliteLength": 22
},....
Check https://github.com/validator/validator/wiki/Service-»-Input-»-POST-body for more details.