Want to send header with screen scraping - php

i am still stucked in screen scraping problem...
link : screen scraping in php problem
This problem was solved to little extent by using '&num=100' in google search query which decreased the no. of request 10 times.But captcha problem is still dere. So to overcome it i used...sleep(seconds) function.
Now the problem is I have to scrape it myself(these are orders).that means i dont want to use 'simple_html_dom.php' becuase catching warnings and error is difficult(for me) in this case.i m instructed to do it myself. so how i can i do it.i know to methods: 1. file_get_content() 2. curl.
But its very tedious work to fetch search for ur content and count rank simultaneously.as using regular exp to parse dom is HELL.read this link for convencing urself.link: RegEx match open tags except XHTML self-contained tags
Task to implemented :
catch captcha error(or warning) so i can stop furhter execution.
Have to use headers.so it seems to be genuine and valid humanable request to google.
simple_html_dom.php cant catch errors.it shows warning when captcha error occurs.How can i catch that warning?
Please help...its long working with this module.Please give suggestion to solve each and every problem related here.

Don't know about the first problem (captcha), but you can send headers easily with curl, for example:
$ch = curl_init();
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Accept-Charset: utf-8'));
And to set the user agent:
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64; rv:2.2a1pre) Gecko/20110324 Firefox/4.2a1pre');

Related

Bittrex API With PHP

I'm trying to set up a bot for bittrex by using the bittrex api. I previously tried using python but had a hard time as the documentation was in php(https://bittrex.com/Home/Api), so I decided to switch to php. Im trying to create the bot but having a hard time starting. I pasted the initial code:
$apikey='xxx';
$apisecret='xxx';
$nonce=time();
$uri='https://bittrex.com/api/v1.1/market/getopenorders?
apikey='.$apikey.'&nonce='.$nonce;
$sign=hash_hmac('sha512',$uri,$apisecret);
$ch = curl_init($uri);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('apisign:'.$sign));
$execResult = curl_exec($ch);
$obj = json_decode($execResult);
And according to this video: (sorry I had to add space because it doesn't allow me to post more than 2 links with low rep)
https:// youtu.be/K0lDTK3D-74?t=5m30s
It should return this: (Same as Above)
http:// i.imgur.com/jCoAUT9.png
But when I try place the same thing in a php values, with my own api key and secret I just get a blank webpage with nothing on it. This is what my php file looks like(API key and secret removed for security reasons):
http://i.imgur.com/DYYoY0g.png
Any idea why this could be happening and how I could fix it?
Edit: No need for help anymore. I decided to go back to python and try to do it there and finally made it work :D
The video you're working from has faked their results. Their code doesn't do anything with the value of $obj, so I wouldn't expect anything to show up on the web page. (And definitely not with the formatting they show.)
If you're unfamiliar enough with PHP that this issue wasn't immediately apparent to you, this is probably a sign that you should step back and get more familiar with PHP before you continue -- especially if you're going to be running code that could make you lose a lot of money if it isn't working properly.
You need to echo your $obj or at least var_dump() it to see the content on a webpage.

How to do a HTTP (POST) Request in PHP without requiring any configuration

I am creating a PHP package that I want anyone to be able to use.
I've not done any PHP dev in a few years and I'm unfamiliar with pear and pecl.
The first part of my question is related to Pecl and Pear:
It seems to me that Pear and pecl are updating my computer, rather than doing anything to my code base, which leads me to the assumption that anything I do with them will also need to be duplicated by anyone wanting to use my package. Is that correct?
The 2nd part of my question is specific, I just want to do a simple HTTP (POST) request, and ideally I'd like to do it without any config required by those who use my package.
These are options I'm considering :
HTTPRequest seems like the perfect option, but it says "Fatal error: Uncaught Error: Class 'HttpRequest' not found" when I try and use it out of the box, and when I follow these instructions for installing it I get, "autoheader: error: AC_CONFIG_HEADERS not found in configure.in
ERROR: `phpize' failed" -- I don't want to debug something crazy like that in order to do a simple HTTP request, nor do I want someone using my package to have to struggle through something like that.
I've used HTTP_Request2 via a pear install and it works for me, but there is nothing added to my codebase at all, so presumably this will break for someone trying to use my package unless they follow the same install steps?
I know that I can use CURL but the syntax for that seems way over the top for such a simple action (I want my code to be really easy to read)
I guess I can use file_get_contents() .. is that the best option?
and perhaps I'll phrase the 2nd part of my question as :
Is there an approach that is considered best practice for (1) doing a HTTP request in PHP, and (2) for creating a package that is able to be easily used by anyone?
This really depends on what you need your request for. While it can be daunting when first learning it, I prefer to use cURL requests most of the time unless all I need to do is query the page with no headers. It becomes pretty readable once you get used to the syntax and the various options in my opinion. When all I need to do is query a page with no headers, I will usually use file_get_contents as this is a lot nicer looking and simpler. I also think most PHP developers can agree with me on this standpoint. I recommend using cURL requests as, when you need to set headers, they're very organized and more popular than messing with file_get_contents.
EDIT
When learning how to do cURL in PHP, the list of options on the documentation page is your friend! http://php.net/manual/en/function.curl-setopt.php
Here's an example of a simple POST request using PHP that will return the response text:
$data = array("arg1" => "val1", "arg2" => true); // POST data included in your query
$ch = curl_init("http://example.com"); // Set url to query
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "POST"); // Send via POST
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($data)); // Set POST data
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // Return response text
curl_setopt($ch, CURLOPT_HEADER, "Content-Type: application/x-www-form-urlencoded"); // send POST data as form data
$response = curl_exec($ch);
curl_close($ch);

I can't get scraping from the usaspending.gov api to work

I'm getting errors while scraping data from usaspending.gov can I can't figure out why. I've checked that my php settings are all open and even setup a test scrape of another random site url.
I took another step to include options with the method and useragent.
I suspect it's timing out, but if that's not it, I'm not sure what else to try to get this to work. Every other url I try, I have no problem getting into. If anyone has any suggestions, I'd love to read them!!
Here's my sample code.
$opts = array(
'http'=>array(
'method'=>"GET",
'user_agent'=>"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8",
'timeout'=>60
)
);
$context = stream_context_create($opts);
$test = file_get_contents('http://www.usaspending.gov/fpds/fpds.php?state=MI&detail=c&fiscal_year=2013',false,$context);
I'll also add, I've tried this with fopen, file_get_contents, and simplexml_load_file with no luck. I've tried it with the extended options on fopen and file_get_contents, no change. I'm sure I'm missing something small, just can't figure out what it is.
Edit: Here's the error message
Warning: file_get_contents(http://www.usaspending.gov/fpds/fpds.php?state=MI&detail=c&fiscal_year=2013) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in...
Additionally, the link works I'm trying to open, if you copy/paste it into your browser, you should get the download.
After beating my head against this same wall for a while, I used a curl method (How to get the real URL after file_get_contents if redirection happens?) to find where the basic API URL was redirecting and that seems to be working now!
Instead of getting your same error message with:
file_get_contents(http://www.usaspending.gov/fpds/fpds.php?detail=c&fiscal_year=2013&state=AL&max_records=1000&records_from=0)
It is now working for me with:
file_get_contents(http://www.usaspending.gov/api/fpds_api_complete.php?fiscal_year=2013&vendor_state=AL&Contracts=c&sortby=OBLIGATED_AMOUNT%2Bdesc&records_from=0&max_records=20&sortby=OBLIGATED_AMOUNT+desc)
So pretty much using this as my base URL to access the API with more parameters added on (with the "Contracts" parameter replacing the original "detail" parameter):
http://www.usaspending.gov/api/fpds_api_complete.php?Contracts=c&sortby=OBLIGATED_AMOUNT%2Bdesc&sortby=OBLIGATED_AMOUNT+desc
I hope this helps, and works for you too!

Using of Wikipedia API with Rest Clients [duplicate]

This question already has answers here:
How to get results from the Wikipedia API with PHP?
(4 answers)
Closed 9 years ago.
I'm trying to get wikipedia pages (from particular category) using of MediaWiki. For this I'm following this tutorial Listing 3. Listing pages within a category. My question is: How to get Wikipedia pages without using of Zend Framework? And is there any Rest Clients based on php without need to install? Because Zend requires to install their package first and some configurations... and I don't want to do all this stuff.
After googling and some investigation I have found a tool called cURL, using of cURL with PHP can also buid a rest service. I really new in implementing rest services, but already tried to implement something in php:
<?php
header('Content-type: application/xml; charset=utf-8');
function curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$wiki = "http://de.wikipedia.org/w/api.php?action=query&list=allcategories&acprop=size&acprefix=haut&format=xml";
$result = curl($wiki);
var_dump($result);
?>
But got the errors in the result. Could anyone to help with this?
UPDATE:
This page contains the following errors:
error on line 1 at column 1: Document is empty
Below is a rendering of the page up to the first error.
Sorry for taking so long to reply, but better late than never...
When I run your code on the command line, the output I get is:
string(120) "Scripts should use an informative User-Agent string with contact information, or they may be IP-blocked without notice.
"
So it seems the problem is that you're bumping into Wikimedia bot User-Agent policy by not telling cURL to send a custom User-Agent header. To fix this, follow the advice given at the bottom of that page and add lines like the following into your script (alongside the other curl_setopt() calls):
$agent = 'ProgramName/1.0 (http://example.com/program; your_email#example.com)';
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
Ps. You probably also don't want to set an application/xml content type unless you're sure that the content actually is valid XML. In particular, the output of var_dump() will not be valid XML, even if the input is.
For testing and development, I'd suggest either running PHP from the command line or using the text/plain content type. Or, if you prefer, use text/html and encode your output with htmlspecialchars().
Ps. Made this a community wiki answer, since I realized that this question has already been asked and answered before.

Make cURL behave like exactly like form

I have a form on my site which sends data to some remote site - simple html form.
What I want to do is to use data user enters into form for statistical purposes.
So I instead of sending data to the remote page I send it first to my script which resends it the remote site.
The thing is I need it to behave in exact way the usual form would behave taking user to the remote site and displaying resources.
When I use this code it kinda works but not in the way I want it to:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $action);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$result = curl_exec($ch);
curl_close($ch);
Problem is that it displays response in the same script. For example if $action is for example:
somesite.com/processform.php and my script name is mysqcript.php it would display the response of "somesite.com/processform.php" inside "mysqcript.php" so all the relative links are not working.
How do I make it to send the user to "somesite.com/processform.php"? Same thing that pressing the button would do?
Leonti
I think you will have to do this on your end, as translating relative paths is the client's job. It should be simple: Just take the base directory of the request you made
http://otherdomain.com/my/request/path.php
and add it in front of every outgoing link that does not begin with "/" or a protocol ("http://", "ftp://").
Detecting all the outgoing links is hard, but I am 100% sure there are ready-made PHP classes that do that. Check for example this article and the getLinks() function in the user comments. I am not 100% sure whether this is what you need but it certainly goes to the right direction.
Here are a couple of possible solutions, which I post separately so they don't get mixed up with the one I recommend:
1 - keep using cURL, parse the response and add a <base/> tag to it. It should work for pretty much everything on that page.
<base href="http://realsite.com/form_url.php" />
2 - do not alter the submit URL. Submit the form to the real URL, but capture its content using some Javascript library (YUI does that) and send it to your script via XHR. It's still kind of hacky though.
There are several ways to do that. Here's one of the easiest: just use a 307 redirect.
header('Location: http://realsite.com/form_url.php', true, 307');
You can do your logging and stuff either before or after header() but if you do it after calling header() you will need to start your script with
ignore_user_abort(true);
Note that browsers are supposed to notify the user that their form is being redirected.

Categories