convert browser cookies and use in cURL crawls - php

First of all, the purpose of this is to spider one of our signed-in applications and get data about jobs run which I may not be able to get any other way.
I can log in via a browser, and also can inspect my cookies; how would I then take that (in a timely manner) and add that information to a cURL call so that I can use PHP to parse the return page (and links)?

cURL can read cookies from a Netscape cookie format file.
Since you can inspect the cookies, use this Netscape format cookie file generator to convert them to a file.
Then use curl_setopt($ch, CURLOPT_COOKIEFILE, 'file.txt'); so cURL will read the cookies from there where file.txt is the file you saved the cookies to.

Related

How to get utma utmz and similiar cookies with php curl?

I noticed that my browser keeps different cookies than my curl cookie files:
"__utma=256586655.848991821.1337158982.1337158982.1337179787.2; __utmz=256586655.1337179787.2.2.utmcsr=login.example.com|utmccn=(referral)|utmcmd=referral|utmcct=/company.php; __utmc=256586655; PHPSESSID=8sedo85uc5rfpnluh06bdb0mk4"
And this is my curl based cookie.txt:
# Netscape HTTP Cookie File
# http://www.netscape.com/newsref/std/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
login.example.com FALSE / FALSE 0 PHPSESSID 8peqektoc5j3570h08efa6o3n2
So, how to create utma utmz values and what is that values stand for ?
Those are Google Analytics cookies. Possibly, you're not telling curl to download third-party scripts referenced by the page.
Try wget with --page-requisites and --save-cookies and --load-cookies. It will download files used by the page, such as scripts.
Unfortunately, it still might not load Analytics if it's using the typical script-driven async loader, since JavaScript has to actually execute for that to work.
Those are Google Analytics cookies. Unless you pull & run those scripts (faking the Javascript runtime environment of a browser), you won't get them. Fortunately, they're irrelevant to anything you actually want to do.

file_get_contents() vs. curl for invoking APIs with PHP

According to the description of the Google Custom Search API you can invoke it using the GET verb of the REST interface, like with the example:
GET https://www.googleapis.com/customsearch/v1?key=INSERT-YOUR-KEY&cx=017576662512468239146:omuauf_lfve&q=lectures
I setup my API key and custom search engine, and when pasted my test query directly on my browser it worked fine, and I got the JSON file displayed to me.
Then I tried to invoke the API from my PHP code by using:
$json = file_get_contents("$url") or die("failed");
Where $url was the same one that worked on the browser, but my PHP code was dying when trying to open it.
After that I tried with curl, and it worked. The code was this:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$body = curl_exec($ch);
Questions:
How come file_get_contents() didn't work and curl did?
Could I use fsocket for this as well?
Question 1:
At first you should check ini setting allow_url_fopen, AFAIK this is the only reason why file_get_contents() shouldn't work. Also deprecated safe_mode may cause this.
Oh, based on your comment, you have to add http:// to URL when using with file system functions, it's a wrapper that tells php that you need to use http request, without it function thinks you require to open ./google.com (the same as google.txt).
Question 2:
Yes, you can build almost any cURL request with sockets.
My personal opinion is that you should stick with cURL because:
timeout settings
handles all possible HTTP states
easy and detailed configuration (there is no need for detailed knowledge of HTTP headers)
file_get_contents probably will rewrite your request after getting the IP, obtaining the same thing as:
file_get_contents("xxx.yyy.www.zzz/app1",...)
Many servers will deny you access if you go through IP addressing in the request.
With cURL this problem doesn't exists. It resolves the hostname leaving the request as you set it, so the server is not rude in response.
This could be the "cause", too..
1) Why are you using the quotes when calling file_get_contents?
2) As it was mentioned in the comment, file_get_contents requires allow_url_fopen to be enabled on your php.ini.
3) You could use fsockopen, but you would have to handle HTTP requests/responses manually, which would be to reinvent the wheel when you have cURL. The same goes for socket_create.
4) Regarding the title of this question: cURL can be more customizable and useful to work with complex HTTP transactions than file_get_contents. Though, it should be mentioned, that working with stream contexts allows you to make a lot of settings for your file_get_contents calls. However, I think cURL is still more complete since it gives you, for instance, the possibility of working with multiple parallel handlers.

PHP CURL Cookie not kept when cookie called a second time

I have a php page that uses CURL to log in to another page, get the cookies and then use that to call another page. On the new page the php can be called again to call the same page but with different parameters. This code all works on my free web hosting site. However when I moved it to my clients webpage works for the first call (i.e. cookie was created and used fine) but does not when I call the page again with a new parameter (i.e. the cookies is not reused). The code is in wordpress and all details are near identical (in the way that I have copied the themes, plugins and DB from one site to another). What would be the reasons for the difference and how would I go about changing this difference?
The only difference I can see at the moment is looking at the response from the web pages, the site that is not working has the cache-control set to no caching and age=0. Would this be the reason and if so how can I change this?
Try to manually assign a cookiejar / file to your curl operations:
$cookie_file = "/tmp/cookie/cookie1.txt";
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file);
curl will then read cookies from the cookiejar before starting the request and will write recieved cookies into the cookiefile it gets from the response.
The path must be accessible and read/write-able by the user that PHP gets executed as. You should use a full path, not a relative one.
Edit: Marc B writes in PHP, Curl, curl_exec(), curl_close() and cookies that cookies are bound to the curl handle. So as long as you don't close the handle curl should take care about cookies.
So you might not need the cookiejar/file if both requests share the same curl handle.

PHP curl post to login to wordpress

I am using php curl to login to wordpress behind-the-scenes as described here:
Wordpress autologin using CURL or fsockopen in PHP
However my script is not setting the cookies necessary to retain the wordpress session. Instead they are being sent back to my script and stored in cookies.txt.
Both the curl script and the wordpress login are on the same server in different directories.
Do I need to write another curl script to manually set the wordpress cookies? Is that possible?
If you're just using the code posted as is then it won't work because the subsequent requests won't send the cookies back on each request. Adding curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie); should help (well, at least it will if the cookies actually get saved now - otherwise look into permissions) - well, depending on what your usage scenario is anyway.
You could also check out 10 awesome things to do with cURL for some neat examples on how to use curl (example 4 might just be what you are looking for).
BTW If this script is intended for multiple (concurrent) users, you shouldn't use a static filename, but create a temporary file for each user.
I needed the cookies to be sent to the browser not back to my curl script. The curl script was triggered by a php script running in the browser.
I solved the problem as follows:
Added these params to the curl object:
curl_setopt($ch, CURLOPT_HEADER ,1);
curl_setopt ($ch, CURLOPT_HEADERFUNCTION, 'read_header');
Added a php functions called read_header which parsed out the cookie data
Used setcookie() to manually set the cookies
If anyone wants the details please comment.

PHP: Curl can't grab a text-only page on my own site

I'm attempting to use curl inside php to grab a page from my own web server. The page is pretty simple, just has some plain text output. However, it returns 'null'. I can successfully retrieve other pages on other domains and on my own server with it. I can see it in the browser just fine, and I can grab it with command line wget just fine, it's just that when I try to grab that one particular page with curl, it simply comes up null. We can't use file_get_contents because our host has it disabled.
Why in the world would this be different behavior be happening?
Found the issue. I was putting my url someplace that was not in curl_init(), and that place was truncating the query string. Once I moved it back to curl_init, it worked.
Try setting curl's user agent. Sometimes hosts will block "bots" by blocking things like wget or curl - but usually they do this just by examining the user agent.
You should check the output of curl_error() and also take a look at your logfiles for the http server.

Categories