I am setting/storing CURL cookies with :
curl_setopt ($ch, CURLOPT_COOKIEJAR, $cookie);
And retrieving them / trying to set them in my browser with:
setcookie($cookie);
But what goes inbetween please?
Cookie variable defining looks like:
`$cookie="cookie.txt"`;
Is there some way to parse the cookie file as an array?
I can't find any official library that can do it for you.
This question has script to parse cookie file. Pay attention to answer about HttpOnly.
Or you might want to parse cookies directly from curl response. Then check this question.
this is the way i done it
$cookies = [];
$lines = file($cookiesFile);
foreach($lines as $line) {
if($line[0] !== '#' && $line[0] !== "\n") {
$tokens = explode("\t", $line);
$cookies[$tokens[5]] = trim($tokens[6]);
}
}
Related
I'm not sure if what I want to do is possible but here's the case.
Cookies are set on server A (first name, last name, etc).
I have a script on server A which gets the cookies, saves it into the db for future use and then finally displays it. Let's say the script is getCookies.php
Here's the code:
include 'dbconnect.php';
$sessid = $_GET['sid'];
$un = $_COOKIE['un'];
$ul = $_COOKIE['ul'];
$up = $_COOKIE['up'];
$ue = $_COOKIE['ue'];
$idn = $_COOKIE['idn'];
if(!empty($un) || !empty($ul) || !empty($up) || !empty($ue) || !empty($idn)){ // Save log to Database
$savedate = date('Y-m-d G:i');
$q = "INSERT INTO cookiedb (sid, un, ul, up, ue, idn, savedate) VALUES ('$sessid', '$un', '$ul', '$up', '$ue', '$idn', '$savedate')";
$rs = mysqli_query($con,$q);
}
echo "$un, $ul, $up, $ue, $idn";
The code above works if I directly access the script from the browser. However, if I access on another server (server B) using CURL, the cookies doesn't seem to work. It's not being read and saved in the db. I'm getting a blank response. I even used some codes like this suggestion I found here on stackoverflow:
$url = "http://serverA.co.za/getCookie.php";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// get headers too with this line
curl_setopt($ch, CURLOPT_HEADER, 1);
$result = curl_exec($ch);
// get cookie
// multi-cookie variant contributed by #Combuster in comments
preg_match_all('/^Set-Cookie:\s*([^;]*)/mi', $result, $matches);
$cookies = array();
foreach($matches[1] as $item) {
parse_str($item, $cookie);
$cookies = array_merge($cookies, $cookie);
}
var_dump($matches);
...but this code does not work. Do you have any idea how I can get the value for those cookies? If CURL could not be used here, are there any other ways? Thank you.
Need a code example and/or guidance about fetching multiple urls stored in a .txt file using curl. Do I need to use a spider, or can I modify the code below which works well for one url?
<?php
$c = curl_init('http://www.example.com/robots.txt');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($c);
curl_close($c);
?>
Your question is vague but I will try to answer it with the information you provided.
I would use explode() PHP function.
$lines = explode(PHP_EOL, $page);
foreach($lines as $line) {
$val = explode(':', $line);
echo $val[1];
}
Something like this should do the job.
I'm in need of a function that tests a URL if it is redirected by whatever means.
So far, I have used cURL to catch header redirects, but there are obviously more ways to achieve a redirect.
Eg.
<meta http-equiv="refresh" content="0;url=/somewhere/on/this/server" />
or JS scripts
window.location = 'http://melbourne.ag';
etc.
I was wondering if anybody has a solution that covers them all. I'll keep working on mine and will post the result here.
Also, a quick way of parsing
<meta http-equiv="refresh"...
in PHP anyone?
I thought this would be included in PHP's native get_meta_tags() ... but I thought wrong :/
It can be done for markup languages (any simple markup parser will do), but it cannot be done in general for programming languages like JavaScript.
Redirection in a program in a Web document is equivalent to halting that program. You are asking for a program that is able to tell whether another, arbitrary program will halt. This is known in computer science as the halting problem, the first undecidable problem.
That is, you will only be able to tell correctly for a subset of resources whether redirection will occur.
Halfway there, I'll add the JS checks when I wrote them...
function checkRedirect($url){
// returns the redirected URL or the original
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, $url);
$out = curl_exec($ch);
$out = str_replace("\r", "", $out);
$headers_end = strpos($out, "\n\n");
if( $headers_end !== false ) {
$out = substr($out, 0, $headers_end);
}
$headers = explode("\n", $out);
foreach($headers as $header) {
if(strtolower(substr($header, 0, 10)) == "location: " ) {
$target = substr($header, 10);
return $target;
}
}
return $url;
}
Sort of a weird question.
From 4shared video site, I get the embed code like the following:
<embed src="http://www.4shared.com/embed/436595676/acfa8f75" width="420" height="320" allowfullscreen="true" allowscriptaccess="always"></embed>
Now, if I access the url in that embed src, the video is loaded up and the URL of the page is changed with information about the video.
I am wondering if there is any way for me to access that info using PHP? I tried file_get_contents but it gives me lots of weird characters.
So, can I use PHP to load the embed url and get the information present in the address bar?
Thanks for all your help! :)
Yes, e.g. with the curl-library of php. This one will handle the redirect-headers from the server, which result in the new/real url of the video.
Here's a sample code:
<?php
// create a new cURL resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.4shared.com/embed/436595676/acfa8f75");
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_NOBODY, 1);
// we want to further handle the content, so return it
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// grab URL and pass it to the browser
$result = curl_exec($ch);
// did we get a good result?
if (!$result)
die ("error getting url");
// if we got a redirection http-code, split the content in
// lines and search for the Location-header.
$location = null;
if ((int)(curl_getinfo($ch, CURLINFO_HTTP_CODE)/100) == 3) {
$lines = explode("\n", $result);
foreach ($lines as $line) {
list($head, $value) = explode(":", $line, 2);
if ($head == 'Location') {
$location = trim($value);
break;
}
}
}
if ($location == null)
die("no redirect found in header");
// close cURL resource, and free up system resources
curl_close($ch);
// your location is now in here.
var_dump($location);
?>
I'd like to use PHP to crawl a document we have that has about 6 or 7 thousand href links in it. What we need is what is on the other side of the link which means that PHP would have to follow each link and grab the contents of the link. Can this be done?
Thanks
Sure, just grab the content of your starting url with a function like file_get_contents (http://nl.php.net/file_get_contents), Find URL's in the content of this page using a regular expression, grab the contents of those url's etcetera.
Regexp will be something like:
$regexUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
Once you harvest the links, you can use curl or file_get_contents (in a safe environment file_get_contents shouldn't allow to walk over http protocol though)
I just have a SQL table of all the links I have found, and if they have been parsed or not.
I then use Simple HTML DOM to parse oldest added page, although as it tends to run out of memory with large pages (500kb+ of html) I use regex for some of it*. For every link I find I add it to the SQL database as needing parsing, and the time I found it.
The SQL database prevents the data being lost on an error, and as I have 100,000+ links to parse, I do it over a long period of time.
I am unsure, but have you checked the useragent of file_get_contents()? If it isn't your pages and you make 1000s of requests, you may want to change the user agent, either by writing your own HTTP down loader or using one from a library(I use the one in the Zend Framework) but cURL etc work fine. If you use a custom user agent, it allows the admin looking over logs to see the information about your bot. (I tend to put the reason why I am crawling and a contact in mine).
*The regex I use is:
'/<a[^>]+href="([^"]+)"[^"]*>/is'
A better solution (From Gumbo) could be:
'/<a\s+(?:[^"'>]+|"[^"]*"|'[^']*')*href=("[^"]+"|'[^']+'|[^<>\s]+)/i'
The PHP Snoopy library has a bunch of built in functions to accomplish exactly what you are looking for.
http://sourceforge.net/projects/snoopy/
You can download the page itself with Snoopy, then it has another function to extract all the URLs on that page. It will even correct the links to be full-fledged URIs (i.e. they aren't just relative to the domain/directory the page resides on).
You can try the following. See this thread for more details
<?php
//set_time_limit (0);
function crawl_page($url, $depth = 5){
$seen = array();
if(($depth == 0) or (in_array($url, $seen))){
return;
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$result = curl_exec ($ch);
curl_close ($ch);
if( $result ){
$stripped_file = strip_tags($result, "<a>");
preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $stripped_file, $matches, PREG_SET_ORDER );
foreach($matches as $match){
$href = $match[1];
if (0 !== strpos($href, 'http')) {
$path = '/' . ltrim($href, '/');
if (extension_loaded('http')) {
$href = http_build_url($url, array('path' => $path));
} else {
$parts = parse_url($url);
$href = $parts['scheme'] . '://';
if (isset($parts['user']) && isset($parts['pass'])) {
$href .= $parts['user'] . ':' . $parts['pass'] . '#';
}
$href .= $parts['host'];
if (isset($parts['port'])) {
$href .= ':' . $parts['port'];
}
$href .= $path;
}
}
crawl_page($href, $depth - 1);
}
}
echo "Crawled {$href}";
}
crawl_page("http://www.sitename.com/",3);
?>
I suggest that you take the HTML document with your 6000 URLs, parse them out and loop through the list you've got. In your loop, get the contents of the current URL using file_get_contents (for this purpose, you don't really need cURL when file_get_contents is enabled on your server), parse out the containing URLs again, and so on.
Would look something like this:
<?php
function getUrls($url) {
$doc = file_get_contents($url);
$pattern = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
preg_match_all($pattern, $doc, $urls);
return $urls;
}
$urls = getUrls("your_6k_file.html");
foreach($urls as $url) {
$moreUrls = getUrls($url);
//do something with moreUrls
}
?>