1)Instead of http://webiste.com/filename i want to take the data from the .txt file which contains 20374 rows and each row contains different website
for example:
website1.com
website2.com
website3.com
etc.
2)parse them individually using curl command
3)get the needed data via preg_match
4)and final results i want to save to my mysql databse
bellow is the code that i am using at the moment, please advise the solution what needs to be added to achieve this goal ?
function curl($url)
{
$agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Cookie: ddosdefend=1d4607e3ac67b865e6c7263260c34e888cae7c56"));
curl_setopt($ch, CURLOPT_USERAGENT, $agent); // The contents of the "User-Agent: " header to be used in a HTTP request.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // TRUE to follow any "Location: " header that the server sends as part of the HTTP header (note this is recursive, PHP will follow as many "Location: " headers that it is sent, unless CURLOPT_MAXREDIRS is set).
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$test = curl('http://webiste.com/filename');
preg_match('/<iframe class=\"metaframe rptss\" src="(.*?)"/', $test, $matches);
$test2 = $matches[1];
Related
The following curl-call succeeds every time, if and only if $data is printed after the curl-call. curl_getinfo() returning
[content_type] => text/html; charset=UTF-8
If $data is not printed, the curl-call sometimes return the same result as above and sometimes returns $data being "Loading...", Which means that page has not finished loading yet. And curl_getinfo() returning
[content_type] => text/html
Furthermore, when using print_r($data), I can see the print_r(curl_getinfo($ch)); on my website being updated several times while performing the curl-call. What... The.... F?
(the set_opt-list has grown larger as I'm trying to find a solution LOL)
Ooh.. yeah, even if I print $data after it's been returned to function caller and caught in another variable.. curl succeeds every time.
Is this normal behaviour? I don't want to print_r($data)!
Is it possible that the url I'm retrieving contains javascript which gets run when I "print" it on my website? Why does it work occasionally without the print_r($data)? Ref: is-there-a-way-to-let-curl-wait-until-the-pages-dynamic-updates-are-done
edit: Until further notice, I've put the curl-call in a while-loop, checking if downloaded size is above a certain threshold. I've set the while loop to 10 iterations, and so far it is enough, i.e. it will manage to download the content of interest. Time consumed is barely noticed.
function curl_get_contents($url) {
global $dbg;
$ch = curl_init();
$timeout = 30;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_NOSIGNAL, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
//curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17');
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch,CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, true);
$data = curl_exec($ch);
if ($dbg) {
print_r(curl_getinfo($ch)); // This one gets refreshed if print_r($data) used below
if(curl_errno($ch)){
echo 'Curl error: ' . curl_error($ch);
} else {
echo "ALL GOOD <br>";
}
}
curl_close($ch);
//echo $data; // If I do this...
//print_r($data); // ... or this. curl is success 100%.
return $data;
}
I'm new to php and curl. I'm calling a webservice that returns values in headers and a json object.
The return string is ONLY has the json object, no information about it, so it looks like
[{
"time":"2017-07-18T17:02:57.759Z",
"trade_id":18237183,
"price":"2302.53000000",
"size":"0.03310247",
"side":"sell"
}]
From my understanding the string should have the return headers and other information. (or do I not understand this)
code:
create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "https://api.gdax.com/products/BTC-USD/trades");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
//enable headers
curl_setopt($ch, CURLOPT_HEADER, 1);
// set user agent
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
// $output contains the output string
$output = curl_exec($ch);
// ONLY DISPLAYS JSON OBJECT NO INFORMTION ON TOP OF STRING
echo( $output);
// close curl resource to free up system resources
curl_close($ch);
i would like to open all the page ids of the website starting with http://website.com/page.php?id=1 and ending with id=1000
take the data via preg_match and record it somewhere or .txt or .sql
bellow is the curl function i'm using at the moment please kindly advise the full code that will get this job done.
function curl($url)
{
$POSTFIELDS = 'name=admin&password=guest&submit=save';
$reffer = "http://google.com/";
$agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)";
$cookie_file_path = "C:/Inetpub/wwwroot/spiders/cookie/cook"; // Please set your Cookie File path. This file must have CHMOD 777 (Full Read / Write Option).
$ch = curl_init(); // Initialize a CURL session.
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_URL, $url); // The URL to fetch. You can also set this when initializing a session with curl_init().
curl_setopt($ch, CURLOPT_USERAGENT, $agent); // The contents of the "User-Agent: " header to be used in a HTTP request.
curl_setopt($ch, CURLOPT_POST, 1); //TRUE to do a regular HTTP POST. This POST is the normal application/x-www-form-urlencoded kind, most commonly used by HTML forms.
curl_setopt($ch, CURLOPT_POSTFIELDS,$POSTFIELDS); //The full data to post in a HTTP "POST" operation.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // TRUE to follow any "Location: " header that the server sends as part of the HTTP header (note this is recursive, PHP will follow as many "Location: " headers that it is sent, unless CURLOPT_MAXREDIRS is set).
curl_setopt($ch, CURLOPT_REFERER, $reffer); //The contents of the "Referer: " header to be used in a HTTP request.
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file_path); // The name of the file containing the cookie data. The cookie file can be in Netscape format, or just plain HTTP-style headers dumped into a file.
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file_path); // The name of a file to save all internal cookies to when the connection closes.
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
You can try it with the function file_put_contents and a loop calling your function.
$file = "data.txt";
$website_url = "http://website.com/page.php?id=";
for(i = 1; i <= 1000; i++){
file_put_contents($file, curl($website_url.i), FILE_APPEND);
}
I'm using the following code:
$agent= 'Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9) Gecko/2008052906 Firefox/3.0';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL, "www.example.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
$output = curl_exec($ch);
echo $output;
But it redirects to like this:
http://localhost/aide.do?sht=_aide_cookies_
Instead of to the URL page.
Can anyone help me solve my problem, please?
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
http://docs.php.net/function.curl-setopt says:
CURLOPT_FOLLOWLOCATION
TRUE to follow any "Location: " header that the server sends as part of the HTTP header (note this is recursive, PHP will follow as many "Location: " headers that it is sent, unless CURLOPT_MAXREDIRS is set).
If it's up to URL redirection only then see the following code, I've documented it for you so you can use it easily & directly, you've two main cURL options control URL redirection (CURLOPT_FOLLOWLOCATION/CURLOPT_MAXREDIRS):
// create a new cURL resource
$ch = curl_init();
// The URL to fetch. This can also be set when initializing a session with curl_init().
curl_setopt($ch, CURLOPT_URL, "http://www.example.com/");
// The contents of the "User-Agent: " header to be used in a HTTP request.
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9) Gecko/2008052906 Firefox/3.0");
// TRUE to include the header in the output.
curl_setopt($ch, CURLOPT_HEADER, false);
// TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// TRUE to follow any "Location: " header that the server sends as part of the HTTP header (note this is recursive, PHP will follow as many "Location: " headers that it is sent, unless CURLOPT_MAXREDIRS is set).
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
// The maximum amount of HTTP redirections to follow. Use this option alongside CURLOPT_FOLLOWLOCATION.
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
// grab URL and pass it to the output variable
$output = curl_exec($ch);
// close cURL resource, and free up system resources
curl_close($ch);
// Print the output from our variable to the browser
print_r($output);
The above code handles the URL redirection issue, but it doesn't deal with cookies (your localhost URL seems to be dealing with cookies). If you wish to deal with cookies from the cURL resource, then you may have to give the following cURL options a look:
CURLOPT_COOKIE
CURLOPT_COOKIEFILE
CURLOPT_COOKIEJAR
For further details please follow the following link:
http://docs.php.net/function.curl-setopt
I want to download a page from the web, it's allowed to do when you are using a simple browser like Firefox, but when I use "file_get_contents" the server refuses and replies that it understands the command but don't allow such downloads.
So what to do? I think I saw in some scripts (on Perl) a way to make your script like a real browser by creating a user agent and cookies, which makes the servers think that your script is a real web browser.
Does anyone have an idea about this, how it can be done?
Use CURL.
<?php
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "example.com");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// set the UA
curl_setopt($ch, CURLOPT_USERAGENT, 'My App (http://www.example.com/)');
// Alternatively, lie, and pretend to be a browser
// curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)');
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
?>
(From http://uk.php.net/manual/en/curl.examples-basic.php)
Yeah, CUrl is pretty good in getting page content. I use it with classes like DOMDocument and DOMXPath to grind the content to a usable form.
function __construct($useragent,$url)
{
$this->useragent='Firefox (WindowsXP) - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.'.$useragent;
$this->url=$url;
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$this->xpath = new DOMXPath($dom);
}
...
public function displayResults($site)
$data=$this->path[0]->length;
for($i=0;$i<$data;$i++)
{
$delData=$this->path[0]->item($i);
//setting the href and title properties
$urlSite=$delData->getElementsByTagName('a')->item(0)->getAttribute('href');
$titleSite=$delData->getElementsByTagName('a')->item(0)->nodeValue;
//setting the saves and additoinal
$saves=$delData->getElementsByTagName('span')->item(0)->nodeValue;
if ($saves==NULL)
{
$saves=0;
}
//build the array
$this->newSiteBookmark[$i]['source']='delicious.com';
$this->newSiteBookmark[$i]['url']=$urlSite;
$this->newSiteBookmark[$i]['title']=$titleSite;
$this->newSiteBookmark[$i]['saves']=$saves;
}
The latter is a part of a class that scrapes data from delicious.com .Not very legal though.
This answer takes your comment to Rich's answer in mind.
The site is probably checking whether or not you are a real user using the HTTP referer or the User Agent string. try setting these for your curl:
//pretend you came from their site already
curl_setopt($ch, CURLOPT_REFERER, 'http://domainofthesite.com');
//pretend you are firefox 3.06 running on windows Vista
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6');
Another way to do it (though others have pointed out a better way), is to use PHP's fopen() function, like so:
$handle = fopen("http://www.example.com/", "r");//open specified URL for reading
It's especially useful if cURL isn't available.