I am building a simple web spider using built-in PHP cURL multi. It works great. Here is the basic implementation:
I am building a simple web spider using built-in PHP cURL multi. It works great. Here is the basic implementation:
<?php
$remainingTargets = ...;
$concurrency = 30;
$multiHandle = curl_multi_init();
$targets = [];
while (count($targets) < $concurrency && count($remainingTargets) > 0) {
$target = array_shift($remainingTargets);
$alreadyChecked = ...;
if ($alreadyChecked !== false) {
continue;
}
$curl = curl_init($target);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
curl_setopt($curl, CURLOPT_FAILONERROR, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 4);
curl_setopt($curl, CURLOPT_TIMEOUT, 5);
curl_multi_add_handle($multiHandle, $curl);
$targets[$target] = $curl;
}
// Run loop for downloading
$running = null;
do {
curl_multi_exec($multiHandle, $running);
} while ($running);
// Harvest results
foreach ($targets as $target => $curl) {
$html = curl_multi_getcontent($curl);
curl_multi_remove_handle($multiHandle, $curl);
// Process this page
}
curl_multi_close($multiHandle);
// If done show results, or continue processing queue...
But I want to know, is it possible to do the harvesting in the "run loop" here? I imagine that would free up resources faster and run better. It seems like I want a c-style select. But curl_multi_select does not return a specific resource.
Related
I am trying to bombard my site using simultaneous 50-100+ curl connections.
i have made list of URLs in array [post IDs are incremental], so no same url will be accessed twice as there is caching enabled on server.
I want to do this purely in php-curl as i am familiar with php.
i have looked on stackoverflow and found this PHP: Opening URLs concurrently to simulate a DOS attack?
but there is no curl solution to do what i am trying to do. i don't want to use 3rd party tools. as some are for windows , some are for linux, some are outdated while i am more familiar with php.
Right now my code waits for current connections to close and then start new ones. while i don't want to wait for open connections to close and just keep on starting new connections.
here is my code so far.
<?php
$my_site_url = 'https://example.com/item/post_id';
$threads = 80;
$all_post_ids = array();
//get all ids in the array
for($int = 50; $int <= 200000 ; $int++){
$all_post_ids[] = $int;
}
//printr($all_post_ids);
//make chunks from the big array
foreach(array_chunk($all_post_ids, $threads) as $post_ids_chunk) {
$ch = curl_multi_init();
foreach ($post_ids_chunk as $i => $each_id) {
$url = str_replace("post_id" , $each_id , $my_site_url) ;
echo $url."\n" ;
$conn[$i] = curl_init($url);
curl_setopt($conn[$i], CURLOPT_RETURNTRANSFER, 1);
curl_setopt($conn[$i], CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($conn[$i], CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($conn[$i], CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36');
curl_setopt($conn[$i], CURLOPT_HEADER, 0);
curl_multi_add_handle($ch, $conn[$i]);
}
do {
$status = curl_multi_exec($ch, $active);
} while ($status === CURLM_CALL_MULTI_PERFORM || $active);
foreach ($post_ids_chunk as $i => $each_id){
//$header[$i] = curl_getinfo($conn[$i]);
//var_dump( $header[$i]);
//echo $header[$i]['http_code'];
//$received_data[$i] = curl_multi_getcontent($conn[$i]);
curl_multi_remove_handle($ch, $conn[$i]);
}
curl_multi_close($ch);
}
So how can i not wait open connections to close and start new ones lets say per second or as soon as possible.
What i am missing here ?
Thanks for your time.
I finally got my script to work but it takes a long time to do the search (via ajax). Basically by entering a keyword, it searches the page and captures all the titles, urls, and thumbnails of the videos. But the problem arose to me to capture the tags that were inside each video, so I had to forcibly access each video to capture the tags, the only way I could think of was to add a loop inside the loop that captures the found videos that is to say:
For each video found -> Capture title, thumbnail, URL -> With captured URL -> Go to that URL and capture your tags.
The code I used is basically the following, I need to know if there is any other method to speed up searches, either by optimizing the code or using another way:
My parse function:
<?php
function dlPage($href) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, "Accept-language: en-US");
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $href);
curl_setopt($curl, CURLOPT_REFERER, $href);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$dom = new simple_html_dom();
// Load HTML from a string
$dom->load($str);
return $dom;
}
?>
My script:
$buscartag = str_replace(' ', '+', $_POST['buscartag']);
$urlparse = "https://example.com/?k=".$buscartag;
$paginas = rand(0, 50);
$html = dlPage($urlparse."&p=".$paginas);
$counter = 0;
foreach($html->find('div.video-box') as $videos) {
if ($videos) {
$titulo = $videos->find('div.video-box>p[!class])>a[!class]',0)->attr['title'];
$pathvideo = str_replace('_', '', $videos->attr['id']);
$link = "https://www.example.com/".$pathvideo."/";
$thumb = $videos->find('div.thumb')->innertext
//HERE MY SECOND BUCLE FOR TAGS!!!
$gettags2 = array();
$html_tags = file_get_html($link);
foreach ($html_tags->find('a.nu') as $gettags){
$gettags2[] = $gettags->innertext;
if (!empty($titulo) && !empty($link) && !empty($idvideo) && !empty($urlimagen)){
$counter++;
//here will echo all variables
}}
I am currently trying to manipulate dom throuhg php to extract views from an fb video page. The below code was working until a bit ago. However now it doesnt find the node that contains the views count. This information is inside a div with id fbPhotoPageMediaInfo. What would be the best way to manipulate the dom through php to get views of an fb video page?
private function _callCurl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Linux; Android 5.0.1; SAMSUNG-SGH-I337 Build/LRX22C; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
curl_setopt($ch, CURLOPT_URL, $url);
$response = curl_exec($ch);
$http = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
return array(
$http,
$response,
);
}
function test()
{
$url = "https://www.facebook.com/TaylorSwift/videos/10153665021155369/";
$request = callCurl($url);
if ($request[0] == 200) {
$dom = new DOMDocument();
#$dom->loadHTML($request[1]);
$elm = $dom->getElementById('fbPhotoPageMediaInfo');
if (isset($elm->nodeValue)) {
$views = preg_replace('/[^0-9]/', '', $elm->nodeValue);
} else {
$views = null;
}
} else {
echo "Error!";
}
return isset($views) ? $views : null;
}
Here is what I've determined...
If you var_dump() on $request you can see that it's giving you a 302 code (redirect) rather than a 200 (ok).
Changing CURLOPT_FOLLOWLOCATION to true or commenting it out entirely makes the error go away, but now we're getting a different page from the one expected.
I ran the following to see where I was being redirected to:
$htm = file_get_contents("https://www.facebook.com/TaylorSwift/videos/10153665021155369/");
var_dump($htm);
This gave me a page saying I was using an outdated browser, and needed to update it. So apparently Facebook doesn't like the User Agent.
I updated it as follows:
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/44.0.2');
That appears to solve the problem.
Personally I prefer to use Simplehtmldom.
FB like other high traffic sites do update their source to help prevent scraping. You may in the future have to adjust your node search.
<?php
$ua = "Mozilla/5.0 (Windows NT 5.0) AppleWebKit/5321 (KHTML, like Gecko) Chrome/13.0.872.0 Safari/5321"; // must be a valid User Agent
ini_set('user_agent', $ua);
require_once('simplehtmldom/simple_html_dom.php'); // http://simplehtmldom.sourceforge.net/
Function Scrape_FB_Views($url) {
IF (!filter_var($url, FILTER_VALIDATE_URL) === false) {
// Create DOM from URL
$html = file_get_html($url);
IF ($html) {
IF (($html->find('span[class=fcg]', 3))) { // 4th instance of span with fcg class
$text = trim($html->find('span[class=fcg]', 3)->plaintext); // get content of span as plain text
$result = preg_replace('/[^0-9]/', '', $text); // replace all non-numeric characters
}ELSE{
$result = "Node is no longer valid."
}
}ELSE{
$result = "Could not get HTML.";
}
}ELSE{
$result = "URL is invalid.";
}
return $result;
}
$url = "https://www.facebook.com/TaylorSwift/videos/10153665021155369/";
echo("<p>".Scrape_FB_Views($url)."</p>");
?>
I'm using php and cURL to scrape a web page, but this web page is poorly designed (as in no classes or ids on tags), so I need to search for specific text, then go to the tag holding it (ie <p>) then move to the next child (or next <p>) and get the text.
There are various things I need to get from the page, some also being the text within an <a onclick="get this stuff here">. So basically I feel that I need to use cURL to scrape the source code to a php variable, then I can use php to kind of parse through and find the stuff I need.
Does this sound like the best method to do this? Does anyone have any pointers or can demonstrate how I can put source code from cURL into a variable?
Thanks!
EDIT (Working/Current Code) -----------
<?php
class Scrape
{
public $cookies = 'cookies.txt';
private $user = null;
private $pass = null;
/*Data generated from cURL*/
public $content = null;
public $response = null;
/* Links */
private $url = array(
'login' => 'https://website.com/login.jsp',
'submit' => 'https://website.com/LoginServlet',
'page1' => 'https://website.com/page1',
'page2' => 'https://website.com/page2',
'page3' => 'https://website.com/page3'
);
/* Fields */
public $data = array();
public function __construct ($user, $pass)
{
$this->user = $user;
$this->pass = $pass;
}
public function login()
{
$this->cURL($this->url['login']);
if($form = $this->getFormFields($this->content, 'login'))
{
$form['login'] = $this->user;
$form['password'] =$this->pass;
// echo "<pre>".print_r($form,true);exit;
$this->cURL($this->url['submit'], $form);
//echo $this->content;//exit;
}
//echo $this->content;//exit;
}
// NEW TESTING
public function loadPage($page)
{
$this->cURL($this->url[$page]);
echo $this->content;//exit;
}
/* Scan for form */
private function getFormFields($data, $id)
{
if (preg_match('/(<form.*?name=.?'.$id.'.*?<\/form>)/is', $data, $matches)) {
$inputs = $this->getInputs($matches[1]);
return $inputs;
} else {
return false;
}
}
/* Get Inputs in form */
private function getInputs($form)
{
$inputs = array();
$elements = preg_match_all('/(<input[^>]+>)/is', $form, $matches);
if ($elements > 0) {
for($i = 0; $i < $elements; $i++) {
$el = preg_replace('/\s{2,}/', ' ', $matches[1][$i]);
if (preg_match('/name=(?:["\'])?([^"\'\s]*)/i', $el, $name)) {
$name = $name[1];
$value = '';
if (preg_match('/value=(?:["\'])?([^"\']*)/i', $el, $value)) {
$value = $value[1];
}
$inputs[$name] = $value;
}
}
}
return $inputs;
}
/* Perform curl function to specific URL provided */
public function cURL($url, $post = false)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13");
// "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, $this->cookies);
curl_setopt($ch, CURLOPT_COOKIEFILE, $this->cookies);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
curl_setopt($ch, CURLOPT_TIMEOUT, 120);
if($post) //if post is needed
{
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($post));
}
curl_setopt($ch, CURLOPT_URL, $url);
$this->content = curl_exec($ch);
$this->response = curl_getinfo( $ch );
$this->url['last_url'] = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
}
}
$sc = new Scrape('user','pass');
$sc->login();
$sc->loadPage('page1');
echo "<h1>TESTTESTEST</h1>";
$sc->loadPage('page2');
echo "<h1>TESTTESTEST</h1>";
$sc->loadPage('page3');
echo "<h1>TESTTESTEST</h1>";
(note: credit to #Ramz scrape a website with secured login)
You can divide your problem in several parts.
Retrieving the data from the data source.
For that, you can possibly use CURL or file_get_contents() depending on your requirements. Code examples are everywhere. http://php.net/manual/en/function.file-get-contents.php and http://php.net/manual/en/curl.examples-basic.php
Parsing the retrieved data.
For that, i would start by looking into "PHP Simple HTML DOM Parser" You can use it to extract data from an HTML string. http://simplehtmldom.sourceforge.net/
Building and generating the output.
This is simply a question of what you want to do with the data that you have extracted. For example, you can print it, reformat it, or store it to a database/file.
I suggest you use a rready made scaper. I use Goutte (https://github.com/FriendsOfPHP/Goutte) which allows me to load website content and traverse it in the same way you do with jQuery. i.e. if I want the content of the <div id="content"> I use $client->filter('#content')->text()
It even allows me to find and 'click' on links and submit forms to retreive and process the content.
It makes life soooooooo mucn easier than using cURL or file_get_contentsa() and working your way through the html manually
I have Server A and Server B which exchanges some data. Server A on user request pull data from Server B using simple file_get_content with some params, so server B can do all task(database etc) and return results to A which formats and show to user. Everything is in PHP.
Now I am interested what is fastest way to to this? I made some test and average transfer time for average response from server B at (~0.2 sec). In that 0.2 sec, 0.1 sec. aprox. is Server B operational time (pulling data calling few databases etc) what mean that average transfer time for 50kb with is 0.1 sec. (servers are NOT in same network)
Should I try with:
cURL insted of file_get_content ?
Or to try to make whole thing with sockets( I never work work with sockets in PHP but I supose that easily can be done, on that way to skip web server )
or something third?
I think that time can be 'found' on shortening connection establishing, since now, every request is new connection initiating (I mean on separate file_get_content calls, or I am wrong?)
Please give me your advices in which directions to try, or if you have some better solution I am listening.
Curl:
function curl($url)
{
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL,$url);
curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6");
curl_setopt ($ch, CURLOPT_TIMEOUT, 60);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec ($ch);
curl_close($ch);
return $result;
}
Sockets:
function sockets($host) {
$fp = fsockopen("www.".$host, 80, $errno, $errstr, 30);
$out = "GET / HTTP/1.1\r\n";
$out .= "Host: www.".$host."\r\n";
$out .= "Connection: Close\r\n\r\n";
fwrite($fp, $out);
$f='';
while (!feof($fp)) {
$f .= fgets($fp, 1024);
}
return $f;
}
file_get_contents
function fgc($url){
return file_get_contents($url);
}
Multicurl
function multiRequest($data,$nobody=false,$options = array(), $oneoptions = array())
{
$curls = array();
$result = array();
$mh = curl_multi_init();
foreach ($data as $id => $d)
{
$curls[$id] = curl_init();
$url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d;
curl_setopt($curls[$id], CURLOPT_URL, $url);
curl_setopt($curls[$id], CURLOPT_HEADER, 0);
curl_setopt($curls[$id], CURLOPT_RETURNTRANSFER, true);
curl_setopt($curls[$id], CURLOPT_FOLLOWLOCATION,1);
curl_setopt($curls[$id], CURLOPT_USERAGENT,"Mozilla/5.0(Windows;U;WindowsNT5.1;ru;rv:1.9.0.4)Gecko/2008102920AdCentriaIM/1.7Firefox/3.0.4");
//curl_setopt($curls[$id], CURLOPT_COOKIEJAR,'cookies.txt');
//curl_setopt($curls[$id], CURLOPT_COOKIEFILE,'cookies.txt');
//curl_setopt($curls[$id], CURLOPT_NOBODY, $nobody);
if (!empty($options))
{
curl_setopt_array($curls[$id], $options);
}
if (!empty($oneoptions[$id]))
{
curl_setopt_array($curls[$id], $oneoptions[$id]);
}
if (is_array($d))
{
if (!empty($d['post']))
{
curl_setopt($curls[$id], CURLOPT_POST, 1);
curl_setopt($curls[$id], CURLOPT_POSTFIELDS, $d['post']);
}
}
curl_multi_add_handle($mh, $curls[$id]);
}
$running = null;
do
{
curl_multi_exec($mh, $running);
}
while($running > 0);
foreach($curls as $id => $content)
{
$result[$id] = curl_multi_getcontent($content);
//echo curl_multi_getcontent($content);
curl_multi_remove_handle($mh, $content);
}
curl_multi_close($mh);
return $result;
}
Tests:
$url = 'example.com';
$start = microtime(1);
for($i=0;$i<100;$i++)
curl($url);
$end = microtime(1);
echo "Curl:".($end-$start)."\n";
$start = microtime(1);
for($i=0;$i<100;$i++)
fgc("http://$url/");
$end = microtime(1);
echo "file_get_contents:".($end-$start)."\n";
$start = microtime(1);
for($i=0;$i<100;$i++)
sockets($url);
$end = microtime(1);
echo "Sockets:".($end-$start)."\n";
$start = microtime(1);
for($i=0;$i<100;$i++)
$arr[]=$url;
multiRequest($arr);
$end = microtime(1);
echo "MultiCurl:".($end-$start)."\n";
?>
Results:
Curl: 5.39667105675 file_get_contents: 7.99799394608 Sockets:
2.99629592896 MultiCurl: 0.736907958984
what is fastest way?
get your data on a flash drive.
Now seriously.
Come on, it's network that's slow. You cannot make it faster.
To make server A response faster, DO NOT request data from the server B. That's the only way.
You can replicate your data or cache it, or just quit such a clumsy setup at all.
But as long as you have to make a network lookup on each user' request, it WILL be slow. Despite of the method you are using. It is not hte method, it is media. Isn't it obvious?
You can try another different approach: Mount the remote filesystem in the local machine. You can do that with sshfs, so you will get the additional security of an encripted connection.
It may be even more efficient since php will not have to deal with connection negotiation and establishment.