PHP cURL to get dynamic content - php
I am trying to use cURL and PHP to scrape proxies off of a webpage. However, when I use cURL all I get is the CSS in the $content. The page uses wordpress so it dynamically loads content but I haven't found anything to help me download the dynamic content. I use wget in linux and the page downloads fine.
//$source1 = file_get_contents('');
$source1 = get_data("");
$array = array();
$source1 = preg_grep('/\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:\d{1,5}\b/', $array);
//download webpage
function get_data($url) {
$options = array(
CURLOPT_RETURNTRANSFER => 1, // return web page
CURLOPT_HEADER => true, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20080311 Firefox/", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 50, // stop after 10 redirects
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
My output:
string:203221) HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Expires: Wed, 06 Feb 2013 22:09:23 GMT
Date: Wed, 06 Feb 2013 22:09:23 GMT
Cache-Control: private, max-age=0
Last-Modified: Wed, 06 Feb 2013 20:39:30 GMT
ETag: "c6675d47-80ec-48ee-9c0f-613c9419f172"
Content-Encoding: gzip
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Content-Length: 47132
Server: GSE
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "">
<html dir='ltr' xmlns='' xmlns:b='' xmlns:data='' xmlns:expr=''>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/>
<script type="text/javascript">(function() { var a=window,b="jstiming",d="tick";var e=function(c){this.t={};this.tick=function(c,p,h){h=void 0!=h?h:(new Date).getTime();this.t[c]=[h,p]};this[d]("start",null,c)},f=new e;a.jstiming={Timer:e,load:f};if(a.performance&&a.performance.timing){var g=a.performance.timing,j=a[b].load,k=g.navigationStart,l=g.responseStart;0<k&&l>=k&&(j[d]("_wtsrt",void 0,k),j[d]("wtsrt_","_wtsrt",l),j[d]("tbsd_","wtsrt_"))}
try{var m=null;,j&&0<k&&(j[d]("_tbnd",void 0,,j[d]("tbnd_","_tbnd",k)));null==m&&a.gtbExternal&&(m=a.gtbExternal.pageT());null==m&&a.external&&(m=a.external.pageT,j&&0<k&&(j[d]("_tbnd",void 0,a.external.startE),j[d]("tbnd_","_tbnd",k)));m&&(a[b].pt=m)}catch(n){};a.tickAboveFold=function(c){var i=0;if(c.offsetParent){do i+=c.offsetTop;while(c=c.offsetParent)}c=i;750>=c&&a[b].load[d]("aft")};var q=!1;function r(){q||(q=!0,a[b].load[d]("firstScrollTime"))}a.addEventListener?a.addEventListener("scroll",r,!1):a.attachEvent("onscroll",r);
<meta content='true' name='MSSmartTagsPreventParsing'/>
<meta content='blogger' name='generator'/>
<link href='' rel='icon' type='image/x-icon'/>
<link href='' rel='canonical'/>
<link rel="alternate" type="application/atom+xml" title="New Fresh Proxies - Atom" href="" />
<link rel="alternate" type="application/rss+xml" title="New Fresh Proxies - RSS" href="" />
<link rel="" type="application/atom+xml" title="New Fresh Proxies - Atom" href="" />
<link rel="EditURI" type="application/rsd+xml" title="RSD" href="" />
<link rel="openid.server" href="" />
<link rel="openid.delegate" href="" />
<!--[if IE]> <script> (function() { var html5 = ("abbr,article,aside,audio,canvas,datalist,details," + "figure,footer,header,hgroup,mark,menu,meter,nav,output," + "progress,section,time,video").split(','); for (var i = 0; i < html5.length; i++) { document.createElement(html5[i]); } try { document.execCommand('BackgroundImageCache', false, true); } catch(e) {} })(); </script> <![endif]-->
<title>New Fresh Proxies</title>
<link type='text/css' rel='stylesheet' href='//' />
<link type="text/css" rel="stylesheet" href="//"/>
input.span-1, textarea.span-1, input.span-2, textarea.span-2, input.span-3, textarea.span-3, input.span-4, textarea.span-4, input.span-5, textarea.span-5, input.span-6, textarea.span-6, input.span-7, textarea.span-7, input.span-8, textarea.span-8, input.span-9, textarea.span-9, input.span-10, textarea.span-10, input.span-11, te...
Curl wont be able to get it directly since it wont execute javascript. But if its coming from an ajax request, you can make a request to that endpoint directly.
Use dev tools/firebug to see what is happening.
Couple things:
Where is your 'output' coming from? I see no displays in your code ...
I also think your preg_grep statement is incorrect. You're searching a blank array and saving the result to the variable you just pulled your data into. Try:
$array = preg_grep('/\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:\d{1,5}\b/', $source1);
When I run the code and dump $source1['content'] directly after the get_data call, I get a crap-ton of IP addresses ...
It seems to me like either a timeout or a problem with your Regexp.
Why not stick to file_get_contents like you tried in the first place?
$content = file_get_contents('');
preg_match_all('/(\d+\.\d+\.\d+\.\d+(:\d+)?)/', $content, $matches);
This will print out a list of IPs:
[0] =>
[1] =>
[2] =>
[3] =>
[4] =>
Hope that helps.
