How can I make this PHP script run faster/asynchronously? - php

I have a pastebin scraper script, which is designed to find leaked emails and passwords, to make a website like HaveIBeenPwned.
Here is what my script is doing:
- Scraping Pastebin links from https://psbdmp.ws/dumps
- Getting a random proxy using this Random Proxy API (because Pastebin bans your IP if you hammer too many requests): https://api.getproxylist.com/proxy
- Doing a CURL request to the Pastebin links, then doing a preg_match_all to find all the email addresses and passwords in the format email:password.
The actual script seems to be working alright, but it isn't optimized enough, and is giving me a 524 timeout error after some time, which I suspect is because of all those CURL requests.Here is my code:
api.php
function comboScrape_CURL($url) {
// Get random proxy
$proxies->json = file_get_contents("https://api.getproxylist.com/proxy");
$proxies->decoded = json_decode($proxies->json);
$proxy = $proxies->decoded->ip.':'.$proxies->decoded->port;
list($ip,$port) = explode(':', $proxy);
// Crawl with proxy
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
comboScrape('email:pass',$curl_scraped_page);
}
index.php
require('api.php');
$expression = "/(?:https\:\/\/pastebin\.com\/\w+)/";
$extension = ['','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20'];
foreach($extension as $pge_number) {
$dumps = file_get_contents("https://psbdmp.ws/dumps/".$pge_number);
preg_match_all($expression,$dumps,$urls);
$codes = str_replace('https://pastebin.com/','',$urls[0]);
foreach ($codes as $code) {
comboScrape_CURL("https://pastebin.com/raw/".$code);
}
}

524 timeout error - err, seems you're running php behind a web server (apache? nginx? lighthttpd? IIS?) don't do that, run your code from php-cli instead, php-cli can run indefinitely and never timeout.
because Pastebin bans your IP if you hammer too many requests - buy a pastebin.com pro account instead ( https://pastebin.com/pro ), it costs about $50 (or $20 around Christmas & Black Friday), and is a lifetime account with a 1-time payment, and gives you access to the scraping api ( https://pastebin.com/doc_scraping_api ), with the scraping api you can fetch about 1 paste per second, or 86400 pastes per day, without getting ip banned.
and because of pastebin.com's rate limits, there is no need to do this asynchronously with multiple connections (it's possible, but not worth the hassle. if you actually needed to do that however, you'd have to use the curl_multi API)

Related

PHP Curl Multi (curl_multi) and proxies,

So,
I have an issue with needing to make multiple http requests (i mean 100's, even 1000's) using PHP, and these requests need to be run through a proxy to protect the server info/ip.
So, the problem is, to do these requests using php curl is perfectly fine, but takes a long time. If i use php curl_multi, it takes a lot less time. But if i introduce the PROXY to this, the curl_multi takes a significantly longer time than not using curl_multi (curl one after another).
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, TRUE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_URL, $_REQUEST['url']);
//Proxy
$proxyStr = 'user:pass#123.123.123.123:80'; // username:password#ip:port
curl_setopt($ch, CURLOPT_PROXY, $proxyStr);
curl_setopt($ch, CURLOPT_PROXYTYPE, 'HTTP');
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
$req = "something= somethingelse \r\n"; //request that needs to be sent
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, $req); // Query
$result = curl_exec($ch);
//This is some sample code, obviously the curl_multi is using curl_multi_init() and curl_multi_exec()
//Here is a sample code for curl_multi, just introduce the proxy code in the middle of it: http://www.phpied.com/simultaneuos-http-requests-in-php-with-curl/
My question is, what am i doing wrong? The curl_multi with proxy takes a HUGE amount of time longer regardless of if i use 2 at a time or 5 at a time. But using multi with 1 at a time is quicker (not quick enough but a LOT quicker than 2 or more).
As a poor man's multithreading, i added a worker script to the server which pretty much takes a http post and does a single curl on the data (post contains the proxy, so this worker uses the proxy). And using the main server, i use curl_multi to POST details to the worker script (about 2-5 requests at a time). This method works "better" but not a huge amount better. Eg. if i use curl_multi to send 10 requests 1 at a time to the worker, it finishes in ~10 secs. If i send 10 requests 5 at a time, it finished in ~8-9 Secs. So its definitely faster, but not as fast as i need it to be.
Does anyone have any ideas/suggestions on why this is the case?
Is there something about php curl_multi and proxies that im missing? And I assumed that multiple PHP requests coming INTO a server would execute in parallel (thus why i was reffering to curl_multi'ing into the worker script which does the proxy as a poor mans multithreading). Anyone suggesting on using pthreads, will that actually solve the problem (it is not practical for me to install pthreads, but from what i can see, i doubt it would solve the problem)?

Diagnosing bottlenecks when fetching data from API

I am running a dedicated server that fetches data from an API server. My machine runs on a Windows Server 2008 OS.
I use PHP curl function to fetch the data via http requests ( and using proxy ). The function I've created for that:
function get_http($url)
{
$proxy_file = file_get_contents("proxylist.txt");
$proxy_file = explode("
", $proxy_file);
$how_Many_Proxies = count($proxy_file);
$which_Proxy = rand(0,$how_Many_Proxies);
$proxy = $proxy_file[$which_Proxy];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
$curl_scraped_page = curl_exec($ch);
curl_close($ch);
return $curl_scraped_page;
}
I then save it in the MySQL database using this simple code that I run 20-40-60-100 versions in parallel with curl ( after some number, it doesn't increase performance and I wonder where is the bottleneck? ):
function retrieveData($id)
{
$the_data = get_http("http://api-service-ip-address/?id=$id");
return $the_data;
}
$ids_List = file_get_contents("the-list.txt");
$ids_List = explode("
",$ids_List);
for($a = 0;$a<50;$a++)
{
$array[$a] = get_http($ids_List[$a]);
}
for($b = 0;$b<50;$b++)
{
$insert_Array[] = "('$ids_List[$b]', NULL, '$array[$b]')";
}
$insert_Array = implode(',', $insert_Array);
$sql = "INSERT INTO `the_data` (`id`, `queue_id`, `data`) VALUES $insert_Array;";
mysql_query($sql);
After many optimizations, I am stuck on retrieving/fetching/saving around 23 rows with data per second.
The MySQL table is pretty simple and looks like this:
id | queue_id(AI) | data
Keep in mind, that the database doesn't seem to be the bottleneck. When I check the CPU usage, the mysql.exe process barely ever goes over 1%.
I fetch the data via 125 proxies. I've decreased the amount to 20 for the test and it DIDN'T make any difference ( suggesting that the proxies are not the bottleneck? - because I get the same performance when using 5 times less of them? )
So if the MySQL and Proxies are not the cause of the limit, what else can be it and how can I find out?
So far, the optimizations I've did:
replaced file_get_contents with curl functions for retrieving the
http data
replaced the https:// url for a http:// one ( is this faster? )
indexed the table
replaced the API domain name that is called by a pure IP address ( so
the DNS time isn't a factor )
I use only private proxies that have low latency.
My questions:
What may be the possible cause of the performance limit?
How do I find the reason for the limit?
Can this be caused by some TCP/IP limitation / poorly configured apache/windows?
The API is really fast and it serves many times more queries to other people so I don't believe it can't respond any faster.
You are reading the proxy file every time you are calling the curl function. I recommend you to use the read operation outside the function. I mean read the proxies once, and store it in an array to reuse it.
Use this curl option CURLOPT_TIMEOUT to defined a fixed amount of time for your curl execution(for example 3 seconds). It will help you to debug whether its the issue of curl operation or not.

Proxy blocking file_get_contents

In my business I have to use google map for my application (calculated distance)
We currently use a configuration script for proxy.
In my application I use the method to query file_get_contents Google Map.
$url = 'http://maps.google.com/maps/api/directions/xml?language=fr&origin='.$adresse1.'&destination='.$adresse2.'&sensor=false';;
$xml=file_get_contents($url);
$root = simplexml_load_string($xml);
$distance=$root->route->leg->distance->value;
$duree=$root->route->leg->duration->value;
$etapes=$root->route->leg->step;
return array(
'distanceEnMetres'=>$distance,
'dureeEnSecondes'=>$duree,
'etapes'=>$etapes,
'adresseDepart'=>$root->route->leg->start_address,
'adresseArrivee'=>$root->route->leg->end_address
);
}
But with the proxy I have an unknown host error. (I tested my home, the code works fine). I wanted to know if there is a way to take into account the proxy that I identify myself as when I browse the web?
You can do this with cURL. It's a bit more verbose than a simple call to file_get_contents(), but a lot more configurable:
$url = 'http://maps.google.com/maps/api/directions/xml?language=fr&origin='.$adresse1.'&destination='.$adresse2.'&sensor=false';
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_PROXY, ''); // your proxy address (and optional :port)
curl_setopt($handle, CURLOPT_PROXYUSERPWD, ''); // credentials in username:password format (if required)
curl_setopt($handle, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, 1);
$xml = curl_exec($handle);
curl_close($handle);
//continue logic using $xml as before
the google have restrictions to number of queries per time segment for it's api
at first u must cache results on your side, and add a pause between queries
https://developers.google.com/maps/documentation/business/articles/usage_limits
Of course you can also mention a "context" to make file_get_contents recognize a proxy. Please see my own question and my own answer to that question here

php curl check if url is reachable before query

We're having problems with an api we are using.
Here is the code we're using (naming no names on the api front)
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://apiurl.com/whatever/api/we/call');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$ch_output = curl_exec($ch);
curl_close($ch);
This response times out, but not for ages. This is hideously slowing down our web app, and as such further code breaks because of the bad return value. This I can fix, however the response timeout I don't know how to fix. Is there any way to quickly see if a url is "responding" (e.g. something like ping in terminal) before trying to do a curl request?
Thank you.
Do you mean usingcurl_setopt($ch,CURLOPT_CONNECTTIMEOUT,NUMERIC_TIMEOUT_VALUE);to set the timeout?
Your best option would be to set the timeout on curl to a more acceptable level. There are several timeout options available for DNS lookup, connect timeout, transfer timeout, etc. More information is available here http://php.net/manual/en/function.curl-setopt.php

Effective way of fetching web pages

I have to fetch multiple web pages, let's say 100 to 500. Right now I am using curl to do so.
function get_html_page($url) {
//create curl resource
$ch = curl_init();
//set url
curl_setopt($ch, CURLOPT_URL, $url);
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
//$output contains the output string
$html = curl_exec($ch);
//close curl resource to free up system resources
curl_close($ch);
return $html;
}
My major concern is the total time taken by my script to fetch all these web pages. I know that the time taken is directly proportional to my internet speed and hence the majority time is taken by $html = curl_exec($ch); function call.
I was thinking that instead of creating and destroying curl instance again and again for each and every web page, if I create it only once and then just reuse it for each and every page and finally in the end destroy it. Something like:
<?php
function get_html_page($ch, $url) {
//$output contains the output string
$html = curl_exec($ch);
return $html;
}
//create curl resource
$ch = curl_init();
//set url
curl_setopt($ch, CURLOPT_URL, $url);
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
.
.
.
<fetch web pages using get_html_page()>
.
.
.
//close curl resource to free up system resources
curl_close($ch);
?>
Will it make any significant difference in the total time taken? If there is any other better approach then please let me know about it also?
How about trying to benchmark it? It may be more efficient to do it the second way, but I don't think it will add up to much. I'm sure your system can create and destroy curl instances in microseconds. It has to initiate the same HTTP connections each time either way, too.
If you were running many of these at the same time and were worried about system resources, not time, it might be worth exploring. As you noted, most of the time spent doing this will be waiting for network transfers, so I don't think you'll notice a change in overall time with either method.
For web scraping I would use : YQL + JSON + xPath. You'll implement it using cURL
I think you'll save a lot of resources.

Categories