Retrieving web content multiple times simultaneously in php - php

I use curl to retrieve web content from another site but there is two problems;
first, it takes an average 4 sec period to retrieve the contents and I don't know if there is any way to reduce it; second, some times remote server returns a short error message instead of full content.
I was thinking it would very efficient to send 3 request simultaneously, then check the first received response to see if it's an error or not and check the second respond if the first one was an error. by this method it would not be need to wait a full 4 secs to get retry response if the first one was error. but I don't know if it's possible to doing it.
It would be very appreciated if any one write a simple code to send multiple request simultaneously.
Here is my initial code:
<?php
$then = microtime(true);
$url='http://www.example.com/StockInformationHandler.ashx?{%22Type%22:%22getstockprice%22,%22la%22:%22En%22,%22arr%22:%22IRO1LKGH0001,IRO1MADN0001,IRO1KAVR0001,IRO1BHMN0001,IRO1PNBA0001,IRO3ASPZ0001,IRO1IKCO0001,IRO1BANK0001,IRO1IKHR0001,IRO1BDAN0001,IRO1SHND0001,IRO1NBEH0001,IRO1PIAZ0001,IRO1RSAP0001,IRO3DTDZ0001,IRO1TSBE0001,IRO1NAFT0001,IRO1PKHA0001,IRO1AZAB0001,IRO3BMAZ0001,IRO1BANS0001,IRO1BAFG0001,IRO1TAYD0001,IRO1GDIR0001,IRO1SPDZ0001,IRO1NALM0001,IRO1TOSA0001,IRO1BSDR0001,IRO1ZMYD0001,IRO1SBEH0001,IRO3SMBZ0001,IRO1PASN0001,IRO1SPAH0001,IRO1PNES0001,IRO1ALIR0001,IRO1MKBT0001,IRO1FKHZ0001,IRO1RENA0001,IRO1SSHR0001,IRO1PKER0001,IRO1SAHD0001,IRO1BTEJ0001,IRO1DADE0001,IRO1PARK0001,IRO1SKRN0001,IRO1FOLD0001,IRO3KHMZ0001,IRO1NASI0001,IRO3FAYZ0001,IRO1ALMR0001,IRO1NSTH0001,IRO1BMLT0001,IRO1TKSM0001,IRO1AYEG0001,IRO1GTSH0001,IRO1COMB0001,IRO3IMFZ0001,IRO1INDM0001,IRO1DSOB0001,IRO3NPSZ0001,IRO1BPAS0001,IRO1PKOD0001,IRO1OFST0001,IRO3NOLZ0001,IRO1RIIR0001,IRO1ATDM0001,IRO1SNMA0001,IRO1VSIN0001,IRO1MNGZ0001,IRO3PSKZ0001,IRO3KRMZ0001,IRO1AMIN0001,IRO1PRDZ0001,IRO1GLOR0001,IRO1SAJN0001,IRO3BGHZ0001,IRO1PAKS0001,IRO1SIPA0001,IRO1PLKK0001,IRO1KSHJ0001,IRO3BDYZ0001,IRO1LEAB0001,IRO1KHSH0001,IRO1KRIR0001,IRO1PKLJ0001,IRO1HTOK0001,IRO1BPST0001,IRO1TAMI0001,IRO1DTIP0001,IRO1SAND0001,IRO1SHPZ0001,IRO1SHKR0001,IRO1CHAR0001,IRO1ALBZ0001,IRO1LMIR0001,IRO1TRIR0001,IRO3ETLZ0001,IRO1CRBN0001,IRO1SAKH0001,IRO3MSZ93011,IRO1NSAZ0001,IRO3BLSZ0001,IRO1MSMI0001,IRO1TMEL0001,IRO1BALB0001,IRO1LZIN0001,IRO3RFNZ0001,IRO1ABAD0001,IRO1LSMD0001,IRO1DPAK0001,IRO1PLAK0001,IRO1MAVA0001,IRO1GSBE0001,IRO3BIPZ0001,IRO1PSIR0001,IRO1BALI0001,IRO1LIRZ0001,IRO3PKSH0001,IRO1NOVN0001,IRO3GRDZ0001,IRO3PMRZ0001,IRO1ROZD0001,IRO1PABD0001,IRO1SSEP0001,IRO1SGRB0001,IRO1NSPS0001,IRO3ARFZ0001,IRO1GNBO0001,IRO1KHAZ0001,IRO1KIMI0001,IRO1SSIN0001,IRO1PETR0001,IRO1BSTE0001,IRO3DSOZ0001,IRO1TSHE0001,IRO1NMOH0001,IRO1SBAH0001,IRO1TMKH0001,IRO3MNOZ0001,IRO1MAPN0001,IRO1SGAZ0001,IRO1MAGS0001,IRO3PRZZ0001,IRO1HSHM0001,IRO3MSZ93021,IRO1KGND0001,IRO1SGOS0001,IRO1SISH0001,IRO1SKER0001,IRO1KCHI0001,IRO3MRJZ0001,IRO1FRVR0001,IRO1GHEG0001,IRO1KRTI0001,IRO1RINM0001,IRO1GHAT0001,IRO1PSHZ0001,IRO1PNTB0001,IRO1KNRZ0001,IRO1BOTA0001,IRO1GOLG0001,IRT3SATF0001,IRO3AFRZ0001,IRR3ASPZ0101,IRO7ARMP0001,IRO3ZF090001,IRT3SSAF0001,IRT3CASF0001,IRO1ASIA0001,IRO1CONT0001,IRO7BHEP0001,IRO1BAKH0001,IRO1KALZ0001,IRO1KBLI0001,IRO1TRNS0001,IRO1KTAK0001,IRO3BSMZ0001,IRR3BSMZ0101,IRO1SWIC0001,IRO1LAPS0001,IRO1JOSH0001,IRO1MOTJ0001,IRR1MOTJ0101,IRO7MILP0001,IRO1NIRO0001,IRO7BHPP0001,IRO3ZF180001,IRO1ARTA0001,IRO1IPAR0001,IRO1YASA0001,IRO1PASH0001,IRO1TAIR0001,IRO3ZF340001,IRO3ZF140001,IRO7IPTP0001,IRR7IPTP0101,IRO3ZF280001,IRO1DRKH0001,IRO3ZF040001,IRO7SHIP0001,IRO1BARZ0001,IRO1PLST0001,IRO1GAZL0001,IRO1PTAP0001,IRO7HPKP0001,IRO1PIRN0001,IRO7GSIP0001,IRO1MOZI0001,IRO3MSZ93031,IRO3MSZ93041,IRO3MSZ93051,IRO3MSZ93061,IRO3MSZ93071,IRO3MSZ93081,IRO3MSZ93091,IRO3MSZ93101,IRO3MSZ93111,IRO3MSZ93121,IRO3MSZ94011,IRO3MSZ94021,IRO3MSZ94031,IRO3MSZ94041,IRO3MSZ94051,IRO1FROZ0001,IRO1GSKE0001,IRO7NARP0001,IRO1TKNO0001,IRO1MNMH0001,IRO3TORZ0001,IRO1SESF0001,IRO3PMTZ0001,IRO1PMSZ0001,IRO3OSHZ0001,IRO3SGDZ0001,IRO3CGRZ0001,IRO1OFRS0001,IRO1MSKN0001,IRO3PJMZ0001,IRO7BPRP0001,IRO1FIBR0001,IRO1KMSH0001,IRO1KSKA0001,IRO3ZF160001,IRO1NEOP0001,IRO1HJPT0001,IRO3KHZZ0001,IRO7RAHP0001,IRO1HFRS0001,IRO1KFJR0001,IRO7BHKP0001,IRO1AZIN0001,IRO1ATIR0001,IRO1SZPO0001,IRO1RTIR0001,IRO1RADI0001,IRR1CHAR0101,IRO3ZF030001,IRO1FNAR0001,IRO7SDRP0001,IRO1IDOC0001,IRO1KFAN0001,IRO1GOST0001,IRO1LENT0001,IRO1MHKM0001,IRO1MSTI0001,IRO1MNSR0001,IRR1IKCO0101,IRO1MESI0001,IRO1DABO0001,IRO1DOSE0001,IRO1DALZ0001,IRO1PDRO0001,IRO1TMVD0001,IRO1THDR0001,IRO1DJBR0001,IRO7DHVP0001,IRO1DAML0001,IRO1DRZK0001,IRO1DZAH0001,IRO1DSIN0001,IRO1DDPK0001,IRO1ABDI0001,IRO1DFRB0001,IRO1FTIR0001,IRO1DKSR0001,IRO1EXIR0001,IRO1DLGM0001,IRO7PDHP0001,IRO1IRDR0001,IRO3ZOBZ0001,IRO1INFO0001,IRO1EPRS0001,IRO1TKIN0001,IRO1RKSH0001,IRO3ZF120001,IRO3ZF240001,IRO3PZGZ0001,IRO7ZNJP0001,IRO3ZAGZ0001,IRO1AZRT0001,IRO1SDAB0001,IRO1SADB0001,IRO1SURO0001,IRO7IRNP0001,IRO7CBBP0001,IRO1SBOJ0001,IRO3SBZZ0001,IRO1SBHN0001,IRO1PRMT0001,IRO1STEH0001,IRO7SASP0001,IRO1SKHS0001,IRO1SKAZ0001,IRO7SEKP0001,IRO3DBRZ0001,IRO1SDST0001,IRO1SDOR0001,IRO1SROD0001,IRR1SSHR0101,IRO1SIMS0001,IRO1SEFH0001,IRO1SSOF0001,IRB3SSHZ9241,IRO1SFRS0001,IRO1SFKZ0001,IRO1FRDO0001,IRO1SFAS0001,IRO1SFNO0001,IRO1SGEN0001,IRO3SKBZ0001,IRO1SKOR0001,IRO1SMAZ0001,IRO3IBKZ0001,IRO7IENP0001,IRO1SSNR0001,IRO1SHZG0001,IRO1SHGN0001,IRO3SYSZ0001,IRO1SEIL0001,IRO1AMLH0001,IRO3PNLZ0001,IRO1BMPS0001,IRO1PPAM0001,IRO3PTRZ0001,IRO1THSH0001,IRO1TOPI0001,IRO1DODE0001,IRO1SHRG0001,IRO1ZNGN0001,IRO1SEPP0001,IRO7STNP0001,IRO1TSAL0001,IRO1SHSI0001,IRO1PESF0001,IRO1PFRB0001,IRO1SHFS0001,IRR1SHFS0101,IRO1PFAN0001,IRO7PKBP0001,IRO1KAFF0001,IRO1NKOL0001,IRO7SHLP0001,IRO1NPRS0001,IRO1VASH0001,IRT1CSNF0001,IRO1KLBR0001,IRO1BENN0001,IRO1LPAK0001,IRO1MINO0001,IRO1CHCH0001,IRO1KDPS0001,IRO1DMOR0001,IRO1SLMN0001,IRO1SPKH0001,IRO1SPPE0001,IRO1SHAD0001,IRO7MINP0001,IRO1GORJ0001,IRO1GCOZ0001,IRO1MRGN0001,IRO1MRAM0001,IRO1RNAB0001,IRO1NOSH0001,IRO1KIVN0001,IRO1MARK0001,IRO1KSIM0001,IRO1ALTK0001,IRO1SAMA0001,IRO1BAHN0001,IRO1BMAS0001,IRO1BIRI0001,IRO1SPTA0001,IRO1JAMD0001,IRO1FAJR0001,IRO1JSHO0001,IRO1FKAS0001,IRO1FRIS0001,IRO1TFKR0001,IRO3ZF100001,IRO3KZIZ0001,IRO1SEPA0001,IRO1LSDD0001,IRO1SORB0001,IRO1SOLI0001,IRO1LAMI0001,IRO7FANP0001,IRO1NGFO0001,IRO1FVAN0001,IRO1FAIR0001,IRO1GPRS0001,IRO1GPSH0001,IRO3CHRZ0001,IRO1GGAZ0001,IRO1GSHI0001,IRO1GHND0001,IRO3GHSZ0001,IRO1GESF0001,IRO1GMRO0001,IRO1GNJN0001,IRO1ABGN0001,IRO3ZF200001,IRT3SSKF0001,IRO7PKZP0001,IRO1KESF0001,IRO1BAMA0001,IRO1ITAL0001,IRO1IRGC0001,IRO1KPRS0001,IRO1CHML0001,IRO1CHIR0001,IRO1KHFZ0001,IRO1DMVN0001,IRO1TSRZ0001,IRO1ROOI0001,IRO1SINA0001,IRR1SINA0101,IRO1ARDK0001,IRO1PSER0001,IRO1KSAD0001,IRO3KSGZ0001,IRO1TBAS0001,IRO1SHQZ0001,IRO1ALVN0001,IRO1NILO0001,IRO1BHSM0001,IRO1SHMD0001,IRO1VARZ0001,IRO3KBCZ0001,IRO1ASAL0001,IRO1AZMA0001,IRO1PELC0001,IRO1PYAM0001,IRO1JJNM0001,IRO1LKPS0001,IRO1SRMA0001,IRO1KMOA0001,IRO1IAGM0001,IRO3KARZ0001,IRO1BMEL0001,IRO7PMMP0001,IRO3ZF220001,IRR3KHMZ0101,IRO3MIHZ0001,IRO1BROJ0001,IRO1PTOS0001,IRO3ZF060001,IRO1MRIN0001,IRO7BNOP0001,IRO7NIRP0001,IRO3ZF080001,IRO1HMRZ0001,IRO7VHYP0001,IRR1ALBZ0101,IRO1OIMC0001,IRO1TAZB0001,IRO7KARP0001,IRO1BIME0001,IRO1BPAR0001,IRO1DARO0001,IRO1TGOS0001,IRO1TOKA0001,IRO3BKHZ0001,IRO3BMDZ0001,IRO3ZMNZ0001,IRO1SSAP0001,IRO1SDID0001,IRO1SKBV0001,IRR1SKBV0101,IRO7SNAP0001,IRO7SHOP0001,IRO1GBEH0001,IRO7TKDP0001,IRO1KRAF0001,IRO7KOSP0001,IRO3IRNZ0001,IRO7BVMP0001,IRO1MELT0001,IRO1GMEL0001,IRO1SNRO0001,IRO1NIKI0001,IRR3ETLZ0101,IRR1BARZ0101,IRO1KVRZ0001,IRR1TRIR0101,IRR3ZNDZ0101,IRR3ZNDZ0101,IRR1KSKA0101,IRR1DALZ0101,IRO3BLKZ0001,IRR1TKIN0101,IRO1KHOC0001,IRO1CIDC0001,IRR1PNTB0101,IRR1PRDZ0101,IRR3PTRZ0101,IRO3LIAZ0001,IRR1GNJN0101,IRO3TBSZ0001,IRR1PKER0101,IRR1SGAZ0101,IRO1BVMA0001,IRO1MOBN0001,IRR1BMEL0101,IRO3FOHZ0001,IRR1BPAS0101,IRO3BHLZ0001,,%22}';
$ch=curl_init();
$timeout=15;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
// Get URL content
$lines_string=curl_exec($ch);
curl_close($ch);
$now = microtime(true);
echo sprintf("Elapsed: %f", $now-$then);
$lines_string;
echo "\r\n";
?>

Related

Display Curl Response Sequential one by one

I have simple POST CURL request script,
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $body);
curl_setopt($ch, CURLOPT_POST, 1);
$result = curl_exec($ch);
The problem i am facing is, if i fetch post data from db or i fetch post data from a form, when i process data in large quantity, for example 60 times. The curl displays the output until and unless all 60 requests are completed.
I want the curl to display output one by one not all at once when the query is completed e.g.
Echo 1st Response
Echo 2nd Response
Echo 3rd Response
What you are requesting appears to be undefined.
PHP, and any other HTTP Server can only start sending data, when the data-length is know. When echoing, your server is supposed to assemble the reply, then organize a transfer.
If you want to assemble a document, that can be done, use AJAX, or any other request/reply system to assemble divs/fields each.
An example would be:
"Base document" -> [N] divs or fields. Each field requests its related procesing.
When basereceives data, it assembles in its related div.

How to generate XML from multiple cURL calls (with PHP)?

guys.
I'm with serious trouble trying to solve this.
The scenario:
Here at work we use the Vulnerability Management tool QualysGuard.
Skipping all technical details, this tool basically detects vulnerabilities in all servers and for each vulnerability in each server it creates a Ticket Number.
From the UI I can access all these tickets and download a CSV file with all of them.
The other way of doing it is by using the API.
The API uses some cURL calls to access the database and retrieve the info that I specify in the parameters.
The method:
I'm using a script like this to get the data:
<?php
$username="myUserName";
$password="myPassword";
$proxy= "myProxy";
$proxyauth = 'myProxyUser:myProxyPassword';
$url="https://qualysapi.qualys.com/msp/ticket_list.php?"; //This is the official script, provided by Qualys, for doing this task.
$postdata = "show_vuln_details=0&SINCE_TICKET_NUMBER=1&CURRENT_STATE=Open&ASSET_GROUPS=All";
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxyauth);
curl_setopt ($ch, CURLOPT_TIMEOUT, 60);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_REFERER, $url);
curl_setopt($ch, CURLOPT_USERPWD, $username . ":" . $password);
curl_setopt ($ch, CURLOPT_POSTFIELDS, $postdata);
curl_setopt ($ch, CURLOPT_POST, 1);
$result = curl_exec ($ch);
$xml = simplexml_load_string($result);
?>
The script above works fine. It connects to the API, pass some parameters to it and the ticket_list.php file generates an XML file with all I need.
The Problems:
1-) This script only allows a limit of 1000 results in the XML file it returns.
If my request has generated more than 1000 results, the script creates a TAG like this, at the end of the XML:
<TRUNCATION last="5066">Truncated after 1000 records</TRUNCATION>
In this case, I would need to execute anoter cURL call, with the parameters bellow:
$postdata = "show_vuln_details=0&SINCE_TICKET_NUMBER=5066&CURRENT_STATE=Open&ASSET_GROUPS=All";
2-) There are approximately 300,000 tickets in Qualys' database (cloud), and I need to download all of them and insert in MY database, which is used by an application that I'm creating. This application has some forms, which are filled by the user and a bunch of queries are run against the database.
The doubt:
What would be the best way for me to do the task above?
I've got some ideas, but I'm at a complete loss.
I thought:
**1-)**Create a function that does the call above, parses the xml and if the tag
TRUNCATION exists, it gets its value and call itself again, doing it recursively until a result without the tag TRUNCATIONcomes.
The problem with this one is that I weren't able to merge the XML results of each call, and I'm not sure if it would cause memory issues, since it would be needed nearly 300 cURL calls. This script would be executed automatically by using the server's cronTab in a non-business period.
2-) Instead of retrieving all the data, I make the forms that I've mentioned post the data to the script and make the cURL calls with the parameters that the user POSTed. But again I'm not sure if that would be good, since I would still need to do multiple calls, depending on the parameters that the user sends.
3-) This is a crazy one: Use some sort of Macro software to record me while I log in the UI, go to the page where the tickets are located, click the download button, check the CSV option and click to download again. Then, export this script to some language like python or java, create a task in the cronTab and create a script that parses the CSV downloaded and inserts the data to the database. (Crazy or not? =P )
Any help is very welcome, maybe the answer is right before my eyes and I haven't gotten yet.
Thanks in advance!
I believe the proper way would involve a queue worker, however, If I were you I'd make your script grab 5 of these XML files in one execution- grab 1, insert rows, remove from memory, repeat. Then, I'd test it by running it a few times manually to see what sort of execution time and memory it requires. Once you've got a good idea of the execution time and you can see memory will not be a problem, schedule a cron for a little under double that time. If all goes well it should be about a minute between runs and you can have it all in your DB within an hour.

Does curl_exec() retry if timeout?

In my application I have to make a POST call to a webservice. They send me an XML response, basically saying "Accepted" or "Refused".
Last week I had an issue with one of these calls: I received a "Refused" response while their backend was telling me this request had been accepted.
I asked them what happened and they told me they received 2 requests (with the same ID - a parameter I send to them). First one was "Refused", second one was "Accepted".
I investigated: in my code, if I receive a "Refused" response, I log it, I update my database, and that's it. I do not try again.
The only thing would be PHP curl functions.
The day the problem occured, the webservice took unusual long time to response (20 seconds).
Could curl have made several calls? There is no retry option in the PHP function (or I didn't find it), but I'd rather ask here to be sure.
Here is my curl code.
$ch = curl_init();
$myArrayWithDatas = array( '...' );
$httpQueryFields = http_build_query($myArrayWithDatas);
$url = "https://www.webservice.com/api";
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $httpQueryFields);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
if (empty($response)) {
// I log an error
// (no trace of that in my logs)
} else {
// I log the XML response
// (one "Refused" response logged)
}
curl_close($ch);
Is there any case where this code could send 2 or more requests to the $url?
curl_exec will only do 1 call.
Are you running your code via a cron job or scheduled task ? If that's the case, maybe your code has been launched twice and that would explain why there were two calls done.

Increase CURL speed php

I am using an API provided by flipkart.com, this allows me to search and get results output as json.
The code I am using is:
$snapword = $_GET['p'];
$snapword = str_replace(' ','+',$snapword);
$headers = array(
'Fk-Affiliate-Id: myaffid',
'Fk-Affiliate-Token: c0f74c4esometokesndad68f50666'
);
$pattern = "#\(.*?\)#";
$snapword = preg_replace($pattern,'',$snapword);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://affiliate-api.flipkart.net/affiliate/search/json?query='.$snapword.'&resultCount=5');
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_ENCODING , "gzip");
curl_setopt($ch, CURLOPT_USERAGENT,'php');
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$snapdeal = curl_exec($ch);
curl_close($ch);
$time_end = microtime(true);
$time = $time_end - $time_start;
echo "Process Time: {$time}";
and the time it is taking is : Process Time: 5.3794288635254
Which is way too much, any ideas on how to reduce this?
Use curl_getinfo() to retrieve more accurate information. It also shows how much time spent resolving DNS etc.
You can see exact times taken for each step with the following keys:
CURLINFO_TOTAL_TIME - Total transaction time in seconds for last transfer
CURLINFO_NAMELOOKUP_TIME - Time in seconds until name resolving was complete
CURLINFO_CONNECT_TIME - Time in seconds it took to establish the connection
CURLINFO_PRETRANSFER_TIME - Time in seconds from start until just before file transfer begins
CURLINFO_STARTTRANSFER_TIME - Time in seconds until the first byte is about to be transferred
CURLINFO_SPEED_DOWNLOAD - Average download speed
CURLINFO_SPEED_UPLOAD - Average upload speed
$info = curl_getinfo($curl);
echo $info['connect_time']; // Same as above, but lower letters without CURLINFO
Most probably, the API is slow.
You could try to change to a faster DNS server (in Linux: /etc/resolv.conf).
Other than that, not much you can do.
I would see if you can determine your servers connection speed in your terminal/console window. This would greatly impact the time it takes to access a resource on the web. Also, you might want to consider thinking about the response time it takes from the resource, as the page needs to get the requested information and send it back.
I would also consider saving as much information that you need using a cronjob late at night so that you don't have to handle this upfront.

PHP cURL crawler doesn't fetch all data

I'm trying to write my first crawler by using PHP with cURL library. My aim is to fetch data from one site systematically, which means that the code doesn't follow all hyperlinks on the given site but only specific links.
Logic of my code is to go to the main page and get links for several categories and store those in an array. Once it's done the crawler goes to those category sites on the page and looks if the category has more than one pages. If so, it stores subpages also in another array. Finally I merge the arrays to get all the links for sites that needs to be crawled and start to fetch required data.
I call the below function to start a cURL session and fetch data to a variable, which I pass to a DOM object later and parse it with Xpath. I store cURL total_time and http_code in a log file.
The problem is that the crawler runs for 5-6 minutes then stops and doesn't fetch all required links for sub-pages. I print content of arrays to check result. I can't see any http error in my log, all sites give a http 200 status code. I can't see any PHP related error even if I turn on PHP debug on my localhost.
I assume that the site blocks my crawler after few minutes because of too many requests but I'm not sure. Is there any way to get a more detailed debug? Do you think that PHP is adequate for this type of activity because I wan't to use the same mechanism to fetch content from more than 100 other sites later on?
My cURL code is as follows:
function get_url($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
$info = curl_getinfo($ch);
$logfile = fopen("crawler.log","a");
echo fwrite($logfile,'Page ' . $info['url'] . ' fetched in ' . $info['total_time'] . ' seconds. Http status code: ' . $info['http_code'] . "\n");
fclose($logfile);
curl_close($ch);
return $data;
}
// Start to crawle main page.
$site2crawl = 'http://www.site.com/';
$dom = new DOMDocument();
#$dom->loadHTML(get_url($site2crawl));
$xpath = new DomXpath($dom);
Use set_time_limit to extend the amount of time your script can run for. That is why you are getting Fatal error: Maximum execution time of 30 seconds exceeded in your error log.
do you need to run this on a server? If not, you should try the cli version of php - it is exempt from common restrictions

Categories