I set code to hit the links to the proxy list in php. The hit is generated succesfully. and I am getting the output in html. but this html is not in display proper on browswer. I want exact html in return from the proxy. any body know how to do it please give me some idea about it here is the code which I am using
<?php
$curl = curl_init();
$timeout = 30;
$proxies = file("proxy.txt");
$r="https://www.abcdefgth.com";
// Not more than 2 at a time
for($x=0;$x<2000; $x++){
//setting time limit to zero will ensure the script doesn't get timed out
set_time_limit(30);
//now we will separate proxy address from the port
//$PROXY_URL=$proxies[$getrand[$x]];
echo $proxies[$x];
curl_setopt($curl, CURLOPT_URL,$r);
curl_setopt($curl , CURLOPT_PROXY , preg_replace('/\s+/', '',$proxies[$x]));
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5");
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($curl, CURLOPT_REFERER, "http://google.com/");
$text = curl_exec($curl);
echo "Hit Generated:";
}
?>
A simple look into the documentation of the function you use would have answered your question:
On http://php.net/manual/en/function.curl-exec.php it clearly states right in the "Return value" section that you receive back either a boolean value from that function. Except if you have specified the CURLOPT_RETURNTRANSFER flag which you did not do you in code.
So have a try adding
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
Followed by any attempt to actually output the result you receive in $text, which you also forgot.
Related
i need your help, can anyone explain me why my code doesnt find the a-tag privacy on the site zoho.com?
my code finds the link "privacy" on other sites well but not on the site zoho.com
I use symfony Crawler: https://symfony.com/doc/current/components/dom_crawler.html
// Imprint Check //
function findPrivacy($domain) {
$ua = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13';
$curl = curl_init($domain);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($curl, CURLOPT_USERAGENT, $ua);
$data = curl_exec($curl);
$crawler = new Crawler($data);
$nodeValues = $crawler->filter('a')->each(function ($node) {
if(str_contains($node->attr('href'), 'privacy-police') || str_contains($node->attr('href'), 'privacy')) {
return true;
} else {
return false;
}
});
return $nodeValues;
}
if you watch the source code from zoho.com, then you will see the footer is empty. But on the site, the footer isnt empty if you scroll down.
How can I find now this link Privacy?
Your script cannot find what is not there. If you load the zoho.com page in a browser and look at the source code, you will notice that the word privacy is not even present. It's possible that the footer containing the link to the privacy policy is loaded asynchronously, which PHP cannot handle.
EDIT: by asynchronously loaded I mean using something like AJAX, which is client-side only. Since PHP is server-side only, it cannot perform the operations required to load the footer containing the link to the privacy policy.
What i'm trying to do is do a search on Amazon using a random keyword, then i'll just scrape maybe the first 10 results, the issue when i print the html results i get nothing, it's just blank, my code looks ok to me and i have used CURL in the past and never come accross this, my code:
<?php
include_once("classes/simple_html_dom.php");
function get_random_keyword() {
$f_contents = file("keywords.txt");
return $f_contents[rand(0, count($f_contents) - 1)];
}
function getHtml($page) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $page);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5');
$html = curl_exec($ch);
print "html -> " . $html;
curl_close($ch);
return $html;
}
$html = getHtml("https://www.amazon.co.uk/s?k=" . get_random_keyword());
?>
Ideally i would have preferred to use the API, but from what i understand you need 3 sales first before you are granted access, can anyone see any issues? i'm not sure what else to check, any help is appreciated.
Amazon is returning the response encoded in gzip. You need to decode it:
$html = getHtml("https://www.amazon.co.uk/s?k=" . get_random_keyword());
echo gzdecode($html);
I have a PHP script which downloads all of the content at a URL via cURL:
<?php
function get_curl_output($link)
{
$channel = curl_init();
curl_setopt($channel, CURLOPT_URL, $link);
curl_setopt($channel, CURLOPT_HEADER, 0);
curl_setopt($channel, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($channel, CURLOPT_CONNECTTIMEOUT, 10000);
curl_setopt($channel, CURLOPT_TIMEOUT, 10000);
curl_setopt($channel, CURLOPT_VERBOSE, true);
curl_setopt($channel, CURLOPT_USERAGENT, 'Mozilla/5.0 (compatible; Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2) Gecko/20070219');
curl_setopt($channel, CURLOPT_FOLLOWLOCATION, true);
$output = curl_exec($channel);
if(curl_errno($channel))
{
file_put_contents('curlerror.txt', curl_error($channel) . PHP_EOL, FILE_APPEND);
}
curl_close($channel);
return $output;
?>
<?php
function downloader($given_url){
//download all the url's content in the $given_url.
//if this $given_url is an array with 10 urls inside it, download all of their content.
while(as long as i find a url inside $given_url){
$content_stored_here = get_curl_output($link);
//and put that content to a file
}
}
?>
Now every thing goes fine until there is no connection loss or IP changes. However, my connection randomly gets a new IP address after some hours, as I don't have a static IP address.
And i use mod_php in apache using WinNT MPM thread worker.
Once I get the new IP address, my code stops working, but throws no errors
EDIT :
i made the same program on c++ (of course changing some functions name and tweaking compiler settings and linker settings) c++ too stops at the middle of the programs once i got the new IP address or a connection loss.
Any insights on this?
You don't need such huge timeouts; when there is no connection cUrl tries to connect for as long as 10000 seconds accroding to your code. Set it to something more reasonable - just 10, for example
this was working up until quite recently and i cannot seem to crack the case.
if you manually visit the url hit against in the script, the results are there..but if i do it in the code, i am having an issue.
you can see in my output test that i am no longer getting any output...
any ideas?
<?
//$ticker=urldecode($_GET["ticker"]);
$ticker='HYG~FBT~';
echo $ticker;
$tickerArray=preg_split("/\~/",$ticker);
// create curl resource
$ch = curl_init();
// set urlm
curl_setopt($ch, CURLOPT_URL, "http://www.batstrading.com/market_data/symbol_data/csv/");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$a='';
$output = curl_exec($ch);
echo "<br><br>OUTPUT TEST: ".($output);
$lineCt=0;
$spaceCt=0;
$splitOutput=preg_split("[\n|\r]",$output);
for($ii=0;$ii<sizeof($tickerArray);$ii++){
$i=0;
$matchSplit[$ii]=-1;
while($i<sizeof($splitOutput) && $matchSplit[$ii]==-1){
$splitOutput2=preg_split("/\,/",$splitOutput[$i]);
if($i>0){
if(strcasecmp($splitOutput2[0],strtoupper($tickerArray[$ii]))==0){
$matchSplit[$ii]=$splitOutput[$i]."#";
}
}
$i++;
}
if($matchSplit[$ii]==-1){
echo "notFound#";
}else{
echo $matchSplit[$ii];
}
}
//echo ($output);
curl_close($ch);
?>
I added a user agent to your script and it seems to work fine here:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.batstrading.com/market_data/symbol_data/csv/");
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15); //time out of 15 seconds
$output = curl_exec($ch);
curl_close($ch);
//Then your output paring code
The output I get:
HYG~FBT~
OUTPUT TEST: Name,Volume,Ask Size,Ask Price,Bid Size,Bid Price,Last Price,Shares Matched,Shares Routed SPY,35641091,0,0.0,0,0.0,134.38,34256509,1384582 BAC,22100508,0,0.0,0,0.0,7.78,20407265,1693243 QQQ,12085707,0,0.0,0,0.0,62.65,11703725,381982 XLF,11642347,0,0.0,0,0.0,14.47,11429581,212766 VXX,9838310,0,0.0,0,0.0,28.2,9525266,313044 EEM,9711498,0,0.0,0,0.0,43.28,9240820,470678 IWM,8272528,0,0.0,0,0.0,81.19,7930349,342179 AAPL,6145951,0,0.0,0,0.0,498.24,4792854,1353097
It is also good practice to close the CURL connection once you are done. I believe that might also play a part in your issues.
If you are still getting issues, check to see if the server the script runs on can access that site.
Update
Upon further investigation, here's what I believe is the root of the problem.
The problem lies with the provider of the CSV file. Perhaps due to some issues on their end, the CSV are generated, but only contain the headers. There were instances where there were indeed data in there.
The data is only avaliable during set hours of the day.
In any case, fetching the empty file = the parser would print #notFound, leading us to assume that there was an issue with CURL.
So my suggestion is to add some further checking to the script to check whether the CSV file actually contains any data at all and is not a file containing just the headings.
Finally, setting a timeout for CURL should fix it as the CSV takes a while to be generated by the provider.
The problem is, I get some parts of the contents but did not get the user's reviews. by Firebug I saw contents but when I checked the source codes NO contents inside HTML tags / no same HTML tags. Here is my code:
<?php
//Headers
include('simple_html_dom.php');
function getPage($page, $redirect = 0, $cookie_file = '')
{
$ch = curl_init();
$headers = array("Content-type: application/json");
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
if($redirect)
{
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
}
curl_setopt($ch, CURLOPT_URL, $page);
if($cookie_file != '') {
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file);
}
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6');
$return = curl_exec($ch); //Mozilla/4.0 (compatible;)
curl_close($ch);
return $return;
}//EO Fn
//Source
$url = 'http://www.vitals.com/doctor/profile/1982660171/reviews/1982660171';
//Parsing ...
$contents = getPage($url, 1, 'cookies.txt');
$html = str_get_html($contents);
//Output
echo $html->outertext;
?>
Can anyone please help me - what I should do to get the whole page so that I can grab reviews?enter code here
They're just stored as JSON in a <script> block towards the top of the page. Parse it out with RegEx or Simple HTML DOM and run it through json_decode.
var json = {"provider":{"id":"1982660171","display_name":"Stephen R Guy, MD","last_name":"Guy","first_name":"Stephen","middle_name":"Russell","master_name":"Stephen_Guy","degree_types":"MD","familiar_name":"Stephen","years_experience":"27","birth_year":"1956","birth_month":"5","birth_day":"23","gender":"M","is_limited":"false","url_deep":"http:\/\/www.vitals.com\/doctor\/profile\/1982660171\/Stephen_Guy","url_public":"http:\/\/www.vitals.com\/doctors\/Dr_Stephen_Guy.html","status_code":"A","client_ids":"1","quality_indicator_set":[{"type":"quality-indicator\/consumer-feedback","count":"2","suboverall_set":[{"name_short":"Promptness","overall":"3"},{"name_short":"Courteous Staff","overall":"4"},{"name_short":"Bedside Manner","overall":"4"},{"name_short":"Spends Time with Me","overall":"4"},{"name_short":"Follow Up","overall":"4"}],"name":"Consumer Reviews","overall":"4.0","measure_set":[{"feedback_response_id":"1756185","input_source_ids":"{0}","date":"1301544000","value":"4","scale":{"best":"1","worst":"4"},"review":{"type":"review\/consumer","comment":"I will never birth with another dr. Granted that's not saying much as I don't like dr's but I actually find him as valuable as the midwives who I adore. I liked Horlacher but when Kitty left I followed the midwives and then followed again....Dr. Guy is GREAT. I honestly don't know who I'd rather support me at my birth; Margie and Lisa or Dr. Guy. ....I wonder if I can just get all of them.Guy's great. Know what you want. Tell him. Be strong and he'll support you.I give him 10 stars. Oh...my baby's 3 years old now. He's GREAT! ","date":"1301544000"},"sub_measure":[{"name":"Waiting time during a visit","name_short":"Promptness","value":"3","scale":{"best":"4","worst":"1"}},{"name":"Courtesy and professionalism of office staff ","name_short":"Courteous Staff","value":"4","scale":{"best":"4","worst":"1"}},{"name":"Bedside manner (caring)","name_short":"Bedside Manner","value":"4","scale":{"best":"4","worst":"1"}},{"name":"Spending enough time with me","name_short":"Spends Time with Me","value":"4","scale":{"best":"4","worst":"1"}},{"name":"Following up as needed after my visit","name_short":"Follow Up","value":"4","scale":{"best":"4","worst":"1"}}]},{"feedback_response_id":"420734","input_source_ids":"{76}","link":"http:\/\/local.yahoo.com\/info-15826842-guy-stephen-r-md-university-women-s-health-center-dayton","date":"1142398800","value":"4","scale":{"best":"1","worst":"4"},"review":{"type":"review\/consumer","comment":"Excellent Doctor: I really like going to this office. They are truely down to earth people and talk my \"non-medical\" language. I have been using thier office since 1997 and they have seen me through 2 premature pregnancies!","date":"1142398800"}}],"wait_time":"50"}]}};
But again, make sure you have permissions to do this...