How can I scrape a site using a User-Agent for Ipad?
I have this code below using curl in PHP which outputs the source but can't find the tags still. On Ipad or Safari browser using an Ipad User-Agent, the tags displays when the site is loaded.
Thanks!
<?php
$useragent= "Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.10')";
$ch = curl_init ("http://www.cbsnews.com/video/watch/?id=7370279n&tag=mg;mostpopvideo");
curl_setopt ($ch, CURLOPT_USERAGENT, $useragent); // set user agent
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
// curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
echo $output = curl_exec ($ch);
curl_close($ch);
?>
Try using curl from the command line, with a perl script such as this:
my $ua = "Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.10";
my $curl = "curl -A '$ua'";
my $server = "http://www.cbsnews.com";
my $startpage = "$server/video/watch/?id=7370279n&tag=mg;mostpopvideo";
my $path = "/path/to/download/to";
open(f, "$curl -L $startpage |") or die "Cannot open website: $!";
while (<f>)
{
if (/<a\s+[^>]*href=\"$server\/([^\"\/])*\"/)
{
my $file = $2;
system("$curl -e $startpage $server/$file > $path/$file");
next;
}
if (/<a\s+[^>]*href=\"$server\/([^\"]+)\/([^\"\/])*\"/)
{
my $folder = $1;
my $file = "$folder/$2";
system("mkdir -p $path/$folder");
system("$curl -e $startpage $server/$file > $path/$file");
next;
}
}
close(f);
Related
I am trying to build a function to render sitemap links and get inside links of inner sitemap its working good but its not working for all the links some of the links ( with the same syntax) is not working and responding errors
function download_page($path){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$path);
curl_setopt($ch, CURLOPT_FAILONERROR,1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_HTTPHEADER, [
'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36',
'Content-type: application/xml'
]);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
$retValue = curl_exec($ch);
curl_close($ch);
return $retValue;
}
function getAllLinks($sitemapUrl) {
$links = array();
$i=0;
// $context = stream_context_create(array('http' => array('header' => 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36')));
// $xml = file_get_contents($sitemapUrl, false, $context);
$sitemap = $this->download_page($sitemapUrl);
// dd($sitemap);
// Load the sitemap XML file
$sitemapXml = new \SimpleXMLElement($sitemap);
// $sitemapXml = simplexml_load_file($sitemap);
// $sitemapXml = simplexml_load_string($sitemap);
// Loop through the <url> and <sitemap> elements
foreach($sitemapXml->children() as $child) {
if ($child->getName() === 'url') {
$i++;
$links[$i]['url'] = (string)$child->loc;
$links[$i]['lastmod'] = (string)$child->lastmod;
}
elseif ($child->getName() === 'sitemap') {
$links = array_merge($links, $this->getAllLinks((string)$child->loc));
}
}
return $links;
}
In the comments I tried to u se multiple methods
Example for working link : https://rulepingpong.com/sitemap_index.xml
Example for not working link: https://majesticgaragedoorfl.com/sitemap_index.xml
getting the error "String could not be parsed as XML"
I am really lost
I have this code
<?php
$ua = array(
"Mozilla/5.0 (compatible; MSIE 9.0; AOL 9.7; AOLBuild 4343.19; Windows NT 6.1; WOW64; Trident/5.0; FunWebProducts)",
"Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; XH; rv:8.578.498) fr, Gecko/20121021 Camino/8.723+ (Firefox compatible)",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1",
"Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko",
"Mozilla/5.0 (X11; U; Linux i686; fr-fr) AppleWebKit/525.1+ (KHTML, like Gecko, Safari/525.1+) midori/1.19",
"Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16",
"Mozilla/5.0 (Linux; U; Android 4.0.3; de-ch; HTC Sensation Build/IML74K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30");
$uar = array_rand($ua);
$url = "sometestserverisetup";
$ip = '127.0.0.1';
$port = '9051';
$auth = 'mypwwhateveritis';
$command = 'signal NEWNYM';
$fp = fsockopen($ip,$port,$error_number,$err_string,10);
if(!$fp) { echo "ERROR: $error_number : $err_string";
return false;
} else {
fwrite($fp,"AUTHENTICATE \"".$auth."\"\n");
$received = fread($fp,512);
fwrite($fp,$command."\n");
$received = fread($fp,512);
}
fclose($fp);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_PROXY, "127.0.0.1:9050");
curl_setopt($ch, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_VERBOSE, 0);
curl_setopt($ch,CURLOPT_USERAGENT,$ua[$uar]);
$response = curl_exec($ch);
echo $response;
?>
everything works fine. With my test site and it displays correctly. However certain sites (google.com, amazon.com, youtube, facebook. only display a blank page for echo response.
Is there some curl set opt that needs to be enabled for pages to display properly.
Looking at a var_dump(curl_getinfo($ch)); after calling curl_exec can be helpful.
I tested your code and found in some cases the sites send a 302 Moved response with a Location header to redirect the browser which would result in an empty response on a successful request.
Adding
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
made it so that every site you mentioned always returned a response in my tests. And depending on what you are doing (searches, logins, form submissions) you will probably find redirects are common so you need to tell cURL to follow them with that option.
Beyond that, you can set CURLOPT_HEADER to true so you can look at the response headers sent to see what's going on in addition to curl_getinfo to make sure the connection was successful (either through Tor or to the site).
I am using CURL to load a web page and return it onto a page on my server. However when the page is returned there are no images showing as they are linked using href="/image.png" etc.... Is there a way using CURL to add the url to any link that starts href="/
function pull_html($url, $device)
{
$ch = curl_init();
if($device == 'iPhone'){
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25');
}elseif($device == 'iPad'){
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25');
}
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch,CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,10);
return curl_exec($ch);
curl_close($ch);
}
Use a simple str_replace()
return str_replace('href="/', 'href="'.$url.'/', curl_exec($ch));
http://php.net/manual/en/function.str-replace.php
I am trying to download a source code of web pages using curl php code but its downloading only for few pages for rest pages file is empty.
I googled it but im not getting solution.
My source code is :-
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $strurl);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch,CURLOPT_USERAGENT, 'CURL via PHP');
$out = curl_exec($ch);
$fp = fopen('f1.html', 'w');
fwrite($fp, $out);
fclose($fp);
curl_close($ch);
What options to add ? Where i am wrong ?
Pls help.
Try setting a user-agent that suggests you're a browser. Some servers will block curl/wget/etc.
For example: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.152 Safari/537.22
Thanks for looking at my question.
I want to get the mobile version by the use of either file_get_contents() or cURL. I know that it can be done by the help of modifying the HTTP headers in the request. Can you please give me a simple example to do so?
Thanks again!
Regards,
Sanket
As an alternative, file_get_contents and stream_context_create can also be used:
$opts = array('http' =>
array(
'header' => 'User-agent: Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420.1 (KHTML, like Gecko) Version/3.0 Mobile/3B48b Safari/419.3',
)
);
$context = stream_context_create($opts);
$result = file_get_contents($url, false, $context);
Is this what you are looking for ?
curl -A "Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3" http://example.com/your-url
You need to set the user agent string:
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420.1 (KHTML, like Gecko) Version/3.0 Mobile/3B48b Safari/419.3');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($ch);
curl_close($ch);