Send request with user ip when scraping data in php - php

I am stuck in a problem I have a url which has a geo location restriction like it can only be viewed from europe or USA. My location is Asia. I want to extract all href's from the url.
However I am using curl but the problem is that it send server ip address and I want the request to be made with user ip address inorder to track a user which links he has visited. If you can guide me how to send request with user ip address and without using curl I'll be grateful.
Following is the source code. The url which I am accesing is:
http://partnerads.ysm.yahoo.com/ypa/?ct=2&c=000000809&u=http%3A%2F%2Ftrouve.autocult.fr%2F_test.php%3Fq%3Dtarif%2520skoda%2520superb%2520combi&r=&w=1&tv=&tt=&lo=&ty=&ts=1458721731523&ao=&h=1&CoNo=3292b85181511c0a&dT=1&er=0&si=p-Autocult_FRA_SERP_2%3A600x796
<?php
include_once 'simple_html_dom.php';
$html = file_get_html('iframe.html');
// find iframe from within html doc
foreach($html->find('iframe') as $iframe)
{
$src = $iframe->getAttribute('src'); // src extracted
$ch = curl_init(); // Initialise a cURL handle
// Set any other cURL options that are required
curl_setopt($ch, CURLOPT_HEADER, TRUE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, TRUE);
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
curl_setopt($ch, CURLOPT_URL,$src);
$results = curl_exec($ch); // Execute a cURL request
//echo curl_error($ch);
curl_close($ch); // Closing the curl
$bool = TRUE; $match = array(); $int = 0;
while(preg_match('/<a[^>]+href=([\'"])(.+?)\1[^>]*>/i', $results, $matches))
{
if($bool)
{
// print captured group that's actually the url your searching for
echo $matches[2].'<br>'.'<br>'.'<br>'.'<br>';
$bool = false;
}
}
}

You can use proxy.
$ip = '100.100.100.100:234' //example $ip
curl_setopt($ch, CURLOPT_PROXY,$ip);
without curl:
$aContext = array(
'http' => array(
'proxy' => 'tcp://'.$ip,
'request_fulluri' => true,
),
);
$cxContext = stream_context_create($aContext);
$sFile = file_get_contents("http://www.google.com", False, $cxContext);
If you lookin' for proxies, there's some adresses easy to scrape:
'http://proxylist.hidemyass.com/',
'http://ipaddress.com/proxy-list/',
'http://nntime.com/proxy-ip-'.$i.'.htm',
'http://www.proxylisty.com/ip-proxylist-'.$i
over 2000 ips

Related

simple_html_dom: 403 Access denied

I implemented this function in order to parse HTML pages using two different "methods".
As you can see both are using the very handy class called simple_html_dom.
The difference is the first method is also using curl to load the HTML while the second is not using curl
Both methods are working fine on a lot of pages but I'm struggling with this specific call:
searchThroughDOM('https://fr.shopping.rakuten.com/offer/buy/3458931181/new-york-1997-4k-ultra-hd-blu-ray-blu-ray-bonus-edition-boitier-steelbook.html', 'simple_html_dom');
In both cases, I end up with a 403 access denied response.
Did I do something wrong?
Or is there another method in order to avoid this type of denial?
function searchThroughDOM ($url, $method)
{
echo '$url = '.$url.'<br>'.'$method = '.$method.'<br><br>';
$time_start = microtime(true);
switch ($method) {
case 'curl':
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36');
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($str);
break;
case 'simple_html_dom':
$html = new simple_html_dom();
$html->load_file($url);
break;
}
$collection = $html->find('h1');
foreach($collection as $x => $x_value) {
echo 'x = '.$x.' => value = '.$x_value.'<br>';
}
$html->save('result.htm');
$html->clear();
$time_end = microtime(true);
echo 'Elapsed Time (DOM) = '.($time_end - $time_start).'<br><br>';
}
From my point of view , there is nothing wrong with "simple_html_dom"
you may remove the simple html dom "part" of the code , leave only for the CURL
which I assume is the source of the problem.
There are lots of reasons cause the curl Not working on page
first of all I can see you add
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
you should also try to add CURLOPT_SSL_VERIFYHOST , false
Secondly , check your curl version, see if it is too old
third option, if none of above working , you may want to enable cookie , it may possible the cookie disabled cause the website detect it is machine, not real person send the request .
lastly , if all above attempt failed , try other library or even file_get_content ,
Curl is not your only option, of cause it is the most powerful one.

Manipulate dom with php to scrape data

I am currently trying to manipulate dom throuhg php to extract views from an fb video page. The below code was working until a bit ago. However now it doesnt find the node that contains the views count. This information is inside a div with id fbPhotoPageMediaInfo. What would be the best way to manipulate the dom through php to get views of an fb video page?
private function _callCurl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Linux; Android 5.0.1; SAMSUNG-SGH-I337 Build/LRX22C; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
curl_setopt($ch, CURLOPT_URL, $url);
$response = curl_exec($ch);
$http = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
return array(
$http,
$response,
);
}
function test()
{
$url = "https://www.facebook.com/TaylorSwift/videos/10153665021155369/";
$request = callCurl($url);
if ($request[0] == 200) {
$dom = new DOMDocument();
#$dom->loadHTML($request[1]);
$elm = $dom->getElementById('fbPhotoPageMediaInfo');
if (isset($elm->nodeValue)) {
$views = preg_replace('/[^0-9]/', '', $elm->nodeValue);
} else {
$views = null;
}
} else {
echo "Error!";
}
return isset($views) ? $views : null;
}
Here is what I've determined...
If you var_dump() on $request you can see that it's giving you a 302 code (redirect) rather than a 200 (ok).
Changing CURLOPT_FOLLOWLOCATION to true or commenting it out entirely makes the error go away, but now we're getting a different page from the one expected.
I ran the following to see where I was being redirected to:
$htm = file_get_contents("https://www.facebook.com/TaylorSwift/videos/10153665021155369/");
var_dump($htm);
This gave me a page saying I was using an outdated browser, and needed to update it. So apparently Facebook doesn't like the User Agent.
I updated it as follows:
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/44.0.2');
That appears to solve the problem.
Personally I prefer to use Simplehtmldom.
FB like other high traffic sites do update their source to help prevent scraping. You may in the future have to adjust your node search.
<?php
$ua = "Mozilla/5.0 (Windows NT 5.0) AppleWebKit/5321 (KHTML, like Gecko) Chrome/13.0.872.0 Safari/5321"; // must be a valid User Agent
ini_set('user_agent', $ua);
require_once('simplehtmldom/simple_html_dom.php'); // http://simplehtmldom.sourceforge.net/
Function Scrape_FB_Views($url) {
IF (!filter_var($url, FILTER_VALIDATE_URL) === false) {
// Create DOM from URL
$html = file_get_html($url);
IF ($html) {
IF (($html->find('span[class=fcg]', 3))) { // 4th instance of span with fcg class
$text = trim($html->find('span[class=fcg]', 3)->plaintext); // get content of span as plain text
$result = preg_replace('/[^0-9]/', '', $text); // replace all non-numeric characters
}ELSE{
$result = "Node is no longer valid."
}
}ELSE{
$result = "Could not get HTML.";
}
}ELSE{
$result = "URL is invalid.";
}
return $result;
}
$url = "https://www.facebook.com/TaylorSwift/videos/10153665021155369/";
echo("<p>".Scrape_FB_Views($url)."</p>");
?>

RSS-Feed returns an empty string

I have a news portal that displays RSS Feeds Items. Approximately 50 sources are read and it works very well.
Only with a source I always get an empty string. The RSS Validator of W3C can read the RSS feed. Even my program Vienna receives data.
What can I do?
Here is my simple code:
$link = 'http://blog.bosch-si.com/feed/';
$response = file_get_contents($link);
if($response !== false) {
var_dump($response);
} else {
echo 'Error ';
}
The server serving that feed expects a User Agent to be set. You apparently don't have a User Agent set in your php.ini, nor do you set it in the call to file_get_contents.
You can either set the User Agent for this particular request through a stream context:
echo file_get_contents(
'http://blog.bosch-si.com/feed/',
FALSE,
stream_context_create(
array(
'http' => array(
'user_agent' => 'php'
)
)
)
);
Or globally for any http calls:
ini_set('user_agent', 'php');
echo file_get_contents($link);
Both will give you the desired result.
blog http://blog.bosch-si.com/feed/ required some header to fetch content from the website, better use curl for the same.
See below solution:
<?php
$link = 'http://blog.bosch-si.com/feed/';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $link);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Host: blog.bosch-si.com', 'User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'));
$result = curl_exec($ch);
if( ! $result)
{
echo curl_error($ch);
}
curl_close($ch);
echo $result;

file_get_contents from specific URL

I have an API Key that verifies the request URL
If I do
echo file_get_contents('http://myfilelocation.com/?apikey=1234');
RESULT : this api key is not authorized for this domain
However, if I put the requested URL within an iframe with the same URL:
RESULT : this api key is authorized
Obviously, the Server I'm getting the requested JSON return data is working properly because the iframe is outputting the proper information. However, how can I verify that PHP is making the request from the proper domain and URL settings?
By using file_get_contents I am always getting back that the API key is not authorized. However, I'm running the php script from the authorized domain.
Try this PHP code:
<?php
$options = array(
'http'=>array(
'method'=>"GET",
'header'=>"Host: myfilelocation.com\r\n". // Don't forgot replace with your domain
"Accept-language: en\r\n" .
"User-Agent: Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.102011-10-16 20:23:10\r\n"
)
);
$context = stream_context_create($options);
$file = file_get_contents("http://myfilelocation.com/?apikey=1234", false, $context);
?>
file_get_contents doesn't send a any referrer information and the api may need it, this may help you:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://myfilelocation.com/?apikey=1234');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_REFERER, 'http://autorized-domain.here');
$html = curl_exec($ch);
echo $html;
?>

How to get redirecting url link with php from bit.ly

I'm trying to get url links to those bit.ly redirects. I've tried to open bit.ly links with file_get_contents but it already gets content from redirected site, but how to get its url?
I was unaware of the bit.ly API, here is the raw way to do it:
$context = array
(
'http' => array
(
'method' => 'GET',
'max_redirects' => 1,
),
);
#file_get_contents('http://bit.ly/cmUTtb', null, stream_context_create($context));
echo 'Redirect to: ' . str_replace('Location: ', '', $http_response_header[6]);
You can query bit.ly's API (documentation) for the long URL. You will need your username and API key (which can be found on your account page).
$endpoint = 'http://api.bit.ly/v3/expand?';
$params = array(
'shortUrl' => 'http://bit.ly/aUmUDq',
'login' => 'your_bitly_username',
'apiKey' => 'your_api_key',
'format' => 'txt'
);
$api_url = $endpoint . http_build_query($params);
echo file_get_contents($api_url);
Use curl, which will not follow redirects by default.
see https://stackoverflow.com/a/41680608/7426396
I implemented to get a each line of a plain text file, with one shortened url per line, the according redirect url:
<?php
// input: textfile with one bitly shortened url per line
$plain_urls = file_get_contents('in.txt');
$bitly_urls = explode("\r\n", $plain_urls);
// output: where should we write
$w_out = fopen("out.csv", "a+") or die("Unable to open file!");
foreach($bitly_urls as $bitly_url) {
$c = curl_init($bitly_url);
curl_setopt($c, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36');
curl_setopt($c, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($c, CURLOPT_HEADER, 1);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 20);
// curl_setopt($c, CURLOPT_PROXY, 'localhost:9150');
// curl_setopt($c, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);
$r = curl_exec($c);
// get the redirect url:
$redirect_url = curl_getinfo($c)['redirect_url'];
// write output as csv
$out = '"'.$bitly_url.'";"'.$redirect_url.'"'."\n";
fwrite($w_out, $out);
}
fclose($w_out);
Have fun and enjoy!
pw

Categories