I am trying to make a sitescraper. I made it on my local machine and it works very fine there. When I execute the same on my server, it shows a 403 forbidden error.
I am using the PHP Simple HTML DOM Parser. The error I get on the server is this:
Warning:
file_get_contents(http://example.com/viewProperty.html?id=7715888)
[function.file-get-contents]: failed
to open stream: HTTP request failed!
HTTP/1.1 403 Forbidden in
/home/scraping/simple_html_dom.php on
line 40
The line of code triggering it is:
$url="http://www.example.com/viewProperty.html?id=".$id;
$html=file_get_html($url);
I have checked the php.ini on the server and allow_url_fopen is On. Possible solution can be using curl, but I need to know where I am going wrong.
I know it's quite an old thread but thought of sharing some ideas.
Most likely if you don't get any content while accessing an webpage, probably it doesn't want you to be able to get the content. So how does it identify that a script is trying to access the webpage, not a human? Generally, it is the User-Agent header in the HTTP request sent to the server.
So to make the website think that the script accessing the webpage is also a human you must change the User-Agent header during the request. Most web servers would likely allow your request if you set the User-Agent header to an value which is used by some common web browser.
A list of common user agents used by browsers are listed below:
Chrome: 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
Firefox: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0
etc...
$context = stream_context_create(
array(
"http" => array(
"header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
)
)
);
echo file_get_contents("www.google.com", false, $context);
This piece of code, fakes the user agent and sends the request to https://google.com.
References:
stream_context_create
Cheers!
This is not a problem with your script, but with the resource you are requesting. The web server is returning the "forbidden" status code.
It could be that it blocks PHP scripts to prevent scraping, or your IP if you have made too many requests.
You should probably talk to the administrator of the remote server.
Add this after you include the simple_html_dom.php
ini_set('user_agent', 'My-Application/2.5');
You can change it like this in parser class from line 35 and on.
function curl_get_contents($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
function file_get_html()
{
$dom = new simple_html_dom;
$args = func_get_args();
$dom->load(call_user_func_array('curl_get_contents', $args), true);
return $dom;
}
Have you tried other site?
It seems that the remote server has some type of blocking. It may be by user-agent, if it's the case you can try using curl to simulate a web browser's user-agent like this:
$url="http://www.example.com/viewProperty.html?id=".$id;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$html = curl_exec($ch);
curl_close($ch);
Write this in simple_html_dom.php for me it worked
function curl_get_contents($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$html = curl_exec($ch);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
$dom = new simple_html_dom;
$args = func_get_args();
$dom->load(call_user_func_array('curl_get_contents', $args), true);
return $dom;
//$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
}
I realize this is an old question, but...
Just setting up my local sandbox on linux with php7 and ran across this. Using the terminal run scripts, php calls php.ini for the CLI. I found that the "user_agent" option was commented out. I uncommented it and added a Mozilla user agent, now it works.
Did you check your permissions on file? I set up 777 on my file (in localhost, obviously) and I fixed the problem.
You also may need some additional information in the conext, to make the website belive that the request comes from a human. What a did was enter the website from the browser an copying any extra infomation that was sent in the http request.
$context = stream_context_create(
array(
"http" => array(
'method'=>"GET",
"header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/50.0.2661.102 Safari/537.36\r\n" .
"accept: text/html,application/xhtml+xml,application/xml;q=0.9,
image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3\r\n" .
"accept-language: es-ES,es;q=0.9,en;q=0.8,it;q=0.7\r\n" .
"accept-encoding: gzip, deflate, br\r\n"
)
)
);
In my case, the server was rejecting HTTP 1.0 protocol via it's .htaccess configuration. It seems file_get_contents is using HTTP 1.0 version.
Use below code:
if you use -> file_get_contents
$context = stream_context_create(
array(
"http" => array(
"header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
)
));
=========
if You use curl,
curl_setopt($curl, CURLOPT_USERAGENT,'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36');
how to change :
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36
To Variable with Codeigniter Class
$browser = $this->agent->browser();
$browser_version = $this->agent->version();
$platform = $this->agent->platform();
Just load user_agent library and use its methods)
$this->load->library('user_agent');
$browser = $this->agent->browser();
$browser_version = $this->agent->version();
$platform = $this->agent->platform();
So there is the problem:
I've made some php code to register page views (with a lot of help from stack overflow). I specifically want to avoid using cookies for this. Also I would prefer not to use an SQL DB if it is possible a well working solution without it.
To deal with browser behaviour like prefetching and the like, I am trying to filter out the extra page views with an if, elseif, else function.
The problem in practice is that the sometimes pageviews are either written twice to the log file or there is a timing issue with the if-statement and the rest of the code.
Here is the code I have:
<?php
/*set variables for log file */
$useragnt = $_SERVER['HTTP_USER_AGENT']; //get user agent
$ipaddrs = $_SERVER['REMOTE_ADDR']; //get ipaddress
$filenameLog = "besog/" . date("Y-m-d") . "LOG.txt";
date_default_timezone_set('Europe/Copenhagen');
$infoToLog = $ipaddrs . "\t" . $useragnt . "\t" . date('H:i:s') . "\n";
$file_arr = file($filenameLog);
$last_row = $file_arr[count($file_arr) - 1];
$arr = explode( "\t", $last_row);
$tidForSidsteLogLinje = strtotime($arr[2]);
$tidNu = strtotime(date('H:i:s'));
//write ip, useragent and time of page view to log file logfil, but only if the same visitor has not viewed the page within the last 10 seconds
if ($arr[0] == $ipaddrs and $arr[1] == $useragnt and $tidNu - $tidForSidsteLogLinje > 10){
//write ip and user agent to textfile
$file = fopen($filenameLog, "a+");
fwrite($file, $infoToLog);
fclose($file);
}
elseif ($arr[0] == $ipaddrs and $arr[1] == $useragnt and $tidNu - $tidForSidsteLogLinje < 10){
die;
}
else {
//Write ip and user agent to textfile
$file = fopen($filenameLog, "a+");
fwrite($file, $infoToLog);
fclose($file);
}
?>
Here are examples of the duplicate entries in the log (I have masked some of the ipaddresses):
xxx.x.95.240 Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko 12:52:33
xx.xxx.229.91 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 12:52:45
xx.xxx.229.91 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 12:52:45
xxx.xx.154.83 ServiceTester/4.4.64.1514 12:53:03
xxx.xx.91.126 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/603.2.5 (KHTML, like Gecko) Version/10.1.1 Safari/603.2.5 12:53:05
xx.xxx.35.3 Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 12:53:09
xxx.xxx.130.34 Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko 12:53:56
xxx.xxx.130.34 Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko 12:53:56
xx.xxx.211.101 Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 12:54:11
x.xxx.54.4 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/601.6.17 (KHTML, like Gecko) Version/9.1.1 Safari/601.6.17 12:54:33
If my if-statements were working as intended, it should be possible to see duplicate lines in the entries like in the above.
How do I improve the code to eliminate these duplicate entries?
And help or suggestions is much appreciated!
We use a Complex Website Visitor Track / log System in our system.
I would recomend that you store this Values in a Database and set the IP address field as Unique.
You can set an CookieID like
Cookie::set('__id', time());
and go like
if (isset($_COOKIE['__id'])){
//With mysql you go like
$db->Execute("INSERT IGNORE INTO VisitorTable(hash, ip,..)
VALUES($_COOKIE['__id'],$_SERVER['REMOTE_ADDR'] )" // the HTTP_USER_AGENT refferer all kind of information that you wannt to store
}
This way the Visitor Only exist once in your list. See insert ignore for more.
Now you can eazy make an other function to save the pages the user visits .
In a script that gets everytime executet you go like:
$db->Execute("INSERT INTO VisitorActivity (visitorID,page....) VALUES ($_COOKIE['__id'],$_Server['..'])" );
Below is the form I'm using to submit a new wheel into a DB on my local XAMPP server, problem is the post method doesn't work, I had problems with a modal which was sorted by using $_GET instead. I've changed a few php.ini settings but have changed them back was, max_upload and another one, I've read about them in other questions but they don't seem to solve the issue.
When I use the submit form below the array printed is just Array(), doesn't even have a single value, this should prompt my error check to print an error at the least.
<!-- wheels form -->
<div class="text-center">
<form class="form-inline" action="wheels.php" method="post">
<div class="form-group">
<label for="wheelName">Add a new wheel:</label>
<input name="wheelName" type="text" id="wheelName" class="form-control" value="<?=((isset($_POST['wheelName']))?$_POST['wheelName']:''); ?> "><!-- shorthand if/else-->
<label for="code">Stockcode</label>
<input name="code" type="text" id="code" class="form-control" value="<?=((isset($_POST['code']))?$_POST['code']:''); ?> "><!-- shorthand if/else-->
<input type="submit" name="add_submit" value="add wheel" class="btn btn-success">
</div>
</form>
</div><hr>
Standard php sql input and checks, the information is being pulled correctly as I have it displaying in a table further on my page.But the $_POST variable appears to be completely empty, has someone had this problem and managed to sort it? I'm assuming its to do with my php setup or .htaccess as someone else had an issue with.
<?php
require_once '../core/init.php';
include 'includes/header.php';
include 'includes/navigation.php';
// Get wheels from DB
$sql = "SELECT * FROM wheels ORDER BY part_no";
$result = $db->query($sql);
$errors = array();
// edit wheel
if(isset($_GET['edit']) && !empty(['edit'])) {
$edit_id = (int)$_GET['edit'];
$edit_id = sanitize('edit_id');
//$sql2 = "SELECT * FROM wheels WHERE recid = '$edit_id'";
//$edit_result = $db->query($sql2);
//$eWheel = mysqli_fetch_assoc($edit_result);
}
// Delete wheel
if(isset($_GET['delete']) && !empty(['delete'])) {
$delete_id = (int)$_GET['delete'];
$delete_id = sanitize($delete_id);
//$sql = "DELETE FROM wheels WHERE recid = '$delete_id'";
//$db->query($sql);
//header('Location: wheels.php');
}
// If add wheel form submitted
if(isset($_POST['add_submit'])) {
$wheel = sanitize($_POST['wheelName']);
$stockCode = sanitize($_POST['stockCode']);
// check if wheel is blank
if($_POST['wheelName'] == '') {
$errors[] .= 'Must enter wheel';
}
// if wheel exists in db
$sql = "SELECT * FROM wheels WHERE stockcode = '$stockCode'";
$result = $db->query($sql);
$count = mysqli_num_rows($result);
if($count > 0) {
$errors[] .= 'that wheel already exists.';
}
// display errors
if(!empty($errors)) {
echo displayErrors($errors);
} else {
//Add wheels to DB // incomplete unsure if feature needed at this stage.
// $sql = "INSERT INTO wheels (wheelName, stockCode, ID ETC ETC VALUES Etc ETC
// $db->query($sql);
// header ('location : wheels.php ');
}
}
var_dump($_POST);
echo file_get_contents("php://input");
?>
Here is the log
::1 - - [23/Apr/2016:14:12:32 +1200] "GET / HTTP/1.1" 302 - "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36"
::1 - - [23/Apr/2016:14:12:32 +1200] "GET /dashboard/ HTTP/1.1" 200 6904 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36"
::1 - - [23/Apr/2016:14:12:32 +1200] "GET /dashboard/stylesheets/normalize.css HTTP/1.1" 200 6876 "http://localhost/dashboard/" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36"
::1 - - [23/Apr/2016:14:12:32 +1200] "GET /dashboard/stylesheets/all.css HTTP/1.1" 200 481308 "http://localhost/dashboard/" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36"
::1 - - [23/Apr/2016:14:12:32 +1200] "GET /dashboard/javascripts/modernizr.js HTTP/1.1" 200 51365 "http://localhost/dashboard/" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36"
::1 - - [23/Apr/2016:14:12:33 +1200] "GET /dashboard/javascripts/all.js HTTP/1.1" 200 189003 "http://localhost/dashboard/" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36"
::1 - - [23/Apr/2016:14:12:33 +1200] "GET /dashboard/images/xampp-logo.svg HTTP/1.1" 200 5427 "http://localhost/dashboard/" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36"
::1 - - [23/Apr/2016:14:12:33 +1200] "GET /dashboard/images/bitnami-xampp.png HTTP/1.1" 200 22133 "http://localhost/dashboard/" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36"
::1 - - [23/Apr/2016:14:12:33 +1200] "GET /dashboard/images/fastly-logo.png HTTP/1.1" 200 1770 "http://localhost/dashboard/" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36"
::1 - - [23/Apr/2016:14:12:33 +1200] "GET /dashboard/images/social-icons.png HTTP/1.1" 200 3361 "http://localhost/dashboard/stylesheets/all.css" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36"
well, i think you should place the <?php line at the very top of the file... before the "Get wheels from DB" comment
And instead of print_r($_POST); use var_dump($_POST);
also, in the section "if wheel exists in db" your sql query is never using the $stockCode variable, you're missing the $ , it should be $sql = "SELECT * FROM wheels WHERE stockcode = '$stockCode'";
And drop the point before the equal sign for the errors. thats for Strings, and as an array $errors[] = 'that wheel already exists.'; should work fine
I have a text file, a huge one that contains some information's that I need to insert into a database.
Here is the one of the rows of the text file:
77.242.22.86 - - [10/Jul/2013:14:07:26 +0200] "GET /web/img/theimage.jpg HTTP/1.1" 304 - "http://www.mywebiste.com/web/"
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/27.0.1453.116 Safari/537.36 AlexaToolbar/alxg-3.1"
I am using some php functions like:
$myFile = "testFile.txt";
$fh = fopen($myFile, 'r');
This only read the text file, as I said, I need to get the data and insert into the database, here is an example how I need the row to be spli:
77.242.22.86
[10/Jul/2013:14:07:26 +0200]
GET
/web/img/theimage.jpg
HTTP/1.1
304
http://www.mywebiste.com/web/
Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/27.0.1453.116 Safari/537.36 AlexaToolbar/alxg-3.1
Please help me to resolve this problem. Thank you.
$test ='77.242.22.86 - - [10/Jul/2013:14:07:26 +0200]
"GET /web/img/theimage.jpg HTTP/1.1" 304 - "http://www.mywebiste.com/web/"
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/27.0.1453.116 Safari/537.36 AlexaToolbar/alxg-3.1"';
function tokenizer($test) {
$test = trim(strtolower($test));
$res= str_replace(",","",$test);
$res= str_replace("-","",$res);
$res= str_replace(' "',"",$res);
$result= explode(" ", $res);
for ($i = 0; $i < count($result); $i++) {
$result[$i] = trim($result[$i]);
echo $result[$i]; ?>
<hr/> <?php
}
return $result; // contains the single words
}