php scraper scripts need to be changed

php scraper scripts need to be changed - php

this script harvests links out of a seed url and only prints them in command shell (or browser) rather than saving elsewhere. I want the script to store any outputs in .txt file within the folder where the script resides. I need suggestions what could be the efficient way to do that. Please give me hints.
<?php
# Initialization
include("LIB_http.php"); // http library
include("LIB_parse.php"); // parse library
include("LIB_resolve_addresses.php"); // address resolution library
include("LIB_exclusion_list.php"); // list of excluded keywords
include("LIB_simple_spider.php"); // spider routines used by this app.
set_time_limit(3600); // Don't let PHP timeout
$SEED_URL = "http://www.schrenk.com"; // First URL spider downloads
$MAX_PENETRATION = 1; // Set spider penetration depth
$FETCH_DELAY = 1; // Wait one second between page fetches
$ALLOW_OFFISTE = false; // Don't allow spider to roam from the SEED_URL's domain
$spider_array = array();
# Get links from $SEED_URL
echo "Harvesting Seed URL \n";
$temp_link_array = harvest_links($SEED_URL);
$spider_array = archive_links($spider_array, 0, $temp_link_array);
# Spider links in remaining penetration levels
for($penetration_level=1; $penetration_level<=$MAX_PENETRATION; $penetration_level++)
{
$previous_level = $penetration_level - 1;
for($xx=0; $xx<count($spider_array[$previous_level]); $xx++)
{
unset($temp_link_array);
$temp_link_array = harvest_links($spider_array[$previous_level][$xx]);
echo "Level=$penetration_level, xx=$xx of ".count($spider_array[$previous_level])." <br>\n";
$spider_array = archive_links($spider_array, $penetration_level, $temp_link_array);
}
}
?>

Use file_put_contents PHP function with enable append file flag.
$file = 'file_name.txt';
file_put_contents($file, $text_to_write_to_file, FILE_APPEND);
Ref: http://www.php.net/manual/en/function.file-put-contents.php

I would recommend first creating a variable to store the output in the script. So at the top (under the $spider_array=array() ) add:
$output = "";
The change all the lines with echo to be $output .=
This will store all the content sent to the screen or the browser into the $output variable.
Now at the bottom of the script, after everything has been scraped and the spider is finished, save the output to a file:
$filename = date('Y_m_d_H_i_s') . '.txt';
$filepath = dirname(__FILE__);
file_put_contents($filepath . '/' . $filename, $output);
This should save the output in a file within the same folder as the script with a date/time file name. (This code was written using examples from php.net, exact implementation may need a bit of debugging, but this should get you close enough.

Related

PHP File Handling (Download Counter) Reading file data as a number, writing it as that plus 1

I'm trying to make a download counter in a website for a video game in PHP, but for some reason, instead of incrementing the contents of the downloadcount.txt file by 1, it takes the number, increments it, and appends it to the end of the file. How could I just make it replace the file contents instead of appending it?
Here's the source:
<?php
ob_start();
$newURL = 'versions/v1.0.0aplha/Dungeon1UP.zip';
//header('Location: '.$newURL);
//increment download counter
$file = fopen("downloadcount.txt", "w+") or die("Unable to open file!");
$content = fread($file,filesize("downloadcount.txt"));
echo $content;
$output = (int) $content + 1;
//$output = 'test';
fwrite($file, $output);
fclose($file);
ob_end_flush();
?>
The number in the file is supposed to increase by one every time, but instead, it gives me numbers like this: 101110121011101310111012101110149.2233720368548E+189.2233720368548E+189.2233720368548E+18

As correctly pointed out in one of the comments, for your specific case you can use fseek ( $file, 0 ) right before writing, such as:
fseek ( $file, 0 );
fwrite($file, $output);
Or even simpler you can rewind($file) before writing, this will ensure that the next write happens at byte 0 - ie the start of the file.
The reason why the file gets appended it is because you're opening the file in append and truncate mode, that is "w+". You have to open it in readwrite mode in case you do not want to reset the contents, just "r+" on your fopen, such as:
fopen("downloadcount.txt", "r+")
Just make sure the file exists before writing!
Please see fopen modes here:
https://www.php.net/manual/en/function.fopen.php
And working code here:
https://bpaste.net/show/iasj

It will be much simpler to use file_get_contents/file_put_contents:
// update with more precise path to file:
$content = file_get_contents(__DIR__ . "/downloadcount.txt");
echo $content;
$output = (int) $content + 1;
// by default `file_put_contents` overwrites file content
file_put_contents(__DIR__ . "/downloadcount.txt", $output);

That appending should just be a typecasting problem, but I would not encourage you to handle counts the file way. In order to count the number of downloads for a file, it's better to make a database update of a row using transactions to handle concurrency properly, as doing it the file way could compromise accuracy.

You can get the content, check if the file has data. If not initialise to 0 and then just replace the content.
$fileContent = file_get_contents("downloadcount.txt");
$content = (!empty($fileContent) ? $fileContent : 0);
$content++;
file_put_contents('downloadcount.txt', $content);
Check $str or directly content inside the file

PHP - Executing PHP from a string

This one is a bit of a weird one. Ive created a function designed to select a template and either include it or parse the %0, %1,%3 etc. variables. This is the current function:
if(!fopen($tf,"r")){
$this->template("error",array("404"));
}
$th = fopen($tf,"r");
$t = fread($th, filesize($tf) );
$i=0;
for($i;$i<count($params);$i++){
$i2 = '%' . $i;
$t = str_replace($i2,$params[$i],$t);
}
echo $t . "\n";
fclose($th);
Where $th is the relative directory to my template file. My issue is, I need to execute the PHP inside of these files whilst at the same tme being able to replace the string variables %0 %1 etc.
How could I go about attempting this?

Like I said in my comment I think a template engine like Smarty would probably serve you better but here's how I'd do it with output buffering rather than eval()
Something like this
ob_start();
include "your_template_file.php";
$contents = ob_get_contents(); // will contain the output of the PHP in the file
ob_end_clean();
// process your str_replace() variables out of $contents

php process image files created with complicated url syntax

I have a script on my server that dynamically creates images of chess diagrams:
<img src = "ChessImager/ChessImager.php?fen=r3k2r/1pqb2pp/pnn2p2/4p3/4Q3/NBP1B3/P4PPP/R2R2K1&square_size=45&ds_color=(143,188,143)&ls_color=(232,223,192)">
But the resulting image files are nearly 30k, too big. I want to use pngnq http://pngnq.sourceforge.net/ to shrink them. I present them in a slideshow at http://communitychessclub.com I want a new php script to create the images from ChessImager.php and pipe each of these diagram image files (~50) to a new filename like 'game1234.png' and I'll batch pre-process (not real-time) them with pngnq. I have a file 'Forsyth.csv' which lists the data:
r1bqk2r/1p2bp1p/p2pnp2/4pN1Q/2B1P3/2N5/PP3PPP/R2R2K1|1256
r3k2r/1pqb2pp/pnn2p2/4p3/4Q3/NBP1B3/P4PPP/R2R2K1|1255
4rrk1/ppp3pp/2n4q/3p4/3P4/1NP1PpPP/PP3Q1K/R4R2|1253
rn2kb1r/1q1p2p1/p3p3/1p2N1Bp/2p1P2P/2P4Q/PP3PP1/3R1RK1|1252
I use this:
<?php $text = file('Forsyth.csv');foreach($text as $line)
{$token = explode("|", $line); print "\n"; $fen = $token[0]; $game_num = $token[1];
$phrase="games/game$game_num.php"; echo "<li> <img
src=\"ChessImager/ChessImager.php?fen=$fen&square_size=45&ds_color=(143,188,143)&ls_color=(232,223,192)\" ></li>";} ?>
Any ideas?
Update: this is posted at http://communitychessclub.com/produce.php
<?php $text = file('Forsyth.csv'); foreach($text as $line) {$token = explode("|",
$line); print "\n"; $fen = $token[0]; $game_num = $token[1]; print "\n";
$goat = "diagrams/game$game_num.png";
$src="ChessImager/ChessImager.php?fen=$fen&square_size=45&ds_color=(143,188,143)&ls_color=(232,223,192)";
echo "<li><img src = \"$src\"></li>";}
?>
Any ideas?

ChesssImager.php or one of its includes must have a imagepng($image) line near the end that sends the generated PNG image to the web browser. If your question is how to save that data to the disk instead, you can just modify the script so that it saves the image data instead:
imagepng($image, $filename);
where $filename is something unique that you can generate from the arguments passed to the script. For example:
$filename = md5($fen).".png";
Wherever you decide to have the script save the files, you'll need to make sure that you (or the web server if you're running it in a browser) has permission to write to that folder.

Split big files using PHP

I want to split huge files (to be specific, tar.gz files) in multiple part from php code. Main reason to do this is, php's 2gb limit on 32bit system.
SO I want to split big files in multiple part and process each part seperately.
Is this possible? If yes, how?

My comment was voted up twice, so maybe my guess was onto something :P
If on a unix environment, try this...
exec('split -d -b 2048m file.tar.gz pieces');
split
Your pieces should be pieces1, pieces2, etc.
You could get the number of resulting pieces easily by using stat() in PHP to get the file size and then do the simple math (int) ($stat['size'] / 2048*1024*1024) (I think).

A simple method (if using Linux based server) is to use the exec command and to run the split command:
exec('split Large.tar.gz -b 4096k SmallParts'); // 4MB parts
/* | | | | |
| | |______| |
App | | |_____________
The source file | |
The split size Out Filename
*/
See here for more details: http://www.computerhope.com/unix/usplit.htm
Or you can use: http://www.computerhope.com/unix/ucsplit.htm
exec('csplit -k -s -f part_ -n 3 LargeFile.tar.gz');
PHP runs within a single thread and the only way to increase this thread count is to create child process using the fork commands.
This is not resource friendly. What I would suggest is to look into a language that can do this fast and effectively. I would suggest using node.js.
Just install node on the server and then create a small script, called node_split for instance, that can do the job on its own for you.
But I do strongly advise that you do not use PHP for this job but use exec to allow the host operating system to do this.

HJSPLIT
http://www.hjsplit.org/php/

PHP itself might not be able to...
If you can figure out how to do this from your computers' command line,
You should be able to then execute these commands using exec();

function split_file($source, $targetpath='/split/', $lines=1000){
$i=0;
$j=1;
$date = date("m-d-y");
$buffer='';
$handle = fopen ($_SERVER['DOCUMENT_ROOT'].$source, "r");
while (!feof ($handle)) {
$buffer .= fgets($handle, 4096);
$i++;
if ($i >= $lines) {
$fname = $_SERVER['DOCUMENT_ROOT'].$targetpath."part_".$date.$j.".txt";
$fhandle = fopen($fname, "w") or die($php_errormsg);
if (!$fhandle) {
echo "Cannot open file ($fname)";
//exit;
}
if (!fwrite($fhandle, $buffer)) {
echo "Cannot write to file ($fname)";
//exit;
}
fclose($fhandle);
$j++;
$buffer='';
$i=0;
$line+=10; // add 10 to $lines after each iteration. Modify this line as required
}
}
fclose ($handle);
}

$handle = fopen('source/file/path','r');
$f = 1; //new file number
while(!feof($handle))
{
$newfile = fopen('newfile/path/'.$f.'.txt','w'); //create new file to write to with file number
for($i = 1; $i <= 5000; $i++) //for 5000 lines
{
$import = fgets($handle);
//print_r($import);
fwrite($newfile,$import);
if(feof($handle))
{break;} //If file ends, break loop
}
fclose($newfile);
$f++; //Increment newfile number
}
fclose($handle);

If you want to split files which are
already on server, you can do it
(simply use the file functions fread,
fopen, fwrite, fseek to read/write
part of the file).
If you want to
split files which are uploaded from
the client, I am afraid you cannot.

This would probably be possible in php, but php was built for web development and trying to this whole operation in one request will result in the request timing out.
You could however use another language like java or c# and build a background process that you can notify from php to perform the operation. Or even run from php, depending on your Security settings on the host.

Splits are named as filename.part0 filename.part1 ...
<?php
function fsplit($file,$buffer=1024){
//open file to read
$file_handle = fopen($file,'r');
//get file size
$file_size = filesize($file);
//no of parts to split
$parts = $file_size / $buffer;
//store all the file names
$file_parts = array();
//path to write the final files
$store_path = "splits/";
//name of input file
$file_name = basename($file);
for($i=0;$i<$parts;$i++){
//read buffer sized amount from file
$file_part = fread($file_handle, $buffer);
//the filename of the part
$file_part_path = $store_path.$file_name.".part$i";
//open the new file [create it] to write
$file_new = fopen($file_part_path,'w+');
//write the part of file
fwrite($file_new, $file_part);
//add the name of the file to part list [optional]
array_push($file_parts, $file_part_path);
//close the part file handle
fclose($file_new);
}
//close the main file handle
fclose($file_handle);
return $file_parts;
}
?>

Caching HTML output with PHP

I would like to create a cache for my php pages on my site. I did find too many solutions but what I want is a script which can generate an HTML page from my database ex:
I have a page for categories which grabs all the categories from the DB, so the script should be able to generate an HTML page of the sort: my-categories.html. then if I choose a category I should get a my-x-category.html page and so on and so forth for other categories and sub categories.
I can see that some web sites have got URLs like: wwww.the-web-site.com/the-page-ex.html
even though they are dynamic.
thanks a lot for help

check ob_start() function
ob_start();
echo 'some_output';
$content = ob_get_contents();
ob_end_clean();
echo 'Content generated :'.$content;

You can get URLs like that using URL rewriting. Eg: for apache, see mod_rewrite
http://httpd.apache.org/docs/2.2/mod/mod_rewrite.html
You don't actually need to be creating the files. You could create the files, but its more complicated as you need to decide when to update them if the data changes.

In my opinion this is the best solution. I use this for cache JSON file for my Android App. It can be simply use in other PHP files.
It's optimize file size from ~1mb to ~163kb (gzip).
Create cache folder in your directory
Then Create cache_start.php file and paste this code
<?php
header("HTTP/1.1 200 OK");
//header("Content-Type: application/json");
header("Content-Encoding: gzip");
$cache_filename = basename($_SERVER['PHP_SELF']) . "?" . $_SERVER['QUERY_STRING'];
$cache_filename = "./cache/".md5($cache_filename);
$cache_limit_in_mins = 60 * 60; // It's one hour
if (file_exists($cache_filename))
{
$secs_in_min = 60;
$diff_in_secs = (time() - ($secs_in_min * $cache_limit_in_mins)) - filemtime($cache_filename);
if ( $diff_in_secs < 0 )
{
print file_get_contents($cache_filename);
exit();
}
}
ob_start("ob_gzhandler");
?>
Create cache_end.php and paste this code
<?php
$content = ob_get_contents();
ob_end_clean();
$file = fopen ( $cache_filename, 'w' );
fwrite ( $file, $content );
fclose ( $file );
echo gzencode($content);
?>
Then create for example index.php (file which you want to cache)
<?php
include "cache_start.php";
echo "Hello Compress Cache World!";
include "cache_end.php";
?>

Manual caching (creating the HTML and saving it to a file) may not be the most efficient way, but if you want to go down that path I recommend the following (ripped from a simple test app I wrote to do this):
$cache_filename = basename($_SERVER['PHP_SELF']) . "?" . $_SERVER['QUERY_STRING'];
$cache_limit_in_mins = 60 * 32; // this forms 32hrs
// check if we have a cached file already
if ( file_exists($cache_filename) )
{
$secs_in_min = 60;
$diff_in_secs = (time() - ($secs_in_min * $cache_limit_in_mins)) - filemtime($cache_filename);
// check if the cached file is older than our limit
if ( $diff_in_secs < 0 )
{
// it isn't, so display it to the user and stop
print file_get_contents($cache_filename);
exit();
}
}
// create an array to hold your HTML output, this is where you generate your HTML
$output = array();
$output[] = '<table>';
$output[] = '<tr>';
// etc
// Save the output as manual cache
$file = fopen ( $cache_filename, 'w' );
fwrite ( $file, implode($output,'') );
fclose ( $file );
print implode($output,'');

I use APC for all my PHP caching (on an Apache server)

If you're not opposed to frameworks, try using the Zend Frameworks's Zend_Cache. It's pretty flexible, and (unlike some of the framework modules) easy to implement.

Can use Cache_lite from PEAR:
Details here
http://mahtonu.wordpress.com/2009/09/25/cache-php-output-for-high-traffic-websites-pear-cache_lite/

I was thinking from the point of load on the database, and charges for data bandwidth and speed of loading. I have some pages which are unlikely to change in years, (I know it is easy to use a CMS system based on a database ). Unlike in US, here the cost of bandwidth can be high. Anybody has any views on that, whether to create htmal pages or dynamic (php, asp.net)
Links to the pages would be stored on a database anyway.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

php scraper scripts need to be changed - php

Use file_put_contents PHP function with enable append file flag. $file = 'file_name.txt'; file_put_contents($file, $text_to_write_to_file, FILE_APPEND); Ref: http://www.php.net/manual/en/function.file-put-contents.php

Related

PHP File Handling (Download Counter) Reading file data as a number, writing it as that plus 1

PHP - Executing PHP from a string

php process image files created with complicated url syntax

Split big files using PHP

Caching HTML output with PHP

Categories

Resources