I would like to ask you about some known PHP libraries which may help me to parse *.txt files for sentences.
I have to parse too large text files, so I decided to make a stream parser (sentence by sentence).
I thought that it would be pretty to iterate file by sentences, something like:
foreach (new SentenceIterator("./data/huge.txt") as $sentence)
{
// do something...
}
Main idea is that file should be load to the memory completely.
What I have tried:
$f = fopen("./data/huge.txt", "r");
$dataBytes = 64;
$buffer = '';
while (!feof($f))
{
$data = fread($f, $dataBytes);
$dotPosition = strpos($data, '.');
if (false !== $dotPosition)
{
$sentence = $buffer . substr($data, 0, $dotPosition);
// correct cursor position
fseek($f, -1 * $dotPosition, SEEK_CUR);
// clear buffer
$buffer = '';
continue;
}
$buffer .= $data;
}
But in this case I get corrupted (lopped) sentences.
Could someone suggest me some existing libraries or maybe how to fix my code?
Thx in advance.
Sorry for inconvenience,
After some digging I have found solution which is... Spl lib..
Iterator called SplFileObject which implements Iterator, RecursiveIterator and SeekableIterator. And it allows read file line by line.
Updates and worked code is:
$file = new SplFileObject('./data/test.txt');
$file->setFlags(SplFileObject::DROP_NEW_LINE | SplFileObject::SKIP_EMPTY);
$buffer = '';
foreach ($file as $lineNumber => $line)
{
$dotPos = strpos($line, '.');
if (false !== $dotPos)
{
$sentence = $buffer . substr($line, 0, $dotPos);
echo $sentence . "\n";
$buffer = substr($line, $dotPos);
continue;
}
$buffer .= $line;
}
Related
I want to read a file line by line, but without completely loading it in memory.
My file is too large to open in memory, and if try to do so I always get out of memory errors.
The file size is 1 GB.
You can use the fgets() function to read the file line by line:
$handle = fopen("inputfile.txt", "r");
if ($handle) {
while (($line = fgets($handle)) !== false) {
// process the line read.
}
fclose($handle);
}
if ($file = fopen("file.txt", "r")) {
while(!feof($file)) {
$line = fgets($file);
# do same stuff with the $line
}
fclose($file);
}
You can use an object oriented interface class for a file - SplFileObject http://php.net/manual/en/splfileobject.fgets.php (PHP 5 >= 5.1.0)
<?php
$file = new SplFileObject("file.txt");
// Loop until we reach the end of the file.
while (!$file->eof()) {
// Echo one line from the file.
echo $file->fgets();
}
// Unset the file to call __destruct(), closing the file handle.
$file = null;
If you want to use foreach instead of while when opening a big file, you probably want to encapsulate the while loop inside a Generator to avoid loading the whole file into memory:
/**
* #return Generator
*/
$fileData = function() {
$file = fopen(__DIR__ . '/file.txt', 'r');
if (!$file) {
return; // die() is a bad practice, better to use return
}
while (($line = fgets($file)) !== false) {
yield $line;
}
fclose($file);
};
Use it like this:
foreach ($fileData() as $line) {
// $line contains current line
}
This way you can process individual file lines inside the foreach().
Note: Generators require >= PHP 5.5
There is a file() function that returns an array of the lines contained in the file.
foreach(file('myfile.txt') as $line) {
echo $line. "\n";
}
The obvious answer wasn't there in all the responses.
PHP has a neat streaming delimiter parser available made for exactly that purpose.
$fp = fopen("/path/to/the/file", "r");
while (($line = stream_get_line($fp, 1024 * 1024, "\n")) !== false) {
echo $line;
}
fclose($fp);
Use buffering techniques to read the file.
$filename = "test.txt";
$source_file = fopen( $filename, "r" ) or die("Couldn't open $filename");
while (!feof($source_file)) {
$buffer = fread($source_file, 4096); // use a buffer of 4KB
$buffer = str_replace($old,$new,$buffer);
///
}
foreach (new SplFileObject(__FILE__) as $line) {
echo $line;
}
One of the popular solutions to this question will have issues with the new line character. It can be fixed pretty easy with a simple str_replace.
$handle = fopen("some_file.txt", "r");
if ($handle) {
while (($line = fgets($handle)) !== false) {
$line = str_replace("\n", "", $line);
}
fclose($handle);
}
This how I manage with very big file (tested with up to 100G). And it's faster than fgets()
$block =1024*1024;//1MB or counld be any higher than HDD block_size*2
if ($fh = fopen("file.txt", "r")) {
$left='';
while (!feof($fh)) {// read the file
$temp = fread($fh, $block);
$fgetslines = explode("\n",$temp);
$fgetslines[0]=$left.$fgetslines[0];
if(!feof($fh) )$left = array_pop($lines);
foreach ($fgetslines as $k => $line) {
//do smth with $line
}
}
}
fclose($fh);
Be careful with the 'while(!feof ... fgets()' stuff, fgets can get an error (returnfing false) and loop forever without reaching the end of file. codaddict was closest to being correct but when your 'while fgets' loop ends, check feof; if not true, then you had an error.
SplFileObject is useful when it comes to dealing with large files.
function parse_file($filename)
{
try {
$file = new SplFileObject($filename);
} catch (LogicException $exception) {
die('SplFileObject : '.$exception->getMessage());
}
while ($file->valid()) {
$line = $file->fgets();
//do something with $line
}
//don't forget to free the file handle.
$file = null;
}
<?php
echo '<meta charset="utf-8">';
$k= 1;
$f= 1;
$fp = fopen("texttranslate.txt", "r");
while(!feof($fp)) {
$contents = '';
for($i=1;$i<=1500;$i++){
echo $k.' -- '. fgets($fp) .'<br>';$k++;
$contents .= fgets($fp);
}
echo '<hr>';
file_put_contents('Split/new_file_'.$f.'.txt', $contents);$f++;
}
?>
Function to Read with array return
function read_file($filename = ''){
$buffer = array();
$source_file = fopen( $filename, "r" ) or die("Couldn't open $filename");
while (!feof($source_file)) {
$buffer[] = fread($source_file, 4096); // use a buffer of 4KB
}
return $buffer;
}
<?php
error_reporting(E_ALL);
ini_set('display_errors' ,1);
//expression to be found in file name
$find = '.5010.';
//directory name
//we will store renamed files here
$dirname = '5010';
if(!is_dir($dirname))
mkdir($dirname, 0777);
//read all files from a directory
//skip directories
$directory_with_files = './';
$dh = opendir($directory_with_files);
$files = array();
while (false !== ($filename = readdir($dh)))
{
if(in_array($filename, array('.', '..')) || is_dir($filename))
continue;
$files[] = $filename;
}
//iterate collected files
foreach($files as $file)
{
//check if file name is matching $find
if(stripos($file, $find) !== false)
{
//open file
$handle = fopen($file, "r");
if ($handle)
{
//read file, line by line
while (($line = fgets($handle)) !== false)
{
//find REF line
$refid = 'REF*2U*';
if(stripos($line, $refid) !== false)
{
//glue refernce numbers
//check if reference number is not empty
$refnumber = str_replace(array($refid, '~'), array('', ''), $line);
if($refnumber != '')
{
$refnumber = '_'. $refnumber .'_';
$filerenamed = str_replace($find, $refnumber, $file);
copy($file, $dirname . '/' . $filerenamed);
}
echo $refnumber . "\n";
}
}
//close file
fclose($handle);
}
}
}
?>
I have this code, the output should be the replacement of ".5010." with "ref" in the final name, however, when I run the code, it just shows me up to ref not the rest of the file name, I tried it on my computer putty and turns out there's a "?" after the ref number, is there any way I could fix this?
For example; My file is 4867586.5010.476564.ed
After the code executes and reads the file, the output should be: 4867586_SMIL01_476564.ed but instead its: 4867586_SMIL01
And when I checked it out on putty the file name was: 4867586_SMIL01?_476564.ed
The ? in the filename denotes that there's a a non-printable character somewhere in the refnumber line.
This is most likely a line-ending character, or something else.
If it's the former, then that can be solved by changing the line:
$refnumber = str_replace(array($refid, '~'), array('', ''), $line);
to
$refnumber = str_replace(array($refid, '~'), array('', ''), $line);
$refnumber = trim($refnumber); // remove any whitespaces or line endings.
If it's the latter, then you'll need to sanitize your $refnumber variable using one of the file sanitizer functions available online.
I want to read a file line by line and add it into a variable till its string length is 1000 bytes . The file is relatively large,
Hence, what I am doing is
if(file_exists($file)
{
$fh = fopen($file, "r");
while(!feof($fh) or strlen($chunk) < 10001)
{
$line = fgets($fh, 1000);
$chunk = $chunk."**".$line;
}
}
Issue is how does I store each chunk into an array index till I encounter end of file ?
What about this:
if(file_exists($file)
{
$fh = fopen($file, "r");
$chunks = array();
while(!feof($fh) or strlen($chunk) < 10001)
{
$line = fgets($fh, 1000);
// add line to the buffer
$chunks []= $line;
}
}
? Or am I missing something?
I want to read a file line by line, but without completely loading it in memory.
My file is too large to open in memory, and if try to do so I always get out of memory errors.
The file size is 1 GB.
You can use the fgets() function to read the file line by line:
$handle = fopen("inputfile.txt", "r");
if ($handle) {
while (($line = fgets($handle)) !== false) {
// process the line read.
}
fclose($handle);
}
if ($file = fopen("file.txt", "r")) {
while(!feof($file)) {
$line = fgets($file);
# do same stuff with the $line
}
fclose($file);
}
You can use an object oriented interface class for a file - SplFileObject http://php.net/manual/en/splfileobject.fgets.php (PHP 5 >= 5.1.0)
<?php
$file = new SplFileObject("file.txt");
// Loop until we reach the end of the file.
while (!$file->eof()) {
// Echo one line from the file.
echo $file->fgets();
}
// Unset the file to call __destruct(), closing the file handle.
$file = null;
If you want to use foreach instead of while when opening a big file, you probably want to encapsulate the while loop inside a Generator to avoid loading the whole file into memory:
/**
* #return Generator
*/
$fileData = function() {
$file = fopen(__DIR__ . '/file.txt', 'r');
if (!$file) {
return; // die() is a bad practice, better to use return
}
while (($line = fgets($file)) !== false) {
yield $line;
}
fclose($file);
};
Use it like this:
foreach ($fileData() as $line) {
// $line contains current line
}
This way you can process individual file lines inside the foreach().
Note: Generators require >= PHP 5.5
There is a file() function that returns an array of the lines contained in the file.
foreach(file('myfile.txt') as $line) {
echo $line. "\n";
}
The obvious answer wasn't there in all the responses.
PHP has a neat streaming delimiter parser available made for exactly that purpose.
$fp = fopen("/path/to/the/file", "r");
while (($line = stream_get_line($fp, 1024 * 1024, "\n")) !== false) {
echo $line;
}
fclose($fp);
Use buffering techniques to read the file.
$filename = "test.txt";
$source_file = fopen( $filename, "r" ) or die("Couldn't open $filename");
while (!feof($source_file)) {
$buffer = fread($source_file, 4096); // use a buffer of 4KB
$buffer = str_replace($old,$new,$buffer);
///
}
foreach (new SplFileObject(__FILE__) as $line) {
echo $line;
}
One of the popular solutions to this question will have issues with the new line character. It can be fixed pretty easy with a simple str_replace.
$handle = fopen("some_file.txt", "r");
if ($handle) {
while (($line = fgets($handle)) !== false) {
$line = str_replace("\n", "", $line);
}
fclose($handle);
}
This how I manage with very big file (tested with up to 100G). And it's faster than fgets()
$block =1024*1024;//1MB or counld be any higher than HDD block_size*2
if ($fh = fopen("file.txt", "r")) {
$left='';
while (!feof($fh)) {// read the file
$temp = fread($fh, $block);
$fgetslines = explode("\n",$temp);
$fgetslines[0]=$left.$fgetslines[0];
if(!feof($fh) )$left = array_pop($lines);
foreach ($fgetslines as $k => $line) {
//do smth with $line
}
}
}
fclose($fh);
Be careful with the 'while(!feof ... fgets()' stuff, fgets can get an error (returnfing false) and loop forever without reaching the end of file. codaddict was closest to being correct but when your 'while fgets' loop ends, check feof; if not true, then you had an error.
SplFileObject is useful when it comes to dealing with large files.
function parse_file($filename)
{
try {
$file = new SplFileObject($filename);
} catch (LogicException $exception) {
die('SplFileObject : '.$exception->getMessage());
}
while ($file->valid()) {
$line = $file->fgets();
//do something with $line
}
//don't forget to free the file handle.
$file = null;
}
<?php
echo '<meta charset="utf-8">';
$k= 1;
$f= 1;
$fp = fopen("texttranslate.txt", "r");
while(!feof($fp)) {
$contents = '';
for($i=1;$i<=1500;$i++){
echo $k.' -- '. fgets($fp) .'<br>';$k++;
$contents .= fgets($fp);
}
echo '<hr>';
file_put_contents('Split/new_file_'.$f.'.txt', $contents);$f++;
}
?>
Function to Read with array return
function read_file($filename = ''){
$buffer = array();
$source_file = fopen( $filename, "r" ) or die("Couldn't open $filename");
while (!feof($source_file)) {
$buffer[] = fread($source_file, 4096); // use a buffer of 4KB
}
return $buffer;
}
In PHP if you write to a file it will write end of that existing file.
How do we prepend a file to write in the beginning of that file?
I have tried rewind($handle) function but seems overwriting if current content is larger than existing.
Any Ideas?
$prepend = 'prepend me please';
$file = '/path/to/file';
$fileContents = file_get_contents($file);
file_put_contents($file, $prepend . $fileContents);
The file_get_contents solution is inefficient for large files. This solution may take longer, depending on the amount of data that needs to be prepended (more is actually better), but it won't eat up memory.
<?php
$cache_new = "Prepend this"; // this gets prepended
$file = "file.dat"; // the file to which $cache_new gets prepended
$handle = fopen($file, "r+");
$len = strlen($cache_new);
$final_len = filesize($file) + $len;
$cache_old = fread($handle, $len);
rewind($handle);
$i = 1;
while (ftell($handle) < $final_len) {
fwrite($handle, $cache_new);
$cache_new = $cache_old;
$cache_old = fread($handle, $len);
fseek($handle, $i * $len);
$i++;
}
?>
$filename = "log.txt";
$file_to_read = #fopen($filename, "r");
$old_text = #fread($file_to_read, 1024); // max 1024
#fclose(file_to_read);
$file_to_write = fopen($filename, "w");
fwrite($file_to_write, "new text".$old_text);
Another (rough) suggestion:
$tempFile = tempnam('/tmp/dir');
$fhandle = fopen($tempFile, 'w');
fwrite($fhandle, 'string to prepend');
$oldFhandle = fopen('/path/to/file', 'r');
while (($buffer = fread($oldFhandle, 10000)) !== false) {
fwrite($fhandle, $buffer);
}
fclose($fhandle);
fclose($oldFhandle);
rename($tempFile, '/path/to/file');
This has the drawback of using a temporary file, but is otherwise pretty efficient.
When using fopen() you can set the mode to set the pointer (ie. the begginng or end.
$afile = fopen("file.txt", "r+");
'r' Open for reading only; place
the file pointer at the beginning of
the file.
'r+' Open for reading and
writing; place the file pointer at the
beginning of the file.
$file = fopen('filepath.txt', 'r+') or die('Error');
$txt = "/n".$string;
fwrite($file, $txt);
fclose($file);
This will add a blank line in the text file, so next time you write to it you replace the blank line. with a blank line and your string.
This is the only and best trick.