Read SELECTED contents from a large text file (varying length text)

Read SELECTED contents from a large text file (varying length text) - php

I'm looking to read contents of a file between two tags in a large text file (so can't read the whole file at once due to memory restrictions on my server provider). This file has around 500000 lines of text.
This ( PHP: Read Specific Line From File ) isn't an option (I don't think), as the text I need to read varies in length and will take up multiple lines (varies from 20-5000 lines).
I am planning to use fopen, fread (read only) and fclose to read the file contents. I have experience of using these functions already.
I am looking to read all the contents in a selected part of the file. i.e.
File contents example
<<TAGNAME-1>>AAAA AAAA AAAA<<//TAGNAME-1>>
<<TAGNAME-2>>TEXT TEXT TEXT<<//TAGNAME-2>>
To select the text "AAAA AAAA AAAA" between the <<TAGNAME-1>> and <<//TAGNAME-1>> when TAGNAME-1 is called as a variable in my script.
How could I go about selecting all the text between the two tags that I require? (and ignore the remainder of the file) I have the ability to create the two tags where required in my php script - my issue is implementing this within the fread function.

You could grep the text file which would only return the text with a matching tag.
$tagnum = 2; //variable
$pattern = "<<TAGNAME-";
$searchstr = $pattern.$tagnum; //concat the prefix with the tag number
$fpath ="testtext.txt"; //define path to text file
$result = exec('grep -in "'.$searchstr.'" '.$fpath);
echo $result;
Where $tagnum would define each tag to search. I've tested it in my sandbox and it works as expected. Note this will read the whole line until the end tad or newline is reached.
Regards,

Related

binary safe write on file with php to create a DBF file

I need to split a big DBF file using php functions, this means that i have for example 1000 records, i have to create 2 files with 500 records each.
I do not have any dbase extension available nor i can install it so i have to work with basic php functions. Using basic fread function i'm able to correctly read and parse the file, but when i try to write a new dbf i have some problems.
As i have understood, the DBF file is structured in a 2 line file: the first line contains file info, header info and it's in binary. The second line contains the data and it's plain text. So i thought to simply write a new binary file replicating the first line and manually adding the first records in the first file, the other records in the other file.
That's the code i use to parse the file and it works nicely
$fdbf = fopen($_FILES['userfile']['tmp_name'],'r');
$fields = array();
$buf = fread($fdbf,32);
$header=unpack( "VRecordCount/vFirstRecord/vRecordLength", substr($buf,4,8));
$goon = true;
$unpackString='';
while ($goon && !feof($fdbf)) { // read fields:
$buf = fread($fdbf,32);
if (substr($buf,0,1)==chr(13)) {$goon=false;} // end of field list
else {
$field=unpack( "a11fieldname/A1fieldtype/Voffset/Cfieldlen/Cfielddec", substr($buf,0,18));
$unpackString.="A$field[fieldlen]$field[fieldname]/";
array_push($fields, $field);
}
}
fseek($fdbf, 0);
$first_line = fread($fdbf, $header['FirstRecord']+1);
fseek($fdbf, $header['FirstRecord']+1); // move back to the start of the first record (after the field definitions)
first_line is the variable the contains the header data, but when i try to write it in a new file something wrong happens and the row isn't written exactly as it was read. That's the code i use for writing:
$handle_log = fopen($new_filename, "wb");
fwrite($handle_log, $first_line, strlen($first_line) );
fwrite($handle_log, $string );
fclose($handle_log);
I've tried to add the b value to fopen mode parameter as suggested to open it in a binary way, i've also taken a suggestion to add exactly the length of the string to avoid the stripes of some characters but unsuccessfully since all the files written are not correctly in DBF format. What can i do to achieve my goal?

As i have understood, the DBF file is structured in a 2 line file: the
first line contains file info, header info and it's in binary. The
second line contains the data and it's plain text.
Well, it's a bit more complicated than that.
See here for a full description of the dbf file format.
So it would be best if you could use a library to read and write the dbf files.
If you really need to do this yourself, here are the most important parts:
Dbf is a binary file format, so you have to read and write it as binary. For example the number of records is stored in a 32 bit integer, which can contain zero bytes.
You can't use string functions on that binary data. For example strlen() will scan the data up to the first null byte, which is present in that 32 bit integer, and will return the wrong value.
If you split the file (the records), you'll have to adjust the record count in the header.
When splitting the records keep in mind that each record is preceded by an extra byte, a space 0x20 if the record is not deleted, an asterisk 0x2A if the record is deleted. (for example, if you have 4 fields of 10 bytes, the length of each record will be 41) - that value is also available in the header: bytes 10-11 - 16-bit number - Number of bytes in the record. (Least significant byte first)
The file could end with the end-of-file marker 0x1A, so you'll have to check for that as well.

Parse CSV content

This must be relatively easy, but I'm struggling to find a solution. I receive data using proprietary network protocol with encryption and at the end the entire received content ends up in a variable. The content is actually that of a CSV file - and I need to parse this data.
If this were a regular file on disk, I could use fgetcsv; if I could somehow break the content into individual records, I could use str_getcsv - but how can I break this file into records? Simple reading until a newline will not work, because CSV can contain values with line breaks in them. Below is an example set of data:
ID,SLN,Name,Address,Contract no
123,102,Market 1a,"Main street, Watertown, MA, 02471",16
125,97,Sinthetics,"Another address,
Line 2
City, NY 10001",16
167,105,"Progress, ahead",,18
All of this data is held inside one variable - and I need to parse it.
Of course, I can always write this data into a temporary file on disk the read/parse it using fgetcsv, but it seems extremely inefficient to me.

If fgetcsv works for you, consider this:
file_put_contents("php://temp",$your_data_here);
$stream = fopen("php://temp","r");
// $result = fgetcsv($stream); ...
For more on php://temp, see the php:// wrapper

Concatanate RTF files with PHP withouth header

I have some RTF files generated by users with Microsoft Word. I need to be able to concatenate these files, and the result file should still be readable by libreoffice. I'm using libreoffice in order to convert the result file into a PDF file.
In order to concatenate two files, my application remove the last character of the first file and the first one of my other file. The files headers are not removed (I'm not speaking about page header).
For some reason, libreoffice do not like the headers inserted by Microsoft Word. But it works fine if I open these files with Wordpad and save them.
Another way to remove these headers is to convert these files into RTF before I concatenate them. This way i can convert into PDF, but libreoffice make a serious mess with my tabs when i convert my files to RTF.
So how can I remove the headers through PHP withouth messing with tabs ? Or do you have another way to get to the same result ?
Edit :
In a nutshell, I must be able to concanate these files and that libreoffice could open it. And my tabs must still display nicely in Microsoft Word.
As you can guess, users don't want to use Wordpad. And my customer's IT department has to comply to that wish ( office politics).
UPDATE :
I have to do the merging first, because of business rules. The files are merged, then my users can modify it using Word (no problems here). Then they ask their boss to validate it. If the boss agree to validate, the RTF file become a PDF file.
UPDATE 2 :
I have a begenning of a solution. If the RTF file start by plain text or a picture, you have to remove everything until you get \pard. But this does not work if you file start by a tab.
UPDATE 3 :
If you want to support tab too, you have to remove evrything until you get \pard or \trowd. I'm going to post the total solution once i get a working code. This will works fine as long you don't need colours and that all yours files use the same font (because we don't remove the RTF headers of the first file).

If the limitations with the 'pure RTF' approach come back to bite you, you could use LibreOffice to convert your RTF files to docx, then use a tool to merge the docx files.
There are such tools for .NET and Java (such as our MergeDocx product); I'm not sure what you'll find for PHP.

I succeed to build a reliable code, which make possible to manipulate the RTF files created with Microsoft Word. It works as long as you only need text, pictures and tabs, and don't need fancy things as color. Color works for text, but beside that ...
$content = "";
//stristr Returns all of haystack starting from and including the first occurrence of needle to the end.
$tmp_pard = stristr($RTFstring, "\pard");
//stristr fail to detect \trowd
$tmp_tab = stristr($RTFstring, "trowd");
if($tmp_pard != "" || $tmp_tab != "") {
//We pick the longer string. Because we want the first occurence of \pard or \trowd
if(strlen($tmp_pard) > strlen($tmp_tab))
// { is added so concatenation code still works. We just remove headers.
$content = "{" . substr($RTFstring,-strlen($tmp_pard)) ;
else
$content = "{" . "\\". substr($RTFstring,-strlen($tmp_tab)) ;
} else {
$content = $RTFstring;
}
return $content;

compare 4 or more files

Is there a command line utility or a php/py script that will generate a html diff so that multiple files can be compared in order to compare 4 or more files.
Each of my files have max of 10k lines each.
Note: these files are plain text files . not html . Only contain A-Za-z0-9=., . and no HTML tags

It depends what type of data you're comparing/analyzing.
The basic solution is
file_get_contents gives you strings of the file data
strcmp will do a "binary-safe compare" of the data
You will probably want to explode() your data to delimit it somehow, and compare sections of the data.
Another option is to delimit, loop through, and make a "comparison coefficient" which would indicate to what degree the files deviate from a norm. For example, File 1 has cc=3, file 4 has cc=8. File 4 would be a closer match.
A final problem you'll run into is the memory limit on the server computer. You can change this in php.ini.
//EDIT
Just noticed the diff tag, but I'll leave this up anyway in case it helps somehow.

Problems with replacing text in a text file

I have the following scenarion.
Everytime my page loads I create a file. Now my file has two tags within. {theme}{/theme} and {layout}{/layout}, now everytime I choose a certain layout or theme it should replace the tags with {layout}layout{/layout} and {theme}theme{/theme}
My issue is that after I run the following code
if(!file_exists($_SESSION['file'])){
$fh = fopen($_SESSION['file'],"w");
fwrite($fh,"{theme}{/theme}\n");
fwrite($fh,"{layout}{/layout}");
fclose($fh);
}
$handle = fopen($_SESSION['file'],'r+');
if ($_REQUEST[theme]) {
$theme = ($_REQUEST[theme]);
//Replacing the theme bracket in the cache file for rememberence
while($line=fgets($handle)){
$line = preg_replace("/{theme}.*{\/theme}/","{theme}".$theme."{/theme}",$line);
fwrite($handle, $line);
}
}
My output looks as follows
{theme}{/theme}
{theme}green{/theme}
And it needs to look like this
{theme}green{/theme}
{layout}layout1{/layout}

I rarely use random-access file operation but like to read it all as text and write ti back so I might be wrong here. BUT as I can see, you read the first line (so the pointer is at the beginning of the second line). Then you write '{theme}green{/theme}' into that file so it replaces the next position text (the second line).
In this case (as your data is small), you better get the hold file. Change it as string and write it back.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.