So I have this script that parses through a file and inserts data into my DB. But now the files I'm looking at have a bunch of random text(basically a key for the rest of the log) before the stuff that I need. So I need to find the timestamp in the file and then operate as I usually would from there, ignoring all the stuff before the timestamp. The logs look like this-
(Blah blah, random stuff)
'2004-05-12 15:45:00',0,0,0,141713,,123.288,122.449,123.2...
And heres my code for once I 'hit' the timestamp, storing the values in an array-
// read and store the values in an array
while (($buffer = gzgets($fp, 8192)) !== false)
{
$val[$i] = $buffer;
$i++;
}
$qry = "insert into afeed
set time_stamp='".$val[0]."',
error_value='".$val[1]."',
firstThing='".$val[2]."',
...
otherstuff='".$val[12]."',
lastThing='".$val[13]."'";
}
So either look for timestamp and start from there, or I was thinking that since all the useless stuff in the beginning is the same every time I could find out the size of it and 'skip that many bytes'?
Any of this possible?
Both of your potential solutions are possible. The "skip that many bytes" solution would probably be the most straightforward. If you know ahead of time that the length of the useless stuff is going to be, for example, 64 bytes at the beginning of every file, you can use something like gzseek($fp, 64) to move the file pointer past the useless stuff.
Since you said the useless stuff is the same every time, this should work. If the useless stuff is not a predetermined length, this method could still be used once you determined how long the useless stuff is (I'd need to see what the useless stuff is to decide on the best method for doing this).
Related
I am writing some json results in files in PHP on shared hosting (fwrite).
Then I read those files to extract json results (file_get_contents).
It happens some times (maybe one out of more than one thousand) that when I read this file it appears truncated: I can only read a multiple of the first 32768 bytes of the file.
I added some code to copy/paste the file I am reading in case the json string is not valid, and I then get 2 different files: the original one was correctly written as it contains a valid json string and the copied one contains only the beginning of the original one and has a size of x*32768 bytes.
Would you have any idea of what could be the problem and how to solve this? (I don't know how to investigate further)
Thank you
Without example code it is impossible to give a 'fix my code' answer, but when doing file write/read sort of programming, you should follow a simple process (which, from the description, is missing one fairly critical step!)
First, write to a TEMP file (you are writing to a file, but it is important here to write to a TEMP file - otherwise, you could have race conditions....... ;);
an easy way to do that in php
$yourData = "whateverYourDataIs....";
$goodfilename = 'whateverYourGoodFileNameIsForYourData.json';
$tempfilename = 'tempfile' . time(); // MANY ways to do this (lots of SO posts on it - just get a unique name every time you write ('unique' may not be needed if you only occasionally do a write, but it is a good safety measure to avoid collisions and time() works for many programs.)
// Now, use $tempfilename in your fwrite.
$fwrite = fwrite($tempfilename,$yourData);
if ($fwrite === false) {
// the write failed, so do whatever 'error' function you may need
// since it failed, there should be no file, but not a bad idea to attempt to delete it
unlink ($tempfile);
}
else {
// the write succeeded, so let's do a 'sanity check' on the file to make sure it is good JSON (this is a 'paranoid' check, but "better safe than sorry", right?)
if(json_decode($tempfile)){
// we know the file is good JSON, so now RENAME (this is really fast, so collisions are almost impossible) NOTE: see http://php.net/manual/en/function.rename.php comments for some potential challenges and workarounds if you have trouble with rename.
rename($tempfilename,$goodfilename);
}
// Now, the GOOD file will contain your new data - and those read issues are gone! (though never say 'never' - it may be possible, but very unlikely!)
}
This may/not be your issue directly and you will have to suit this to fit your code, but as a safety factor - and a good way to avoid collisions, it should give you ~100% read success, which I believe is what you are after!)
If this doesn't help, then some direct code will be needed to provide a more complete answer.
As suggested by #UlrichEckhardt comment, it was due to read / write concurrency problem. I was trying to read a file that was being writen. I solved this by just waiting before trying to read the file again
In PHP, I use fopen( ), fgets( ), and fclose( ) to read a file line by line. It works well. But I have a script (being run from the CLI) that has to process three hundred 5GB text files. That's approximately 3 billion fgets( ). So it works well enough but at this scale, tiny speed savings will add up extremely fast. So I'm wondering if there are any tricks to speed up the process?
The only potential thing I thought of was getting fgets( ) to read more than one line at once. It doesn't look like it supports that, but I could in theory do lets say 20 consecutive $line[] = fgets($file); and then process the array. That's not quite the same thing as reading multiple lines in one command so it may not have any affect. But I know queue your mysql inserts and sending them as one giant insert (another trick I'm going to implement in this script after more testing and benchmarking) will save a lot of time.
Update 4/13/19
Here is the solution I went with. Originally I had a much more complicated method of slicing off the end of each read, but then I realized you can do it much simpler.
$index_file = fopen( path to file,"r" );
$chunk = "";
while ( !feof($index_file) )
{
$chunk .= fread($index_file,$read_length);
$payload_lines = explode("\n",$chunk);
if ( !feof($index_file) )
{ $chunk = array_pop($payload_lines); }
}
Of course PHP has a function for everything. So I break every read into an array of lines, and array_pop() the last item in the array back to the beginning of the 'read buffer'. That last part is probably split, but not necessarily split. But either way, it goes back in and gets processed with the next loop (unless we're done with the file, then we don't pop it).
The only thing you have to watch out for here is if you have a line so long that a single read won't capture the whole thing. But know your data, that probably won't be a hassle. For me, I'm parsing a json-ish file, and I'm reading 128 KB at a time, so there are always many line breaks in my read.
Note: I settled on 128 KB by doing a million benchmarks and finding the size my server processes the absolute fastest. This parsing function will run 300 times so every second I save, saves me 5 minutes of total runtime.
One possible approach that might be faster would be to read large chunks of the file in with fread(), split it by newlines and then process the lines. You'd have to take in account that the chunks may sever lines and you'd have to detect this and glue them back together.
Generally speaking the larger the chunk you can read in one go the faster your process should become. Within the limits of your available memory.
From fread() docs:
Note that fread() reads from the current position of the file pointer. Use ftell() to find the current position of the pointer and rewind() to rewind the pointer position.
Hey guys I've seen a lot of options on fread (which requires a fiole, or writing to memory),
but I am trying to invalidate an input based on a string that has already been accepted (unknown format). I have something like this
if (FALSE !== str_getcsv($this->_contents, "\n"))
{
foreach (preg_split("/\n/", $this->_contents) AS $line)
{
$data[] = explode(',', $line);
}
print_r($data); die;
$this->_format = 'csv';
$this->_contents = $this->trimContents($data);
return true;
}
Which works fine on a real csv or csv filled variable, but when I try to pass it garbage to invalidate, something like:
https://www.gravatar.com/avatar/625a713bbbbdac8bea64bb8c2a9be0a4 which is garbage (since its a png), it believes its csv
anyway and keeps on chugging along until the program chokes. How can I fix this? I have not seen and CSV validators that
are not at least several classes deep, is there a simple three or four line to (in)validate?
is there a simple three or four line to (in)validate?
Nope. CSV is so loosely defined - it has no telltale signs like header bytes, and there isn't even a standard for what character is used for separating columns! - that there technically is no way to tell whether a file is CSV or not - even your PNG could technically be a gigantic one-column CSV with some esoteric field and line separator.
For validation, look at what purpose you are using the CSV files for and what input you are expecting. Are the files going to contain address data, separated into, say, 10 columns? Then look at the first line of the file, and see whether enough columns exist, and whether they contain alphanumeric data. Are you looking for a CSV file full of numbers? Then parse the first line, and look for the kinds of values you need. And so on...
If you have an idea of the kinds of CSVs likely to make it to your system, you could apply some heuristics -- at the risk of not accepting valid CSVs. For instance, you could look at line length, consistency of line length, special characters, etc...
If all you are doing is checking for the presence of commas and newlines, then any sufficiently large, random file will likely have those and thus pass such a CSV test.
I have a csv file with records being sorted on the first field. I managed to generate a function that does binary search through that file, using fseek for random access through file.
However, this is still a pretty slow process, since when I seek some file position, I actually need to look left, looking for \n characted, so I can make sure I'm reading a whole line (once whole line is read, I can check for first field value mentioned above).
Here is the function that returns a line that contains character at position x:
function fgetLineContaining( $fh, $x ) {
if( $x 125145411) // 12514511 is the last pos in my file
return "";
// now go as much left as possible, until newline is found
// or beginning of the file
while( $x > 0 && $c != "\n" && $c != "\r") {
fseek($fh, $x);
$x--; // go left in the file
$c = fgetc( $fh );
}
$x+=2; // skip newline char
fseek( $fh, $x );
return fgets( $fh, 1024 ); // return the line from the beginning until \n
}
While this is working as expected, I have to sad that my csv file has ~1.5Mil lines, and these left-seeks are slowing thins down pretty much.
Is there a better way to seek a line containing position x inside a file?
Also, it would be much better if object of a class could be saved to a file without serializing it, thus enabling reading of a file object-by-object. Does php support that?
Thanks
I think you really should consider using SQLite or MySQL again (like others have suggested in the comments). Most of the suggestions about pre-calculating indexes are already implemented "properly" in these SQL engines.
You said the speed wasn't good enough in SQL. Did you have the fields indexed properly? How were you querying the data? Where you using bulk queries, where you using prepared statements? Did the SQL process have enough ram to store it's indexes in RAM?
One thing you can possibly try to speed under the current algorithm is to load the (~100MB ?) file onto a RAM disc. No matter what you chose to do, either CVS or SQLite, this WILL help speed things up, especially if the hard drive seek time is your bottleneck.
You could possibly even read the whole file into PHP array's (assuming your computer has enough RAM for that). That would allow you to do your search via index ($big_array[$offset]) lookups.
Also one thing to keep in mind, PHP isn't exactly super fast at doing low level things fast. You might want to consider moving away from PHP in favor of C or C++.
What is the best way to write to files in a large php application. Lets say there are lots of writes needed per second. How is the best way to go about this.
Could I just open the file and append the data. Or should i open, lock, write and unlock.
What will happen of the file is worked on and other data needs to be written. Will this activity be lost, or will this be saved. and if this will be saved will is halt the application.
If you have been, thank you for reading!
Here's a simple example that highlights the danger of simultaneous wites:
<?php
for($i = 0; $i < 100; $i++) {
$pid = pcntl_fork();
//only spawn more children if we're not a child ourselves
if(!$pid)
break;
}
$fh = fopen('test.txt', 'a');
//The following is a simple attempt to get multiple threads to start at the same time.
$until = round(ceil(time() / 10.0) * 10);
echo "Sleeping until $until\n";
time_sleep_until($until);
$myPid = posix_getpid();
//create a line starting with pid, followed by 10,000 copies of
//a "random" char based on pid.
$line = $myPid . str_repeat(chr(ord('A')+$myPid%25), 10000) . "\n";
for($i = 0; $i < 1; $i++) {
fwrite($fh, $line);
}
fclose($fh);
echo "done\n";
If appends were safe, you should get a file with 100 lines, all of which roughly 10,000 chars long, and beginning with an integer. And sometimes, when you run this script, that's exactly what you'll get. Sometimes, a few appends will conflict, and it'll get mangled, however.
You can find corrupted lines with grep '^[^0-9]' test.txt
This is because file append is only atomic if:
You make a single fwrite() call
and that fwrite() is smaller than PIPE_BUF (somewhere around 1-4k)
and you write to a fully POSIX-compliant filesystem
If you make more than a single call to fwrite during your log append, or you write more than about 4k, all bets are off.
Now, as to whether or not this matters: are you okay with having a few corrupt lines in your log under heavy load? Honestly, most of the time this is perfectly acceptable, and you can avoid the overhead of file locking.
I do have high-performance, multi-threaded application, where all threads are writing (appending) to single log file. So-far did not notice any problems with that, each thread writes multiple times per second and nothing gets lost. I think just appending to huge file should be no issue. But if you want to modify already existing content, especially with concurrency - I would go with locking, otherwise big mess can happen...
If concurrency is an issue, you should really be using databases.
If you're just writing logs, maybe you have to take a look in syslog function, since syslog provides an api.
You should also delegate writes to a dedicated backend and do the job in an asynchroneous maneer ?
These are my 2p.
Unless a unique file is needed for a specific reason, I would avoid appending everything to a huge file. Instead, I would wrap the file by time and dimension. A couple of configuration parameters (wrap_time and wrap_size) could be defined for this.
Also, I would probably introduce some buffering to avoid waiting the write operation to be completed.
Probably PHP is not the most adapted language for this kind of operations, but it could still be possible.
Use flock()
See this question
If you just need to append data, PHP should be fine with that as filesystem should take care of simultaneous appends.