So I have a LOT (millions) of records which I am trying to process. I've tried MongoDB and Neo4j and both simply grind my dual core ubuntu box to a halt.
I am wondering (and I don't believe there is) if there is any way to store PHP arrays in a file but only load one array into memory. So for example:
<?php
$loaded = array('hello','world');
$ignore_me = array('please','ignore');
$ignore_me2 = array('please','ignore','again');
?>
So effectively I could call the $loaded array but the others aren't loaded into memory (even though they're in the same file)? I know about fread/fopen but that tends to be where the file is a general block of text.
If (as I suspect) the answer is no - how would something like a NoSQL database not need to a) create a file per record and b) load everything into memory?? I know Neo4j uses Java but PHP should be able to match that!!
Did you consider Relational Databases such as Mysql, PostgreSql, MS Sql server?
I see that you tried MongoDB, an object-oriented database, and Neo4J, a node-oriented database.
I know that NoSQL is a great trend, but I tried NoSQL with my collections of millions of records and it performs so bad that I switched back to Relational SQL.
If you still insist to go with NoSQL, try Redis and Memcached, they are in-memory databases.
You could use PHP Streams to read/write to a file.
Read file and convert to array
$content = file_get_contents('/tmp/file.csv'); //file.csv contains a,b,c
$csv = implode(","$content); //csv is now array('a', 'b', 'c');
Write to file
$line = array("a", "b", "c");
// create stream
$file = fopen("/tmp/file.csv","w");
// add line as csv
fputcsv($file, $line);
// close stream
fclose($file);
You can also loop and add lines to a csv and loop and retrieve the lines.
https://secure.php.net/manual/en/function.fputcsv.php
You can retrieve multiple lines with fgetcsv which keeps a pointer of the next line to access as well
https://secure.php.net/manual/en/function.fgetcsv.php
Related
I generate JSON files which I load into datatables, and these JSON files can contain thousands of rows from my database. To generate them, I need to loop through every row in the database and add each database row as a new row in the JSON file. The problem I'm running into is this:
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 262643 bytes)
What I'm doing is I get the JSON file with file_get_contents($json_file) and decode it into an array then I add a new row to the array, then encode the array back into JSON and export it to the file with file_put_contents($json_file).
Is there a better way to do this? Is there a way I can prevent the memory increasing with each loop iteration? Or is there a way I can clear the memory before it reaches the limit? I need the script to run to completion, but with this memory problem it barely gets up to 5% completion before crashing.
I can keep rerunning the script and each time I rerun it, it adds more rows to the JSON file, so if this memory problem is unavoidable, is there a way to automatically rerun the script numerous times until its finished? For example could I detect the memory usage, and detect when its about to reach the limit, then exit out of the script and restart it? I'm on wpengine so they won't allow security risky functions like exec().
So I switched to using CSV files and it solved the memory problem. The script runs vastly faster too. JQuery DataTables doesn't have built in support for CSV files, so I wrote a function to convert the CSV file to JSON:
public function csv_to_json($post_type) {
$data = array(
"recordsTotal" => $this->num_rows,
"recordsFiltered" => $this->num_rows,
"data"=>array()
);
if (($handle = fopen($this->csv_file, 'r')) === false) {
die('Error opening file');
}
$headers = fgetcsv($handle, 1024, "\t");
$complete = array();
while ($row = fgetcsv($handle, 1024, "\t")) {
$complete[] = array_combine($headers, $row);
}
fclose($handle);
$data['data'] = $complete;
file_put_contents($this->json_file,json_encode($data,JSON_PRETTY_PRINT));
}
So the result is I create a CSV file and a JSON file much faster than creating a JSON file alone, and there are no issues with memory limits.
Personally as I said in the comments, I would use CSV files. They have several advantages.
you can read / write one line at a time so you only manage the memory for one line
you can just append new data into the file.
PHP has plenty of built in support using either the fputcsv() or SPL file objects.
you can load them directly into the database using using "Load Data Infile"
http://dev.mysql.com/doc/refman/5.7/en/load-data.html
The only cons are
keep the same schema through the whole file
no nested data structures
The issue with Json, is ( as far as I know ) you have to keep the whole thing in memory as a single data set. Therefor you cannot stream it ( line for line ) like a normal text file. There is really no solution beside limiting the size of the json data, which may or may not even be easy to do. You can increase the memory some, but that is just a temporary fix if you expect the data to continue to grow.
We use CSV files in a production environment and I regularly deal with datasets that are 800k or 1M rows. I've even seen one that was 10M rows. We have a single table of 60M rows ( MySql ) that is populated from CSV uploads. So it will work and be robust.
If your set on Json, then I would just come up with a fixed number of rows that works and design your code to only run that many rows at a time. It's impossible for me to guess how to do that without more details.
I have an XML data source URL from where I am reading the data using fread. It contains student information from which I am extracting the Grades and compiling them in an array.
The problem is when I run this script locally, it works fine and all the grades are correctly listed/collected in array. However, when I run this script on a shared server, I get some incorrectly read grades in addition the normal grade names, for example, "ergarten". The complete grade name "Kindergarten" is also recorded in the array which means that there a problem in reading only specific elements.
The first suspect I have in mind is fread byte length. I have changed it to 8192 but without luck.
Here is the relevant code chunk from the php file:
if (!($xml_parser = xml_parser_create())) die("Couldn't create parser.");
xml_set_element_handler( $xml_parser, "startElementHandler", "endElementHandler");
xml_set_character_data_handler( $xml_parser, "characterDataHandler");
while( $data = fread($fp, 8192)){
if(!xml_parse($xml_parser, $data, feof($fp))) {
break;}}
xml_parser_free($xml_parser);
Any thoughts?
I found the problem and fixed it myself.
The problem was that in the loop where the data was being read in chunks using fread, I was simultaneously converting that data using the XML parser and that was causing the problem since the streams of data do not always have a full tags. I removed the parser from that point to run it only when all the data has been read by the script.
That solved the problem.
I have a custom CakePHP shopping cart application where I’m trying to create a CSV file that contains a row of data for each transaction. I’m running into memory problems when having PHP create the CSV file at once by compiling the relevant data in the MySql databse. Currently the CSV file contains about 200 rows of data.
Alternatively, I’ve considered creating the CSV in a piecemeal process by appending a row of data to the file every time a transaction is made using: fopen($mFile.csv, 'a');
My developers are saying that I will still run into memory issues with this approach when the CSV file gets too large as PHP will read the whole file into memory. Is this the case? When using the append mode will PHP attempt to read the whole file into memory? If so, can you recommend a better approach?
Thanks in advance,
Ben
I ran the following script for a few minutes, and generated a 1.4gb file, well over my php memory limit. I also read from the file without issue. If you are running into memory issues it is probably something else that is causing the problem.
$fp = fopen("big_file.csv","a");
for($i = 0; $i < 100000000; $i++)
{
fputcsv($fp , array("val1","val2","val3","val4","val5","val6","val7","val8","val9"));
}
can't you just export from the db like so:
SELECT list_fields INTO OUTFILE '/tmp/result.text'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
FROM test_table;
I have a csv file with records being sorted on the first field. I managed to generate a function that does binary search through that file, using fseek for random access through file.
However, this is still a pretty slow process, since when I seek some file position, I actually need to look left, looking for \n characted, so I can make sure I'm reading a whole line (once whole line is read, I can check for first field value mentioned above).
Here is the function that returns a line that contains character at position x:
function fgetLineContaining( $fh, $x ) {
if( $x 125145411) // 12514511 is the last pos in my file
return "";
// now go as much left as possible, until newline is found
// or beginning of the file
while( $x > 0 && $c != "\n" && $c != "\r") {
fseek($fh, $x);
$x--; // go left in the file
$c = fgetc( $fh );
}
$x+=2; // skip newline char
fseek( $fh, $x );
return fgets( $fh, 1024 ); // return the line from the beginning until \n
}
While this is working as expected, I have to sad that my csv file has ~1.5Mil lines, and these left-seeks are slowing thins down pretty much.
Is there a better way to seek a line containing position x inside a file?
Also, it would be much better if object of a class could be saved to a file without serializing it, thus enabling reading of a file object-by-object. Does php support that?
Thanks
I think you really should consider using SQLite or MySQL again (like others have suggested in the comments). Most of the suggestions about pre-calculating indexes are already implemented "properly" in these SQL engines.
You said the speed wasn't good enough in SQL. Did you have the fields indexed properly? How were you querying the data? Where you using bulk queries, where you using prepared statements? Did the SQL process have enough ram to store it's indexes in RAM?
One thing you can possibly try to speed under the current algorithm is to load the (~100MB ?) file onto a RAM disc. No matter what you chose to do, either CVS or SQLite, this WILL help speed things up, especially if the hard drive seek time is your bottleneck.
You could possibly even read the whole file into PHP array's (assuming your computer has enough RAM for that). That would allow you to do your search via index ($big_array[$offset]) lookups.
Also one thing to keep in mind, PHP isn't exactly super fast at doing low level things fast. You might want to consider moving away from PHP in favor of C or C++.
I have a PHP script that builds a binary search tree over a rather large CSV file (5MB+). This is nice and all, but it takes about 3 seconds to read/parse/index the file.
Now I thought I could use serialize() and unserialize() to quicken the process. When the CSV file has not changed in the meantime, there is no point in parsing it again.
To my horror I find that calling serialize() on my index object takes 5 seconds and produces a huge (19MB) text file, whereas unserialize() takes unbearable 27 seconds to read it back. Improvements look a bit different. ;-)
So - is there a faster mechanism to store/restore large object graphs to/from disk in PHP?
(To clarify: I'm looking for something that takes significantly less than the aforementioned 3 seconds to do the de-serialization job.)
var_export should be lots faster as PHP won't have to process the string at all:
// export the process CSV to export.php
$php_array = read_parse_and_index_csv($csv); // takes 3 seconds
$export = var_export($php_array, true);
file_put_contents('export.php', '<?php $php_array = ' . $export . '; ?>');
Then include export.php when you need it:
include 'export.php';
Depending on your web server set up, you may have to chmod export.php to make it executable first.
Try igbinary...did wonders for me:
http://pecl.php.net/package/igbinary
First you have to change the way your program works. divide CSV file to smaller chunks. This is an IP datastore i assume. .
Convert all IP addresses to integer or long.
So if a query comes you can know which part to look.
There are <?php ip2long() /* and */ long2ip(); functions to do this.
So 0 to 2^32 convert all IP addresses into 5000K/50K total 100 smaller files.
This approach brings you quicker serialization.
Think smart, code tidy ;)
It seems that the answer to your question is no.
Even if you discover a "binary serialization format" option most likely even that would be to slow for what you envisage.
So, what you may have to look into using (as others have mentioned) is a database, memcached, or on online web service.
I'd like to add the following ideas as well:
caching of requests/responses
your PHP script does not shutdown but becomes a network server to answer queries
or, dare I say it, change the data structure and method of query you are currently using
i see two options here
string serialization, in the simplest form something like
write => implode("\x01", (array) $node);
read => explode() + $node->payload = $a[0]; $node->value = $a[1] etc
binary serialization with pack()
write => pack("fnna*", $node->value, $node->le, $node->ri, $node->payload);
read => $node = (object) unpack("fvalue/nre/nli/a*payload", $data);
It would be interesting to benchmark both options and compare the results.
If you want speed, writing to or reading from the file system in less than optimal.
In most cases, a database server will be able to store and retrieve data much more efficiently than a PHP script that is reading/writing files.
Another possibility would be something like Memcached.
Object serialization is not known for its performance but for its ease of use and it's definitely not suited to handle large amounts of data.
SQLite comes with PHP, you could use that as your database. Otherwise you could try using sessions, then you don't have to serialize anything, you just saving the raw PHP object.
What about using something like JSON for a format for storing/loading the data? I have no idea how fast the JSON parser is in PHP, but it's usually a fast operation in most languages and it's a lightweight format.
http://php.net/manual/en/book.json.php