BigQuery PHP API - large query result memory bloat - even with paging - php

I am running a range of queries in BigQuery and exporting them to CSV via PHP. There are reasons why this is the easiest method for me to do this (multiple queries dependent on variables within an app).
I am struggling with memory issues when the result set is larger than 100mb. It appears that the memory usage of my code seems to grow in line with the result set, which I thought would be avoided by paging. Here is my code:
$query = $bq->query($myQuery);
$queryResults = $bq->runQuery($query,['maxResults'=>5000]);
$FH = fopen($storagepath, 'w');
$rows = $queryResults->rows();
foreach ($rows as $row) {
fputcsv($FH, $row);
}
fclose($FH);
The $queryResults->rows() function returns a Google Iterator which uses paging to scroll through the results, so I do not understand why memory usage grows as the script runs.
Am I missing a way to discard previous pages from memory as I page through the results?
UPDATE
I have noticed that actually since upgrading to the v1.4.3 BigQuery PHP API, the memory usage does cap out at 120mb for this process, even when the result set reaches far beyond this (currently processing a 1gb result set). But still, 120mb seems too much. How can I identify and fix where this memory is being used?
UPDATE 2
This 120mb seems to be tied at 24kb per maxResult in the page. E.g. adding 1000 rows to maxResults adds 24mb of memory. So my question is now why is 1 row of data using 24kb in the Google Iterator? Is there a way to reduce this? The data itself is < 1kb per row.

Answering my own question
The extra memory is used by a load of PHP type mapping and other data structure info that comes alongside the data from BigQuery. Unfortunately I couldn't find a way to reduce the memory usage below around 24kb per row multiplied by the page size. If someone finds a way to reduce the bloat that comes along with the data please post below.
However thanks to one of the comments I realized you can extract a query directly to CSV in a Google Cloud Storage Bucket. This is really easy:
query = $bq->query($myQuery);
$queryResults = $bq->runQuery($query);
$qJobInfo = $queryResults->job()->info();
$dataset = $bq->dataset($qJobInfo['configuration']['query']['destinationTable']['datasetId']);
$table = $dataset->table($qJobInfo['configuration']['query']['destinationTable']['tableId']);
$extractJob = $table->extract('gs://mybucket/'.$filename.'.csv');
$table->runJob($extractJob);
However this still didn't solve my issue as my result set was over 1gb, so I had to make use of the data sharding function by adding a wildcard.
$extractJob = $table->extract('gs://mybucket/'.$filename.'*.csv');
This created ~100 shards in the bucket. These need to be recomposed using gsutil compose <shard filenames> <final filename>. However, gsutil only lets you compose 32 files at a time. Given I will have variable numbers of shards, opten above 32, I had to write some code to clean them up.
//Save above job as variable
$eJob = $table->runJob($extractJob);
$eJobInfo = $eJob->info();
//This bit of info from the job tells you how many shards were created
$eJobFiles = $eJobInfo['statistics']['extract']['destinationUriFileCounts'][0];
$composedFiles = 0; $composeLength = 0; $subfile = 0; $fileString = "";
while (($composedFiles < $eJobFiles) && ($eJobFiles>1)) {
while (($composeLength < 32) && ($composedFiles < $eJobFiles)) {
// gsutil creates shards with a 12 digit number after the filename, so build a string of 32 such filenames at a time
$fileString .= "gs://bucket/$filename" . str_pad($composedFiles,12,"0",STR_PAD_LEFT) . ".csv ";
$composedFiles++;
$composeLength++;
}
$composeLength = 0;
// Compose a batch of 32 into a subfile
system("gsutil compose $fileString gs://bucket/".$filename."-".$subfile.".csv");
$subfile++;
$fileString="";
}
if ($eJobFiles > 1) {
//Compose all the subfiles
system('gsutil compose gs://bucket/'.$filename.'-* gs://fm-sparkbeyond/YouTube_1_0/' . $filepath . '.gz') ==$
}
Note in order to give my Apache user access to gsutil I had to allow the user to create a .config directory in the web root. Ideally you would use the gsutil PHP library, but I didn't want the code bloat.
If anyone has a better answer please post it
Is there a way to get smaller output from the BigQuery library than 24kb per row?
Is there a more efficient way to clean up variable numbers of shards?

Related

Alpha Vantage client too slow

I have this very simple PHP call to Alpha Vantage API to fill a table (or list) with NASDAQ stock prices:
<?php
function get_price($commodity = "")
{
$url = 'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&symbol=' . $commodity . '&outputsize=full&apikey=myKey';
$obj = json_decode(file_get_contents($url), true);
$date = $obj['Meta Data']['3. Last Refreshed'];
$result = $obj['Time Series (Daily)']['2018-03-23']['4. close'];
$rd_result = round($result, 2);
echo $result;
}
?>
<?php get_price("XOM");
get_price("AAPL");
get_price("MSFT");
get_price("CVX");
get_price("CAT");
get_price("BA");
?>
And it works, but just so freaking slow. It can take ove 30 secs. to load while the json file from Alpha Vantage loads in a fraction of second.
Does anyone knows where am I going wrong?
This what i did when the API took time to reply, my solution is written in C# but the logic would be the same.
string[] AlphaVantageApiKey = { "RK*********", "B2***********", 4FD*********QN", "7S3Z*********FRX", "U************I3" };
int ApiKeyValue = 0;
foreach (var stock in listOfStocks)
{
DataTable dtResult = DataRetrival.GetIntradayStockFeedForSelectedStockAs(stock.Symbol.Trim().ToUpper(), ApiKeyValue);
ApiKeyValue = (ApiKeyValue == 4) ? 0 : ApiKeyValue + 1;
}
I use 5 to 6 different API keys, when i'm querying data. I loop thought each of them for each call. There by reducing load on one perpendicular token.
I observed that this improved my performance a lot. It takes me less than 1 min to get Intraday data for 50 stocks.
Another, way you can improve your performance is to use
outputsize=compact
compact returns only the latest 100 data points in the time series.
UPDATE: Batch Stock Quotes
You might want to consider using this type of query as well. Multiple stock quotes all in one call.
Also, using the full output size is grabbing data from the past 20 years, if applicable. Take that out of your query and have the API use its condensed output default.
EDIT: According to the above, you should make changes to your query. But it can also be an issue with your server. I tested this for a use case I am working on and it takes me a few seconds to get the data, albeit I am only pulling it for one stock symbol on a page at a time.
Try increasing your memory limit if things are too slow for your liking.
<?php
ini_set('memory_limit','500M'); // or your desired limit
?>
Also, if you have shared hosting, that might be the problem. However, I do not know enough about your server to answer that fully.

PHP - How to append to a JSON file

I generate JSON files which I load into datatables, and these JSON files can contain thousands of rows from my database. To generate them, I need to loop through every row in the database and add each database row as a new row in the JSON file. The problem I'm running into is this:
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 262643 bytes)
What I'm doing is I get the JSON file with file_get_contents($json_file) and decode it into an array then I add a new row to the array, then encode the array back into JSON and export it to the file with file_put_contents($json_file).
Is there a better way to do this? Is there a way I can prevent the memory increasing with each loop iteration? Or is there a way I can clear the memory before it reaches the limit? I need the script to run to completion, but with this memory problem it barely gets up to 5% completion before crashing.
I can keep rerunning the script and each time I rerun it, it adds more rows to the JSON file, so if this memory problem is unavoidable, is there a way to automatically rerun the script numerous times until its finished? For example could I detect the memory usage, and detect when its about to reach the limit, then exit out of the script and restart it? I'm on wpengine so they won't allow security risky functions like exec().
So I switched to using CSV files and it solved the memory problem. The script runs vastly faster too. JQuery DataTables doesn't have built in support for CSV files, so I wrote a function to convert the CSV file to JSON:
public function csv_to_json($post_type) {
$data = array(
"recordsTotal" => $this->num_rows,
"recordsFiltered" => $this->num_rows,
"data"=>array()
);
if (($handle = fopen($this->csv_file, 'r')) === false) {
die('Error opening file');
}
$headers = fgetcsv($handle, 1024, "\t");
$complete = array();
while ($row = fgetcsv($handle, 1024, "\t")) {
$complete[] = array_combine($headers, $row);
}
fclose($handle);
$data['data'] = $complete;
file_put_contents($this->json_file,json_encode($data,JSON_PRETTY_PRINT));
}
So the result is I create a CSV file and a JSON file much faster than creating a JSON file alone, and there are no issues with memory limits.
Personally as I said in the comments, I would use CSV files. They have several advantages.
you can read / write one line at a time so you only manage the memory for one line
you can just append new data into the file.
PHP has plenty of built in support using either the fputcsv() or SPL file objects.
you can load them directly into the database using using "Load Data Infile"
http://dev.mysql.com/doc/refman/5.7/en/load-data.html
The only cons are
keep the same schema through the whole file
no nested data structures
The issue with Json, is ( as far as I know ) you have to keep the whole thing in memory as a single data set. Therefor you cannot stream it ( line for line ) like a normal text file. There is really no solution beside limiting the size of the json data, which may or may not even be easy to do. You can increase the memory some, but that is just a temporary fix if you expect the data to continue to grow.
We use CSV files in a production environment and I regularly deal with datasets that are 800k or 1M rows. I've even seen one that was 10M rows. We have a single table of 60M rows ( MySql ) that is populated from CSV uploads. So it will work and be robust.
If your set on Json, then I would just come up with a fixed number of rows that works and design your code to only run that many rows at a time. It's impossible for me to guess how to do that without more details.

Parsing large XML data

I am trying to parse xml files to store data into database. I have written a code with PHP (as below) and I could successfully run the code.
But the problem is, it requires around 8 mins to read a complete file (which is around 30 MB), and I have to parse around 100 files in each hour.
So, obviously my current code is of no use to me. Can anybody advise for a better solution? Or should I switch to other coding language?
What I get from net is, I can do it with Perl/Python or something called XSLT (which I am not so sure about, frankly).
$xml = new XMLReader();
$xml->open($file);
while ($xml->name === 'node1'){
$node = new SimpleXMLElement($xml->readOuterXML());
foreach($node->node2 as $node2){
//READ
}
$xml->next('node1');
}
$xml->close();
Here's an example of my script I used to parse the WURFL XML database found here.
I used the ElementTree module for Python and wrote out a JavaScript Array - although you can easily modify my script to write a CSV of the same (Just change the final 3 lines).
import xml.etree.ElementTree as ET
tree = ET.parse('C:/Users/Me/Documents/wurfl.xml')
root = tree.getroot()
dicto = {} #to store the data
for device in root.iter("device"): #parse out the device objects
dicto[device.get("id")] = [0, 0, 0, 0] #set up a list to store the needed variables
for child in device: #iterate through each device
if child.get("id") == "product_info": #find the product_info id
for grand in child:
if grand.get("name") == "model_name": #and the model_name id
dicto[device.get("id")][0] = grand.get("value")
dicto[device.get("id")][3] +=1
elif child.get("id") == "display": #and the display id
for grand in child:
if grand.get("name") == "physical_screen_height":
dicto[device.get("id")][1] = grand.get("value")
dicto[device.get("id")][3] +=1
elif grand.get("name") == "physical_screen_width":
dicto[device.get("id")][2] = grand.get("value")
dicto[device.get("id")][3] +=1
if not dicto[device.get("id")][3] == 3: #make sure I had enough
#otherwise it's an incomplete dataset
del dicto[device.get("id")]
arrays = []
for key in dicto.keys(): #sort this all into another list
arrays.append(key)
arrays.sort() #and sort it alphabetically
with open('C:/Users/Me/Documents/wurfl1.js', 'w') as new: #now to write it out
for item in arrays:
new.write('{\n id:"'+item+'",\n Product_Info:"'+dicto[item][0]+'",\n Height:"'+dicto[item][1]+'",\n Width:"'+dicto[item][2]+'"\n},\n')
Just counted this as I ran it again - took about 3 seconds.
In Perl you could use XML::Twig, which is designed to process huge XML files (bigger than can fit in memory)
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my $file= shift #ARGV;
XML::Twig->new( twig_handlers => { 'node1/node2' => \&read_node })
->parsefile( $file);
sub read_node
{ my( $twig, $node2)= #_;
# your code, the whole node2 string is $node2->sprint
$twig->purge; # if you want to reduce memory footprint
}
You can find more info about XML::Twig at xmltwig.org
In case of Python I would recommend using lxml.
As you are having performance problems, I would recommend iterating through your XML and processing things part by part, this would save a lot of memory and is likely to be much faster.
I am reading on old server 10 MB XML within 3 seconds, your situation might be different.
About iterating with lxml: http://lxml.de/tutorial.html#tree-iteration
Review this line of code:
$node = new SimpleXMLElement($xml->readOuterXML());
Documentation for readOuterXML has a comment, that sometime it is attempting to reach out for namespaces etc. Anyway, here I would suspect big performance problem.
Consider using readInnerXML() if you could.

PHP Using fgetcsv on a huge csv file

Using fgetcsv, can I somehow do a destructive read where rows I've read and processed would be discarded so if I don't make it through the whole file in the first pass, I can come back and pick up where I left off before the script timed out?
Additional Details:
I'm getting a daily product feed from a vendor that comes across as a 200mb .gz file. When I unpack the file, it turns into a 1.5gb .csv with nearly 500,000 rows and 20 - 25 fields. I need to read this information into a MySQL db, ideally with PHP so I can schedule a CRON to run the script at my web hosting provider every day.
I have a hard timeout on the server set to 180 seconds by the hosting provider, and max memory utilization limit of 128mb for any single script. These limits cannot be changed by me.
My idea was to grab the information from the .csv using the fgetcsv function, but I'm expecting to have to take multiple passes at the file because of the 3 minute timeout, I was thinking it would be nice to whittle away at the file as I process it so I wouldn't need to spend cycles skipping over rows that were already processed in a previous pass.
From your problem description it really sounds like you need to switch hosts. Processing a 2 GB file with a hard time limit is not a very constructive environment. Having said that, deleting read lines from the file is even less constructive, since you would have to rewrite the entire 2 GB to disk minus the part you have already read, which is incredibly expensive.
Assuming you save how many rows you have already processed, you can skip rows like this:
$alreadyProcessed = 42; // for example
$i = 0;
while ($row = fgetcsv($fileHandle)) {
if ($i++ < $alreadyProcessed) {
continue;
}
...
}
However, this means you're reading the entire 2 GB file from the beginning each time you go through it, which in itself already takes a while and you'll be able to process fewer and fewer rows each time you start again.
The best solution here is to remember the current position of the file pointer, for which ftell is the function you're looking for:
$lastPosition = file_get_contents('last_position.txt');
$fh = fopen('my.csv', 'r');
fseek($fh, $lastPosition);
while ($row = fgetcsv($fh)) {
...
file_put_contents('last_position.txt', ftell($fh));
}
This allows you to jump right back to the last position you were at and continue reading. You obviously want to add a lot of error handling here, so you're never in an inconsistent state no matter which point your script is interrupted at.
You can avoid timeout and memory error to some extent when reading like a Stream. By Reading line by line and then inserts each line into a database (Or Process accordingly). In that way only single line is hold in memory on each iteration. Please note don't try to load a huge csv-file into an array, that really would consume a lot of memory.
if(($handle = fopen("yourHugeCSV.csv", 'r')) !== false)
{
// Get the first row (Header)
$header = fgetcsv($handle);
// loop through the file line-by-line
while(($data = fgetcsv($handle)) !== false)
{
// Process Your Data
unset($data);
}
fclose($handle);
}
I think a better solution (it will be phenomnally inefficient to continuously rewind and write to open file stream) would be to track the file position of each record read (using ftell) and store it with the data you've read - then if you have to resume, then just fseek to the last position.
You could try loading the file directly using mysql's read file function (which will likely be a lot faster) although I've had problems with this in the past and ended up writing my own php code.
I have a hard timeout on the server set to 180 seconds by the hosting provider, and max memory utilization limit of 128mb for any single script. These limits cannot be changed by me.
What have you tried?
The memory can be limited by other means than the php.ini file, but I can't imagine how anyone could actually prevent you from using a different execution time (even if ini_set is disabled, from the command line you could run php -d max_execution_time=3000 /your/script.php or php -c /path/to/custom/inifile /your/script.php )
Unless you are trying to fit the entire datafile into memory then there should be no issue with a memory limit of 128Mb

php parallelprocess or one process after another on filewriting

Am exporting data to csv. after 25000 records , memory exhausted.
Memory limit increasing is ok.
If i have 100000 rows, can i write it as 4 process.
write first 25000 rows, then next 25000 then next...
Is this possible in csv export?
Will this have any advantage? Or this is same exporting whole data?
Any multiple processing or parallel processing have some advantage?
Well, this depends on how you're generating the CSV.
Assuming that you're doing it as the result of a database query (or some other import), you could try streaming instead of building and then returning.
Basically, you turn off output buffering first:
while(ob_get_level() > 0) {
ob_end_flush();
}
Then, when you're building it, echo it out row by row:
foreach ($rows as $row) {
echo '"'.$row[0].'","'.$row[1].'"'."\n";
}
That way, you're not using too much memory in PHP.
You could also write the data to a temporary file, and then stream that file back:
$file = tmpfile();
foreach ($rows as $row) {
fputcsv($file, $row);
}
rewind($file);
fpassthru($file); // Sends the file to the client
fclose($file);
But again, it all depends on what you're doing. It sounds to me like you're building the CSV in a string (which is eating all your memory). That's why I suggested these two options...
The problem is if you fork the process, you have to worry about cleaning its children up, and you're still using the same amount of memory. Ultimately you're limited by the machine memory, but if you don't want to have to conditionally increase php's memory_limit based on the number of iterations, then forking may be the way to go.
If you compiled PHP with --enable-pcntl and --enable-sigchild, you're good to go - otherwise, you won't be able to fork the process. One workaround would be to have a master script that delegates the execution of other scripts, but if you're using backticks or shell() or exec() (or anything similar) it starts to get sloppy and you'll have to take a lot of steps to ensure that your commands cannot be tainted/exploited.

Categories