Export complete table to csv from cassandra - php

I have a very large table in cassandra (~500mil) and I want to export all rows for some columns to a file. I tried this using the COPY command with:
COPY keyspace.table (id, value) TO 'filepath' WITH DELIMITER=',';
but it took ~12 hours to complete the export. Is there any option this could be done faster?
If it is a problem to just export some columns it wouldn't be a problem to export all data. The important thing is that I need a way to get all entries which I can proceed afterwards.
The other question is, is it possible to process this export in PHP just with the DataStax PHP driver?

COPY ... TO ... not a good idea to use on a big amount of data.
is it possible to process this export in PHP just with the DataStax PHP driver
I did CSV export from the Cassandra with the help of Datastax Java driver, but PHP must have the same algorithm. According to documentation you can easily do a request and print output. Take in to attention pagination as well.
You can convert array to CSV with the help of fputcsv funciton
So, the simplest example would be:
<?php
$cluster = Cassandra::cluster() // connects to localhost by default
->build();
$keyspace = 'system';
$session = $cluster->connect($keyspace); // create session, optionally scoped to a keyspace
$statement = new Cassandra\SimpleStatement( // also supports prepared and batch statements
'SELECT keyspace_name, columnfamily_name FROM schema_columnfamilies'
);
$future = $session->executeAsync($statement); // fully asynchronous and easy parallel execution
$result = $future->get(); // wait for the result, with an optional timeout
// Here you can print CSV headers.
foreach ($result as $row) { // results and rows implement Iterator, Countable and ArrayAccess
// Here you can print CSV values
// printf("The keyspace %s has a table called %s\n", $row['keyspace_name'], $row['columnfamily_name']);
}

The short answer is yes, there are faster ways to do this.
The how is a longer answer, if you are going to be saving these rows to file on a regular basis - you might want to use Apache Spark. Depending how much memory is on your Cassandra nodes, you can bring a simple 500 million row table scan => write to file down to < 1 hour.

There are some of options which can give you fast & reliable turn-around:
Hive [ My Preferred One, run SQL like Query ]
Shark/Beeline [ run SQL like Query ]
Spark [ Fast for data related computation but not the best option for your use-case]
For PHP[Hive PHP Client]:
<?php
// set THRIFT_ROOT to php directory of the hive distribution
$GLOBALS['THRIFT_ROOT'] = '/lib/php/';
// load the required files for connecting to Hive
require_once $GLOBALS['THRIFT_ROOT'] . 'packages/hive_service/ThriftHive.php';
require_once $GLOBALS['THRIFT_ROOT'] . 'transport/TSocket.php';
require_once $GLOBALS['THRIFT_ROOT'] . 'protocol/TBinaryProtocol.php';
// Set up the transport/protocol/client
$transport = new TSocket('localhost', 10000);
$protocol = new TBinaryProtocol($transport);
$client = new ThriftHiveClient($protocol);
$transport->open();
// run queries, metadata calls etc
$client->execute('SELECT * from src');
var_dump($client->fetchAll());
$transport->close();
Ref: https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-PHP

Related

How to dump MySQL table to a file then read it and use it in place of the DB itself?

because a provider I use, has a quite unreliable MySQL servers, which are down at leas 1 time pr week :-/ impacting one of the sites I made, I want to prevent its outeges in the following way:
dump the MySQL table to a file In case the connection with the SQL
server is failed,
then read the file instead of the Server, till the Server is back.
This will avoid outages from the user experience point of view.
In fact things are not so easy like it seems and I ask for your help please.
What I did is to save the data to a JSON file format.
But this got issues because many data on the DB are "in clear" included escaped complex URLs, with long argument's line, that give some issue during the decode process from JSON.
On CSV and TSV is also not workign correctly.
CSV is delimited by Commas or Semilcolon , and those are present in the original content taken from the DB.
TSV format leave double quotes that are not deletable, without avoid to go to eliminate them into the record's fields
Then I tried to serialize each record read from the DB, store it and retrive it serializing it.
But the result is a bit catastrophic, becase all the records are stored in the file.
When I retrieve them, only one is returned. then there is something that blocks the functioning of the program (here below the code please)
require_once('variables.php');
require_once("database.php");
$file = "database.dmp";
$myfile = fopen($file, "w") or die("Unable to open file!");
$sql = mysql_query("SELECT * FROM song ORDER BY ID ASC");
// output data of each row
while ($row = mysql_fetch_assoc($sql)) {
// store the record into the file
fwrite($myfile, serialize($row));
}
fclose($myfile);
mysql_close();
// Retrieving section
$myfile = fopen($file, "r") or die("Unable to open file!");
// Till the file is not ended, continue to check it
while ( !feof($myfile) ) {
$record = fgets($myfile); // get the record
$row = unserialize($record); // unserialize it
print_r($row); // show if the variable has something on it
}
fclose($myfile);
I tried also to uuencode and also with base64_encode but they were worse choices.
Is there any way to achieve my goal?
Thank you very much in advance for your help
If you have your data layer well decoupled you can consider using SQLite as a fallback storage.
It's just a matter of adding one abstraction more, with the same code accessing the storage and changing the storage target in case of unavailability of the primary one.
-----EDIT-----
You could also try to think about some caching (json/html file?!) strategy returning stale data in case of mysql outage.
-----EDIT 2-----
If it's not too much effort, please consider playing with PDO, I'm quite sure you'll never look back and believe me this will help you structuring your db calls with little pain when switching between storages.
Please take the following only as an example, there are much better
way to design this architectural part of code.
Just a small and basic code to demonstrate you what I mean:
class StoragePersister
{
private $driver = 'mysql';
public function setDriver($driver)
{
$this->driver = $driver;
}
public function persist($data)
{
switch ($this->driver)
{
case 'mysql':
$this->persistToMysql($data);
case 'sqlite':
$this->persistToSqlite($data);
}
}
public function persistToMysql($data)
{
//query to mysql
}
public function persistSqlite($data)
{
//query to Sqlite
}
}
$storage = new StoragePersister;
$storage->setDriver('sqlite'); //eventually to switch to sqlite
$storage->persist($somedata); // this will use the strategy to call the function based on the storage driver you've selected.
-----EDIT 3-----
please give a look at the "strategy" design pattern section, I guess it can help to better understand what I mean.
After SELECT... you need to create a correct structure for inserting data, then you can serialize or what you want.
For example:
You have a row, you could do that - $sqls[] = "INSERT INTOsong(field1,field2,.. fieldN) VALUES(field1_value, field2_value, ... fieldN_value);";
Than you could serialize this $sqls, write into file, and when you need it, you could read, unserialize and make query.
Have you thought about caching your queries into a cache like APC ? Also, you may want to use mysqli or pdo instead of mysql (Mysql is deprecated in the latest versions of PHP).
To answer your question, this is one way of doing it.
var_export will export the variable as valid PHP code
require will put the content of the array into the $data variable (because of the return statement)
Here is the code :
$sql = mysql_query("SELECT * FROM song ORDER BY ID ASC");
$content = array();
// output data of each row
while ($row = mysql_fetch_assoc($sql)) {
// store the record into the file
$content[$row['ID']] = $row;
}
mysql_close();
$data = '<?php return ' . var_export($content, true) . ';';
file_put_contents($file, $data);
// Retrieving section
$rows = require $file;

Google Big Query + PHP -> How to fetch a large data set without running out of memory

I am trying to run a query in BigQuery/PHP (using google php SDK) that returns a large dataset (can be 100,000 - 10,000,000 rows).
$bigqueryService = new Google_BigqueryService($client);
$query = new Google_QueryRequest();
$query->setQuery(...);
$jobs = $bigqueryService->jobs;
$response = $jobs->query($project_id, $query);
//query is a syncronous function that returns a full dataset
The next step is to allow the user to download the result as a CSV file.
The code above will fail when the dataset becomes too large (memory limit).
What are my options to perform this operation with lower memory usage ?
(I figured an option is to save the results to another table with BigQuery and then start doing partial fetch with LIMIT and OFFSET but I figured a better solution might be available..)
Thanks for the help
You can export your data directly from Bigquery
https://developers.google.com/bigquery/exporting-data-from-bigquery
You can use PHP to run a API call that does the export (you dont need the BQ tool)
You need to set the jobs configuration.extract.destinationFormat see the reference
Just to elaborate on Pentium10's answer
You can export up to a 1GB file in json format.
Then you can read the file line by line which will minimize the memory used by your application and then you can use json_decode the information.
The suggestion to export is a good one, I just wanted to mention there is another way.
The query API you are calling (jobs.query()) does not return the full dataset; it just returns a page of data, which is the first 2 MB of the results. You can set the maxResults flag (described here) to limit this to a certain number of rows.
If you get back fewer rows than are in the table, you will get a pageToken field in the response. You can then fetch the remainder with the jobs.getQueryResults() API by providing the job ID (also in the query response) and the page token. This will continue to return new rows and a new page token until you get to the end of your table.
The example here shows code (in java in python) to run a query and fetch the results page by page.
There is also an option in the API to convert directly to CSV by specifying alt='csv' in the URL query string, but I'm not sure how to do this in PHP.
I am not sure do you still using the PHP but the answer is:
$options = [
'maxResults' => 1000,
'startIndex' => 0
];
$jobConfig = $bigQuery->query($query);
$queryResults = $bigQuery->runQuery($jobConfig, $options);
foreach ($queryResults as $row) {
// Handle rows
}

Retrieving Azure Performance Counters

I am in the process of coding a cloud monitoring application and coudnt find useful logic of getting performance counters from AZURE php SDK documentation(such as CPU utilization, disk utilization, ram usage).
can anybody help ??
define('PRODUCTION_SITE', false); // Controls connections to cloud or local storage
define('AZURE_STORAGE_KEY', '<your_storage_key>');
define('AZURE_SERVICE', '<your_domain_extension>');
define('ROLE_ID', $_SERVER['RoleDeploymentID'] . '/' . $_SERVER['RoleName'] . '/' . $_SERVER['RoleInstanceID']);
define('PERF_IN_SEC', 30); // How many seconds between times dumping performance metrics to table storage
/** Microsoft_WindowsAzure_Storage_Blob */
require_once 'Microsoft/WindowsAzure/Storage/Blob.php';
/** Microsoft_WindowsAzure_Diagnostics_Manager **/
require_once 'Microsoft/WindowsAzure/Diagnostics/Manager.php';
/** Microsoft_WindowsAzure_Storage_Table */
require_once 'Microsoft/WindowsAzure/Storage/Table.php';
if(PRODUCTION_SITE) {
$blob = new Microsoft_WindowsAzure_Storage_Blob(
'blob.core.windows.net',
AZURE_SERVICE,
AZURE_STORAGE_KEY
);
$table = new Microsoft_WindowsAzure_Storage_Table(
'table.core.windows.net',
AZURE_SERVICE,
AZURE_STORAGE_KEY
);
} else {
// Connect to local Storage Emulator
$blob = new Microsoft_WindowsAzure_Storage_Blob();
$table = new Microsoft_WindowsAzure_Storage_Table();
}
$manager = new Microsoft_WindowsAzure_Diagnostics_Manager($blob);
//////////////////////////////
// Bring in global include file
require_once('setup.php');
// Performance counters to subscribe to
$counters = array(
'\Processor(_Total)\% Processor Time',
'\TCPv4\Connections Established',
);
// Retrieve the current configuration information for the running role
$configuration = $manager->getConfigurationForRoleInstance(ROLE_ID);
// Add each subscription counter to the configuration
foreach($counters as $c) {
$configuration->DataSources->PerformanceCounters->addSubscription($c, PERF_IN_SEC);
}
// These settings are required by the diagnostics manager to know when to transfer the metrics to the storage table
$configuration->DataSources->OverallQuotaInMB = 10;
$configuration->DataSources->PerformanceCounters->BufferQuotaInMB = 10;
$configuration->DataSources->PerformanceCounters->ScheduledTransferPeriodInMinutes = 1;
// Update the configuration for the current running role
$manager->setConfigurationForRoleInstance(ROLE_ID,$configuration);
///////////////////////////////////////
// Bring in global include file
//require_once('setup.php');
// Grab all entities from the metrics table
$metrics = $table->retrieveEntities('WADPerformanceCountersTable');
// Loop through metric entities and display results
foreach($metrics AS $m) {
echo $m->RoleInstance . " - " . $m->CounterName . ": " . $m->CounterValue . "<br/>";
}
this is the code I crafted to extract processor info ...
UPDATE
Do take a look at the following blog post: http://blog.maartenballiauw.be/post/2010/09/23/Windows-Azure-Diagnostics-in-PHP.aspx. I realize that it's an old post but I think this should give you some idea about implementing diagnostics in your role running PHP. The blog post makes use of PHP SDK for Windows Azure on CodePlex which I think is quite old and has been retired in favor of the new SDK on Github but I think the SDK code on Github doesn't have diagnostics implemented (and that's a shame).
ORIGINAL RESPONSE
Since performance counters data is stored in Windows Azure Table Storage, you could simply use Windows Azure SDK for PHP to query WADPerformanceCountersTable in your storage account to fetch this data.
I have written a blog post about efficiently fetching diagnostics data sometime back which you can read here: http://gauravmantri.com/2012/02/17/effective-way-of-fetching-diagnostics-data-from-windows-azure-diagnostics-table-hint-use-partitionkey/.
Update
Looking at your code above and source code for TableRestProxy.php, you could include a query as the 2nd parameter to your retrieveEntities call. You could something like:
$query = "(CounterName eq '\Processor(_Total)\% Processor Time` or CounterName eq '\TCPv4\Connections Established')
$metrics = $table->retrieveEntities('WADPerformanceCountersTable', $query);
Please note that my knowledge about PHP is limited to none so the code above may not work. Also, please ensure to include PartitionKey in your query to avoid full table scan.
Storage Analytics Metrics aggregates transaction data and capacity data for a storage account. Transactions metrics are recorded for the Blob, Table, and Queue services. Currently, capacity metrics are only recorded for the Blob service. Transaction data and capacity data is stored in the following tables:
$MetricsCapacityBlob
$MetricsTransactionsBlob
$MetricsTransactionsTable
$MetricsTransactionsQueue
The above tables are not displayed when a listing operation is performed, such as the ListTables method. Each table must be accessed directly.
When you retrieve metrics,use these tables.
Ex:
$metrics = $table->retrieveEntities('$MetricsCapacityBlob');
URL:
http://msdn.microsoft.com/en-us/library/windowsazure/hh343264.aspx

How to "release" memory in loop?

I have a script that is running on a shared hosting environment where I can't change the available amount of PHP memory. The script is consuming a web service via soap. I can't get all my data at once or else it runs out of memory so I have had some success with caching the data locally in a mysql database so that subsequent queries are faster.
Basically instead of querying the web service for 5 months of data I am querying it 1 month at a time and storing that in the mysql table and retrieving the next month etc. This usually works but I sometimes still run out of memory.
my basic code logic is like this:
connect to web service using soap;
connect to mysql database
query web service and store result in variable $results;
dump $results into mysql table
repeat steps 3 and 4 for each month of data
the same variables are used in each iteration so I would assume that each batch of results from the web service would overwrite the previous in memory? I tried using unset($results) in between iterations but that didn't do anything. I am outputting the memory used with memory_get_usage(true) each time and with every iteration the memory used is increased.
Any ideas how I can fix this memory leak? If I wasn't clear enough leave a comment and I can provide more details. Thanks!
***EDIT
Here is some code (I am using nusoap not the php5 native soap client if that makes a difference):
$startingDate = strtotime("3/1/2011");
$endingDate = strtotime("7/31/2011");
// connect to database
mysql_connect("dbhost.com", "dbusername" "dbpassword");
mysql_select_db("dbname");
// configure nusoap
$serverpath ='http://path.to/wsdl';
$client = new nusoap_client($serverpath);
// cache soap results locally
while($startingDate<=$endingDate) {
$sql = "SELECT * FROM table WHERE date >= ".date('Y-m-d', $startingDate)." AND date <= ".date('Y-m-d', strtotime($startingDate.' +1 month'));
$soapResult = $client->call('SelectData', $sql);
foreach($soapResult['SelectDateResult']['Result']['Row'] as $row) {
foreach($row as &$data) {
$data = mysql_real_escape_string($data);
}
$sql = "INSERT INTO table VALUES('".$row['dataOne']."', '".$row['dataTwo']."', '".$row['dataThree'].")";
$mysqlResults = mysql_query($sql);
}
$startingDate = strtotime($startingDate." +1 month");
echo memory_get_usage(true); // MEMORY INCREASES EACH ITERATION
}
Solved it. At least partially. There is a memory leak using nusoap. Nusoap writes a debug log to a $GLOBALS variable. Altering this line in nusoap.php freed up a lot of memory.
change
$GLOBALS['_transient']['static']['nusoap_base']->globalDebugLevel = 9;
to
$GLOBALS['_transient']['static']['nusoap_base']->globalDebugLevel = 0;
I'd prefer to just use php5's native soap client but I'm getting strange results that I believe are specific to the webservice I am trying to consume. If anyone is familiar with using php5's soap client with www.mindbodyonline.com 's SOAP API let me know.
Have you tried unset() on $startingDate and mysql_free_result() for $mysqlResults?
Also SELECT * is frowned upon even if that's not the problem here.
EDIT: Also free the SOAP result too, perhaps. Some simple stuff to begin with to see if that helps.

PHP + Mysql queries for a real Beginner

After years of false starts, I'm finally diving head first into learning to code PHP. After about 10 failed previous attempts to learn, it's getting exciting and finally going fairly well.
The project I'm using to learn with is for work. I'm trying to import 100+ fixed width text files into a MySql database.
So far so good
I'm getting comfortable with sql, and I'm learning some php tricks, but I'm not sure how to tie all the pieces together. The basic structure for what I want to do goes something like the following:
Name the text file I want to import
Do a LOAD DATA INFILE to import the data into one field it to a temporary db
Use substring() to separate the fixed width file into real columns
Remove lines I don't want (file identifiers, subtotals, etc....)
Add the files in the temp db, to the main db
Drop the temp db and start again
As you can see in the attached code, thigns are working fine. It gets the new file, imports it to the temp table, removes unwanted lines and then moves the content to final main database. Perfect.
Questions three
My two questions are:
Am I doing this 'properly'? When I want to run a pile of queries one after anohter, do I keep assinging mysql_query to random variables?
How would I go about automating the script to loop through every file there and import them? Rather than have to change the file name and run the script every time.
And, last, what PHP function would I use to 'select' the file(s) I want to import? You know, like attaching a file to an email -> Browse for file, upload it, and then run the script on it?
Sorry for this being an ultra-beginner question, but I'm having trouble seeing how all the pieces fit together. Specifcally I'm wondering how multiple sql queries get strung together to form a script? The way I've done it below? Some other way?
Thanks x 100 for any insights!
Terry
<?php
// 1. Create db connection
$connection = mysql_connect("localhost","root","root") or die("DB connection failed:" . mysql_error());
// 2. Select the database
$db_select = mysql_select_db("pd",$connection) or die("Couldn't select the database:" . mysql_error());
?>
<?php
// 3. Perform db query
// Drop table import if it already exists
$q="DROP table IF EXISTS import";
//4. Make new import table with just one field
if ($newtable = mysql_query("CREATE TABLE import (main VARCHAR(700));", $connection)) {
echo "Table import made successfully" . "<br>";
} else{
echo "Table import was not made" . "<br>";
}
//5. LOAD DATA INFILE
$load_data = mysql_query("LOAD DATA INFILE '/users/terrysutton/Desktop/importmeMay2010.txt' INTO table import;", $connection) or die("Load data failed" . mysql_error());
//6. Cleanup unwanted lines
if ($cleanup = mysql_query("DELETE FROM import WHERE main LIKE '%GRAND%' OR main LIKE '%Subt%' OR main LIKE '%Subt%' OR main LIKE '%USER%' OR main LIKE '%DATE%' OR main LIKE '%FOR:%' OR main LIKE '%LOCATION%' OR main LIKE '%---%' OR `main` = '' OR `main` = '';")){
echo "Table import successfully cleaned up";
} else{
echo "Table import was not successfully cleaned up" . "<br>";
}
// 7. Next, make a table called "temp" to store the data before it gets imported to denominators
$temptable = mysql_query("CREATE TABLE temp
SELECT
SUBSTR(main,1,10) AS 'Unit',
SUBSTR(main,12,18) AS 'Description',
SUBSTR(main,31,5) AS 'BD Days',
SUBSTR(main,39,4) AS 'ADM',
SUBSTR(main,45,4) AS 'DIS',
SUBSTR(main,51,4) AS 'EXP',
SUBSTR(main,56,5) AS 'PD',
SUBSTR(main,100,5) AS 'YTDADM',
SUBSTR(main,106,5) AS 'YTDDIS',
SUBSTR(main,113,4) AS 'YTDEXP',
SUBSTR(main,118,5) AS 'YTDPD'
FROM import;");
// 8. Add a column for the date
$datecolumn = mysql_query("ALTER TABLE temp ADD Date VARCHAR(20) AFTER Unit;");
$date = mysql_query("UPDATE temp SET Date='APR 2010';");
// 8. Move data from the temp table to its final home in the main database
// Append data in temp table to denominator table
$append = mysql_query("INSERT INTO denominators SELECT * FROM temp;");
// 9. Drop import and temp tables to start from scratch.
$droptables = mysql_query("DROP TABLE import, temp;");
// 10. Next, rename the text file to be imported and do the whole thing over again.
?>
<?php
// 5. Close connection
mysql_close($connection);
?>
If you have access to the command like, you can do all your data loading right from the mysql command line. Further, you can automate the process by writing a shell script. Just because you can do something in PHP doesn't mean you should.
For instance, you can just install PHPMyAdmin, create your tables on the fly, then use mysqldump to dump your database definitions to a file. like so
mysqldump -u myusername -pmypassword mydatabase > mydatabase.backup.sql
later, you can then just reload the whole database
mysql -u myusername -pmypassword < mydatabase.backup.sql
It's cool that you are learning to do things in PHP, but focus on doing the stuff you will do in PHP regularly rather than doing RDBMS stuff in PHP which is not where you should do it most of the time anyway. Build forms, and process the data. Learn how to build objects, and why you might want to do that. Head over and check out Symphony and Doctrine. Learn about the Front Controller pattern.
Also, look into PDO. It is very "bad form" to use the direct mysql_query() functions anymore.
Finally, PHP is great for templating and including disparate parts to form a cohesive whole. Practice making a left and top navigation html file. Figure out how you can include that one file on all your pages so that your same navigation shows up everywhere.
Then figure out how to look at variables like the page name and highlight the navigation tab you are on. Those are the things PHP is well suited for.
Why don't you load the files and process them in PHP, and use it to insert values in the actual table?
Ie:
$data = file_get_contents('somefile');
// process data here, say you dump it into a 2d array like
// $insert[$rows][$cols]
// then you can insert these into the db, ie:
$query = '';
foreach ($insert as $row) {
$query .= "INSERT INTO table VALUES ({$row[1]}, {$row[2]}, {$row[3]});";
}
mysql_query($query);
The purpose behind setting mysql_query to a variable is so that you can get the data you were querying for. In the case of any other query than SELECT, it only returns true or false.
So in the case where you are using if ($var = mysql...) you do not need the variable assingment there at all as the function returns true or false.
Also, I feel like doing all your substring and data file processing would be MUCH better suited in PHP. you can look into the fopen function and the related functions on the left side of that page.

Categories