Statistical analysis on large data set to be published on the web

Statistical analysis on large data set to be published on the web - php

I have a non-computer related data logger, that collects data from the field. This data is stored as text files, and I manually lump the files together and organize them. The current format is through a csv file per year per logger. Each file is around 4,000,000 lines x 7 loggers x 5 years = a lot of data. some of the data is organized as bins item_type, item_class, item_dimension_class, and other data is more unique, such as item_weight, item_color, date_collected, and so on ...
Currently, I do statistical analysis on the data using a python/numpy/matplotlib program I wrote. It works fine, but the problem is, I'm the only one who can use it, since it and the data live on my computer.
I'd like to publish the data on the web using a postgres db; however, I need to find or implement a statistical tool that'll take a large postgres table, and return statistical results within an adequate time frame. I'm not familiar with python for the web; however, I'm proficient with PHP on the web side, and python on the offline side.
users should be allowed to create their own histograms, data analysis. For example, a user can search for all items that are blue shipped between week x and week y, while another user can search for sort the weight distribution of all items by hour for all year long.
I was thinking of creating and indexing my own statistical tools, or automate the process somehow to emulate most queries. This seemed inefficient.
I'm looking forward to hearing your ideas
Thanks

I think you can utilize your current combination(python/numpy/matplotlib) fully if the number of users are not too big. I do some similar works, and my data size a little more than 10g. Data are stored in a few sqlite files, and i use numpy to analyze data, PIL/matplotlib to generate chart files(png, gif), cherrypy as a webserver, mako as a template language.
If you need more server/client database, then you can migrate to postgresql, but you can still fully use your current programs if you go with a python web framework, like cherrypy.

Related

Is there a downside to creating a table for each import to save time upon search?

I'm making a web application with a very specific, albeit fairly simple set of requirements.
The user must be able to upload 100+ spreadsheets, each containing 40-200k rows.
The user must be able to perform near real-time searches (< 0.5 seconds) on a specific spreadsheet import.
Because the application does not need to search all spreadsheets a user has uploaded, but rather just a single spreadsheet designated beforehand, it seems that the most efficient way to do this would be to upload each spreadsheet as a database so that we only need to run through 200k hands at most as opposed to running through a database of potentially millions of rows by way of a pivot table, for example. Am I thinking about this correctly?
My plan would be to make a database in the form of Username_SpreadsheetName and to simply import into that database and select from it when it comes time to search.
I know this works and it works well, but it seems really quick and dirty to me. Wondering if there is a more clean way to achieve a similar level of efficiency that I'm not considering.

Should I migrate local JSON database to MariaDB?

Fictional story:
I have a car website containing over 200,000+ car listings around the United States. I get my data from two sources CarSeats and CarsPro updated nightly. Both sources contain about 100,000 detailed listings each in JSON format. The file size of both feeds is about 8GB and I plan to incorporate more used car sources in the near future.
The current JSON data contains everything I need to display car info from car search to car purchase, however, the JSON DB is stored locally and I use PHP's file_get_contents() to grab the appropriate metadata for each listing. It takes about 8 to 12 seconds to return 200 cars which is not bad, but I know there is room for improvement.
My Question:
Will migrating my data from localized JSON files to MariaDB 10.1 be a best practice move? Is that the scalable alternative for the future?
What should my stack look like to improve speed and improve search capabilities?
Note:
Forge installs MariaDB on the instance you boot.
The 8GB is chunked by "car make" into over 20 different files. No individual file is larger than 400MB.
Currently using
Laravel 5.2
Forge
PHP 5.6.10
Servers from AWS

Will migrating my data from localized JSON files to MariaDB 10.1 be a
best practice move? Is that the scalable alternative for the future?
What should my stack look like to improve speed and improve search
capabilities?
Yes. The whole purpose of a database is to make the storage—and usage—of data like this easier in the long run.
Each time you load a JSON file in PHP, PHP has to parse the data and I highly doubt 200,000 listings that consist of 8GB of data will ever work well as a file loaded into PHP memory from the file system. PHP will most likely die (aka: throw up an error) just when you attempt to load the file to begin with. Sorting and manipulating that data in PHP in that low-level state is even less efficient.
Storing that JSON data in a database of some kind—MariaDB, MySQL, MongoDB, etc…—is the only practical and best practice way to handle something like this.
The main reason anyone would repeatedly load a local JSON file into PHP would be for small tests and development ideas. On a practical level, it’s inefficient but when you are in an early stage of development and don’t feel like dealing with creating a process to import a large JSON file like that into an actual database a small sample file of data can be useful from your developer’s perspective to hash out basic concepts and idea.
But there is utterly no “best practice” that would ever state a file being read from a filesystem is a “best practice”; it’s honestly a very bad idea.

You will need Apache Solr that will improves the search and dealing with textual data.
The good point there that you will be able to use file_get_contents to deal with its query and its query results is in JSON format by default.

RRD Tool Structure and understanding

I am currently monitoring 5 different buildings, for each building there are around 300 rooms. Each room has 4 sensors, three which monitor temperature at different points of the room and one for the amount of power (killowatts) the room is consuming.
I am currently polling every sensor every 15 minutes which is producing 576,000 entries per day, the amount of buildings I am monitoring is soon going to increase.
I am currently storing all the information in MySQL, I have a MySQL table for each sensor type so the tables are named 'power', 'temp1', 'temp2', 'temp3'. The columns within these tables are 'id', 'building_id', 'epoch', 'value'.
I then use this data to produce graphs using the Chart.js library and statistics such as the amount of power used per building within a certain time period etc, I do all this using PHP.
I don't believe my MySQL database is going to be able to handle this without serious scale and clustering
I need to be able to view historic data for 5 years although some of the granularity can be lost after a certain period of time.
I have been informed that RRD might be able to solve my problem and have done some research on it but still have some questions.
Will it still allow me to create my own graphs using specifically the Chart.js library? If I can get time / value JSON data from the RRD this should be ok.
How many different RRD files will I need to create also? Would I need one per building? Per room? Per sensor? Is it still going to be easy to manage.
I have PHP scripts which run at 15 minute intervals that pull the data from the sensors using SNMP and then insert the data into MySQL if I can use the same scripts to also insert into the RRD that would also be great, from what I have seen you can use PHP to insert into RRD so that should be ok.
EDIT: I am currently reading http://michael.bouvy.net/blog/en/2013/04/28/graph-data-rrdtool-sensors-arduino/ which has started to answer some of my questions.

Whether you have one RRD file with 6000 metrics, or 5 files with 1200 metrics etc depends on how you are managing the data.
Firstly, you should not group together metrics for which the samples arrive at different points in time. So, if you sample one room at a time, you should probably have one RRD file per room (with 4 metrics in it). This will depend on what manages your sensors; if you have one device per room or per building. Retrieving the data and graphing it works whether you have one file or a thousand (though the 'thousand' scenario work much better in the latest version of RRDTool).
Secondly, are you likely to add new data points (IE, buildings or rooms)? You cannot (easily) add new metrics to an existing RRD file. So, if you expect to add a new building in the future, or to add or remove a room, then maybe one RRD per building or one per room would be better.
Without having any more information, I would guess that you'd be better off with one RRD per room (containing 4 metrics) and update them separately. Name the files according to the building and room IDs, and they can hold the power and 3 temperature values according to Epoch.
For graphing, RRDTool is of course capable of creating its own graphs by accessing the data directly. However, if you want to extract the data and put them into a graph yourself this is possible; the Xport function will allow you to pull out the necessary datapoints (possibly from multiple RRD files and with aggregation) which you can then pass to the graphing library of your choice. There is also the Fetch function if you want the raw data.
If your data samples are coming at 15min intervals, make sure you set your RRD Interval, Heartbeat and RRAs up correctly. In particular, the RRAs will specify what aggregation is performed and for how long data are kept at higher granularities. The RRAs should in general correspond to the resolutions you expect to graph the data at (which is why people generally use 5min/30min/2h/1d as these correspond nicely to daily, weekly, monthly and yearly graphs at a 400px width)

You might want to take a look at time-series databases and test a few systems that have built-in visualization, an API that allows you to perform aggregations and PHP wrappers. Time-series databases are optimized for efficient storage of timestamped data and have built-in functionality for time-series transformations.
https://en.wikipedia.org/wiki/Time_series_database

Best way to deal with 900,000 record database and zip codes?

A company we do business with wants to give us a 1.2 gb CSV file every day containing about 900,000 product listings. Only a small portion of the file changes every day, maybe less than 0.5%, and it's really just products being added or dropped, not modified. We need to display the product listings to our partners.
What makes this more complicated is that our partners should only be able to see product listings available within a 30-500 mile radius of their zip code. Each product listing row has a field for what the actual radius for the product is (some are only 30, some are 500, some are 100, etc. 500 is the max). A partner in a given zip code is likely to only have 20 results or so, meaning that there's going to be a ton of unused data. We don't know all the partner zip codes ahead of time.
We have to consider performance, so I'm not sure what the best way to go about this is.
Should I have two databases- one with zip codes and latitude/longitude and use the Haversine formula for calculating distance...and the other the actual product database...and then what do I do? Return all the zip codes within a given radius and look for a match in the product database? For a 500 mile radius that's going to be a ton of zip codes. Or write a MySQL function?
We could use Amazon SimpleDB to store the database...but then I still have this problem with the zip codes. I could make two "domains" as Amazon calls them, one for the products, and one for the zip codes? I don't think you can make a query across multiple SimpleDB domains, though. At least, I don't see that anywhere in their documentation.
I'm open to some other solution entirely. It doesn't have to be PHP/MySQL or SimpleDB. Just keep in mind our dedicated server is a P4 with 2 gb. We could upgrade the RAM, it's just that we can't throw a ton of processing power at this. Or even store and process the database every night on a VPS somewhere where it wouldn't be a problem if the VPS were unbearably slow while that 1.2 gb CSV is being processed. We could even process the file offline on a desktop computer and then remotely update the database every day...except then I still have this problem with zip codes and product listings needing to be cross-referenced.

You might want to look into PostgreSQL and Postgis. It has similar features as MySQL spacial indexing features, without the need to use MyISAM (which, in my experience, tend to become corrupt as opposed to InnoDB).
In particular with Postgres 9.1, which allows k-nearest neighbour search queries using GIST indexes.

Well, that is an interesting problem indeed.
This seems like its actually two issues, one how should you index the databases and the second is how to you keep it up to date. The first you can achieve as you describe, but normalization may or may not be a problem, depending on how you are storing the zip code. This primarily comes down to what your data looks like.
As for the second one, this is more my area of expertise. You can have your client upload the csv to you as they currently are, keep a copy of the one from yesterday and run it through a diff utility, or you can leverage Perl, PHP, Python, Bash or any other tools you have, to find the lines that have changed. Pass those into a second block that would update your database. I have dealt with clients with issues along this line and scripting it away tends to be the best choice. If you need help with organizing your script that is always available.

Storing And Displaying Stats

I am going to be writing some software in PHP to parse log files and aggregate the data then display them in graphs (like bar graphs, not vertices and edges).
Yeah, it's basically business intelligence software which my company has an entire team for but apparently they don't do a great job (10 minutes to load a page just doesn't do it).
Here is what i have to do:
Log files are data files which stores the raw data from a stats server we have setup running from our office (we send asynchronous calls to the stats server kinda like google analytics). It stores the data in csv format.
write a script to parse the files and aggregate the data into a database (or i was thinking about redis)
There will be millions and millions of things to aggregate so when displaying stats it must be fast
I know about OLAP for the DB, but if i want to go with redis do you think it would scale for large volumes of data? To parse the files do you think a PHP script would suffice or should i go with something faster like C/C++?
Basically i would like to get some interesting ideas about different ways to accomplish my task. It must be fast and scale.
Any ideas?

It sounds like at the scales you're talking about, you need to separate the data aggregation and display. That is, you should have some process working to receive the log files when they're generated, parse them and insert the data into the database; that will be a long, complicated task. Then when a user wants to display a graph of the data, they can make a request to the PHP server, which will pull the data from the database and construct the display they want. In this way, your parsing is separated from your display request (although it's still serially dependent, your parsing can begin when the logfiles become available, and therefore, the lag involved in parsing them is hidden at display time).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.