php reading and plotting large datasets - php

I have 5 second data for a year in an ascii file
each line is a reading...
timestamp, value
for 6 million lines
I want to display this data in a chart or multiple charts in a web browser
I considered a choice of 3 charts
1 - last 500 data points at maximum (5 second resolution)
2 - last 500 points at 15 min resolution
3 - all data at various resolutions
etc
being wary of a) time to read file b) processing time c) amount and time to download data to browser for javascript plotting
Can php do direct access read from a file?
More to the point, these big dataset plotting problems must be quite common, bhoiw do people get around it?

Yes, PHP can read files on the filesystem. Databases may be very helpful along with reading data from file with PHP.
What would you like to do with the data - as in, what are you really looking for? Are you looking for peaks, averages, values greater than average etc.?
Generic analysis
Perhaps the answer is - I don't know; I want to look at the data and see what comes glaring out. Fair enough. In that case you could have a web page that uses something like a stock chart. Show the last 1000 records.
To get the last 1000 records, you could use a combination of commands such as top or head (on Linux-y systems) or powershell on Windows to get last 1000 rows and then parse them with PHP, shove them into an array or object and show them on screen using Javascript or PHP charting tools.
When the user changes selection, do a read of the file and display relevant records. This can be taxing because the file is constantly processed.
Non-PHP method
A faster non-PHP alternate would be to use something like an in-memory business intelligence tool like QlikView (free download for personal use, I think). The learning curve is not steep, ... and I have no affiliation with QlikView. Tableau and Spotfire are other tools that can be easy to use and can make analysis of large datasets relatively easy.
Specific analysis
If your interest is to find out number of days each month with sales of $1 million or above, you could do a single pass on the file and extract all lines with sales >= 1MM and store it in an array of date and sales. Pass through the array and output a file with Year, Month, Sales. That would be pre-processing of the data.
Then, the web application or your presentation layer can pull this data and show information in bar charts or whatever else. Javascript charting libraries like amcharts, d3, highcharts etc. can be used, or PHP charting libraries like jpGraph and such can be used to read pre-processed data on the fly and show them.
If data has to be looked from multiple angles such as - table showing top 10 products that are sold, scatter plot of # of orders vs. $ ordered etc., all this data could be shoved into a database and then pulled on the screen. As Mark Baker commented, appropriate indexes will be necessary to pull data efficiently.
Prepare specific datasets in batch
Some climate research centers in the US run programs that churn through millions of records, like you have, at night to create graphs, charts, maps etc. and then use web applications to display them. For example High Plains Regional Climate Center and Iowa Mesonet do that regularly. You could do something similar with PHP.
Databases are my favorite. I prefer massaging textual data, eliminate what I don't want and push the objective matter in databases. PHP can then utilize rollup, top n, group by etc. methods within dbs to extract data and present it on screen - primarily through a web interface.
If you have specific questions about toolset or so relevant to this question, feel free to comment. If a new question crops up in your mind, feel free to add a new question altogether to solicit diverse answers.

Related

Will using an array for map tiles in a web game in this manner be too resource intensive?

I'm working on a PHP-based browser RPG in which the player moves on a grid map. My maps are defined via a two dimensional matrix with each value being a 21 character string of letters and numbers. This string is an encoded value that tells the game what happens on that tile, and it gives me a wide range of options for additional features in the future. I considered using just integers for this value, but I decided that I wanted to retain some readability of the maps so I could still eyeball files.
I currently output the map with a map creation script into a .csv file, then the player movement class opens that map file, finds the players current coordinates and moves in the direction that the player indicated.
My current map sizes top out at 200x300 tiles, each tile containing that grid's value (AAAAB111CCC2222223333); however, I would like to keep the ability to make these up to 1000x1000 in the future (larger than that would likely see maps for regions of an overall world).
I store the player coordinates in a user status database, so I will just read lines of the .csv map that were within the range of the player's movement and prevent the entire map from being loaded into a variable every time a player moves. However, this could still be an 11x1000 grid with those 21 character codes in each tile. After finding the result of each movement I will unset the array.
Even with my precautionary measures, I am concerned that this will this become too much of a resource burden in the future if many users are playing at the same time and I wonder if I should store map information in a database instead.
The rest of my user and game data is stored in a few large databases. When I started working on my game, I didn't believe that these map grid data sets were complex enough to warrant putting them into their own database, and it seemed like using an array for movement would be easy ($location[$x][$y]). However, now I'm wondering if using .csv files will hurt my performance or not.
Reading a file will definitely be more work from a csv, especially if you increase the map size. The problem I see with taking only portions of the csv for loading is you still have to parse the data to get to the players x,y coordinates. Following (or at the same time) parsing each string to find out if it is included in the near player set.
And as you say, you are expecting more players as time goes on. The amount of reads drive reads needed to open/parse the file would grow quickly and cause problems unless you had some form of caching available.
So, I would say go with a database, and the sooner the better. As yes, using a csv will hurt performance.

How could I build a statistical map?

I'm trying to figure out how to build a statistical map for my web app. Here's what I've got:
I have a MySQL database of zip codes, and each zip code has latitude & longitude.
I have users who have declared what zip code they live in.
I even have a haversine query which will show how many users exist within, for example, 25 miles of a given latitude/longitude, based on their zip code.
My question is this: Using this information, how could I approach building a statistical map for a web application using PHP?
I would be fine with using just a US map or even a North American map for now, but I'm just not sure how to build that map. Some options I've considered:
Show a colored dot on the map, larger or smaller depending on the number of users near that location. I'm not sure how to do this, though, especially if those dots were to overlap!
Show individual "pushpins" where the users are. Seems like this could get out of hand if my user base grows significantly
So back to my question. If I had 300 users in Dallas, 4,000 in NYC, 45 users in Detroit, 403 in Chicago... how would I be able to represent that on a map -- and also how would I draw that map in a web application built on PHP?
You are trying to build a three-dimensional (probably even more dimensions) data display.
Your dimensions are:
X-Location
Y-Location
The value at every location
This really does not define anything about the visual appearance, though.
A simple approach might be to calculate the absolute number of users per state and then color the state on the map according to some scale. You also might calculate the percentage of users living in a state compared to the absolute number and color that instead.
A different approach would be to put a dot for every user on the map, and if this dot was printed before, to change it's color instead, e.g. make it brighter.
In the end, it really depends on what your actual data is and if your approach on visualizing it displays some significant information - but this can only be confirmend after you see it.
As you are looking for a web application have you considered Google Maps. Factor 1. can be implemented using the MarkerClusterer library. A DEMO showing this. The data from your database can be loaded using AJAX.

Easiest and fastest way to template, possibly in a PDF

I have been looking extensively for a simple solution to a not-very-complicated problem.
I have a great deal of data in a sql database which needs to be printed (for example, each entry would have name, address, phone number, etc).
The vast majority of the data on the eventual printed page is static- there would only need to be a small handful of fields that need to be 'variables' in the 'template'. Quite beneficially the areas that the variable data would be dropped into are themselves in both location orientation and dimensions fixed-- so there need be no adjustments to spacing for the other static/redundant data on the page.
I would like to have some form of 'accounting' in the sense that, since the amount of pages printed are going to be on the order of the tens of thousands, I would like to know which sql entries have been printed thus far.
I would not like to 'reinvent the wheel' and write a php front end which loops through arrays and deposits the sql data onto the right place on the page before or after it is rendered as pdf...
I would prefer to print directly from the server (*nix), and would be very enthusiastic if there is a way to do this without actually having to render tens of thousands of individual pdfs. With todays open source software packages, which route is the best to take?
(so far, it is looking like if there isn't a simple way, I am going to need to learn LaTeX, Cheetah, and some python)
Dabo's report writer is a banded reporting engine like Crystal, which takes as input a set of data (output of cur.fetchall(), for example) and a report template (xml string or file), and outputs a PDF or set of PDF's (it can output a stream of bytes instead of writing to a file directly, if desired).
Dabo's main purpose is a desktop-application framework on top of wxPython, but the reporting can be done on the web with no desktop interaction. Though it does help to design the reports using the desktop though using the included report designer.
http://dabodev.com
There will be some installation hurdles and a learning curve, but you'll find this to be an easy task once you are ramped up.

Web Graphing Tool For Time scale data

I want to plot some data into a web graph control(preferably javascript or php). The data is collected regularly from a microcontroller, however the data collection interval is not linear. For instance, I may collect 5 data points in one day, and then 2 data points the next, and at different intervals etc...
Is there a graphing tool that can automatically create a linear datetime axis such that the data is represented properly on the graph.
jquery's jqplot does the trick.
You can use jqplot as James Cotter says, but there are also :
Highcharts.js (apparently the best JS out there, but restrictive license)
Google Charts Tools (generate a graph from an URL, it returns a PNG image)
Depending on your needs, I'd definitely use Google Charts Tools as it'd decrease my bandwidth usage. However, if you need complex manipulation of your datas once the chart is displayed, use Highcharts.js (or jqPlot).

Google Maps Overlays

I'm trying to find something, preferably F/OSS, that can generate a Google Maps overlay from KML and/or KMZ data.
We've got an event site we're working on that needed to accommodate ~16,000 place markers last year and will likely have at least that many again this year. Last year, the company that had done the site just fed the KML data directly to the gMaps API and let it place all of the markers client side. Obviously, that became a performance nightmare and tended to make older browsers "freeze" (or at least appear frozen for several minutes at a time).
Ideally this server side script would take the KML, the map's lat/lon center, and the map zoom level and appropriately merge all of the visible place markers into a single GIF or PNG overlay.
Any guidance or recommendations on this would be greatly appreciated.
UPDATE 10/8/2008 - Most of the information I've come across here and other places would seem to indicate that lessening the number of points on the map is the way to go (i.e. using one marker to represent several when viewing from a higher altitude/zoom level). While that's probably a good approach in some cases, it won't work here. We're looking for the visual impact of a US map with many thousand markers on it. One option I've explored is a service called PushPin, which when fed (presumably) KML will create, server side, an overlay that has all of the visible points (based on center lat/lon and zoom level) rendered onto a single image, so instead of performing several thousand DOM manipulations client side, we merge all of those markers into a single image server side and do a single DOM manipulation on the client end. The PushPin service is really slick and would definitely work if not for the associated costs. We're really looking for something F/OSS that we could run server side to generate that overlay ourselves.
You may want to look into something like Geoserver or Mapserver. They are Google map clones, and a lot more.
You could generate an overlay that you like, and Geoserver(I think mapserver does as well) can give you KML, PDF, png, and other output to mix your maps, or you could generate the whole map by yourself, but that takes time.
Not sure why you want to go to a GIF/PNG overlay, you can do this directly in KML. I'm assuming that most of your performance problem was being caused by points outside the user's current view, i.e. the user is looking at New York but you have points in Los Angeles that are wasting memory because they aren't visible. If you really have 16,000 points that are all visible at once for a typical then yes you'll need to pursue a different strategy.
If the above applies, the procedure would be as follows:
Determine the center & extent of the map
Given that you should be able to calculate the lat/long of the upper left and lower right corners of the map.
Iterate through your database of points and check each location against the two corners. Longitude needs to be greater (signed!) than the upper left longitude and less than the lower right longitude. Latitude needs to be less than the upper left latitude (signed!) and greater than the lower right latitude. Just simple comparisons, no fancy calculations required here.
Output the matching points to a temporary KML for the user.
You can feed KML directly into Google Maps and let it map it, or you can use the Javascript maps API to load the points via KML.
It might not solve your exact problem here, but for related issues you might also look into the Google Static Maps API. This allows you to create a static image file with placemarkers on it that will load very quickly, but won't have the interactivity of a regular Google map. Because of the way the API is designed, however, it can't handle anywhere near 16,000 points either so you'd still have to filter down to the view.
I don't know how fare you are with your project but maybe you can take a look at GeoDjango? This modified Django release includes all kinds of tools to store locations; convert coordinates and display maps, the easy way. Offcourse you need some Python experience and a server to run it on, but once you've got the hang of Django it works fast and good.
If you just want a solution for your problem try grouping your results at lower zoom levels, a good example of this implementation can be found here.
This is a tough one. You can use custom tilesets with Google Maps, but you still need some way to generate the tiles (other than manually).
I'm afraid that's all I've got =/
OpenLayers is a great javascript frontend to multiple mapping services or your own map servers. Version 2.7 was just released, which adds some pretty amazing features and controls.

Categories