Reducing graph data without losing graph shape - php

I have a dataset with 100 000 datapoints which I have to plot on a graph. The resulting graph will be about 500px wide, so for every pixel there will be about 200 datapoints, which seems quite unnecessary.
I need to find a way to get rid of the excess datapoints without losing the shape of the graph to speed up the rendering. Currently the rendering of all 100 000 points can take 10+ seconds as I'm also using anti-aliasing and other "effects".
I tried to approach this problem by just taking every 200th datapoint and plotting them, but this results in some of the more significant points missing out (think about spikes in the graph that I want to be able to show). I also thought of splitting the dataset in chunks of 200 datapoints, then taking the maximum value from every chunk but that wont work either.
Is anyone aware of a method that would suit my needs here? The language I'm using is PHP, graph is created by GD and data is coming from MySQL, so optimizations to some of those are welcome.
The data is in this format:
Datetime Value
2005-01-30 00:00:00 35.30
2005-01-30 01:00:00 35.65
2005-01-30 02:00:00 36.15
2005-01-30 03:00:00 35.95
...
And the resulting graph currently looks like this:
alt text http://www.ulmanen.fi/stuff/graph-sample.png

I know this question is quite old but I had a problem almost similar.
To reduce the number of points to display without affecting the shape of the graph, We use the Ramer-Douglas-Peucker algoritm. The difference of shape between the uncompressed graph and the one with this algorithm is unnoticeable.

It seems to me that 1 in 200 is pretty serious data loss, and if those 200 values that should be represented with one value on the graph aren't close enough to be meaningfully substituted with an average, you have yourself a problem. If average isn't good enough, you must find a criterium to tell what data is more significant and should be included, and we can't help you with it because we don't know what kind of data it is, its statistical properties, or why any value would be more significant than the other. With those additional info, maybe a more specific answer could be given.
EDIT: After looking at the graph, it seems that you need both minimum and maximum in a given interval, because the dark blue area are values between those two, correct? Maybe you can take 100 values and make a graph from minimum, maximum, and average, so that every point in graph is made with 6 instead of 200 values, or something like that.

Another approach that might work is splitting the graph up into 200 point bins, and discard all but the maximum, minimum, and median points in each interval. Each of the three points in the interval gets plotted at its original location, so the locations of the extreme values won't change. Using the median instead of the mean will probably work better for your data set because the maxima are much more extreme than the minima, which would cause the filtered graph to shift upwards if you used the mean.

One approach to your problem is max-min decimation; I suggest you Google for a definition and algorithm I don't have either to hand or I would share with you.
Beyond that I think you might use a low-pass (anti-aliasing) filter followed by simple decimation (ie throwing away excess points).

I think that ordinary average from each 200 bunch of points would be just enough.

I don't know what your code/data source looks like but is it possible to do a distinct on your mysql select statement to reduce the number of data points being brought back to your application?

Related

Rotating between URLs, how precise would randomly picking from an array be?

thanks for taking the time to read this.
My goal here is to rotate between links, anywhere from 1 link up to, let's say 4.
The easy way to do this, would be to make an array of the links and using php, pick one randomly to display.
While this is pretty easy, and quick to set up, it also has me worried a bit, because it's not really accurate, especially not on a small scale.
Giving you some numbers here, let's say my website gets anywhere from 3000 to 5000 unique impressions a day, how accurate would it be to randomly pick a link from an array for 2, 3 or 4 links to choose from?
If anyone else has an idea on how to make a system that rotates very accurate and evenly, let me know!
Thanks in advance to anyone that can help me out :)
Over a lengthy period of time with many impressions, most random functions would be evenly distributed. For a small distribution, the results may be noticeably skewed... but the more
But for perfectly even distribution, nothing beats a straight cafeteria-plate "next-up" array.
Either way, I think you will be satisfied.

using php to get two even columns of text

Does anyone know a clever way to create even columns of text using php?
So lets say I have a few paragraphs of text and I want to split this into two columns of even length (not string length, I'm talking even visible length).
At the moment I'm splitting based on word count, which (as you can imagine) isn't working too well. For instance, on one page I have a list (ul li style) which is increasing the line breaks but not the word count. eg: whats happening is that the left column (with the list in it) is visibly longer than the right column (and if there was a list in the right hand column then it would be the same the other way round).
So does anyone have a clever way to split text? For instance using my knowledge of objective c there is a "size that fits" function. I know how wide the columns are going to be, so is there any way to take that, and the string, and work out how high its going to be? Then cut it in half? Or similar?
Thanks
ps: no css3 nonsense please, we're targeting browsers as far back as ie6 (shudder). :)
I know you're looking at a PHP solution but since the number of lines will depend on how it's rendered in the browser, you'll need to use some javascript.
You basically need to know the dimensions of the container the text is in and using the height divided by the text's line-height, you'll get the number of lines.
Here's a fiddle using jQuery: http://jsfiddle.net/bh8ZR/
There is not a lot of information here as to the source data. However, if you know that you have 20 lines of data, and want to split it, why not simply use an array of the display lines, then divide by two. Then you can take the first half of the PHP array and push it into the second column when you hit the limit of the first.
I think you're going to have trouble displaying these columns in a web browser and having a consistent look and feel because you're trying to apply simple programming logic to a visual layout. CSS and jQuery were designed to help layout issues. jQuery does have IE6 compatibility.
I really don't think you're going to find a magic bullet here if you have HTML formatting inside the data you're trying to display. The browser is going to render this based on a lot of variables. Page width, font size, etc. This is exactly why CSS and other layout styles are there, to handle this sort of formatting.
Is there any reason why you're not trying to solve this in the browser instead of PHP? IE6 to me is not a strong enough case not to do this where it belongs.

Manipulating maps

Given a set of floorplans (in Autocad, svg, or whatever format need be...), I would like to programatically generate directions from point A to point B. Basically I would like to say: "How do I get from room 101 to room 143?" (or for triple bonus points, from room 101 to room 323). Anyone have any ideas how to go about this? I am pretty language agnostic at this point, although I know C(++), Erlang, PHP and Python the best. I do realize this is a tall order.
Thanks!
The general term for this is pathfinding. The problem has been studied extensively for 2D diagrams. I would break apart the problem into these sections:
Convert CAD model of floor into a simple model of rooms, doors, halways.
Run a pathfinding algorithm on that floor from source to destination, with constraints for human motion.
Convert the results to text directions (turn right, go straight, etc.). The addition of landmarks may be helpful
For multiple floors, you could just use the one floor implementation and go from (e.g.) 104 to the 1st floor stairs, 3rd floor stairs to 311. The conversion of the CAD drawing to a semantically useful format seems like the most difficult step to me.
I know you want to use php, but i recommend python and networkx. you have to convert your building into a set of (origin, Destination, cost) and then run either a TSP (as mentioned by still standing) or A* or Dijkstra
read about the traveling salesman algorithm there are an infinite number of paths from point A to point B. are you looking for the shortest? what is your means of transport? can you fly or are you forced to walk or drive? these are factors in determining a solution.

pChart php graph generation

I want to know a way of inverting the y-axis using pChart. I want the y-axis to start at 0 and 1,2,3,4 etc down, rather than up if that makes sense. In the case of search engine rankings a lower number is better and I want the graph to reflect that by inverting the y-axis. Any ideas?
Search engines don't care if your y-axis is upside down, especially on an a bitmap image file. You're just making your code less maintainable. Just use pChart they way it's intended.
My only suggestion is to make the rankings as negative numbers. It may not be a perfect representation, but it will invert the axis values, and should still be readable.

how to store and search mp3 by its content

I want to store multiple mp3 files and search them by giving some part of song, to detect which song it is.
I am thinking of storing all binary content in mysql and when I want to search for a specific song by content I will take some middle portion of song and actually match it with the binary data in MySQL.
My questions are:
Is this a reasonable way to find songs by their content?
Is it right to store the songs' content in the database or should I use the filesystem?
This is not going to work. MP3 is a "lossy" format. That means that it constantly alters subtle nuances of the music when encoding, thus producing totally different byte-wise data on almost every encoding for the same song.
Also, even in an uncompressed format like WAV, two identical records at different volumes will produce different byte data. So, it is impossible to compare music by comparing the byte values of the file's contents.
A binary comparison will work only for two exact identical copies of the same MP3 file. It won't even work anymore when you re-encode the same MP3 file with identical settings.
Comparing music is not a trivial matter, several approaches exist but to my knowledge none that can be used in PHP.
If you're lucky, there exists a web service that allows some kind of matching. Expect it to be commercial in some way, though - I doubt we are at the stage where this kind of thing can be used free of charge.
Is it a right way to find songs by content of song.
Only if you can be sure that the part you get as search criterium will actually be an excerpt from that particular MP3 file... and that is very, very unlikely. If the part can be from a different source (i.e. a different recording of the same song, or just a differently compressed MP3), you'll have to use audio fingerprinting which is vastly more complicated.
Is it right to store songs content in database or file store normally will work?
If you do simple binary matching, there is no point in using a database. If you have a more complex indexing technique (such as audio fingerprints) then using a database can make sense.
As others have pointed out - comparing MP3s by looking at the binary content of files is not going to work.
I wrote something like this in Java whilst at university for my final year project. I'd be more than happy to send you the source code. It dealt in relative similarities - "song X is more similar to song Y than it is to song Z", rather than matches, but it might be a step in the right direction.
And please, whatever you do, don't try and do this in PHP. The algorithm I used needed me to compute (if I remember correctly - I worked on this around 3 years ago) 30 30x30 matrices for each MP3 it analysed. Each song took around 30 seconds to process to a set of matrices on my clunky old machine (I'm sure my new PC could get the job done significantly quicker). Once I had those matrices for n songs a second step computed differences between each pair of songs, and a third step reduced those differences down to m-dimensional space. Each of these 3 steps takes a fair amount of horsepower, and PHP definitely isn't the right horse for the job.
What PHP might work for is a frontend - I ended up with a queryable web-app written in Ruby on Rails, where I had a simple backend which stored the co-ordinates of each song in m-dimensional space (I happened to choose m = 6) - given a particular song, or fragment, X, you could then compute songs within a certain "distance" of X.
NB. I should probably point out that all the code I wrote was basically just a wrapper around libraries others had written - which were by some smart people at a university in Austria - those libraries took two songs and generated the matrices - all I did was compute distances and map distances of lots of songs into m-dimensional space. Wish I was smart enough to have done the first bit too!
I don't fully understand what you're trying to do, but if you're going to index an MP3 collection, it's probably a better idea to store a hash (of sufficient length) rather than the actual file.
The problem is that the bytes don't give you any insight to the CONTENT of the file, i.e. the music in it. Even if you cut the metadata from the bytes to compare (to get rid of noise like changes in spelling/capitalisation of metadata), you only know something about the unique file itself. So you could compare two identical files (i.e. exact duplicates) for equality, but you couldn't compare any two random files for similarity.
To search songs, you may probably want to index their tags and focus on a nice, easy to use UI so users can look for them in flexible ways.
As said above, same song will show different content bytes depending on the encoding.
However, one idea pointing to your direction, and I'm not sure how feasible is, would be to index some songs patterns that may uniquely identify it. For ex. what do all Johnny Cash songs have in common? Volume, tone, a combination of them? And when you get a portion of content, you may extract that same pattern from it and match. That would be an interesting concept.

Categories