PHP - how to find temperature trend - php

I have an array of last readings from temperature sensor.
How to find id the trend is going high or low.
I know I can compare last value, but this in not a good idea, because temperature tend to flow.
So let's say we have last readings in array:
$temp_array = array(5.1, 5.5, 6, 5.9, 6.2, 6.1);
How to find answer for question: is temperature will grow, or will go down based on last readings?
I am thinking to count average from last 3 vs first 3.
The average from first 3 is 5.533, and the average from latest three is: 6.066 - so I will say that trend is high. But maybe it is a smarter idea to do that?

Calculating a trend depends on many factors and the context it is calculated in.
For example, your rate of measurement will dictate your solution. For example, if you measure outside temperatures every hour, then looking at the last 3 measurements might make sense.
On the other hand, measuring every minute means you will get caught up in changing trends when even a cloud goes in front of the sun and you most likely don't want that (unless you try to predict solar-cell efficiency).
As you can see, the answer and algorithm to use can depend on a lot of factors.
Have a look in Excel of LibreOffice Calc and investigate the trend lines there. You'll see many types and complexity.
Your proposal might work fine for your use-case, but might be problematic as well. Did you consider a missing or wrong measurement? Example: 8.1, 8.0, 0, 7.9, 7.8, 7.7 ? In this case you will predict the wrong trend.
Keeping things simple, I would do something like:
- Filter out large deviations (you know the domain you are in, high changes are unlikely)
- Count the average
- Count how many measurements are below and above the average
Note, this is just a quick alternative, but it would make the prediction less biased to recent measurement.
Good luck.

Related

Optimize PHP algorithm with huge number of threads?

As part of a Laravel based app I am trying to write a PHP script that fetches certain data, that is constantly updated, from across the web about certain products, books to be exact.
The problem:
Books are identified by ISBN, a 10 digit identifier. The first 9 digits can be 0-9, while the last digit can be 0-9 or X. However, the last digit is a check-digit which is calculated based off the first 9 digits, thus there is really only 1 possible digit for the last place.
That being the case, we arrive at:
10*10*10*10*10*10*10*10*10*1 = 1,000,000,000
numerically correct ISBNs. I can do a little better than that if I limit my search to English books, as they would contain only a 0 or a 1 as the first digit. Thus I would get:
2*10*10*10*10*10*10*10*10*1 = 200,000,000
numerically correct ISBNs.
Now for each ISBN I have 3 http requests that are required to fetch the data, each taking roughly 3 seconds to complete. Thus:
3seconds*3requests*200,000,000ISBNs = 1,800,000,000 seconds
1,800,000,000seconds/60seconds/60minutes/24hours/365days = ~57 years
Hopefully in 57 years time, there won't be such thing as a book anymore, and this algorithm will be obsolete.
Actually, since the data I am concerned with is constantly changing, for this algorithm to be useful it would have to complete each pass within just a few days (2 - 7 days is ideal).
Thus the problem is how to optimize this algorithm to bring its runtime down from 57 years, to just one week?
Potential Solutions:
1) The very first thing that you will notice is that while there are 200,000,000 possible ISBNs, there are no where near as many real ISBNs that exist, which means a majority of this algorithm will spend time making http requests on false ISBNs (I could move to the next ISBN after the first failed http request, but that alone will not bring down the timing significantly enough). Thus solution 1 would be to get/buy/download a database which already contains a list of ISBNs in use, thus significantly bringing down the number of ISBNs to search.
My issue with solution 1 is that new books are constantly being published, and I hope to pick up on new books when the algorithm runs again. Using a database of existing books would only be good for books up to date of creation of the database. (A potential fix would be a service that constantly updates their database and will let me download it once a week, but that seems unlikely, and plus I was really hoping to solve this problem through programming!)
2) While this algorithm takes forever to run, most of the time it is actually just sitting idly waiting for an http response. Thus one option would seem to be to use Threads.
If we do the math, I think the equation would look like this:
(numISBNs/numThreads)*secondsPerISBN = totalSecondsToComplete
If we isolate numThreads:
numThreads = (numISBNs * secondsPerISBN) / totalSecondsToComplete
If our threshold is one week, then:
totalSecondsToComplete = 7days * 24hrs * 60min * 60sec = 604,800seconds
numISBNs = 200,000,000
secondsPerISBN = 3
numThreads = (200,000,000 * 3) / 604,800
numThreads = ~992
So 992 threads would have to run concurrently for this to work. Is this a reasonable number of threads to run on say a DigitalOcean server? My mac right now says it is running over 2000 threads, so it could be this number is actually manageable.
My Question(s):
1) Is 992 a reasonable number of threads to run on a DigitalOcean server?
2) Is there a more efficient way to asynchronously perform this algorithm as each http request is completely independent of any other? What is the best way to keep the CPU busy while waiting for all the http requests to return?
3) Is there a specific service I should be looking in to for this that may help achieve what I am looking for?
Keep a DB of ISBN and continue to crawl to keep it updated, similar to google with all the web pages
analyze ISBN generation logic and try to avoid to fetch ISBN that are not possible
at crawling level, not only you can split in various thread, but you can also split by them by multiple servers each with access to the DB server, wich is dedicated to the DB and not overheaded by the crawling
also you could use some kind of web cache if it enhance the performances, for instance google cache or web archive
3 seconds are a lot for a web service, are you sure is there no service that answer you in lesser time? Search for it, maybe
If you manage to list all published books in a certain date, you can try to crawl only new books from that date, by finding some source of them only, this refresh would be very faster than search any book

How to identify the bottlenecks with Xhprof?

I have an issue with a very slow API call and want to find out, what it caused by, using Xhprof: the default GUI and the callgraph. How should this data be analyzed?
What is the approach to find the places in the code, that should be optimized, and especially the most expensive bottlenecks?
Of all those columns, focus on the one called "IWall%", column 5.
Notice that send, doRequest, read, and fgets each have 72% inclusive wall-clock time.
What that means is if you took 100 stack samples, each of those routines would find itself on 72 of them, give or take, and I suspect they would appear together.
(Your graph should show that too.)
So since the whole thing takes 23 seconds, that means about 17 seconds are spent simply reading.
The only way you can reduce that 17 seconds is if you can find that some of the reading is unnecessary. Can you?
What about the remaining 28% (6 seconds)?
First, is it worth it?
Even if you could reduce that to zero (17 seconds total, which you can't), the speedup factor would 1/(1-0.28) = 1.39, or 39%.
If you could reduce it by half (20 seconds total), it would be 1/(1-0.14) = 1.16, or 16%.
20 seconds versus 23, it's up to you to decide if it's worth the trouble.
If you decide it is, I recommend the random pausing method, because it doesn't flood you with noise.
It gets right to the heart of the matter, not only telling you which routines, but which lines of code, and why they are being executed.
(The why is most important, because you can't replace it if it's absolutely necessary.
With profilers, you tend to assume it is necessary, because you have no way to tell otherwise.)
Since you are looking for something taking about 14% of the time, you're going to have to examine 2/0.14 = 14 samples, on average, to see it twice, and that will tell you what it is.
Keep in mind that about 14 * 0.72 = 10 of those samples will land in fgets (and all its callers), so you can either ignore those or use them to make sure all that I/O is really necessary.
(For example, is it just possible that you're reading things twice, for some obscure reason like it was easier to do that way? I've seen that.)

Weighted voting algorithm

I'm looking for information on which voting algorithm will be best for me. I have a basic 'Up/Down' voting system where a user can only vote the product up or down. I would like to make it weighted so that a product that is a year old will not be held to the same standards as one that is brand new.
I'm thinking do an algorithm that takes the amount of votes for each product in the last 30 days. However this creates a draw back. I don't want votes older than 30 days to become meaningless, but maybe not weighted as much as newer ones. Then possibly votes after 90 days are even weighted less than ones older than 30 days.
Is anyone aware of an algorithm that does this already and even more so is able to be calculated easily in PHP?
Google App Engine has a nice example that deals with votes that "decay" over time.
It's in Python, but it should fit your needs.
I think that given the simplicity of your requirement, the best course of action is write this yourself.
Without knowing more, I think your challenge will be in deciding whether you save this data into your database in a pre-weighted format (e.g. "when vote is cast, give it $this_year + 1 points"), whether you calculate the weighting in your db query (e.g. order by a score that accounts for both upvotes and the date when a vote was cast), or whether you return all the needed data and deduce the weighting in PHP. The choice depends on what your app needs to do exactly and how much data there will be.

Curve-fitting in PHP

I have a MySql table called today_stats. It has got Id, date and clicks. I'm trying to create a script to get the values and try to predict the next 7 days clicks. How I can predict it in PHP?
Different types of curve fitting described here:
http://en.wikipedia.org/wiki/Curve_fitting
Also: http://www.qub.buffalo.edu/wiki/index.php/Curve_Fitting
This has less to do with PHP, and more to do with math. The simplest way to calculate something like this is to take the average traffic for a given day over the past X weeks. You don't want to pull all the data, because fads and page content changes.
So, for example, get the average traffic for each day over the last month. You'll be able to tell how accurate your estimates are by comparing them to actual traffic. If they aren't accurate at all, then try playing with the calculation (ex., change the time period you're sampling from). Or maybe it's a good thing that your estimate is off: your site was just featured on the front page of the New York Times!
Cheers.
The algorithm you are looking for is called Least Squares
What you need to do is minimize the summed up distances from each point to the function you will use to predict the future values. For the distance to be always positive, not the absolute value is taken into calculation, but the square of the value. The sum of the squares of the differences has to be minimum. By defining the function that makes up that sum, deriving it, solving the resulting equation, you will find the parameters for your function, that will be CLOSEST to the statistical values from the past.
Programs like Excel (maybe OpenOffice Spreadsheet too) have a built-in function that does this for you, using polynomial functions to define the dependence.
Basically you should take Time as the independent value, and all the others as described values.
This is called econometrics, because its widespread in economics. This way, if you have a lot of statistical data from the past, the prediction for the next day will be quite accurate (you will also be able to determine the trust interval - the possible error that may occur). The following days will be less and less accurate.
If you make different models for each day of week, include holidays and special days as variables, you will get a much higher precision.
This is the only RIGHT way to mathematically forecast future values. But from all this a question arises: Is it really worth it?
Start off by connecting to the database and then retrieving the data for x days previously.
Then you could attempt to make a line of best fit for the previous days and then just use that and extend into the future. But depending on the application, a line of best fit isn't going to be good enough.
a simple approach would be to group by days and average each value. This can all be done in SQL

A Digg-like rotating homepage of popular content, how to include date as a factor?

I am building an advanced image sharing web application. As you may expect, users can upload images and others can comments on it, vote on it, and favorite it. These events will determine the popularity of the image, which I capture in a "karma" field.
Now I want to create a Digg-like homepage system, showing the most popular images. It's easy, since I already have the weighted Karma score. I just sort on that descendingly to show the 20 most valued images.
The part that is missing is time. I do not want extremely popular images to always be on the homepage. I guess an easy solution is to restrict the result set to the last 24 hours. However, I'm also thinking that in order to keep the image rotation occur throughout the day, time can be some kind of variable where its offset has an influence on the image's sorting.
Specific questions:
Would you recommend the easy scenario (just sort for best images within 24 hours) or the more sophisticated one (use datetime offset as part of the sorting)? If you advise the latter, any help on the mathematical solution to this?
Would it be best to run a scheduled service to mark images for the homepage, or would you advise a direct query (I'm using MySQL)
As an extra note, the homepage should support paging and on a quiet day should include entries of days before in order to make sure it is always "filled"
I'm not asking the community to build this algorithm, just looking for some advise :)
I would go with a function that decreases the "effective karma" of each item after a given amount of time elapses. This is a bit like Eric's method.
Determine how often you want the "effective karma" to be decreased. Then multiply the karma by a scaling factor based on this period.
effective karma = karma * (1 - percentage_decrease)
where percentage_decrease is determined by yourfunction. For instance, you could do
percentage_decrease = min(1, number_of_hours_since_posting / 24)
to make it so the effective karma of each item decreases to 0 over 24 hours. Then use the effective karma to determine what images to show. This is a bit more of a stable solution than just subtracting the time since posting, as it scales the karma between 0 and its actual value. The min is to keep the scaling at a 0 lower bound, as once a day passes, you'll start getting values greater than 1.
However, this doesn't take into account popularity in the strict sense. Tim's answer gives some ideas into how to take strict popularity (i.e. page views) into account.
For your first question, I would go with the slightly more complicated method. You will want some "All time favorites" in the mix. But don't go by time alone, go by the number of actual views the image has. Keep in mind that not everyone is going to login and vote, but that doesn't make the image any less popular. An image that is two years old with 10 votes and 100k views is obviously more important to people than an image that is 1 year old with 100 votes and 1k views.
For your second question, yes, you want some kind of caching going on in your front page. That's a lot of queries to produce the entry point into your site. However, much like SO, your type of site will tend to draw traffic to inner pages through search engines .. so try and watch / optimize your queries everywhere.
For your third question, going by factors other than time (i.e. # of views) helps to make sure you always have a full and dynamic page. I'm not sure about paginating on the front page, leading people to tags or searches might be a better strategy.
You could just calculate an "adjusted karma" type field that would take the time into account:
adjusted karma = karma - number of hours/days since posted
You could then calculate and sort by that directly in your query, or you could make it an actual field in the database that you update via a nightly process or something. Personally I would go with a nightly process that updates it since that will probably make it easier to make the algorithm a bit more sophisticated in the future.
This, i've found it, the Lower bound of Wilson score confidence interval for a Bernoulli parameter
Look at this: http://www.derivante.com/2009/09/01/php-content-rating-confidence/
At the second example he explains how to use time as a "freshness factor".

Categories