I have a MySql table called today_stats. It has got Id, date and clicks. I'm trying to create a script to get the values and try to predict the next 7 days clicks. How I can predict it in PHP?
Different types of curve fitting described here:
http://en.wikipedia.org/wiki/Curve_fitting
Also: http://www.qub.buffalo.edu/wiki/index.php/Curve_Fitting
This has less to do with PHP, and more to do with math. The simplest way to calculate something like this is to take the average traffic for a given day over the past X weeks. You don't want to pull all the data, because fads and page content changes.
So, for example, get the average traffic for each day over the last month. You'll be able to tell how accurate your estimates are by comparing them to actual traffic. If they aren't accurate at all, then try playing with the calculation (ex., change the time period you're sampling from). Or maybe it's a good thing that your estimate is off: your site was just featured on the front page of the New York Times!
Cheers.
The algorithm you are looking for is called Least Squares
What you need to do is minimize the summed up distances from each point to the function you will use to predict the future values. For the distance to be always positive, not the absolute value is taken into calculation, but the square of the value. The sum of the squares of the differences has to be minimum. By defining the function that makes up that sum, deriving it, solving the resulting equation, you will find the parameters for your function, that will be CLOSEST to the statistical values from the past.
Programs like Excel (maybe OpenOffice Spreadsheet too) have a built-in function that does this for you, using polynomial functions to define the dependence.
Basically you should take Time as the independent value, and all the others as described values.
This is called econometrics, because its widespread in economics. This way, if you have a lot of statistical data from the past, the prediction for the next day will be quite accurate (you will also be able to determine the trust interval - the possible error that may occur). The following days will be less and less accurate.
If you make different models for each day of week, include holidays and special days as variables, you will get a much higher precision.
This is the only RIGHT way to mathematically forecast future values. But from all this a question arises: Is it really worth it?
Start off by connecting to the database and then retrieving the data for x days previously.
Then you could attempt to make a line of best fit for the previous days and then just use that and extend into the future. But depending on the application, a line of best fit isn't going to be good enough.
a simple approach would be to group by days and average each value. This can all be done in SQL
Related
I have an array of last readings from temperature sensor.
How to find id the trend is going high or low.
I know I can compare last value, but this in not a good idea, because temperature tend to flow.
So let's say we have last readings in array:
$temp_array = array(5.1, 5.5, 6, 5.9, 6.2, 6.1);
How to find answer for question: is temperature will grow, or will go down based on last readings?
I am thinking to count average from last 3 vs first 3.
The average from first 3 is 5.533, and the average from latest three is: 6.066 - so I will say that trend is high. But maybe it is a smarter idea to do that?
Calculating a trend depends on many factors and the context it is calculated in.
For example, your rate of measurement will dictate your solution. For example, if you measure outside temperatures every hour, then looking at the last 3 measurements might make sense.
On the other hand, measuring every minute means you will get caught up in changing trends when even a cloud goes in front of the sun and you most likely don't want that (unless you try to predict solar-cell efficiency).
As you can see, the answer and algorithm to use can depend on a lot of factors.
Have a look in Excel of LibreOffice Calc and investigate the trend lines there. You'll see many types and complexity.
Your proposal might work fine for your use-case, but might be problematic as well. Did you consider a missing or wrong measurement? Example: 8.1, 8.0, 0, 7.9, 7.8, 7.7 ? In this case you will predict the wrong trend.
Keeping things simple, I would do something like:
- Filter out large deviations (you know the domain you are in, high changes are unlikely)
- Count the average
- Count how many measurements are below and above the average
Note, this is just a quick alternative, but it would make the prediction less biased to recent measurement.
Good luck.
I'm trying to write a program that relies on date ranges. I am trying to be able to alert users when there are holes in their ranges but I need a reliable way to find those, and to be able to handle them effectively.
My solution was to change any dates so that any day inserted into the app is rewritten so it is that day at noon. Here is the code for that:
public function reformDate($date){
return strtotime(date("F j, Y", $date)." 12:00pm");
}
This would allow me to deal with a more regular and consistent dataset. Because I only had to see how many days they were apart, rather than seeing how many seconds they were apart and making a decision whether that time quantity represented an intentional gap or not...
I saw, however, when you put something in for today at noon, then if you put something tomorrow at noon, since the values are the same, and based on my restriction:
Select * from times where :date between start and end
It triggers a response. My solution for this was to just add one to the start variable, and detract one from the end variable, so I can easily check if there are overlap by asking if the difference between the start of one and end of another is more than 2.
Anyway, my question is: is this a good way to do this? I'm particularly worried about the number 2 - do I need to worry about using such small units of time (that is unix time, by the way). Alternately, should I create a test that if two time units overlap perfectly - they should be accepted?
Disclaimer: I'm fully aware that the best way to represent date/times is either Unix timestamps or PHP's DateTime class and Oracle's DATE data type.
With that out of the way, I'm wondering what the most appropriate data types are (in PHP, as well as Oracle) for storing just time data. I'm not interested in storing a date component; only the time.
For example, say I had an employee entity, for which I wanted to store his/her typical work schedule. This employee might work 8:00am - 5:00pm. There are no date components to these times, so what should be used to store them and represent them?
Options I have considered:
As strings, with a standard format (likely 24-hour HH:MM:SS+Z).
As numbers in the range 0 <= n < 24, with fractional parts representing minutes/seconds (not able to store timezone info?).
As PHP DateTime and Oracle DATE with normalized/unused date component, such as 0001-01-01.
Same as above, only using Unix timestamps instead (PHP integer and Oracle TIMESTAMP).
Currently I'm using #3 above, but it sort of irks me that it seems like I'm misusing these data types. However, it provides the best usability as far as I can tell. Comparisons and sorts all work as expected in both PHP and Unix, timezone data can be maintained, and there's not really any special manipulation needed for displaying the data.
Am I overlooking anything, or is there a more appropriate way?
If you don't need the date at all, then what you need is the interval day data type. I haven't had the need to actually use that, but the following should work:
interval day(0) to second(6)
The option you use (3) is the best one.
Oracle has the following types for storing times and dates:
date
timestamp (with (local) time zone)
interval year to month
interval day to second
Interval data types are not an option for you, because you care when to start and when you finish. You could possibly use one date and one interval but this just seems inconsistent to me, as you still have one "incorrect" date.
All the other options you mentioned need more work on your side and probably also lead to decreased performance compared to the native date type.
More information on oracle date types: http://docs.oracle.com/cd/B19306_01/server.102/b14225/ch4datetime.htm#i1005946
I think that the most correct answer to this question is totally dependant on what you are planning to do with the data. If you are planning to do all your work in PHP and nothing in the database, the best way to store the data will be whatever is easiest for you to get the data in a format that assists you with what you are doing in PHP. That might indeed be storing them as strings. That may sound ghastly to a DBA, but you have to remember that the database is there to serve your application. On the other hand, if you are doing a lot of comparisons in the database with some fancy queries, be sure to store everything in the database in a format that makes your queries the most efficient.
If you are doing tasks like heavy loads calculating hours worked, converting into a decimal format may make things easier for calculations before a final conversion back to hours:minutes. A simple function can be written to convert a decimal to and fro when you are getting data from the database, convert it to decimal, do all your calculations, then run it back through to convert back into a time format.
Using unix timestamps is handy when you are calculating dates, probably not so much when you are calculating times though. While there seem to be some positives using this, such as very easily adding a timestamp to a timestamp, I have found that having to convert everything into timestamps to calculations is pesky and annoying, so I would steer clear of this scenario.
So, to sum up:
If you want to easily store, but not manipulate data, strings can be
an effective method. They are easy to read and verify. For anything
else, choose something else.
Calculating as numbers makes for super easy calculations. Convert
the time/date to a decimal, do all your heavy hiting, then revert to
a real time format and store.
Both PHP's Datetime and Oracle's Date are handy, and there are some
fantastic functions built into oracle and PHP to manipulate the
data, but even the best functions can be more difficult then adding
some decimals together. I think that storing the data in the
database in a date format is probably a safe idea - especially if
you want to do calculations based on the columns within a query.
What you plan to do with them inside PHP will determine how you use
them.
I would rule option four out right off the bat.
Edit: I just had an interesting chat with a friend about time types. Another thing you should be aware of is that sometimes time based objects can cause more problems than they solve. He was looking into an application where we track delivery dates and times. The data was in fact stored in datetime objects, but here is the catch: truck delivery times are set for a particular day and a delivery window. An acceptable delivery is either on time, or up to an hour after the time. This caused some havoc when a truck was to arrive at 11:30pm and turned up 45 minutes later. While still within the acceptable window, it was showing up as being the next day. Another issue was at a distribution center which actually works on a 4:00AM starting 24 hour day. Setting up times worked for the staff - and consolidating it to payments revolving around a normal date proved quite a headache.
I am building an advanced image sharing web application. As you may expect, users can upload images and others can comments on it, vote on it, and favorite it. These events will determine the popularity of the image, which I capture in a "karma" field.
Now I want to create a Digg-like homepage system, showing the most popular images. It's easy, since I already have the weighted Karma score. I just sort on that descendingly to show the 20 most valued images.
The part that is missing is time. I do not want extremely popular images to always be on the homepage. I guess an easy solution is to restrict the result set to the last 24 hours. However, I'm also thinking that in order to keep the image rotation occur throughout the day, time can be some kind of variable where its offset has an influence on the image's sorting.
Specific questions:
Would you recommend the easy scenario (just sort for best images within 24 hours) or the more sophisticated one (use datetime offset as part of the sorting)? If you advise the latter, any help on the mathematical solution to this?
Would it be best to run a scheduled service to mark images for the homepage, or would you advise a direct query (I'm using MySQL)
As an extra note, the homepage should support paging and on a quiet day should include entries of days before in order to make sure it is always "filled"
I'm not asking the community to build this algorithm, just looking for some advise :)
I would go with a function that decreases the "effective karma" of each item after a given amount of time elapses. This is a bit like Eric's method.
Determine how often you want the "effective karma" to be decreased. Then multiply the karma by a scaling factor based on this period.
effective karma = karma * (1 - percentage_decrease)
where percentage_decrease is determined by yourfunction. For instance, you could do
percentage_decrease = min(1, number_of_hours_since_posting / 24)
to make it so the effective karma of each item decreases to 0 over 24 hours. Then use the effective karma to determine what images to show. This is a bit more of a stable solution than just subtracting the time since posting, as it scales the karma between 0 and its actual value. The min is to keep the scaling at a 0 lower bound, as once a day passes, you'll start getting values greater than 1.
However, this doesn't take into account popularity in the strict sense. Tim's answer gives some ideas into how to take strict popularity (i.e. page views) into account.
For your first question, I would go with the slightly more complicated method. You will want some "All time favorites" in the mix. But don't go by time alone, go by the number of actual views the image has. Keep in mind that not everyone is going to login and vote, but that doesn't make the image any less popular. An image that is two years old with 10 votes and 100k views is obviously more important to people than an image that is 1 year old with 100 votes and 1k views.
For your second question, yes, you want some kind of caching going on in your front page. That's a lot of queries to produce the entry point into your site. However, much like SO, your type of site will tend to draw traffic to inner pages through search engines .. so try and watch / optimize your queries everywhere.
For your third question, going by factors other than time (i.e. # of views) helps to make sure you always have a full and dynamic page. I'm not sure about paginating on the front page, leading people to tags or searches might be a better strategy.
You could just calculate an "adjusted karma" type field that would take the time into account:
adjusted karma = karma - number of hours/days since posted
You could then calculate and sort by that directly in your query, or you could make it an actual field in the database that you update via a nightly process or something. Personally I would go with a nightly process that updates it since that will probably make it easier to make the algorithm a bit more sophisticated in the future.
This, i've found it, the Lower bound of Wilson score confidence interval for a Bernoulli parameter
Look at this: http://www.derivante.com/2009/09/01/php-content-rating-confidence/
At the second example he explains how to use time as a "freshness factor".
As the title states, I want to get the difference (in seconds) between 2 (specifically between now and a date in the past) dates without using: strtotime, the Zend Framework or a PEAR package.
I don't want to get into the details of my reason but the gist of it is that I'm working with very old dates (and I do mean old, I'm talking before 0 A.D.).
It is preferred that the returned result be highly accurate down to the second of the textual timestamp given. The format to call the function should be similar to:
$bar = foo("YYYY-MM-DD HH:MM:SS", "AD"); // Where AD is Anno Domini
$baz = foo("YYYY-MM-DD HH:MM:SS", "BC"); // Where BC is Before Christ
The first person who offers a working that features:
High readability
No magic (ternary operators, etc.)
Will have their answer up-voted and accepted. Their name will be credited in the header of the source file which uses their code.
EDIT (Re: Fame):
Someone said having a name credited in the header looks bad and can be edited out. I'm talking about the header of the source file that utilizes the function I want. This isn't about "fame". Credit should be given where credit is due and I have no need to lie about who authored a work.
EDIT (Re: Accurateness):
No reason other than I want to keep with the "letter of the message" as best as I am able.
EDIT (Re: Magic):
Magic is different things to different people. In regards to the ternary operator, please respect my opinion as I respect yours. Thank you.
EDIT (Re: Old Dates and One Second Accuracy):
As a student of history, it makes sense to me. The desire for "one second accuracy" is not an absolute. Perfection, while attainable, is not required.
I'd suggest splitting each datetime into parts (year, month, date, hours, minutes, seconds). Then, with each part, do a basic sum of most more minus less recent (remembering that a BC date is effectively a negative number).
You'll never get it absolutely correct. You're going to have to consider leap years, and whether a century year is a leap year, the switch between Gregorian/Julian dates etc.
Plus I'd love to know your reasoning for the limitations and high accuracy requirement!
For all such matters see Calendrical Calculations (Google for it).
Oh, and there was no year 0 AD, the calendar went from 1BC to 1AD, or rather, we modern westerners define the calendar that way, at the time most of the world was using other systems.
Or, make calls to on-line calculators such as this one and save yourself a lot of time.
Some languages and databases do date arithmetic, some don't. If you store your dates in a database, try postgres :
pg=> SELECT now() - 'January 8, 52 BC'::DATE;
-----------------------------
754835 days 20:27:31.223035
If you don't use a DB, then it gets a bit more problematic. PHP's date arithmetic is ... well, I'd rather not talk about it. Python's is very good, but it starts at year 1BC.
You might have to roll your own...
why don't you subtract the timestamps?
mktime(16,59,0,8,7,2001) - mktime(16,59,0,8,7,2000) = seconds between them