I have a SQL table with sales, it has data such as the time of sale, day etc.
Is it possible to predict sales next month or next year or so and also seasonal sales.
What kind of algorithm would I use here?
You're talking about using predictive analytics. You could either roll your own and create Regression type algorithms, or you could use an API like Google's Prediction API, http://code.google.com/apis/predict/
One thing to keep in mind is that this is all predicated on past behavior really being indicative of future results. If you look at your sales over time, is there a statistical correlation between months or years of sales figures? If not, then you're not going to be successful with the predictions.
I am not aware of any functionality in MySQL to determine trends, but here is a good start and includes several algorithms you can make use of.
http://en.wikipedia.org/wiki/Trend_estimation
Related
I am storing price history data for 3500 different stocks from 1970 to present (with a cron job running to update it every day).
What is the best way to store this data? It will be used to run calculations based on both daily data and weekly data. Currently I am storing it as:
stock_id, date, closing_price, high, low, open, volume
Since I want weekly price as well, should I make a separate table to store:
stock_id, week_end_date, weekly_closing_price, weekly_high, weekly_low, week_open_price, average_daily_volume, total_weekly_volume
Since this data is all calculable from the first table, is it necessary to store it again? The only reason I am considering it is that there are a LOT of rows of data to be running calculations.....
It depends on how much data you have and if you what your other transactional requirements are.
It doesn't make sense to duplicate this data in your source/OLTP system if you have one. I'm a SQL Server programmer, not MySQL, but I imagine they have datepart functions like all other RDBMS so determining a week number from a date is trivial.
When you get to OLAP or reporting, though, you may want to make another table with data at your week-level granularity. This will make reporting much faster, especially for things like aggregations which typically don't perform well when run against the output of a function.
Both these depend on the scale of your data. If you have hundreds of rows per day, it may not be worthwhile to do a materialized weekly table for that. If you have tens of thousands of records per day then the performance benefits will probably make it a reasonable option.
You ask if it's necessary? Who knows. That depends on how much disk space you have. However, what you are describing is an "old fashioned" aggregation table and is often used to improve reporting performance. When dealing with historical data, there's no need to recalculate things like weekly totals since the data doesn't change.
In fact, if I were doing this, I'd also define "monthly" and "annual" summary tables for more flexibility, especially for so much history. You can consider "standardizing" the data in such a way that each period is comparable. Calendar months and weeks have different numbers of trading days so things like "average daily volume" might be misleading.
If you really want to get fancy, do some research on ROLAP solutions. It's a very broad topic but you might find it useful.
Since this data is all calculable from the first table, is it necessary to store it again?
It's not necessary to summarize it and store it. You can just create a view that does all the summary calculations, and query the view.
If you're going to run reports over the full range of data a lot, though, it makes sense to summarize it once, and store the result. You're going to start with about 40 million rows. (3500 stocks * 43 years * about 265 days/year)
If I were in your shoes, I'd load the data, write the query for weekly prices, and test the performance. If it's too slow, insert the summary data into a table.
I am running a website that lets users contribute by letting them upload files on specific subjects. Right now my rating system is the worst possible (number of downloads of the file). Not only is this highly inaccurate in terms of quality control but also does it prevent new content to become listed on top anytime soon.
This is why I want to change my rating system so that users can up-/down-vote each item. However this should not be the only factor to display the popularity of such item. I would like to have older content to decrease in rating over time. Maybe I could even factor in the amount of downloads but to a very low percentage.
So, my questions are:
Which formula would you suggest under the assumption that there is 1 new upload every day?
How would you implement this in a php/mysql environment?
My problem is that right now I am simply sorting my stuff by the downloads row in the database. How can I sort a query by a factor that is calculated externally (in php) or do I have to update a new row in my table with the rating factor each time someone calls the site in his browser?
(Please excuse any mistakes, I am not a native speaker)
I am not really fluent in php or mysql, but as for the rating system, if you want to damp things in time, have you considered a decaying exponential? Off the top of my head, I would probably do something like
$rating = $downloads * exp(-1*$elapsedTime)
you can read up on it here http://en.wikipedia.org/wiki/Exponential_decay. Maybe build in a one week or one month or something delay before you starting damping the results, or people are going to get their upload downrated immediately.
First of all, in any case, you will need to add at least one column to your table. The best thing would be to have a separate table with id, upvotes, downvotes, datetime
If you want to take in consideration the freshness of posts (or uploads or comments or...) I think the best actual method is Wilson score with a gravity parameter.
For a good start with Wilson score implementation in PHP, check this.
Then you will need to read this to understand the pros and the cons of other solutions and use SQL directly.
Remark: gravity is not explicitly detailed in the SQL code but thanks to the PHP one you should be able to make it work.
Note that if you would like something simpler but still not lame, you could check with Bayesian Average. IMDB uses Bayesian Estimation to calculate its Top 250.
Implementing your own statistical model will only results in drawbacks that you had not imagined first (too far from the mean, downvotes are more important than upvotes, decay too quickly, etc...)
Finally you are talking about rating uploads directly, not the user who uploads the files. If you would like to do the same with the user, the simpler would be to use a Bayesian estimate with the results from your uploads ratings.
You have a lot to read, just in StackOverflow, to dry the subject.
Your journey starts here...
Typically a FreeRadius system contains a radacct table which keeps track of data in real time. Every hour or every day or each time the user logs off the current usage is added to the radacct table with the amount of data used and the date.
This makes it easy to offer post-paid data. In fact, you have to, because all that you can query is the data the client has used historically. But if you want to do pre-paid, I have thought about this for years and although I have come up with something that works similar to a bank, I am still not sure how to achieve pre-paid data. Once caveat is that pre-paid data might be valid for a few months, and since radacct works on the current date, I can't see how to achieve this.
I am looking for an easy way to enable an existing Radius system to allow pre-paid data without using any or too many stored procedures. I'm using MySQL and PHP.
Edit:
I revisited this post after a year and three months. We ended up using a product Radius Manager by DMA Softlab which has this functionality built-in. Doing this on our own would have required too many stored procedures and too much development time. Just explaining our architecture as requested by #maraspin is a mission.
I have submitted this previously but because someone down voted it and said it was already answered, no-one will answer it.
I know there are similar posts here:
Design question: How would you design a recurring event system?
What's the best way to model recurring events in a calendar application?
but they do not answer my request which is for a practical, simple and real-world example of logic for a recurring calendar setup, without using another framework or tools/scripts other than straight PHP and MySQL.
I do agree that this article http://martinfowler.com/apsupp/recurring.pdf is good, but it is so abstract that I cannot understand it.
I know there are other "Systems that have done this" but this is my own white whale, and I will figure it out at some point - I would just like some help along the way.
So, the question is: how do I build a recurring calendar using PHP and MySQL?
You should strive to understand the Fowler article. It really does cover it.
The fact of the matter is, this is a hard problem. Editing logic "on the fly" is not something users really want to do. Rather, they want you as a programmer to have anticipated what they'll want to do and provided a rule for it--they don't want to have to figure out how to compute the second Wednesday of the month, for instance.
It sounds like your real problem lies in modeling the recurrence in MySQL. You could use Google's spec, which can be stored in a database and covered on Stack Overflow before. Fowler's piece also provides some good starting points in terms of well-defined classes that can be represented in an RDBMS.
It's a hard problem. And while SO wants you to succeed, we can only lead you to the stream. We can't force you to drink.
For a practical, real-world example of recurring calendar logic, look at your PDA or equivalent.
I got to build a calendar in an intranet application a few years ago and basically copied what my Palm had for recurring options. It made sense to people, so I judged it a success. But it didn't store real clean in the database. My code ended up with lots of careful checks that data was consistent along with various rules to correct things if something looked awry. It helped that we were actively using it as I was developing it. :-)
As far as storage went, the calendar entry involved a flag that indicated if it was part of a recurring series or not. If it wasn't, it was a non-recurring entry. If it was, then editing it had a couple of options, one of which was to break the series at this entry. Recurring entries were put into the database as discrete items; this was a type of denormalization that was done for performance reasons. Amongst other things, it meant that other code that wanted to check the calendar didn't have to worry about recurring items. We solved the "neverending" problem by always requiring an end-date to the series.
Actually setting up the recurring items involved some JavaScript in the UI to control the different settings. The entry in the DB had a combination of values to indicate the scope of the recurrence (e.g. daily, weekly, ...) the recurring step (e.g. 1 week, 2 weeks, ...) and any variation (weekly let you say "Monday, Wednesday, Thursday every week").
Finally, we had some logic that I never got to fully implement that handled timezones and daylight saving. This is difficult, because you have to allow the shift to apply selectively. That is, some calender items will stay in time local to the end-user, others are fixed to a location, and either may or may not shift with daylight saving. I left that company before I got a fix on that problem.
Lastly, I'm replying to this because I didn't see all the other questions. :-) But go read and understand that PDF.
I've built one, but I'm convinced it's wrong.
I had a table for customer details, and another table with the each date staying (i.e. a week's holiday would have seven records).
Is there a better way?
I code in PHP with MySQL
Here you go
I found it at this page:
A list of free database models.
WARNING: Currently (November '11), Google is reporting that site as containing malware: http://safebrowsing.clients.google.com/safebrowsing/diagnostic?client=Firefox&hl=en-US&site=http://www.databaseanswers.org/data_models/hotels/hotel_reservations_popkin.htm
I work in the travel industry and have worked on a number of different PMS's. The last one I designed had the row per guest per night approach and it is the best approach I've come across yet.
Quite often in the industry there are particular pieces of information to each night of the stay. For example you need to know the rate for each night of the stay at the time the booking was made. The guest may also move room over the duration of their stay.
Performance wise it's quicker to do an equals lookup than a range in MySQL, so the startdate/enddate approach would be slower. To do a lookup for a range of dates do "where date in (dates)".
Roughly the schema I used is:
Bookings (id, main-guest-id, arrivaltime, departime,...)
BookingGuests (id, guest-id)
BookingGuestNights (date, room, rate)
Some questions you need to ask yourself:
Is there a reason you need a record for each day of the stay?
Could you not just have a table for the stay and have an arrival date and either a number of nights or a departure date?
Is there specific bits of data that differ from day to day relating to one customer's stay?
Some things that may break your model. These may not be a problem, but you should check with your client to see if they may occur.
Less than 1 day stays (short midday stays are common at some business hotels, for example)
Late check-outs/early check-ins. If you are just measuring the nights, and not dates/times, you may find it hard to arrange for these, or see potential clashes. One of our clients wanted a four hour gap, not always 10am-2pm.
Wow, thanks for all the answers.
I had thought long and hard about the schema, and went with a record=night approach after trying the other way and having difficulty in converting to html.
I used CodeIgniter with the built in Calendar Class to display the booking info. Checking if a date was available was easier this way (at least after trying), so I went with it. But I'm convinced that it's not the best way, which is why I asked the question.
And thanks for the DB answers link, too.
Best,
Mei
What's wrong with that? logging each date that the customer is staying allows for what I'd imagine are fairly standard reports such as being able to display the number of booked rooms on any given day.
The answer heavily depends on your requirements... But I would expect only storing a record with the start and stop date for their stay is needed. If you explain your question more, we can give you more details.
A tuple-per-day is a bit overkill, I think. A few columns on a "stay" table should suffice.
stay.check_in_time_scheduled
stay.check_in_time_actual
stay.check_out_time_scheduled
stay.check_out_time_actual
Is creating a record for each day a person stays neccessary? It should only be neccessary if each day is significant, otherwise have a Customer/Guest table to contain the customer details, a Booking table to contain bookings for guests. Booking table would contain room, start date, end date, guest (or guests), etc.
If you need to record other things such as activities paid for, or meals, add those in other tables as required.
One possible way to reduce the number of entries for each stay is, store the time-frame e.g. start-date and end-date. I need to know the operations you run against the data to give a more specific advice.
Generally speaking, if you need to check how many customers are staying on a given date you can do so with a stored procedure.
For some specific operations your design might be good. Even if that's the case I would still hold a "visits" table linking a customer to a unique stay, and a "days-of-visit" table where I would resolve each client's stay to its days.
Asaf.
You're trading off database size with query simplicity (and probably performance)
Your current model gives simple queries, as its pretty easy to query for number of guests, vacancies in room X on night n, and so on, but the database size will increase fairly rapidly.
Moving to a start/stop or start/num nights model will make for some ... interesting queries at times :)
So a lot of the choice is to do with your SQL skill level :)
I don't care for the schema in the diagram. It's rather ugly.
Schema Abstract
Table: Visit
The Visit table contains one row for each night stayed in a hotel.
Note: Visit contains
ixVisit
ixCusomer
dt
sNote
Table: Customer
ixCustomer
sFirstName
sLastName
Table: Stay
The Stay table includes one row that describes the entire visit. It is updated everytime Visit is udpated.
ixStay
dtArrive
dtLeave
sNote
Notes
A web app is two things: SELECT actions and CRUD actions. Most web apps are 99% SELECT, and 1% CRUD. Normalization tends to help CRUD much more than SELECT. You might look at my schema and panic, but it's fast. You will have to do a small amount of extra work for any CRUD activity, but your SELECTS will be so much faster because all of your SELECTS can hit the Stay table.
I like how Jeff Atwood puts it: "Normalize until it hurts, denormalize until it works"
For a website used by a busy hotel manager, how well it works is just as important as how fast it works.