I am doing some large timestamp-list iterations: Putting them in tables with date-ranges, and grouping them by ranges.
In order to do that, I found strtotime() a very helpfull function, but I am worried about its performance.
For example, a function that loops over a list of weeks (say, week 49 to 05) and has to decide the beginning of the week and the timestamp at the end of that week. A usefull way to do that, would be:
foreach ($this->weeks($first, $amount) as $starts_at) {
$ends_at = strtotime('+1 week', $starts_at);
$groups[$week_key] = $this->slice($timestamps, $starts_at, $ends_at);
}
//$this->weeks returns a list of timestamps at which each week starts.
//$this->slice is a simple helper that returns only the timestamps within a range, from a list of timestamps.
Instead of strtotime(), I could, potentially, find out the amount of seconds between begin and end of the week, 99% of the times that would be 24 * 60 * 60 * 7. But in these rare cases where there is a DST-switch, that 24 should either be 23 or 25. Code to sort that out, will probably be a lot slower then strtotime(), not?
I use the same patterns for ranges of years, months (months, being very inconsistent!), days and hours. Only with hours would I suspect simply adding 3600 to the timestamp is faster.
Any other gotcha's? Are there ways (that do not depend on PHP5.3!) that offer better routes for consistent, DST and leap-year safe dateranges?
Why are you worried about its performance? Do you have evidence that it's slowing down your system? If not, don't try to over-complicate the solution for unnecessary reasons. Remember that premature optimization is the root of all evil. Write readable code that makes sense, and only optimize if you KNOW it's going to be an issue...
But something else to consider is that it's also compiled C code, so it should be quite efficient for what it does. You MIGHT be able to build a sub-set of the code in PHP land and make it faster, but it's going to be a difficult job (due to all the overhead involved in PHP code).
Like I said before, use it until you prove it's a problem, then fix that problem. Don't forget re-writing it for you needs isn't free either. It takes time and introduces bugs. Is it worth it if the gain is minimal (meaning it wasn't a performance problem to begin with). So don't bother trying to micro-optimize unless you KNOW it's a problem...
I know this probably isn't the answer you're looking for, but your best bet is profiling it with a real use case in mind.
My instinct is that, as you think, strtotime will be slower. But even if it's, say, 3 times slower, this is only meaningful in context. Maybe your routine, with real data, takes 60 ms using strtotime, so in most cases, you'd be just saving 40 ms (I totally made up these numbers, but you get the idea). So, you might find out that optimising this wouldn't really pay off (considering you're opening your code to more potential bugs and you'll have to invest more time to get it right).
By the way, if you have good profiling tools, awesome, but even if you don't comparing timestamps should give you a rough idea.
To respond to the question, finaly :
Based on many benchmarks like this one: https://en.code-bude.net/2013/12/19/benchmark-strtotime-vs-datetime-vs-gettimestamp-in-php/
We can see that strtotime() is more effective that we can think.
So yes, to convert a string to a timestamp, the strtotime function has pretty good performance.
Very interesting question. I'd say that the only way you can really figure this out, is to set up your own performance test. Observe the value of microtime() at the beginning and end of the script, to determine performance. Run a ridiculous number of values through a loop with one method, then the other method. Compare times.
Related
I have an issue with a very slow API call and want to find out, what it caused by, using Xhprof: the default GUI and the callgraph. How should this data be analyzed?
What is the approach to find the places in the code, that should be optimized, and especially the most expensive bottlenecks?
Of all those columns, focus on the one called "IWall%", column 5.
Notice that send, doRequest, read, and fgets each have 72% inclusive wall-clock time.
What that means is if you took 100 stack samples, each of those routines would find itself on 72 of them, give or take, and I suspect they would appear together.
(Your graph should show that too.)
So since the whole thing takes 23 seconds, that means about 17 seconds are spent simply reading.
The only way you can reduce that 17 seconds is if you can find that some of the reading is unnecessary. Can you?
What about the remaining 28% (6 seconds)?
First, is it worth it?
Even if you could reduce that to zero (17 seconds total, which you can't), the speedup factor would 1/(1-0.28) = 1.39, or 39%.
If you could reduce it by half (20 seconds total), it would be 1/(1-0.14) = 1.16, or 16%.
20 seconds versus 23, it's up to you to decide if it's worth the trouble.
If you decide it is, I recommend the random pausing method, because it doesn't flood you with noise.
It gets right to the heart of the matter, not only telling you which routines, but which lines of code, and why they are being executed.
(The why is most important, because you can't replace it if it's absolutely necessary.
With profilers, you tend to assume it is necessary, because you have no way to tell otherwise.)
Since you are looking for something taking about 14% of the time, you're going to have to examine 2/0.14 = 14 samples, on average, to see it twice, and that will tell you what it is.
Keep in mind that about 14 * 0.72 = 10 of those samples will land in fgets (and all its callers), so you can either ignore those or use them to make sure all that I/O is really necessary.
(For example, is it just possible that you're reading things twice, for some obscure reason like it was easier to do that way? I've seen that.)
I'm trying to write a program that relies on date ranges. I am trying to be able to alert users when there are holes in their ranges but I need a reliable way to find those, and to be able to handle them effectively.
My solution was to change any dates so that any day inserted into the app is rewritten so it is that day at noon. Here is the code for that:
public function reformDate($date){
return strtotime(date("F j, Y", $date)." 12:00pm");
}
This would allow me to deal with a more regular and consistent dataset. Because I only had to see how many days they were apart, rather than seeing how many seconds they were apart and making a decision whether that time quantity represented an intentional gap or not...
I saw, however, when you put something in for today at noon, then if you put something tomorrow at noon, since the values are the same, and based on my restriction:
Select * from times where :date between start and end
It triggers a response. My solution for this was to just add one to the start variable, and detract one from the end variable, so I can easily check if there are overlap by asking if the difference between the start of one and end of another is more than 2.
Anyway, my question is: is this a good way to do this? I'm particularly worried about the number 2 - do I need to worry about using such small units of time (that is unix time, by the way). Alternately, should I create a test that if two time units overlap perfectly - they should be accepted?
Disclaimer: I'm fully aware that the best way to represent date/times is either Unix timestamps or PHP's DateTime class and Oracle's DATE data type.
With that out of the way, I'm wondering what the most appropriate data types are (in PHP, as well as Oracle) for storing just time data. I'm not interested in storing a date component; only the time.
For example, say I had an employee entity, for which I wanted to store his/her typical work schedule. This employee might work 8:00am - 5:00pm. There are no date components to these times, so what should be used to store them and represent them?
Options I have considered:
As strings, with a standard format (likely 24-hour HH:MM:SS+Z).
As numbers in the range 0 <= n < 24, with fractional parts representing minutes/seconds (not able to store timezone info?).
As PHP DateTime and Oracle DATE with normalized/unused date component, such as 0001-01-01.
Same as above, only using Unix timestamps instead (PHP integer and Oracle TIMESTAMP).
Currently I'm using #3 above, but it sort of irks me that it seems like I'm misusing these data types. However, it provides the best usability as far as I can tell. Comparisons and sorts all work as expected in both PHP and Unix, timezone data can be maintained, and there's not really any special manipulation needed for displaying the data.
Am I overlooking anything, or is there a more appropriate way?
If you don't need the date at all, then what you need is the interval day data type. I haven't had the need to actually use that, but the following should work:
interval day(0) to second(6)
The option you use (3) is the best one.
Oracle has the following types for storing times and dates:
date
timestamp (with (local) time zone)
interval year to month
interval day to second
Interval data types are not an option for you, because you care when to start and when you finish. You could possibly use one date and one interval but this just seems inconsistent to me, as you still have one "incorrect" date.
All the other options you mentioned need more work on your side and probably also lead to decreased performance compared to the native date type.
More information on oracle date types: http://docs.oracle.com/cd/B19306_01/server.102/b14225/ch4datetime.htm#i1005946
I think that the most correct answer to this question is totally dependant on what you are planning to do with the data. If you are planning to do all your work in PHP and nothing in the database, the best way to store the data will be whatever is easiest for you to get the data in a format that assists you with what you are doing in PHP. That might indeed be storing them as strings. That may sound ghastly to a DBA, but you have to remember that the database is there to serve your application. On the other hand, if you are doing a lot of comparisons in the database with some fancy queries, be sure to store everything in the database in a format that makes your queries the most efficient.
If you are doing tasks like heavy loads calculating hours worked, converting into a decimal format may make things easier for calculations before a final conversion back to hours:minutes. A simple function can be written to convert a decimal to and fro when you are getting data from the database, convert it to decimal, do all your calculations, then run it back through to convert back into a time format.
Using unix timestamps is handy when you are calculating dates, probably not so much when you are calculating times though. While there seem to be some positives using this, such as very easily adding a timestamp to a timestamp, I have found that having to convert everything into timestamps to calculations is pesky and annoying, so I would steer clear of this scenario.
So, to sum up:
If you want to easily store, but not manipulate data, strings can be
an effective method. They are easy to read and verify. For anything
else, choose something else.
Calculating as numbers makes for super easy calculations. Convert
the time/date to a decimal, do all your heavy hiting, then revert to
a real time format and store.
Both PHP's Datetime and Oracle's Date are handy, and there are some
fantastic functions built into oracle and PHP to manipulate the
data, but even the best functions can be more difficult then adding
some decimals together. I think that storing the data in the
database in a date format is probably a safe idea - especially if
you want to do calculations based on the columns within a query.
What you plan to do with them inside PHP will determine how you use
them.
I would rule option four out right off the bat.
Edit: I just had an interesting chat with a friend about time types. Another thing you should be aware of is that sometimes time based objects can cause more problems than they solve. He was looking into an application where we track delivery dates and times. The data was in fact stored in datetime objects, but here is the catch: truck delivery times are set for a particular day and a delivery window. An acceptable delivery is either on time, or up to an hour after the time. This caused some havoc when a truck was to arrive at 11:30pm and turned up 45 minutes later. While still within the acceptable window, it was showing up as being the next day. Another issue was at a distribution center which actually works on a 4:00AM starting 24 hour day. Setting up times worked for the staff - and consolidating it to payments revolving around a normal date proved quite a headache.
I have some dates/events in a database, and I'd like to pull them out ordered by month (year doesn't matter) - right now all the timestamps are in unix in a column named eventDate. How can make that query?
SELECT * FROM calendar ORDER BY eventDate
Obviously that sorts them, but I want to make sure all events across all years are grouped by month - then obviously need to arrange them January, February, March, etc.
Any advice?
Thanks!
You could use FROM_UNIXTIME() function + MONTH() function.
SELECT MONTH(FROM_UNIXTIME(0));
-- 12
But there's no reason to store a unix timestamp over a real timestamp (YYYY-MM-DD HH:II:SS). RDBMS have functions to manipulate dates and if you really need the unix timestamp (I never do, TBH), you can use the UNIX_TIMESTAMP function.
There are plenty of extremely good reasons for using unix time. Good database design hugely impacts how expensive it is to run databases and website, especially successful busy ones.
The best case I know of is..
a really busy server(s) and where time data is required to be stored but the time data is actually accessed rarely compared to the number of reads and writes actually going on in the db. It takes cpu resources to do all the manipulation of that time data, So don't unless you absolutely have to.
A real life example is my own. We needed 4 front end web servers and were going to be adding more. they were old too and needed updating. looking at 6 replacement servers that would be needed it was going to cost us a bundle. decided to look about what we were doing. We now have 2 front end servers instead of 4 or 6. what it took? optimizing the database structure and queries and the code that inserted and read data from them.
One example that took your exact consideration in mind... changed 1 line of php code, changed the time column to unix instead of yyyy-dd-mm hh:mm:ss, added an index to the time column and that one operation went from 0.08 seconds to 0.00031 seconds start to finish.
The multifold impact on cpu resources was huge. the next queued up operations executed faster... etc.
That is why people have jobs as database designers... it really is important.
of course if your website is slow and not busy.. probably no one will notice.
But if you are successfull, it WILL matter.
If you've got a busy site and your servers get sluggish... look at things like this. You might not need a new box or more memmory, you just might need to clean up code and optimize the db.
Timestamps, their form and how they are used and stored DO MATTER.
I have a MySql table called today_stats. It has got Id, date and clicks. I'm trying to create a script to get the values and try to predict the next 7 days clicks. How I can predict it in PHP?
Different types of curve fitting described here:
http://en.wikipedia.org/wiki/Curve_fitting
Also: http://www.qub.buffalo.edu/wiki/index.php/Curve_Fitting
This has less to do with PHP, and more to do with math. The simplest way to calculate something like this is to take the average traffic for a given day over the past X weeks. You don't want to pull all the data, because fads and page content changes.
So, for example, get the average traffic for each day over the last month. You'll be able to tell how accurate your estimates are by comparing them to actual traffic. If they aren't accurate at all, then try playing with the calculation (ex., change the time period you're sampling from). Or maybe it's a good thing that your estimate is off: your site was just featured on the front page of the New York Times!
Cheers.
The algorithm you are looking for is called Least Squares
What you need to do is minimize the summed up distances from each point to the function you will use to predict the future values. For the distance to be always positive, not the absolute value is taken into calculation, but the square of the value. The sum of the squares of the differences has to be minimum. By defining the function that makes up that sum, deriving it, solving the resulting equation, you will find the parameters for your function, that will be CLOSEST to the statistical values from the past.
Programs like Excel (maybe OpenOffice Spreadsheet too) have a built-in function that does this for you, using polynomial functions to define the dependence.
Basically you should take Time as the independent value, and all the others as described values.
This is called econometrics, because its widespread in economics. This way, if you have a lot of statistical data from the past, the prediction for the next day will be quite accurate (you will also be able to determine the trust interval - the possible error that may occur). The following days will be less and less accurate.
If you make different models for each day of week, include holidays and special days as variables, you will get a much higher precision.
This is the only RIGHT way to mathematically forecast future values. But from all this a question arises: Is it really worth it?
Start off by connecting to the database and then retrieving the data for x days previously.
Then you could attempt to make a line of best fit for the previous days and then just use that and extend into the future. But depending on the application, a line of best fit isn't going to be good enough.
a simple approach would be to group by days and average each value. This can all be done in SQL