Extract dates, times and date ranges from text in PHP - php

I'm building a local events calendar which takes RSS feeds and website scrapes and extracts event dates from them.
I've previously asked how to extract dates from text in PHP here, and received a good answer at the time from MarcDefiant:
function parse_date_tokens($tokens) {
# only try to extract a date if we have 2 or more tokens
if(!is_array($tokens) || count($tokens) < 2) return false;
return strtotime(implode(" ", $tokens));
}
function extract_dates($text) {
static $patterns = Array(
'/^[0-9]+(st|nd|rd|th|)?$/i', # day
'/^(Jan(uary)?|Feb(ruary)?|Mar(ch)?|etc)$/i', # month
'/^20[0-9]{2}$/', # year
'/^of$/' #words
);
# defines which of the above patterns aren't actually part of a date
static $drop_patterns = Array(
false,
false,
false,
true
);
$tokens = Array();
$result = Array();
$text = str_word_count($text, 1, '0123456789'); # get all words in text
# iterate words and search for matching patterns
foreach($text as $word) {
$found = false;
foreach($patterns as $key => $pattern) {
if(preg_match($pattern, $word)) {
if(!$drop_patterns[$key]) {
$tokens[] = $word;
}
$found = true;
break;
}
}
if(!$found) {
$result[] = parse_date_tokens($tokens);
$tokens = Array();
}
}
$result[] = parse_date_tokens($tokens);
return array_filter($result);
}
# test
$texts = Array(
"The focus of the seminar, on Saturday 2nd February 2013 will be [...]",
"Valentines Special # The Radisson, Feb 14th",
"On Friday the 15th of February, a special Hollywood themed [...]",
"Symposium on Childhood Play on Friday, February 8th",
"Hosting a craft workshop March 9th - 11th in the old [...]"
);
$dates = extract_dates(implode(" ", $texts));
echo "Dates: \n";
foreach($dates as $date) {
echo " " . date('d.m.Y H:i:s', $date) . "\n";
}
However, the solution has some downsides - for one thing, it can't match date ranges.
I'm now looking for a more complex solution that can extract dates, times and date ranges from sample text.
Whats the best approach for this? It seems like I'm leaning back toward a series of regex statements run one after the other to catch these cases. I can't see a better way of catching date ranges in particular, but I know there must be a better way of doing this. Are there any libraries out there just for date parsing in PHP?
Date / Date Range samples, as requested
$dates = [
" Saturday 28th December",
"2013/2014",
"Friday 10th of January",
"Thursday 19th December",
" on Sunday the 15th December at 1 p.m",
"On Saturday December 14th ",
"On Saturday December 21st at 7.30pm",
"Saturday, March 21st, 9.30 a.m.",
"Jan-April 2014",
"January 21st - Jan 24th 2014",
"Dec 30th - Jan 3rd, 2014",
"February 14th-16th, 2014",
"Mon 14 - Wed 16 April, 12 - 2pm",
"Sun 13 April, 8pm",
"Mon 21 - Wed 23 April",
"Friday 25 April, 10 – 3pm",
"The focus of the seminar, on Saturday 2nd February 2013 will be [...]",
"Valentines Special # The Radisson, Feb 14th",
"On Friday the 15th of February, a special Hollywood themed [...]",
"Symposium on Childhood Play on Friday, February 8th",
"Hosting a craft workshop March 9th - 11th in the old [...]"
];
The function I'm currently using (not the above) is about 90% accurate. It can catch date ranges, but has difficulty if a time is also specified. It uses a list of regex expressions and is very convoluted.
UPDATE: Jan 6th, 2014
I'm working on code that does this, working on my original method of a series of regex statements run one after the other. I think I'm close to a working solution that can pretty much extract almost any date/time range / format from a piece of text. When I'm done I'll post it here as an answer.

I think you can sum up the regex in your question like the one below.
(?<date_format_1>(?<day>(?i)\b\s*[0-9]+(?:st|nd|rd|th|)?)(?<month>(?i)\b\s*(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|etc))(?<year>\b\s*20[0-9]{2}) ) |
(?<date_format_2>(?&month)(?&day)(?!\s+-)) |
(?<date_format_3>(?&day)\s+of\s+(?&month)) |
(?<range_type_1>(?&month)(?&day)\s+-\s+(?&day))
Flags: x
Description
Demo
http://regex101.com/r/wP5fR4
Discussion
By using recursive subpatterns, you reduce the complexity of the final regex.
I have used a negative lookahead in the date_format_2 because it would match partially range_type_1. You may need to add more range type depending on your data. Don't forget to check other partterns in case of partial match.
Another solution would consist in build small regexes in different string variables and then concatenate them in PHP to build a bigger regex.

Related

Adding and removing from a string containing dates using PHP

I have to edit a whole bunch of date intervals. But they are all mixed up. Most are in the form Month YearMonth Year
eg January 2014March 2015
How would I insert a hyphen in between so I end up with
January 2014 - March 2015
I also have the problem where these dates occur in the same year.
eg April 2012September2012
In such a case I would need to insert the hyphen and remove the year so that I'm left with
April - September
There must be some PHP string operators for stuff like this. Well thats what I'm hoping.
Would appreciate some guidance. Thanks in advance.
Thanks, sorry for my delayed reply
$string = "January 2014March 2015";
preg_match('/([a-z]+) *(\d+) *([a-z]+) *(\d+)/i', $string, $match);
print "$match[1] $match[2] - $match[3] $match[4]";
outputs,
January 2014 - March 2015
You could do it using lookaround:
$string = "January 2014March 2015";
$res = preg_replace('/(?<=\d)(?=[A-Z])/', ' - ', $string);
echo $res,"\n";
Output:
January 2014 - March 2015

Detect BST in php

I am building a booking system in php that offers session times for people to book outdoor activities.
In the summer months, there is an extra session available at the end of the day, because of Daylight Savings, there is an extra hour in the evenings.
Year Clocks go forward Clocks go back
2014 30 March 26 October
2015 29 March 25 October
2016 27 March 30 October
2017 27 March 30 October
2018 25 March 28 October
I am using this...
$todaysDate = strtotime(date("Y-m-d"));
$bstBegin = strtotime("2015-03-29");
$bstEnd = strtotime("2015-10-25");
if($todaysDate > $bstBegin && $todaysDate > $bstEnd)
{
echo "<option value="evening">Evening Session</option>";
}
I only need to show this extra option in the select list between these dates. Is this something I will need to set manually from year to year, or is there a PHP date variable that knows the days the clocks change?
$today = strtotime(date("Y-m-d"));
if (date('I', $today)) {
echo "We're in BST!";
} else {
echo "We're not in BST!";
}
or use the DateTime object equivalent, which maintains details of all the transition dates globally
I work normally with the DateTime functions and like it very much. You have a lot of possibilities to modify a date.
But in your case you can concat the actual year to your string.
$todaysDate = strtotime(date("Y-m-d"));
$bstBegin = strtotime(date('Y')."-03-29");
$bstEnd = strtotime(date('Y')."-10-25");
I hope i have understood your problem correctly.
I have found this online very good ref : https://gist.github.com/aromig/56376f76d4fb653ba83e
public function is_BST() {
$theTime = time();
$tz = new DateTimeZone('Europe/London');
$transition = $tz->getTransitions($theTime, $theTime);
$abbr = $transition[0]['abbr'];
return $abbr == 'BST' ? true : false; }

Using regex to group data and it's children

I have a simple document that I need to split up into events (by day), unfortunately the document contains other useless info (such as event details) which I'll need to crawl through to retrieve the info. An except of this document looks like this:
10th March 2015
Baseball 10:00 Please remember to bring your bats
Soccer 14:00 over 18s only
11th March 2015
Swimming 10:00 Children only
Soccer 14:00 Over 14s team training
My initial plan was to use preg_spit to try and split the string at the date, then loop over each one, however I need to maintain the structure of the document.
Ideally I'd like to return the data into an array like:
arr[
'days' =>[
'date' => '10th MArch 2015'
'events' => ['Baseball 10:00', 'Soccer 14:00'],
]
]
How would I best go about doing this? Regex isn't my strongest suit, but I know enough to capture the days ([0-9]{1,2}[a-z]{2}/s[a-z]+/s[0-9]{4}) and the events ([a-Z]+/s[0-9]{2}:[0-9]{2}).
You can use this regex:
/(?:\b(\d+th\h+.*?\d{4})\b|\G)\s+(\S+\h+\d{2}:\d{2}\b).*?(?=\s+(?>\S+\h+\d{2}:\d{2}|\d+th\h+|\z))/i
And then a bit of PHP code to loop through the result.
RegEx Demo
This is what I came up with. I used explode() to split out the different sections and then to split up the lines. I didn't use preg_match() until the very end to get the specific sport/time.
<?php
$text = <<<EOD
10th March 2015
Baseball 10:00 Please remember to bring your bats
Soccer 14:00 over 18s only
11th March 2015
Swimming 10:00 Children only
Soccer 14:00 Over 14s team training
EOD;
$days = array();
if( $sections = explode("\n\n",$text) ){
foreach($sections as $k=>$section){
$events = array();
$lines = explode("\n",$section);
$day = $lines[0];
unset($lines[0]);
if($lines){
foreach($lines as $line){
preg_match("/(\w+)\s(\d){2}:(\d){2}/",$line,$matches);
if(isset($matches[0])){
$events[] = $matches[0];
}
}
}
$days[$k] = array(
'day' => $day,
'events' => $events
);
}
}
echo '<pre>',print_r($days),'</pre>';

PHP - Get date from specified weekday in a given week

I wan't to get a PHP date() from specified weekday in a given week.
For example:
Weekday: Thursday - in week:8 - year:13 (means 2013).
I would like to return a date from these specified values. The phpdate will in this case return: "21 Feb 2013", which is a Thursday in week 8 of 2013.
Please fill in this php-method:
function getDateWithSpecifiedValues($weekDayStr,$week,$year) {
//return date();
}
Where the example:
getDateWithSpecifiedValues("Tuesday",8,13);
will return a phpdate of "19 Feb 2013"
First, you have to define what you mean by "week of the year". There are several different definitions. Does the first week start on Jan 1? On the first Sunday or Monday? Is it number 1 or 0?
There is a standard definition, codified in ISO 8601, which says that weeks run from Monday through Sunday, the first one of the year is the one with at least 4 days of the new year in it, and that week is number 1. Your example expected output is consistent with that definition.
So you can convert the values by putting them into a string and passing that string to strptime, along with a custom format string telling it what the fields in the string are. For example, the the week number itself should be indicated in the format string by %V.
For the weekday, the format depends on how you want to provide it as input to your function. If you have the full name (e.g. "Thursday"), that's %A. If you have the abbreviated name (e.g. "Thu"), that's %a. If you have a number (e.g. 4), that's either %w (if Sundays are 0) or %u (if Sundays are 7). (If you're not sure, you can always just use %w and pass the number % 7.)
Now, the year should be %G (full year) or %g (just the last two digits). It's different from the normal calendar year fields (%Y for 2014 and %y for 13) because, for example, week 1 of 2014 actually started on December 30, 2013, which obviously has a '%Y' of 2013 where we want 2014. However, the G fields don't work properly with strptime, so you'll have to use the Y's.
For example:
$date_array = strptime("$weekDayStr $week $year", '%A %V %y');
That's a good start, but the return value of strptime is an array:
array('tm_sec' => seconds, 'tm_min' => minutes, tm_hour => hour,
tm_mday => day of month, tm_mon => month number (0..11), tm_year => year - 1900)
And that array is not the input expected by any of the other common date or time functions, as far as I can tell. You have to pull the values out yourself and modify them in some cases and pass the result to something to get what you want. For instance:
$time_t = mktime($date_array['tm_hour'], $date_array['tm_min'],
$date_array['tm_sec'], $date_array['tm_mon']+1,
$date_array['tm_mday'], $date_array['tm_year']+1900);
And then you can return that in whatever form you need. Here I'm returning it as a string:
function getDateWithSpecifiedValues($weekDayStr,$week,$year) {
$date_array = strptime("$weekDayStr $week $year", '%A %V %y');
$time_t = mktime($date_array['tm_hour'], $date_array['tm_min'],
$date_array['tm_sec'], $date_array['tm_mon']+1,
$date_array['tm_mday'], $date_array['tm_year']+1900);
return strftime('%d %b %Y', $time_t);
}
For example,
php > print(getDateWithSpecifiedValues('Thursday',8,13)."\n");
21 Feb 2013
Try this function:
function getDateWithSpecifiedValues($weekDayStr, $week, $year) {
$dt = DateTime::createFromFormat('y, l', "$year, $weekDayStr");
return $dt->setISODate($dt->format('o'), $week, $dt->format('N'))->format('j M Y');
}
demo
I'm fairly certain neither 21st of January or the 19th of January is in week 8.
You can however use strptime to parse custom formats:
var_dump(strptime("Thursday 8 13", "%A %V %y"));
array(9) {
["tm_sec"]=>
int(0)
["tm_min"]=>
int(0)
["tm_hour"]=>
int(0)
["tm_mday"]=>
int(21)
["tm_mon"]=>
int(1)
["tm_year"]=>
int(113)
["tm_wday"]=>
int(4)
["tm_yday"]=>
int(58)
["unparsed"]=>
string(0) ""
}
See the documentation for strptime for the meaning of each value in the returned array.

Parsing a string for dates in PHP

Given an arbitrary string, for example ("I'm going to play croquet next Friday" or "Gadzooks, is it 17th June already?"), how would you go about extracting the dates from there?
If this is looking like a good candidate for the too-hard basket, perhaps you could suggest an alternative. I want to be able to parse Twitter messages for dates. The tweets I'd be looking at would be ones which users are directing at this service, so they could be coached into using an easier format, however I'd like it to be as transparent as possible. Is there a good middle ground you could think of?
If you have the horsepower, you could try the following algorithm. I'm showing an example, and leaving the tedious work up to you :)
//Attempt to perform strtotime() on each contiguous subset of words...
//1st iteration
strtotime("Gadzooks, is it 17th June already")
strtotime("is it 17th June already")
strtotime("it 17th June already")
strtotime("17th June already")
strtotime("June already")
strtotime("already")
//2nd iteration
strtotime("Gadzooks, is it 17th June")
strtotime("is it 17th June")
strtotime("17th June") //date!
strtotime("June") //date!
//3rd iteration
strtotime("Gadzooks, is it 17th")
strtotime("is it 17th")
strtotime("it 17th")
strtotime("17th") //date!
//4th iteration
strtotime("Gadzooks, is it")
//etc
And we can assume that strtotime("17th June") is more accurate than strtotime("17th") simply because it contains more words... i.e. "next Friday" will always be more accurate than "Friday".
I would do it this way:
First check if the entire string is a valid date with strtotime(). If so, you're done.
If not, determine how many words are in your string (split on whitespace for example). Let this number be n.
Loop over every n-1 word combination and use strtotime() to see if the phrase is a valid date. If so you've found the longest valid date string within your original string.
If not, loop over every n-2 word combination and use strtotime() to see if the phrase is a valid date. If so you've found the longest valid date string within your original string.
...and so on until you've found a valid date string or searched every single/individual word. By finding the longest matches, you'll get the most informed dates (if that makes sense). Since you're dealing with tweets, your strings will never be huge.
Inspired by Juan Cortes's broken link based off Dolph's algorithm, I went ahead and wrote it up myself. Note that I decided to just return on first successful match.
<?php
function extractDatetime($string) {
if(strtotime($string)) return $string;
$string = str_replace(array(" at ", " on ", " the "), " ", $string);
if(strtotime($string)) return $string;
$list = explode(" ", $string);
$first_length = count($list);
for($j=0; $j < $first_length; $j++) {
$original_length = count($list);
for($i=0; $i < $original_length; $i++) {
$temp_list = $list;
for($k = 0; $k < $i; $k++) unset($temp_list[$k]);
//echo "<code>".implode(" ", $temp_list)."</code><br/>"; // for visualizing the tests, if you want to see it
if(strtotime(implode(" ", $temp_list))) return implode(" ", $temp_list);
}
array_pop($list);
}
return false;
}
Inputs
$array = array(
"Gadzooks, is it 17th June already",
"I’m going to play croquet next Friday",
"Where was the dog yesterday at 6 PM?",
"Where was Steve on Monday at 7am?"
);
foreach($array as $a) echo "$a => ".extractDatetime(str_replace("?", "", $a))."<hr/>";
Outputs
Gadzooks, is it 17th June already
is it 17th June already
it 17th June already
17th June already
June already
already
Gadzooks, is it 17th June
is it 17th June
it 17th June
17th June
Gadzooks, is it 17th June already => 17th June
-----
I’m going to play croquet next Friday
going to play croquet next Friday
to play croquet next Friday
play croquet next Friday
croquet next Friday
next Friday
I’m going to play croquet next Friday => next Friday
-----
Where was Rav Four yesterday 6 PM
was Rav Four yesterday 6 PM
Rav Four yesterday 6 PM
Four yesterday 6 PM
yesterday 6 PM
Where was the Rav Four yesterday at 6 PM? => yesterday 6 PM
-----
Where was Steve Monday 7am
was Steve Monday 7am
Steve Monday 7am
Monday 7am
Where was Steve on Monday at 7am? => Monday 7am
-----
Something like the following might do it:
$months = array(
"01" => "January",
"02" => "Feberuary",
"03" => "March",
"04" => "April",
"05" => "May",
"06" => "June",
"07" => "July",
"08" => "August",
"09" => "September",
"10" => "October",
"11" => "November",
"12" => "December"
);
$weekDays = array(
"01" => "Monday",
"02" => "Tuesday",
"03" => "Wednesday",
"04" => "Thursday",
"05" => "Friday",
"06" => "Saturday",
"07" => "Sunday"
);
foreach($months as $value){
if(strpos(strtolower($string),strtolower($value))){
\\ extract and assign as you like...
}
}
Probably do a nother loop to check for other weekDays or other formats, or just nest.
Use the strtotime php function.
Of course you would need to set up some rules to parse them since you need to get rid of all the extra content on the string, but aside from that, it's a very flexible function that will more than likely help you out here.
For example, it can take strings like "next Friday" and "June 15th" and return the appropriate UNIX timestamp for the date in the string. I guess that if you consider some basic rules like looking for "next X" and week and month names you would be able to do this.
If you could locate the "next Friday" from the "I'm going to play croquet next Friday" you could extract the date. Looks like a fun project to do! But keep in mind that strtotime only takes english phrases and will not work with any other language.
For example, a rule that will locate all the "Next weekday" cases would be as simple as:
$datestring = "I'm going to play croquet next Friday";
$weekdays = array('monday','tuesday','wednesday',
'thursday','friday','saturday','sunday');
foreach($weekdays as $weekday){
if(strpos(strtolower($datestring),"next ".$weekday) !== false){
echo date("F j, Y, g:i a",strtotime("next ".$weekday));
}
}
This will return the date of the next weekday mentioned on the string as long as it follows the rule! In this particular case, the output was June 18, 2010, 12:00 am.
With a few (maybe more than a few!) of those rules you will more than likely extract the correct date in a high percentage of the cases, considering that the users use correct spelling though.
Like it's been pointed out, with regular expressions and a little patience you can do this. The hardest part of coding is deciding what way you are going to approach your problem, not coding it once you know what!
Following Dolph Mathews idea and basically ignoring my previous answer, I built a pretty nice function that does exactly that. It returns the string it thinks is the one that matches a date, the unix datestamp of it, and the date itself either with the user specified format or the predefined one (F j, Y).I wrote a small post about it on Extracting a date from a string with PHP. As a teaser, here's the output of the two example strings:
Input: “I’m going to play croquet next Friday”
Output: Array (
[string] => "next friday",
[unix] => 1276844400,
[date] => "June 18, 2010"
)
Input: “Gadzooks, is it 17th June already?”
Output: Array (
[string] => "17th june",
[unix] => 1276758000,
[date] => "June 17, 2010"
)
I hope it helps someone.
Based on Dolph's suggestion, I wrote out a function that I think serves the purpose.
public function parse_date($text, $offset, $length){
$parseArray = preg_split( "/[\s,.]/", $text);
$dateTest = implode(" ", array_slice($parseArray, $offset, $length == 0 ? null : $length));
$date = strtotime($dateTest);
if ($date){
return $date;
}
//make the string one word shorter in the front
$offset++;
//have we reached the end of the array?
if($offset > count($parseArray)){
//reset the start of the string
$offset = 0;
//trim the end by one
$length--;
//reached the very bottom with no date found
if(abs($length) >= count($parseArray)){
return false;
}
}
//try to find the date with the new substring
return $this->parse_date($text, $offset, $length);
}
You would call it like this:
parse_date('Setting the due date january 5th 2017 now', 0 , 0)
What you're looking for a is a temporal expression parser. You might look at the Wikipedia article to get started. Keep in mind that the parsers can get pretty complicated, because this really a language recognition problem. That is commonly a problem tackled by the artificial intelligence/computational linguistics field.
Majority of suggested algorithms are in fact pretty lame. I suggest using some nice regex for dates and testing the sentence with it. Use this as an example:
(\d{1,2})?
((mon|tue|wed|thu|fri|sat|sun)|(monday|tuesday|wednesday|thursday|friday|saturday|sunday))?
(\d{1,2})? (\d{2,4})?
I skipped months, since I'm not sure I remember them in the right order.
This is the easiest solution, yet I will do the job better than other compute-power based solutions. (And yeah, it's hardly a fail-proof regex, but you get the point). Then apply the strtotime function on the matched string. This is the simplest and the fastest solution.

Categories