I have a string and want to extract data from it.
$str = "Online (UVD) - 154,842 - Last Updated: Nov 23 2015 02:24 PM";
I want this 154,842 extract and this 2015 I've successfully extracted the first part. with this method
trim(str_replace("Online (UVD) - ", "", str_replace(",", "", substr_replace($str, "", strpos($str, " - Last Updated"))), $str))
Now, I'm unsure how to extract the other one. Data can vary for instance,
$str = "Online (UVD) - 1123123 - Last Updated: Nov 23 2015 02:24 PM";
$str = "Online (UVD) - 12 - Last Updated: Nov 23 2015 02:24 PM";
$str = "Online (UVD) - 1546546 - Last Updated: Nov 23 2015 02:24 PM";
$str = "Online (UVD) - 3525252525 - Last Updated: Nov 23 2015 02:24 PM";
Is there a better method to extract?/
If the strings will always have the same number of values perhaps explode and then using specific array positions would work for you.
$str = "Online (UVD) - 154,842 - Last Updated: Nov 23 2015 02:24 PM";
$pieces = explode(' ',$str);
echo 'Value is ' . $pieces[3] . ' and the year is ' . $pieces[9];
You can do it without using regex if all the words in the string are in same order that you provided. Let's try with explode() -
<?php
$str = "Online (UVD) - 1123123 - Last Updated: Nov 23 2015 02:24 PM";
$str = "Online (UVD) - 12 - Last Updated: Nov 23 2015 02:24 PM";
$str = "Online (UVD) - 1546546 - Last Updated: Nov 23 2015 02:24 PM";
$str = "Online (UVD) - 3525252525 - Last Updated: Nov 23 2015 02:24 PM";
$digit = explode(' ',$str);
echo trim($digit[3]); // returns digits
echo trim($digit[9]); // returns date
?>
DEMO: https://3v4l.org/ttBDG
I know this is answered but I think on also providing a regex solution for this:
To extract your 1st group, you can use bellow regex:
preg_match('/.-.(\d+).-/', $str, $numExtracted);
if (!empty($numExtracted)) {
echo $numExtracted[1].PHP_EOL;
}
To extract your Year:
preg_match('/(\w\w\w).(\d\d).(\d\d\d\d)/', $str, $year, PREG_OFFSET_CAPTURE);
$year = $year[3][0];
echo $year.PHP_EOL;
This worked on all of the below trials:
Online (UVD) - 1123123 - Last Updated: Nov 23 2015 02:24 PM
Online (UVD) - 12 - Last Updated: Nov 23 2015 02:24 PM
Online (UVD) oi oi - 1546546 - Last Updated: Nov 23 2015 02:24 PM
Online -sdtgstg346fg - (UVD) - 3525252525 - Last Updated: Nov 23 2015 02:24 PM
You can check the working code here
As per you comment question, you can enhance your regex to consider such cases:
.-.(\d+)?[\,\#\!\?\$\£\;\:]*(\d+)?.-
It will match all of the above plus this cases:
Online (UVD) - 1123,123 - Last Updated: Nov 23 2015 02:24 PM
Online (UVD) - 1123#!,123 - Last Updated: Nov 23 2015 02:24 PM
But I think there is a time you need to consider if you want to have a hold on the information you received or just consider it corrupt.
You can even introduce cycles to parse to every single case scenario but if I am expecting a number and suddenly the regex that triggers a match is for something like 1A2B3C4G5D8D2F I will discard it as it goes far from what I initially expected. But it all depends from where you receive your information, how likely is it to change, etc :)
Still, I think regex will make you happier and assert far more possibilities
PS: For the special cases introduced, because the number is interrupted by special chars (or even words if you consider them) it now interprets and 2 numbers.
Related
Im making a pagination system that will paginate articles in my website.
GOAL: Paginate an array of files (7 element/page)
I got across a problem that ive been troubleshooting for 5+ hours... Heres the logic side of things, correct me if im wrong.
Okay. So ive got 26 dummy articles (the alphabet) inside a folder.
Lets find the number of files in there... I will call the result: variable X.
To get the number of pagination pages, im doing the following:
X divided by 7. Obviously, this can output floats instead of integers.. So ill be rounding up the result using "cint"- which will always round upwards.
Lets call the number of pages "Z".
So me and my new friend Z want to tell some kind of function to fetch those articles. Ive made the following equations to find the start and the end of what articles I want to show.
$start = Z * 7 - 7
$end = Z * 7
Those equations generate
0 to 7 for page 1. Expected result (not reality):
a, b, c, d, e, f, g.
8 to 15 for page 2. Expected result (not reality).
h, i, j, k, l, m, n.
And so on...
So, using my superior brain (sike) I managed to generate the following output for page 1:
CHOOSE PAGE: 1 2 3 4
Youre at page 1
Theres 26 articles
Showing 0 to 7
a - Thursday, 4th of April 2019 # 20:54:02
b - Thursday, 4th of April 2019 # 20:54:04
c - Thursday, 4th of April 2019 # 20:54:08
d - Thursday, 4th of April 2019 # 20:54:10
e - Thursday, 4th of April 2019 # 20:54:13
f - Thursday, 4th of April 2019 # 20:54:15
g - Thursday, 4th of April 2019 # 20:54:18
But, wierdly enough, when I go to page 2, I get... this mess.
CHOOSE PAGE: 1 2 3 4
Youre at page 2
Theres 26 articles
Showing 7 to 14
h - Thursday, 4th of April 2019 # 20:54:22
i - Thursday, 4th of April 2019 # 20:54:24
j - Thursday, 4th of April 2019 # 20:54:28
k - Thursday, 4th of April 2019 # 20:54:31
l - Thursday, 4th of April 2019 # 20:54:34
m - Thursday, 4th of April 2019 # 20:54:37
n - Thursday, 4th of April 2019 # 20:54:39
o - Thursday, 4th of April 2019 # 20:54:42
p - Thursday, 4th of April 2019 # 20:54:44
q - Thursday, 4th of April 2019 # 20:55:47
r - Thursday, 4th of April 2019 # 20:55:49
s - Thursday, 4th of April 2019 # 20:55:51
t - Thursday, 4th of April 2019 # 20:55:53
u - Thursday, 4th of April 2019 # 20:55:55
...And when I go to page 3, some of the results from page 2 appears!
CHOOSE PAGE: 1 2 3 4
Youre at page 3
Theres 26 articles
Showing 14 to 21
o - Thursday, 4th of April 2019 # 20:54:42
p - Thursday, 4th of April 2019 # 20:54:44
q - Thursday, 4th of April 2019 # 20:55:47
r - Thursday, 4th of April 2019 # 20:55:49
s - Thursday, 4th of April 2019 # 20:55:51
t - Thursday, 4th of April 2019 # 20:55:53
u - Thursday, 4th of April 2019 # 20:55:55
v - Thursday, 4th of April 2019 # 20:55:57
w - Thursday, 4th of April 2019 # 20:56:00
x - Thursday, 4th of April 2019 # 20:56:03
y - Thursday, 4th of April 2019 # 20:56:05
z - Thursday, 4th of April 2019 # 20:56:07
Finally, I get one last page (page 4) with the final last result from page 3.
Heres the code...
<?php
$page = strip_tags($_GET['p']);
if(empty($page)){$page = "1";}
$post_array = glob("post/*");
$post_count = count($post_array);
$page_num = ceil($post_count / 7);
echo "CHOOSE PAGE: ";
for($i = 1; $i<$page_num+1; $i++){
echo "{$i} ";
}
if($page>$page_num){
echo "<br>error";
}
elseif(!is_numeric($page)) {
echo "<br>error";
}
else {echo "<br>Youre at page {$page}<br>";
echo "Theres {$post_count} articles<br><br>";
$start = $page * 7 - 7;
$end = $page * 7;
$post_array_sliced = array_slice($post_array, $start, $end);
echo "Showing {$start} to {$end}<br><br>";
foreach ($post_array_sliced as $post){
$post_name = pathinfo($post)['filename'];
$post_date = filemtime($post);
echo "{$post_name} - ".date('l, jS \of F Y # H:i:s', $post_date)."<br>";
}
}
?>
I think this problem is caused by my awful logic skills.
Could anyone correct me, point me to docs?
Thanks alot for yall time :)
array_slice doesn't expect the first and the last index, but the first index and the length (number of elements to extract).
So you should put:
array_slice($post_array, $start, 7);
I have a document with a bunch of dates, always wrapped in tags and always in a specific format.
$text = '...<dt>31 DEC 1793</dt>... ...<dt>14 JAN 1934</dt>...';
I'm trying to replace this text to include the day of the week:
$text = '...<dt>Tuesday, 31 DEC 1793</dt>... ...<dt>Sunday, 14 JAN 1934</dt>...';
Right now I'm trying to use preg_replace to achieve this, but it just gives me the current date.
$text = preg_replace('/<dt>(\d{1,2} [A-Z]{3} \d{4})<\/dt>/i', "<dt>".date('l', strtotime("$1")).", $1</dt>", $text);
It seems like the date function just runs once, instead of each time a replace happens. How could I make this work?
You need to run the date and strtotime functions inside a callback:
$text = '...<dt>31 DEC 1793</dt>... ...<dt>14 JAN 1934</dt>...';
$text = preg_replace_callback(
'/<dt>(\d{1,2} [A-Z]{3} \d{4})<\/dt>/i',
function ($matches) {
$date = $matches[1];
return "<dt>".date('l', strtotime($date)).", ".$date."</dt>";
},
$text
);
// $text = '...<dt>Tuesday, 31 DEC 1793</dt>... ...<dt>Sunday, 14 JAN 1934</dt>...';
I have to edit a whole bunch of date intervals. But they are all mixed up. Most are in the form Month YearMonth Year
eg January 2014March 2015
How would I insert a hyphen in between so I end up with
January 2014 - March 2015
I also have the problem where these dates occur in the same year.
eg April 2012September2012
In such a case I would need to insert the hyphen and remove the year so that I'm left with
April - September
There must be some PHP string operators for stuff like this. Well thats what I'm hoping.
Would appreciate some guidance. Thanks in advance.
Thanks, sorry for my delayed reply
$string = "January 2014March 2015";
preg_match('/([a-z]+) *(\d+) *([a-z]+) *(\d+)/i', $string, $match);
print "$match[1] $match[2] - $match[3] $match[4]";
outputs,
January 2014 - March 2015
You could do it using lookaround:
$string = "January 2014March 2015";
$res = preg_replace('/(?<=\d)(?=[A-Z])/', ' - ', $string);
echo $res,"\n";
Output:
January 2014 - March 2015
I'm building a local events calendar which takes RSS feeds and website scrapes and extracts event dates from them.
I've previously asked how to extract dates from text in PHP here, and received a good answer at the time from MarcDefiant:
function parse_date_tokens($tokens) {
# only try to extract a date if we have 2 or more tokens
if(!is_array($tokens) || count($tokens) < 2) return false;
return strtotime(implode(" ", $tokens));
}
function extract_dates($text) {
static $patterns = Array(
'/^[0-9]+(st|nd|rd|th|)?$/i', # day
'/^(Jan(uary)?|Feb(ruary)?|Mar(ch)?|etc)$/i', # month
'/^20[0-9]{2}$/', # year
'/^of$/' #words
);
# defines which of the above patterns aren't actually part of a date
static $drop_patterns = Array(
false,
false,
false,
true
);
$tokens = Array();
$result = Array();
$text = str_word_count($text, 1, '0123456789'); # get all words in text
# iterate words and search for matching patterns
foreach($text as $word) {
$found = false;
foreach($patterns as $key => $pattern) {
if(preg_match($pattern, $word)) {
if(!$drop_patterns[$key]) {
$tokens[] = $word;
}
$found = true;
break;
}
}
if(!$found) {
$result[] = parse_date_tokens($tokens);
$tokens = Array();
}
}
$result[] = parse_date_tokens($tokens);
return array_filter($result);
}
# test
$texts = Array(
"The focus of the seminar, on Saturday 2nd February 2013 will be [...]",
"Valentines Special # The Radisson, Feb 14th",
"On Friday the 15th of February, a special Hollywood themed [...]",
"Symposium on Childhood Play on Friday, February 8th",
"Hosting a craft workshop March 9th - 11th in the old [...]"
);
$dates = extract_dates(implode(" ", $texts));
echo "Dates: \n";
foreach($dates as $date) {
echo " " . date('d.m.Y H:i:s', $date) . "\n";
}
However, the solution has some downsides - for one thing, it can't match date ranges.
I'm now looking for a more complex solution that can extract dates, times and date ranges from sample text.
Whats the best approach for this? It seems like I'm leaning back toward a series of regex statements run one after the other to catch these cases. I can't see a better way of catching date ranges in particular, but I know there must be a better way of doing this. Are there any libraries out there just for date parsing in PHP?
Date / Date Range samples, as requested
$dates = [
" Saturday 28th December",
"2013/2014",
"Friday 10th of January",
"Thursday 19th December",
" on Sunday the 15th December at 1 p.m",
"On Saturday December 14th ",
"On Saturday December 21st at 7.30pm",
"Saturday, March 21st, 9.30 a.m.",
"Jan-April 2014",
"January 21st - Jan 24th 2014",
"Dec 30th - Jan 3rd, 2014",
"February 14th-16th, 2014",
"Mon 14 - Wed 16 April, 12 - 2pm",
"Sun 13 April, 8pm",
"Mon 21 - Wed 23 April",
"Friday 25 April, 10 – 3pm",
"The focus of the seminar, on Saturday 2nd February 2013 will be [...]",
"Valentines Special # The Radisson, Feb 14th",
"On Friday the 15th of February, a special Hollywood themed [...]",
"Symposium on Childhood Play on Friday, February 8th",
"Hosting a craft workshop March 9th - 11th in the old [...]"
];
The function I'm currently using (not the above) is about 90% accurate. It can catch date ranges, but has difficulty if a time is also specified. It uses a list of regex expressions and is very convoluted.
UPDATE: Jan 6th, 2014
I'm working on code that does this, working on my original method of a series of regex statements run one after the other. I think I'm close to a working solution that can pretty much extract almost any date/time range / format from a piece of text. When I'm done I'll post it here as an answer.
I think you can sum up the regex in your question like the one below.
(?<date_format_1>(?<day>(?i)\b\s*[0-9]+(?:st|nd|rd|th|)?)(?<month>(?i)\b\s*(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|etc))(?<year>\b\s*20[0-9]{2}) ) |
(?<date_format_2>(?&month)(?&day)(?!\s+-)) |
(?<date_format_3>(?&day)\s+of\s+(?&month)) |
(?<range_type_1>(?&month)(?&day)\s+-\s+(?&day))
Flags: x
Description
Demo
http://regex101.com/r/wP5fR4
Discussion
By using recursive subpatterns, you reduce the complexity of the final regex.
I have used a negative lookahead in the date_format_2 because it would match partially range_type_1. You may need to add more range type depending on your data. Don't forget to check other partterns in case of partial match.
Another solution would consist in build small regexes in different string variables and then concatenate them in PHP to build a bigger regex.
hi have string,
$rt="Ability: B,Session: Session #2: Tues June 14th - Fri June 24th (9-2:00PM),Time: 9:30am,karthi";
$rt="Ability: B,Session: Session #2: Tues June 14th - Fri June 24th (9-2:00PM),Time: 9:30pm,karthi";
i used below regex for remove text from last comma(,).
$it_nme = preg_replace('/(?<=pm,)\S*/is', '', $rt);
it is worked for second string (because before comma have 'pm' text). for second one before comma we have string 'am'.
for both how can i write single regex?
preg_replace('/(?<=[ap]m,)\S*/is', '', $rt)
You can use a regex OR like so:
$it_nme = preg_replace('/(?<=(pm|am),)\S*/is', '', $rt);