I need to get everything before "On Sun, May 27, 2012 at 6:25 AM,"
I am hoping to get everything before "On xxx, xxx xx, xxxx at xx:xx xx,"
The problem here is that May, 27, and 6 are all variable in length. What is the best tool for this job. Due to my lack of experience with regex I am trying to use explode() but it doesn't appear it can do the job here. Is regex my best option?
[EDIT]
I ended up using a combination of answers. I went with:
preg_match("/(.*)On\s+(Sun|Sat|Fri|Thu|Wed|Tue|Mon),\s+(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d?\d,\s+\d{4}\s+at\s+\d?\d:\d\d\s+[AP]M,/i", $to, $end);
Something like this, I guess:
/On\s+(Sun|Sat|Fri|Thu|Wed|Tue|Mon),\s+(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d?\d,\s+\d{4}\s+at\s+\d?\d:\d\d\s+[AP]M,/i
[EDIT]
As per the comment: I have added support for case insensitive (by adding the i modifier to the end of the regex). I have also change the spaces in the expression to \s to allow any whitespace character, and added + to allow multiples spaces between words.
I haven't changed it to support long day names or short month names, as the questions specified that month name was variable in length but didn't specify day name as being variable. However, it should be trivial enough to add these variants if required.
[EDIT]
$to = "Let me know how this response looks..... On Sun, May 27, 2012 at 6:25 AM, Pr";
preg_match("/On\s+(Sun|Sat|Fri|Thu|Wed|Tue|Mon),\s+(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d?\d,\s+\d{4}\s+at\s+\d?\d:\d\d\s+[AP]M,/i", $to, $end);
This code works for the example given in your comment.
Hope that helps.
preg_match('/(.*?) On \w+, \w+ \d?\d, \d+ at \d?\d:\d?\d \w\w,/', 'grab this text here On Sun, May 27, 2012 at 6:25 AM,', $matches);
echo $matches[1];
// echoes 'grab this text here'
(.*?) matches everything in the beginning, \w+ matches any alphanumeric character 1 or more times, \d?\d matches either one or two digits
a regular expression would work since that's what it was made for: selecting data based on a pattern. You could however explode on ',' (comma) and just implode the first 4 elements together again to form your sentence. I doubt using regular expression will be faster in this case.
Ultimately it's your preference: which is better readable and understandable by you.
The main advantage regular expression would have in this particular case is hat they can extract specific values/patterns, so you could easily have them set aside the month for instance.
$dateString = "On Sun, May 27, 2012 at 6:25 AM, some other text here";
// using explode/implode
$result = explode(',',$dateString);
print "we got: " . implode(',', array_slice($result,0,3)) . "\n";
// using regular expression
$pattern = "/On [A-Z,a-z]{3}, [A-Z,a-z]{3} [0-9]+, [0-9]{4} at [0-9,:]+ (?:A|P)M/U";
preg_match($pattern,$dateString,$match);
print "We got: " . $match[0] . "\n";
Please also read the PHP manual, Regular Expressions subsection together with an initial tutorial
Personally in this case I think reg exp might be overkill both visually and performance wise. Do learn regular expressions though, they can be very helpful at times.
Related
I want to split a string as per the parameters laid out in the title. I've tried a few different things including using preg_match with not much success so far and I feel like there may be a simpler solution that I haven't clocked on to.
I have a regex that matches the "price" mentioned in the title (see below).
/(?=.)\£(([1-9][0-9]{0,2}(,[0-9]{3})*)|[0-9]+)?(\.[0-9]{1,2})?/
And here are a few example scenarios and what my desired outcome would be:
Example 1:
input: "This string should not split as the only periods that appear are here £19.99 and also at the end."
output: n/a
Example 2:
input: "This string should split right here. As the period is not part of a price or at the end of the string."
output: "This string should split right here"
Example 3:
input: "There is a price in this string £19.99, but it should only split at this point. As I want it to ignore periods in a price"
output: "There is a price in this string £19.99, but it should only split at this point"
I suggest using
preg_split('~\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?(*SKIP)(*F)|\.(?!\s*$)~u', $string)
See the regex demo.
The pattern matches your pattern, \£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})? and skips it with (*SKIP)(*F), else, it matches a non-final . with \.(?!\s*$) (even if there is trailing whitespace chars).
If you really only need to split on the first occurrence of the qualifying dot you can use a matching approach:
preg_match('~^((?:\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?|[^.])+)\.(.*)~su', $string, $match)
See the regex demo. Here,
^ - matches a string start position
((?:\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?|[^.])+) - one or more occurrences of your currency pattern or any one char other than a . char
\. - a . char
(.*) - Group 2: the rest of the string.
To split a text into sentences avoiding the different pitfalls like dots or thousand separators in numbers and some abbreviations (like etc.), the best tool is intlBreakIterator designed to deal with natural language:
$str = 'There is a price in this string £19.99, but it should only split at this point. As I want it to ignore periods in a price';
$si = IntlBreakIterator::createSentenceInstance('en-US');
$si->setText($str);
$si->next();
echo substr($str, 0, $si->current());
IntlBreakIterator::createSentenceInstance returns an iterator that gives the indexes of the different sentences in the string.
It takes in account ?, ! and ... too. In addition to numbers or prices pitfalls, it works also well with this kind of string:
$str = 'John Smith, Jr. was running naked through the garden crying "catch me! catch me!", but no one was chasing him. His psychatre looked at him from the window with a circumspect eye.';
More about rules used by IntlBreakIterator here.
You could simply use this regex:
\.
Since you only have a space after the first sentence (and not a price), this should work just as well, right?
It's a basic preg_replace that detects phone numbers (and just long numbers). My problem is I want to avoid detecting numbers between double "", single '' and forward slashes //
$text = preg_replace("/(\+?[\d-\(\)\s]{8,25}[0-9]?\d)/", "<strong>$1</strong>", $text);
I poked around but nothing is working for me. Your help will be appreciated.
I predict that your pattern is going to let you down more than it is going to satisfy you (or you are very comfortable with "over-matching" within the scope of your project).
While my suggestion really blows out the pattern length, a (*SKIP)(*FAIL) technique will serve you well enough by consuming and discarding the substrings that require disqualification. There may be a way of dictating the pattern logic with lookaround instead, but with an initial pattern with so many potential holes in it and no sample data, there are just too many variables to make a confident suggestion.
Regex101 Demo
Code: (Demo)
$text = <<<TEXT
A number 555555555 then some more text and a quoted number "(123)4567890" and
then 1 2 3 4 6 (54) 3 -2 and forward slashed /+--------0/ versus
+--------0 then something more realistic '234 588 9191' no more text.
This is not closed by the same character on both
ends: "+012345678901/ which of course is a _necessary_ check?
TEXT;
echo preg_replace(
'~([\'"/])\+?[\d()\s-]{8,25}\d{1,2}\1(*SKIP)(*FAIL)|((?!\s)\+?[\d()\s-]{8,25}\d{1,2})~',
"<strong>$2</strong>",
$text);
Output:
A number <strong>555555555</strong> then some more text and a quoted number "(123)4567890" and
then <strong>1 2 3 4 6 (54) 3 -2</strong> and forward slashed /+--------0/ versus
<strong>+--------0</strong> then something more realistic '234 588 9191' no more text.
This is not closed by the same character on both
ends: "<strong>+012345678901</strong>/ which of course is a _necessary_ check?
For the technical breakdown, see the Regex101 link.
Otherwise, this is effectively checking for "phone numbers" (by your initial pattern) and if they are wrapped by ', ", or / then the match is ignored and the regex engine continues looking for matches AFTER that substring. I have added (?!\s) at the start of the second usage of your phone pattern so that leading spaces are omitted from the replacement.
It seems that you're not validating, then you might be trying to write some expression with less boundaries, such as:
^\+?[0-9()\s-]{8,25}[0-9]$
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
I need to change the decimal separator in a given string that has numbers in it.
What RegEx code can ONLY select the thousands separator character in the string?
It need to only select, when there is number around it. For example only when 123,456 I need to select and replace ,
I'm converting English numbers into Persian (e.g: Hello 123 becomes Hello ۱۲۳). Now I need to replace the decimal separator with Persian version too. But I don't know how I can select it with regex. e.g. Hello 121,534 most become Hello ۱۲۱/۵۳۴
The character that needs to be replaced is , with /
Use a regular expression with lookarounds.
$new_string = preg_replace('/(?<=\d),(?=\d)/', '/', $string);
DEMO
(?<=\d) means there has to be a digit before the comma, (?=\d) means there has to be a digit after it. But since these are lookarounds, they're not included in the match, so they don't get replaced.
According to your question, the main problem you face is to convert the English number into the Persian.
In PHP there is a library available that can format and parse numbers according to the locale, you can find it in the class NumberFormatter which makes use of the Unicode Common Locale Data Repository (CLDR) to handle - in the end - all languages known to the world.
So converting a number 123,456 from en_UK (or en_US) to fa_IR is shown in this little example:
$string = '123,456';
$float = (new NumberFormatter('en_UK', NumberFormatter::DECIMAL))->parse($string);
var_dump(
(new NumberFormatter('fa_IR', NumberFormatter::DECIMAL))->format($float)
);
Output:
string(14) "۱۲۳٬۴۵۶"
(play with it on 3v4l.org)
Now this shows (somehow) how to convert the number. I'm not so firm with Persian, so please excuse if I used the wrong locale here. There might be options as well to tell which character to use for grouping, but for the moment for the example, it's just to show that conversion of the numbers is taken care of by existing libraries. You don't need to re-invent this, which is even a sort of miss-wording, this isn't anything a single person could do, or at least it would be sort of insane to do this alone.
So after clarifying on how to convert these numbers, question remains on how to do that on the whole text. Well, why not locate all the potential places looking for and then try to parse the match and if successful (and only if successful) convert it to the different locale.
Luckily the NumberFormatter::parse() method returns false if parsing did fail (there is even more error reporting in case you're interested in more details) so this is workable.
For regular expression matching it only needs a pattern which matches a number (largest match wins) and the replacement can be done by callback. In the following example the translation is done verbose so the actual parsing and formatting is more visible:
# some text
$buffer = <<<TEXT
it need to only select , when there is number around it. for example only
when 123,456 i need to select and replace "," I'm converting English
numbers into Persian (e.g: "Hello 123" becomes "Hello ۱۲۳"). now I need to
replace the Decimal separator with Persian version too. but I don't know how
I can select it with regex. e.g: "Hello 121,534" most become
"Hello ۱۲۱/۵۳۴" The character that needs to be replaced is , with /
TEXT;
# prepare formatters
$inFormat = new NumberFormatter('en_UK', NumberFormatter::DECIMAL);
$outFormat = new NumberFormatter('fa_IR', NumberFormatter::DECIMAL);
$bufferWithFarsiNumbers = preg_replace_callback(
'(\b[1-9]\d{0,2}(?:[ ,.]\d{3})*\b)u',
function (array $matches) use ($inFormat, $outFormat) {
[$number] = $matches;
$result = $inFormat->parse($number);
if (false === $result) {
return $number;
}
return sprintf("< %s (%.4f) = %s >", $number, $result, $outFormat->format($result));
},
$buffer
);
echo $bufferWithFarsiNumbers;
Output:
it need to only select , when there is number around it. for example only
when < 123,456 (123456.0000) = ۱۲۳٬۴۵۶ > i need to select and replace "," I'm converting English
numbers into Persian (e.g: "Hello < 123 (123.0000) = ۱۲۳ >" becomes "Hello ۱۲۳"). now I need to
replace the Decimal separator with Persian version too. but I don't know how
I can select it with regex. e.g: "Hello < 121,534 (121534.0000) = ۱۲۱٬۵۳۴ >" most become
"Hello ۱۲۱/۵۳۴" The character that needs to be replaced is , with /
Here the magic is just two bring the string parts into action with the number conversion by making use of preg_replace_callback with a regular expression pattern which should match the needs in your question but is relatively easy to refine as you define the whole number part and false positives are filtered thanks to the NumberFormatter class:
pattern for Unicode UTF-8 strings
|
(\b[1-9]\d{0,2}(?:[ ,.]\d{3})*\b)u
| | |
| grouping character |
| |
word boundary -----------------+
(play with it on regex101.com)
Edit:
To only match the same grouping character over multiple thousand blocks, a named reference can be created and referenced back to it for the repetition:
(\b[1-9]\d{0,2}(?:(?<grouping_char>[ ,.])\d{3}(?:(?&grouping_char)\d{3})*)?\b)u
(now this get's less easy to read, get it deciphered and play with it on regex101.com)
To finalize the answer, only the return clause needs to be condensed to return $outFormat->format($result); and the $outFormat NumberFormatter might need some more configuration but as it is available in the closure, this can be done when it is created.
(play with it on 3v4l.org)
I hope this is helpful and opens up a broader picture to not look for solutions only because hitting a wall (and only there). Regex alone most often is not the answer. I'm pretty sure there are regex-freaks which can give you a one-liner which is pretty stable, but the context of using it will not be very stable. However not saying there is only one answer. Instead bringing together different levels of doings (divide and conquer) allows to rely on a stable number conversion even if yet still unsure on how to regex-pattern an English number.
You can write a regex to capture numbers with thousand separator, and then aggregate the two numeric parts with the separator you want :
$text = "Hello, world, 121,534" ;
$pattern = "/([0-9]{1,3}),([0-9]{3})/" ;
$new_text = preg_replace($pattern, "$1X$2", $text); // replace comma per 'X', keep other groups intact.
echo $new_text ; // Hello, world, 121X534
In PHP you can do that using str_replace
$a="Hello 123,456";
echo str_replace(",", "X", $a);
This will return: Hello 123X456
PHP REGEX is a weakness of mine, but still I manage to get some things done with online tools. Consider the following:
A subject string which generally follows this pattern: 1551 UTC 04 June 2012
I want to extract the "04" and assign it to the $day variable using below:
$day = preg_replace("/^([0-9]{4})\s([A-Z]{3})\s([0-9]{2})\s([A-Za-z]{3,})\s([0-9]{4})$/", "$3", $weather['date']);
This works on the following website: http://sqa.fyicenter.com/Online_Test_Tools/Test_Regular_Expression_Search_Replace.php
but I can't get it to work in my script... $day would equal the whole subject string.
The result of your var_dump() is string(38) "1551 UTC 04 June 2012 ". It has 38 chars while it should be only 21. So it looks like there are multiple whitespaces in the string.
Try to trim() your input string and replace \s with \s+ to support multiple whitespaces:
$day = preg_replace("/^([0-9]{4})\s+([A-Z]{3})\s+([0-9]{2})\s+([A-Za-z]{3,})\s+([0-9]{4})$/", "$3", trim($weather['date']));
you say preg_replace, but I think you want to use preg_match(). Is that correct that you don't want to replace the "04" but you just want to put it into a the variable $day? If so use preg_match(). In your description you say you want to capture only the "04" part, but your regex has many capture groups (anything within "()" is a capture group and will be returned in the array you give to preg_match).
I have some strings I need to scrape data from. I need a simple way of telling PHP to look in the string and delete data before and after the part I need. An example is:
When: Sat 19 Sep 2009 22:00 to Sun 20 Sep 2009 03:00
I want to delete the "When: " and then remove the & and everything after it. Is this a Regex thing? Not really used them before.
I would not use regular expressions for this.
$data = substr($input, 6, strpos($input, '&') - 6);
Yes, regex can do this kind of thing in its sleep.
$result = preg_replace('/When:(.*)&.*/', '$1', $text);
UPDATE
If you want to find the date range only, in the middle of a lot of other text, here is a crude regex that will match the one in the question...
if (preg_match('/[a-z]{3} [0-9]{2} [a-z]{3} [0-9]{4} [0-9]{2}:[0-9]{2} to [a-z]{3} [0-9]{2} [a-z]{3} [0-9]{4} [0-9]{2}:[0-9]{2}/i', $text, $regs)) {
$result = $regs[0];
} else {
$result = "";
}
So you would want to keep "Sat 19 Sep 2009 22:00 to Sun 20 Sep 2009 03:00"
Well you can go for a regexp alright. I don't know much about the Regexp in PHP, but in PERL, you could do somehing like
/^When: (.*)\ $/ .
The (.*) could then be used to get all that is what you want to keep. In PERL, that would be looking the $1 var.
Or you could do something like
/^When: (.)\&.$/ if the content after the & is variable.
Also, you must watch out. If the string you want to keep contains &, then it might a little more tricky.
But RegExp are usually the way to got for this type of work.