I am trying to use a regular expression to pick a phone number from a string, where the format of the phone number could be just about anything, or there may not be a phone number at all. For example:
$string = 'My phone number is +34 961 123456.';
$string = 'My phone number is +34 (961) 123456.';
$string = 'My phone number is 961-123456.';
$string = 'My phone number is +34.961.12.34.56.';
$string = 'Product A costs €100.00 and Product B costs €134.15.';
So far, I have got to
$number = preg_replace("/[^0-9\/\+\.\-\s]+/", "", $string);
$number = preg_replace("/[^0-9]+/", "", $number);
if (strlen($number)>8) {
/* It's a phone number, so do something with it */
}
This works for picking out all the different phone number formats that I have tried, but it also puts the prices together and assumes that they are a phone number too.
It seems that my problem is that a human can readily distinguish between a space between words and a space in the middle of a phone number, but how do I make the computer do that? Is there a way that I can replace spaces that are both preceded and followed by a number but leave other spaces intact? Is there some other way of sorting this out?
I'm afraid you aren't gonna like it. The regex I get is this:
(\+?[0-9]?[0-9]?[[:blank:],\.]?[0-9][0-9][0-9][[:blank:],\.]?[0-9][0-9][[:blank:],\.]?[0-9][0-9][[:blank:],\.]?[0-9][0-9])
Explanation:
( <-- is for "grouping" and get the regular expression, probably not needed here
\+? <-- optional plus sign
[0-9]?[0-9]? <-- optional prefix code
[[:blank:],\.]? <-- optional space (or comma or dot) between the prefix code and the rest of the number
[0-9][0-9][0-9][[:blank:],\.]? <-- optional province code
[0-9][0-9][[:blank:],\.]?[0-9][0-9][[:blank:],\.]?[0-9][0-9] <-- number, composed by six numbers
Because these examples are for spanish telephone numbers, aren't they???
In that case, you've forgotten to give us examples of other formats, like "91 123 45 67", that might complicate the solution even more.
For these cases, I humbly think that is a best solution to make a little function. The regular expression is too complex to be a maintenable solution.
Looks like you want sequences of nine to twelve digits, with nothing between them except spaces, parentheses, periods or dashes; and possibly preceded by +. Try this:
preg_match_all("/\+?(?:\d[-. ()]*){9,12}/", $string, $results);
This isn't quite perfect, since trailing punctuation (like the period that follows all your examples) will be included in the matched string. Post-process the list of results to trim it:
preg_replace("/[-. ]+$/", "", $results);
Or you could standardize the collected phone numbers by removing all non-digits from the results, keeping just the digits and possibly an initial "+":
preg_replace("/[-. ()]/", "", $results);
Related
I want to split a string as per the parameters laid out in the title. I've tried a few different things including using preg_match with not much success so far and I feel like there may be a simpler solution that I haven't clocked on to.
I have a regex that matches the "price" mentioned in the title (see below).
/(?=.)\£(([1-9][0-9]{0,2}(,[0-9]{3})*)|[0-9]+)?(\.[0-9]{1,2})?/
And here are a few example scenarios and what my desired outcome would be:
Example 1:
input: "This string should not split as the only periods that appear are here £19.99 and also at the end."
output: n/a
Example 2:
input: "This string should split right here. As the period is not part of a price or at the end of the string."
output: "This string should split right here"
Example 3:
input: "There is a price in this string £19.99, but it should only split at this point. As I want it to ignore periods in a price"
output: "There is a price in this string £19.99, but it should only split at this point"
I suggest using
preg_split('~\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?(*SKIP)(*F)|\.(?!\s*$)~u', $string)
See the regex demo.
The pattern matches your pattern, \£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})? and skips it with (*SKIP)(*F), else, it matches a non-final . with \.(?!\s*$) (even if there is trailing whitespace chars).
If you really only need to split on the first occurrence of the qualifying dot you can use a matching approach:
preg_match('~^((?:\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?|[^.])+)\.(.*)~su', $string, $match)
See the regex demo. Here,
^ - matches a string start position
((?:\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?|[^.])+) - one or more occurrences of your currency pattern or any one char other than a . char
\. - a . char
(.*) - Group 2: the rest of the string.
To split a text into sentences avoiding the different pitfalls like dots or thousand separators in numbers and some abbreviations (like etc.), the best tool is intlBreakIterator designed to deal with natural language:
$str = 'There is a price in this string £19.99, but it should only split at this point. As I want it to ignore periods in a price';
$si = IntlBreakIterator::createSentenceInstance('en-US');
$si->setText($str);
$si->next();
echo substr($str, 0, $si->current());
IntlBreakIterator::createSentenceInstance returns an iterator that gives the indexes of the different sentences in the string.
It takes in account ?, ! and ... too. In addition to numbers or prices pitfalls, it works also well with this kind of string:
$str = 'John Smith, Jr. was running naked through the garden crying "catch me! catch me!", but no one was chasing him. His psychatre looked at him from the window with a circumspect eye.';
More about rules used by IntlBreakIterator here.
You could simply use this regex:
\.
Since you only have a space after the first sentence (and not a price), this should work just as well, right?
I'm trying to remove / detect phone numbers from messages between users of my marketplace website (think eBay does something similar)
this is the code I'm using:
$string = preg_replace('/([0-9]+[\- ]?[0-9]+)/', '', $string);
BUT... it's too aggressive and it does strip away any number with 2 or more numerals... how can set a limit of say 7 numbers instead?
to be more precise the phone numbers can be any format like
3747657654
374-7657654
374-765-7654
(374)765-7654
etc...(i cannot predict what the users will write depending of their habits)
Try this regular expression :
/([0-9]+[\- ]?[0-9]{6,})/
changed to match your samples:
Regex101
That would depend on the exact requirements as now you have 1 or more numbers followed by an optional - or space followed by 1 or more numbers again.
If you wanted for example at least 2 numbers before the space or - followed by at least 5 numbers, you could use something like:
$string = preg_replace('/([0-9]{2,}[\- ]?[0-9]{5,})/', '', $string);
^^^^ Here you can specify mininimum / maximum
^^^^ Here you can specify mininimum / maximum
You can try something like this:
$string = preg_replace('/(?<![0-9]|[0-9]-)[0-9](?:[- ]?[0-9]){6}(?!-?[0-9])/', '', $string);
The lookarounds are here to avoid numbers with more than 7 digits, but if you want something more specific, you should provide an example string.
It is impossible to determine whether a number of X digits (where X is a valid phone number length) is a phone number or something else without some sort of context intelligence happening. A simple regex can't determine the difference between "call me at 3453456" and "call me when you've flown 3453456 miles".
Therefore trying to catch phone numbers without any formatting (just straight digits) with a regex is hopeless, pure and simple. Attempting to do so is only holding you back from finding a regex that can find formatted/semi-formatted numbers. What you should be going for here is "get the obvious and as many others as possible with minimal false positives...but recognize I can't get them all."
For that I'd recommend this:
/1?[ \-]?\(?([0-9]{3})?\)?[ \-]?([0-9]{3})[ \-]([0-9]{4})/g
It should not get the first three, but get all the rest in this list:
no-match: 3747657654
no-match: 444444444444444
no-match: 7657654
match: 374-765-7654
match: 1-374-765-7654
match: (374)765-7654
match: (374) 765 7654
match: 765-7654
match: 1 (374) 765 7654
match: 1(374)765 7654
I have the following example strings:
The price is $54.00 including delivery
On sale for £12.99 until December
European pricing €54.76 excluding UK
From each of them I want to return only the price and currency denominator
$54.00
£12.99
€54.76
My though process is the have an array of currency symbols and search the string for each one and then capture just the characters before the space after that - however, $ 67.00 would then fail
So, can i run through an array of preset currency symbols, then explode the string and chop it at the next instance of a non numeric character that is not a . or , - or maybe with regex
Is this possible?
In regex, \p{Currency_Symbol} or \p{Sc} represent any currency symbol.
However, PHP supports only the shorthand form \p{Sc} and /u modifier is required.
Using regex pattern
/\p{Sc}\s*\d[.,\d]*(?<=\d)/u
you will be able to match for example:
$1,234
£12.3
€ 5,345.01
If you want to use . as a decimal separator and , as a thousands delimiter, then go with
/\p{Sc}\s*\d{1,3}(?:,\d{3})*(?:\.\d+)?/u
Check this demo.
You could go for something like this:
preg_match('/(?:\$|€|£)\s*[\d,.-]+/', $input, $match);
And then find your currency and price inside $match.
Of course, you can generate that first part from an array of currency symbols. Just don't forget to escape everything:
$escapedCurrency = array_map("preg_quote", $currencyArray);
$pattern = '/(?:' . implode("|", $escapedCurrency) . ')\s*[\d,.-]+/';
preg_match($pattern, $input, $match);
Some possible improvement to the end of the pattern (the actual number):
(?:\$|€|£)\s*\d+(?:[.,](?:-|\d+))?
That will make sure that there is only one . or , followed by either - or only digits (in case your intention was to allow an international decimal separator).
If you only want to allow the comma to separate thousands, you could go for this:
(?:\$|€|£)\s*\d{1,3}(?:,\d{3})*(?:\.(?:-|\d+))?
This will match the longest "correctly" formatted number (i.e. $ 1,234.4567,123.456 -> $ 1,234.4567 or € 123,456789.12 -> € 123,456). It really depends on how accurate you want to go for.
How do you split a string based on the number of letter characters and/or the number of numbers so that they are separate strings?
Hopefully this makes sense. Thanks for any help (:
For example:
The user inputs:
Henry, Smith ID: 123456
I would like to sort the user input into separate strings with the result of:
$name = 'Henry, Smith';
$ID = '123456';
You can use regex to match only numbers and everything but numbers.
$number = preg_replace("/[^0-9]/", '', $str);
$name = preg_replace("/[0-9]/", '', $str);
Note, for the name, this will return Henry, Smith ID: from your question's example. This just takes the numbers out... it doesn't know "ID:" isn't part of a person's name.
Explanation of the caret (^):
Inside the brackets it means match everything NOT in the brackets. So [^0-9] matches everything but numbers. In this example, it'll replace everything that isn't a number with a blank (second parameter). For the $name, we do the opposite. We replace everything that IS a number with a blank to just get the non-digit characters.
See here for more info on regex.
I need to split a UK postcode into two. I have some code that gets the first half but it doesn't cover everything (such as gir0aa). Does anyone have anything better that validates all UK postcodes then breaks it into the first and second half? Thanks.
function firstHalf($postcode) {
if(preg_match('/^(([A-PR-UW-Z]{1}[A-IK-Y]?)([0-9]?[A-HJKS-UW]?[ABEHMNPRVWXY]?|[0-9]?[0-9]?))\s?([0-9]{1}[ABD-HJLNP-UW-Z]{2})$/i',$postcode))
return preg_replace('/^([A-Z]([A-Z]?\d(\d|[A-Z])?|\d[A-Z]?))\s*?(\d[A-Z][A-Z])$/i', '$1', $postcode);
}
will split ig62ts into ig6 or cm201ln into cm20.
The incode is always a single digit followed by two alpha characters, so the easiest way to split is to chop off the last three characters, allowing it to be validated easily.
Trim any spaces: they're used purely for ease of human readability.
The first part that remains is then the outcode. This can be a single alpha character followed by 1 or 2 digits; two alpha characters followed by 1 or 2 digits; or one or two characters followed by a single digit, followed by an additional alpha character.
There are a couple of notable exceptions: SAN TA1 is a recognised postcode, as is GIR 0AA; but these are the only two that don't follow the standard pattern.
To test if a postcode is valid, a regexp isn't really adequate... you need to do a lookup to retrieve that information.
If you do not care about validation, based on information here (at the bottom of the page there are different regexps, including yours) http://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom your can use for everything except of Anguilla
$str = "BX3 2BB";
preg_match('#^(.*)(\s+)?(\d\w{2})$#', $str, $matches);
echo "Part #1 = " . $matches[1];
echo "<br>Part #2 = " . $matches[3];