Regex to remove 'groups of characters' with less than 3 characters - php

I am trying to remove any 'groups of characters' with less than 3 characters.
This is the source:
1.29 Cancels part plan C/5879 2030. in i i.r e9g6Pop Iatian Area ProcH 22.4.93 Suburban Lands n f 53dv 3 N014 3.5.98. PLAN or any from 01 53 under M R.5I B.L.1laY98 E35. P0 RT I 0 N S At Maroubrajuncti p /I .z. .0 / .L .I. .I
Settings bounds for word characters with repetition between 1 and 3 e.g. /b\w{1,3}\b/ does not work as "C/5879" would become "5879".
The desired output would be as follows:
1.29 Cancels part plan C/5879 2030. e9g6Pop Iatian Area ProcH 22.4.93 Suburban Lands 53dv N014 3.5.98. PLAN from under R.5I B.L.1laY98 E35. Maroubrajuncti
An alternative which could also work would be to create larger 'groups of characters' by joining 'groups of characters' with 2 or less characters delimited by a whitespace.
For example:
1.29 Cancels part plan C/5879 2030. inii.r e9g6Pop Iatian Area ProcH 22.4.93 Suburban Lands nf 53dv 3N014 3.5.98. PLAN orany from 0153 under MR.5I B.L.1laY98 E35. P0RTI0NS AtMaroubrajuncti p/I.z. .0/.L.I..I
I would be open to either solution to rescue me from Regex Hell.

Your definition of "words" is "whitespace delimited", which differ from regex's defitionition of "word to non-word", so use look arounds:
\s+\S{1,3}(?=\s)
Note that the expression includes (captures) leading spaces, so removing matches will not leave double spaces in the result.
When tested on regextester result is:
1.29 Cancels part plan C/5879 2030. e9g6Pop Iatian Area ProcH 22.4.93 Suburban Lands 53dv N014 3.5.98. PLAN from under R.5I B.L.1laY98 E35. Maroubrajuncti .I

Related

regex to validate phone number

Help me write a regex for below conditions
number can start with +
number can contain - or . but not () and /
number can start with 0
Min number in the string should be 9 digits excluding extension details and starting +
max number in the phone number field should not reach more than 14 excluding +
if the string contains ex/ext/x then the digit after should not have more than 5 characters (normally 4)
this above should satisfy examples below
0-1234-123456
+91-1234-56789012
+91-1234-56789012 x1234
+91-1234-56789012 ex1234
+91-1234-56789012 ext12345
+91-1234-56789012x1234
+91-1234-56789012ex1234
+91-1234-56789012ext12345
91-1234-56789012
91-1234-56789012 x1234
91-1234-56789012 ex1234
91-1234-56789012 ext12345
91-1234-56789012x1234
91-1234-56789012ex1234
91-1234-56789012ext12345
91123456789012
91123456789012 x1234
91123456789012 ex1234
91123456789012 ext12345
91123456789012x1234
91123456789012ex1234
91123456789012ext12345
91.1234.56789012
91.1234.56789012 x1234
91.1234.56789012 ex1234
91.12345.6789012 ext12345
91.12345.6789012x1234
91.12345.6789012ex1234
91.12345.6789012ext12345
1-234-567-8901
1-234-567-8901 x1234
1-234-567-8901 ext1234
1 234 567-8901
1.234.567.8901
12345678901
I found few links online one of them is
http://ericholmes.ca/php-phone-number-validation-revisited/
and on stackoverflow
A comprehensive regex for phone number validation
also
^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$
is not working for many of the above
^\+?(\d[.\- ]*){9,14}(e?xt?\d{1,5})?$
Explanation;
^ Asserts start of string
\+? Matches an optional plus
(\d[.\- ]*){9,14} between 9 and 14 single digits, possibly seperated by spaces, dots, or dashes.
(e?xt?\d{1,5})? Optionally a x possibly preceeded by an e or followed by a t. The letters always followed by between 1 and 5 numbers.
$ Asserts end of the string
This will do it, but depending on which language you are programming in (we always need to know that with regexs, so if this doesn't work for you, reply with the language used. I've tested it in PHP5.)
Your condition 5 (max 14 chars in the phone no) appears to be in error, since several of your examples contain 16 characters if they include dots or hyphens. In any case, this does not check for overall length of the whole thing because of the other length checks it does; it would need a second regex, or, better, check the string length beforehand (eg in PHP by doing a call to strlen).
You might want to allow for a space in extension numbers, eg ext 1234; if so add \s* in the appropriate place.
I hope this helps.
^\+?\d[\d-\.\s]{8,15}\s?((ext|ex|x)\d{3,5})?$

PHP regex match multiple pieces

I am new to regex and I know the basics of how to pull out one sub string from a given string but I am struggling to get out multiple parts that I need. I am wondering if someone could help me with this simple example and then I work my way from there. Take this string:
LMJ won Neu. Zone - KEN #55 LEIGH vs LMJ #63 ONEIL
The parts in italics are the parts of the string that will change and bold will stay the same in every string. The parts I need out are:
First team id which in this case is LMJ, this will always start the string and be 3 uppercase letters, ^[A-Z]{3}?
The Neu part which could be one of 3 strings, Neu, Off, Def, [Neu|Off|Def]?
The second team part which will come always after the word Zone -, [A-Z]{3}?
Need the numeric part of the string after the first #. This could be 1 or 2 digits [0-9]{1,2}?
5.Third team part same as 3 except will appear after vs, [A-Z]{3}?
Same as 4 need numeric part after 2nd #, [0-9]{1,2}?
I would like to put that all together into one regex is that possible?
Everything inside square brackets is a so-called character class: it matches only a single character. so, [Neu|Off|Def] means: exactly one of the characters N, e, u, |, O, f or D (repetitions are ignored)
What you want is a capture group: (Neu|Off|Def)
Putting it together:
^([A-Z]{3}) won (Neu|Off|Def)\. Zone - ([A-Z]{3}) #([0-9]{1,2}) [A-Z]+ vs ([A-Z]{3}) #([0-9]{1,2}) [A-Z]+$
(This assumes you're not interested in the "LEIGH" and "ONEIL" parts, and these are always in upper case letters)
The regex should be something like;
'/([A-Z]{3})\ won\ (Neu|Off|Def)\.\ Zone\ -\ ([A-Z]{3})\ (\#[0-9]{1,2}\ \w+)\ vs\ ([A-Z]{3})\ (\#[0-9]{1,2}\ \w+)/'
() are used for capturing the different parts.
This is not tested properly.

My regex for testing phone issue

I want to validate these phone number formats:
517123123
+48517123123
+48 517 123 123
(48)517123123
(48)517 123 123
517-123-123
48 517-123-123
48/517-123-123
48 517 123 123
I wrote this regex:
(\+?)+(((\(([0-9]+){2,2}\)))|(([0-9]+){2,2})?)+(\/?)+(\s?)+(([0-9]+){9,9}|([0-9]+){3,3}(\s|-){1,1}([0-9]+){3,3}(\s|-){1,1}([0-9]+){3,3})
The problem is that it's makes big numbers like 8978978979878978967 valid. Where is my mistake?
Looking at just the end of the regex, I see something that you seem to be doing in multiple places;
([0-9]+){3,3}
The + says at least one repeat of [0-9], which makes 1111111111111 a perfectly valid match. You then limit it to exactly 3 of those matches, which can still be a very long number.
If you want exactly 3 digits, remove the +.
may be you lost anchors.... however, use my regex ^(\+?(\(\d{2}\)|(\d{2})|(\d{2}[/ ])))?((\d{3} \d{3} \d{3})|(\d{3}-\d{3}-\d{3})|(\d{9}))$
At the moment I can't see what your regex is doing, there is too much superfluous stuff in it.
You have too many groups
You want to repeat optional characters!?
e.g.:
(\+?)+, you don't need a group around and you don't want to repeat that, so \+? is what you want here.
(\s?)+, do you want to say "0 or more whitespaces"? Then \s* is what you need.
When you write e.g. {9,9}, then you can remove one digit, {9} is the same.
You are nesting quantifiers, thats the place where you allow too many characters. You have multiple places, where you do ([0-9]+){9,9}, that means 1 or more digits and repeat that 9 times.

Substring with dots

I am using SUBSTRING function to retreive an "excerpt" of a message body:
SELECT m.id, m.thread_id, m.user_id, SUBSTRING(m.body, 1, 100) AS body, m.sent_at
FROM message m;
What I would like to do is add 3 dots to the end of the substring, but only if the source string was more than my upper limit (100 characters), i.e. if substring had to cut off the string. If the source string is less than 100 characters then no need to add any dots to the end.
I am using PHP as my scripting language.
That can be done in the query, rather than PHP, using:
SELECT m.id, m.thread_id, m.user_id,
CASE
WHEN CHAR_LENGTH(m.body) > 100 THEN CONCAT(SUBSTRING(m.body, 1, 100), '...')
ELSE m.body
END AS body,
m.sent_at
FROM MESSAGE m
The term for the three trailing dots is "ellipsis".
Ask for 101 characters. If you receive 101 characters your resource string is definitely more than 100 characters. In that case, remove the last character in your scripting language of choice and add "...". This will relieve your DB somewhat.
Personally I would advise you to create a bit of a difference though. E.g. cut off at 90 characters if and only if you exceed 110 characters (by requesting 110 + 1 characters of course). Otherwise you will get the effect I notice with Slashdot sometimes: you have a Read the rest of this comment link, only to receive the final word of the story.
More or less, the user will be annoyed if the method of retrieving the rest of the story takes more space than the story itself.

Regex problem: Can't match a variable length pattern

I have a problem with regex, using preg_match_all(), to match something of a variable length.
What I am trying to match is the traffic condition after the word 'Congestion' What I came up with is this regex pattern:
Congestion\s*:\s*(?P<congestion>.*)
It would however, extract the first instance all the way to the end of the entire subject, since .* would match everything. But that's not what I want though, I would like it to match separately as 3 instances.
Now since the words behind Congestion could be of variable length, I can't really predict how many words and spaces are in between to come up with a stricter \w*\s*\w* match etc.
Any clues on how I can proceed from here?
Highway : Highway 26
Datetime : 18-Oct-2010 05:18 PM
Congestion : Traffic is slow from Smith St to Alice Springs St
Highway : Princes Highway
Datetime : 18-Oct-2010 05:18 PM
Congestion : Traffic is slow at the Flinders St / Elizabeth St intersection
Highway : Eastern Freeway
Datetime : 18-Oct-2010 05:19 PM
Congestion : Traffic is slow from Prince St to Queen St
EDIT FOR CLARITY
These very nicely formatted texts here, are actually received via a very poorly formatted html email. It contains random line breaks here and there eg "Congestion : Traffic\n is slow from Prince\nSt to Queen St".
So while processing the emails, I stripped off all the html codes and the random line breaks, and json_encode() them into one very long single-line string with no line break...
Usually, regex matching is line-based. Regex assumes that your string is a single line. You can use the “m” (PCRE_MULTILINE) flag to change that behaviour. Then you can tell PHP to match only to the end of the line:
preg_match('/^Congestion\s*:\s*(?P<congestion>.*)$/m', $subject, $matches);
There are two things to notice: first, the pattern was modified to include line-begin (^) and line-end ($) markers. Secondly, the pattern now carries the m modifier.
You can try a minimal match:
Congestion\s*:\s*(?P<congestion>.*?)
This would result in returning zero characters in the named group 'congestion' unless you could match something immediately after the congestion string.
So, this could be fixed if "Highway" always starts the traffic condition records:
Congestion\s*:\s*(?P<congestion>.*?)Highway\s*:
If this works (I have not checked it), then the first records are matched but the last record is not! This could be easily fixed by appending the text 'Highway :' at the end of the input string.
Congestion\s*:\s*Traffic is\s*(?P<c1>[^\n]*)\s*from\s*(?P<c2>[^\n]*)\s*to\s*(?P<c3>[^\n]*)$

Categories