Regex to detect word abbreviations - php

I'm currently working on a CSV that has information about Portugal's administrative areas and postal codes, but the file doesn't follow any strict format, which means sometimes there are entire strings in uppercase, along with other issues.
The issue I want to solve is as follows : some areas have a abbreviation at the end of the name, related to it's parent's administrative level, that I want to remove. As far as I can see, this are the rules :
Abbreviations don't take more than 3 characters in lenght (always 3 characters so far);
The first character may be any letter, case insensitive;
The last 2 characters are always consonants (e.g. Z, B, M, P, ..);
(edit) the abbreviations always occur as the last word in a string;
(edit 2) - The strings are always UTF-8
The purpose is to remove this abbreviations from the area names.

Sounds simple enough..
/\b[a-z][ZBMP]{2}\b/i
Would match any such described abbrevations, Add letters to the second character class ([ZBMP]) to complete the match.
It would only match if it's not part of another word (That's the \b's job).

Related

Preg_match is "ignoring" a capture group delimiter

We have thousands of structured filenames stored in our database, and unfortunately many hundreds have been manually altered to names that do not follow our naming convention. Using regex, I'm trying to match the correct file names in order to identify all the misnamed ones.
The files are all relative to a meeting agenda, and use the date, meeting type, Agenda Item#, and description in the name.
Our naming convention is yyyymmdd_aa[_bbb]_ccccc.pdf where:
yyyymmdd is a date (and may optionally use underscores such as yyyy_mm_dd)
aa is a 2-3 character Meeting Type code
bbb is an optional Agenda Item
ccccc is a freeform variable length description of the file (alphanumeric only)
Example filenames:
20200225_RM_agenda.pdf
20200225_RM_2_memo.pdf
20200225_SS1_3c_presenTATION.pdf
20200225_CA_4d_SiGnEd.pdf
20200225_RM_5_Order1234.pdf
2021_02_25_EV_Notice.pdf
The regex I'm using to match these files is below (regex demo):
/^(\d{4}[_]?\d{2}[_]?\d{2})_(\w{2,3})_([a-z0-9]{1,3})_?(.+)?.pdf/i
The Problem:
In general, it's working fine, BUT if the Agenda Number ("bbb") is NOT in the filename, the regex captures and returns the first 3 characters of the description. It seems to me that the 3rd capture group _([a-z0-9]{1,3})_ is saying 1-3 alphanumeric characters between underscores, but I don't know how to "force the underscore delimiters", or otherwise tell it that the group may not be there, and that it's now looking at the descriptive text. This can be seen in the demo code where the first and last filenames do not use an Agenda Number.
Any assistance is appreciated.
The optional identifier ? is for the last thing, either a characters or group. So the expression ([a-z0-9]{1,3})_? makes the underscore optional, but not the preceding group. The solution is to move the underscore into the parenthesis.
^(\d{4}[_]?\d{2}[_]?\d{2})_(\w{2,3})_([a-z0-9]{1,3}_)?(.+)?.pdf
Additionally, the [_]? can be simplified to just _?, file name periods should be escaped (otherwise they are a wildcard), and I personally like to name my groups using (?<name>) syntax. Putting that all together you get:
^(?<date>\d{4}_?\d{2}_?\d{2})_(?<meeting_type>\w{2,3})_(?<agenda>[a-z0-9]{1,3}_)?(?<description>.+)?\.pdf$
Demo here: https://regex101.com/r/BUKCih/1
Updated:
I've made some updates based on the comments. I added $ to the end to force "end of filename" as #Chris Maurer said. This stops file.pdf.txt from getting through. I also made a sub-group and moved the name into that group, which allows the trailing underscore to not be included in the named-group. I'm going to leave Chris's other comment about tightening the last matching group alone, although I do agree with it, and the OP might find a couple of non-conforming files if they use [a-z0-9]+ or similar. I don't remember off-hand if PHP supports POSIX but if so [:alnum:] could be used too.
^(?<date>\d{4}_?\d{2}_?\d{2})_(?<meeting_type>\w{2,3})_((?<agenda>[a-z0-9]{1,3})_)?(?<description>.+)?\.pdf$
Updated demo here: https://regex101.com/r/ebmxkF/1

reading double word for translation

we are designing translation project from Sindhi to English, in Sindhi (Pakistani/Indus) Language their are so many words with double word or having space bw them but have one meaning like in English to eat. it is two word, but have single meaning. I want to design a program to read starting double word, search it in database if meaning found then put and read next double word, and if meaning not found then read first word and find meaning, if found meaning then read next two words after first single word. for example I want to do this
this is simple sentence
I want to eat a mango.
I want to PHP or visual basic.net to break it into this style
I want
I
want to
want
to eat
to
eat mango
eat
mango
with this example all words are read both in single and double style.
I have some hints
use loop for (i=0, i<=length of text, i++)
sense word sepration where space or panctuation marks are used
coding may be
str=substr(text, i, 1)
if str is= " " or str= punctuation marks (space or punctuation mark is the word separators)
but remember we first have to read first two words so read while spaces become two
echo or print such dobule word.
word reading may be like this
word length (wrdlen) is equal to i variable of for loop and after usage it become 0 when word is made by strings
tillword = substr(text, i-wrdlen, wordlen)
these are some hints i'm hanged up please help any one. so with the help of above hints I need these results form
I want to eat a mango.
I want
I
want to
want
to eat
to
eat mango
eat
mango
You may think this double word language philosophy from any secondary language you know that double word may contain single meaning, or some times single word is meaning less, like in English there is "to"
I am not sure if I correctly understand but from what I get you want to be able to translate on basis of a multiple word phrase instead of single words. This is kinda similar to what language parsers do while compiling or interpreting.
One simple way to implement this functionality would be to first break down the sentence into words. In python this can be done very simply with something like:
words = sentence.split(' ')
Now you can try parsing these words by looping through them and storing them in a queue. The trick is to remember what was entered and have defined rules.
Let me give you an example. Let's say your sentence is "to eat a mango"
The rules of your language translation are (assumed):
to eat - X
to drink - XZ
mango - Y
So you loop through the words and enter them into a queue. After performing this step your queue will have
mango
a
eat
to
You can then start popping out elements. The first element to pop out is 'to'. Now check if there are phrases that start with 'to' if so store it in a string and go to the next element which is 'eat'. Concatenate this with the original string with a space. So you get "to eat" which matches a rule "to get" -> X so now translate and return X.
Alternatively if hadn't had matched then you translate the original string "to" return it and create a new string with the new element and continue.
Hope this helps.

Regex for two latitudes and longitudes not working

I am using some data which gives paths for google maps either as a path or a set of two latitudes and longitudes. I have stored both values as a BLOB in a mySql database, but I need to detect the values which are not paths when they come out in the result. In an attempt to do this, I have saved them in the BLOB in the following format:
array(lat,lng+lat,lng)
I am using preg_match to find these results, but i havent managed to get any to work. Here are the regex codes I have tried:
^[a]{1}[r]{2}[a]{1}[y]{1}[\(]{1}[1-9\.\,\+]{1*}[\)]{1}^
^[a]{1}[r]{2}[a]{1}[y]{1}[\(]{1}(\-?\d+(\.\d+)?),(\-?\d+(\.\d+)?)\+(\-?\d+(\.\d+)?),(\-?\d+(\.\d+)?)[\)]{1}^
Regex confuses me sometimes (as it is doing now). Can anyone help me out?
Edit:
The lat can be 2 digits followed by a decimal point and 8 more digits and the lng can be 3 digits can be 3 digits follwed by a decimal point and 8 more digits. Both can be positive or negative.
Here are some example lat lngs:
51.51160000,-0.12766000
-53.36442000,132.27519000
51.50628000,0.12699000
-51.50628000,-0.12699000
So a full match would look like:
array(51.51160000,-0.12766000+-53.36442000,132.27519000)
Further Edit
I am using the preg_match() php function to match the regex.
Here are some pointers for writing regex:
If you have a single possibility for a character, for example, the a in array, you can indeed write it as [a]; however, you can also write it as just a.
If you are looking to match exactly one of something, you can indeed write it as a{1}, however, you can also write it as just a.
Applying this lots, your example of ^[a]{1}[r]{2}[a]{1}[y]{1}[\(]{1}[1-9\.\,\+]{1*}[\)]{1}^ reduces to ^array\([1-9\.\,\+]{1*}\)^ - that's certainly an improvement!
Next, numbers may also include 0's, as well as 1-9. In fact, \d - any digit - is usually used instead of 1-9.
You are using ^ as the delimiter - usually that is /; I didn't recognize it at first. I'm not sure what you can use for the delimiter, so, just in case, I'll change it to the usual /.This makes the above regex /array\([\d\.\,\+]{1*}\)/.
To match one or more of a character or character set, use +, rather than {1*}. This makes your query /array\([\d\.\,\+]+\)/
Then, to collect the resulting numbers (assuming you want only the part between the brackets, put it in (non-escaped) brackets, thus: /array\(([\d\.\,\+]+)\)/ - you would then need to split them, first by +, then by ,. Alternatively, if there are exactly two lat,lng pairs, you might want: /array\(([\d\.]+),([\d\.]+)\+([\d\.]+),([\d\.]+)\)/ - this will return 4 values, one for each number; the additional stuff (+, ,) will already be removed, because it is not in (unescaped) brackets ().
Edit: If you want negative lats and longs (and why wouldn't you?) you will need \-? (a "literal -", rather than part of a range) in the appropriate places; the ? makes it optional (i.e. 0 or 1 dashes). For example, /array\((\-?[\d\.]+),(\-?[\d\.]+)\+(\-?[\d\.]+),(\-?[\d\.]+)\)/
You might also want to check out http://regexpal.com - you can put in a regex and a set of strings, and it will highlight what matches/doesn't match. You will need to exclude the delimiter / or ^.
Note that this is a little fast and loose; it would also match array(5,0+0,1...........). You can nail it down a little more, for example, by using (\-?\d*\.\d+)\) instead of (\-?[\d\.]+)\) for the numbers; that will match (0 or 1 literal -) followed by (0 or more digits) followed by (exactly one literal dot) followed by (1 or more digits).
This is the regex I made:
array\((-*\d+\.\d+),(-*\d+\.\d+)\+(-*\d+\.\d+),(-*\d+\.\d+)\)
This also breaks the four numbers into groups so you can get the individual numbers.
You will note the repeated pattern of
(-*\d+\.\d+)
Explanation:
-* means 0 or more matches of the - sign ( so - sign is optional)
\d+ means 1 or more matches of a number
\. means a literal period (decimal)
\d+ means 1 or more matches of a number
The whole thing is wrapped in brackets to make it a captured group.

Is there a regex symbol to match one, the other, or both (if possible)?

I want to highlight a group of words, they can appear single or in a row. I'd like them to be highlighted together if they appear one after the other, and if they don't, they should also be highlighted, like the normal behavior. For instance, if I want to highlight the words:
results as
And the subject is:
real time results: shows results as you type
I'd like the result to be:
real time results: shows <span class="highlighted"> results as </span> you type
The whitespaces are also a headache, because I tried using an or expression:
( results )|( as )
with whitespaces to prevent highlighting words like bass, crash, and so on. But since the whitespace after results is the same as the whitespace before as, the regexp ignores it and only highlights results.
It can be used to highlighted many words so combinations of
( (one) (two) )|( (two) (one) )|( one )|( two )
are not an option :(
Then I thought that there may be an operator that worked like | that could be use to match both if possible, else one, or the other.
Using spaces to ensure you match full words is the wrong approach. That's what word boundaries are for: \b matches a position between a word and a non-word character (where word characters usually are letters, digits and underscores). To match combinations of your desired words, you can simply put them all in an alternation (like you already do), and repeat as often as possible. Like so:
(?:\bresults\b\s*|\bas\b\s*)+
This assumes that you want to highlight the first and separate results in your example as well (which would satisfy your description of the problem).
Perhaps you do not need to match a string of words next to each other. Why not just apply your highlighting like so:
real time results: shows <span class="highlighted">results</span> <span class="highlighted">as</span> you type
The only realy difference is that the space between the words is not highlighted, but it's a clean and easy compromise which will save you hours of work and doesn't seem to hurt the UX in the least (in my opinion).
In that case, you could just use alternation:
\b(results|as)\b
(\b being the word boundary anchor)
If you really don't like the space between words not being highlight, you could write a jQuery function to find "highlighted" spans separated by only white space and then combine them (a "second stage" to achieve your UX design goals).
Update
(OK... so merging spans is actually kind of difficult via jQuery. See Find text between two tags/nodes)

How can I use regex to solve this?

I have two strings that I need to pull data out of but can't seem to get it working. I wish I knew regular expression but unfortunately I don't. I have read some beginner tutorials but I can't seem to find an expression that will do what I need.
Out of this first string delimited by the equal character, I need to skip the first 6 characters and grab the following 9 characters. After the equal character, I need to grab the first 4 characters which is a day and year. Lastly for this string, I need the remaining numbers which is a date in YYYYmmdd.
636014034657089=130719889904
The second string seems a little more difficult because the spaces between the characters differ but always seem to be delimited by at minimum, a single space. Sometimes, there are as many as 15 or 20 spaces separating the blocks of data.
Here are two different samples that show the space difference.
!!92519 C 01 M600200BLNBRN D55420090205M1O
!!95815 A M511195BRNBRN D62520070906 ":%/]Q2#0*&
The data that I need out of these last two strings are:
The zip code following the 2 exclamation marks.
The single letter 'M' following that. It always appears to be in a 13 character block
The 3 numbers after the single letter
The next 3 numbers which are the person's height
The following next 3 are the person's weight
The next 3 are eye color
The next block of 3 which are the person's hair color
The last block that I need data from:
I need to get the single letter which in the example appears to be a 'D'.
Skip the next 3 numbers
The last and remaining 8 numbers which is a date in YYYYmmdd
If someone could help me resolve this, I'd be very grateful.
For the first string you can use this regular expression:
^[0-9]{6}([0-9]{9})=([0-9]{4})([0-9]{4})([0-9]{2})([0-9]{2})$
Explanation:
^ Start of string/line
[0-9]{6} Match the first 6 digits
([0-9]{9}) Capture the next 9 digits
= Match an equals sign
([0-9]{4}) Capture the "day and year" (what format is this in?)
([0-9]{4}) Capture the year
([0-9]{2}) Capture the month
([0-9]{2}) Capture the date
$ End of string/line
For the second:
^!!([0-9]{5}) +.*? +M([0-9]{3})([0-9]{3})([A-Z]{3})([A-Z]{3}) +([A-Z])[0-9]{3}([0-9]{4})([0-9]{2})([0-9]{2})
Rubular
It works in a similar way to the first. You may need to adjust it slightly if your data is not exactly in the format that the regular expression expects. You might want to replace the .*? with something more precise but I'm not sure what because you haven't described the format of the parts you are not interested in.

Categories