Regex assistance needed - php

I'm trying to build a regex in PHP to extract the first part of the sample strings below. The angle brackets denotes required part, square brackets denotes the optional part, and there are three possibilities of input (the brackets are not included in the input).
<Rua Olavo Bilac>
<Rua Olavo Bilac>[ - de 123...]
<Rua Olavo Bilac>[ - até ...]
(beware that the required part may have dashes)
I've tried:
/(.*?)( - (de|até){1,1}.*)?/i (the first group should capture what I needed, ungreedily)
I`ve also tried several modifications without luck. I'm probably doing some confusion here, specially with the groups and with the quantity modifiers. From what I understand:
The first group would catch any character, ungreedly
The second group, optional per the ? modifier, would have \s-\s followed by one of the two words de or até exactly one time, then any characters until the end of the line.
I ended replacing preg_match_all with strpos and substr, testing for each possibility. It did work, but I need to understand where I'm wrong about the regex approach.

You can use this regex (see demo):
^.*(?= *-(?!.*-))|^.*
How does it work?
We match two kinds of string, on either side of the |
On the left side, from the head of the string (anchored by the ^ assertion), the dot-star .* eats up any characters up to a place where the lookahead (?= *-(?!.*-)) asserts that what follows is optional space characters * and a dash -, not followed by (negative lookahead) more characters and a dash.
On the right side of the |, we match anything.
This assume that you are checking the strings line by line. If that is not the case, let us know.
Sample Code
$regex = "~^.*(?= *-(?!.*-))|^.*~";
if(preg_match($regex,$string,$m)) echo $m[0];

Related

How to check if string contains specific special characters or starting with a space? [duplicate]

I have the following requirements for validating an input field:
It should only contain alphabets and spaces between the alphabets.
It cannot contain spaces at the beginning or end of the string.
It cannot contain any other special character.
I am using following regex for this:
^(?!\s*$)[-a-zA-Z ]*$
But this is allowing spaces at the beginning. Any help is appreciated.
For me the only logical way to do this is:
^\p{L}+(?: \p{L}+)*$
At the start of the string there must be at least one letter. (I replaced your [a-zA-Z] by the Unicode code property for letters \p{L}). Then there can be a space followed by at least one letter, this part can be repeated.
\p{L}: any kind of letter from any language. See regular-expressions.info
The problem in your expression ^(?!\s*$) is, that lookahead will fail, if there is only whitespace till the end of the string. If you want to disallow leading whitespace, just remove the end of string anchor inside the lookahead ==> ^(?!\s)[-a-zA-Z ]*$. But this still allows the string to end with whitespace. To avoid this look back at the end of the string ^(?!\s)[-a-zA-Z ]*(?<!\s)$. But I think for this task a look around is not needed.
This should work if you use it with String.matches method. I assume you want English alphabet.
"[a-zA-Z]+(\\s+[a-zA-Z]+)*"
Note that \s will allow all kinds of whitespace characters. In Java, it would be equivalent to
[ \t\n\x0B\f\r]
Which includes horizontal tab (09), line feed (10), carriage return (13), form feed (12), backspace (08), space (32).
If you want to specifically allow only space (32):
"[a-zA-Z]+( +[a-zA-Z]+)*"
You can further optimize the regex above by making the capturing group ( +[a-zA-Z]+) non-capturing (with String.matches you are not going to be able to get the words individually anyway). It is also possible to change the quantifiers to make them possessive, since there is no point in backtracking here.
"[a-zA-Z]++(?: ++[a-zA-Z]++)*+"
Try this:
^(((?<!^)\s(?!$)|[-a-zA-Z])*)$
This expression uses negative lookahead and negative lookbehind to disallow spaces at the beginning or at the end of the string, and requiring the match of the entire string.
I think the problem is there's a ? before the negation of white spaces, which means it is optional
This should work:
[a-zA-Z]{1}([a-zA-Z\s]*[a-zA-Z]{1})?
at least one sequence of letters, then optional string with spaces but always ends with letters
I don't know if words in your accepted string can be seperated by more then one space. If they can:
^[a-zA-Z]+(( )+[a-zA-z]+)*$
If can't:
^[a-zA-Z]+( [a-zA-z]+)*$
String must start with letter (or few letters), not space.
String can contain few words, but every word beside first must have space before it.
Hope I helped.

PHP Pattern Validation

I'm having a bit of trouble getting my pattern to validate the string entry correctly. The PHP portion of this assignment is working correctly, so I won't include that here as to make this easier to read. Can someone tell me why this pattern isn't matching what I'm trying to do?
This pattern has these validation requirements:
Should first have 3-6 lowercase letters
This is immediately followed by either a hyphen or a space
Followed by 1-3 digits
$codecheck = '/^([[:lower:]]{3,6}-)|([[:lower:]]{3,6} ?)\d{1,3}$/';
Currently this catches most of the requirements, but it only seems to validate the minimum character requirements - and doesn't return false when more than 6 or 3 characters (respectively) are entered.
Thanks in advance for any assistance!
The problem here lies in how you group the alternatives. Right now, the regex matches a string that
^([[:lower:]]{3,6}-) - starts with 3-6 lowercase letters followed with a hyphen
| - or
([[:lower:]]{3,6} ?)\d{1,3}$ - ends with 3-6 lowercase letters followed with an optional space and followed with 1-3 digits.
In fact, you can get rid of the alternation altogether:
$codecheck = '/^\p{Ll}{3,6}[- ]\d{1,3}$/';
See the regex demo
Explanation:
^ - start of string
\p{Ll}{3,6} - 3-6 lowercase letters
[- ] - a positive character class matching one character, either a hyphen or a space
\d{1,3} - 1-3 digits
$ - end of string
You need to delimit the scope of the | operator in the middle of your regex.
As it is now:
the right-side argument of that OR runs up until the very end of your regex, even including the $. So the digits, nor the end-of-string condition do not apply for the left side of the |.
the left-side argument of the OR starts with ^, and only applies to the left side.
That is why you get a match when you supply 7 lowercase characters. The first character is ignored, and the rest matches with the right-side of the regex pattern.

Regular expression to match a white space character before the actual pattern - PHP

I have this string -:
#Harry #Harry are great twins with #Harry
I would want to match #Harry in the above string. But on one condition -:
#Harry should be preceeded by either nothing or a white space
character only, same goes for succession too.
So here 3 matches must be found. Currently i am doing this, but in vain -:
(\s|^$)\#Harry(\s|^$).
^$ means an empty string.
[\s means more than a whitespace (source) so you should just use a whitespace :)] <= Unnecessary
So, without taking the whitespace as part of the match:
"/(?<=^|\s)#Harry(?=\s|$)/"
Edit:
Source for the lookarounds.
Also, if you want to match the whitespaces, remove those lookarounds.

Cut off string at not allowed tags in regex

I got this fine working regex to use with php's preg_match_all to match a string containing 0 to x lines before and 0 to y lines after a specific word in a sentence/string:
'(?:[^\.?!<]*[\.?!]+){0,x}(?:[^\.?!]*)'.$word.'(?:[^\.?!]*)(?:[\.?!]+[^\.?!]*){0,y}'.'(?:[\.?!]+)'
Now, I want the string to be cut off when specific tags occur. So I was thinking about implementing this part in this string above:
(?:(<\/?(?!'.$allowed_tags.')))
in which $allowed_tags is a php variable that could look like this for example: '(frame|head|span|script)'
Despite trying to get this to work with lookahead, lookbehind and other conditions I can't get it working properly and I unfortunately have to admit this is way beyond my programming skills.
Hopefully someone can help me with this? I am sure someone among you geniuses can :)
Thanks a lot in advance!
Example input-output:
For example I would like to grab this part:
<p>Tradition, Expansion, Exile.<br/>Individual paths in Chinese contemporary art </p><p>The contemporary <i>art world</i> craves for novelty: the best reason for Chinese art to be so trendy is also the <strong>worst one</strong>.</p>
from this complete string:
<div readability="120"><p>Tradition, Expansion, Exile.<br/>Individual paths in Chinese contemporary art </p><p>The contemporary <i>art world</i> craves for novelty: the best reason for Chinese art to be so trendy is also the <strong>worst one</strong>.</p><div>
That means in this example <p></p><i></i><strong></strong> <br/> are allowed tags and <div > and </div> aren't.
Assuming you define div and span tags as “illegal” as per your comment, the following regex will match x sentences before and y sentences after the sentence conatining $word, as long as those sentences do not contain the “illegal” tags:
'(?:(?<=[.!?]|^)(?:(?<!<div|<\/div|<span|<\/span)>|[^>.!?])+[.!?]+){0,x}[^.!?]*'.$word.'[^.!?]*[.!?]+(?:(?:<(?!\/?div|\/?span)|[^<.!?])*[.!?]+){0,y}'
Split up and explained (quotes and string concatenation operator removed, comments and line breaks added for better reading):
// 0 TO X LEADING SENTENCES
(?: ---------------------------------// do not create a capture group
(?<=[.!?]|^) ----------------------// match only after sentence end or start of string
(?: -------------------------------// do not create a capture group
(?<!<div|<\/div|<span|<\/span)> -// match “>” only if not preceded by span or div tags
|[^>.!?] ------------------------// or any any other, non punctuation character
)+ --------------------------------// one or more times
[.!?]+ ----------------------------// followed by one or more punctuation characters
){0,x} ------------------------------// the whole sentence repeated 0 to x times
// MIDDLE SENTENCE WITH KEYWORD
[^.!?]* -----------------------------// match 0 or more non-punctuation characters
$word -------------------------------// match string value of $word
[^.!?]* -----------------------------// match 0 or more non-punctuation characters
[.!?]+ ------------------------------// followed by one or more punctuation characters
// 0 TO Y TRAILING SENTENCES
(?: ---------------------------------// do not create a capture group
<(?!<\/?div|\/?span) --------------// match “<” not followed by a “div” or “span” tag
|[^<.!?] --------------------------// or any non-punctuation character that is not “<”
)* --------------------------------// zero or more times
[.!?]+ ----------------------------// followed by one or more punctuation characters
){0,y} ------------------------------// the whole sentence repeated 0 to y times
Note the lookbehind assertion used for matching sentences before $word will only match opening and closing tags without attributes, and has to match both the opening and closing tag variants literally, as lookbehind assertions cannot be of variable length. There are other limitations and gotchas:
notably that the regex will return an “illegal” tag if it is located inside the sentence containing $word
and that “inside” a sentence literally means “following the closing punctuation of the preceding sentence”, which, although formally correct, might not be what is expected.
All of this goes to highlight the limitations of a regex based approach to the problem. In this light, you might think that switching to a more programatic approach (like parsing all sentences into an array irrespective of tags, then scanning for “illegal” tags and trimming or rejecting the array accordingly, which would allow for a more flexible tag matching regex) would work better, and you would be right, were it not for the underlying difficulty of matching a natural language construct like a sentence with a regex with any degree of accuracy. I’ll leave you to ponder what the “sentence splitting” regex used in this question and answer would do to the following:
“T.J. Hooker was plaid (sic.) by W. Shatner of Starship Enterprise (!) fame”
It’s not pretty. And neither is the result.

Matching ugly extra abbreviations and numbers in titles with PHP regex

I have to create regex to match ugly abbreviations and numbers. These can be one of following "formats":
1) [any alphabet char length of 1 char][0-9]
2) [double][whitespace][2-3 length of any alphabet char]
I tried to match double:
preg_match("/^-?(?:\d+|\d*\.\d+)$/", $source, $matches);
But I coldn't get it to select following example: 1.1 AA My test title. What is wrong with my regex and how can I add those others to my regex too?
In your regex you say "start of string, followed by maybe a - followed by at least one digit or followed by 0 or more digits, followed by a dot and followed by at least one digit and followed by the end of string.
So you regex could match for example.. 4.5, -.1 etc. This is exactly what you tell it to do.
You test input string does not match since there are other characters present after the number 1.1 and even if it somehow magically matched your "double" matching regex is wrong.
For a double without scientific notation you usually use this regex :
[-+]?\b[0-9]+(\.[0-9]+)?\b
Now that we have this out of our way we need a whitespace \s and
[2-3 length of alphabet]
Now I have no idea what [2-3 length of alphabet] means but by combining the above you get a regex like this :
[-+]?\b[0-9]+(\.[0-9]+)?\b\s[2-3 length of alphabet]
You can also place anchors ^$ if you want the string to match entirely :
^[-+]?\b[0-9]+(\.[0-9]+)?\b\s[2-3 length of alphabet]$
Feel free to ask if you are stuck! :)
I see multiple issues with your regex:
You try to match the whole string (as a number) by the anchors: ^ at the beginning and $ at the end. If you don't want that, remove those.
The number group is non-catching. It will be checked for matches, but those won't be added to $matches. That's because of the ?: internal options you set in (?:...). Remove ?: to make that group catching.
You place the shorter digit-pattern before the longer one. If you swap the order, the regex engine will look for it first and on success prefer it over the shorter one.
Maybe this already solves your issue:
preg_match("/-?(\d*\.\d+|\d+)/", $source, $matches);
Demo

Categories