Regex parsing multi-language string with languages codes

Regex parsing multi-language string with languages codes - php

I have multi-language strings formatted as follows:
[en]this is english [es]esto es español [fr] C'est française [it] Questo è italiano
The order of the languages is not always the same, and not all languages are always available.
I'm trying, with no success, to extract a specific language string. Language strings contain HTML, and any sort of special characters, spaces, newlines, tabs, etc.
Let's say I want to extract the English part; I need a regex able to match everything after the [en] part (new lines, carriage returns, special characters, tabs, etc.) until the starting of a new language string: ([a-z]{2})
This is not working: the french string is also returned, and if the Spanish string is in the past position nothing is returned.
/\[es\]((.|\n|\t|\r)*)(\[([a-z]{2})\])/u
I'm not able to write a regex for: "anything after [es] that is not two letters inside brackets or end of string"
Any help will be much appreciated!

Your real problem is greedy matching. There're a couple ways to deal with that. Lazy matching:
/\[es\]((?:.|\n|\t|\r)*?)\[([a-z]{2})\]/u
And negative lookaheads:
/\[es\]((?:(?!\[([a-z]{2})\])(?:.|\n|\t|\r))*)/u
You see, the Regex engine is greedy, which means it captures as many tokens as possible and backtracks until it has a matching string - the common way of saying is that the engine returns the largest capture possible. You can use a lazy matcher (any matcher followed by a ? - so ??, *?, +?, etc), which inverts the matching behaviour and captures as little as possible, slowly grabbing more until it has a match. You can also use a lookahead to ensure that the wildcard you're matching doesn't include your delimiter string.
You can also use the s modifier to force the . to match everything, including the newline character (it already matches the \t character.
/\[es\](.*?)\[([a-z]{2})\]/su
A word of caution to this tale, if Hercules fights, you will fail! if your string ever has anything in it that looks like a language code, but isn't - this regex will fail.
Click here to see it match.

FrankieTheKneeMan wrote a good explanation of the difference between greedy and lazy behaviour.
To take advantage of the greedy behaviour without backtracking (or with a very limited backtracking), you can use a negated character class:
/\[es]([^[]*)/u
(note that you don't need the s modifier, since you don't use the dot.)
In case: However, the precedent pattern doesn't allow the use of the opening square bracket inside the content you want to match. You can solve this problem if you check that each [ is not the begining of a language tag:
/\[es]((?>[^[]+|\[(?![a-z]{2}]))*)/u

Related

Allow Parenthesis and forward slash to this regex

I have this regex
!preg_match("/^[a-z0-9](?:[a-z0-9'. -]*[a-z0-9])?$/i", stripslashes($post['job_title']))
and I want to allow numbers parenthesis and also slashes in this regex. because some job title can be "Front-end developer/designer" or "Recruitment Staff (HR)"How can I achieve this?

Okay I managed to make a proper regex for this which allows Slashes within but not at the START/END, and also allows parenthesis within and at START/END.
!preg_match("/^[a-z0-9\(\)](?:[a-z0-9\/\(\)'. -]*[a-z0-9\(\)])?$/i", stripslashes($post['job_title']))
Thanks to #anubhava his reply gave me an idea how to add stuff in the regex

I don't think your intention is being translated to the pattern.
/^[a-z0-9](?:[a-z0-9'. -]*[a-z0-9])?$/i
/^[a-z0-9\(\)](?:[a-z0-9\/\(\)'. -]*[a-z0-9\(\)])?$/i
In the pattern in your question and the oattern in your answer, the third segment (final optional character match) it provides no effective validation. You see the multi-character (zero or more) matching in the middle of the pattern contains all characters in the last character class. In other words, your pattern will behave exactly the same without the last optional check. These are suitable replacements:
/^[a-z0-9](?:[a-z0-9'. -]*$/i
~^[a-z0-9()](?:[a-z0-9/()'. -]*$~i
If you mean to demand that the string ends in alphanumeric or parenthetical character, then remove your ? before the $.
That said, if you want to ensure that:
hyphens, spaces, and dots only occur the the middle of the string and
all parentheses are properly opened and closed, contain characters between them, and do not occur at the start of the string
etc.
then the best strategy will be "test driven development". Create a large, diverse sample of strings as well as unrealistic strings that you know should fail. Then run your current pattern against all strings. Then analyze which cases do not evaluate as expected and adjust your pattern.

Understanding Regular Expressions

I am tired of being frightened of regular expressions. The topic of this post is limited to PHP implementation of regular expressions, however, any generic regular expression advice would obviously be appreciated (i.e. don't confuse me with scope that is not applicable to PHP).
The following (I believe) will remove any whitespace between numbers. Maybe there is a better way to do so, but I still want to understand what is going on.
$pat="/\b(\d+)\s+(?=\d+\b)/";
$sub="123 345";
$string=preg_replace($pat, "$1", $sub);
Going through the pattern, my interpretation is:
\b A word boundary
\d+ A subpattern of 1 or more digits
\s+ One or more whitespaces
(?=\d+\b) Lookahead assertion of one or more digit followed by a word boundary?
Putting it all together, search for any word boundary followed by one or more digits and then some whitespace, and then do some sort of lookahead assertion on it, and save the results in $1 so it can replace the pattern?
Questions:
Is my above interpretation correct?
What is that lookahead assertion all about?
What is the purpose of the leading / and trailing /?

Is my above interpretation correct?
Yes, your interpretation is correct.
What is that lookahead assertion all about?
That lookahead assertion is a way for you to match characters that have a certain pattern in front of them, without actually having to match the pattern.
So basically, using the regex abcd(?=e) to match the string abcde will give you the match: abcd.
The reason that this matches is that the string abcde does in fact contain:
An a
Followed by a b
Followed by a c
Followed by a d that has an e after it (this is a single character!)
It is important to note that after the 4th item it also contains an actual "e" character, which we didn't match.
On the other hand, trying to match the string against the regex abcd(?=f) will fail, since the sequence:
"a", followed by "b", followed by "c", followed by "d that has an f in front of it"
is not found.
What is the purpose of the leading / and trailing /
Those are delimiters, and are used in PHP to distinguish the pattern part of your string from the modifier part of your string. A delimiter can be any character, although I prefer # signs myself. Remember that the character you are using as a delimiter needs to be escaped if it is used in your pattern.

It would be a good idea to watch this video, and the 4 that follow this:
http://blog.themeforest.net/screencasts/regular-expressions-for-dummies/
The rest of the series is found here:
http://blog.themeforest.net/?s=regex+for+dummies
A colleague sent me the series and after watching them all I was much more comfortable using Regular Expressions.
Another good idea would be installing RegexBuddy or Regexr. Especially RegexBuddy is very useful for understanding the workings of a regular expression.

regex validation

I am trying to validate a string of 3 numbers followed by / then 5 more numbers
I thought this would work
(/^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9])/i)
but it doesn't, any ideas what i'm doing wrong

Try this
preg_match('#^\d{3}/\d{5}#', $string)
The reason yours is not working is due to the + symbols which match "one or more" of the nominated character or character class.
Also, when using forward-slash delimiters (the characters at the start and end of your expression), you need to escape any forward-slashes in the pattern by prefixing them with a backslash, eg
/foo\/bar/
PHP allows you to use alternate delimiters (as in my answer) which is handy if your expression contains many forward-slashes.

First of all, you're using / as the regexp delimiter, so you can't use it in the pattern without escaping it with a backslash. Otherwise, PHP will think that you're pattern ends at the / in the middle (you can see that even StackOverflow's syntax highlighting thinks so).
Second, the + is "greedy", and will match as many characters as it can, so the first [0-9]+ would match the first 3 numbers in one go, leaving nothing for the next two to match.
Third, there's no need to use i, since you're dealing with numbers which aren't upper- or lowercase, so case-sensitivity is a moot point.
Try this instead
/^\d{3}\/\d{5}$/
The \d is shorthand for writing [0-9], and the {3} and {5} means repeat 3 or 5 times, respectively.
(This pattern is anchored to the start and the end of the string. Your pattern was only anchored to the beginning, and if that was on purpose, the remove the $ from my pattern)

I recently found this site useful for debugging regexes:
http://www.regextester.com/index2.html
It assumes use of /.../ (meaning you should not include those slashes in the regex you paste in).
So, after I put your regex ^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9]) in the Regex box and 123/45678 in the Test box I see no match. When I put a backslash in front of the forward slash in the middle, then it recognizes the match. You can then try matching 1234/567890 and discover it still matches. Then you go through and remove all the plus signs and then it correctly stops matching.
What I particularly like about this particular site is the way it shows the partial matches in red, allowing you to see where your regex is working up to.

Why does this regex not validate in the same way in PHP?

when I try preg_match with the following expression: /.{0,5}/, it still matches string longer than 5 characters.
It does, however, work properly when trying in online regexp matcher

The site you reference, myregexp.com, is focussed on Java.
Java has a specific function for matching an exact pattern, without needing to use anchor characters. This is the function which myregexp.com uses.
In most other languages, in order to match an exact pattern, you would need to add the anchoring characters ^ and $ at the start and end of the pattern respectively, otherwise the regex assumes it only needs to find the matched pattern somewhere within the string, rather than the whole string being the match.
This means that without the anchors, your pattern will match any string, of any length, because whatever the string, it will contain within it somewhere a match for "zero to five of any character".
So in PHP, and Perl, and virtually any other language, you need your pattern to look like this:
/^.{0,5}$/
Having explained all that, I would make one final observation though: this specific pattern really doesn't need to be a regular expression -- you could achieve the same thing with strlen(). In addition, the dot character in regex may not work exactly as you expect: it typically matches almost any character; some characters, including new line characters, are excluded by default, so if your string contains five characters, but one of them is a new line, it will fail your regex when you might have expected it to pass. With this in mind, strlen() would be a safer option (or mb_strlen() if you expect to have unicode characters).
If you need to match any character in regex, and the default behaviour of the dot isn't good enough, there are two options: One is to add the s modifier at the end of the expression (ie it becomes /^.{0,5}$/s). The s modifier tells regex to include new line characters in the dot "any character" match.
The other option (which is useful for languages that don't support the s modifier) is to use an expression and its negative together in a character class - eg [\s\S] - instead of the dot. \s matches any white space character, and \S is a negative of \s, so any character not matched by \s. So together in a character class they match any character. It's more long winded and less readable than a dot, but in some languages it's the only way to be sure.
You can find out more about this here: http://www.regular-expressions.info/dot.html
Hope that helps.

You need to anchor it with ^$. These symbols match the beginning and end of the string respectively, so it must be 0-5 characters between the beginning and end. Leaving out the anchors will match anywhere in the string so it could be longer.
/^.{0,5}$/
For better readability, I would probably also enclose the . in (), but that's kind of subjective.
/^(.){0,5}$/

Can you rely on the order that regular expression syntax is interpreted?

(The background for this question is that I thought it would be fun to write something that parses wiki creole markup. Anyway the problem that I think I have a solution to is differentiating between // in a url and as opening/closing syntax for italic text)
My question is slightly compound so I've tried to break it up under the headings
If there is a substring(S1) that can contain any one of a series of substrings separated by | does the regular expression interpreter simply match the first substring within 'S1' then move onto the regular expression after 'S1'? Or can will it in some instances try find the best/greediest match?
Here is an example to try and make my question more clear:
String to search within: String
Regex: /(?:(Str|Strin).*)/ (the 'S1' in my question refers to the non-capturing substring
I think that the matches from the above should be:
$0 will be String
$1 will be Str and not Strin
Will this always happen or are the instances (e.g maybe 'S1' being match greedily using *) where the another matching substring will be used i.e. Strin in my example.
If the above is correct than can I/should I rely on this behaviour?
Real world example
/^\/\/(\b((https?|ftp):\/\/|mailto:)([^\s~]*?(?:~(.|$))?)+?(?=\/\/|\s|$)|~(.|$)|[^/]|\/([^/]|$))*\/\//
Should correctly match:
//Some text including a http//:url//
With $1 == Some text including a http//:url
Note: I've tried to make this relatively language agnostic but I will be using php

PHP uses the PCRE regex engine. By default, and the way PHP uses it, the PCRE engine runs in longest-leftmost mode. This mode returns the first match, evaluating the regex from left to right. So yes, you can rely on the order that PHP interprets a regex.
The other mode, provided by the pcre_dfa_exec() function, evaluates all possible matches and returns the longest possible match.

In PHP, using preg extension, you can choose between greedy and non greedy operators (usually appending '?' to them).
By the way, in the example you gave, if you want Strin to match, you must invert your cases : /(?:(Strin|Str).*)/. I think, you should put the most generic expression at the end of the Regex.
FYI, with preg engine,
alternation operator is neither greedy nor lazy but ordered
Mastering regular expressions, J. Friedl, p175
If you want a greedy engine, you must use a Posix compliant engine (ereg - but it's deprecated).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex parsing multi-language string with languages codes - php

Related

Allow Parenthesis and forward slash to this regex

Understanding Regular Expressions

regex validation

Why does this regex not validate in the same way in PHP?

Can you rely on the order that regular expression syntax is interpreted?

Categories

Resources