Is there a regex to find all the digit sequences (\d+) in text, but not the ones forming HTML entities? Look like I should use both "look ahead" and "look behind" together, but I can’t figure out how.
For example, for the string ✑ #555 foo 777; I want to match only 555 and 777, but not 10001.
I’ve tried
~(?<!(&#)|\d])\d+(?![\d|;])~
But it seems to be too strict, as it returns no matches for cases like 777;
You can probably use this regex with lookarounds:
(?<!&#)\b\d+\b|(?:^|\b)\d+\b(?!;|$)
Demo: http://www.rubular.com/r/IUGqDf7Nfg
I’ve found the solution the next morning.
(?<![(&#)\d])\d+|\d+(?!\d|;)
It's quite big and poorly readable, but it works.
P.S. I think it’s a lot easier just do decode/hide the entities before processing and then put them back.
Related
I'm trying to find some words (or expression: like two words) in a string which are not in the anchor of a link (the string contains html code and is usually utf-8 encoded). The plan is to replace those words with some links after that.
I'm not really good with regex, i've searched the web and stackoverflow and found two regex patterns which help me, but each of them have an issue. I'm hoping someone can help me to combine those two example to get a good one.
First pattern: /('.$tag.')(?![^<]*<\/a>)/is
This pattern, finds the words, but if by example i'm trying to find "express" in the string:
In computing, a regular expression provides a concise and flexible means...
..i don't expect to find a match, however the match is found in the word "expression".
Second pattern: \'(?!((<.*?)|(<a.*?)))(\b'.$tag.'\b)(?!(([^<>]*?)>)|([^>]*?</a>))\'is
This pattern, doesn't have the previous issue, but if the word or expression, i'm trying to find has as a last character a special utf-8 character then i don't get a match.
Example word: apă
Example string: ...care transformă umiditatea din aer în apă potabilă. Dacă iniţial a fost creată pentru situaţia ţărilor...
Assuming the second regular expression works for you (I haven't tested it and I really don't think you should use regexes for this kind of stuff), all you need to do is add a u modifier like #hakre said:
\'(?!((<.*?)|(<a.*?)))(\b'.$tag.'\b)(?!(([^<>]*?)>)|([^>]*?</a>))\'isu
Personally, I'd use DOMDocument for this task.
I'm trying to learn regular expression, because I can't do without them.
So, this is a list of different dimension patterns (for products to sale) :
40x30x75
46x38x23-27
Ø30H30
Ø25-18H27
So, what pattern to use to find each kind of dimensions ?
For example, now, I'm using this to find this kind of pattern 40x30x75, but it not works :
if(preg_match("#^[0-9][x][0-9][x][0-9]#", $dimension))
echo "ok"
Could you help me ?
Try the following regex:
(^[0-9]+x[0-9]+x[0-9]+$)|(^[0-9]+x[0-9]+x[0-9]+-[0-9]+$)|(^Ø[0-9]+H[0-9]+$)|(^Ø[0-9]+-[0-9]+H[0-9]+$)
So:
if (preg_match("/(^[0-9]+x[0-9]+x[0-9]+$)|(^[0-9]+x[0-9]+x[0-9]+-[0-9]+$)|(^Ø[0-9]+H[0-9]+$)|(^Ø[0-9]+-[0-9]+H[0-9]+$)/", $dimension))
echo "ok";
It probably can be simplified even more, maybe someone would want to have a go at that?
By the way, did you know about a website called RegExr it allows you to test your regular expessions, it has been very useful to me whenever I work with regex's.
Your regex is missing quantifiers, add a + sign behind the character classes in question to singal you're looking for one or more matches:
if(preg_match("#^[0-9]+x[0-9]+x[0-9]+#", $dimension))
echo "ok"
By default it's looking for one character of the class only. Single characters do not need the character class (albeit it was not wrong). See the x'es in the example above.
Your regex should be:
^[0-9]{2}x[0-9]{2}x[0-9]{2}$
[0-9] means a single character which is between 0 and 9. So, you either need to have two of those, or use a quantifier thing like {2}. Instead of [0-9] you could also use \d, meaning any digit. So, you could for example write:
^\d\dx\d\dx\d\d$
Tip: If you can't do without regular expressions, want to learn it and have an easier life, I can recommend you get RegexBuddy. Bought it for myself when I just got started, and it has helped me a lot.
This will validate the first two:
^[0-9]+x[0-9]+x[0-9]+-?[0-9]*$
I have the following problem.
Let's take the input (wikitext)
======hello((my first program)) world======
I want to match "hello", "my first program" and " world" (notice the space).
But for the input:
======hello(my first program)) world======
I want to match "hello(my first program" and " world".
In other words, I want to match any letters, spaces and additionally any single symbols (no double or more).
This should be done with the unicode character properties like \p{L}, \p{S} or \p{Z}, as documented here.
Any ideas?
Addendum 1
The regex has just to stop before any double symbol or punctuation in unicode terms, that is, before any \p{S}{2,} or \p{P}{2,}.
I'm not trying to parse the whole wikitext with this, read my question carefully. The regex I'm looking for IS for the lexer I'm working on, and making it match such inputs will simplify my parser incredibly.
Addendum 2
The pattern must work with preg_match(). I can imagine how I'd have to split it first. Perhaps it would use some lookahead, I don't know, I've tried everything that I could imagine.
Using only preg_match() is a requirement set in stone by the current implementation of the lexer. It must be that way, because that's the natural way of how lexers work: they match sequences in the input stream.
return preg_split('/([\pS\pP])\\1+/', $theString);
Result: http://www.ideone.com/YcbIf
(You need to get rid of the empty strings manually.)
Edit: as a preg_match regex:
'/(?:^|([\pS\pP])\\1+)((?:[^\pS\pP]|([\pS\pP])(?!\\3))*)/'
take the 2nd capture group when it is matched. Example: http://www.ideone.com/ErTVA
But you could just consume ([\pS\pP])\\1+ and discard, or if doesn't match, consume (?:[^\pS\pP]|([\pS\pP])(?!\\3))* and record, since your lexer is going to use more than 1 regex anyway?
Regular expressions are notoriously overused and ill-suited for parsing languages like this. You can get away with it for a little while, but eventually you will find something that breaks your parser, requiring tweak after tweak and a huge library of unit tests to ensure compliance.
You should seriously consider writing a proper lexer and parser instead.
I found a regex pattern for PHP that does the exact OPPOSITE of what I'm needing, and I'm wondering how I can reverse it?
Let's say I have the following text: Item_154 ($12)
This pattern /\((.*?)\)/ gets what's inside the parenthesis, but I need to get "Item_154" and cut out what's in parenthesis and the space before the parenthesis.
Anybody know how I can do that?
Regex is above my head apparently...
/^([^( ]*)/
Match everything from the start of the string until the first space or (.
If the item you need to match can have spaces in it, and you only want to get rid of whitespace immediately before the parenthetical, then you can use this instead:
/^([^(]*?)\s*\(/
The following will match anything that looks like text (...) but returns just the text part in the match.
\w+(?=\s*\([^)]*\))
Explanation:
The \w includes alphanumeric and underscore, with + saying match one or more.
The (?= ) group is positive lookahead, saying "confirm this exists but don't match it".
Then we have \s for whitespace, and * saying zero or more.
The \( and \) matches literal ( and ) characters (since its normally a special chat).
The [^)] is anything non-) character, and again * is zero or more.
Hopefully all makes sense?
/(.*)\(.*\)/
What is not in () will now be your 1st match :)
One site that really helped me was http://gskinner.com/RegExr/
It'll let you build a regex and then paste in some sample targets/text to test it against, highlighting matches. All of the possible regex components are listed on the right with (essentially) a tooltip describing the function.
<?php
$string = 'Item_154 ($12)';
$pattern = '/(.*)\(.*?\)/';
preg_match($pattern, $string, $matches);
var_dump($matches[1]);
?>
Should get you Item_154
The following regex works for your string as a replacement if that helps? :-
\s*\(.*?\)
Here's an explanation of what's it doing...
Whitespace, any number of repetitions - \s*
Literal - \(
Any character, any number of repetitions, as few as possible - .*?
Literal - \)
I've found Expresso (http://www.ultrapico.com/) is the best way of learning/working out regular expressions.
HTH
Here is a one-shot to do the whole thing
$text = 'Item_154 ($12)';
$text = preg_replace('/([^\s]*)\s(\()[^)]*(\))/', $1$2$3, $text);
var_dump($text);
//Outputs: Item_154()
Keep in mind that using any PCRE functions involves a fair amount of overhead, so if you are using something like this in a long loop and the text is simple, you could probably do something like this with substr/strpos and then concat the parens on to the end since you know that they should be empty anyway.
That said, if you are looking to learn REGEXs and be productive with them, I would suggest checking out: http://rexv.org
I've found the PCRE tool there to very useful, though it can be quirky in certain ways. In particular, any examples that you work with there should only use single quotes if possible, as it doesn't work with double quotes correctly.
Also, to really get a grip on how to use regexs, I would check out Mastering Regular Expressions by Jeffrey Friedl ISBN-13:978-0596528126
Since you are using PHP, I would try to get the 3rd Edition since it has a section specifically on PHP PCRE. Just make sure to read the first 6 chapters first since they give you the foundation needed to work with the material in that particular chapter. If you see the 2nd Edition on the cheap somewhere, that pretty much the same core material, so it would be a good buy as well.
how can i much the sentense, if it doesn't contain none of {word1,word2,word3}
where i must put ^ symbol?
i think it must looks like this
^([^word1|word2|word3])$
but it doesn't work.
could you help? thanks
Regex isn't the best tool for testing these sorts of conditions, but if you must then you can do it with negative lookaheads:
^(?!.*word1)(?!.*word2)(?!.*word3).*$
What you are trying to do won't work because [^...] is a negative character class with an unordered list of characters. What you wrote is equivalent to:
^([^123dorw|])$
Note also that depending on your needs you might also want to include word-boundaries in your regular expression:
^(?!.*\bword1\b)(?!.*\bword2\b)(?!.*\bword3\b).*$
im not familiar with the use of regex in htaccess, so my thoughts may be bit high level:
what you try looks like kind of:
if sentence contains not word1 | not word2 | not word3
then do something
i would suggest a solution the way:
if sentence contains word1|word2|word3
then do nothing
else do something
means don't use the negation in the "query", but in the result, which makes the regex simpler.