Parsing from:x; but not lfrom:x; - php

I am trying to parse a string with something like :
preg_match( "|from:(.*?);|", $string, $match);
But then I found that the string can also contain lfrom: and _from:
A few examples of how the string can be:
var1:34234;from:website1.com;lfrom:website2.com;var2:343423;
lfrom:website1.com;var1:4234234;from:website2.com
from:website1.com;_from:website2.com;lfrom:website2.com;var1:43523;
How can I parse only from:(.*?); and not lfrom, _from, etc.

I was gonna give you the solution but I better explain you about the lookbehind modifier.
In regex each time you "match" a h for example, that h will add 1 to the pointer of where the regex is at the moment so you dont want to "add" nothing to the pointer. You just want to look if the from is preceded by a ;\s\b or the start of the string. You don't want to match the VOID because there are voids everywhere!!
So, an example: (?<a)b that would match a b that has an a before it. So it just does the next: When a b found it looks before it, if there is an a it matches the regex.
So... (?<=[;\s\b]|^)from:(\w+\.\w+) Would match a from that right before it has [;\s\b] OR ^ (The string start)
DEMO
Pretty easy, huh!?

You could either use an assertion:
|(?<!l)from:(.*?);|
Or look for the preceding ; or line start:
|(;|^)from:(.*?);|m
It might also be a good idea to replace the generic .*? match with [^;]*

Assuming preceding from is whitespace or a ;
/[\s\b;]from:([^;]+);/
This will only match from preceeded by a space, word boundary, or ;. I also prefer to narrow captures, i.e. [^;]+ vs. [.*?];.

There is a concept called (negative) lookbehind, which asserts that your current position is (not) preceded by certain things. I guess, in this case I would go with a positive lookbehind, and assert that from is preceded by a the start of the string, a line-break or a ;:
preg_match('|(?<=^|;)from:(.*?);|m', $string, $match);
Make sure to you multi-line mode m, so that ^ will also match at the start of each line and not just at the start of the string.
If you only wanted to exlude l and _ in front of from but accept any other characters, then a negative lookbehind might be what you are looking for:
preg_match('|(?<![l_])from:(.*?);|m', $string, $match);
The convenient thing about lookbehinds is, that they are not included in the actual match. They just check what's there without actually consuming it. Here is some reading.

Related

Regex match section within string

I have a string foo-foo-AB1234-foo-AB12345678. The string can be in any format, is there a way of matching only the following pattern letter,letter,digits 3-5 ?
I have the following implementation:
preg_match_all('/[A-Za-z]{2}[0-9]{3,6}/', $string, $matches);
Unfortunately this finds a match on AB1234 AND AB12345678 which has more than 6 digits. I only wish to find a match on AB1234 in this instance.
I tried:
preg_match_all('/^[A-Za-z]{2}[0-9]{3,6}$/', $string, $matches);
You will notice ^ and $ to mark the beginning and end, but this only applies to the string, not the section, therefore no match is found.
I understand why the code is behaving like it is. It makes logical sense. I can't figure out the solution though.
You must be looking for word boundaries \b:
\b\p{L}{2}\p{N}{3,5}\b
See demo
Note that \p{L} matches a Unicode letter, and \p{N} matches a Unicode number.
You can as well use your modified regex \b[a-zA-Z]{2}[0-9]{3,5}\b. Note that using anchors makes your regex match only at the beginning of a string (with ^) or/and at the end of the string (with $).
In case you have underscored words (like foo-foo_AB1234_foo_AB12345678_string), you will need a slight modification:
(?<=\b|_)\p{L}{2}\p{N}{3,5}(?=\b|_)
You have to end your regular expression with a pattern for a non-digit. In Java this would be \D, this should be the same in PHP.

(PHP) How to find words beginning with a pattern and replace all of them?

I have a string. An example might be "Contact /u/someone on reddit, or visit /r/subreddit or /r/subreddit2"
I want to replace any instance of "/r/x" and "/u/x" with "[/r/x](http://reddit.com/r/x)" and "[/u/x](http://reddit.com/u/x)" basically.
So I'm not sure how to 1) find "/r/" and then expand that to the rest of the word (until there's a space), then 2) take that full "/r/x" and replace with my pattern, and most importantly 3) do this for all "/r/" and "/u/" matches in a single go...
The only way I know to do this would be to write a function to walk the string, character by character, until I found "/", then look for "r" and "/" to follow; then keep going until I found a space. That would give me the beginning and ending characters, so I could do a string replacement; then calculate the new end point, and continue walking the string.
This feels... dumb. I have a feeling there's a relatively simple way to do this, and I just don't know how to google to get all the relevant parts.
A simple preg_replace will do what you want.
Try:
$string = preg_replace('#(/(?:u|r)/[a-zA-Z0-9_-]+)#', '[\1](http://reddit.com\1)', $string);
Here is an example: http://ideone.com/dvz2zB
You should see if you can discover what characters are valid in a Reddit name or in a Reddit username and modify the [a-zA-Z0-9_-] charset accordingly.
You are looking for a regular expression.
A basic pattern starts out as a fixed string. /u/ or /r/ which would match those exactly. This can be simplified to match one or another with /(?:u|r)/ which would match the same as those two patterns. Next you would want to match everything from that point up to a space. You would use a negative character group [^ ] which will match any character that is not a space, and apply a modifier, *, to match as many characters as possible that match that group. /(?:u|r)/[^ ]*
You can take that pattern further and add a lookbehind, (?<= ) to ensure your match is preceded by a space so you're not matching a partial which results in (?<= )/(?:u|r)/[^ ]*. You wrap all of that to make a capturing group ((?<= )/(?:u|r)/[^ ]*). This will capture the contents within the parenthesis to allow for a replacement pattern. You can express your chosen replacement using the \1 reference to the first captured group as [\1](http://reddit.com\1).
In php you would pass the matching pattern, replacement pattern, and subject string to the preg_replace function.
In my opinion regex would be an overkill for such a simple operation. If you just want to replace instance of "/r/x" with "[r/x](http://reddit.com/r/x)" and "/u/x" with "[/u/x](http://reddit.com/u/x)" you should use str_replace although with preg_replace it'll lessen the code.
str_replace("/r/x","[/r/x](http://reddit.com/r/x)","whatever_string");
use regex for intricate search string and replace. you can also use http://www.jslab.dk/tools.regex.php regular expression generator if you have something complex to capture in the string.

Php lookahead assertion at the end of the regex

I want to write a regex with assertions to extract the number 55 from string unknownstring/55.1, here is my regex
$str = 'unknownstring/55.1';
preg_match('/(?<=\/)\d+(?=\.1)$/', $str, $match);
so, basically I am trying to say give me the number that comes after slash, and is followed by a dot and number 1, and after that there are no characters. But it does not match the regex. I just tried to remove the $ sign from the end and it matched. But that condition is essential, as I need that to be the end of the string, because the unknownstring part can contain similar text, e.g. unknow/545.1nstring/55.1. Perhaps I can use preg_match_all, and take the last match, but I want understand why the first regex does not work, where is my mistake.
Thanks
Use anchor $ inside lookahead:
(?<=\/)\d+(?=\.1$)
RegEx Demo
You cannot use $ outside the positive lookahead because your number is NOT at the end of input and there is a \.1 following it.

What do these certain symbols/parts mean in preg_match?

I know a little about preg_match, however there are some that look rather complex and some that contain symbols that I don't entirely understand. For example:
On the first one - I can only assume this has something to do with an e-mail address and url, but what do things like [^/] and the ? mean?
preg_match('#^(?:http://)?([^/]+)#i', $variable);
.....
In the second one - what do things like the ^, {5} and $ mean?
preg_match("/^[A-Z]{5}[0-9]{4}[A-Z]{1}$/", $variable);
It's just these small things I'm not entirely sure on and a brief explanation would be much appreciated.
Here are the direct answers. I kept them short because they won't make sense without an understanding of regex. That understanding is best gained at http://www.regular-expressions.info/tools.html. I advise you to also try out the regex helper tools listed there, they allow you to experiment - see live capturing/matching as you edit the pattern, very helpful.
Simple parentheses ( ) around something makes it a group. Here you have (?=) which is an assertion, specifically a positive look ahead assertion. All it does is check whether what's inside actually exists forward from the current cursor position in the haystack. Still with me?
Example: foo(?=bar) matches foo only if followed by bar. bar is never matched, only foo is returned.
With this in mind, let's dissect your regex:
/^.*(?=.{4,})(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z]).*$/
Reads as:
^.* From Start, capture 0-many of any character
(?=.{4,}) if there are at least 4 of anything following this
(?=.*[0-9]) if there is: 0-many of any, ending with an integer following
(?=.*[a-z]) if there is: 0-many of any, ending with a lowercase letter following
(?=.*[A-Z]) if there is: 0-many of any, ending with an uppercase letter following
.*$ 0-many of anything preceding the End
Although I am not a fan of just posting links, I think a regex tutorial would be too much. So check out this Regular Expression cheat sheet it will probably get you on your way if you already have a little understanding of what it does.
Also check out this for some explanations and more helpful links; http://coding.smashingmagazine.com/2009/06/01/essential-guide-to-regular-expressions-tools-tutorials-and-resources/
First one:
The # actually don't have anything to do with the content that is matched. Usually, you use / as the delimiter character in a regex. Downside is, that you need to escape it everytime you want to use it. So here, # is used as the delimiter.
[^/] is a character group. [/] would match only the / character, ^ inverts this. [^/] matches all characters except the /.
Second one:
^ matches the beginning of the string, $ the end of the string. You can use this to enforce that the regex has to apply to the whole string you are matching on.
{5} is a quantifier. It is equivalent to {5,5} which is minimum 5, maximum 5, so it matches exactly 5 characters.
first one:
[^/] = everything but no slash
second one:
^ look from beginning of $variable
{5} exactly 5 occurencies of [A-Z]
$ look until end of $variable reached
combination of ^ and $ means that everything between that has to apply to $variable

Regular Expression get part of string

How can I get only the text inside "()"
For example from "(en) English" I want only the "en".
I've written this pattern "/\(.[a-z]+\)/i" but it also gets the "()";
Thanks in advance.
<?php
$string = '(en) English';
preg_match('#\((.*?)\)#is', $string, $matches);
echo $matches[1]; # en
?>
$matches[0] will contain entire matches string, $matches[1] will first group, in this case (.*?) between ( and ).
What is the dot in your regex good for, I assume its there by mistake.
Second to give you an alternative to the capturing group answer (which is perfectly fine!), here is to soltution using lookbehind and lookahead.
(?<=\()[a-z]+(?=\))
See it here on Regexr
The trick here is, those lookarounds do not match the characters inside, they just check if they are there. So those characters are not included in the result.
(?<=\() positive look behind assertion, checking for the character ( before its position
(?=\) positive look ahead assertion, checking for the character ( ahead of its position
That should do the job.
"/\(([a-z]+)\)/i"
The easiest way is to get "/\(([a-z]+)\)/i" and use the capture group to get what you want.
Otherwise, you have to get into look ahead, look behinds
You could use a capture group like everyone else proposes
OR
you can make your match only check if your match is preceded by "(" and followed by ")". It's called Lookahead and lookbehind.
"/(?<=\().[a-z]+(?=\))/i"

Categories