How can i extract https://domain.com/gamer?hid=.115f12756a8641 from the below string ,i.e from url
rrth:'http://www.google.co',cctp:'323',url:'https://domain.com/gamer?hid=.115f12756a8641',rrth:'https://another.com'
P.s :I am new to regular expression, I am learning .But above string seems to be formatted..so some sort of shortcut must be there.
If your input string is called $str:
preg_match('/url:\'(.*?)\'/', $str, $matches);
$url = $matches[1];
(.*?) captures everything between url:' and ' and can later be retrieved with $matches[1].
The ? is particularly important. It makes the repetition ungreedy, otherwise it would consume everything until the very last '.
If your actual input string contains multiple url:'...' section, use preg_match_all instead. $matches[1] will then be an array of all required values.
Simple regex:
preg_match('/url\s*\:\s*\'([^\']+)/i',$theString,$match);
echo $match[1];//should be the url
How it works:
/url\s*\:\s*: matches url + [any number of spaces] + : (colon)+ [any number of spaces]But we don't need this, that's where the second part comes in
\'([^\']+)/i: matches ', then the brackets (()) create a group, that will be stored separately in the $matches array. What will be matches is [^']+: Any character, except for the apostrophe (the [] create a character class, the ^ means: exclude these chars). So this class will match any character up to the point where it reaches the closing/delimiting apostrophe.
/i: in case the string might contain URL:'http://www.foo.bar', I've added that i, which is the case-insensitive flag.
That's about it.Perhaps you could sniff around here to get a better understanding of regex's
note: I've had to escape the single quotes, because the pattern string uses single quotes as delimiters: "/url\s*\:\s*'([^']+)/i" works just as well. If you don't know weather or not you'll be dealing with single or double quotes, you could replace the quotes with another char class:
preg_match('/url\s*\:\s*[\'"]([^\'"]+)/i',$string,$match);
Obviously, in that scenario, you'll have to escape the delimiters you've used for the pattern string...
Related
Below is the REGEX which I am trying:
/((?<![\\\\])['"])((?:.(?!(?<![\\\\])\\1))*.?)\\1/
Here this is the text which I am giving
val1=""val2>"2022-11-16 10:19:20"
I need blank expressions like for val1 as well,
i.e. I need something like below in matches
""
2022-11-16 10:19:20
If I change the text to something like below, I am getting proper output
val2>"2022-11-16 10:19:20"val1=""
Can anyone please let me know where I am going wrong
Use alternatives to match the two cases.
One alternative matches the pair of quotes, the other uses lookarounds to match the inside of two quotes.
""|(?<=")[^"]+(?=")
In your pattern, this part (?:.(?!(?<![\\])\1))* first matches any character and then it asserts that what is to the right is not a group 1 value without an escape \
So in this string ""val2>" your whole pattern matches " with the character class ["'] and then it matches " again with the . From the position after that match, it is true that what is to the right is not the group 1 value without a preceding \ and that is why that match is ""val2>" instead of ""
If the second example string does give you a proper output, you could reverse the dot and first do the assertiong in the repeating part of the pattern, and omit matching an optional char .?
Note that the backslash does not have to be in square brackets.
(?<!\\)(['"])((?:(?!(?<!\\)\1).)*+)\1
See the updated regex101 demo.
I have the right function, just not finding the right regex pattern to remove (ID:999999) from the string. This ID value varies but is all numeric. I like to remove everything including the brackets.
$string = "This is the value I would like removed. (ID:17937)";
$string = preg_replace('#(ID:['0-9']?)#si', "", $string);
Regex is not more forte! And need help with this one.
Try this:
$string = preg_replace('# \(ID:[0-9]+\)#si', "", $string);
You need to escape the parenthesis using backslashes \.
You shouldn't use quotes around the number range.
You should use + (one or more) instead of ? (zero or one).
You can add a space at the start, to avoid having a space at the end of the resulting string.
In PHP regex is in / and not #, after that, parentheses are for capture group so you must escape them to match them.
Also to use preg_replace replacement you will need to use capture group so in your case /(\(ID:[0-9]+\))/si will be the a nice regular expression.
Here are two options:
Code: (Demo)
$string = "This is the value I would like removed. (ID:17937)";
var_export(preg_replace('/ \(ID:\d+\)/',"",$string));
echo "\n\n";
var_export(strstr($string,' (ID:',true));
Output: (I used var_export() to show that the technique is "clean" and gives no trailing whitespaces)
'This is the value I would like removed.'
'This is the value I would like removed.'
Some points:
Regex is a better / more flexible solution if your ID substring can exist anywhere in the string.
Your regex pattern doesn't need a character class if you use the shorthand range character \d.
Regex generally speaking should only be used when standard string function will not suffice or when it is proven to be more efficient for a specific case.
If your ID substring always occurs at the end of the string, strstr() is an elegant/perfect function.
Both of my methods write a (space) before ID to make the output clean.
You don't need either s or i modifiers on your pattern, because s only matters if you use a . (dot) and your ID is probably always uppercase so you don't need a case-insensitive search.
I'm having a difficulty with understanding how \G anchor works in PHP flavor of regular expressions.
I'm inclined to think (even though I may be wrong) that \G is used instead of ^ in situations when multiple matches of the same string are taking place.
Could someone please show an example of how \Gshould be used, and explain how and why it works?
UPDATE
\G forces the pattern to only return matches that are part of a continuous chain of matches. From the first match each subsequent match must be preceded by a match. If you break the chain the matches end.
<?php
$pattern = '#(match),#';
$subject = "match,match,match,match,not-match,match";
preg_match_all( $pattern, $subject, $matches );
//Will output match 5 times because it skips over not-match
foreach ( $matches[1] as $match ) {
echo $match . '<br />';
}
echo '<br />';
$pattern = '#(\Gmatch),#';
$subject = "match,match,match,match,not-match,match";
preg_match_all( $pattern, $subject, $matches );
//Will only output match 4 times because at not-match the chain is broken
foreach ( $matches[1] as $match ) {
echo $match . '<br />';
}
?>
This is straight from the docs
The fourth use of backslash is for certain simple assertions. An
assertion specifies a condition that has to be met at a particular
point in a match, without consuming any characters from the subject
string. The use of subpatterns for more complicated assertions is
described below. The backslashed assertions are
\G
first matching position in subject
The \G assertion is true only when the current matching position is at
the start point of the match, as specified by the offset argument of
preg_match(). It differs from \A when the value of offset is non-zero.
http://www.php.net/manual/en/regexp.reference.escape.php
You will have to scroll down that page a bit but there it is.
There is a really good example in ruby but it is the same in php.
How the Anchor \z and \G works in Ruby?
\G will match the match boundary, which is either the beginning of the string, or the point where the last character of last match is consumed.
It is particularly useful when you need to do complex tokenization, while also making sure that the tokens are valid.
Example problem
Let us take the example of tokenizing this input:
input 'some input in quote' more input '\'escaped quote\'' lots#_$of_fun ' \' \\ ' crazy'stuff'
Into these tokens (I use ~ to denote end of string):
input~
some input in quote~
more~
input~
'escaped quote'~
lots#_$of_fun~
' \ ~
crazy~
stuff~
The string consists of a mix of:
Singly quoted string, which allows the escape of \ and ', and spaces are conserved. Empty string can be specified using singly quoted string.
OR unquoted string, which consists of a sequence of non-white-space characters, and does not contain \ or '.
Space between 2 unquoted string will delimit them. Space is not necessary to delimit other cases.
For the sake of simplicity, let us assume the input does not contain new line (in real case, you need to consider it). It will add to the complexity of the regex without demonstrating the point.
The RAW regex for singly quoted string is '(?:[^\\']|\\[\\'])*+'
And the RAW regex for unquoted string is [^\s'\\]++
You don't need to care too much about the 2 piece of regex above, though.
The solution below with \G can make sure that when the engine fails to find any match, all characters from the beginning of the string to the position of last match has been consumed. Since it cannot skip character, the engine will stop matching when it fails to find valid match for both specifications of tokens, rather than grabbing random stuff in the rest of the string.
Construction
At the first step of construction, we can put together this regex:
\G(?:'((?:[^\\']|\\[\\'])*+)'|([^\s'\\]++))
Or simply put (this is not regex - just to make it easier to read):
\G(Singly_quote_regex|Unquoted_regex)
This will match the first token only, since when it attempts matching for the 2nd time, the match stops at the space before 'some input....
We just need to add a bit to allow for 0 or more space, so that in the subsequent match, the space at the position left off by the last match is consumed:
\G *+(?:'((?:[^\\']|\\[\\'])*+)'|([^\s'\\]++))
The regex above will now correctly identify the tokens, as seen here.
The regex can be further modified so that it returns the rest of the string when the engine fails to retrieve any valid token:
\G *+(?:'((?:[^\\']|\\[\\'])*+)'|([^\s'\\]++)|((?s).+$))
Since the alternation is tried in order from left-to-right, the last alternative ((?s).+$) will be match if and only if the string ahead doesn't make up a valid single quoted or unquoted token. This can be used to check for error.
The first capturing group will contain the text inside single quoted string, which needs extra processing to turn into the desired text (it is not really relevant here, so I leave it as an exercise to the readers). The second capturing group will contain the unquoted string. And the third capturing group acts as an indicator that the input string is not valid.
Demo for the final regex
Conclusion
The above example is demonstrate of one scenario of usage of \G in tokenization. There can be other usages that I haven't come across.
I am trying to get a string in between single quotes from the following document.write('mystring') I have tried the following pattern \'[A-Za-z0-9]+\' but I want the string without the single quote marks. How can I do that?
preg_match("/'([a-z0-9]+)'/i", $str, $matches);
//string contents are now in `$matches[1]`
You may also want to use \w or even something else to do the capturing unless you are absolutely sure that the string you want to acquire is purely alphanumeric. This also assumes that only apostrophes are used and not quotes.
Capture the word in a specific group:
\'([A-Za-z0-9]+)\'
I have this anchor locating regex working pretty well:
$p = '%<a.*\s+name="(.*)"\s*>(?:.*)</a>%im';
It matches <a followed by zero or more of anything followed by a space and name="
It is grabbing the names even if a class or an id precedes the name in the anchor.
What I would like to add is the ability to match on name=' with a single quote (') as well since sooner or later someone will have done this.
Obviously I could just add a second regex written for this but it seems inelegant.
Anyone know how to add the single quote and just use one regex? Any other improvements or recommendations would be very welcome. I can use all the regex help I can get!
Thanks very much for reading,
function findAnchors($html) {
$names = array();
$p = '%<a.*\s+name="(.*)"\s*>(?:.*)</a>%im';
$t = preg_match_all($p, $html, $matches, PREG_SET_ORDER);
if ($matches) {
foreach ($matches as $m) {
$names[] = $m[1];
}
return $names;
}
}
James' comment is actually a very popular, but wrong regex used for string matching. It's wrong because it doesn't allow for escaping of the string delimiter. Given that the string delimiter is ' or " the following regex works
$regex = '([\'"])(.*?)(.{0,2})(?<![^\\\]\\\)(\1)';
\1 is the starting delimeter, \2 is the contents (minus 2 characters) and \3 is the last 2 characters and the ending delimiter. This regex allows for escaping of delimiters as long as the escape character is \ and the escape character hasn't been escaped. IE.,
'Valid'
'Valid \' String'
'Invalid ' String'
'Invalid \\' String'
Try this:
/<a(?:\s+(?!name)[^"'>]+(?:"[^"]*"|'[^']*')?)*\s+name=("[^"]*"|'[^']*')\s*>/im
Here you just have to strip the surrounding quotes:
substr($match[1], 1, -1)
But using a real parser like DOMDocument would be certainly better that this regular expression approach.
Use [] to match character sets:
$p = "%<a.*\s+name=['\"](.*)['\"]\s*>(?:.*)</a>%im";
Your current solution won't match anchors with other attributes following 'name' (e.g. <a name="foo" id="foo">).
Try:
$regex = '%<a\s+\S*\s*name=["']([^"']+)["']%i';
This will extract the contents of the 'name' attribute into the back reference $1.
The \s* will also allow for line breaks between attributes.
You don't need to finish off with the rest of the 'a' tag as the negated character class [^"']+ will be lazy.
Here's another approach:
$rgx='~<a(?:\s+(?>name()|\w+)=(?|"([^"]*)"|\'([^\']*)\'))+?\1~i';
I know this question is old, but when it resurfaced just now I thought up another use for the "empty capturing groups as checkboxes" idiom from the Cookbook. The first, non-capturing group handles the matching of all "name=value" pairs under the control of a reluctant plus (+?). If the attribute name is literally name, the empty group (()) matches nothing, then the backreference (\1) matches nothing again, breaking out of the loop. (The backreference succeeds because the group participated in the match, even though it didn't consume any characters.)
The attribute value is captured each time in group #2, overwriting whatever was captured on the previous iteration. (The branch-reset construct ((?|(...)|(...)) enables us to "re-use" group #2 to capture the value inside the quotes, whichever kind of quotes they were.) Since the loop quits after the name name comes up, the final captured value corresponds to that attribute.
See a demo on Ideone