Regex Question: Matching this pattern with hard or soft quotes - php

I have this anchor locating regex working pretty well:
$p = '%<a.*\s+name="(.*)"\s*>(?:.*)</a>%im';
It matches <a followed by zero or more of anything followed by a space and name="
It is grabbing the names even if a class or an id precedes the name in the anchor.
What I would like to add is the ability to match on name=' with a single quote (') as well since sooner or later someone will have done this.
Obviously I could just add a second regex written for this but it seems inelegant.
Anyone know how to add the single quote and just use one regex? Any other improvements or recommendations would be very welcome. I can use all the regex help I can get!
Thanks very much for reading,
function findAnchors($html) {
$names = array();
$p = '%<a.*\s+name="(.*)"\s*>(?:.*)</a>%im';
$t = preg_match_all($p, $html, $matches, PREG_SET_ORDER);
if ($matches) {
foreach ($matches as $m) {
$names[] = $m[1];
}
return $names;
}
}

James' comment is actually a very popular, but wrong regex used for string matching. It's wrong because it doesn't allow for escaping of the string delimiter. Given that the string delimiter is ' or " the following regex works
$regex = '([\'"])(.*?)(.{0,2})(?<![^\\\]\\\)(\1)';
\1 is the starting delimeter, \2 is the contents (minus 2 characters) and \3 is the last 2 characters and the ending delimiter. This regex allows for escaping of delimiters as long as the escape character is \ and the escape character hasn't been escaped. IE.,
'Valid'
'Valid \' String'
'Invalid ' String'
'Invalid \\' String'

Try this:
/<a(?:\s+(?!name)[^"'>]+(?:"[^"]*"|'[^']*')?)*\s+name=("[^"]*"|'[^']*')\s*>/im
Here you just have to strip the surrounding quotes:
substr($match[1], 1, -1)
But using a real parser like DOMDocument would be certainly better that this regular expression approach.

Use [] to match character sets:
$p = "%<a.*\s+name=['\"](.*)['\"]\s*>(?:.*)</a>%im";

Your current solution won't match anchors with other attributes following 'name' (e.g. <a name="foo" id="foo">).
Try:
$regex = '%<a\s+\S*\s*name=["']([^"']+)["']%i';
This will extract the contents of the 'name' attribute into the back reference $1.
The \s* will also allow for line breaks between attributes.
You don't need to finish off with the rest of the 'a' tag as the negated character class [^"']+ will be lazy.

Here's another approach:
$rgx='~<a(?:\s+(?>name()|\w+)=(?|"([^"]*)"|\'([^\']*)\'))+?\1~i';
I know this question is old, but when it resurfaced just now I thought up another use for the "empty capturing groups as checkboxes" idiom from the Cookbook. The first, non-capturing group handles the matching of all "name=value" pairs under the control of a reluctant plus (+?). If the attribute name is literally name, the empty group (()) matches nothing, then the backreference (\1) matches nothing again, breaking out of the loop. (The backreference succeeds because the group participated in the match, even though it didn't consume any characters.)
The attribute value is captured each time in group #2, overwriting whatever was captured on the previous iteration. (The branch-reset construct ((?|(...)|(...)) enables us to "re-use" group #2 to capture the value inside the quotes, whichever kind of quotes they were.) Since the loop quits after the name name comes up, the final captured value corresponds to that attribute.
See a demo on Ideone

Related

Looking to use preg_replace to remove characters from my strings

I have the right function, just not finding the right regex pattern to remove (ID:999999) from the string. This ID value varies but is all numeric. I like to remove everything including the brackets.
$string = "This is the value I would like removed. (ID:17937)";
$string = preg_replace('#(ID:['0-9']?)#si', "", $string);
Regex is not more forte! And need help with this one.
Try this:
$string = preg_replace('# \(ID:[0-9]+\)#si', "", $string);
You need to escape the parenthesis using backslashes \.
You shouldn't use quotes around the number range.
You should use + (one or more) instead of ? (zero or one).
You can add a space at the start, to avoid having a space at the end of the resulting string.
In PHP regex is in / and not #, after that, parentheses are for capture group so you must escape them to match them.
Also to use preg_replace replacement you will need to use capture group so in your case /(\(ID:[0-9]+\))/si will be the a nice regular expression.
Here are two options:
Code: (Demo)
$string = "This is the value I would like removed. (ID:17937)";
var_export(preg_replace('/ \(ID:\d+\)/',"",$string));
echo "\n\n";
var_export(strstr($string,' (ID:',true));
Output: (I used var_export() to show that the technique is "clean" and gives no trailing whitespaces)
'This is the value I would like removed.'
'This is the value I would like removed.'
Some points:
Regex is a better / more flexible solution if your ID substring can exist anywhere in the string.
Your regex pattern doesn't need a character class if you use the shorthand range character \d.
Regex generally speaking should only be used when standard string function will not suffice or when it is proven to be more efficient for a specific case.
If your ID substring always occurs at the end of the string, strstr() is an elegant/perfect function.
Both of my methods write a (space) before ID to make the output clean.
You don't need either s or i modifiers on your pattern, because s only matters if you use a . (dot) and your ID is probably always uppercase so you don't need a case-insensitive search.

Explode and/or regex text to HTML link in PHP

I have a database of texts that contains this kind of syntax in the middle of English sentences that I need to turn into HTML links using PHP
"text1(text1)":http://www.example.com/mypage
Notes:
text1 is always identical to the text in parenthesis
The whole string always have the quotation marks, parenthesis, colon, so the syntax is the same for each.
Sometimes there is a space at the end of the string, but other times there is a question mark or comma or other punctuation mark.
I need to turn these into basic links, like
text1
How do I do this? Do I need explode or regex or both?
"(.*?)\(\1\)":(.*\/[a-zA-Z0-9]+)(?=\?|\,|\.|$)
You can use this.
See Demo.
http://regex101.com/r/zF6xM2/2
You can use this replacement:
$pattern = '~"([^("]+)\(\1\)":(http://\S+)(?=[\s\pP]|\z)~';
$replacement = '\1';
$result = preg_replace($pattern, $replacement, $text);
pattern details:
([^("]+) this part will capture text1 in the group 1. The advantage of using a negated character class (that excludes the double quote and the opening parenthesis) is multiple:
it allows to use a greedy quantifier, that is faster
since the class excludes the opening parenthesis and is immediatly followed by a parenthesis in the pattern, if in an other part of the text there is content between double quotes but without parenthesis inside, the regex engine will not go backward to test other possibilities, it will skip this substring without backtracking. (This is because the PCRE regex engine converts automatically [^a]+a into [^a]++a before processing the string)
\S+ means all that is not a whitespace one or more times
(?=[\s\pP]|\z) is a lookahead assertion that checks that the url is followed by a whitespace, a punctuation character (\pP) or the end of the string.
You can use this regex:
"(.*?)\(.*?:(.*)
Working demo
An appropriate Regular Expression could be:
$str = '"text1(text1)":http://www.example.com/mypage';
preg_match('#^"([^\(]+)' .
'\(([^\)]+)\)[^"]*":(.+)#', $str, $m);
print ''.$m[2].'' . PHP_EOL;

What is the use of '\G' anchor in regex?

I'm having a difficulty with understanding how \G anchor works in PHP flavor of regular expressions.
I'm inclined to think (even though I may be wrong) that \G is used instead of ^ in situations when multiple matches of the same string are taking place.
Could someone please show an example of how \Gshould be used, and explain how and why it works?
UPDATE
\G forces the pattern to only return matches that are part of a continuous chain of matches. From the first match each subsequent match must be preceded by a match. If you break the chain the matches end.
<?php
$pattern = '#(match),#';
$subject = "match,match,match,match,not-match,match";
preg_match_all( $pattern, $subject, $matches );
//Will output match 5 times because it skips over not-match
foreach ( $matches[1] as $match ) {
echo $match . '<br />';
}
echo '<br />';
$pattern = '#(\Gmatch),#';
$subject = "match,match,match,match,not-match,match";
preg_match_all( $pattern, $subject, $matches );
//Will only output match 4 times because at not-match the chain is broken
foreach ( $matches[1] as $match ) {
echo $match . '<br />';
}
?>
This is straight from the docs
The fourth use of backslash is for certain simple assertions. An
assertion specifies a condition that has to be met at a particular
point in a match, without consuming any characters from the subject
string. The use of subpatterns for more complicated assertions is
described below. The backslashed assertions are
\G
first matching position in subject
The \G assertion is true only when the current matching position is at
the start point of the match, as specified by the offset argument of
preg_match(). It differs from \A when the value of offset is non-zero.
http://www.php.net/manual/en/regexp.reference.escape.php
You will have to scroll down that page a bit but there it is.
There is a really good example in ruby but it is the same in php.
How the Anchor \z and \G works in Ruby?
\G will match the match boundary, which is either the beginning of the string, or the point where the last character of last match is consumed.
It is particularly useful when you need to do complex tokenization, while also making sure that the tokens are valid.
Example problem
Let us take the example of tokenizing this input:
input 'some input in quote' more input '\'escaped quote\'' lots#_$of_fun ' \' \\ ' crazy'stuff'
Into these tokens (I use ~ to denote end of string):
input~
some input in quote~
more~
input~
'escaped quote'~
lots#_$of_fun~
' \ ~
crazy~
stuff~
The string consists of a mix of:
Singly quoted string, which allows the escape of \ and ', and spaces are conserved. Empty string can be specified using singly quoted string.
OR unquoted string, which consists of a sequence of non-white-space characters, and does not contain \ or '.
Space between 2 unquoted string will delimit them. Space is not necessary to delimit other cases.
For the sake of simplicity, let us assume the input does not contain new line (in real case, you need to consider it). It will add to the complexity of the regex without demonstrating the point.
The RAW regex for singly quoted string is '(?:[^\\']|\\[\\'])*+'
And the RAW regex for unquoted string is [^\s'\\]++
You don't need to care too much about the 2 piece of regex above, though.
The solution below with \G can make sure that when the engine fails to find any match, all characters from the beginning of the string to the position of last match has been consumed. Since it cannot skip character, the engine will stop matching when it fails to find valid match for both specifications of tokens, rather than grabbing random stuff in the rest of the string.
Construction
At the first step of construction, we can put together this regex:
\G(?:'((?:[^\\']|\\[\\'])*+)'|([^\s'\\]++))
Or simply put (this is not regex - just to make it easier to read):
\G(Singly_quote_regex|Unquoted_regex)
This will match the first token only, since when it attempts matching for the 2nd time, the match stops at the space before 'some input....
We just need to add a bit to allow for 0 or more space, so that in the subsequent match, the space at the position left off by the last match is consumed:
\G *+(?:'((?:[^\\']|\\[\\'])*+)'|([^\s'\\]++))
The regex above will now correctly identify the tokens, as seen here.
The regex can be further modified so that it returns the rest of the string when the engine fails to retrieve any valid token:
\G *+(?:'((?:[^\\']|\\[\\'])*+)'|([^\s'\\]++)|((?s).+$))
Since the alternation is tried in order from left-to-right, the last alternative ((?s).+$) will be match if and only if the string ahead doesn't make up a valid single quoted or unquoted token. This can be used to check for error.
The first capturing group will contain the text inside single quoted string, which needs extra processing to turn into the desired text (it is not really relevant here, so I leave it as an exercise to the readers). The second capturing group will contain the unquoted string. And the third capturing group acts as an indicator that the input string is not valid.
Demo for the final regex
Conclusion
The above example is demonstrate of one scenario of usage of \G in tokenization. There can be other usages that I haven't come across.

Getting regular expression

How can i extract https://domain.com/gamer?hid=.115f12756a8641 from the below string ,i.e from url
rrth:'http://www.google.co',cctp:'323',url:'https://domain.com/gamer?hid=.115f12756a8641',rrth:'https://another.com'
P.s :I am new to regular expression, I am learning .But above string seems to be formatted..so some sort of shortcut must be there.
If your input string is called $str:
preg_match('/url:\'(.*?)\'/', $str, $matches);
$url = $matches[1];
(.*?) captures everything between url:' and ' and can later be retrieved with $matches[1].
The ? is particularly important. It makes the repetition ungreedy, otherwise it would consume everything until the very last '.
If your actual input string contains multiple url:'...' section, use preg_match_all instead. $matches[1] will then be an array of all required values.
Simple regex:
preg_match('/url\s*\:\s*\'([^\']+)/i',$theString,$match);
echo $match[1];//should be the url
How it works:
/url\s*\:\s*: matches url + [any number of spaces] + : (colon)+ [any number of spaces]But we don't need this, that's where the second part comes in
\'([^\']+)/i: matches ', then the brackets (()) create a group, that will be stored separately in the $matches array. What will be matches is [^']+: Any character, except for the apostrophe (the [] create a character class, the ^ means: exclude these chars). So this class will match any character up to the point where it reaches the closing/delimiting apostrophe.
/i: in case the string might contain URL:'http://www.foo.bar', I've added that i, which is the case-insensitive flag.
That's about it.Perhaps you could sniff around here to get a better understanding of regex's
note: I've had to escape the single quotes, because the pattern string uses single quotes as delimiters: "/url\s*\:\s*'([^']+)/i" works just as well. If you don't know weather or not you'll be dealing with single or double quotes, you could replace the quotes with another char class:
preg_match('/url\s*\:\s*[\'"]([^\'"]+)/i',$string,$match);
Obviously, in that scenario, you'll have to escape the delimiters you've used for the pattern string...

Getting a random string within a string

I need to find a random string within a string.
My string looks as follows
{theme}pink{/theme} or {theme}red{/theme}
I need to get the text between the tags, the text may differ after each refresh.
My code looks as follows
$str = '{theme}pink{/theme}';
preg_match('/{theme}*{\/theme}/',$str,$matches);
But no luck with this.
* is only the quantifier, you need to specify what the quantifier is for. You've applied it to }, meaning there can be 0 or more '}' characters. You probably want "any character", represented by a dot.
And maybe you want to capture only the part between the {..} tags with (.*)
$str = '{theme}pink{/theme}';
preg_match('/{theme}(.*){\/theme}/',$str,$matches);
var_dump($matches);
'/{theme}(.*?){\/theme}/' or even more restrictive '/{theme}(\w*){\/theme}/' should do the job
preg_match_all('/{theme}(.*?){\/theme}/', $str, $matches);
You should use ungreedy matching here. $matches[1] will contain the contents of all matched tags as an array.
$matches = array();
$str = '{theme}pink{/theme}';
preg_match('/{([^}]+)}([^{]+){\/([^}]+)}/', $str, $matches);
var_dump($matches);
That will dump out all matches of all "tags" you may be looking for. Try it out and look at $matches and you'll see what I mean. I'm assuming you're trying to build your own rudimentary template language so this code snippet may be useful to you. If you are, I may suggest looking at something like Smarty.
In any case, you need parentheses to capture values in regular expressions. There are three captured values above:
([^}]+)
will capture the value of the opening "tag," which is theme. The [^}]+ means "one or more of any character BUT the } character, which makes this non-greedy by default.
([^{]+)
Will capture the value between the tags. In this case we want to match all characters BUT the { character.
([^}]+)
Will capture the value of the closing tag.
preg_match('/{theme}([^{]*){\/theme}/',$str,$matches);
[^{] matches any character except the opening brace to make the regex non-greedy, which is important, if you have more than one tag per string/line

Categories