Parse for square brackets with regular expressions - php

I've always had a difficult time with regular expressions. I've searched for help with this, but I can't quite find what I'm looking for.
I have blocks of text that follow this pattern:
[php]
... any type of code sample here
[/php]
I need to:
check for the square brackets, which can contain any number of 20-30 programming language names (php, ruby, etc.).
need to grab all code in between the opening and closing bracket.
I have worked out the following regular expression:
#\[([a-z]+)\]([^\[/]*)\[/([a-z]+)\]#i
Which matches everything pretty well. However, it breaks when the code sample contains square brackets. How do I modify it so that any character between those opening/closing braces will be matched for later use?

This is the regex you want. It matches where the tags are even too, so a php tag will only end a php tag.
/\[(\w+)\](.*?)\[\/\1\]/s
Or if you wanted to explicitly match the tags you could use...
$langs = array('php', 'python', ...);
$langs = implode('|', array_map('preg_quote', $langs));
preg_match_all('/\[(' . $langs . ')\](.*?)\[\/\1\]/s', $str, $matches);

The following will work:
\[([a-z]+)\].*\[/\1\]
If you don't want to remove the greediness, you can do:
\[([a-z]+)\].*?\[/\1\]
All you have to do is to check that both the closing and opening tags have the same text (in this case, that both are the same programming language), and you do that with \1, telling it to match the previously matched Group number 1: ([a-z]+)

Why don't you use something like below:
\[php\].*?\[/php\]
I don't understand why you want to use [a-z]+ for the tags, there should be php or a limited amount of other tags. Just keep it simple.
Actually you can use:
\[(php)\].*?\[/(\1)\]
so that you can match the opening and closing tags. Otherwise you will be matching random opening and closing. Add others like, I don't know, js etc as php|js etc.

Use a backreference to refer to a match already made in the regular expression:
\[(\w+)\].*?\[/\1\]

Related

preg_match link text with less-than sign in it

I'm trying to get information in DB from html files, and suddenly found that link can be like:
channel crosstalk: <60dB
there for my regular expression doesn't find that link:
preg_match_all('|<a href="/blabla/([0-9]+)"[^>]*>([^<]*)</a>|Uis',$html,$matches);
This is a part of big regular expression, I just simplified it for example.
It's hard to tell what you are trying to pull. Are you looking for the entire link? Or are you looking to grab parts from the link (hence the parenthesis)? Here is a solution for getting the individual contents in the link:
preg_match_all( '#(.*?)#i', $html, $matches);
The first element of matches will be the entire link, while the other elements will be the sub parts.
Or here is one for just the entire link:
preg_match_all( "#(<a.*>.*</a>)#i", $html, $matches );
Or here is a slightly modified version of yours which currently isn't matching because it's saying to match anything that is not an angle bracket inside the opening and closing A tags as its contents has an angle bracket:
preg_match_all( '|<a href="/blabla/([0-9]+)"[^>]*>(.*?)</a>|Uis', $html, $matches );
Again, not 100% sure the exact results you are looking for, but maybe this will get your going and you can make modifications as needed.
You can use this regex to extract href and link text.
<a[^>]+?href="(.*?)"[^>]+?>(.*?)</a>
Group 1: href
Group 2: link text
This is the fundamental issue with trying to regex HTML. This is not really good HTML - because contents that are not meant to be interpreted as HTML should be html entities (aka &lte; instead of <). You won't always be able to handle that though.
In your case, something like this works for regex:
|.*?|Uis
The matching group gets shifted. This also allows nested tags (like <a><b><i></i></b></a>).
Keep in mind that the Ungreedy tag you used means that you can be a little more lax in your regex matching. If you wanted to do this without the U modifier you'd maybe need to do some negative lookaheads.
|(?:(?!).)*</a>|is

Replace string using regular expression

I always encounter regular expressions but I don't really try to understand and use them. But my current project is forcing me to use a regular expression so I need someone who can give me the correct regex to replace a simple string. Basically I'm replacing a small subset of longtext retrieved from a database. The longtext is just a paragraph(s) with text anchors in a form of:
Example
So the question is how do I replace the value of the title attribute? Please note that the text may contain two more anchor tags so I'd like to able to specifically target each of them.
EDIT:
I'd like to use pure PHP on this. I think I know how to do this using js/jquery.
$doc = new DOMDocument();
$doc->loadHTML('Example');
$anchors = $doc->getElementsByTagName('a');
foreach ($anchors as $anchor)
{
$anchor->setAttribute('target', '__blank');
}
$html = $doc->saveHTML();
echo $html;
See it in action
Description
You could do this with the following regex
(<a\b[^>]*?\btitle=(['"]))(.*?)\2
Summary
( start capture group 1
<a\b consume open angle bracket and an a followed by a word break
[^>]*? consume all non close angle bracket characters up to... this forces the regex to stay inside the anchor tag
\btitle= consume a word break and title=, the break helps do some additional checking
(['"]) capture group 2, ensure the an open single or double quote is being used
) close capture group 1
(.*?) start capture group 3, and non greedy consume to collect all text inside the quotes
\2 reference back to the string from capture group 2, if you used a single quote to open the value, then a single quote will be required to close the value. Same if you had use a double quote.
In the replace command I'm simply replacing the entire found string from <a to the close quote with: group capture 1, followed by the desired text NewValue followed by the close quote from group capture 2.
PHP example
<?php
$sourcestring="Example";
echo preg_replace('/(<a\b[^>]*?\btitle=([\'"]))(.*?)\2/im','\1NewValue\2',$sourcestring);
?>
$sourcestring after replacement:
Example
Disclaimer
Since parsing text via a html parser is not the desired solution, I'll skip the usual soap box disclaimer about parsing html with Regex.
$string=preg_replace(
'#<a (.*)title="(.*)"([^>]*)>(.*)</a>#iU',
'<a $1title="'.$replacement.'"$3>$4</a>',
$string);
Note that the i at the end of the expression makes it case insensitive, and the U makes it ungreedy.

PHP/Perl Regular expression help!

I have a string:
$string = "This is my big <span class="big-string">string</span>";
I cannot figure out how to write a regular expression that will replace the 'b' in 'big' without replacing the 'b' in 'big-string'. I need to replace all occurances of a substring except when that substring appears in an html tag.
Any help is appreciated!
Edit
Maybe some more info will help. I'm working on an autocomplete feature that highlights whatever you're searching for in the current result set. Currently if you have typed 'aut' in the search dialog, then the results look like this: automotive
The problem appears when I search for 'auto b'. First I replace all occurrences of 'auto' with '<b>auto</b>' then I replace all occurrences of 'b' with '<b>b</b>'. Unfortunately this second sweep changes '<b>auto</b>' to '<<b>b</b>>auto</<b>b</b>>'
Please do not try to parse HTML using regular expressions. Just load up the HTML in a DOM, walk over the text nodes and do a simple str_replace. You'll thank me around debugging time.
Is there a guarantee that 'big' won't be immediately preceded by "? If so, then s/([^"])b/$1foo/ should replace the b in question with foo.
If you insist upon using a regex, this one will do a pretty decent job:
$re = '/# (Crudely) match a sub-string NOT in an HTML tag.
big # The sub-string to be matched.
(?= # Assert we are not inside an HTML tag.
[^<>]* # Consume all non-<> up to...
(?:<\w+ # either an HTML start tag,
| $ # or the end of string.
) # End group of valid alternatives.
) # End "not-in-html-tag" lookahead assertion.
/ix';
Caveats: This regex has very real limitations. The HTML must not have any angle brackets in the tag attributes. This regex also finds the target substring inside other parts of the HTML file such as comments, scripts and stylesheets, and this may not be desirable.

PHP / Regex : match json inside json

Just a quick regex question...hopefully
I have a string that looks something like this:
$string = 'some text [ something {"index":"{"index2":"value2"}"}] [something2 {"here to be":"more specific"}]';
I want to be able to get the value:
{"index":"{"index2":"value2"}"}
But all my attempts at matching (or replacing) keep giving me:
{"index":"{"index2":"value2"}
preg_replace('/\[(.*?)({.*?[^}]})*?\]/is', "", $string);
Here I'm matching the whole square bracket area, but hopefully you can see what I'm trying to do.
The negation of the "do not match }" doesn't seem to be doing anything. Maybe I just need an OR in there or something.
Well, thanks if you have time to answer.
The $string could contain multiple instances of the {} so a greedy regex won't work....that I know of.
You can't make a regex count the opening brackets and the corresponding closeing brackets, you should use a simple for loop to do that, but you can get the complete string from the first opening bracket to the last closeing one with a greedy expression like: ({.*}). Note that simple string functions are much faster then regular expressions, so you should use those instead.

Regular expression to match a certain HTML element

I'm trying to write a regular expression for matching the following HTML.
<span class="hidden_text">Some text here.</span>
I'm struggling to write out the condition to match it and have tried the following, but in some cases it selects everything after the span as well.
$condition = "/<span class=\"hidden_text\">(.*)<\/span>/";
If anyone could highlight what I'm doing wrong that would be great.
You need to use a non-greedy selection by adding ? after .* :
$condition = "/<span class=\"hidden_text\">(.*?)<\/span>/";
Note : If you need to match generic HTML, you should use a XML parser like DOM.
You shouldn’t try to use regular expressions on a non-regular language like HTML. Better use a proper HTML parser to parse the document.
See the following questions for further information on how to do that with PHP:
How to parse HTML with PHP?
Best methods to parse HTML
$condition = "/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/";
I got it. ;)
Chances are that you have multiple spans, and the regexp you're using will default to greedy mode
It's a lot easier using PHP's DOM Parser to extract content from HTML
I think this is what they call a teachable moment. :P Let us now compare and contrast the regex in your self-answer:
"/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/"
...and this one:
'~<span class="hidden_text">[^><]++</span>~'
PHP's double-quoted strings are subject to interpolation of embedded variables ($my_var) and evaluation of source code wrapped in braces ({return "foo"}). If you aren't using those features, it's best to use single-quoted strings to avoid surprises. As a bonus, you don't have to escape those double-quotes any more.
PHP allows you to use almost any ASCII punctuation character for the regex delimiters. By replacing your slashes with ~ I eliminated the need to escape the slash in the closing tag.
The lookbehind - (?<=^|>) - was not doing anything useful. It would only ever be evaluated immediately after the opening tag had been matched, so the previous character was always >.
[^><]+? is good (assuming you don't want to allow other tags in the content), but the quantifier doesn't need to be reluctant. [^><]+ can't possibly overrun the closing </span> tag, so there's point sneaking up on it. In fact, go ahead and kick the door in with a possessive quantifier: [^><]++.
Like the lookbehind before it, (?=<|$) was only taking up space. If [^><]+ consumes everything it can and the next character not <, you don't need a lookahead to tell you the match is going to fail.
Note that I'm just critiquing your regex, not fixing it; your regex and mine would probably yield the same results every time. There are many ways both of them can go wrong, even if the HTML you're working with is perfectly valid. Matching HTML with regexes is like trying to catch a greased pig.

Categories