PHP Regular expression tag matching - php

Been beating my head against a wall trying to get this to work - help from any regex gurus would be greatly appreciated!
The text that has to be matched
[template option="whatever"]
<p>any amount of html would go here</p>
[/template]
I need to pull the 'option' value (i.e. 'whatever') and the html between the template tags.
So far I have:
> /\[template\s*option=["\']([^"\']+)["\']\]((?!\[\/template\]))/
Which gets me everything except the html between the template tags.
Any ideas?
Thanks, Chris

edit: [\s\S] will match anything that is space or not space.
you may have a problem when there are consecutive blocks in a large string. in that case you will need to make a more specific quantifier - either non greedy (+?) or specify range {1,200} or make the [\s\S] more specific
/\[template\s*option=["\']([^"\']+)["\']\]([\s\S]+)\[\/template\]/

Try this
/\[template\s*option=\"(.*)\"\](.*)\[\/template]/
basically instead of using complex regex to match every single thing just use (.*) which means all since you want everything in between its not like you want to verify the data in between

The assertion ?! method is unneeded. Just match with .*? to get the minimum giblets.
/\[template\s*option=\pP([\h\w]+)\pP\] (.*?) [\/template\]/x

Chris,
I see you've already accepted an answer. Great!
However, I don't think use of regular expressions is the right solution here. I think you can get the same effect by using string manipulations (substrings, etc)
Here is some code that may help you. If not now, maybe later in your coding endeavors.
<?php
$string = '[template option="whatever"]<p>any amount of html would go here</p>[/template]';
$extractoptionline = strstr($string, 'option=');
$chopoff = substr($extractoptionline,8);
$option = substr($chopoff, 0, strpos($chopoff, '"]'));
echo "option: $option<br \>\n";
$extracthtmlpart = strstr($string, '"]');
$chopoffneedle = substr($extracthtmlpart,2);
$html = substr($chopoffneedle, 0, strpos($chopoffneedle, '[/'));
echo "html: $html<br \>\n";
?>
Hope this helps anyone looking for a similar answer with a different flavor.

Related

Regex everything but strings + use of groups

I'm trying to make a whitelist of html tags, here's my code :
$string = "<x>-<x>";
$result = preg_match('#^<(?!white1|white2)>.*<(\1)>$#i', $string);
But it returns false, and I don't know why. I simplified the regex to avoid confusions, but this is still the same idea.
I want to match every correct tag but the ones I want to keep safe. This regex will go on a preg_replace to erase every matched tag and let the ones I allow.
Thanks for your help in advance !
EDIT: If I find a way to do this with regexs, I'll put the solution here. But for now, I'll do it with strip_tags().
EDIT2: The easiest way I thought to is to parse all the tags and then revert back the ones we allow.
If you want to eliminate unwanted tags, you can use strip_tags:
$allowedTags = '<p><a><img>';
$filteredContent = strip_tags($content, $allowedTags);

Ommitting a specific pattern from a regex statement

I've spent the last couple of days trying to figure out how to resolve this particular issue and posting on SO, but no dice so far. I think this is probably easier than I've been making it to be, but I need some help;
Here is a pretty basic regex statement that linkifies pretty much any link. It's not the only regex pattern I have, so I've included a piece that skips over the link if it includes the specific pattern "img.youtube.com/vi/" It works great;
$message = preg_replace("#(((f|ht)tp(s)?://)?!(img.youtube.com/vi/)[-a-zA-Z?-??-?()0-9#:%_+.~\#?&;//=,])+#i", "<a href=$1 target='_blank'><b>$1</b></a>", $message);
I do not want this to linkify any url with .jpeg, jpg, gif, or any popular image format, I have another expression that will embed those kinds of links (and it works fine, too). So, I need to find a way to get this expression to reject those kinds of links.
I've gotten advice on negative lookarounds, matching to specific strings, but none of them seem to work so far. I need to find a way to get this regex to ignore any URL that ends with .jpeg and so forth;
So, the regex statement above already has an example of a string that disqualifies certain URLs - ?!(img.youtube.com/vi/). This seems like that's all I need to do, but where do I put it and how does it look? The + symbol in the statement makes it so that the regex will scrutinize the string all the way to the end of it, using the matching characters of [-a-zA-Z?-??-?()0-9#:%_+.~#?&;//=,]. So, this matching string should probably be put somewhere before the + symbol. Does it go in "?!(img.youtube.com/vi/)" ? In my mind, it should probably look like this;
$message = preg_replace("#(((f|ht)tp(s)?://)?!(img.youtube.com/vi/|/^\.jpeg$/|/^\.jpg$/|/^\gif$/)[-a-zA-Z?-??-?()0-9#:%_+.~\#?&;//=,])+#i",
"<a href=$1 target='_blank'><b>$1</b></a>", $message);
Any help is appreciated.
I answer and also clean up your regexp
(?i)((?:f|ht)tps?://((?!img|jpe?g|gif|png|bmp))(?:([-a-z0-9()#:%_+.~#?&;/=,])(?2))+(?!(?3)))
Now the img etc you don't want is in the neg lookahead and you can add a things you don't like.
$good="http://www.google.com/";
$bad="http://img.google.com/";
$r="#(?i)((?:f|ht)tps?://((?!img|jpe?g|gif|png|bmp))(?:([-a-z0-9()#:%_+.~\#?&;/=,])(?2))+(?!(?3)))#";
$rep="<a href=$1 target='_blank'><b>$1</b></a>";
echo preg_replace($r,$rep,$good);
echo preg_replace($r,$rep,$bad);
You can try here http://ideone.com/419yfm
Just remove this part of the regex:
img|
<?php
$good="http://www.google.com/";
$bad="http://img.google.com/";
$r="#(?i)((?:f|ht)tps?://((?!jpe?g|gif|png|bmp))(?:([-a-z0-9()#:%_+.~\#?&;/=,])(?2))+(?!(?3)))#";
$rep="<a href=$1 target='_blank'><b>$1</b></a>";
echo preg_replace($r,$rep,$good); echo "\n";
echo preg_replace($r,$rep,$bad);
?>
DEMO

PHP preg_replace();

I've got a problem with regexp function, preg_replace(), in PHP.
I want to get viewstate from html's input, but it doesn't work properly.
This code:
$viewstate = preg_replace('/^(.*)(<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value=")(.*[^"])("\s+name="__VIEWSTATE">)(.*)$/u','^\${3}$',$html);
Returns this:
%0D%0A%0D%0A%3C%21DOCTYPE+html+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+XHTML+1.0+Transitional%2F%2FEN%22+%22http%3A%2F%2Fwww.w3.org%2FTR%2Fxhtml1%2FDTD%2Fxhtml1-transitional.dtd%22%3E%0D%0A%0D%0A%3Chtml+xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxhtml%22+%3E%0D%0A%3Chead%3E%3Ctitle%3E%0D%0A%09Strava.cz%0D%0A%3C%2Ftitle%3E%3Clink+rel%3D%22shortcut+icon%22+href%3D%22..%2FGrafika%2Ffavicon.ico%22+type%3D%22image%2Fx-icon%22+%2F%3E%3Clink+rel%3D%22stylesheet%22+type%3D%22text%2Fcss%22+media%3D%22screen%22+href%3D%22..%2FStyly%2FZaklad.css%22+%2F%3E%0D%0A++++%3Cstyle+type%3D%22text%2Fcss%22%3E%0D%0A++++++++.style1%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+47px%3B%0D%0A++++++++%7D%0D%0A++++++++.style2%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+64px%3B%0D%0A++++++++%7D%0D%0A++++%3C%2Fstyle%3E%0D%0A%0D%0A%3Cscript+type%3D%22text%2Fjavascript%22%3E%0D%0A%0D%0A++var+_gaq+%3D+_gaq+%7C%7C+%5B%5D%3B%0D%0A++_gaq.push%28%5B
EDIT: Sorry, I left this question for a long time. Finally I used DOMDocument.
To be sure i'd split this match into two phases:
Find the relevant input element
Get the value
Because you cannot be certain what the attributes order in the element will be.
if(preg_match('/<input[^>]+name="__VIEWSTATE"[^>]*>/i', $input, $match))
$value = preg_replace('/.*value="([^"]*)".*/i', '$1', $match[0]);
And, of course, always consider DOM and DOMXpath over regex for parsing html/xml.
You should only capture when you're planning on using the data. So most () are obsolete in that regexp pattern. Not a cause for failure but I thought I'd mention it.
Instead of using [^"] to mark that you don't want that character you could use the non-greedy modifier - ?. This makes sure the pattern is matching as little as it can. Since you have name="__VIEWSTATE" following the value this should be safe.
Let's put this in practice and simplify the pattern some. This works as you want:
'/.*<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value="(.+?)"\s+name="__VIEWSTATE">.*/'
I would strongly recommend checking out an alternative to regexp for DOM operations. This makes certain your code works also if the attributes changes order. Plus it's so much nicer to work with.
The main mistake was the use of funciton preg_replace, witch returns the subject - neither the matched pattern nor the replacement. Thank you for your ideas and for the recommendation of DOMDocument. m93a
http://www.php.net/manual/en/function.preg-replace.php#refsect1-function.preg-replace-returnvalues

php preg_match two examples

I need to preg_match for
src="http:// "
where the blank space following // is the rest of the url ending with the ". My adapted doesn't seem to work:
preg_match('#src="(http://[^"]+)#', $data, $match);
And I am also struggling to get text that starts with > and ends with EITHER a full stop . or an exclamation mark ! or a question mark ? I have no idea how to do this one. An example of the text I want to preg_match for is:
blahblahblah>Hello world this is what I want.
I'm hoping a kind preg_match guru can tell me the answer and save me hours of headscratching.
Thanks for reading.
As for the URL:
preg_match('#src="(.*?)"#', $data, $match);
and for the second case, use />(.*?)(\.|!|\?)/
(.*?)" will match any character greedily up until the time it sees the end double quote
It seems that you want to parse a document or string which follows a HTML, DOM, XML or something similiar structure.
Use XPath, and parse to the Tag and let it return the src Attribute, this will save much trouble and you can forget about regular expressions.
Example: CLICK ME

preg_replace() help in PHP

Consider this string
hello awesome <a href="" rel="external" title="so awesome is cool"> stuff stuff
What regex could I use to match any occurence of awesome which doesn't appear within the title attribute of the anchor?
So far, this is what I've came up with (it doesn't work sadly)
/[^."]*(awesome)[^."]*/i
Edit
I took Alan M's advice and used a regex to capture every word and send it to a callback. Thanks Alan M for your advice. Here is my final code.
$plantDetails = end($this->_model->getPlantById($plantId));
$botany = new Botany_Model();
$this->_botanyWords = $botany->getArray();
foreach($plantDetails as $key=>$detail) {
$detail = preg_replace_callback('/\b[a-z]+\b/iU', array($this, '_processBotanyWords'), $detail);
$plantDetails[$key] = $detail;
}
And the _processBotanyWords()...
private function _processBotanyWords($match) {
$botanyWords = $this->_botanyWords;
$word = $match[0];
if (array_key_exists($word, $botanyWords)) {
return '' . $word . '';
} else {
return $word;
}
}
Hope this well help someone else some day! Thanks again for all your answers.
This subject comes up pretty much every day here and basically the issue is this: you shouldn't be using regular expressions to parse or alter HTML (or XML). That's what HTML/XML parsers are for. The above problem is just one of the issues you'll face. You may get something that mostly works but there'll still be corner cases where it doesn't.
Just use an HTML parser.
Asssuming this is related to the question you posted and deleted a little while ago (that was you, wasn't it?), it's your fundamental approach that's wrong. You said you were generating these HTML links yourself by replacing words from a list of keywords. The trouble is that keywords farther down the list sometimes appear in the generated title attributes and get replaced by mistake--and now you're trying to fix the mistakes.
The underlying problem is that you're replacing each keyword using a separate call to preg_replace, effectively processing the entire text over and over again. What you should do is process the text once, matching every single word and looking it up in your list of keywords; if it's on the list, replace it. I'm not set up to write/test PHP code, but you probably want to use preg_replace_callback:
$text = preg_replace_callback('/\b[A-Za-z]+\b/', "the_callback", $text);
"the_callback" is the name of a function that looks up the word and, if it's in the list, generates the appropriate link; otherwise it returns the matched word. It may sound inefficient, processing every word like this, but in fact it's a great deal more efficient than your original approach.
Sure, using a parsing library is the industrial-strength solution, but we all have times were we just want to write something in 10 seconds and be done. Next time you want to process the meaty text of a page, ignoring tags, try just run your input through strip_tags first. This way you will get only the plain, visible text and your regex powers will again reign supreme.
This is so horrible I hesitate to post it, but if you want a quick hack, reverse the problem--instead of finding the stuff that isn't X, find the stuff that IS, change it, do the thing and change it back.
This is assuming you're trying to change awesome (to "wonderful"). If you're doing something else, adjust accordingly.
$string = 'Awesome is the man who <b>awesome</b> does and awesome is.';
$string = preg_replace('#(title\s*=\s*\"[^"]*?)awesome#is', "$1PIGDOG", $string);
$string = preg_replace('#awesome#is', 'wonderful', $string);
$string = preg_replace('#pigdog#is', 'awesome', $string);
Don't vote me down. I know it's hack.

Categories