Optional regex pattern produces no value - php

I am having a bit of a problem with some regex I did for a project of mine (please keep in mind that I am a beginner at regex which shows in the follwoing example). I am having a bit of a problem with a piece of xml code from which I am trying to extract certain parts of it using an associated pattern.
<banner piclink="pic" urlactive="url_active" urltarget="globaltgt" urllink="globallink" timevar="globaldelay" swf="0" smooth="1" name="name" alt="alternate" />
I am using the following regular expression to obtain the piclink, urlactive, urltarget, urllink and timevar using preg_match_all:
/piclink=\"(?<pic>.+)\".+urltarget=\"(?<target>.+)\".+urllink=\"(?<url>.*)\".+timevar=\"(?<delay>.*)\"/iU
So far so good everything works right however, I am now trying to capture with association the name and alt tags which are optional as in they don't always appear. I have tried to put them in parenthesis followed by a ? to indicate that they are optional like such:
(name=\"(?<name>.*)\")?
However the $matches['name'] array is always empty, I do not know where I am messing up but I have tried all sorts of combinations and all of them result in an empty result except for when I put (?: at the end and encapsulate everything from swf= onwards then it does return like 115 results in the array which is not acceptabe as the result is like $matches['name'][X] = result, where x is sometimes 1 other times its at 109 for some reason.

I agree that something like SimpleXML would be better but if you want to get dirty, you can use lookaheads to try to match with the remaining characters.
/piclink=\"(?<pic>.+)\".+urltarget=\"(?<target>.+)\".+urllink=\"(?<url>.*)\".+timevar=\"(?<delay>[^"]*)\"(?=(.*name=\"(?<name>[^"]*)\")?)(?=(.*alt=\"(?<alt>[^"]*)\")?).*/iU

Related

regex with preg_match anything and all line breaks

i am not too good with regex and can't seem to find the answer
I am writing a class file to check data type and "partially/best possible sanitise" any submitted data as well as performing some other functions too. This is working on all data types (i.e emails, url's phone numbers, int/signed/un-signed, words, passwords, various date formats, basic HTML, etc)
i am having problems with trying to match "anything"* (this is the one data type i dont really need to check, but for consistency, i need it to run through the preg_match, but always want it to return true).
when i say "anything" i want it to match any text, number, symbols AND Line Breaks. It is the line break i am having problems with
i am using :
define('REG_TEXT', '/^(.*)$/');
preg_match(REG_TEXT, $data)
this works fine on the first paragraph, but wont match past any line beaks so returns false
an example of what i want this to match (return true) would be:
this is a test match on anything 345 +_)(*&^%$£"!<br><html> <?php echo this i PHP; ?>
and match this too on a new line
and match all this line too
and anything else at all
i am not worried about any code in-putted into the data at this point as other areas of my class are dealing with this (before this stage!).
basically i am after a regex that will match/return true on absolutely anything.
(i dont want to change to preg_match_all as this will break other aspects of the class or require me to add additional code that will be a partial repeat of code that i dont think is needed)
any advice would be greatly welcomed!
thanks
Jon
Use:
'/^(.*)$/ms'
You need the m and s modifiers here. http://php.net/manual/en/reference.pcre.pattern.modifiers.php

PHP Regex URL parsing issues preg_replace

I have a custom markup parsing function that has been working very well for many years. I recently discovered a bug that I hadn't noticed before and I haven't been able to fix it. If anyone can help me with this that'd be awesome. So I have a custom built forum and text based MMORPG and every input is sanitized and parsed for bbcode like markup. It'll also parse out URL's and make them into legit links that go to an exit page with a disclaimer that you're leaving the site... So the issue that I'm having is that when I user posts multiple URL's in a text box (let's say \n delimited) it'll only convert every other URL into a link. Here's the parser for URL's:
$markup = preg_replace("/(^|[^=\"\/])\b((\w+:\/\/|www\.)[^\s<]+)" . "((\W+|\b)([\s<]|$))/ei", '"$1".shortURL("$2")."$4"', $markup);
As you can see it calls a PHP function, but that's not the issue here. Then entire text block is passed into this preg_replace at the same time rather than line by line or any other means.
If there's a simpler way of writing this preg_replace, please let me know
If you can figure out why this is only parsing every other URL, that's my ultimate goal here
Example INPUT:
http://skylnk.co/tRRTnb
http://skylnk.co/hkIJBT
http://skylnk.co/vUMGQo
http://skylnk.co/USOLfW
http://skylnk.co/BPlaJl
http://skylnk.co/tqcPbL
http://skylnk.co/jJTjRs
http://skylnk.co/itmhJs
http://skylnk.co/llUBAR
http://skylnk.co/XDJZxD
Example OUTPUT:
http://skylnk.co/tRRTnb
<br>http://skylnk.co/hkIJBT
<br>http://skylnk.co/vUMGQo
<br>http://skylnk.co/USOLfW
<br>http://skylnk.co/BPlaJl
<br>http://skylnk.co/tqcPbL
<br>http://skylnk.co/jJTjRs
<br>http://skylnk.co/itmhJs
<br>http://skylnk.co/llUBAR
<br>http://skylnk.co/XDJZxD
<br>
e flag in preg_replace is deprecated. You can use preg_replace_callback to access the same functionality.
i flag is useless here, since \w already matches both upper case and lower case, and there is no backreference in your pattern.
I set m flag, which makes the ^ and $ matches the beginning and the end of a line, rather than the beginning and the end of the entire string. This should fix your weird problem of matching every other line.
I also make some of the groups non-capturing (?:pattern) - since the bigger capturing groups have captured the text already.
The code below is not tested. I only tested the regex on regex tester.
preg_replace_callback(
"/(^|[^=\"\/])\b((?:\w+:\/\/|www\.)[^\s<]+)((?:\W+|\b)(?:[\s<]|$))/m",
function ($m) {
return "$m[1]".shortURL($m[2])."$m[3]";
},
$markup
);

Php regex match a string between two html tags with the tags been unknown

Ok, so here's my issue:
I have a link, say: http://www.blablabla.com/watch?v=1lyu1KKwC74&feature=list_other&playnext=1&list=AL94UKMTqg-9CfMhPFKXPXcvJ_j65v7UuV
And the link is between two tags say like this:
<br>http://www.blablabla.com/watch?v=1lyu1KKwC74&feature=list_other&playnext=1&list=AL94UKMTqg-9CfMhPFKXPXcvJ_j65v7UuV<br></p>
Using this regex with preg_replace:
'#(^|[^\/]|[^>])('.addcslashes($link,'.?+').')([^\w\/]|[^<]$)#i'
As such:
preg_replace('#(^|[^\/]|[^>])('.addcslashes($link,'.?+').')([^\w\/]|[^<]$)#i', "***",$strText);
The resulted string is :
<br***p>
Which is wrong!!
It should have been
<br>***<br></p>
How can I get the desired result? I have blasted my head out trying to solve this one out.
I would like to mention that str_replace replaces even the link within another valid link, so it's not a good method, I need an exact match between two boundaries, even if the boundary is text or another HTML tag.
Assuming you don't want to use a DOM parser for some reason, I believe doing what you intended is as simple as the following:
preg_replace('#(^|[^\/]|[^>])('.addcslashes($link,'.?+').')([^\w\/]|[^<]$)#i', "$1***$3",$strText);
This uses $1 and $3 to put back the delimiting text you matched in your regular expression.
As others have pointed out, using a DOM parser is more reliable.
Does this do what you want?

Stuck with regexp

I'm stuck with php preg_match_all function. Maybe someone wil help me with regexp. Let's assume we have some code:
[a]a[/a]
[s]a[/s]
[b]1[/b]
[b]2[/b]
...
...
[b]n[/b]
[e]a[/e]
[b]8[/b]
[b]9[/b]
...
...
[b]n[/b]
I need to match all that inside [b] tags located between [s] and [e] tags. Any ideas?
if your structure is exactly the same as above I would personally avoid regex (not a good idea with these fort of languages) and just check the second char of each line. Once you see an s go into consume mode and for each line until you see an e find the first ] and read in everything between that and the next [
For simplicity use two preg_match calls.
First to retrieve the list you want to inspect /\[s](.+?)\[e]/s.
And then use that result string and match for the contained /\[b](.+?)\[\/b]/s things.
It looks like you are trying to pattern match something that has a treelike structure, essentially like HTML or XML. Any time you find yourself saying "find X located inside matching Y tags" you are going to have this problem.
Trying to do this sort of work with with regular expressions is a Bad Idea.
Here's some info copy/pasted from a different answer of mine for a similar question:
Some references to similar SO posts which will give you an idea of the difficulty you're getting into:
Regex to match all HTML tags except <p> and </p>
Regex to replace all \n in a String, but no those inside [code] [/code] tag
RegEx match open tags except XHTML self-contained tags - bobince says it much more thoroughly than I do (:
The "Right Thing" to do is to parse your input, maintaining state as you go. This can be as simple as scanning your text and keeping a stack of current tags.
Regular expressions alone aren't sufficient to parse XML, and this appears to be a simplified XML language here.

How to match anything except a pattern between two tags

I am attempting to match a string which is composed of HTML. Basically it is an image gallery so there is a lot of similarity in the string. There are a lot of <dl> tags in the string, but I am looking to match the last <dl>(.?)+</dl> combo that comes before a </div>.
The way I've devised to do this is to make sure that there aren't any <dl's inside the <dl></dl> combo I'm matching. I don't care what else is there, including other tags and line breaks.
I decided I had to do it with regular expressions because I can't predict how long this substring will be or anything that's inside it.
Here is my current regex that only returns me an array with two NULL indicies:
preg_match_all('/<dl((?!<dl).)+<\/dl>(?=<\/div>)/', $foo, $bar)
As you can see I use negative lookahead to try and see if there is another <dl> within this one. I've also tried negative lookbehind here with the same results. I've also tried using +? instead of just + to no avail. Keep in mind that there's no pattern <dl><dl></dl> or anything, but that my regex is either matching the first <dl> and the last </dl> or nothing at all.
Now I realize . won't match line breaks but I've tried anything I could imagine there and it still either provides me with the NULL indicies or nearly the whole string (from the very first occurance of <dl to </dl></div>, which includes several other occurances of <dl>, exactly what I didn't want). I honestly don't know what I'm doing incorrectly.
Thanks for your help! I've spent over an hour just trying to straighten out this one problem and it's about driven me to pulling my hair out.
Don't use regular expressions for irregular languages like HTML. Use a parser instead. It will save you a lot of time and pain.
I would suggest to use tidy instead. You can easily extra all the desired tags with their contents, even for broken HTML.
In general I would not recommend to write a parser using regex.
See http://www.php.net/tidy
As crazy as it is, about 2 minutes after I posted this question, I found a way that worked.
preg_match_all('/<dl([^\z](?!<dl))+?<\/dl>(?=<\/div>)/', $foo, $bar);
The [^\z] craziness is just a way I used to say "match all characters, even line breaks"

Categories