Regular expression to put <P> AND <ul>/<ol> into array - php

I'm searching for a function in PHP to put every paragraph element like <p>, <ul> and <ol> into an array. So that i can manipulate the paragraph, like displayen the first two paragraphs and hiding the others.
This function does the trick for the p-element. How can i adjust the regexp to also match the ul and ol? My tryout gives an error: complaining the < is not an operator...
function aantalP($in){
preg_match_all("|<p>(.*)</p>|U",
$in,
$out, PREG_PATTERN_ORDER);
return $out;
}
//tryout:
function aantalPT($in){
preg_match_all("|(<p> | <ol>)(.*)(</p>|</o>)|U",
$in,
$out, PREG_PATTERN_ORDER);
return $out;
}
Can anyone help me?

You can't do this reliably with regular expressions. Paragraphs are mostly OK because they're not nested generally (although they can be). Lists however are routinely nested and that's one area where regular expressions fall down.
PHP has multiple ways of parsing HTML and retrieving selected elements. Just use one of those. It'll be far more robust.
Start with Parse HTML With PHP And DOM.
If you really want to go down the regex route, start with:
function aantalPT($in){
preg_match_all('!<(p|ol)>(.*)</\1>!Us', $in, $out);
return $out;
}
Note: PREG_PATTERN_ORDER is not required as it is the default value.
Basically, use a backreference to find the matching tag. That will fail for many reasons such as nested lists and paragraphs nested within lists. And no, those problems are not solvable (reliably) with regular expressions.
Edit: as (correctly) pointed out, the regex is also flawed in that it used a pipe delimeter and you were using a pipe character in your regex. I generally use ! as that doesn't normally occur in the pattern (not in my patterns anyway). Some use forward slashes but they appear in this pattern too. Tilde (~) is another reasonably common choice.

First of all, you use | as delimiter to mark the beginning and end of the regular expression. But you also use | as the or sign. I suggest you replace the first and last | with #.
Secondly, you should use backreferences with capture of the start and end tag like such: <(p|ul)>(.*?)</\1>

Related

Regex for PHP seems simple but is killing me

I'm trying to make a replace in a string with a regex, and I really hope the community can help me.
I have this string :
031,02a,009,a,aaa,AZ,AZE,02B,975,135
And my goal is to remove the opposite of this regex
[09][0-9]{2}|[09][0-9][A-Za-z]
i.e.
a,aaa,AZ,AZE,135
(to see it in action : http://regexr.com?3795f )
My final goal is to preg_replace the first string to only get
031,02a,009,02B,975
(to see it in action : http://regexr.com?3795f )
I'm open to all solution, but I admit that I really like to make this work with a preg_replace if it's possible (It became something like a personnal challenge)
Thanks for all help !
As #Taemyr pointed out in comments, my previous solution (using a lookbehind assertion) was incorrect, as it would consume 3 characters at a time even while substrings weren't always 3 characters.
Let's use a lookahead assertion instead to get around this:
'/(^|,)(?![09][0-9]{2}|[09][0-9][A-Za-z])[^,]*/'
The above matches the beginning of the string or a comma, then checks that what follows does not match one of the two forms you've specified to keep, and given that this condition passes, matches as many non-comma characters as possible.
However, this is identical to #anubhava's solution, meaning it has the same weakness, in that it can leave a leading comma in some cases. See this Ideone demo.
ltriming the comma is the clean way to go there, but then again, if you were looking for the "clean way to go," you wouldn't be trying to use a single preg_replace to begin with, right? Your question is whether it's possible to do this without using any other PHP functions.
The anwer is yes. We can take
'/(^|,)foo/'
and distribute the alternation,
'/^foo|,foo/'
so that we can tack on the extra comma we wish to capture only in the first case, i.e.
'/^foo,|,foo/'
That's going to be one hairy expression when we substitute foo with our actual regex, isn't it. Thankfully, PHP supports recursive patterns, so that we can rewrite the above as
'/^(foo),|,(?1)/'
And there you have it. Substituting foo for what it is, we get
'/^((?![09][0-9]{2}|[09][0-9][A-Za-z])[^,]*),|,(?1)/'
which indeed works, as shown in this second Ideone demo.
Let's take some time here to simplify your expression, though. [0-9] is equivalent to \d, and you can use case-insensitive matching by adding /i, like so:
'/^((?![09]\d{2}|[09]\d[a-z])[^,]*),|,(?1)/i'
You might even compact the inner alternation:
'/^((?![09]\d(\d|[a-z]))[^,]*),|,(?1)/i'
Try it in more steps:
$newList = array();
foreach (explode(',', $list) as $element) {
if (!preg_match('/[09][0-9]{2}|[09][0-9][A-Za-z]/', $element) {
$newList[] = $element;
}
}
$list = implode(',', $newList);
You still have your regex, see! Personnal challenge completed.
Try matching what you want to keep and then joining it with commas:
preg_match_all('/[09][0-9]{2}|[09][0-9][A-Za-z]/', $input, $matches);
$result = implode(',', $matches);
The problem you'll be facing with preg_replace is the extra-commas you'll have to strip, cause you don't just want to remove aaa, you actually want to remove aaa, or ,aaa. Now what when you have things to remove both at the beginning and at the end of the string? You can't just say "I'll just strip the comma before", because that might lead to an extra comma at the beginning of the string, and vice-versa. So basically, unless you want to mess with lookaheads and/or lookbehinds, you'd better do this in two steps.
This should work for you:
$s = '031,02a,009,a,aaa,AZ,AZE,02B,975,135';
echo ltrim(preg_replace('/(^|,)(?![09][0-9]{2}|[09][0-9][A-Za-z])[^,]+/', '', $s), ',');
OUTPUT:
031,02a,009,02B,975
Try this:
preg_replace('/(^|,)[1-8a-z][^,]*/i', '', $string);
this will remove all substrings starting with the start of the string or a comma, followed by a non allowed first character, up to but excluding the following comma.
As per #GeoffreyBachelet suggestion, to remove residual commas, you should do:
trim(preg_replace('/(^|,)[1-8a-z][^,]*/i', '', $string), ',');

Nested regex... I'm clueless!

I'm pretty clueless when it comes to PHP and regex but I'm trying to fix a broken plugin for my forum.
I'd like to replace the following:
<blockquote rel="blah">foo</blockquote>
With
<blockquote class="a"><div class="b">blah</div><div class="c"><p>foo</p></div></blockquote>
Actually, that part is easy and I've already partially fixed the plugin to do this. The following regex is being used in a call to preg_replace_callback() to do the replacement:
/(<blockquote rel="([\d\w_ ]{3,30})">)(.*)(<\/blockquote>)/u
The callback code is:
return <<<BLOCKQUOTE
<blockquote class="a"><div class="b">{$Matches[2]}</div><div class="c"><p>{$Matches[3]}</p></div></blockquote>
BLOCKQUOTE;
And that works for my above example (non-nested blockquotes). However, if the blockquotes are nested, such as in the following example:
<blockquote rel="blah">foo <blockquote rel="bloop">bar ...maybe another nest...</blockquote></blockquote>
It doesn't work. So my question is, how can I replace all nested blockquotes using a combination of regex/PHP? I know recursive patterns are possible in PHP with (?R); the following regex will extract all nested blockquotes from the string containing them:
/(<blockquote rel="([\d\w_ ]{3,30})">)(.*|(?R))(<\/blockquote>)/s
But from there on I'm not quite sure what to do in the preg_replace_callback() callback to replace each nested blockquote with the above replacement.
Any help would be appreciated.
The simple answer is that you can't do this with regex. The language of nested tags (or parens, or brackets, or anything) of an arbitrary depth is not regular and hence cannot be matched with a regular expression. I would suggest you use a DOM parser or - if absolutely necessary for some weird reason - write your own parsing scheme.
The complicated answer is that you might be able to do this with some really ugly, hacky regex and PHP code, but I wouldn't advise it to be quite honest.
See also: The Chomsky hierarchy.
Also see also:
Matching nested [quote] using regexp
Find <td> with specific class including nested tables
How using regex can I capture the outer HTML element when the same element type is nested within it?
There's no direct support for recursive substitutions, and preg_replace_callback() isn't particularly useful in this case. But there's nothing stopping you doing the substitution in multiple passes. The first pass takes care of the outermost tags, and subsequent passes work their way inward. The optional $count argument tells you how many replacements were performed in each pass; when it comes up zero, you're done.
$regex = '~(<BQ rel="([^"]++)">)((?:(?:(?!</?+BQ\b).)++|(?R))*+)(</BQ>)~s';
$sub = '<BQ class="a"><div class="b">$2</div><div class="c"><p>$3</p></div></BQ>';
do {
$s = preg_replace($regex, $sub, $s, -1, $count);
} while ($count != 0);
See it in action on ideone.com

Regular expression to match a certain HTML element

I'm trying to write a regular expression for matching the following HTML.
<span class="hidden_text">Some text here.</span>
I'm struggling to write out the condition to match it and have tried the following, but in some cases it selects everything after the span as well.
$condition = "/<span class=\"hidden_text\">(.*)<\/span>/";
If anyone could highlight what I'm doing wrong that would be great.
You need to use a non-greedy selection by adding ? after .* :
$condition = "/<span class=\"hidden_text\">(.*?)<\/span>/";
Note : If you need to match generic HTML, you should use a XML parser like DOM.
You shouldn’t try to use regular expressions on a non-regular language like HTML. Better use a proper HTML parser to parse the document.
See the following questions for further information on how to do that with PHP:
How to parse HTML with PHP?
Best methods to parse HTML
$condition = "/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/";
I got it. ;)
Chances are that you have multiple spans, and the regexp you're using will default to greedy mode
It's a lot easier using PHP's DOM Parser to extract content from HTML
I think this is what they call a teachable moment. :P Let us now compare and contrast the regex in your self-answer:
"/<span class=\"hidden_text\">(?<=^|>)[^><]+?(?=<|$)<\/span>/"
...and this one:
'~<span class="hidden_text">[^><]++</span>~'
PHP's double-quoted strings are subject to interpolation of embedded variables ($my_var) and evaluation of source code wrapped in braces ({return "foo"}). If you aren't using those features, it's best to use single-quoted strings to avoid surprises. As a bonus, you don't have to escape those double-quotes any more.
PHP allows you to use almost any ASCII punctuation character for the regex delimiters. By replacing your slashes with ~ I eliminated the need to escape the slash in the closing tag.
The lookbehind - (?<=^|>) - was not doing anything useful. It would only ever be evaluated immediately after the opening tag had been matched, so the previous character was always >.
[^><]+? is good (assuming you don't want to allow other tags in the content), but the quantifier doesn't need to be reluctant. [^><]+ can't possibly overrun the closing </span> tag, so there's point sneaking up on it. In fact, go ahead and kick the door in with a possessive quantifier: [^><]++.
Like the lookbehind before it, (?=<|$) was only taking up space. If [^><]+ consumes everything it can and the next character not <, you don't need a lookahead to tell you the match is going to fail.
Note that I'm just critiquing your regex, not fixing it; your regex and mine would probably yield the same results every time. There are many ways both of them can go wrong, even if the HTML you're working with is perfectly valid. Matching HTML with regexes is like trying to catch a greased pig.

Regular expression anchor text for a link

I am trying to pull the anchor text from a link that is formatted this way:
<h3><b>File</b> : i_want_this</h3>
I want only the anchor text for the link : "i_want_this"
"variable_text" varies according to the filename so I need to ignore that.
I am using this regex:
<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>
This is matching of course the complete link.
PHP uses a pretty close version to PCRE (PERL Regex). If you want to know a lot about regex, visit perlretut.org. Also, look into Regex generators like exspresso.
For your use, know that regex is greedy. That means that when you specify that you want something, follwed by anything (any repetitions) followed by something, it will keep on going until that second something is reached.
to be more clear, what you want is this:
<a href="
any character, any number of times (regex = .* )
">
any character, any number of times (regex = .* )
</a>
beyond that, you want to capture the second group of "any character, any number of times". You can do that using what are called capture groups (capture anything inside of parenthesis as a group for reference later, also called back references).
I would also look into named subpatterns, too - with those, you can reference your choice with a human readable string rather than an array index. Syntax for those in PHP are (?P<name>pattern) where name is the name you want and pattern is the actual regex. I'll use that below.
So all that being said, here's the "lazy web" for your regex:
<?php
$str = '<h3><b>File</b> : i_want_this</h3>';
$regex = '/(<a href\=".*">)(?P<target>.*)(<\/a>)/';
preg_match($regex, $str, $matches);
print $matches['target'];
?>
//This should output "i_want_this"
Oh, and one final thought. Depending on what you are doing exactly, you may want to look into SimpleXML instead of using regex for this. This would probably require that the tags that we see are just snippits of a larger whole as SimpleXML requires well-formed XML (or XHTML).
I'm sure someone will probably have a more elegant solution, but I think this will do what you want to done.
Where:
$subject = "<h3><b>File</b> : i_want_this</h3>";
Option 1:
$pattern1 = '/(<a href=")(.*)(">)(.*)(<\/a>)/i';
preg_match($pattern1, $subject, $matches1);
print($matches1[4]);
Option 2:
$pattern2 = '()(.*)()';
ereg($pattern2, $subject, $matches2);
print($matches2[4]);
Do not use regex to parse HTML. Use a DOM parser. Specify the language you're using, too.
Since it's in a captured group and since you claim it's matching, you should be able to reference it through $1 or \1 depending on the language.
$blah = preg_match( $pattern, $subject, $matches );
print_r($matches);
The thing to remember is that regex's return everything you searched for if it matches. You need to specify that only care about the part you've surrounded in parenthesis (the anchor text). I'm not sure what language you're using the regex in, but here's an example in Ruby:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)
puts data # => outputs 'i_want_this'
If you specify what you want in parenthesis, you can reference it:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)[1]
puts data # => outputs 'i_want_this'
Perl will have you use $1 instead of [1] like this:
$string = 'i_want_this';
$string =~ m/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/;
$data = $1;
print $data . "\n";
Hope that helps.
I'm not 100% sure if I understand what you want. This will match the content between the anchor tags. The URL must start with /en/browse/file/, but may end with anything.
#(.*?)#
I used # as a delimiter as it made it clearer. It'll also help if you put them in single quotes instead of double quotes so you don't have to escape anything at all.
If you want to limit to numbers instead, you can use:
#(.*?)#
If it should have just 5 numbers:
#(.*?)#
If it should have between 3 and 6 numbers:
#(.*?)#
If it should have more than 2 numbers:
#(.*?)#
This should work:
<a href="[^"]*">([^<]*)
this says that take EVERYTHING you find until you meet "
[^"]*
same! take everything with you till you meet <
[^<]*
The paratese around [^<]*
([^<]*)
group it! so you can collect that data in PHP! If you look in the PHP manual om preg_match you will se many fine examples there!
Good luck!
And for your concrete example:
<a href="/en/browse/file/variable_text">([^<]*)
I use
[^<]*
because in some examples...
.*?
can be extremely slow! Shoudln't use that if you can use
[^<]*
You should use the tool Expresso for creating regular expression... Pretty handy..
http://www.ultrapico.com/Expresso.htm

regular expression to match html tag with specific contents

I am trying to write a regular expression to capture this string:
<td style="white-space:nowrap;">###.##</td>
I can't even match it if include the string as it is in the regex pattern!
I am using preg_match_all(), however, I am not finding the correct pattern. I am thinking that "white-space:nowrap;" is throwing off the matching in some way. Any idea? Thanks ...
Why not try using DOM document instead? Then you do not have to worry about having the HTML formatted properly. Using the Dom Doc collection will also improve readability and ensure fast performance since its part of the PHP Core rather then living in user space
When I'm having problems with regular expressions, I like to test them in real time with one of the following websites:
preg_match Regular Expression Tester
Regular Expression Test Tool
Did you see any warnings? You have to escape some bits of that, namely the / before the td close tag. This seemed to work for me:
$string='cow cow cow <td style="white-space:nowrap;">###.##</td> cat cat cat cat';
php > preg_match_all('/<td style="white-space:nowrap;">###\.##<\/td>/',$string,$result);
php > var_dump($result);
array(1) {
[0]=>
array(1) {
[0]=>
string(43) "<td style="white-space:nowrap;">###.##</td>"
}
}
Are you aware that the regex argument to any of PHP's preg_ functions has to be double-delimited? For example:
preg_match_all(`'/foo/'`, $target, $results)
'...' are the string delimiters, /.../ are the regex delimiters, and the actual regex is foo. The regex delimiters don't have to be slashes, they just have to match; some popular choices are #...#, %...% and ~...~. They can also be balanced pairs of bracketing characters, like {...}, (...), [...], and <...>; those are much less popular, and for good reason.
If you leave out the regex delimiters, the regex-compilation phase will probably fail and the error message will probably make no sense. For example, this code:
preg_match_all('<td style="white-space:nowrap;">###.##</td>', $s, $m)
...would generate this message:
Unknown modifier '#'
It tries to use the first pair of angle brackets as the regex delimiters, and whatever follows the > as the regex modifiers (e.g., i for case-insensitive, m for multiline). To fix that, you would add real regex delimiters, like so:
preg_match_all('%<td style="white-space:nowrap;">###\.##</td>%i', $s, $m)
The choice of delimiter is a matter of personal preference and convenience. If I had used # or /, I would have had to escape those characters in the actual regex. I escaped the . because it's a regex metacharacter. Finally, I added the i modifier to demonstrate the use of modifiers and because HTML isn't case sensitive.

Categories