Regular Expression formatting help required - php

I am trying to remove a part of a document on the fly using preg_replace().
/* target example:
<li id="footer-poweredbyico">
<img src="//bits.wikimedia.org/skins-1.18/common/images/poweredby_mediawiki_88x31.png" alt="Powered by MediaWiki" width="88" height="31" />
</li>
*/
$reg = preg_quote('<li id="footer-poweredbyico">.*?</li>');
preg_replace($reg,"",$str);
Ignore any errors in PHP, this question is about how to format the regular expression correctly to remove anything matching the target example opening and closing tags. The contents of the containing HTML tags will be different each time, hence .*? (I think that's wrong).

The preg_quote function actually does the opposite of what you want: its purpose is to disable all regex-features in a string. So in your case, what you currently have is (roughly) looking for an actual .*? in your HTML, instead of looking for zero or more characters. What you want is:
$str = preg_replace('/<li id="footer-poweredbyico">.*?<\/li>/s', '', $str);

The .*? portion of your regex is being escaped. Therefore, it isn't matching anything. Try this.
$reg = preg_quote('<li id="footer-poweredbyico">') . '.*?' . preg_quote('</li>');
preg_replace($reg,"",$str);

you don't need to use this hack approach, read the faq
"How can I edit / remove the Powered by MediaWiki image in the footer?"

preg_quote() will disable all the special characters you used, like .*?.
Try something like:
preg_replace('#<li id="footer-poweredbyico">.*?</li>#s', '', $str);
Now, the difficult question is whether to make this regex "greedy". Right now, it's ungreedy, which means it will break your page if there's another <li> inside the one you're trying to remove. But if you make it greedy, it will remove everything from the beginning of the <li> tag until the end of the last <li> element in the page, even if it's a different <li> element. Neither is ideal. This is why a proper HTML parser usually does a better job at manipulating HTML.
But if the page is simple enough, a regex will work.
EDIT Corrected a gross error, thanks to #Nilpo.

Related

regex for preg_match not working

I need to scrape some data from a website. For that I am using preg_match, but I am not able to write the regex for it. The data on the website is
title="Russia"/></a>
<small>*</small> <a href="/profile/roman
I have written the regex as #title=\"Russia\"\/><\/a>((\n|\r)*)<small>*<\/small> <a href=\"/profile/(.+?)\"#sx
But this is not working and I dont know why ? When I echo my regex it says #title="Russia"\/><\/a>(( | )*)*<\/small> . Where are the others gone? And why is it not working ?
Try this:
#title=\"Russia\"/></a>(\s*)<small>\*</small>\s+<a\s+href=\"/profile/(.+?)\"#sx
I have escaped the * because its a metacharacter. Without it, you would match strings containing the word small followed by zero or more >s.
You really should not use regexes to evaluate markup content, especially when you acquire it by scrapping pages.
In your case there are at least three reasons that might be responsible for breaking your regex.
Do not attempt to write your own whitespace evaluators when you can simply use \s which stands for "any whitespace character"
In regular expressions asterisk (*) has a special meaning which is why you can't simply use it to identify asterisks. If you want to collect content inside the small attribute you should use <small>(.*)</small> instead. If on the other hand you are actually expecting an asterisk then you have to escape it like this <small>\*</small>.
Your regex expects a closing quote for your href attribute on that last <a> but in your sample markup you have none. Provided that on the original page you do have a closing quote the following regex should do the trick.
#title=\"Russia\"\/><\/a>(\s*)<small>\*</small> <a href="/profile/(.+)?\"#sx
However once again I have to advise using a DOM parser like DOMDocument for this not only because it is much more reliable when handling markup content but also because it can interpret bad markup as well (if its loaded as HTML of course).

Regular Expression to Search+Replace href="URL"

I'm useless with regular expressions and haven't been able to google myself a clear solution to this one.
I want to search+replace some text ($content) for any url inside the anchor's href with a new url (stored as the variable $newurl).
Change this:
<img alt="foobar" src="http://blogurl.com/files/2011/03/foobar_thumb.jpg" />
To this:
<img alt="foobar" src="http://blogurl.com/files/2011/03/foobar_thumb.jpg" />
I imagine using preg_replace would be best for this. Something like:
preg_replace('Look for href="any-url"',
'href="$newurl"',$content);
The idea is to get all images on a WordPress front page to link to their posts instead of to full sized images (which is how they default). Usually there would be only one url to replace, but I don't think it would hurt to replace all potential matches.
Hope all that made sense and thanks in advance!
Here is the gist of what I came up with. Hopefully it helps someone:
$content = get_the_content();
$pattern = "/(?<=href=(\"|'))[^\"']+(?=(\"|'))/";
$newurl = get_permalink();
$content = preg_replace($pattern,$newurl,$content);
echo $content;
Mucho thanko to #WiseGuyEh
This should do the trick- you can test it here
(?<=href=("|'))[^"']+(?=("|'))
It uses lookahead and lookbehind to assert that anything it matches starts with href=" or href=' and makes sure that it ends with a single or double quote.
Note: the regex will not be able to determine if this is a valid html document- if there is a mix of single then double quotes used to enclose a href value, it will ignore this error!

What is the correct regex (for PHP preg_replace) to remove empty paragraph ( <p> ) tags?

I'm working in Wordpress and need to be able to remove images and empty paragraphs. So far, I've found out how to remove images without a problem. But, I then need to remove empty paragraph tags. I'm using PHP preg_replace to handle the regex functions.
So, as an example, I have the string:
<p style="text-align:center;"><img src="http://www.blah.com/image.jpg" alt="Blah Image" /></p><p>Some text</p>
I run this regex on it:
/<img.*?(>)/
And I end up with this string:
<p style="text-align:center;"></p><p>Some text</p>
I then need to be able to remove the empty paragraph. I tried this, but it removes all paragraphs and the contents of the paragraphs:
/<p[^>]*><\/p[^>]*>/
Any help/suggestions is greatly appreciated!
The correct regex is no regex. Use an HTML/DOM Parser instead. They're simple to use. Regex is for regular languages (which HTML is not).
/<p[^>]*><\/p[^>]*>/ (the regex you gave) should work fine. If it's giving you trouble you could try double-escaping the / like this: /<p[^>]*><\\/p[^>]*>/
PHP is funny about quoting and escape characters. For example "\n" is not equal to '\n'. The first is a line break, the second is a literal backslash followed by an 'n'. The PHP manual entry on string literals is probably worth a quick look.

WordPress: Problem with the shortcode regex

This is the regular expression used for "shortcodes" in WordPress (one for the whole tag, other for the attributes).
return '(.?)\[('.$tagregexp.')\b(.*?)(?:(\/))?\](?:(.+?)\[\/\2\])?(.?)';
$pattern = '/(\w+)\s*=\s*"([^"]*)"(?:\s|$)|(\w+)\s*=\s*\'([^\']*)\'(?:\s|$)|(\w+)\s*=\s*([^\s\'"]+)(?:\s|$)|"([^"]*)"(?:\s|$)|(\S+)(?:\s|$)/';
It parses stuff like
[foo bar="baz"]content[/foo]
or
[foo /]
In the WordPress trac they say it's a bit flawed, but my main problem is that it don't support shortcodes inside the attributes, like in
[foo bar="[baz /]"]content[/foo]
because the regex stops the main shortcode at the first appearance of a closing bracket, so in the example it renders
[foo bar="[baz /]
and
"]content[/foo]
shows as it is.
Is there any way to change the regex so it bypass any occurrence of [ with ] and its content when occurs between the opening tag or self-closing tag?
What is your goal? Even if WordPress’ regex were better, the shortcode would not be executed.
return '(.?)\[('.$tagregexp.')\b((?:"[^"]*"|.)*?)(?:/)?\](?:(.+?)\[\/\2\])?(.?)';
is a variation on the first regex where the bit that matches the attributes has been changed to capture strings completely without regard to what's in them:
(?:"[^"]*"|.)*?
instead of
.*?
Note that it doesn't handle strings with escaped quote characters in them (yet - can be done, but is it necessary?). I haven't changed anything else because I don't know the syntax for WordPress shortcodes.
But it looks like it could have been cleaned up a little by removing unnecessary backslashes and parentheses:
return '(.?)\[(foo)\b((?:"[^"]*"|.)*?)/?\](?:(.+?)\[/\2\])?(.?)';
Perhaps further improvements are warranted. I'm a bit worried about the unprecise dot in the above snippet, and I'd rather use (?:"[^"]*"|[^/\]])* instead of (?:"[^"]*"|.)*?, but I don't know whether that would break something else. Also, I don't know what the leading and trailing (.?) are good for. They don't match anything in your example so I don't know their purpose.
Do you want a drop-in replacement for that regex? This one allows attribute values to contain things that look like tags, as in your example:
'(.?)\[(\w+)\b((?:[^"\'\[\]]++|(?:"[^"]*+")|(?:\'[^\']*+\'))*+)\](?:(?<=(\/)\])|([^\[\]]*+)\[\/\2\])(.?)'
Or, in more readable form:
/(.?) # could be [
\[(\w+)\b # tag name
((?:[^"'\[\]]++ # attributes
|(?:"[^"]*+")
|(?:'[^']*+')
)*+
)\]
(?:(?<=(\/)\]) # '/' if self-closing
|([^\[\]]*+) # ...or content
\[\/\2\] # ...and closing tag
)(.?) # could be ]
/
As I understand it, $tagregexp in the original is an alternation of all the tag names that have been defined; I substituted \w+ for readability. Everything the original regex captures, this one does too, and in the same groups. The only difference is that the / in a self-closing tag is captured in group #3 along with the attributes as well as in its own group (#4).
I don't think the other regex needs to be changed unless you want to add full support for tags embedded in attribute values. That would also mean allowing for escaped quotes in this one, and I don't know how you would want to do that. Doubling them would be my guess; that's how Textpattern does it, and WordPress is supposedly based on that.
This question is a good example of why apps like WordPress shouldn't be implemented with regexes. The only way to add or change functionality is by making the regexes bigger and uglier and even harder to maintain.
I found a way to fix it:
First, change the shortcode regex from:
(.?)\[('.$tagregexp.')\b(.*?)(?:(\/))?\](?:(.+?)\[\/\2\])?(.?)
To:
(.?)\[('.$tagregexp.')\b((?:[^\[\]]|(?R)|.)*?)(?:(\/))?\](?:(.+?)\[\/\2\])?(.?)
And then change the priority of the do_shortcode function to avoid conflict with wptexturize, the function that stylize the quotes and mess up this fix. It don't have problems with wpautop because that's somewhat fixed with another recent function I think.
Before:
add_filter('the_content', 'do_shortcode', 11); // AFTER wpautop()
After:
add_filter('the_content', 'do_shortcode', 9);
I submitted this to the trac and is on some kind of permanent hiatus. In the meanwhile I figure if I can make a plugin to apply my fix without changing the core files. Override the filter priority is easy, but I have no idea of how to override the regex.
This would be nice to fix! I do not have sufficient rep to comment, so I am leaving the following related wordpress trac link, maybe it is the same as the one you meant:
http://core.trac.wordpress.org/ticket/14481
I would hope that any fix would allow shortcode syntax like
[shortcode att1="val]ue"]content[/shortcode]
since in 3.0.1 the $content is mis-parsed as ue"]content instead of just content
Update: After spending time learning about regices (regexes?) I made it possible to allow ] and Pascal-style escaped quotes (eg arg='that''s [so] great') in these arguments with 2 changes: first change the (.*?) group in the first regex (get_shortcode_regex) to
((?:[^'"\]]|'[^']*'|"[^"]*")*)
(NB: make sure you escape everything properly in your php code) then in shortcode_parse_atts (the function containing the second regex) change the following (again, change ' to \' if you single-quote $pattern like in the original code)
in $pattern change "([^"]*)" to "((?:[^"]|"")*)"
in $pattern change '([^']*)' to '((?:[^']|'')*)'
$atts[strtolower($m[1])] = preg_replace('_""_', '"', stripcslashes($m[2]));
$atts[strtolower($m[3])] = preg_replace("_''_", "'", stripcslashes($m[4]));
NB again: changes to pattern may rely on greedy nature of matching so if that option's ever changed, the changed bits of $pattern might have to be terminated with something like (?!"), etc

PHP Regex negation

I have a web bot which extracts some data from a website. The problem is that the html content is sent without line brakes so it's a little bit harder to match certain things so I need to extract everything that is between td tags. Here's a string example:
<a class="a" href="javascript:ow(19623507)">**-**-**-***.cstel.net</a> (<b><font color="#3300cc">Used</font></b>)</td><td><a class="a" href="javascript:ow(19623507)">**-**-**-***.cstel.net</a> (<b><font color="#3300cc">Used</font></b>)</td>
And my regex so far:
<a\s+class="a"\s+href="javascript:ow\((.*?)\)">.+</a>(?!<td>).+</td>
But my regex matches the whole line instead of matching all contents. Any ideas?
Don't waste your time on regexes. Use DOM and XPath.
DOMDocument::loadHTML($html)->getElementsByTagName('a')
Have you tried changing .+ to .+? ?
Can you determine where the proper line breaks SHOULD be? If so, it might be easier to first replace those tokens with a proper line break and then use the pattern you have (assuming that pattern works - I haven't tried it).
Your pattern looks VERY specific, but perhaps it works fine for what you are doing.

Categories