Regular Expression to Search+Replace href="URL" - php

I'm useless with regular expressions and haven't been able to google myself a clear solution to this one.
I want to search+replace some text ($content) for any url inside the anchor's href with a new url (stored as the variable $newurl).
Change this:
<img alt="foobar" src="http://blogurl.com/files/2011/03/foobar_thumb.jpg" />
To this:
<img alt="foobar" src="http://blogurl.com/files/2011/03/foobar_thumb.jpg" />
I imagine using preg_replace would be best for this. Something like:
preg_replace('Look for href="any-url"',
'href="$newurl"',$content);
The idea is to get all images on a WordPress front page to link to their posts instead of to full sized images (which is how they default). Usually there would be only one url to replace, but I don't think it would hurt to replace all potential matches.
Hope all that made sense and thanks in advance!

Here is the gist of what I came up with. Hopefully it helps someone:
$content = get_the_content();
$pattern = "/(?<=href=(\"|'))[^\"']+(?=(\"|'))/";
$newurl = get_permalink();
$content = preg_replace($pattern,$newurl,$content);
echo $content;
Mucho thanko to #WiseGuyEh

This should do the trick- you can test it here
(?<=href=("|'))[^"']+(?=("|'))
It uses lookahead and lookbehind to assert that anything it matches starts with href=" or href=' and makes sure that it ends with a single or double quote.
Note: the regex will not be able to determine if this is a valid html document- if there is a mix of single then double quotes used to enclose a href value, it will ignore this error!

Related

Php regex to conditionally replace first occurance of string

I need to do some cleanup on strings that look like this:
$author_name = '<a href="http://en.wikipedia.org/wiki/Robert_Jones_Burdette>Robert Jones Burdette </a>';
Notice the href tag doesn't have closing quotes - I'm using the DOMParser on a large table of these to extract the text, and it borks on this.
I would like to look at the string in $author_name;
IF the first > does NOT have a " before it, replace it with "> to close the tag correctly. If it is okay, just skip and do the next step. Be sure not to replace the second > at all.
Using php regex, I haven't been able to find a working solution - I could chop up the whole thing and check its parts, but that would be slow and I think there must be a regex that can do what I want.
TIA
What you can do is, find the first closing tag, with or without the double-quote ("), and replace it with (">):
$author_name = preg_replace('/(.+?)"?>(.+?)/', '$1">$2', $author_name);
http://www.barattalo.it/html-fixer/
Download that, then include it in your php.
The rest is quite easy:
$dirty_html = ".....bad html here......";
$a = new HtmlFixer();
$clean_html = $a->getFixedHtml($dirty_html);
It's common for people to want to use regular expressions, but you must remember that HTML is not regular.

Regular Expression formatting help required

I am trying to remove a part of a document on the fly using preg_replace().
/* target example:
<li id="footer-poweredbyico">
<img src="//bits.wikimedia.org/skins-1.18/common/images/poweredby_mediawiki_88x31.png" alt="Powered by MediaWiki" width="88" height="31" />
</li>
*/
$reg = preg_quote('<li id="footer-poweredbyico">.*?</li>');
preg_replace($reg,"",$str);
Ignore any errors in PHP, this question is about how to format the regular expression correctly to remove anything matching the target example opening and closing tags. The contents of the containing HTML tags will be different each time, hence .*? (I think that's wrong).
The preg_quote function actually does the opposite of what you want: its purpose is to disable all regex-features in a string. So in your case, what you currently have is (roughly) looking for an actual .*? in your HTML, instead of looking for zero or more characters. What you want is:
$str = preg_replace('/<li id="footer-poweredbyico">.*?<\/li>/s', '', $str);
The .*? portion of your regex is being escaped. Therefore, it isn't matching anything. Try this.
$reg = preg_quote('<li id="footer-poweredbyico">') . '.*?' . preg_quote('</li>');
preg_replace($reg,"",$str);
you don't need to use this hack approach, read the faq
"How can I edit / remove the Powered by MediaWiki image in the footer?"
preg_quote() will disable all the special characters you used, like .*?.
Try something like:
preg_replace('#<li id="footer-poweredbyico">.*?</li>#s', '', $str);
Now, the difficult question is whether to make this regex "greedy". Right now, it's ungreedy, which means it will break your page if there's another <li> inside the one you're trying to remove. But if you make it greedy, it will remove everything from the beginning of the <li> tag until the end of the last <li> element in the page, even if it's a different <li> element. Neither is ideal. This is why a proper HTML parser usually does a better job at manipulating HTML.
But if the page is simple enough, a regex will work.
EDIT Corrected a gross error, thanks to #Nilpo.

preg_replace image src with full url

I have seen lots of similar queries to this, but am struggling to get them to work in my application because I still don't fully understand regular expressions!
I'm using the old FCKEditor WYSIWYG to upload an image, but need to store the src as the full URL rather than the relative path.
At the time I need to do the replace, I've already replaced quotes with " so the pattern I'm looking for needs to be:
src=\"/userfiles/
This needs to be replaced with
src=\"http://mydomain.com/userfiles/
Thanks for your suggestions!!
you can actually do this with a str_replace and it'd be simpler but here's a preg.
$html = preg_replace('!src="/userfiles/!', 'src="http://mydomain.com/userfiles", $html)
here's the str_replace
$html = str_replace('src="/userfiles/', 'src="http://mydomain.com/userfiles", $html)
if there are spaces here and there you'll need the preg and you'll want to add
\s* in the places that have spaces.

PHP Regular expression tag matching

Been beating my head against a wall trying to get this to work - help from any regex gurus would be greatly appreciated!
The text that has to be matched
[template option="whatever"]
<p>any amount of html would go here</p>
[/template]
I need to pull the 'option' value (i.e. 'whatever') and the html between the template tags.
So far I have:
> /\[template\s*option=["\']([^"\']+)["\']\]((?!\[\/template\]))/
Which gets me everything except the html between the template tags.
Any ideas?
Thanks, Chris
edit: [\s\S] will match anything that is space or not space.
you may have a problem when there are consecutive blocks in a large string. in that case you will need to make a more specific quantifier - either non greedy (+?) or specify range {1,200} or make the [\s\S] more specific
/\[template\s*option=["\']([^"\']+)["\']\]([\s\S]+)\[\/template\]/
Try this
/\[template\s*option=\"(.*)\"\](.*)\[\/template]/
basically instead of using complex regex to match every single thing just use (.*) which means all since you want everything in between its not like you want to verify the data in between
The assertion ?! method is unneeded. Just match with .*? to get the minimum giblets.
/\[template\s*option=\pP([\h\w]+)\pP\] (.*?) [\/template\]/x
Chris,
I see you've already accepted an answer. Great!
However, I don't think use of regular expressions is the right solution here. I think you can get the same effect by using string manipulations (substrings, etc)
Here is some code that may help you. If not now, maybe later in your coding endeavors.
<?php
$string = '[template option="whatever"]<p>any amount of html would go here</p>[/template]';
$extractoptionline = strstr($string, 'option=');
$chopoff = substr($extractoptionline,8);
$option = substr($chopoff, 0, strpos($chopoff, '"]'));
echo "option: $option<br \>\n";
$extracthtmlpart = strstr($string, '"]');
$chopoffneedle = substr($extracthtmlpart,2);
$html = substr($chopoffneedle, 0, strpos($chopoffneedle, '[/'));
echo "html: $html<br \>\n";
?>
Hope this helps anyone looking for a similar answer with a different flavor.

can't make a preg_match right !

I have this link inside an HTML page.
<img id="catImage" width="250" alt="" src="http://dev-server2/image2.png" />
I want to get the value of src and am not getting along with preg_match and all of this regex stuff. Is this one right?
preg_match(
"/<img id=\"catImage\" width=\"[0-9]+\" alt=\"\" src=\"([[a-zA-Z0-9]\/-._]*)\"/",
$artist_page["content"], $matches);
I get an empty array!
First and foremost, the portion of your regex that deals with the src attribute doesn't account for the colon that appears in the URL.
I'd suggest changing the src portion (and any other attribute values) to look instead for the close quote and capture everything between:
... src=\"([^\"]*)\" ....
Does this work?
'/<img id="catImage"[^>]+src="([^"]*)"/'
I'm still really new on regex but I thought I would throw my thoughts out there and get some criticism for it. Should the expression be something like (?<=(src=")).*(?=["])? (not quite PHP formatted, yet). This would grab the contents of the src attribute.
"/<img id=\"catImage\" width=\"[0-9]+\" alt=\"\" src=\"([a-zA-Z0-9/.:_-]*)\"/"
Should do. Note that I edited the range [ ... ] part. The hyphen (-) has a special meaning so I put it last to add it as a literal in the range. Also, I added the : char (thanks #user333699). This hints, however, that you should not try and think of any valid URL character. Instead, match anything until you know that the entire value of the src attribute is matched:
"/<img id=\"catImage\" width=\"[0-9]+\" alt=\"\" src=\"([^\"]*)\"/"
I.e., anything that is not a quote (").
Note that in order to get the value of src you'll have to perform additional computation after the preg_match, as your match is going to return the entire tag.
It might be worth diving into XPath, depending on what you really want to do with it.

Categories