This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I imported some posts to my site from RSS but at the end of post this line appears - This Post Appeared First On This site.
<p>The post <a rel="nofollow" href="link">title</a> appeared first on <a rel="nofollow" href="Website.com"">Website</a>.</p>
however, my removal code doesn't work
preg_replace('/<p>The post <a\s+.*?href=".*?"\s+.*?>.*?<\/a> appeared first on <a\s+.*?href=".*?"\s+.*?>.*?<\/a>.</p>/i', '', $text);
hope someone can help me
I agree with the comments above, don't use regexes to parse HTML or XML strings, they're not the tools for the job. However, if you must, your original regex has two problems:
You didn't escape the </p> (as User3783243 mentioned). It needs to be <\/p> in the regex.
The regex requires a whitespace after the href="" attribute, which is not present in the example. You should probably remove the \s+ after the second " in the href.
If you add them in, the regex matches the provided string see here: https://regex101.com/r/MDwSua/1
This should work:
$regex = '/\<p\>The post \<a[^>]*\>title\<\/a\> appeared first on \<a[^>]*\>Website\<\/a\>.\<\/p\>/';
preg_replace($regex, '', $text);
The pattern [^>]* captures the attributes of a tag.
Related
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 2 years ago.
I'm trying to remove a script that contains a malware from my database.
It was injected in a lot of registers of my table.
The script starts with a <script> tag and ends with a </script> tag.
I'm using the following code to find and replace it:
$content = $post->post_content;
$new_content = preg_replace('/(<script>.+?)+(<\/script>)/i', '', $content);
I've tested it on regx101.com and it's working fine but on my code, it doesn't work.Does anyone know what's wrong?
Here is my goto regex for <script>...</script> tags with their contents:
(\<script\>)([\s\S]*?)(<\/script>)
You're not escaping some key characters and you're not capturing everything which could be in the contents of the tags.
Here is an explanation of the content capturing group:
\s matches any whitespace character
\S matches any non-whitespace character
*? matches between zero and unlimited times, as few times as possible, expanding as needed
As I stated before, you really shouldn't do this. You should use a PHP DOM parser instead.
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 4 years ago.
Lets say I have a long string and I want to remove one "< a >" tag and this is what I have already tried:
$text= preg_replace('~<a[\s\S]*?'.$aString.'[\s\S]*?/a>~','',$text);
As you can see with this line it removes everything from the first "< a" to the one that satisfies the condition.
How do I rewrite it that it only searches inside one opening and closing a tag?
In another words to make myself clear: I have a long text that might or might not have many "< a" tags. I need to remove any of them that contains a specific string. This string is created dynamically. With the code above I tell the program to search for "< a" and remove everything until it finds the $aString and then to the closing "a >" tag which is not what I want. I want it to remove only the tag that contains $aString.
UPDATE: a simple str_replace wouldn't do the trick because it fails the part that "[\s\S]?" achieve that's why I put "[\s\S]?" there. As I said the text inside the tag contains $aString and by that I meant: it might be:
<a class='blah' style='blah' $aString title='blah'>blahblahblah</a>
or
<a class='notblah' style='notblah' $aString>blah</a>
or
<a> $aString</a>
'~<a[\s\S]*?'.$aString.'[\s\S]*?/a>~'
How do I rewrite it that it only searches inside one opening and closing a tag?
A negative lookahead assertion can effect that <a… is matched so that no other <a is contained in the match. You'd replace the subpattern <a[\s\S]*? with <a((?!<a)[\s\S])*?.
Also you can simplify the expression a little by setting the modifier s (PCRE_DOTALL) and changing [\s\S] to ..
$text = preg_replace('~<a((?!<a).)*?'.$aString.'.*?/a>~s', '', $text);
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Regex - Greedyness - matching HTML tags, content and attributes
The text I want to parse is something like this:
Dir: Vinton Heuck, Ciro Nieli
With: Eric Loomis, Bumper Robinson, Dawn Olivieri
Usually, there're one or two anchor elements after "Dir" and multiple anchor elements after "With".
What I want to do is get all values of anchor elements after "Dir" and before "With". I tried some regular expression like this:
preg_match_all("/Dir: <a href=\"\/name\/.+\/\">(.+)<\/a>/", $content, $matches);
But this only works when there's only one anchor element after "Dir". Any suggestions? Thanks!
i think you are missing some grouping instruction "()+" to get not only one but one or two links, take a look at this to test your regex.
You would have to group your regex for finding the anchor tag, and use + for one or more.
Something like:
/Dir: (<a href=\"\/name\/.+\/\">(.+)<\/a>)+/
You'd have to edit to take into account the comma, but it will get you started.
Assuming that the line that contains "Dir:" appears only once:
preg_match_all("/(<([[:graph:]]+)[^>]*>)(.*?)(<\/\\2>)/", preg_replace("/[[:blank:]]*With:.*/","",$content), $matches);
print_r($matches[3]);
This question already has answers here:
Grabbing the href attribute of an A element
(10 answers)
Closed 9 years ago.
i am trying to regex a difficult link
preg_match_all('/<a[^>]*href\s*=\s*(["\'])(.*?)\1[^>]*>\s+TEXTTOFIND+(.*?)\s*<\/a>/', '<a href="http://subdomain.BLABLABLA.net/de/cgi/g.fcgi/BLABLABLA/print?folder=inbox&uid=U3RlcClzESBNZK9SDGsmQ05yIJTj7Eax&CUSTOMERNO=124332225&t=de1142311604.1315866430.20ba8551" style="margin-right: 10px;" title=""BLABLABLA.net Registrierung" <register#gutefrage.net>">"TEXTTOFIND.net R...
</a>', $match);
print_r($match);
BLABLABLA is only a test to hide the real page :)
all I want is to find the URL of link with "TEXTTOFIND"
but it doesn't work :(
You should be using a DOM parser to do this, not regular expressions. But if you want to do it the wrong way anyway, it looks like one of the reasons it isn't working is that you're trying to match:
... [^>]*>\s+TEXTTOFIND ...
But your test string is:
... >"TEXTTOFIND
Note the double quote " between the right angle bracket and your TEXTTOFIND string. The modifier from your regex, \s+, will not match this.
http://ua2.php.net/manual/en/function.preg-match-all.php
at first, try read docs, you miss 2nd parameter
at second, hello Alex :)
at 3rd \s+ you can change to . at some ... happens(sorry for my english)
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
PHP Regular express to remove <h1> tags (and their content)
I have some HTML that looks like this:
<h2>
Fund Management</h2>
<p>
The majority of property investments are now made via our Funds.</p>
Trying to use a regular expression to strip h2 tags but doesn't work because of the space between the opening and closing h2 tags.
preg_replace('/<h2>(.+?)<\/h2>/', '', $content);
Any ideas on how to make this work?
Also I would ideally like it to replace h1-h6 tags so maybe it needs [1-6] or something?
The only problem of your regex was the lack of modifiers (the si thingy) but if you you want to extend it to match from <h1> to <h6> tags, you can accomplish this by using a back-reference from first tag:
preg_replace("/<h([1-6]{1})>.*?<\/h\\1>/si", '', $content);
This way you ensure your first tag to match the second.
You can learn more about the modifiers here:
reference.pcre.pattern.modifiers.php