how to write regular expression for this by php? [duplicate] - php

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Regex - Greedyness - matching HTML tags, content and attributes
The text I want to parse is something like this:
Dir: Vinton Heuck, Ciro Nieli
With: Eric Loomis, Bumper Robinson, Dawn Olivieri
Usually, there're one or two anchor elements after "Dir" and multiple anchor elements after "With".
What I want to do is get all values of anchor elements after "Dir" and before "With". I tried some regular expression like this:
preg_match_all("/Dir: <a href=\"\/name\/.+\/\">(.+)<\/a>/", $content, $matches);
But this only works when there's only one anchor element after "Dir". Any suggestions? Thanks!

i think you are missing some grouping instruction "()+" to get not only one but one or two links, take a look at this to test your regex.

You would have to group your regex for finding the anchor tag, and use + for one or more.
Something like:
/Dir: (<a href=\"\/name\/.+\/\">(.+)<\/a>)+/
You'd have to edit to take into account the comma, but it will get you started.

Assuming that the line that contains "Dir:" appears only once:
preg_match_all("/(<([[:graph:]]+)[^>]*>)(.*?)(<\/\\2>)/", preg_replace("/[[:blank:]]*With:.*/","",$content), $matches);
print_r($matches[3]);

Related

regex The post appeared first on [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I imported some posts to my site from RSS but at the end of post this line appears - This Post Appeared First On This site.
<p>The post <a rel="nofollow" href="link">title</a> appeared first on <a rel="nofollow" href="Website.com"">Website</a>.</p>
however, my removal code doesn't work
preg_replace('/<p>The post <a\s+.*?href=".*?"\s+.*?>.*?<\/a> appeared first on <a\s+.*?href=".*?"\s+.*?>.*?<\/a>.</p>/i', '', $text);
hope someone can help me
I agree with the comments above, don't use regexes to parse HTML or XML strings, they're not the tools for the job. However, if you must, your original regex has two problems:
You didn't escape the </p> (as User3783243 mentioned). It needs to be <\/p> in the regex.
The regex requires a whitespace after the href="" attribute, which is not present in the example. You should probably remove the \s+ after the second " in the href.
If you add them in, the regex matches the provided string see here: https://regex101.com/r/MDwSua/1
This should work:
$regex = '/\<p\>The post \<a[^>]*\>title\<\/a\> appeared first on \<a[^>]*\>Website\<\/a\>.\<\/p\>/';
preg_replace($regex, '', $text);
The pattern [^>]* captures the attributes of a tag.

preg_replace all links in file_get_contents not containing a word [duplicate]

This question already has answers here:
Regular expression to match a line that doesn't contain a word
(34 answers)
Closed 9 years ago.
I'm reading a page into a variable and I would like to disable all links that do not contain the word "remedy" in the address. The code I have so far grabs all the links including ones with "remedy". What am I doing wrong?
$page = preg_replace('~<a href=".*?(?!remedy).*?".*?>(.*?)</a>~i', '<font color="#808080">$1</font>', $page);
-- solution --
$page = preg_replace('~<a href="(.(?!remedy))*?".*?>(.*?)</a>~i', '<font color="#808080">$2</font>', $page);
Try ~<a href="(.(?!remedy))*?".*?>(.*?)</a>~i
To the question, what you are doing wrong: Regexes match ever if anyhow possible and for each url (even that containing remedy) it is possible to match '~<a href=".*?(?!remedy).*?".*?>(.*?)</a>~i' because you did not specify remedy may not be contained anywhere in the attribute but you specified there must be anything/nothing (.*?) that is not followed by remedy and that is the case for any url except those that begin with exactly <a href="remedy". Hope one can understand that...
I would probably use this:
<a href="(?:(?!remedy)[^"])*"[^>]*>([^<]*)</a>
The most interesting part is this:
"(?:(?!remedy)[^"])*"
Each time the [^"] is about to consume another character, it yields to the lookahead so it confirm that it's not the first character of the word remedy. Using [^"] instead of . prevents it from looking at anything beyond the closing quote. I also took the liberty of replacing your .*?s with negated character classes. This serves the same purpose, keeping the match "corralled" in the area where you want it to match. It's also more efficient and more robust.
Of course, I'm assuming the <a> element's content is plain text, with no more elements nested inside it. In fact, that's just one of many simplifying assumptions I've made. You can't match HTML with regexes without them.

Replace content between two words [duplicate]

This question already has answers here:
Get content between two strings PHP
(7 answers)
Closed 4 years ago.
I am trying to replace the content between two words using php. The content between the two words is different so I can't use tradition str_replace. I want to replace the content between two words for example:
I would like to replace **some string of text** between two words
change to:
I would like to replace between two words
You can see that I removed all the wording between "some" and "text". Again I cannot use regular str_replace because the text between the two words may differ. For example it may say:
I would like to replace **some words of text** between two words
change to:
I would like to replace between two words
The regex is simple: /some .*? text/
Just replace it with the empty string.
According to your question, only the inner part of your string changes. If that is the case it's rather trivial, because you already have the solution: You do not need to replace it, but you just need to not take it over:
$result = substr($string, 0, $startlen) . substr($string, -$endlen);
Probably this helps you to find some more "resolution angles" for such problems.

regex link with php [duplicate]

This question already has answers here:
Grabbing the href attribute of an A element
(10 answers)
Closed 9 years ago.
i am trying to regex a difficult link
preg_match_all('/<a[^>]*href\s*=\s*(["\'])(.*?)\1[^>]*>\s+TEXTTOFIND+(.*?)\s*<\/a>/', '<a href="http://subdomain.BLABLABLA.net/de/cgi/g.fcgi/BLABLABLA/print?folder=inbox&uid=U3RlcClzESBNZK9SDGsmQ05yIJTj7Eax&CUSTOMERNO=124332225&t=de1142311604.1315866430.20ba8551" style="margin-right: 10px;" title=""BLABLABLA.net Registrierung" <register#gutefrage.net>">"TEXTTOFIND.net R...
</a>', $match);
print_r($match);
BLABLABLA is only a test to hide the real page :)
all I want is to find the URL of link with "TEXTTOFIND"
but it doesn't work :(
You should be using a DOM parser to do this, not regular expressions. But if you want to do it the wrong way anyway, it looks like one of the reasons it isn't working is that you're trying to match:
... [^>]*>\s+TEXTTOFIND ...
But your test string is:
... >"TEXTTOFIND
Note the double quote " between the right angle bracket and your TEXTTOFIND string. The modifier from your regex, \s+, will not match this.
http://ua2.php.net/manual/en/function.preg-match-all.php
at first, try read docs, you miss 2nd parameter
at second, hello Alex :)
at 3rd \s+ you can change to . at some ... happens(sorry for my english)

Getting div data without using DOM [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Getting DIV content with Regular Expression
Let me first tell you that DOM is not an option on this one.
I simply have the html :
className">Name</div>......</div>....</div>
Now, i have created a regular expression like :
$match_count = preg_match_all('/className\">(.*)\<\/div\>/', $page, $matches);
This would seem fine to me, but for some reason, it gets more data than expected. That is, it finishes some closing divs later. How can i restrict it so that it gets the data only inside the first closing div ?
$match_count = preg_match_all('/className">(.*?)<\/div>/', $page, $matches);
use non greedy selector .*?
Use preg_match instead. It will stop searching after the first matched pattern.
This works:
$match_count = preg_match_all('/className\">(.*)\<\/div\>/', $page, $matches);
The U pattern modifier will make sure it finds the smallest possible match, not the biggest.

Categories