Regexp ignore certain parts in string - php

I need to use regex for a string to find matching results.
I need to find the (.+?) but would like to ignore everything where it says (*) right now:
$regex='#<span class="(*)">(.+?)</span>#';
Instead of ignoring (* ), it echoes out what is in (* ).
How can I ignore these and only get (.+?) ?

The parenthesizes mean capture: what's inside those () will be captured so you can use it later.
If you do not want something to be captured, because you don't want/need to use it later, just remove the parenthesizes.
I should add that using regular expressions to extract data from HTML is generally quite not such a good idea... You might want to use a DOM parser instead, with DOMDocument::loadHTML() for example .

Related

Regex Including the next occurence of word

The regex works perfectly but the problem is it also include the next occurrence instead of ending with the first occurrence then start again from the
Regex : (?=<appView)\s{0,1}(.*)(?<=<\/appView>)
String: <appView></appView> <appView></appView>
But my problem is it eat matches the whole word like
(Match 1)<appView></appView> <appView></appView>
I want it to search the group differently but i cant make it work.
Desired output : (Match 1) <appView></appView> (Match 2)<appView></appView>
\s{0,1} equals \s? You need to use (.*?) to be lazy instead of (.*)
Use this pattern: ~(?=<appView)\s?(.*?)(?<=</appView>)~
Demo Link
*note, you don't have to escape / in the closing tag if you use something other than a slash as your pattern delimiter. I am using ~ at the beginning and end of my pattern to avoid escaping.
I fully recommend to switch from regex to an actual sequential xml parser. Regex is aweful for parsing xml based files, for example because of the problems below.
That said, you can "fix" your regex by using ([^<>]*). This will match all characters without < or >, which will make sure that no other tags are nested inside. If done with all tags, you cannot match something like <appview><unclosedTag></appView>, because it is invalid. If you can be certain that the structure is correct, this is slightly less of an issue.
Another problem your approach has is that if you have nested tags like so: <appView> something <appView> something else </appView> else </appView>, your approach will make you end up with [replaced] else </appView>.

Regular expression to filter links

I am using this regular expression to filter .pdffiles from the webpage:
$regex='|<a.*?href="(.*pdf?)"|';
It does the job if the link is like this:
www.xyz.com/trgrrtr/ghtty.pdf
but if the links are something like this, it is unable to filter:
www.xyz.com/trgrrtr/ghtty.pdf?code=KksRHhdVXAoECBFCVFpeXBsBUgYMDQpxd3J2d3F2fDtzfnFuLiErNXNpIG5kYm16aGhpcmxoa05QV1VKUVFFUxQ%3D
What regular expression I should use to filter out this link from a webpage?
First of all, you need to escape the ? otherwise it just makes the f in front of it optional. Then you could do something like this:
$regex = '|<a.*?href="([^"]*\.pdf\?[^"]*)"|';
The use of the negated character class makes sure that you cannot leave the attribute. (.* could consume the attribute-ending " as well, and go on until " matches another double quote further down the string.)
But I really recommend that you use a DOM parser to find the link-elements first. PHP has a built-in one and there is a very nice and convenient 3rd-party alternative.
The blog post An Improved Liberal, Accurate Regex Pattern for Matching URLs may help.

replace special strings in a html page by php

I am looking for a way to replace all string looking alike in entire page with their defined values
Please do not recommend me other methods of including language constants.
Strings like this :
[_HOME]
[_NEWS]
all of them are looking the same in [_*] part
Now the big issue is how to scan a HTML page and to replace the defined values .
One ways to parse the html page is to use DOMDocument and then pre_replace() it
but my main problem is writing a pattern for the replacement
$pattern = "/[_i]/";
$replacement= custom_lang("/i/");
$doc = new DOMDocument();
$htmlPage = $doc->loadHTML($html);
preg_replace($pattern, $replacement, $htmlPage);
In RegEx, [] are operators, so if you use them you need to escape them.
Other problem with your expression is _* which will match Zero or more _. You need to replace it with some meaningful match, Like, _.* which will match _ and any other characters after that. SO your full expression becomes,
/\[_.*?\]/
Hey, why an ?, you might be tempted to ask: The reason being that it performs a non-greedy match. Like,
[_foo] [_bar] is the query string then a greedy match shall return one match and give you the whole of it because your expression is fully valid for the string but a non-greedy match will get you two seperate matches. (More information)
You might be better-off in being more constrictive, by having an _ followed by Capital letters. Like,
/\[_[A-Z]+\]/
Update: Using the matched strings and replacing them. To do so we use the concept called back-refrencing.
Consider modifying the above expression, enclosing the string in parentheses, like, /\[_([A-Z]+)\]/
Now in preg-replace arguments we can use the expression in parentheses by back-referencing them with $1. So what you can use is,
preg_replce("/\[_([A-Z]+)\]/e", "my_wonderful_replacer('$1')", $html);
Note: We needed the e modifier to treat the second parameter as PHP code. (More information)
If you know the full keyword you are trying to replace (e.g. [_HOME]), then you can just use str_replace() to replace all instances.
No need to make things like this more complex by introducing regex.

Regular expression anchor text for a link

I am trying to pull the anchor text from a link that is formatted this way:
<h3><b>File</b> : i_want_this</h3>
I want only the anchor text for the link : "i_want_this"
"variable_text" varies according to the filename so I need to ignore that.
I am using this regex:
<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>
This is matching of course the complete link.
PHP uses a pretty close version to PCRE (PERL Regex). If you want to know a lot about regex, visit perlretut.org. Also, look into Regex generators like exspresso.
For your use, know that regex is greedy. That means that when you specify that you want something, follwed by anything (any repetitions) followed by something, it will keep on going until that second something is reached.
to be more clear, what you want is this:
<a href="
any character, any number of times (regex = .* )
">
any character, any number of times (regex = .* )
</a>
beyond that, you want to capture the second group of "any character, any number of times". You can do that using what are called capture groups (capture anything inside of parenthesis as a group for reference later, also called back references).
I would also look into named subpatterns, too - with those, you can reference your choice with a human readable string rather than an array index. Syntax for those in PHP are (?P<name>pattern) where name is the name you want and pattern is the actual regex. I'll use that below.
So all that being said, here's the "lazy web" for your regex:
<?php
$str = '<h3><b>File</b> : i_want_this</h3>';
$regex = '/(<a href\=".*">)(?P<target>.*)(<\/a>)/';
preg_match($regex, $str, $matches);
print $matches['target'];
?>
//This should output "i_want_this"
Oh, and one final thought. Depending on what you are doing exactly, you may want to look into SimpleXML instead of using regex for this. This would probably require that the tags that we see are just snippits of a larger whole as SimpleXML requires well-formed XML (or XHTML).
I'm sure someone will probably have a more elegant solution, but I think this will do what you want to done.
Where:
$subject = "<h3><b>File</b> : i_want_this</h3>";
Option 1:
$pattern1 = '/(<a href=")(.*)(">)(.*)(<\/a>)/i';
preg_match($pattern1, $subject, $matches1);
print($matches1[4]);
Option 2:
$pattern2 = '()(.*)()';
ereg($pattern2, $subject, $matches2);
print($matches2[4]);
Do not use regex to parse HTML. Use a DOM parser. Specify the language you're using, too.
Since it's in a captured group and since you claim it's matching, you should be able to reference it through $1 or \1 depending on the language.
$blah = preg_match( $pattern, $subject, $matches );
print_r($matches);
The thing to remember is that regex's return everything you searched for if it matches. You need to specify that only care about the part you've surrounded in parenthesis (the anchor text). I'm not sure what language you're using the regex in, but here's an example in Ruby:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)
puts data # => outputs 'i_want_this'
If you specify what you want in parenthesis, you can reference it:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)[1]
puts data # => outputs 'i_want_this'
Perl will have you use $1 instead of [1] like this:
$string = 'i_want_this';
$string =~ m/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/;
$data = $1;
print $data . "\n";
Hope that helps.
I'm not 100% sure if I understand what you want. This will match the content between the anchor tags. The URL must start with /en/browse/file/, but may end with anything.
#(.*?)#
I used # as a delimiter as it made it clearer. It'll also help if you put them in single quotes instead of double quotes so you don't have to escape anything at all.
If you want to limit to numbers instead, you can use:
#(.*?)#
If it should have just 5 numbers:
#(.*?)#
If it should have between 3 and 6 numbers:
#(.*?)#
If it should have more than 2 numbers:
#(.*?)#
This should work:
<a href="[^"]*">([^<]*)
this says that take EVERYTHING you find until you meet "
[^"]*
same! take everything with you till you meet <
[^<]*
The paratese around [^<]*
([^<]*)
group it! so you can collect that data in PHP! If you look in the PHP manual om preg_match you will se many fine examples there!
Good luck!
And for your concrete example:
<a href="/en/browse/file/variable_text">([^<]*)
I use
[^<]*
because in some examples...
.*?
can be extremely slow! Shoudln't use that if you can use
[^<]*
You should use the tool Expresso for creating regular expression... Pretty handy..
http://www.ultrapico.com/Expresso.htm

How can I replace only the captured elements of a regex?

I'm trying to extract only certain elements of a string using regular expressions and I want to end up with only the captured groups.
For example, I'd like to run something like (is|a) on a string like "This is a test" and be able to return only "is is a". The only way I can partially do it now is if I find the entire beginning and end of the string but don't capture it:
.*?(is|a).*? replaced with $1
However, when I do this, only the characters preceding the final found/captured group are eliminated--everything after the last found group remains.
is is a test.
How can I isolate and replace only the captured strings (so that I end up with "is is a"), in both PHP and Perl?
Thanks!
Edit:
I see now that it's better to use m// rather than s///, but how can I apply that to PHP's preg_match? In my real regex I have several captured group, resulting in $1, $2, $3 etc -- preg_match only deals with one captured group, right?
If all you want are the matches, the there is no need for the s/// operator. You should use m//. You might want to expand on your explanation a little if the example below does not meet your needs:
#!/usr/bin/perl
use strict;
use warnings;
my $text = 'This is a test';
my #matches = ( $text =~ /(is|a)/g );
print "#matches\n";
__END__
C:\Temp> t.pl
is is a
EDIT: For PHP, you should use preg_match_all and specify an array to hold the match results as shown in the documentation.
You can't replace only captures. s/// always replaces everything included in the match. You need to either capture the additional items and include them in the replacement or use assertions to require things that aren't included in the match.
That said, I don't think that's what you're really asking. Is Sinan's answer what you're after?
You put everything into captures and then replaces only the ones you want.
(.*?)(is|a)(.*?)

Categories