PHP/Perl Regular expression help!

PHP/Perl Regular expression help! - php

I have a string:
$string = "This is my big <span class="big-string">string</span>";
I cannot figure out how to write a regular expression that will replace the 'b' in 'big' without replacing the 'b' in 'big-string'. I need to replace all occurances of a substring except when that substring appears in an html tag.
Any help is appreciated!
Edit
Maybe some more info will help. I'm working on an autocomplete feature that highlights whatever you're searching for in the current result set. Currently if you have typed 'aut' in the search dialog, then the results look like this: automotive
The problem appears when I search for 'auto b'. First I replace all occurrences of 'auto' with '<b>auto</b>' then I replace all occurrences of 'b' with '<b>b</b>'. Unfortunately this second sweep changes '<b>auto</b>' to '<<b>b</b>>auto</<b>b</b>>'

Please do not try to parse HTML using regular expressions. Just load up the HTML in a DOM, walk over the text nodes and do a simple str_replace. You'll thank me around debugging time.

Is there a guarantee that 'big' won't be immediately preceded by "? If so, then s/([^"])b/$1foo/ should replace the b in question with foo.

If you insist upon using a regex, this one will do a pretty decent job:
$re = '/# (Crudely) match a sub-string NOT in an HTML tag.
big # The sub-string to be matched.
(?= # Assert we are not inside an HTML tag.
[^<>]* # Consume all non-<> up to...
(?:<\w+ # either an HTML start tag,
| $ # or the end of string.
) # End group of valid alternatives.
) # End "not-in-html-tag" lookahead assertion.
/ix';
Caveats: This regex has very real limitations. The HTML must not have any angle brackets in the tag attributes. This regex also finds the target substring inside other parts of the HTML file such as comments, scripts and stylesheets, and this may not be desirable.

Related

Regex - Match everything except HTML tags

I've searched for this but couldn't find a solution that worked for me.
I need regex pattern that will match all text except html tags, so I can make it cyrilic (which would obviously ruin the entire html =))
So, for example:
<p>text1</p>
<p>text2 <span class="theClass">text3</span></p>
I need to match text1, text2, and text3, so something like
preg_match_all("/pattern/", $text, $matches)
and then I would just iterate over the matches, or if it can be done with preg_replace, to replace text1/2/3, with textA/B/C, that would be even better.

As you probably know, regex is not a great choice for this (the general advice here will be to use a Dom parser).
However, if you needed a quick regex solution, you use this (see demo):
<[^>]*>(*SKIP)(*F)|[^<]+
How this works is that on the left the <[^>]*> matches complete <tags>, then the (*SKIP)(*F) causes the regex to fail and the engine to advance to the position in the string that follows the last character of the matched tag.
This is an application of a general technique to exclude patterns from matches (read the linked question for more details).
If you don't want to allow the matches to span several lines, add \r\n to the negated character class that does your matching, like so:
<[^>]*>(*SKIP)(*F)|[^<\r\n]+

How about this RegEx:
/(?<=>)[\w\s]+(?=<)/g
Online Demo

Maybe this one (in Ruby):
/(?<!<)(?<!<\/)(?<![<\/\w+])([[:alpha:]])+(?!>)/
Enjoy !

Please use PHP DOMDocument class to parse XML content :
PHP Doc

php Regular Expression Issues - Can't remove/strip out and replace a string within a string

I have never worked with regular expressions before and I need them now and I am having some issues getting the expected outcome.
Consider this for example:
[x:3xerpz1z]Some Text[/x:3xerpz1z] Some More Text
Using the php preg_replace() function, I want to replace [x:3xerpz1z] with <start> and [/x:3xerpz1z] with </end> but I can't figure this out. I have read some regular expression tutorials but I am still confused.
I have tried this for the starting tag:
preg_replace('/(.*)\[x:/','<start>', $source_string);
The above would return:<start>3xerpz1z
As you can see, the "3xerpz1z" isn't getting removed and it needs to be stripped out. I can't hard code and search and replace "3xerpz1z" because the "3xerpz1z" chars are randomly generated and the characters are always different but the length of the tag is the same.
This is the desired output I want:
<start>Some Text</end> Some More Text
I haven't event tried processing [/x:3xerpz1z] because I can't even get the first tag going.

You must use capturing groups (....):
$data = '[x:3xerpz1z]Some Text[/x:3xerpz1z] Some More Text';
$result = preg_replace('~\[x:([^]]+)](.*?)\[/x:\1]~s', '<start>$2</end>', $data);
pattern details:
~ # pattern delimiter: better than / here (no need to escape slashes)
\[x:
([^]]+) # capture group 1: all that is not a ]
]
(.*?) # capture group 2: content
\[/x:\1] # \1 is a backreference to the first capturing group
~s # s allows the dot to match newlines

Matching a Specific URL Pattern with PHP

I'm trying to read an HTML file and capture all anchor tags that match a specific URL pattern in order to display those links on another page. The pattern looks like this:
https://docs.google.com/file/d/aBunchOfLettersAndNumbers/edit?usp=drive_web
I'm lousy with RegEx. I've tried a bunch of things and read a bunch of answers here on Stack Overflow, but I'm not hitting on the correct syntax.
Here's what I have now:
preg_match ('/<a href="https:\/\/docs.google.com\/file\/d\/(.*)<\/a>/', $file, $matches)
When I test this on an HTML page with two matching anchor tags, the first result includes the first and second match and everything in between, while the second result includes part of the first match, part of the second match, and everything in between.
While I'd be happy to capture matching anchor tags along with the inner HTML, I'd be even happier if I could generate a multidimensional array with the HREF attribute of each matching anchor tag, along with the matching inner HTML (so I can format the links myself, without having to use even more RegEx to get rid of unwanted attributes). Would I use preg_match_all for that? What would that look like?
Am I even on the right path here, or should I be using DOM and XPath queries to find this stuff?
Thanks.

Oh jeez, I can't believe every answer here uses "/" delimiters. If your pattern has slashes in it, use something else for the sake of readability.
Here's a better answer (you may need to tweak if your anchors may have additional attributes other than href):
$hrefPattern = "(?P<href>https://docs\.google\.com/file/d/[a-z0-9]+/edit\?usp=drive_web)";
$innerPattern = "(?P<inner>.*?)";
$anchorPattern = "$innerPattern";
preg_match_all("#$anchorPattern#i", $file, $matches);
This will give you something like:
[
0 => ['<span>More foo</span>'],
"href" => ["https://docs.google.com/file/d/foo/edit?usp=drive_web"],
"inner" => ["<span>More foo</span>"]
]
And absolutely, you should use the DOM for this.

Replace (.*) with (.*?) - use lazy quantification:
preg_match('/<a href="https:\/\/docs.google.com\/file\/d\/(.*?)<\/a>/', $file, $matches);

You could use the following regular expression:
/<a.*?href="(https:\/\/docs\.google\.com\/file\/d\/.*?)".*?>(.*?)<\/a>/
Which would give you the URL from the href and the innerHTML.
Break down
<a.*?href=" Matches the opening a tag and any charachters up until href="
(https:\/\/docs\.google\.com\/file\/d\/.*?)" Matches (and captures) until the end of the href (i.e. until "
.*?> Matches all characters to the end of the a tag >
(.*?)<\/a> Matches (and captures) the innerHTML until the closing a tag (i.e. </a>).

Dave,
The DOM would be better. But here is the Regex that works.
$url = 'href="https://docs.google.com/file/d/aBunchOfLettersAndNumbers/edit?usp=drive_web"';
preg_match ('/href="https:\/\/docs.google.com\/file\/d\/(.*?)"/', $url, $matches);
Results:
array (size=2)
0 => string 'href="https://docs.google.com/file/d/aBunchOfLettersAndNumbers/edit?usp=drive_web"' (length=82)
1 => string 'aBunchOfLettersAndNumbers/edit?usp=drive_web' (length=44)
You can can the html tags, but most importantly, in your question, your code in the preg_match line didn't contain the ending > of the opening tag which threw it off and it needed to have (.?) instead of (.). The added ? tells it to looking for any characters, of an unknown quantity. (.*) means any one character I believe.

Replace string using regular expression

I always encounter regular expressions but I don't really try to understand and use them. But my current project is forcing me to use a regular expression so I need someone who can give me the correct regex to replace a simple string. Basically I'm replacing a small subset of longtext retrieved from a database. The longtext is just a paragraph(s) with text anchors in a form of:
Example
So the question is how do I replace the value of the title attribute? Please note that the text may contain two more anchor tags so I'd like to able to specifically target each of them.
EDIT:
I'd like to use pure PHP on this. I think I know how to do this using js/jquery.

$doc = new DOMDocument();
$doc->loadHTML('Example');
$anchors = $doc->getElementsByTagName('a');
foreach ($anchors as $anchor)
{
$anchor->setAttribute('target', '__blank');
}
$html = $doc->saveHTML();
echo $html;
See it in action

Description
You could do this with the following regex
(<a\b[^>]*?\btitle=(['"]))(.*?)\2
Summary
( start capture group 1
<a\b consume open angle bracket and an a followed by a word break
[^>]*? consume all non close angle bracket characters up to... this forces the regex to stay inside the anchor tag
\btitle= consume a word break and title=, the break helps do some additional checking
(['"]) capture group 2, ensure the an open single or double quote is being used
) close capture group 1
(.*?) start capture group 3, and non greedy consume to collect all text inside the quotes
\2 reference back to the string from capture group 2, if you used a single quote to open the value, then a single quote will be required to close the value. Same if you had use a double quote.
In the replace command I'm simply replacing the entire found string from <a to the close quote with: group capture 1, followed by the desired text NewValue followed by the close quote from group capture 2.
PHP example
<?php
$sourcestring="Example";
echo preg_replace('/(<a\b[^>]*?\btitle=([\'"]))(.*?)\2/im','\1NewValue\2',$sourcestring);
?>
$sourcestring after replacement:
Example
Disclaimer
Since parsing text via a html parser is not the desired solution, I'll skip the usual soap box disclaimer about parsing html with Regex.

$string=preg_replace(
'#<a (.*)title="(.*)"([^>]*)>(.*)</a>#iU',
'<a $1title="'.$replacement.'"$3>$4</a>',
$string);
Note that the i at the end of the expression makes it case insensitive, and the U makes it ungreedy.

Regular expression anchor text for a link

I am trying to pull the anchor text from a link that is formatted this way:
<h3><b>File</b> : i_want_this</h3>
I want only the anchor text for the link : "i_want_this"
"variable_text" varies according to the filename so I need to ignore that.
I am using this regex:
<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>
This is matching of course the complete link.

PHP uses a pretty close version to PCRE (PERL Regex). If you want to know a lot about regex, visit perlretut.org. Also, look into Regex generators like exspresso.
For your use, know that regex is greedy. That means that when you specify that you want something, follwed by anything (any repetitions) followed by something, it will keep on going until that second something is reached.
to be more clear, what you want is this:
<a href="
any character, any number of times (regex = .* )
">
any character, any number of times (regex = .* )
</a>
beyond that, you want to capture the second group of "any character, any number of times". You can do that using what are called capture groups (capture anything inside of parenthesis as a group for reference later, also called back references).
I would also look into named subpatterns, too - with those, you can reference your choice with a human readable string rather than an array index. Syntax for those in PHP are (?P<name>pattern) where name is the name you want and pattern is the actual regex. I'll use that below.
So all that being said, here's the "lazy web" for your regex:
<?php
$str = '<h3><b>File</b> : i_want_this</h3>';
$regex = '/(<a href\=".*">)(?P<target>.*)(<\/a>)/';
preg_match($regex, $str, $matches);
print $matches['target'];
?>
//This should output "i_want_this"
Oh, and one final thought. Depending on what you are doing exactly, you may want to look into SimpleXML instead of using regex for this. This would probably require that the tags that we see are just snippits of a larger whole as SimpleXML requires well-formed XML (or XHTML).

I'm sure someone will probably have a more elegant solution, but I think this will do what you want to done.
Where:
$subject = "<h3><b>File</b> : i_want_this</h3>";
Option 1:
$pattern1 = '/(<a href=")(.*)(">)(.*)(<\/a>)/i';
preg_match($pattern1, $subject, $matches1);
print($matches1[4]);
Option 2:
$pattern2 = '()(.*)()';
ereg($pattern2, $subject, $matches2);
print($matches2[4]);

Do not use regex to parse HTML. Use a DOM parser. Specify the language you're using, too.
Since it's in a captured group and since you claim it's matching, you should be able to reference it through $1 or \1 depending on the language.
$blah = preg_match( $pattern, $subject, $matches );
print_r($matches);

The thing to remember is that regex's return everything you searched for if it matches. You need to specify that only care about the part you've surrounded in parenthesis (the anchor text). I'm not sure what language you're using the regex in, but here's an example in Ruby:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)
puts data # => outputs 'i_want_this'
If you specify what you want in parenthesis, you can reference it:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)[1]
puts data # => outputs 'i_want_this'
Perl will have you use $1 instead of [1] like this:
$string = 'i_want_this';
$string =~ m/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/;
$data = $1;
print $data . "\n";
Hope that helps.

I'm not 100% sure if I understand what you want. This will match the content between the anchor tags. The URL must start with /en/browse/file/, but may end with anything.
#(.*?)#
I used # as a delimiter as it made it clearer. It'll also help if you put them in single quotes instead of double quotes so you don't have to escape anything at all.
If you want to limit to numbers instead, you can use:
#(.*?)#
If it should have just 5 numbers:
#(.*?)#
If it should have between 3 and 6 numbers:
#(.*?)#
If it should have more than 2 numbers:
#(.*?)#

This should work:
<a href="[^"]*">([^<]*)
this says that take EVERYTHING you find until you meet "
[^"]*
same! take everything with you till you meet <
[^<]*
The paratese around [^<]*
([^<]*)
group it! so you can collect that data in PHP! If you look in the PHP manual om preg_match you will se many fine examples there!
Good luck!
And for your concrete example:
<a href="/en/browse/file/variable_text">([^<]*)
I use
[^<]*
because in some examples...
.*?
can be extremely slow! Shoudln't use that if you can use
[^<]*

You should use the tool Expresso for creating regular expression... Pretty handy..
http://www.ultrapico.com/Expresso.htm

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP/Perl Regular expression help! - php

Please do not try to parse HTML using regular expressions. Just load up the HTML in a DOM, walk over the text nodes and do a simple str_replace. You'll thank me around debugging time.

Is there a guarantee that 'big' won't be immediately preceded by "? If so, then s/([^"])b/$1foo/ should replace the b in question with foo.

Related

Regex - Match everything except HTML tags

php Regular Expression Issues - Can't remove/strip out and replace a string within a string

Matching a Specific URL Pattern with PHP

Replace string using regular expression

Regular expression anchor text for a link

Categories

Resources