How can I do a "does not contain" operation in regex?

How can I do a "does not contain" operation in regex? - php

This is my string:
<br/><span style=\'background:yellow\'>Some data</span>,<span style=\'background:yellow\'>More data</span><br/>(more data)<br/>';
I want to produce this output:
Some data,More data
Right now, I do this in PHP to filter out the data:
$rePlaats = "#<br/>([^<]*)<br/>[^<]*<br/>';#";
$aPlaats = array();
preg_match($rePlaats, $lnURL, $aPlaats); // $lnURL is the source string
$evnPlaats = $aPlaats[1];
This would work if it weren't for these <span> tags, as shown here:
<br/>Some data,More data<br/>(more data)<br/>';
I will have to rewrite the regex to tolerate HTML tags (except for <br/>) and strip out the <span> tags with the strip_tags() function. How can I do a "does not contain" operation in regex?

Don't listen to these DOM purists. Parsing HTML with DOM you'll have an incomprehensible tree. It's perfectly ok to parse HTML with regex, if you know what you are after.
Step 1) Replace <br */?> with {break}
Step 2) Replace <[^>]*> with empty string
Step 3) Replace {break} with <br>

don't fret yourself with too much regex. use your normal PHP string functions
$str = "<br/><span style=\'background:yellow\'>Some data</span>,<span style=\'background:yellow\'>More data</span><br/>(more data)<br/>';";
$s = explode("</span>",$str);
for($i=0;$i<count($s)-1;$i++){
print preg_replace("/.*>/","",$s[$i]) ."\n"; #minimal regex
}
explode on "</span>" , since the data you want to get is all near "</span>". Then go through every element of array , replace from start till ">". This will get your data. The last element is excluded.
output
$ php test.php
Some data
More data

If you really want to use regular expressions for this, then you're better off using regex replaces. This regex SHOULD match tags, I just whipped it up off the top of my head so it might not be perfect:
<[a-zA-Z0-9]{0,20}(\s+[a-zA-Z0-9]{0,20}=(("[^"]?")|('[^']?'))){0,20}\s*[/]{0,1}>
Once all the tags are gone the rest of the string manipulation should be pretty easy

As has been said many times don't use regex to parse html. Use the DOM instead.

Related

RegEx replace not working in PHP

I've written a regular expression to get the first two paragraphs from a database clob which stores its content in HTML formatting.
I've checked with these online RegEx builder/checkers here and here and they both seem to be doing what I want them to do (I've altered the RegEx slightly since these checkers to handle the new line formatting which I found after.
However when I go to use this in my PHP it doesn't seem to want to get just the group I'm after, and instead matches everything.
Here is my preg_replace line:
$description = preg_replace('/(^.*?)((<p[^>]*>.*?<\/p>\s*){2})(.*)/', "$2", $description);
And here is my testing content in the format of the content I am getting
<p>
Paragraph 1</p>
<p>
Paragraph 2</p>
<p>
Paragraph 3</p>
I've had a look at this SO Post which didn't help.
Any Ideas?
EDIT
As pointed out in one of the comments you cannot Regex HTML in PHP (Don't know why, I'm not really bothered by that).
Now I'm opening the option for getting it in PL/SQL as well.
select
DBMS_LOB.substr(description, 32000, 1) /* How do I make this into a regular expression? */
from
blog_posts

Your input contains newlines, therefore you have to add the s modifier:
/(^.*?)((<p[^>]*>.*?<\/p>\s*){2})(.*)/s
Otherwise, .* breaks on newlines and the regex doesn't match.

You could take a look at the PHP Simple DOM Parser. Going by their manual, you could do something like so:
$html = str_get_html('your html string');
foreach($html->find('p') as $element) //This should get all the paragraph elements in your string.
echo $element->plaintext. '<br>';

[php]how to extract a single simple text from a long html source

i have a html like this:
......whatever very long html.....
<span class="title">hello world!</span>
......whatever very long html......
it is a very long html and i only want the content 'hello world!' from this html
i got this html by
$result = file_get_contents($url , false, $context);
many people were using Simple HTML DOM parser, but i think in this case, using regex would be more efficient.
how should i do it? any suggestions? any help would be really great.
thanks in advance!

Stick with the DOM parser - it is better. Having said that, you could use a REGEX like this...
// where the html is stored in `$html`
preg_match('/<span class="title">(.+?)<\/span>/', $html, $m);
$whatYouWant = $m[1];
preg_match() stores an array of all the elements captured inside brackets in the regex, and a 0th element which is the entire captured string. The regex is very simple in this case, being almost a direct string match for what you want, with the closing span tag's slash escaped. The captured part just means any character (.) one or more times (+) un-greedily (?).

No, I really don't think regEx or similar functions would be either more effective or easier.
If you would use SimpleHTML DOM, you could quickly get the data you are looking for like this:
//Get your file
$html = file_get_html('myfile.html');
//Use jQuery style selectors
$spanValue = $html->find('span.title')->plaintext;
echo($spanValue);
with preg_match you could do like this:
preg_match("/<span class=\"title\">([^`]*?)<\/span>/", $data, $matches);
or this, if there are multiple spans with the class "title":
preg_match_all("/<span class=\"title\">([^`]*?)<\/span>/", $data, $matches);

php - use of preg_match or preg_match_all

<font size="+1"><font size="+2" color="green"><b>1.</b>
</font><b>If no head injury is too trivial to be neglected, then:</b></font>
In PHP using preg_match or preg_match_all I want to retrieve the text "If no head injury is too trivial to be neglected, then:"
How can I do this?

Code :
<?php
$str = '<font size="+1"><font size="+2" color="green"><b>1.</b></font><b>If no head injury is too trivial to be neglected, then:</b></font>';
$pattern = "/font><b>(.+)<\/b>/";
preg_match($pattern,$str,$matches);
echo $matches[1];
?>
Output :
If no head injury is too trivial to be neglected, then:

I am not sure, under what conditiones you select the string to capture, why gets 1. not captured, but your 2. string does? As long, as you do not explain that I can only guess, so as an expression:
/<\w+(?:\s+\w+=(?:(?:"[^"]*")|(?:'[^']*')))*\s*>([^<]+)</\w+>/g
will match all html tags, that only contain a text node (wich should be case for xhtml, since <p>text<br /></p> would not be wellformed...).
so <p>text</p><br>text2</br> will be matched and as a result the text will be in capturegroup 1.
<\w+(?:\s+\w+=(?:(?:"[^"]*")|(?:'[^']*')))*\s*> will capture every opening xhtml tag
([^<]+) will catch all cahrs exept from < and put it in the capturegroup
</\w+> finally catches the closing tag...
the g is the global flag so that the expression can catch multiple results...
Good luck with this, if you need something different please be a little more precise...

The pattern will be something like this:
/<\s*b\s*>(.+)<\s*\/b\s*>/

php - preg_match string not within the href attribute

i find regex kinda confusing so i got stuck with this problem:
i need to insert <b> tags on certain keywords in a given text. problem is that if the keyword is within the href attribute, it would result to a broken link.
the code goes like this:
$text = preg_replace('/(\b'.$keyword.'\b)/i','<b>\1</b>',$text);
so for cases like
this keyword here
i end up with:
this <b>keyword</b> here
i tried all sorts of combinations but i still couldn't get the right pattern.
thanks!

You can't only use Regex to do that. They are powerful, but they can't parse recursive grammar like HTML.
Instead you should properly parse the HTML using a existing HTML parser. you just have to echo the HTML unless you encouter some text entity. In that case, you run your preg_repace on the text before echoing it.
If your HTML is valid XHTML, you can use the xml_parse function. if it's not, then use whatever HTML parser is available.

You can use preg_replace again after the first replacement to remove b tags from href:
$text=preg_replace('#(href="[^"]*)<b>([^"]*)</b>#i',"$1$2",$text);

Yes, you can use regex like that, but the code might become a little convulted.
Here is a quick example
$string = 'link text with keyword and stuff';
$keyword = 'keyword';
$text = preg_replace(
'/(<a href=")('.$keyword.')(.php">)(.*)(<\/a>)/',
"$1$2$3<b>$4</b>$5",
$string
);
echo $string."\n";
echo $text."\n";
The content inside () are stored in variables $1,$2 ... $n, so I don't have to type stuff over again. The match can also be made more generic to match different kinds of url syntax if needed.
Seeing this solution you might want to rethink the way you plan to do matching of keywords in your code. :)
output:
link text with keyword and stuff
<b>link text with keyword and stuff</b>

php anchor tag regex

I have a bunch of strings, each containing an anchor tag and url.
string ex.
here is a link http://www.google.com. enjoy!
i want to parse out the anchor tags and everything in between.
result ex.
here is a link. enjoy!
the urls in the href= portion don't always match the link text however (sometimes there are shortened urls,sometimes just descriptive text).
i'm having an extremely difficult time figuring out how to do this with either regular expressions or php functions. how can i parse an entire anchor tag/link from a string?
thanks!

Looking at your result example, it seems like you're just removing the tags/content - did you want to keep what you stripped out or no? If not you might be looking for strip_tags().

You shouldn't use regex to parse html and use an html parser instead.
But if you should use regex, and your anchor tags inner contents are guaranteed to be free of html like </a>, and each string is guaranteed to contain only one anchor tag as in the example case, then - only then - you can use something like:
Replacing /^(.+)<a.+<\/a>(.+)$/ with $1$2

Since your problem seems to be very specific, I think this should do it:
$str = preg_replace('#\s?<a.*/a>#', '', $str);

just use your normal PHP string functions.
$str='here is a link http://www.google.com. enjoy!';
$s = explode("</a>",$str);
foreach($s as $a=>$b){
if( strpos( $b ,"href")!==FALSE ){
$m=strpos("$b","<a");
echo substr($b,0,$m);
}
}
print end($s);
output
$ php test.php
here is a link . enjoy!

$string = 'here is a link http://www.google.com. enjoy!';
$text = strip_tags($string);
echo $text; //Outputs "here is a link . enjoy!"

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How can I do a "does not contain" operation in regex? - php

Don't listen to these DOM purists. Parsing HTML with DOM you'll have an incomprehensible tree. It's perfectly ok to parse HTML with regex, if you know what you are after. Step 1) Replace <br /?> with {break} Step 2) Replace <[^>]> with empty string Step 3) Replace {break} with <br>

As has been said many times don't use regex to parse html. Use the DOM instead.

Related

RegEx replace not working in PHP

[php]how to extract a single simple text from a long html source

php - use of preg_match or preg_match_all

php - preg_match string not within the href attribute

php anchor tag regex

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How can I do a "does not contain" operation in regex? - php

Don't listen to these DOM purists. Parsing HTML with DOM you'll have an incomprehensible tree. It's perfectly ok to parse HTML with regex, if you know what you are after. Step 1) Replace <br */?> with {break} Step 2) Replace <[^>]*> with empty string Step 3) Replace {break} with <br>

As has been said many times don't use regex to parse html. Use the DOM instead.

Related

RegEx replace not working in PHP

[php]how to extract a single simple text from a long html source

php - use of preg_match or preg_match_all

php - preg_match string not within the href attribute

php anchor tag regex

Categories

Resources

Don't listen to these DOM purists. Parsing HTML with DOM you'll have an incomprehensible tree. It's perfectly ok to parse HTML with regex, if you know what you are after. Step 1) Replace <br /?> with {break} Step 2) Replace <[^>]> with empty string Step 3) Replace {break} with <br>