php - preg_match string not within the href attribute - php

i find regex kinda confusing so i got stuck with this problem:
i need to insert <b> tags on certain keywords in a given text. problem is that if the keyword is within the href attribute, it would result to a broken link.
the code goes like this:
$text = preg_replace('/(\b'.$keyword.'\b)/i','<b>\1</b>',$text);
so for cases like
this keyword here
i end up with:
this <b>keyword</b> here
i tried all sorts of combinations but i still couldn't get the right pattern.
thanks!

You can't only use Regex to do that. They are powerful, but they can't parse recursive grammar like HTML.
Instead you should properly parse the HTML using a existing HTML parser. you just have to echo the HTML unless you encouter some text entity. In that case, you run your preg_repace on the text before echoing it.
If your HTML is valid XHTML, you can use the xml_parse function. if it's not, then use whatever HTML parser is available.

You can use preg_replace again after the first replacement to remove b tags from href:
$text=preg_replace('#(href="[^"]*)<b>([^"]*)</b>#i',"$1$2",$text);

Yes, you can use regex like that, but the code might become a little convulted.
Here is a quick example
$string = 'link text with keyword and stuff';
$keyword = 'keyword';
$text = preg_replace(
'/(<a href=")('.$keyword.')(.php">)(.*)(<\/a>)/',
"$1$2$3<b>$4</b>$5",
$string
);
echo $string."\n";
echo $text."\n";
The content inside () are stored in variables $1,$2 ... $n, so I don't have to type stuff over again. The match can also be made more generic to match different kinds of url syntax if needed.
Seeing this solution you might want to rethink the way you plan to do matching of keywords in your code. :)
output:
link text with keyword and stuff
<b>link text with keyword and stuff</b>

Related

How to prevent PHP str_replace replace text in html tags?

I am developing a website where user can upload their texts. For managerial purpose, I want to
change all the text "apple" to <a href="https://apple.com">apple<a> dynamically by php.
I am using str_replace('apple','apple') Now
However, the word "apple" might already been linked to an external source by users. In this case, will mess up the original link.
Say the page has the following :
apple
my code will change it to
<a href="...">apple</a>
Is there any way I can identify if a certain "apple" was in an a tag or other html tags already?
Thank you
Use DOMDocument to turn the HTML into a DOM you can work with. Then, iterate over all text nodes, making the replacements.
Why not use an if statement to look for the <a href="..">, else do your replacement?
Would all occurrences of "Apple" be in regular sentences (i.e. preceded or followed by spaces or newlines)? If so, you could try something like this:
str_replace(' apple', ' apple, $string);
If that wont do what you need, do a catch-all str_replace and then use preg_match with regex to get clean up any nested links. Something along the lines of this which would preserve the original link (though I don't recommend using regex to parse HTML).
preg_match('/\\\3', $string);

[php]how to extract a single simple text from a long html source

i have a html like this:
......whatever very long html.....
<span class="title">hello world!</span>
......whatever very long html......
it is a very long html and i only want the content 'hello world!' from this html
i got this html by
$result = file_get_contents($url , false, $context);
many people were using Simple HTML DOM parser, but i think in this case, using regex would be more efficient.
how should i do it? any suggestions? any help would be really great.
thanks in advance!
Stick with the DOM parser - it is better. Having said that, you could use a REGEX like this...
// where the html is stored in `$html`
preg_match('/<span class="title">(.+?)<\/span>/', $html, $m);
$whatYouWant = $m[1];
preg_match() stores an array of all the elements captured inside brackets in the regex, and a 0th element which is the entire captured string. The regex is very simple in this case, being almost a direct string match for what you want, with the closing span tag's slash escaped. The captured part just means any character (.) one or more times (+) un-greedily (?).
No, I really don't think regEx or similar functions would be either more effective or easier.
If you would use SimpleHTML DOM, you could quickly get the data you are looking for like this:
//Get your file
$html = file_get_html('myfile.html');
//Use jQuery style selectors
$spanValue = $html->find('span.title')->plaintext;
echo($spanValue);
with preg_match you could do like this:
preg_match("/<span class=\"title\">([^`]*?)<\/span>/", $data, $matches);
or this, if there are multiple spans with the class "title":
preg_match_all("/<span class=\"title\">([^`]*?)<\/span>/", $data, $matches);

php, highlight search keywords without breaking anchor tags

I already did a search on Google and Stackoverflow, but I couldn't find any solution that works for me.
This is what I have so far:
$string = preg_replace('/'.$keyword.'/i',
'<span class="highlight">$0</span>', $string);
Which works fine, except when the string contains anchor tags. But I still want to be able to highlight the keywords outside and within the anchor tags.
Example:
$keyword = 's';
Output:
I alrady did a search on Google and Stackoverflow, but I couldn't find any solution that works for me.
I would appreciate it if someone could find a solution for this without having to use PHP Simple HTML DOM Parser.
This should work in most situations:
$string = preg_replace('/(?![^<>]*>)'.preg_quote($keyword,"/").'/i',
'<span class="highlight">$0</span>', $string);
It seems to me like you'll need to use a DOM parser as you are only wanting to deal with the "text" in your string, rather than the entire string. So you need a way to determine what is "text" and what are HTML attributes.
There are lots of examples of why regex doesn't work for trying to parse HTML.

preg_replace on text only and not inside href's

I have this code to do some ugly inline text style color formatting on html content.
But this breaks anything inside tags as links, and emails.
I've figured out halfways how to prevent it from formatting links, but it still doesn't prevent the replace when the text is info#mytext.com
$text = preg_replace('/(?<!\.)mytext(?!\/)/', '<span style="color:#DD1E32">my</span><span style="color:#002d6a">text</span>', $text);
What will be a better approach to only replace text, and prevent the replacing on links?
A better approach would be to use XML functions instead.
Your lookbehind assertion only tests one character, so it's insufficient to assert matches outside of html tags. This is something where regular expression aren't the best option. You can however get an approximation like:
preg_replace("/(>[^<]*)(?<![#.])(mytext)/", "$1<span>$2</span>",
This would overlook the first occourence of mytext if it's not preceeded by a html tag. So works best if $text = "<div>$text</div>" or something.
[edited] ohh I see you solved the href problem.
to solve your email problem, change all #mytext. to [email_safeguard] with str_replace, before working on the text, and when your finished, change it back. :)
$text = str_replace('info#mytext.com','[email_safeguard]',$text);
//work on the text with preg_match()
$text = str_replace('[email_safeguard]','info#mytext.com',$text);
that should do the trick :)
but as people have mentioned before, you better avoid html and regex, or you will suffer the wrath of Cthulhu.
see this instead

php anchor tag regex

I have a bunch of strings, each containing an anchor tag and url.
string ex.
here is a link http://www.google.com. enjoy!
i want to parse out the anchor tags and everything in between.
result ex.
here is a link. enjoy!
the urls in the href= portion don't always match the link text however (sometimes there are shortened urls,sometimes just descriptive text).
i'm having an extremely difficult time figuring out how to do this with either regular expressions or php functions. how can i parse an entire anchor tag/link from a string?
thanks!
Looking at your result example, it seems like you're just removing the tags/content - did you want to keep what you stripped out or no? If not you might be looking for strip_tags().
You shouldn't use regex to parse html and use an html parser instead.
But if you should use regex, and your anchor tags inner contents are guaranteed to be free of html like </a>, and each string is guaranteed to contain only one anchor tag as in the example case, then - only then - you can use something like:
Replacing /^(.+)<a.+<\/a>(.+)$/ with $1$2
Since your problem seems to be very specific, I think this should do it:
$str = preg_replace('#\s?<a.*/a>#', '', $str);
just use your normal PHP string functions.
$str='here is a link http://www.google.com. enjoy!';
$s = explode("</a>",$str);
foreach($s as $a=>$b){
if( strpos( $b ,"href")!==FALSE ){
$m=strpos("$b","<a");
echo substr($b,0,$m);
}
}
print end($s);
output
$ php test.php
here is a link . enjoy!
$string = 'here is a link http://www.google.com. enjoy!';
$text = strip_tags($string);
echo $text; //Outputs "here is a link . enjoy!"

Categories