php anchor tag regex - php

I have a bunch of strings, each containing an anchor tag and url.
string ex.
here is a link http://www.google.com. enjoy!
i want to parse out the anchor tags and everything in between.
result ex.
here is a link. enjoy!
the urls in the href= portion don't always match the link text however (sometimes there are shortened urls,sometimes just descriptive text).
i'm having an extremely difficult time figuring out how to do this with either regular expressions or php functions. how can i parse an entire anchor tag/link from a string?
thanks!

Looking at your result example, it seems like you're just removing the tags/content - did you want to keep what you stripped out or no? If not you might be looking for strip_tags().

You shouldn't use regex to parse html and use an html parser instead.
But if you should use regex, and your anchor tags inner contents are guaranteed to be free of html like </a>, and each string is guaranteed to contain only one anchor tag as in the example case, then - only then - you can use something like:
Replacing /^(.+)<a.+<\/a>(.+)$/ with $1$2

Since your problem seems to be very specific, I think this should do it:
$str = preg_replace('#\s?<a.*/a>#', '', $str);

just use your normal PHP string functions.
$str='here is a link http://www.google.com. enjoy!';
$s = explode("</a>",$str);
foreach($s as $a=>$b){
if( strpos( $b ,"href")!==FALSE ){
$m=strpos("$b","<a");
echo substr($b,0,$m);
}
}
print end($s);
output
$ php test.php
here is a link . enjoy!

$string = 'here is a link http://www.google.com. enjoy!';
$text = strip_tags($string);
echo $text; //Outputs "here is a link . enjoy!"

Related

[php]how to extract a single simple text from a long html source

i have a html like this:
......whatever very long html.....
<span class="title">hello world!</span>
......whatever very long html......
it is a very long html and i only want the content 'hello world!' from this html
i got this html by
$result = file_get_contents($url , false, $context);
many people were using Simple HTML DOM parser, but i think in this case, using regex would be more efficient.
how should i do it? any suggestions? any help would be really great.
thanks in advance!
Stick with the DOM parser - it is better. Having said that, you could use a REGEX like this...
// where the html is stored in `$html`
preg_match('/<span class="title">(.+?)<\/span>/', $html, $m);
$whatYouWant = $m[1];
preg_match() stores an array of all the elements captured inside brackets in the regex, and a 0th element which is the entire captured string. The regex is very simple in this case, being almost a direct string match for what you want, with the closing span tag's slash escaped. The captured part just means any character (.) one or more times (+) un-greedily (?).
No, I really don't think regEx or similar functions would be either more effective or easier.
If you would use SimpleHTML DOM, you could quickly get the data you are looking for like this:
//Get your file
$html = file_get_html('myfile.html');
//Use jQuery style selectors
$spanValue = $html->find('span.title')->plaintext;
echo($spanValue);
with preg_match you could do like this:
preg_match("/<span class=\"title\">([^`]*?)<\/span>/", $data, $matches);
or this, if there are multiple spans with the class "title":
preg_match_all("/<span class=\"title\">([^`]*?)<\/span>/", $data, $matches);

php - preg_match string not within the href attribute

i find regex kinda confusing so i got stuck with this problem:
i need to insert <b> tags on certain keywords in a given text. problem is that if the keyword is within the href attribute, it would result to a broken link.
the code goes like this:
$text = preg_replace('/(\b'.$keyword.'\b)/i','<b>\1</b>',$text);
so for cases like
this keyword here
i end up with:
this <b>keyword</b> here
i tried all sorts of combinations but i still couldn't get the right pattern.
thanks!
You can't only use Regex to do that. They are powerful, but they can't parse recursive grammar like HTML.
Instead you should properly parse the HTML using a existing HTML parser. you just have to echo the HTML unless you encouter some text entity. In that case, you run your preg_repace on the text before echoing it.
If your HTML is valid XHTML, you can use the xml_parse function. if it's not, then use whatever HTML parser is available.
You can use preg_replace again after the first replacement to remove b tags from href:
$text=preg_replace('#(href="[^"]*)<b>([^"]*)</b>#i',"$1$2",$text);
Yes, you can use regex like that, but the code might become a little convulted.
Here is a quick example
$string = 'link text with keyword and stuff';
$keyword = 'keyword';
$text = preg_replace(
'/(<a href=")('.$keyword.')(.php">)(.*)(<\/a>)/',
"$1$2$3<b>$4</b>$5",
$string
);
echo $string."\n";
echo $text."\n";
The content inside () are stored in variables $1,$2 ... $n, so I don't have to type stuff over again. The match can also be made more generic to match different kinds of url syntax if needed.
Seeing this solution you might want to rethink the way you plan to do matching of keywords in your code. :)
output:
link text with keyword and stuff
<b>link text with keyword and stuff</b>

How can I do a "does not contain" operation in regex?

This is my string:
<br/><span style=\'background:yellow\'>Some data</span>,<span style=\'background:yellow\'>More data</span><br/>(more data)<br/>';
I want to produce this output:
Some data,More data
Right now, I do this in PHP to filter out the data:
$rePlaats = "#<br/>([^<]*)<br/>[^<]*<br/>';#";
$aPlaats = array();
preg_match($rePlaats, $lnURL, $aPlaats); // $lnURL is the source string
$evnPlaats = $aPlaats[1];
This would work if it weren't for these <span> tags, as shown here:
<br/>Some data,More data<br/>(more data)<br/>';
I will have to rewrite the regex to tolerate HTML tags (except for <br/>) and strip out the <span> tags with the strip_tags() function. How can I do a "does not contain" operation in regex?
Don't listen to these DOM purists. Parsing HTML with DOM you'll have an incomprehensible tree. It's perfectly ok to parse HTML with regex, if you know what you are after.
Step 1) Replace <br */?> with {break}
Step 2) Replace <[^>]*> with empty string
Step 3) Replace {break} with <br>
don't fret yourself with too much regex. use your normal PHP string functions
$str = "<br/><span style=\'background:yellow\'>Some data</span>,<span style=\'background:yellow\'>More data</span><br/>(more data)<br/>';";
$s = explode("</span>",$str);
for($i=0;$i<count($s)-1;$i++){
print preg_replace("/.*>/","",$s[$i]) ."\n"; #minimal regex
}
explode on "</span>" , since the data you want to get is all near "</span>". Then go through every element of array , replace from start till ">". This will get your data. The last element is excluded.
output
$ php test.php
Some data
More data
If you really want to use regular expressions for this, then you're better off using regex replaces. This regex SHOULD match tags, I just whipped it up off the top of my head so it might not be perfect:
<[a-zA-Z0-9]{0,20}(\s+[a-zA-Z0-9]{0,20}=(("[^"]?")|('[^']?'))){0,20}\s*[/]{0,1}>
Once all the tags are gone the rest of the string manipulation should be pretty easy
As has been said many times don't use regex to parse html. Use the DOM instead.

Find phrases/words between HTML using PHP

I was wondering of a solid way to find phrases/words that are part of an HTML document. For example if I have the following document:
This is a test<b>Another test</b>
My goal is to find "This is a test" and "Another test" and replace it with something else. Note that these are sample phrases and it could contain numbers or the ampersand symbol.
Any help would be great.
Thank you
Consider your HTML as XML and use the DOM (PHP 5) or DOM XML (PHP 4) extension (or any other XML extension included in PHP).
For each node, you can get the inside text using DomNode.GetValue (depending on what library you use).
I would look into something like str_replace()
Here is explained how to remove all html stuff (html tags, scripts, css) and then with str_replace you can replace whatever you want.
If this is an option to do client side I would suggest jQuery replaceWith()
you could use php's strip_tags($string, $tagsToRemove)
$justText = strip_tags('This is a test<b>Another test</b>');
And then you'd have the text, so you could use str_replace("new text", $justText);
You might have to break it up using the second parameter of strip_tags() to keep the tags seperate, though.
$html = 'This is a test<b>Another test</b>';
$anchorText = strip_tags($html, '<a>');
$paraText = strip_tags($html, '<p>');
$html = str_replace("new anchor text", $anchorText);
$html = str_replace("new paragraph text", $paraText);
The key here is to use a regular expression to, in a sense, parse the HTML...
So you'd use:
<?php
$str = "Hello"; //The string to search
preg_match('/(<.+>)??.+(<\/.+>)??/i',$str,$match); //Find all occurences and store the tag content in an array called $match
echo $match[0]; //Echo the first value
?>
This basically searches the input string (which you'd set as your page's HTML) and returns each match of text between the tags as a value in the array. For for the first tag, the value would be stored in $match[0], the second in $match[1], etc.
It does this by first finding a pattern that starts with an HTML tag and ends with an HTML tag, but not selecting either tag, leaving only the content in between selected.
Hope this helps!
Braeden

How to remove text between tags in php?

Despite using PHP for years, I've never really learnt how to use expressions to truncate strings properly... which is now biting me in the backside!
Can anyone provide me with some help truncating this? I need to chop out the text portion from the url, turning
text
into
$str = preg_replace('#(<a.*?>).*?(</a>)#', '$1$2', $str)
Using SimpleHTMLDom:
<?php
// example of how to modify anchor innerText
include('simple_html_dom.php');
// get DOM from URL or file
$html = file_get_html('http://www.example.com/');
//set innerText to null for each anchor
foreach($html->find('a') as $e) {
$e->innerText = null;
}
// dump contents
echo $html;
?>
What about something like this, considering you might want to re-use it with other hrefs :
$str = 'text';
$result = preg_replace('#(<a[^>]*>).*?(</a>)#', '$1$2', $str);
var_dump($result);
Which will get you :
string '' (length=24)
(I'm considering you made a typo in the OP ? )
If you don't need to match any other href, you could use something like :
$str = 'text';
$result = preg_replace('#().*?()#', '$1$2', $str);
var_dump($result);
Which will also get you :
string '' (length=24)
As a sidenote : for more complex HTML, don't try to use regular expressions : they work fine for this kind of simple situation, but for a real-life HTML portion, they don't really help, in general : HTML is not quite "regular" "enough" to be parsed by regexes.
You could use substring in combination with stringpos, eventhough this is not
a very nice approach.
Check: PHP Manual - String functions
Another way would be to write a regular expression to match your criteria.
But in order to get your problem solved quickly the string functions will do...
EDIT: I underestimated the audience. ;) Go ahead with the regexes... ^^
You don't need to capture the tags themselves. Just target the text between the tags and replace it with an empty string. Super simple.
Demo of both techniques
Code:
$string = 'text';
echo preg_replace('/<a[^>]*>\K[^<]*/', '', $string);
// the opening tag--^^^^^^^^ ^^^^^-match everything before the end tag
// ^^-restart fullstring match
Output:
Or in fringe cases when the link text contains a <, use this: ~<a[^>]*>\K.*?(?=</a>)~
This avoids the expense of capture groups using a lazy quantifier, the fullstring restarting \K and a "lookahead".
Older & wiser:
If you are parsing valid html, you should use a dom parser for stability/accuracy. Regex is DOM-ignorant, so if there is a tag attribute value containing a >, my snippet will fail.
As a narrowly suited domdocument solution to offer some context:
$dom = new DOMDocument;
$dom->loadHTML($string, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD); // 2nd params to remove DOCTYPE);
$dom->getElementsByTagName('a')[0]->nodeValue = '';
echo $dom->saveHTML();
Only use strip_tags(), that would get rid of the tags and left only the desired text between them

Categories