Find phrases/words between HTML using PHP - php

I was wondering of a solid way to find phrases/words that are part of an HTML document. For example if I have the following document:
This is a test<b>Another test</b>
My goal is to find "This is a test" and "Another test" and replace it with something else. Note that these are sample phrases and it could contain numbers or the ampersand symbol.
Any help would be great.
Thank you

Consider your HTML as XML and use the DOM (PHP 5) or DOM XML (PHP 4) extension (or any other XML extension included in PHP).
For each node, you can get the inside text using DomNode.GetValue (depending on what library you use).

I would look into something like str_replace()

Here is explained how to remove all html stuff (html tags, scripts, css) and then with str_replace you can replace whatever you want.

If this is an option to do client side I would suggest jQuery replaceWith()

you could use php's strip_tags($string, $tagsToRemove)
$justText = strip_tags('This is a test<b>Another test</b>');
And then you'd have the text, so you could use str_replace("new text", $justText);
You might have to break it up using the second parameter of strip_tags() to keep the tags seperate, though.
$html = 'This is a test<b>Another test</b>';
$anchorText = strip_tags($html, '<a>');
$paraText = strip_tags($html, '<p>');
$html = str_replace("new anchor text", $anchorText);
$html = str_replace("new paragraph text", $paraText);

The key here is to use a regular expression to, in a sense, parse the HTML...
So you'd use:
<?php
$str = "Hello"; //The string to search
preg_match('/(<.+>)??.+(<\/.+>)??/i',$str,$match); //Find all occurences and store the tag content in an array called $match
echo $match[0]; //Echo the first value
?>
This basically searches the input string (which you'd set as your page's HTML) and returns each match of text between the tags as a value in the array. For for the first tag, the value would be stored in $match[0], the second in $match[1], etc.
It does this by first finding a pattern that starts with an HTML tag and ends with an HTML tag, but not selecting either tag, leaving only the content in between selected.
Hope this helps!
Braeden

Related

How to conditionally remove content from a string?

I have a function that creates a preview of a post like this
<?php $pos=strpos($post->content, ' ', 280);
echo substr($post->content,0,$pos ); ?>
But it's possible that the very first thing in that post is a <style> block. How can i create some conditional logic to make sure my preview writes what is after the style block?
If the only HTML content is a <style> tag, you could just simply use preg_replace:
echo preg_replace('#<style>.*?</style>#', '', $post->content);
However it is better (and more robust) to use DOMDocument (note that loadHTML will put a <body> tag around your post content and that is what we search for) to output just the text it contains:
$doc = new DOMDocument();
$doc->loadHTML($post->content);
echo $doc->getElementsByTagName('body')->item(0)->nodeValue . "\n";
For this sample input:
$post = (object)['content' => '<style>some random css</style>the text I really want'];
The output of both is
the text I really want
Demo on 3v4l.org
Taking a cue from the excellent comment of #deceze here's one way to use the DOM with PHP to eliminate the style tags:
<?php
$_POST["content"] =
"<style>
color:blue;
</style>
The rain in Spain lies mainly in the plain ...";
$dom = new DOMDocument;
$dom->loadHTML($_POST["content"]);
$style_tags = $dom->GetElementsByTagName('style');
foreach($style_tags as $style_tag) {
$prent = $style_tag->parentNode;
$prent->replaceChild($dom->createTextNode(''), $style_tag);
}
echo strip_tags($dom->saveHTML());
See demo here
I also took guidance from a related discussion specifically looking at the officially accepted answer.
The advantage of manipulating PHP with the DOM is that you don't even need to create a conditional to remove the STYLE tags. Also, you are working with HTML elements, so you don't have to bother with the intricacies of using a regex. Note that in replacing the style tags, they are replaced by a text node containing an empty string.
Note, tags like HEAD and BODY are automatically inserted when the DOM object executes its saveHTML() method. So, in order to display only text content, the last line uses strip_tags() to remove all HTML tags.
Lastly, while the officially accepted answer is generally a viable alternative, it does not provide a complete solution for non-compliant HTML containing a STYLE tag after a BODY tag.
You have two options.
If there are no tags in your content use strip_tags()
You could use regex. This is more complex but there is always a suiting pattern. e.g. preg_match()

php surf along html DOM callbacking node plain text contents

Scenario:
I need to apply a php function to the plain text contained inside HTML tags, and show the result, maintaining the original tags (with their original attributes).
Visualize:
Take this:
<p>Some text here pointing to the moon and that's it</p>
Return this:
<p>
phpFunction('Some text here pointing to the ')
phpFunction('moon')
phpFunction(' and that\'s it')
</p>
What I should do:
Use a PHP html parser (instead of using regexp) and iterate over every tag, applying the callback to the node text content.
Problem:
If I have, for example, an <a> tag inside a <p> tag, the text content of the parent <p> tag would consist of two different plain text parts, which the php callback should considerate as separate.
Question:
How should I approach this in a clean and smooth way?
Thanks for your time, all the best.
In the end, I decided to use regex instead of including an external library.
For the sake of simplicity:
$expectedOutput = preg_replace_callback(
'/>(.*)</U',
function ($withstuff) {
return '>'.doStuff($withStuff).' <';
},
$fromInput
);
This will look for everything between > and <, which is, indeed, what I was looking for.
Of course any suggestion/comment is still welcome.
Peace.

[php]how to extract a single simple text from a long html source

i have a html like this:
......whatever very long html.....
<span class="title">hello world!</span>
......whatever very long html......
it is a very long html and i only want the content 'hello world!' from this html
i got this html by
$result = file_get_contents($url , false, $context);
many people were using Simple HTML DOM parser, but i think in this case, using regex would be more efficient.
how should i do it? any suggestions? any help would be really great.
thanks in advance!
Stick with the DOM parser - it is better. Having said that, you could use a REGEX like this...
// where the html is stored in `$html`
preg_match('/<span class="title">(.+?)<\/span>/', $html, $m);
$whatYouWant = $m[1];
preg_match() stores an array of all the elements captured inside brackets in the regex, and a 0th element which is the entire captured string. The regex is very simple in this case, being almost a direct string match for what you want, with the closing span tag's slash escaped. The captured part just means any character (.) one or more times (+) un-greedily (?).
No, I really don't think regEx or similar functions would be either more effective or easier.
If you would use SimpleHTML DOM, you could quickly get the data you are looking for like this:
//Get your file
$html = file_get_html('myfile.html');
//Use jQuery style selectors
$spanValue = $html->find('span.title')->plaintext;
echo($spanValue);
with preg_match you could do like this:
preg_match("/<span class=\"title\">([^`]*?)<\/span>/", $data, $matches);
or this, if there are multiple spans with the class "title":
preg_match_all("/<span class=\"title\">([^`]*?)<\/span>/", $data, $matches);

php - preg_match string not within the href attribute

i find regex kinda confusing so i got stuck with this problem:
i need to insert <b> tags on certain keywords in a given text. problem is that if the keyword is within the href attribute, it would result to a broken link.
the code goes like this:
$text = preg_replace('/(\b'.$keyword.'\b)/i','<b>\1</b>',$text);
so for cases like
this keyword here
i end up with:
this <b>keyword</b> here
i tried all sorts of combinations but i still couldn't get the right pattern.
thanks!
You can't only use Regex to do that. They are powerful, but they can't parse recursive grammar like HTML.
Instead you should properly parse the HTML using a existing HTML parser. you just have to echo the HTML unless you encouter some text entity. In that case, you run your preg_repace on the text before echoing it.
If your HTML is valid XHTML, you can use the xml_parse function. if it's not, then use whatever HTML parser is available.
You can use preg_replace again after the first replacement to remove b tags from href:
$text=preg_replace('#(href="[^"]*)<b>([^"]*)</b>#i',"$1$2",$text);
Yes, you can use regex like that, but the code might become a little convulted.
Here is a quick example
$string = 'link text with keyword and stuff';
$keyword = 'keyword';
$text = preg_replace(
'/(<a href=")('.$keyword.')(.php">)(.*)(<\/a>)/',
"$1$2$3<b>$4</b>$5",
$string
);
echo $string."\n";
echo $text."\n";
The content inside () are stored in variables $1,$2 ... $n, so I don't have to type stuff over again. The match can also be made more generic to match different kinds of url syntax if needed.
Seeing this solution you might want to rethink the way you plan to do matching of keywords in your code. :)
output:
link text with keyword and stuff
<b>link text with keyword and stuff</b>

php anchor tag regex

I have a bunch of strings, each containing an anchor tag and url.
string ex.
here is a link http://www.google.com. enjoy!
i want to parse out the anchor tags and everything in between.
result ex.
here is a link. enjoy!
the urls in the href= portion don't always match the link text however (sometimes there are shortened urls,sometimes just descriptive text).
i'm having an extremely difficult time figuring out how to do this with either regular expressions or php functions. how can i parse an entire anchor tag/link from a string?
thanks!
Looking at your result example, it seems like you're just removing the tags/content - did you want to keep what you stripped out or no? If not you might be looking for strip_tags().
You shouldn't use regex to parse html and use an html parser instead.
But if you should use regex, and your anchor tags inner contents are guaranteed to be free of html like </a>, and each string is guaranteed to contain only one anchor tag as in the example case, then - only then - you can use something like:
Replacing /^(.+)<a.+<\/a>(.+)$/ with $1$2
Since your problem seems to be very specific, I think this should do it:
$str = preg_replace('#\s?<a.*/a>#', '', $str);
just use your normal PHP string functions.
$str='here is a link http://www.google.com. enjoy!';
$s = explode("</a>",$str);
foreach($s as $a=>$b){
if( strpos( $b ,"href")!==FALSE ){
$m=strpos("$b","<a");
echo substr($b,0,$m);
}
}
print end($s);
output
$ php test.php
here is a link . enjoy!
$string = 'here is a link http://www.google.com. enjoy!';
$text = strip_tags($string);
echo $text; //Outputs "here is a link . enjoy!"

Categories