PHP Regex to remove HTML-Tag - php

I am looking for a way to search a string in PHP and remove "<pre", "</pre>" and everything, that is in between.
Example:
$string = 'Hello, I am a little text. <pre class="foo">This should be deleted.</pre> This is fine again.';
// Some magic function
$newString = 'Hello, I am a little text. This is fine again.';
Is there any way to do it? If I use strip_tags(), only the tags will be removed, but now the content inside of the tags.
Thank you very much!

If it's just a small string, I don't recommend it but regex would be alright here.
$newString = preg_replace('~<pre[^>]*>[^<]*</pre>~', '', $str);
However, I always use DOM when dealing with HTML/XML.
$doc = new DOMDocument;
$doc->loadHTML($html);
foreach ($doc->getElementsByTagName('pre') as $tag) {
$tag->parentNode->removeChild($tag);
}

I'd use #hwnd's parsing example below (or above), that's a lot safer than using regex.
You could use something like this:
/<(.*?)(\h*).*?>(.*?)<\/\1>/
Demo: https://regex101.com/r/cN9rL4/3
PHP Demo: https://eval.in/415470
echo preg_replace('/<(.*?)(\h*).*?>(.*?)<\/\1>/s', '', 'Hello, I am a little text. <pre class="foo">This should be deleted.</pre> This is fine again.');
Output:
Hello, I am a little text. This is fine again.
Edit: added s modifier in case the content exceeds one line, demo of failure https://regex101.com/r/cN9rL4/2.
Also note this isn't specific to pre this will replace any elements it encounters that close.

Related

preg_match_all how to remove img tag?

$str=<<<EOT
<img src="./img/upload_20571053.jpg" /><span>some word</span><div>some comtent</div>
EOT;
How to remove the img tag with preg_match_all or other way? Thanks.
I want echo <span>some word</span><div>some comtent</div> // may be other html tag, like in the $str ,just remove img.
As many people said, you shouldn't do this with a regexp. Most of the examples you've seen to replace the image tags are naive and would not work in every situation. The regular expression to take into account everything (assuming you have well-formed XHTML in the first place), would be very long, very complex and very hard to understand or edit later on. And even if you think that it works correctly then, the chances are it doesn't. You should really use a parser made specifically for parsing (X)HTML.
Here's how to do it properly without a regular expression using the DOM extension of PHP:
// add a root node to the XHTML and load it
$doc = new DOMDocument;
$doc->loadXML('<root>'.$str.'</root>');
// create a xpath query to find and delete all img elements
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//img') as $node) {
$node->parentNode->removeChild($node);
}
// save the result
$str = $doc->saveXML($doc->documentElement);
// remove the root node
$str = substr($str, strlen('<root>'), -strlen('</root>'));
$str = preg_replace('#<img[^>]*>#i', '', $str);
preg_replace("#\<img src\=\"(.+)\"(.+)\/\>#iU", NULL, $str);
echo $str;
?
In addition to #Karo96, I would go more broad:
/<img[^>]*>/i
And:
$re = '/<img[^>]*>/i';
$str = preg_replace($re,'',$str);
demo
This also assumes the html will be properly formatted. Also, this disregards the general rule that we should not parse html with regex, but for the sake of answering you I'm including it.
Perhaps you want preg_replace. It would then be: $str = preg_replace('#<img.+?>#is', '', $str), although it should be noted that for any non-trivial HTML processing you must use an XML parser, for example using DOMDocument
$noimg = preg_replace('/<img[^>]*>/','',$str);
Should do the trick.
Don't use regex's for this. period. This isn't exactly parsing and it might be trivial but rexeg's aren't made for the DOM:
RegEx match open tags except XHTML self-contained tags
Just use DomDocument for instance.

automatic link creation using php without breaking the html tags

i want to convert text links in my content page into active links using php. i tried every possible script out there, they all fine but the problem that they convert links in img src tag. they convert links everywhere and break the html code.
i find a good script that do what i want exactly but it is in javascript. it is called jquery-linkify.
you can find the script here
http://github.com/maranomynet/linkify/
the trick in the script that it convert text links without breaking the html code. i tried to convert the script into php but failed.
i cant use the script on my website because there is other scripts that has conflict with jquery.
anyone could rewrite this script for php? or at least guide me how?
thanks.
First, parse the text with an HTML parser, with something like DOMDocument::loadHTML. Note that poor HTML can be hard to parse, and depending on the parser, you might get slightly different output in the browser after running such a function.
PHP's DOMDocument isn't very flexible in that regard. You may have better luck by parsing with other tools. But if you are working with valid HTML (and you should try to, if it's within your control), none of that is a concern.
After parsing the text, you need to look at the text nodes for links and replace them. Using a regular expression is the simplest way.
Here's a sample script that does just that:
<?php
function linkify($text)
{
$re = "#\b(https?://)?(([0-9a-zA-Z_!~*'().&=+$%-]+:)?[0-9a-zA-Z_!~*'().&=+$%-]+\#)?(([0-9]{1,3}\.){3}[0-9]{1,3}|([0-9a-zA-Z_!~*'()-]+\.)*([0-9a-zA-Z][0-9a-zA-Z-]{0,61})?[0-9a-zA-Z]\.[a-zA-Z]{2,6})(:[0-9]{1,4})?((/[0-9a-zA-Z_!~*'().;?:\#&=+$,%#-]+)*/?)#";
preg_match_all($re, $text, $matches, PREG_OFFSET_CAPTURE);
$matches = $matches[0];
$i = count($matches);
while ($i--)
{
$url = $matches[$i][0];
if (!preg_match('#^https?://#', $url))
$url = 'http://'.$url;
$text = substr_replace($text, ''.$matches[$i][0].'', $matches[$i][1], strlen($matches[$i][0]));
}
return $text;
}
$dom = new DOMDocument();
$dom->loadHTML('<b>stackoverflow.com</b> test');
$xpath = new DOMXpath($dom);
foreach ($xpath->query('//text()') as $text)
{
$frag = $dom->createDocumentFragment();
$frag->appendXML(linkify($text->nodeValue));
$text->parentNode->replaceChild($frag, $text);
}
echo $dom->saveHTML();
?>
I did not come up with that regular expression, and I cannot vouch for its accuracy. I also did not test the script, except for this above case. However, this should be more than enough to get you going.
Output:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<b>stackoverflow.com</b>
test
</body>
</html>
Note that saveHTML() adds the surrounding tags. If that's a problem, you can strip them out with substr().
Use a HTML parser and only search for URLs within text nodes.
I think the trick is in tracking the single ' and double quotes '' in your PHP code and merging between them in a correct way so you put '' inside "" or vice versa.
For Example,
<?PHP
//old html tags
echo "<h1>Header1</h1>";
echo "<div>some text</div>";
//your added links
echo "<p><a href='link1.php'>Link1</a><br>";
echo "<a href='link1.php'>Link1</a></p>";
//old html tags
echo "<h1>Another Header</h1>";
echo "<div>some text</div>";
?>
I hope this helps you ..
$text = 'Any text ... link http://example123.com and image <img src="http://exaple.com/image.jpg" />';
$text = preg_replace('!([^\"])(http:\/\/(?:[\w\.]+))([^\"])!', '\\1\\2\\3', $text);

file_get_contents and div

What's wrong with my code?
I wish to get all dates from
but my array is empty.
<?php
$url = "http://weather.yahoo.com/";
$page_all = file_get_contents($url);
preg_match_all('#<div id="myLocContainer">(.*)</div>#', $page_all, $div_array);
echo "<pre>";
print_r($div_array);
echo "</pre>";
?>
Thanks
You want to parse a multiline content but you did not use multiline switch of REGEX pattern.
Try using this:
preg_match_all('#<div id="myLocContainer">(.*?)</div>#sim', $page_all, $div_array);
Please note that regular expressions is not suitable for parsing HTML content because of the hierachical nature of HTML documents.
try adding "m" and "s" modifiers, new lines might be in the div you need.. like this:
preg_match_all('#<div id="myLocContainer">(.*)</div>#ms', $page_all, $div_array);
Before messing around with REGEX, try HTML Scraping. This HTML Scraping in Php might give some ideas on how to do it in a more elegant and (possibly) faster way.
$doc = new DomDocument;
$doc->Load('http://weather.yahoo.com/');
$doc->getElementById('myLocContainer');
you need to Excape Special Characters in your Regular Expression like the following
~\<div id\=\"myLocContainer\"\>(.*)\<\/div\>~
also Checkout wheather there is a newline problem or not as mentioned by #eyazici and #kgb
Test your response before running the regex search. Then you'll know which part isn't working.

replace <br> to new line between pre tag

I want to convert
<p>Code is following</p>
<pre>
<html><br></html>
</pre>
to
<p>Code is following</p>
<pre>
<html>
</html>
</pre>
I don't know how to write regular expression for replace between pre tag in PHP.
I tried this code Replace newlines with BR tags, but only inside PRE tags
but it's not working for me.
Which answer are you using code from?
Assuming it was the accepted answer, just reverse the preg_replace() line as follows;
$parts[$idx] = preg_replace('#<br\s*/?>#', "\n", $part);
You shouldn't use regex to match html tags because it is theoretically impossible.
There are some php librarys for html parsing out there, a quick search on google showed this. http://simplehtmldom.sourceforge.net/
Try to get the code between the "pre" tags and use a simple regex on this.
Try this:
$newtext = preg_replace('#<pre>.*<br[/>]*?>?</pre>#si','\n',$text);
if (preg_match("/<pre>.*(<br(|\s*\/)>).*<\/pre>/m", $str)) {
$str = preg_replace("/(<br(|\s*\/)>)/", "\n", $str);
}
Works just the same. Replaces <br>, <br/>, <br /> only when found inside <pre>...</pre>

How to remove text between tags in php?

Despite using PHP for years, I've never really learnt how to use expressions to truncate strings properly... which is now biting me in the backside!
Can anyone provide me with some help truncating this? I need to chop out the text portion from the url, turning
text
into
$str = preg_replace('#(<a.*?>).*?(</a>)#', '$1$2', $str)
Using SimpleHTMLDom:
<?php
// example of how to modify anchor innerText
include('simple_html_dom.php');
// get DOM from URL or file
$html = file_get_html('http://www.example.com/');
//set innerText to null for each anchor
foreach($html->find('a') as $e) {
$e->innerText = null;
}
// dump contents
echo $html;
?>
What about something like this, considering you might want to re-use it with other hrefs :
$str = 'text';
$result = preg_replace('#(<a[^>]*>).*?(</a>)#', '$1$2', $str);
var_dump($result);
Which will get you :
string '' (length=24)
(I'm considering you made a typo in the OP ? )
If you don't need to match any other href, you could use something like :
$str = 'text';
$result = preg_replace('#().*?()#', '$1$2', $str);
var_dump($result);
Which will also get you :
string '' (length=24)
As a sidenote : for more complex HTML, don't try to use regular expressions : they work fine for this kind of simple situation, but for a real-life HTML portion, they don't really help, in general : HTML is not quite "regular" "enough" to be parsed by regexes.
You could use substring in combination with stringpos, eventhough this is not
a very nice approach.
Check: PHP Manual - String functions
Another way would be to write a regular expression to match your criteria.
But in order to get your problem solved quickly the string functions will do...
EDIT: I underestimated the audience. ;) Go ahead with the regexes... ^^
You don't need to capture the tags themselves. Just target the text between the tags and replace it with an empty string. Super simple.
Demo of both techniques
Code:
$string = 'text';
echo preg_replace('/<a[^>]*>\K[^<]*/', '', $string);
// the opening tag--^^^^^^^^ ^^^^^-match everything before the end tag
// ^^-restart fullstring match
Output:
Or in fringe cases when the link text contains a <, use this: ~<a[^>]*>\K.*?(?=</a>)~
This avoids the expense of capture groups using a lazy quantifier, the fullstring restarting \K and a "lookahead".
Older & wiser:
If you are parsing valid html, you should use a dom parser for stability/accuracy. Regex is DOM-ignorant, so if there is a tag attribute value containing a >, my snippet will fail.
As a narrowly suited domdocument solution to offer some context:
$dom = new DOMDocument;
$dom->loadHTML($string, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD); // 2nd params to remove DOCTYPE);
$dom->getElementsByTagName('a')[0]->nodeValue = '';
echo $dom->saveHTML();
Only use strip_tags(), that would get rid of the tags and left only the desired text between them

Categories