I have a collection of text that I am trying to process with PHP dynamically (the data comes from an XML file), however I want to strip the a link and the text that is linked.
PHP's strip_tags takes out the <a etc...> and </a> but not the text in between.
I am currently trying to use the Regex preg_replace('#(<a.*?>).*?(</a>)#', '', $content);
Another thing to note is the links have styles, classes, href and titles.
Does anyone know the solution?
try this:
$content=preg_replace('/<a[^>]*>(.*)<\/a>/iU','',$content);
You can use DOMDocument, for example (untested!):
$doc = new DOMDocument();
$doc->loadHTMLFile('foo.php');
$domNodeList = $doc->getElementsByTagname('a');
$len = count($domNodeList);
for($i = 0; $i < $len; $i++) {
$domNodeList[$i]->parentNode->removeChild($domNodeList[$i]);
}
$doc->saveHTMLFile('output.html');
Or using Simple HTML DOM Parser:
$html = file_get_html('http://www.example.com/');
foreach($html->find('a') as $element) {
$element->outertext = '';
}
$html->save('output.html');
Because the a-Element is not the online one, that can break your page, you better should use a whitelist approach, like strip_tags().
I used the solution(s) posted as comments, they seemed to work best and were exactly what I was looking for!
"For reference, you've grouped the anchor tags but not the content, which is where the problem lies. preg_replace replaces the grouped element (those included in parenthesis). You can try the following though: #(<a[^>]*?>.*?</a>)#i (i flag for a case insensitive compare)" – Brad Christie
"briefly tested shorter regex version, just for fun :) preg_replace ('/<(?:a|\/)[^>]*>/', '', $data);" – Cyber-Guard Design yesterday
With regex, but not thoroughly tested
echo preg_replace('#(<a.*?>)(.*?)(<\/a>)#','$2', $str);
Also, the limit argument set to -1 will set it to no limit.
Related
The regex solution suggested on
PHP RegEx remove empty paragraph tags
#<p>(\s| |</?\s?br\s?/?>)*</?p>#
fail on my example-string:
<p><br></p><div align="justify"><b>Some Text</b></div><p></p>
and I can't figure out why.
See Live Regex here
http://www.phpliveregex.com/p/6ID
You really shouldn't set about modifying a DOM using regex. There are DOM parsers to do this kind of thing. It's not even that hard:
$html = '<p><br></p><div align="justify"><b>Some Text</b></div>
<p>foobar</p>
<p></p>';//empty
$dom = new DOMDocument;
$dom->loadHTML($html);
$pars = $dom->getElementsByTagName('p');
foreach ($pars as $tag)
{
if (!trim($tag->textContent))
{
$tag->parentNode->removeChild($tag);
}
}
That's all. You simply select all of the p tags, then check if its trim-ed text contents is empty, if it is: remove the node by selecting its parent, and invoking the DOMNode::removeChild method...
The snippet above removes 2 of the 3 paragraph nodes, the one containing foorbar is left as is. I thinkg that's what you are trying to do...
To get the actual dom fragment, after removing the tags that needed to be removed, you can simply do this:
echo trim(
substr(
$dom->saveHTML($dom->documentElement),//omit doctype
12, -14//12 => <html><body> and -14 for </body></html>
)
);
proof of concept
In your Live Regex example you were using double separators, see http://www.phpliveregex.com/p/6II for a working example. Also, since the pre-defined separator is / you need to escape the slashes in code (also in example).
EDIT: In general though, it's best to follow Jay's suggestion and not use regex for this kind of tasks.
I was reading this article. This function that it includes:
<?php
function getFirstPara($string){
$string = substr($string,0, strpos($string, "</p>")+4);
return $string;
}
?>
...seems to return first found <p> in the string. But, how could I get the first HTML element (p, a, div, ...) in the string (kind of :first-child in CSS).
It's generally recommended to avoid string parsing methods to interrogate html.
You'll find that html comes with so many edge cases and parsing quirks that however clever you think you've been with your code, html will come along and whack you over the head with a string that breaks your tests.
I would highly recommend you use a php dom parsing library (free and often included by default with php installs).
For example DomDocument:
$dom = new \DOMDocument;
$dom->loadHTML('<p>One</p><p>Two</p><p>Three</p>');
$elements = $dom->getElementsByTagName('body')->item(0)->childNodes;
print '<pre>';
var_dump($elements->item(0));
You could use http://php.net/strstr as the article
first search for "<p>" this will give you the full string from the first occurrence and to the end
$first = strstr($html, '<p>');
then search for "</p>" in that result, this will give you all the html you dont want to keep
$second = strstr($first, '</p>');
then remove the unwanted html
$final = str_replace($second, "", $first);
The same methode could be use to get the first child by looking for "<" and "</$" in the result from before. You will need to check the first char/word after the < to find the right end tag.
$str=<<<EOT
<img src="./img/upload_20571053.jpg" /><span>some word</span><div>some comtent</div>
EOT;
How to remove the img tag with preg_match_all or other way? Thanks.
I want echo <span>some word</span><div>some comtent</div> // may be other html tag, like in the $str ,just remove img.
As many people said, you shouldn't do this with a regexp. Most of the examples you've seen to replace the image tags are naive and would not work in every situation. The regular expression to take into account everything (assuming you have well-formed XHTML in the first place), would be very long, very complex and very hard to understand or edit later on. And even if you think that it works correctly then, the chances are it doesn't. You should really use a parser made specifically for parsing (X)HTML.
Here's how to do it properly without a regular expression using the DOM extension of PHP:
// add a root node to the XHTML and load it
$doc = new DOMDocument;
$doc->loadXML('<root>'.$str.'</root>');
// create a xpath query to find and delete all img elements
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//img') as $node) {
$node->parentNode->removeChild($node);
}
// save the result
$str = $doc->saveXML($doc->documentElement);
// remove the root node
$str = substr($str, strlen('<root>'), -strlen('</root>'));
$str = preg_replace('#<img[^>]*>#i', '', $str);
preg_replace("#\<img src\=\"(.+)\"(.+)\/\>#iU", NULL, $str);
echo $str;
?
In addition to #Karo96, I would go more broad:
/<img[^>]*>/i
And:
$re = '/<img[^>]*>/i';
$str = preg_replace($re,'',$str);
demo
This also assumes the html will be properly formatted. Also, this disregards the general rule that we should not parse html with regex, but for the sake of answering you I'm including it.
Perhaps you want preg_replace. It would then be: $str = preg_replace('#<img.+?>#is', '', $str), although it should be noted that for any non-trivial HTML processing you must use an XML parser, for example using DOMDocument
$noimg = preg_replace('/<img[^>]*>/','',$str);
Should do the trick.
Don't use regex's for this. period. This isn't exactly parsing and it might be trivial but rexeg's aren't made for the DOM:
RegEx match open tags except XHTML self-contained tags
Just use DomDocument for instance.
I'd like to only remove the anchor tags and the actual urls.
For instance, test www.example.com would become test.
Thanks.
I often use:
$string = preg_replace("/<a[^>]+>/i", "", $string);
And remember that strip_tags can remove all the tags from a string, except the ones specified in a "white list". That's not what you want, but I tell you also this for exhaustiveness.
EDIT: I found the original source where I got that regex. I want to cite the author, for fairness: http://bavotasan.com/tutorials/using-php-to-remove-an-html-tag-from-a-string/
you should consider using the PHP's DOM library for this job.
Regex is not the right tool for HTML parsing.
Here is an example:
// Create a new DOM Document to hold our webpage structure
$xml = new DOMDocument();
// Load the html's contents into DOM
$xml->loadHTML($html);
$links = $xml->getElementsByTagName('a');
//Loop through each <a> tags and replace them by their text content
for ($i = $links->length - 1; $i >= 0; $i--) {
$linkNode = $links->item($i);
$lnkText = $linkNode->textContent;
$newTxtNode = $xml->createTextNode($lnkText);
$linkNode->parentNode->replaceChild($newTxtNode, $linkNode);
}
Note:
It's important to use a regressive loop here, because when calling replaceChild, if the old node has a different name from the new node, it will be removed from the list once it has been replaced, and some of the links would not be replaced.
This code doesn't remove urls from the text inside a node, you can use the preg_replace from nico on $lnkText before the createTextNode line. It's always better to isolate parts from html using DOM, and then use regular expressions on these text only parts.
To complement gd1's answer, this will get all the URLs:
// http(s)://
$txt = preg_replace('|https?://www\.[a-z\.0-9]+|i', '', $txt);
// only www.
$txt = preg_replace('|www\.[a-z\.0-9]+|i', '', $txt);
Despite using PHP for years, I've never really learnt how to use expressions to truncate strings properly... which is now biting me in the backside!
Can anyone provide me with some help truncating this? I need to chop out the text portion from the url, turning
text
into
$str = preg_replace('#(<a.*?>).*?(</a>)#', '$1$2', $str)
Using SimpleHTMLDom:
<?php
// example of how to modify anchor innerText
include('simple_html_dom.php');
// get DOM from URL or file
$html = file_get_html('http://www.example.com/');
//set innerText to null for each anchor
foreach($html->find('a') as $e) {
$e->innerText = null;
}
// dump contents
echo $html;
?>
What about something like this, considering you might want to re-use it with other hrefs :
$str = 'text';
$result = preg_replace('#(<a[^>]*>).*?(</a>)#', '$1$2', $str);
var_dump($result);
Which will get you :
string '' (length=24)
(I'm considering you made a typo in the OP ? )
If you don't need to match any other href, you could use something like :
$str = 'text';
$result = preg_replace('#().*?()#', '$1$2', $str);
var_dump($result);
Which will also get you :
string '' (length=24)
As a sidenote : for more complex HTML, don't try to use regular expressions : they work fine for this kind of simple situation, but for a real-life HTML portion, they don't really help, in general : HTML is not quite "regular" "enough" to be parsed by regexes.
You could use substring in combination with stringpos, eventhough this is not
a very nice approach.
Check: PHP Manual - String functions
Another way would be to write a regular expression to match your criteria.
But in order to get your problem solved quickly the string functions will do...
EDIT: I underestimated the audience. ;) Go ahead with the regexes... ^^
You don't need to capture the tags themselves. Just target the text between the tags and replace it with an empty string. Super simple.
Demo of both techniques
Code:
$string = 'text';
echo preg_replace('/<a[^>]*>\K[^<]*/', '', $string);
// the opening tag--^^^^^^^^ ^^^^^-match everything before the end tag
// ^^-restart fullstring match
Output:
Or in fringe cases when the link text contains a <, use this: ~<a[^>]*>\K.*?(?=</a>)~
This avoids the expense of capture groups using a lazy quantifier, the fullstring restarting \K and a "lookahead".
Older & wiser:
If you are parsing valid html, you should use a dom parser for stability/accuracy. Regex is DOM-ignorant, so if there is a tag attribute value containing a >, my snippet will fail.
As a narrowly suited domdocument solution to offer some context:
$dom = new DOMDocument;
$dom->loadHTML($string, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD); // 2nd params to remove DOCTYPE);
$dom->getElementsByTagName('a')[0]->nodeValue = '';
echo $dom->saveHTML();
Only use strip_tags(), that would get rid of the tags and left only the desired text between them