Find Links and Remove them from HTML - php

How can I look for links in HTML and remove them?
$html = '<p>Test Title 1</p>';
$html .= '<p>Test Title 2</p>';
$html .= '<p>Test Title 3</p>';
$match = '<a href="javascript:doThis('Test Title 2')">';
I want to remove the anchor but display the text. see below.
Test Title 1
Test Title 2
Test Title 3
I've never used Regular Expressions before, but maybe i can avoid it also. Let me know if im not clear.
Thanks
Mark
EDIT: its not a client side thing. I cant use javascript for this. I have a custom CMS and want to edit HTML stored in a Database.

You may try the simplest thing:
echo strip_tags($html, '<p>');
This strips all tags except <p>
If you really like regexp:
echo preg_replace('=</?a(\s[^>]*)?>=ims', '', $html);
EDIT:
Delete a - tag AND surrounding tags (code gets messy and doesn't work with broken (X)HTML):
echo preg_replace('=<([a-z]+)[^>]*>\s*<a(\s[^>]*)?>(.*?)</a>\s*</\\1>=ims', '$3', $html);
Howerwer if your problem is that complicated, I recommend that you try xpath.

You could see if Simple HTML DOM does the trick.

You might have some joy with Beautiful Soup - http://www.crummy.com/software/BeautifulSoup/ (Python HTML parsing / manipulation API)

sed -i -e 's/<a.*<\/a>//g' filename.html
Note that using regular expressions for hacking HTML is a... dubious proposition, but it might just work in practice ;-)

You can use
var foo = document.getElementsByTagName('a');
to fetch all the link tags. No need for regular expressions here...
EDIT: I'm just learning to read... ;) Go with PHP's DOM or XML abilities. It should be pretty easy using those.

open the HTML file in Microsoft Expression.
Ctrl+F and then chose replace tag or tag attributes contents
Easy and quick solution
Thanks
Shomaail

Related

Replace the content inside a DIV

I have a div called
<div id="form">Content</div>
and I want to replace the content of the div with new content using Preg_replace.
what Regex should be used.?
You shouldn't be using a regex at all. HTML can come in many forms, and you would need to take all of them in account. What if the id/class doesn't come in the place you expect? The regex would have to be really complex to get you reasonable results.
Instead, you should use a DOM parser - or a really cool tool I recently stumbled across, phpQuery. With it, you can access your document in PHP almost exactly as you would with jQuery.
This will work in your case:
$html = '<div id="content">Content</div>';
$html = preg_replace('/(<\s*div[^>]*>)[^<]*(<\s*\/div\s*>)/', '$1New Content$2', $html);
echo $html; // <div id="content">New Content</div>
However note that since HTML is not a regular language it is impossible to handle all cases. The simple regex I provided will produce bad output in the following example:
<div class=">">Content</div>

How do I grab part of a page's HTML DOM with PHP?

I'm grabbing data from a published google spreadsheet, and all I want is the information inside of the content div (<div id="content">...</div>)
I know that the content starts off as <div id="content"> and ends as </div><div id="footer">
What's the best / most efficient way to grab the part of the DOM that is inside there? I was thinking regular expression (see my example below) but it is not working and I'm not sure if it that efficient...
header('Content-type: text/plain');
$foo = file_get_contents('https://docs.google.com/spreadsheet/pub?key=0Ahuij-1M3dgvdG8waTB0UWJDT3NsUEdqNVJTWXJNaFE&single=true&gid=0&output=html&ndplr=1');
$start = '<div id="content">';
$end = '<div id="footer">';
$foo = preg_replace("#$start(.*?)$end#",'$1',$foo);
echo $foo;
UPDATE
I guess another question I have is basically about if it is just simpler and easier to use regex with start and end points rather than trying to parse through a DOM which might have errors and then extract the piece I need. Seems like regex would be the way to go but would love to hear your opinions.
Try changing your regex to $foo = preg_replace("#$start(.*?)$end#s",'$1',$foo); , the s modifier changes the . to include new lines. As it is, your regex would have to all the content between the tags on the same line to match.
If your HTML page is any more complex than that, then regex probably won't cut it and you'd need to look into a parser like DOMDocument or Simple HTML DOM
if you have a lot to do, I would recommend you take a look at http://simplehtmldom.sourceforge.net
really good for this sort of thing.
Do not use regex, it can fail.
Use PHP's inbuilt DOM parse :
http://php.net/manual/en/class.domdocument.php
You can easily traverse and parse relevant content .

How to remove a link from content using php?

$text = file_get_contents('http://www.example.com/file.php?id=name');
echo preg_replace('#<a.*?>.*?</a>#i', '', $text)
the link contains this content:
text text text. <br><a href='http://www.example.com' target='_blank' title='title' style='text-decoration:none;'>name</a>
what is the problem at this script?
You can't parse HTML with regular expressions. Use an XML/HTML parser.
Tempted to flag your question, but there's no option for "Report user for summoning Cthulhu"
I'd recommend reading: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
RegEx is very poor and not at all intended to parse HTML. That's why there are HTML parsing libraries. Find and use one for PHP. :)
use <a[^>]+>[^<]*</a> (works fine as long as theres just text and no tags inside the a element)
USE strip_tags this way
$t = 'http://yoururl.com/test1.php';
$t1 = file_get_contents($t);
$text = strip_tags($t1);
it should work getting rid of all the links inside the page you are reading, visit the reference anyway, it may not work for complicated elements http://php.net/manual/en/function.strip-tags.php

question regarding php function preg_replace

I want to dynamically remove specific tags and their content from an html file and thought of using preg_replace but can't get the syntax right. Basically it should, for example, do something like :
Replace everything between (and including) "" by nothing.
Could anybody help me out on this please ?
Easy dude.
To have a Ungreedy regexpr, use the U modifier
And to make it multiline, use the s modifier.
Knowing that, to remove all paragraphes use this pattern :
#<p[^>]*>(.*)?</p>#sU
Explain :
I use # delimiter to not have to protect my \ characters (to have a more readable pattern)
<p[^>]*> : part detecting an opening paragraph (with a hypothetic style, such as )
(.*)? : Everything (in "Ungreedy mode")
</p> : Obviously, the closing paragraph
Hope that help !
If you are trying to sanitize your data, it is often recommended that you use a whitelist as opposed to blacklisting certain terms and tags. This is easier to sanitize and prevent XSS attacks. There's a well known library called HTML Purifier that, although large and somewhat slow, has amazing results regarding purifying your data.
I would suggest not trying to do this with a regular expression. A safer approach would be to use something like
Simple HTML DOM
Here is the link to the API Reference: Simple HTML DOM API Reference
Another option would be to use DOMDocument
The idea here is to use a real HTML parser to parse the data and then you can move/traverse through the tree and remove whichever elements/attributes/text you need to. This is a much cleaner approach than trying to use a regular expression to replace data within the HTML.
<?php
$doc = new DOMDocument;
$doc->loadHTMLFile('blah.html');
$content = $doc->documentElement;
$table = $content->getElementsByTagName('table')->item(0);
$delfirstTable = $content->removeChild($table);
echo $doc->saveHTML();
?>
If you don't know what is between the tags, Phill's response won't work.
This will work if there's no other tags in between, and is definitely the easier case. You can replace the div with whatever tag you need, obviously.
preg_replace('#<div>[^<]+</div>#','',$html);
If there could be other tags in the middle, this should work, but could cause problems. You're probably better off going with the DOM solution above, if so
preg_replace('#<div>.+</div>#','',$html);
These aren't tested
PSEUDO CODE
function replaceMe($html_you_want_to_replace,$html_dom) {
return preg_replace(/^$html_you_want_to_replace/, '', $html_dom);
}
HTML Before
<div>I'm Here</div><div>I'm next</div>
<?php
$html_dom = "<div>I'm Here</div><div>I'm next</div>";
$get_rid_of = "<div>I'm Here</div>";
replaceMe($get_rid_of);
?>
HTML After
<div>I'm next</div>
I know it's a hack job

Regex to match html attributes

I am trying to match a pattern so that I can retrieve a string from a website. Here is the string in Question:
<a title="Posts by ivek dhwWaVa"
href="http://www.example.com/author/ivek/"
rel="nofollow">ivek</a>
I am trying to match the string "ivek" in between the a tag and I want to do this for each post and relate it to the number of comments.
Firstly, what is the regex I should use the above so I can use it as an example for the rest. I have nothing so far:
$content = file_get_contents('http://www.example.com');
preg_match_all("", $content, $matches);
And how I would relate the comments to the authors name as there are many other authors on the website and also their own set of comments. Do I use divs to break this up? As each set of info is wrapped around this div:
<div id="post-54" class="excerpt">
Thanks all for any help!
Please let me be the first to introduce you to the most famous answer on Stack Overflow.
Regular expressions are not suited to parsing HTML. You really need an HTML parser, even for what might appear to be a simple task.
I recommend something like PHP Simple HTML DOM Parser.
You really shouldn't be looking to Regex to do the job:
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Can you provide an example of parsing HTML with your favorite parser?

Categories