I'm working on using htmlpurifier to create a text-only version of my site.
I now need to replace all the a hrefs with the text only url i.e. 'www.example.com/aboutus' becomes 'www.example.com/text/aboutus'
Initially I tried a simple str_replace on the domain (I use a global variable for the domain), but the problem is links to files also get replaced i.e.
'www.example.com/document.pdf' becomes 'www.example.com/text/document.pdf' and therefore fails.
Is there a regular expression where I can say replace domain with domain/text where the url does not include string?
Thanks for any pointers you might be able to give me :)
Use a negative lookahead:
$output = preg_replace(
'#www.example.com(?!/text/)#',
'www.example.com/text',
$input
);
Better yet, use DOM with it:
$html = 'foo
<p>hello</p>
bar';
libxml_use_internal_errors(true); // supresses DOM errors
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->query('//a/#href');
foreach ($hrefs as $href) {
$href->value = preg_replace(
'#^www.example.com(?!/text/)(.*?)(?<!\.pdf)$#',
'www.example.com/text\\1',
$href->value
);
}
This should give you:
foo
<p>hello</p>
bar
Related
I have a part of HTML string like below which I get from web page scraping.
$scraping_html = "<html><body>
....
<div id='ctl00_ContentPlaceHolder1_lblHdr'>some text here with &. some text here.</div>
....</body></html>";
I want to take count of & between the particular div by using PHP. Is it possible to get using any of the PHP preg functions? Thanks in advance.
The hard part is getting the text nodes (I assume that's where you're stuck). Depending on how reliable it needs to be you have two alternatives (just sample code, not actually tested):
Good old strip_tags():
$plain_text = strip_tags($scraping_html);
Proper DOM parser:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($scraping_html);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$plain_text = '';
foreach ($xpath->query('//text()') as $textNode) {
$plain_text .= $textNode->nodeValue;
}
To count, you have e.g. substr_count().
To get the number of & in the given example, use DOMDocument:
$html = <<<EOD
<html><body>
<div id='ctl00_ContentPlaceHolder1_lblHdr'>some text here with &. some text here.</div>
</body></html>
EOD;
$dom = new DOMDocument;
$dom->loadHTML($html);
$ele = $dom->getElementById('ctl00_ContentPlaceHolder1_lblHdr');
echo substr_count($ele->nodeValue, '&');
Does anyone know how I could delete empty tags with PHP ?
self-closing tags should be ignored
it should consider empty content (spaces, line breaks, etc)
I did try two things :
with DOMdocument, but the problem is that it considers self-closing tags as empty (images, etc)
$xpath = new DOMXPath($dom);
$query = '//*[not(node())]'; //all empty tags
$nodes = $xpath->query($query);
foreach ($nodes as $node) {
$node->parentNode->removeChild($node);
}
I also had a try with regexes, but the best one I found on the internet doesn't work for what I need either :
//http://regex101.com/r/rD0sI8/1
$pattern = "/<.[^>]*>(\s+|()|( )*|\s+( )*|( )*\s+|\s+( )*\s+)<\/.[^>]*>/i";
$content = preg_replace($pattern,'',$content);
I guess it have problems with
<img...></span>
, for example. That why I would prefer to work with DOMdocument...
Any ideas ?
If you want to remove empty tags you can use this regex:
<(.*?)\s*.*?>\s*<\/\1>
Working demo
If it is available or you can install it, you could use the [php-tidy][1] extension. That should get rid of your empty tags and fix other errors.
A simple example.
To handle nested empty elements, you could run preg_replace until there's nothing left to replace:
<?php
$html = 'foo <i></i> bar <img src> <img> <ul><li></li></ul>';
do {
$input = $html;
$html = preg_replace('/<(\S+)[^>]*><\/\1>/', '', $input);
} while ($html !== $input);
print $html;
Given that parsing HTML with regular expressions is always going to lead to problems, though, it would be better to work with the DOM to remove nodes that a) aren't known "void" elements in HTML and b) have no text content:
<?php
$html = '<div>foo <i></i> bar <img src> <img> <ul><li></li></ul></div>';
$doc = new DOMDocument;
$doc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//node()');
$voids = array('area', 'base', 'br', 'col', 'command', 'embed', 'hr', 'img', 'input', 'keygen', 'link', 'meta', 'param', 'source', 'track', 'wbr');
foreach ($nodes as $node) {
if (!in_array($node->nodeName, $voids) && !strlen($node->textContent)) {
$node->parentNode->removeChild($node);
}
}
print $doc->saveHTML(); // <div>foo bar <img src> <img> </div>
I am using DomDocument to pull content from a specific div on a page.
I would then like to replace all instances of links with a path equal to http://example.com/test/ with http://example.com/test.php.
$url = "http://pugetsoundbasketball.com/stackoverflow_sample.php";
$doc = new DomDocument('1.0', 'UTF-8');
$doc->loadHtml(file_get_contents($url));
$div = $doc->getElementById('upcoming_league_dates');
foreach ($div->getElementsByTagName('a') as $item) {
$item->setAttribute('href', 'http://example.com/test.php');
}
echo $doc->saveHTML($div);
As you can see in the example above, str_replace causes problems after I target the upcoming_league_dates div with getElementById. I understand this but unfortunately I don't know where to go from here!
I've tried several different ways including executing the str_replace above the getElementById function (I figured I could replace the strings first and then target the specific div), with no luck.
What am I missing here?
EDIT: UPDATED CODE TO SHOW WORKING SOLUTION
You can't just use str_replace on that node. You need to access it properly first. Thru the DOMElement class you can use the method ->setAttribute() and make the replacement.
Example:
$url = "http://pugetsoundbasketball.com/stackoverflow_sample.php";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTMLFile($url);
$xpath = new DOMXpath($dom); // use xpath
$needle = 'http://example.com/test/';
$replacement = 'http://example.com/test.php';
// target the link
$links = $xpath->query("//div[#id='upcoming_league_dates']/a[contains(#href, '$needle')]");
foreach($links as $anchor) {
// replacement of those href values
$anchor->setAttribute('href', $replacement);
}
echo $dom->saveHTML();
Update: After your revision, your code is now working anyway. This is just to answer your logic replacement (ala str_replace search/replace) on your previous question.
I need to be able to parse some text and find all the instances where an tag has target="_blank".... and for each match, add (for example): This link opens in a new window before the closeing tag.
For example:
Before:
Go here now
After:
Go here now<span>(This link opens in a new window)</span>
This is for a PHP site, so i assume preg_replace() will be the method... i just dont have the skills to write the regex properly.
Thanks in advance for any help anyone can offer.
You should never use a regex to parse HTML, except maybe in extremely well-defined and controlled circumstances.
Instead, try a built-in parser:
$dom = new DOMDocument();
$dom->loadHTML($your_html_source);
$xpath = new DOMXPath($dom);
$links = $xpath->query("//a[#target='_blank']");
foreach($links as $link) {
$link->appendChild($dom->createTextNode(" (This link opens in a new window)"));
}
$output = $dom->saveHTML();
Aternatively, if this is being output to the browser, you can just use CSS:
a[target='_blank']:after {
content: ' (This link opens in a new window)';
}
This will work for anchor tag replacement....
$string = str_replace('<a ','<a target="_blank" ',$string);
Well #Kolink is right, but there's my RegExp version.
$string = '<p>mess</p>Google<p>mess</p>';
echo preg_replace("/(\<a.*?target=\"_blank\".*?>)(.*?)(\<\/a\>)/miU","$1$2(This link opens in a new window)$3",$string);
This does the job:
$newText = '<span>(This link opens in a new window)</span>';
$pattern = '~<a\s[^>]*?\btarget\s*=(?:\s*([\'"])_blank\1|_blank\b)[^>]*>[^<]*(?:<(?!/a>)[^<]*)*\K~i';
echo preg_replace($pattern, $newText, $html);
However this direct string approach may replace also commented html parts, strings or comments in css or javascript code and eventually inside javascript literal regexes, that is at best unneeded and at worst unwanted at all. That's why you should use a DOM approach if you want to avoid these pitfalls. All you have to do is to append a new node to each link with the desired attribute:
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$nodeList = $xp->query('//a[#target="_blank"]');
foreach($nodeList as $node) {
$newNode = dom->createElement('span', '(This link opens in a new window)');
$node->appendChild($newNode);
}
$html = $dom->saveHTML();
To finish, a last alternative consists to not change the html at all and to play with css:
a[target="_blank"]::after {
content: " (This link opens in a new window)";
font-style: italic;
color: red;
}
You won't be able to write a regex that will evaluate an infinitely long string. I suggest:
$h = explode('>', $html);
This will give you the chance to traverse it like any other array and then do:
foreach($h as $k){
if(!preg_match('/^<a href=/', $k){
continue;
}elseif(!preg_match(/target="_blank")/, $k){
continue;
}else{
$h[$k + 1] .= '(open in new window);
}
}
$html = implode('>', $h);
This is how I would approach such a problem. of course, I just threw this out off the top of my head and is note guaranteed to work as is, but with a few possible tweaks to your exact logic, and you will have what you need.
I am trying to read the source code of a page. I just want to read some text that is within a certain division element with the id "wrapper_left".
My problem is that if a prime " is used in the first argument of the explode function, it does not work. I tried escaping the string, although I figured this wouldn't do anything.
$source_code = htmlspecialchars(file_get_contents('http://mydomain.com'));
$source_code = explode('<div id="wrapper_left">', $source_code);
echo $source_code[1];
Thanks tons in advance.
Don't bother trying to get this done with explode(), string manipulation, or a regular expression, you need an HTML parser, like DOMDocument:
$doc = new DOMDocument;
$doc->loadHTMLFile( 'http://mydomain.com');
$xpath = new DOMXPath( $doc);
$div = $xpath->query( '//div[#id="wrapper_left"]')->item(0);
echo $div->textContent;
You can see it working in this demo, which, when fed this HTML:
<div id="wrapper_left">Some text</div>
It produces:
Some text