Str_replace with regex - php

Say I have the following link:
<li class="hook">
I_have_underscores
</li>
How would I, remove the underscores only in the text and not the href? I have used str_replace, but this removes all underscores, which isn't ideal.
So basically I would be left with this output:
<li class="hook">
I have underscores
</li>
Any help, much appreciated

You can use a HTML DOM parser to get the text within the tags, and then run your str_replace() function on the result.
Using the DOM Parser I linked, it is as simple as something like this:
$html = str_get_html(
'<li class="hook">I_have_underscores</li>');
$links = $html->find('a'); // You can use any css style selectors here
foreach($links as $l) {
$l->innertext = str_replace('_', ' ', $l->innertext)
}
echo $html
//<li class="hook">I have underscores</li>
That's it.

It's safer to parse HTML with DOMDocument instead of regex. Try this code:
<?php
function replaceInAnchors($html)
{
$dom = new DOMDocument();
// loadHtml() needs mb_convert_encoding() to work well with UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
$xpath = new DOMXPath($dom);
foreach($xpath->query('//text()[(ancestor::a)]') as $node)
{
$replaced = str_ireplace('_', ' ', $node->wholeText);
$newNode = $dom->createDocumentFragment();
$newNode->appendXML($replaced);
$node->parentNode->replaceChild($newNode, $node);
}
// get only the body tag with its contents, then trim the body tag itself to get only the original content
return mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");
}
$html = '<li class="hook">
I_have_underscores
</li>';
echo replaceInAnchors($html);

Related

PHP DOMDocument: Get inner HTML of node

When loading HTML into an <textarea>, I intend to treat different kinds of links differently. Consider the following links:
http://stackoverflow.com
StackOverflow
When the text inside a link matches its href attribute, I want to remove the HTML, otherwise the HTML remains unchanged.
Here's my code:
$body = "Some HTML with a http://stackoverflow.com";
$dom = new DOMDocument;
$dom->loadHTML($body, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach ($dom->getElementsByTagName('a') as $node) {
$link_text = $node->ownerDocument->saveHTML($node->childNodes[0]);
$link_href = $node->getAttribute("href");
$link_node = $dom->createTextNode($link_href);
$node->parentNode->replaceChild($link_node, $node);
}
$html = $dom->saveHTML();
The problem with the above code is that DOMDocument encapsulates my HTML into a paragraph tag:
<p>Some HTML with a http://stackoverflow.com</p>
How do I get it ot only return the inner HTML of that paragraph?
You need to have a root node to have a valid DOM document.
I suggest you to add a root node <div> to avoid to destroy a possibly existing one.
Finally, load the nodeValue of the rootNode or substr().
$body = "Some HTML with a http://stackoverflow.com";
$body = '<div>'.$body.'</div>';
$dom = new DOMDocument;
$dom->loadHTML($body, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach ($dom->getElementsByTagName('a') as $node) {
$link_text = $node->ownerDocument->saveHTML($node->childNodes[0]);
$link_href = $node->getAttribute("href");
$link_node = $dom->createTextNode($link_href);
$node->parentNode->replaceChild($link_node, $node);
}
// or probably better :
$html = $dom->saveHTML() ;
$html = substr($html,5,-7); // remove <div>
var_dump($html); // "Some HTML with a http://stackoverflow.com"
This works is the input string is :
<p>Some HTML with a http://stackoverflow.com</p>
outputs :
<p>Some HTML with a http://stackoverflow.com</p>

Parse a HTML document and get a specific element in PHP and save its HTML

All I want to do is save the first div with attribute role="main" as a string from an external URL using PHP.
So far I have this:
$doc = new DOMDocument();
#$doc->loadHTMLFile("http://example.com/");
$xpath = new DOMXPath($doc);
$elements = $xpath->query('//div[#role="main"]');
$str = "";
if ($elements->length > 0) {
$str = $elements->item(0)->textContent;
}
echo htmlentities($str);
But unfortunately the $str does not seem to be displaying the HTML tags. Just the text.
You can get the HTML via the saveHTML() method.
$str = $doc->saveHTML($elements->item(0));

PHP Dom Document Don't fix markup

How do I stop DOMDocument from having a mind of its own?
$dom = new DOMDocument();
$validHtml = '<body>Test</body>';
$dom->loadHTML($validHtml);
After loading, the anchor attribute is encoded. I want it not to do this.
$body = $dom->saveHTML();
var_dump($body);
//<body>Test</body>
I realize this has been covered before, but every where I look, it's more useless Ninja code. Any help appreciated.
Here's how I fixed my own problem. Basically, I decided to strip out all the tags in the markup and put in place holders that I can use later use to put back in:
$validHtml = '<body>Test</body>';
$matches = array();
preg_match_all('/{{[^}]+}}/',$validHtml, $matches);
$matches = $matches[0];
if (count($matches)>0){
foreach ($matches as $i=>$match){
$validHtml = str_replace($match, "<!--INDEX-$i-->", $validHtml);
}
}
$dom = new DOMDocument();
$dom->loadHTML($validHtml);
... //do processing on the loaded dom
Later on after manipulating the dom, I put back all the matches:
$validHtml = $dom->saveHTML();
if (count($matches)>0){
foreach ($matches as $i=>$match){
$validHtml = str_replace(array("<!--INDEX-$i-->", "<!--INDEX-$i-->"), $match, $validHtml);
}
}

PHP nodeValue strips html tags - innerHTML alternative?

I'm using the following script for a lightweight DOM editor. However, nodeValue in my for loop is converting my html tags to plain text. What is a PHP alternative to nodeValue that would maintain my innerHTML?
$page = $_POST['page'];
$json = $_POST['json'];
$doc = new DOMDocument();
$doc = DOMDocument::loadHTMLFile($page);
$xpath = new DOMXPath($doc);
$entries = $xpath->query('//*[#class="editable"]');
$edits = json_decode($json, true);
$num_edits = count($edits);
for($i=0; $i<$num_edits; $i++)
{
$entries->item($i)->nodeValue = $edits[$i]; // nodeValue strips html tags
}
$doc->saveHTMLFile($page);
Since $edits[$i] is a string, you need to parse it into a DOM structure and replace the original content with the new structure.
Update
The code fragment below does an incredible job when using non-XML compliant HTML. (e.g. HTML 4/5)
for($i=0; $i<$num_edits; $i++)
{
$f = new DOMDocument();
$edit = mb_convert_encoding($edits[$i], 'HTML-ENTITIES', "UTF-8");
$f->loadHTML($edit);
$node = $f->documentElement->firstChild;
$entries->item($i)->nodeValue = "";
foreach($node->childNodes as $child) {
$entries->item($i)->appendChild($doc->importNode($child, true));
}
}
I haven't working with that library in PHP before, but in my other xpath experience I think that nodeValue on anything other than a text node does strip tags. If you're unsure about what's underneath that node, then I think you'll need to recursively descend $entries->item($i)->childNodes if you need to get the markup back.
Or...you may wany textContent instead of nodeValue:
http://us.php.net/manual/en/class.domnode.php#domnode.props.textcontent

PHP Reg ex for parsing a link

I've a PHP script that parse the POST content of a form (message) and transform any URL in a real HTML link. This is the 2 regular expressions I use:
$dbQueryList['sb_message'] = preg_replace("#(^|[\n ])([\w]+?://[^ \"\n\r\t<]*)#is", "\\1\\2", $dbQueryList['sb_message']);
$dbQueryList['sb_message'] = preg_replace("#(^|[\n ])((www|ftp)\.[^ \"\t\n\r<]*)#is", "\\1\\2", $dbQueryList['sb_message']);
Ok it works well but now, in another script I would like to do the opposite. So in my $dbQueryList['sb_message'] I could have a link like this "Google" and I would like to just have "http://google.com".
I cannot write the regex that can do that. Could you help me please?
Thanks :)
Something like this i think:
echo preg_replace('/ Google helloworld');
It's safer to use DOMDocument instead of regex to parse HTML contents.
Try this code:
<?php
function extractAnchors($html)
{
$dom = new DOMDocument();
// loadHtml() needs mb_convert_encoding() to work well with UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//a') as $node)
{
if ($node->hasAttribute('href'))
{
$newNode = $dom->createDocumentFragment();
$newNode->appendXML($node->getAttribute('href'));
$node->parentNode->replaceChild($newNode, $node);
}
}
// get only the body tag with its contents, then trim the body tag itself to get only the original content
return mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");
}
$html = 'Some text Google some text <img src="http://dontextract.it" alt="alt"> some text.';
echo extractAnchors($html);

Categories