Replace links from specific domain with text (PHP) - php

I have :
Title
And :
Title
I want to replace link to text "Title", but only from http://abc.com. But I don't know how ( I tried Google ), can you explain for me. I'm not good in PHP.
Thanks in advance.

Not sure I really understand what you're asking, but if you :
Have a string that contains some HTML
and want to replace all links to abc.com by some text
Then, a good solution (better than regular expressions, should I say !) would be to use the DOM-related classes -- especially, you can take a look at the DOMDocument class, and its loadHTML method.
For example, considering that the HTML portion is declared in a variable :
$html = <<<HTML
<p>some text</p>
Title
<p>some more text</p>
Title
<p>and some again</p>
HTML;
You could then use something like this :
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags = $dom->getElementsByTagName('a');
for ($i = $tags->length - 1 ; $i > -1 ; $i--) {
$tag = $tags->item($i);
if ($tag->getAttribute('href') == 'http://abc.com') {
$replacement = $dom->createTextNode($tag->nodeValue);
$tag->parentNode->replaceChild($replacement, $tag);
}
}
echo $dom->saveHTML();
And this would get you the following portion of HTML, as output :
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>some text</p>
Title
<p>some more text</p>
Title
<p>and some again</p>
</body></html>
Note that the whole Title portion has been replaced by the text it contained.
If you want some other text instead, just use it where I used $tag->nodeValue, which is the current content of the node that's being removed.
Unfortunately, yes, this generates a full HTML document, including the doctype declaration, <html> and <body> tags, ...

To cover another interpreted case:
$string = 'Title Title';
$pattern = '/\<\s?a\shref[\s="\']+([^\'"]+)["\']\>([^\<]+)[^\>]+\>/';
$result = preg_replace_callback($pattern, 'replaceLinkValueSelectively', $string);
function replaceLinkValueSelectively($matches)
{
list($link, $URL, $value) = $matches;
switch ($URL)
{
case 'http://abc.com':
$newValue = 'New Title';
break;
default:
return $link;
}
return str_replace($value, $newValue, $link);
}
echo $result;
input
Title Title
becomes
New Title Title
$string is your input, $result is your input modified. You can define more URLs as cases.
Please note: I wrote that regular expression hastily, and I'm quite the novice. Please check that it suits all your intended cases.

Related

How to format plaintext in PHP Simple HTML DOM Parser?

I'm trying to extract the content of a webpage in plain text - without the html tags. Here's some sample code:
$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html($url);
$result['body'] = $dom->find('body', 0)->plaintext;
The problem is that what I get in $result['body'] is very messy. The HTML was removed, sure, but sentences often merge into others since there are no spaces or periods to delimit where the text from one HTML tag ended, and text from the following tag begins.
An example:
<body>
<div class="H2">Header</div>
<div class="P">this is a paragraph</div>
<div class="P">this is another paragraph</div>
</body>
Results in:
"Headerthis is a paragraphthis is another paragraph"
Desired result:
"Header. this is a paragraph. this is another paragraph"
Is there any way to format the result from plaintext or perhaps apply extra manipulation on the innertext before using plaintext to achieve clear delimiters for sentences?
EDIT:
I'm thinking of doing something like this:
foreach($dom->find('div') as $element) {
$text = $element->plaintext;
$result['body'] .= $text.'. ';
}
but there's a problem when the divs are nested, since it would add the content of the parent, which includes text from all children, and then add the content of the children, effectively duplicating the text. This can be fixed simply by checking if there is a </div> inside the $text though.
Perhaps I should try callbacks.
Possibly something like this? Tested.
<?php
require_once 'vendor/autoload.php';
$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html("index.html");
$result['body'] = implode('. ', array_map(function($element) {
return $element->plaintext;
}, $dom->find('div')));
echo $result['body'];
<body>
<div class="H2">Header</div>
<div class="P">this is a paragraph</div>
<div class="P">this is another paragraph</div>
</body>
Try this code:
$result = array();
foreach($html->find('div') as $e){
$result[] = $e->plaintext;
}

Replace image inside HTML if we only know the image source

I have an HTML with a number of images inside it. Suppose I have one url which is of one of the images inside the HTML content. What if I have to replace the image inside HTML with some custom text in PHP?
<div>
<p>Some text<img src="a.jpg" class="testclass" alt="image" title="image"/></p>
<p>Some more text<img src="b.jpg" class="testclass2" alt="image2" title="image2"/></p>
</div>
And suppose I have to replace <img src="a.jpg" class="testclass" alt="image" title="image"/> with some custom text but the only information I have is the image URL i.e "a.jpg". How to do it in PHP?
Using regular expressions for this is not the ideal solution. Such expressions can become very complicated to deal with quotes, white space, attribute order, scripts,... etc in HTML.
The preferred method is to use a DOM parser, which PHP offers out-of-the-box.
Here is some code you could use to get what you want:
// main function: pass it the DOM, image URL and replacement text
function DOMreplaceImagesByText($dom, $img_src, $text) {
foreach($dom->getElementsByTagName('img') as $img) {
if ($img->getAttribute("src") == "a.jpg") {
$span = $dom->createElement("span", $text);
$img->parentNode->replaceChild($span, $img);
};
}
}
// utility function to get innerHTML of an element
function DOMinnerHTML($element) {
$innerHTML = "";
foreach ($element->childNodes as $child) {
$innerHTML .= $element->ownerDocument->saveHTML($child);
}
return $innerHTML;
}
// test data
$html = '<div>
<p>Some text<img src="a.jpg" class="testclass" alt="image" title="image"/></p>
<p>Some more text<img src="b.jpg" class="testclass2" alt="image2" title="image2"/></p>
</div>';
// create DOM for given HTML
$dom = new DOMDocument();
$dom->loadHTML($html);
// call our function to make the replacement(s)
DOMreplaceImagesByText($dom, "a.jpg", "custom text");
// convert back to HTML
$html = DOMinnerHTML($dom->getElementsByTagName('body')->item(0));
// show result (for demo only, in reality you would not use htmlentities)
echo htmlentities($html);
The above code will output:
<div>
<p>Some text<span>custom text</span></p>
<p>Some more text<img src="b.jpg" class="testclass2" alt="image2" title="image2"></p>
</div>
Regular Expression
As stated above, regular expressions are not well-suited for this job, but I will provide you one just for completeness sake:
function HTMLreplaceImagesByText($html, $img_src, $text) {
// escape special characters in $img_src so they work as
// literals in the main regular expression
$img_src = preg_replace("/(\W)/", "\\\\$1", $img_src);
// main regular expression:
return preg_replace("/<img[^>]*?\ssrc\s*=\s*[\'\"]" . $img_src
. "[\'\"].*?>/si", "<span>$text</span>", $html);
}
$html = '<div>
<p>Some text<img src="a.jpg" class="testclass" alt="image" title="image"/></p>
<p>Some more text<img src="b.jpg" class="testclass2" alt="image2" title="image2"/></p>
</div>';
$html = HTMLreplaceImagesByText($html, "a.jpg", "custom text");
echo htmlentities($html);
The output will be the same as with the DOM parsing solution. But it will fail in many specific situations, where the DOM solution will not have any problem. For instance, if a matching image tag appears in a comment or as a string within a script tag, it will make the replacement, while it shouldn't. Worse, when the matching image tag has a greater-than sign in an attribute value, the replacement will produce wrong results.
There are many other instances where it will go wrong.

Search and replace a string of HTML using the PHP DOM Parser

How can I search and replace a specific string (text + html tags) in a web page using the native PHP DOM Parser?
For example, search for
<p> Check this site </p>
This string is somewhere inside inside an html tree.
I would like to find it and replace it with another string. For example,
<span class="highligher"><p> Check this site </p></span>
Bear in mind that there is no ID to the <p> or <a> nodes. There can be many of those identical nodes, holding different pieces of text.
I tried str_replace, however it fails with complex html markup, so I have turned to HTML Parsers now.
EDIT:
The string to be found and replaced might contain a variety of HTML tags, like divs, headlines, bolds etc.. So, I am looking for a solution that can construct a regex or DOM xpath query depending on the contents of the string being searched.
Thanks!
Is this what you wanted:
<?php
// load
$doc = new DOMDocument();
$doc->loadHTMLFile("filename.html");
// search p elements
$p_elements = $doc->getElementsByTagName('p');
// parse this elements, if available
if (!is_null($p_elements))
{
foreach ($p_elements as $p_element)
{
// get p element nodes
$nodes = $p_element->childNodes;
// check for "a" nodes in these nodes
foreach ($nodes as $node) {
// found an a node - check must be defined better!
if(strtolower($node->nodeName) === 'a')
{
// create the new span element
$span_element = $doc->createElement('span');
$span_element->setAttribute('class', 'highlighter');
// replace the "p" element with the span
$p_element->parentNode->replaceChild($span_element, $p_element);
// append the "p" element to the span
$span_element->appendChild($p_element);
}
}
}
}
// output
echo '<pre>';
echo htmlentities($doc->saveHTML());
echo '</pre>';
This HTML is the basis for conversion:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<p> Check this site </p>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><p> Check this site </p>
</body></html>
The output looks like that, it wraps the elements you mentioned:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<span class="highlighter"><p> Check this site </p></span>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><span class="highlighter"><p> Check this site </p></span>
</body></html>
You could use a regular expression with preg_replace.
preg_replace("/<\s*p[^>]*>(.*?)<\s*\/\s*p>/", '<span class="highligher"><p>$1</p></span>', '<p> Check this site</p>');
The third parameter of preg_replace can be used to restrict the number of replacements
http://php.net/manual/en/function.preg-replace.php
http://www.pagecolumn.com/tool/all_about_html_tags.htm - for more examples on regular expressions for HTML
You will need to edit the regular expression to only capture the p tags with the google href
EDIT
preg_replace("/<\s*\w.*?><a href\s*=\s*\"?\s*(.*)(google.com)\s*\">(.*?)<\/a>\s*<\/\s*\w.*?>/", '<span class="highligher"><p>$3</p></span>', $string);

Getting Text Between HTML Tags & Replacing Them

I would like to get text between HTML tags and replacing them dynamically. Considering HTML tags might contain anything (nested HTML tags, comments, etc) I think DOM Document class is the way to go. However I wasn't able to find any example for my needs. I can only get the text between of specifically selected html tag. I also couldn't find an example to replace selected text.
<?php
// HTML OUTPUT
$html= "<p>Subject,</p>
<h1>H1 title</h1>
<h2>H2 title</h2>
<h3>H2 title</h3>";
// DESIRED OUTPUT
$newHTML "<p>My Fav. Colors;</p>
<h1>Blue</h1>
<h2>Orange</h2>
<h3>Yellow</h3>";
?>
Basically I would like to get text from HTML output dynamically (might contain nested HTML tags, comments, javascripts scripts and so on.) and replace them (replaced values will be selected from database) to create new HTML output.
What is the best and elegant way to go? Is DOM Document class is the tool I need or Regex is the way to go?
I will be really glad if you could show me with a small piece of code to understand it clearly.
P.S. HTML document in question might be a page on another domain. Such as http://anotherdomain.com/page.html.
Here is an example of DOM.
$html= "<p>Subject,</p>
<h1>H1 title</h1>
<h2>H2 title</h2>
<h3>H2 title</h3>";
$doc = new DOMDocument;
$doc->loadHTML( '<div>' . $html . '</div>');
foreach($doc->getElementsByTagName('div')->item(0)->childNodes as $node) {
switch ($node->nodeName) {
case "p":
$node->nodeValue = "My Fav. Colors";
break;
case "h1":
$node->nodeValue = "Blue";
break;
case "h2":
$node->nodeValue = "Orange";
break;
case "h3":
$node->nodeValue = "Yellow";
break;
}
}
echo $doc->saveXML($doc);

Fixing unclosed HTML tags

I am working on some blog layout and I need to create an abstract of each post (say 15 of the lastest) to show on the homepage. Now the content I use is already formatted in html tags by the textile library. Now if I use substr to get 1st 500 chars of the post, the main problem that I face is how to close the unclosed tags.
e.g
<div>.......................</div>
<div>...........
<p>............</p>
<p>...........| 500 chars
</p>
<div>
What I get is two unclosed tags <p> and <div> , p wont create much trouble , but div just messes with the whole page layout. So any suggestion how to track the opening tags and close them manually or something?
There are lots of methods that can be used:
Use a proper HTML parser, like DOMDocument
Use PHP Tidy to repair the un-closed tag
Some would suggest HTML Purifier
As ajreal said, DOMDocument is a solution.
Example :
$str = "
<html>
<head>
<title>test</title>
</head>
<body>
<p>error</i>
</body>
</html>
";
$doc = new DOMDocument();
#$doc->loadHTML($str);
echo $doc->saveHTML();
Advantage : natively included in PHP, contrary to PHP Tidy.
You can use DOMDocument to do it, but be careful of string encoding issues. Also, you'll have to use a complete HTML document, then extract the components you want. Here's an example:
function make_excerpt ($rawHtml, $length = 500) {
// append an ellipsis and "More" link
$content = substr($rawHtml, 0, $length)
. '… More >';
// Detect the string encoding
$encoding = mb_detect_encoding($content);
// pass it to the DOMDocument constructor
$doc = new DOMDocument('', $encoding);
// Must include the content-type/charset meta tag with $encoding
// Bad HTML will trigger warnings, suppress those
#$doc->loadHTML('<html><head>'
. '<meta http-equiv="content-type" content="text/html; charset='
. $encoding . '"></head><body>' . trim($content) . '</body></html>');
// extract the components we want
$nodes = $doc->getElementsByTagName('body')->item(0)->childNodes;
$html = '';
$len = $nodes->length;
for ($i = 0; $i < $len; $i++) {
$html .= $doc->saveHTML($nodes->item($i));
}
return $html;
}
$html = "<p>.......................</p>
<p>...........
<p>............</p>
<p>...........| 500 chars";
// output fixed html
echo make_excerpt($html, 500);
Outputs:
<p>.......................</p>
<p>...........
</p>
<p>............</p>
<p>...........| 500 chars… More ></p>
If you are using WordPress you should wrap the substr() invocation in a call to wpautop - wpautop(substr(...)). You may also wish to test the length of the $rawHtml passed to the function, and skip appending the "More" link if it isn't long enough.

Categories