Fixing unclosed HTML tags - php

I am working on some blog layout and I need to create an abstract of each post (say 15 of the lastest) to show on the homepage. Now the content I use is already formatted in html tags by the textile library. Now if I use substr to get 1st 500 chars of the post, the main problem that I face is how to close the unclosed tags.
e.g
<div>.......................</div>
<div>...........
<p>............</p>
<p>...........| 500 chars
</p>
<div>
What I get is two unclosed tags <p> and <div> , p wont create much trouble , but div just messes with the whole page layout. So any suggestion how to track the opening tags and close them manually or something?

There are lots of methods that can be used:
Use a proper HTML parser, like DOMDocument
Use PHP Tidy to repair the un-closed tag
Some would suggest HTML Purifier

As ajreal said, DOMDocument is a solution.
Example :
$str = "
<html>
<head>
<title>test</title>
</head>
<body>
<p>error</i>
</body>
</html>
";
$doc = new DOMDocument();
#$doc->loadHTML($str);
echo $doc->saveHTML();
Advantage : natively included in PHP, contrary to PHP Tidy.

You can use DOMDocument to do it, but be careful of string encoding issues. Also, you'll have to use a complete HTML document, then extract the components you want. Here's an example:
function make_excerpt ($rawHtml, $length = 500) {
// append an ellipsis and "More" link
$content = substr($rawHtml, 0, $length)
. '… More >';
// Detect the string encoding
$encoding = mb_detect_encoding($content);
// pass it to the DOMDocument constructor
$doc = new DOMDocument('', $encoding);
// Must include the content-type/charset meta tag with $encoding
// Bad HTML will trigger warnings, suppress those
#$doc->loadHTML('<html><head>'
. '<meta http-equiv="content-type" content="text/html; charset='
. $encoding . '"></head><body>' . trim($content) . '</body></html>');
// extract the components we want
$nodes = $doc->getElementsByTagName('body')->item(0)->childNodes;
$html = '';
$len = $nodes->length;
for ($i = 0; $i < $len; $i++) {
$html .= $doc->saveHTML($nodes->item($i));
}
return $html;
}
$html = "<p>.......................</p>
<p>...........
<p>............</p>
<p>...........| 500 chars";
// output fixed html
echo make_excerpt($html, 500);
Outputs:
<p>.......................</p>
<p>...........
</p>
<p>............</p>
<p>...........| 500 chars… More ></p>
If you are using WordPress you should wrap the substr() invocation in a call to wpautop - wpautop(substr(...)). You may also wish to test the length of the $rawHtml passed to the function, and skip appending the "More" link if it isn't long enough.

Related

How to format plaintext in PHP Simple HTML DOM Parser?

I'm trying to extract the content of a webpage in plain text - without the html tags. Here's some sample code:
$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html($url);
$result['body'] = $dom->find('body', 0)->plaintext;
The problem is that what I get in $result['body'] is very messy. The HTML was removed, sure, but sentences often merge into others since there are no spaces or periods to delimit where the text from one HTML tag ended, and text from the following tag begins.
An example:
<body>
<div class="H2">Header</div>
<div class="P">this is a paragraph</div>
<div class="P">this is another paragraph</div>
</body>
Results in:
"Headerthis is a paragraphthis is another paragraph"
Desired result:
"Header. this is a paragraph. this is another paragraph"
Is there any way to format the result from plaintext or perhaps apply extra manipulation on the innertext before using plaintext to achieve clear delimiters for sentences?
EDIT:
I'm thinking of doing something like this:
foreach($dom->find('div') as $element) {
$text = $element->plaintext;
$result['body'] .= $text.'. ';
}
but there's a problem when the divs are nested, since it would add the content of the parent, which includes text from all children, and then add the content of the children, effectively duplicating the text. This can be fixed simply by checking if there is a </div> inside the $text though.
Perhaps I should try callbacks.
Possibly something like this? Tested.
<?php
require_once 'vendor/autoload.php';
$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html("index.html");
$result['body'] = implode('. ', array_map(function($element) {
return $element->plaintext;
}, $dom->find('div')));
echo $result['body'];
<body>
<div class="H2">Header</div>
<div class="P">this is a paragraph</div>
<div class="P">this is another paragraph</div>
</body>
Try this code:
$result = array();
foreach($html->find('div') as $e){
$result[] = $e->plaintext;
}

Insert " < " into html file using php

Is there a way I can insert < or > in HTML file using PHP. Here is a a part of code
<?php
$custom_tag = $searchNode->nodeValue = "<";
$custom_tag = $searchNode->setAttribute("data-mark", $theme_name);
$custom_tag = $dom->saveHTML($searchNode);
$root = $searchNode->parentNode;
$root->removeChild($searchNode);
$test = $dom->createTextNode("<?php echo '$custom_tag'; ?>");
$root->appendChild($test);
$saved_file_in_HTML = $dom->saveHTML();
$saved_file = htmlspecialchars_decode($saved_file_in_HTML);
file_put_contents("test.html", $saved_file);
The problem is that I get < using above method and I would like to have <.
EDIT:
Full code:
if($searchNode->hasAttribute('data-put_page_content')) {
$custom_tag = $searchNode->nodeValue = "'; \$up_level = null; \$path_to_file = 'DATABASE/PAGES/' . basename(\$_SERVER['PHP_SELF'], '.php') . '.txt'; for(\$x = 0; \$x < 1; ) { if(is_file(\$up_level.\$path_to_file)) { \$x = 1; include(\$up_level.\$path_to_file); } else if(!is_file(\$up_level.\$path_to_file)) { \$up_level .= '../'; } }; echo '";
$custom_tag = $searchNode->setAttribute("data-mark", $theme_name);
$custom_tag = $dom->saveHTML($searchNode);
$root = $searchNode->parentNode;
$root->removeChild($searchNode);
$test = $dom->createTextNode("<?php echo '$custom_tag'; ?>");
$root->appendChild($test);
$saved_file_in_HTML = $dom->saveHTML();
$saved_file = htmlspecialchars_decode($saved_file_in_HTML);
file_put_contents("../THEMES/ACTIVE/". $theme_name ."/" . $trimed_file, $saved_file);
copy("../THEMES/ACTIVE/" . $theme_name . "/structure/index.php", "../THEMES/ACTIVE/" . $theme_name ."/index.php");
header("Location: editor.php");
}
FINAL EDIT:
If you want to have > or < using PHP DOMs methods, it is working using createTextNode() method.
The original question appears to concern how to use PHP to manufacture an HTML tag, so I'll address that question. While it is a good idea to use htmlentities(), you also need to be aware that it's primary purpose is to protect your code so that a malicious user doesn't inject it with javascript or other tags that could create security problems. In this case, the only way to generate HTML is to create HTML and PHP provides at least three ways to accomplish this feat.
You may write code such as the following:
<?php
define("LF_AB","<");
define("RT_AB",">");
echo LF_AB,"div",RT_AB,"\n";
echo "content\n";
echo LF_AB,"/div",RT_AB,"\n";
The code produces start and end div tags with some minimal content. Note, the defines are optional, i.e. one could code in a more straightforward fashion <?php echo "<"; ?> instead of resorting to using a define() to generate the left-angle tag character.
However, if one is wary of generating HTML in this manner, then you may also use html_entity_decode:
<?php
define("LF_AB",html_entity_decode("<"));
define("RT_AB",html_entity_decode(">"));
echo LF_AB,"div",RT_AB,"\n";
echo "content\n";
echo LF_AB,"/div",RT_AB,"\n";
This example treats the arguments to html_entity_decode as harmless HTML entities and then converts them into HTML left and right angle characters respectively.
Alternately, you could also take advantage of the DOM with PHP, even with an existing HTML page, as follows:
<?php
function displayContent()
{
$dom = new DOMDocument();
$element = $dom->createElement('div', 'My great content');
$dom->appendChild($element);
echo $dom->saveHTML();
}
?>
<!DOCTYPE html>
<html>
<head>
<title>Untitled</title>
<style>
div {
background:#ccbbcc;
color:#009;
font-weight:bold;
}
</style>
</head>
<body>
<?php displayContent(); ?>
</body>
</html>
Note: the Dom method createTextNode true to its name creates text and not HTML, i.e. the browser will only interpret the right and left angle characters as text without any additional meaning.
If I understand you correctly, I think you could use < and > in the first place. They should be rendered as HTML. Might save you some coding. Let me know if it works.

import the content of a form in wikipedia

I want to import de whole name frome this page ( http://nl.wikipedia.org/w/index.php?title=Samenstelling_Tweede_Kamer_2012-heden&action=edit&section=1 )(from form) and then compare it with the names of this page (http://nl.wikipedia.org/wiki/Samenstelling_Tweede_Kamer_2012-heden) and printout de relevante links with php
You have to write some code to parse the HTML from the Wikipedia site.
The PHP Simple HTML DOM Parser is the way to go to parse the HTML and get the information you need.
Once you have your data from the Wikipedia pages, you can compare them in your code.
Example to get the names (not tested, you probably need some more selectors to get exactly what you want):
ini_set('memory_limit','160M');
require('simple_html_dom.php');
// Create DOM from URL or file
$url = 'http://nl.wikipedia.org/wiki/Samenstelling_Tweede_Kamer_2012-heden';
// Object oriented style
$html = new simple_html_dom();
$html->load_file($url);
// Procedural style
// $html = file_get_html($url);
$items = array();
// Find div with class editmode and loop through it.
foreach($html->find('div.editmode') as $article) {
// Get all anchors in a unordened list with a list tag
foreach($article->find('ul li a') as $a)
$items[] = "<a href='". $a->href . "'>" . $a->plaintext . "</a>";
}
print_r($items);
If you see some weird characters in names (André Bosman for example), you should consider defining your charset (to UTF-8) in your html like this:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

DOMDocument escaping end chars in PHP

I have a problem with class DOMDocument. I use this php class to edit a html template. I have in this template this meta tag:
<meta http-equiv="Content-type" content="text/html;charset=UTF-8"/>
But after editing, although I was not editing this tag, it escapes the end char "/" and it doesn't work.
This is the script:
$textValue = $company.'<br />'.$firstName.' '.$lastName.'<br />'.$adress;
$values = array($company, $firstName.' '.$lastName, $adress);
$document = new DOMDocument;
$document->loadHTMLFile($dir.'temp/OEBPS/signature.html');
$dom = $document->getElementById('body');
for ($i = 0; $i < count($values); $i++) {
$dom->appendChild($document->createElement('p', $values[$i]));
}
$document->saveHTMLFile($dir.'temp/OEBPS/signature.html');
echo 'signature added <br />';
Please see the answer provided by this question: Why doesn't PHP DOM include slash on self closing tags?
In short, DOMDocument->saveHTMLFile() outputs its internal structure as regular old HTML instead of XHTML. If you absolutely need XHTML, you can use DOMDocument->saveXMLFile() which will use self-closing tags. The only problem with this method is some HTML tags cannot use self-closing tags like <script> and <style> so you have to put a space in their content so that they don't use self-closing tags.
I would recommend just ignoring the issue unless it is mandatory that you fix it. Self-closing tags are a relic of XHTML and are unused in HTML5.

Replace links from specific domain with text (PHP)

I have :
Title
And :
Title
I want to replace link to text "Title", but only from http://abc.com. But I don't know how ( I tried Google ), can you explain for me. I'm not good in PHP.
Thanks in advance.
Not sure I really understand what you're asking, but if you :
Have a string that contains some HTML
and want to replace all links to abc.com by some text
Then, a good solution (better than regular expressions, should I say !) would be to use the DOM-related classes -- especially, you can take a look at the DOMDocument class, and its loadHTML method.
For example, considering that the HTML portion is declared in a variable :
$html = <<<HTML
<p>some text</p>
Title
<p>some more text</p>
Title
<p>and some again</p>
HTML;
You could then use something like this :
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags = $dom->getElementsByTagName('a');
for ($i = $tags->length - 1 ; $i > -1 ; $i--) {
$tag = $tags->item($i);
if ($tag->getAttribute('href') == 'http://abc.com') {
$replacement = $dom->createTextNode($tag->nodeValue);
$tag->parentNode->replaceChild($replacement, $tag);
}
}
echo $dom->saveHTML();
And this would get you the following portion of HTML, as output :
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>some text</p>
Title
<p>some more text</p>
Title
<p>and some again</p>
</body></html>
Note that the whole Title portion has been replaced by the text it contained.
If you want some other text instead, just use it where I used $tag->nodeValue, which is the current content of the node that's being removed.
Unfortunately, yes, this generates a full HTML document, including the doctype declaration, <html> and <body> tags, ...
To cover another interpreted case:
$string = 'Title Title';
$pattern = '/\<\s?a\shref[\s="\']+([^\'"]+)["\']\>([^\<]+)[^\>]+\>/';
$result = preg_replace_callback($pattern, 'replaceLinkValueSelectively', $string);
function replaceLinkValueSelectively($matches)
{
list($link, $URL, $value) = $matches;
switch ($URL)
{
case 'http://abc.com':
$newValue = 'New Title';
break;
default:
return $link;
}
return str_replace($value, $newValue, $link);
}
echo $result;
input
Title Title
becomes
New Title Title
$string is your input, $result is your input modified. You can define more URLs as cases.
Please note: I wrote that regular expression hastily, and I'm quite the novice. Please check that it suits all your intended cases.

Categories