Trying to get the plaintext using php domdocument

Trying to get the plaintext using php domdocument - php

I am trying to get the plaintext from the given html. But, it is not possible for me.
for this, what I had done is
My html is in $content variable
Now, I am passing $content variable to php DomDocuemnt
$d = new DOMDocument();
#$d->loadHTML($content)
Whats my next step to get the plaintext from the obtained html.
Please help me in this. Thanks in advance!

I can't understand your question but if you want the HTML code as string then
Try this...
$d = new DOMDocument();
$d->loadHTML($content);
$plainText = $d->textContent;
echo $plainText;

The DOM itself does not have such functionality. You may use the strip_tags() function though. Like this:
$d = new DOMDocument();
$d->loadHTML($content);
$plainText = strip_tags($d->textContent);
echo $plainText;
// which is probably equivalent to:
$plainText = strip_tags($content);
Note: using the DOMDocument() is useful to test that $content is correct or if you want to get a specific tag ($main = $d->getElementByName('<main>'); $plainText = strip_tags($main[0]->textContent)) otherwise directly using strip_tags() is enough.
There are some problems as the strip_tags() function has no clue about the type of tag being removed. This means a sequence such as:
... word</p><p>more ...
will concatenate those two words:
... wordmore ...
This is a difficult problem since some tags are expected to be removed that way and others not. For example, if the user had some form of emphasis, no spaces is the right way of removing the tag:
che<u>val<u> -> cheval
che<u>veaux<u> -> cheveaux
(Singular and plural of "horse" in French)
A browser has no clue either, the CSS is what tells whether a tag is a block (<div>) or inline (<u>).

Related

Avoid percent-encoding href attributes when using PHP's DOMDocument

The best answers I was able to find for this issue are using XSLT, but I'm just not sure how to apply those answers to my problem.
Basically, DOMDocument is doing a fine job of escaping URLs (in href attributes) that are passed in, but I'm actually using it to build a Twig/Django style template, and I'd rather it leave them alone. Here's a specific example, illustrating the "problem":
<?php
$doc = new DOMDocument();
$doc->loadHTML('<html><body>Test<br></body></html>');
echo $doc->saveHTML();
Which outputs the following:
<html><body>Test<br></body></html>
Is it possible to NOT percent-encode the href attribute?
If it's not possible directly, can you suggest a concise and reliable workaround? I'm doing other processing, and the DOMDocument usage will have to stay. So perhaps a pre/post processing trick?

I'm not happy with the 'hack'/duct-tape solution, but this is how I'm currently solving the problem:
function fix_template_variable_tokens($template_string)
{
$pattern = "/%7B%7B(\w+)%7D%7D/";
$replacement = '{{$1}}';
return preg_replace($pattern, $replacement, $template_string);
}
$html = $doc->saveHTML();
$html = fix_template_variable_tokens($html);

substrings result in incomplete html tags

I am Expanding/Condensing a Blog post using substrings, where the second substring is within a div tag that activates when a button is pressed (hence concatenating both substrings)
The code looks like as below:
<?php echo substr($f2, 0, 50);?>
<div id="<?php echo $f4; ?>" class = "hidden">
<?php echo substr($f2, 0, 5000);?></div>
My problem however is if the blog post contains html tags (e.g. <\li>, <\p>) and the initial substring ends before the termination of that set of tags, then obviously it causes major formatting problems.
Is there a way around this using my current method, or am I going to need to use something like an XML stylesheet (in which case please guide me through it)
EDIT:
I have semi-completed my request using DOMDocument.
$second = substr($f2, 50, 5000);
$dom= new DOMDocument();
$dom->loadHTML($second);
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
$secondoutput = ($dom->saveXml($body->item(0)));
$first = substr($f2, 0, 50);
$dom= new DOMDocument();
$dom->loadHTML($first);
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
$firstoutput = ($dom->saveXml($body->item(0)));
This works except, when the second subtring is called it no longer has the previous formatting as it has been purified.
Is there any way to reattch the previous HTML tag when the second substring is called?

There are diffrent solutions to this problem, but substr is not particularly suitable (as you mentioned).
You could use Regular Expressions, or a HTML-Parser.
Go ahead and copy solutions from this question.

You may want to use Tidy to fix the truncated HTML.

You may want to parse whole HTML code with DOMDocument or SimpleHTMLDOM and then remove last elements until the post is short enough.

Use XML tag attribute as PHP variable and use it in HTTP request

Is there any way to save an XML attribute as a PHP variable and automatically place it in another http request? Or is there a better way to do that?
Basically, I send a server an http request, the code I get back looks something like this:
<tag one="info" two="string">
I need to save the string in attribute two and insert it in an http request that looks something like this:
http://theserver.com/request?method=...&id=123456
The '123456' ID needs to be the string in attribute 'two'.
Any help would be appreciated!
Thanks,
Jane

If you are 100% entirely absolutely completely TOTALLY sure that the content will always have that exact format, you can probably use regex as the other answers have suggested.
Otherwise, DOM isn't very hard to manage...
$dom = new DOMDocument;
$dom->loadXML($yourcontent);
$el = $dom->getElementsByTagName('A')->item(0); // presuming your tag is the only element in the document
if ($el) {
$id = $el->getAttribute('id');
}
$url = 'http://theserver.com/request?method=...&id=' . $id;
If you have a real example of the XML that you'll receive, please do post it and I'll adapt this answer for it.

If you can... send it in JSON. If you can't, and the only thing that is returned is that wee snippet, then I'd use regex to pull out the value.
Something like /.*two="([^"]+)".*/ should match everything, replace matches with '$1'
Otherwise use simplexml.

You could use:
<?php
$string = '<tag one="info" two="123456">';
if (preg_match('/<tag one="[^"]*" two=\"([^"]*)\">/',$string,$match)) {
$url = 'http://theserver.com/request?method=...&id=' . $match[1];
echo $url;
}
?>

PHP get external page content

i get the html from another site with file_get_contens, my question is how can i get a specific tag value?
let's say i have:
<div id="global"><p class="paragraph">1800</p></div>
how can i get paragraph's value? thanks

If the example is really that trivial you could just use a regular expression. For generic HTML parsing though, PHP has DOM support:
$dom = new domDocument();
$dom->loadHTML("<div id=\"global\"><p class=\"paragraph\">1800</p></div>");
echo $dom->getElementsByTagName('p')->item(0)->nodeValue;

You need to parse the HTML. There are several ways to do this, including using PHP's XML parsing functions.
However, if it is just a simple value (as you asked above) I would use the following simple code:
// your content
$contents='<div id="global"><p class="paragraph">1800</p></div>';
// define start and end position
$start='<div id="global"><p class="paragraph">';
$end='</p></div>';
// find the stuff
$contents=substr($contents,strpos($contents,$start)+strlen($start));
$contents=substr($contents,0,strpos($contents,$end));
// write output
echo $contents;
Best of luck!
Christian Sciberras
(tested and works)

$input = '<div id="global"><p class="paragraph">1800</p></div>';
$output = strip_tags($input);

preg_match_all('#paragraph">(.*?)<#is', $input, $output);
print_r($output);
Untested.

How do I insert HTML into a PHP DOM object? [duplicate]

This question already has answers here:
How to insert HTML to PHP DOMNode?
(5 answers)
Closed 7 years ago.
I am using PHP's DOM object to create HTML pages for my website. This works great for my head, however since I will be entering a lot of HTML into the body (not via DOM), I would think I would need to use DOM->createElement($bodyHTML) to add my HTML from my site to the DOM object.
However DOM->createElement seems to parse all HTML entities so my end result ended up displaying the HTML on the page and not the actual renders HTML.
I am currently using a hack to get this to work,
$body = $this->DOM
->createComment('DOM Glitch--><body>'.$bodyHTML."</body><!--Woot");
Which puts all my site code in a comment, which I bypass athe comment and manually add the <body> tags.
Currently this method works, but I believe there should be a more proper way of doing this. Ideally something like DOM->createElement() that will not parse any of the string.
I also tried using DOM->createDocumentFragment() However it does not like some of the string so it would error and not work (Along with take up extra CPU power to re-parse the body's HTML).
So, my question is, is there a better way of doing this other than using DOM->createComment()?

You use the DOMDocumentFragment objec to insert arbitrary HTML chunks into another document.
$dom = new DOMDocument();
#$dom->loadHTML($some_html_document); // # to suppress a bajillion parse errors
$frag = $dom->createDocumentFragment(); // create fragment
$frag->appendXML($some_other_html_snippet); // insert arbitary html into the fragment
$node = // some operations to find whatever node you want to insert the fragment into
$node->appendChild($frag); // stuff the fragment into the original tree

I FOUND THE SOLUTION but it's not a pure php solution, but works very well. A little hack for everybody who lost countless hours, like me, to fix this
$dom = new DomDocument;
// main object
$object = $dom->createElement('div');
// html attribute
$attr = $dom->createAttribute('value');
// ugly html string
$attr->value = "<div> this is a really html string ©</div><i></i> with all the © that XML hates!";
$object->appendChild($attr);
// jquery fix (or javascript as well)
$('div').html($(this).attr('value')); // and it works!
$('div').removeAttr('value'); // to clean-up

loadHTML works just fine.
<?php
$dom = new DOMDocument();
$dom->loadHTML("<font color='red'>Hey there mrlanrat!</font>");
echo $dom->saveHTML();
?>
which outputs Hey there mrlanrat! in red.
or
<?php
$dom = new DOMDocument();
$bodyHTML = "here is the body, a nice body I might add";
$dom->loadHTML("<body> " . $bodyHTML . " </body>");
// this would even work as well.
// $bodyHTML = "<body>here is the body, a nice body I might add</body>";
// $dom->loadHTML($bodyHTML);
echo $dom->saveHTML();
?>
Which outputs:
here is the body, a nice body I might add and inside of your HTML source code, its wrapped inside body tags.

I spent a lot of time working on Anthony Forloney's answer, But I cannot seem to get the html to append to the body without it erroring.
#Mark B: I have tried doing that, but as I said in the comments, it errored on my html.
I forgot to add the below, my solution:
I decided to make my html object much simpler and to allow me to do this by not using DOM and just use strings.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Trying to get the plaintext using php domdocument - php

I can't understand your question but if you want the HTML code as string then Try this... $d = new DOMDocument(); $d->loadHTML($content); $plainText = $d->textContent; echo $plainText;

Related

Avoid percent-encoding href attributes when using PHP's DOMDocument

substrings result in incomplete html tags

Use XML tag attribute as PHP variable and use it in HTTP request

PHP get external page content

How do I insert HTML into a PHP DOM object? [duplicate]

Categories

Resources