I'm trying to extract the content of a webpage in plain text - without the html tags. Here's some sample code:
$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html($url);
$result['body'] = $dom->find('body', 0)->plaintext;
The problem is that what I get in $result['body'] is very messy. The HTML was removed, sure, but sentences often merge into others since there are no spaces or periods to delimit where the text from one HTML tag ended, and text from the following tag begins.
An example:
<body>
<div class="H2">Header</div>
<div class="P">this is a paragraph</div>
<div class="P">this is another paragraph</div>
</body>
Results in:
"Headerthis is a paragraphthis is another paragraph"
Desired result:
"Header. this is a paragraph. this is another paragraph"
Is there any way to format the result from plaintext or perhaps apply extra manipulation on the innertext before using plaintext to achieve clear delimiters for sentences?
EDIT:
I'm thinking of doing something like this:
foreach($dom->find('div') as $element) {
$text = $element->plaintext;
$result['body'] .= $text.'. ';
}
but there's a problem when the divs are nested, since it would add the content of the parent, which includes text from all children, and then add the content of the children, effectively duplicating the text. This can be fixed simply by checking if there is a </div> inside the $text though.
Perhaps I should try callbacks.
Possibly something like this? Tested.
<?php
require_once 'vendor/autoload.php';
$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html("index.html");
$result['body'] = implode('. ', array_map(function($element) {
return $element->plaintext;
}, $dom->find('div')));
echo $result['body'];
<body>
<div class="H2">Header</div>
<div class="P">this is a paragraph</div>
<div class="P">this is another paragraph</div>
</body>
Try this code:
$result = array();
foreach($html->find('div') as $e){
$result[] = $e->plaintext;
}
Related
I want to select contents of every DIV tags in PHP.
Just imagine we have this HTML page :
<html>
<body>
<div class="one">Content1</div>
<span>blah..</span>
<div class="two">Content2</div>
</body>
</html>
Now , i want to have every DIV tag content, For example from that HTML code , I want to have Content1 in One variable and the Content2 in the other Variable and so on ....
Just need to access the parts easily. Just this.
Every page have random number of DIV tags, so i need a flexable Code to detect DIV tags and put the content of every one in array or any type of variable..
How to do it ?
DOMDocument
$divs = array();
$HTML = '<html>
<body>
<div class="one">Content1</div>
<span>blah..</span>
<div class="two">Content2</div>
</body>
</html>';
$doc = new DOMDocument();
$doc->loadHTML($HTML);
foreach($doc->getElementsByTagName('div') as $div) {
array_push($divs, $div->textContent);
}
var_dump($divs);
example
try to use strip_tags() function:
http://php.net/manual/en/function.strip-tags.php
You can download PHP Simple HTML DOM Parser
And access the div tags like this :
$html = file_get_html('urltopage.com');
foreach($html->find('div') as $e)
echo $e->innertext . '<br>';
Good day everyone,
I'm very new with phpquery and this is my first post here at stackoverflow for a reason that i cant find the correct for syntax for the phpquery chaining. I know someone knows what i been looking for.
I only want to remove the a certain div inside a div.
<div id = "content">
<p>The text that i want to display</p>
<div class="node-links">Stuff i want to remove</div>
</content>
This few lines of codes works perfect
pq('div.node-links')->remove();
$text = pq('div#content');
print $text; //output: The text that i want to display
But when I tried
$text = pq('div#content')->removeClass('div.node-links'); //or
$text = pq('div#content')->remove('div.node-links');
//output: The text that i want to display (+) Stuff i want to remove
Can someone tell me why the second block of code is not working?
Thanks!
The first line of code will only work if your trying to remove the class from div.node-links, it won't remove the node.
If you are trying to remove the class you need to change it from:
$text = pq('div#content')->removeClass('div.node-links');
// to
$text = pq('div#content')->find('.node-links')->removeClass('node-links')->end();
which will output:
<div id="content">
<p>The text that i want to display</p>
<div>Stuff i want to remove</div>
</div>
As for the second line of code.. I'm not exactly sure why it is not working, it seems like your not selecting .node-links but I was able to get the desired results using these.
// $markup = file_get_contents('test.html');
// $doc = phpQuery::newDocumentHTML($markup);
$text = $doc->find('div#content')->children()->remove('.node-links')->end();
// or
$text = pq('div#content')->find('.node-links')->remove()->end();
// or
$text = pq('div#content > *')->remove('.node-links')->parent();
Hope that helps
Since remove() does not take any parameter, you can do:
$text = pq('div#content div.node-links')->remove();
I would like to get text between HTML tags and replacing them dynamically. Considering HTML tags might contain anything (nested HTML tags, comments, etc) I think DOM Document class is the way to go. However I wasn't able to find any example for my needs. I can only get the text between of specifically selected html tag. I also couldn't find an example to replace selected text.
<?php
// HTML OUTPUT
$html= "<p>Subject,</p>
<h1>H1 title</h1>
<h2>H2 title</h2>
<h3>H2 title</h3>";
// DESIRED OUTPUT
$newHTML "<p>My Fav. Colors;</p>
<h1>Blue</h1>
<h2>Orange</h2>
<h3>Yellow</h3>";
?>
Basically I would like to get text from HTML output dynamically (might contain nested HTML tags, comments, javascripts scripts and so on.) and replace them (replaced values will be selected from database) to create new HTML output.
What is the best and elegant way to go? Is DOM Document class is the tool I need or Regex is the way to go?
I will be really glad if you could show me with a small piece of code to understand it clearly.
P.S. HTML document in question might be a page on another domain. Such as http://anotherdomain.com/page.html.
Here is an example of DOM.
$html= "<p>Subject,</p>
<h1>H1 title</h1>
<h2>H2 title</h2>
<h3>H2 title</h3>";
$doc = new DOMDocument;
$doc->loadHTML( '<div>' . $html . '</div>');
foreach($doc->getElementsByTagName('div')->item(0)->childNodes as $node) {
switch ($node->nodeName) {
case "p":
$node->nodeValue = "My Fav. Colors";
break;
case "h1":
$node->nodeValue = "Blue";
break;
case "h2":
$node->nodeValue = "Orange";
break;
case "h3":
$node->nodeValue = "Yellow";
break;
}
}
echo $doc->saveXML($doc);
UPDATE:
Yes I am Using PHP in my pages.
Hello Friends I was thinking..... Is there a way to add a <span> tag to the title without using javascript?
May be using Regex or php or some other method. I dont really know.
Let me explain....
My HTML is like this:
<h3 class="title">The Title Goes Here</h3>
What I want is to automatically add a span tag, so the the final HTML looks like this.
<h3 class="title"><span>The </span>Title Goes Here</h3>
I want to wrap only the first word of the title in a <span> tag.
I know this can easily be dont using Javascript but I am looking for a non-javascript solution.
Please Help!
You can do this with DOMDocument in PHP if you don't want to do it with the javascript DOM:
$html = '<h3 class="title">The Title Goes Here</h3>';
$doc = new DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
foreach($xp->query('//h3[#class="title"]') as $parent) {
$title = $parent->nodeValue;
list($first, $rest) = explode(' ', $title, 2);
$span = new DOMElement('span', $first. ' ');
$parent->nodeValue = $rest;
$parent->insertBefore($span, $parent->firstChild);
}
foreach($doc->getElementsByTagName('body')->item(0)->childNodes as $node)
{
echo $doc->saveHTML($node);
}
My answer is that the cannot be done. You can't manipulate a page in the browser without JavaScript. This can only be achieved by editing the page on the server manually, or by dynamically generating it using PHP logic, or an equivalent solution, of which there are many.
If you are doing this for a corporate solution that is only used on a single corporate standard browser, you could look into building a plugin for the browser.
I'm hoping someone can help me. I'm using PHP Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/manual.htm) successfully, but I now am trying to find elements based on a certain name. For example, in the fetched HTML, there might be a tags such as:
<p class="mattFacer">Matt Facer</p>
<p class="mattJones">Matt Jones</p>
<p class="daveSmith">DaveS Smith</p>
What I need to do is to read in this HTML and capture any HTML elements which match anything beginning with the word, "matt"
I've tried
$html = str_get_html("http://www.testsite.com");
foreach($html->find('matt*') as $element) {
echo $element;
}
but this doesn't work. It returns nothing.
Is it possible to do this? I basically want to search for any HTML element which contains the word "matt". It could be a span, div or p.
I'm at a dead end here!
$html = str_get_html("http://www.testsite.com");
foreach($html->find('[class*=matt]') as $element) {
echo $element;
}
Let's try that
Maybe this?
foreach(array_merge($html->find('[class*=matt]'),$html->find('[id*=matt]')) as $element) {
echo $element;
}