Simple HTML DOM Parser - removing element not working - php

I have the following HTML code:
<div id="hero_techSpec">
<div class="hero_techSpecItem">
blubb
</div>
</div>
Now, I'm trying to remove the inner div-element with "simple HTML DOM parser".
$document = HtmlDomParser::str_get_html("...some HTML code as string...");
$techSpec = $document->find("#hero_techSpec", 0);
echo $techSpec;
$techSpec->find(".hero_techSpecItem", 0)->outertext = '';
echo $techSpec;
$document->load($document->save());
echo $document->find("#hero_techSpec", 0); die;
In all three "echo"s, the inner div is still present. I tried to follow the related solution: Simple HTML Dom: How to remove elements?
However, it seems it is not working in my case. Do you have any ideas / hints how to solve that issue? Thank you!

Try something like this:
$document->load($htmlString);
$techSpec = $document->find(".//div[#class='hero_techSpecItem']")[0];
$techSpec->outertext = "";
$document->load($document->save());
echo $document;
Output should be:
<div id="hero_techSpec"> </div>

Related

How to format plaintext in PHP Simple HTML DOM Parser?

I'm trying to extract the content of a webpage in plain text - without the html tags. Here's some sample code:
$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html($url);
$result['body'] = $dom->find('body', 0)->plaintext;
The problem is that what I get in $result['body'] is very messy. The HTML was removed, sure, but sentences often merge into others since there are no spaces or periods to delimit where the text from one HTML tag ended, and text from the following tag begins.
An example:
<body>
<div class="H2">Header</div>
<div class="P">this is a paragraph</div>
<div class="P">this is another paragraph</div>
</body>
Results in:
"Headerthis is a paragraphthis is another paragraph"
Desired result:
"Header. this is a paragraph. this is another paragraph"
Is there any way to format the result from plaintext or perhaps apply extra manipulation on the innertext before using plaintext to achieve clear delimiters for sentences?
EDIT:
I'm thinking of doing something like this:
foreach($dom->find('div') as $element) {
$text = $element->plaintext;
$result['body'] .= $text.'. ';
}
but there's a problem when the divs are nested, since it would add the content of the parent, which includes text from all children, and then add the content of the children, effectively duplicating the text. This can be fixed simply by checking if there is a </div> inside the $text though.
Perhaps I should try callbacks.
Possibly something like this? Tested.
<?php
require_once 'vendor/autoload.php';
$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html("index.html");
$result['body'] = implode('. ', array_map(function($element) {
return $element->plaintext;
}, $dom->find('div')));
echo $result['body'];
<body>
<div class="H2">Header</div>
<div class="P">this is a paragraph</div>
<div class="P">this is another paragraph</div>
</body>
Try this code:
$result = array();
foreach($html->find('div') as $e){
$result[] = $e->plaintext;
}

How To get html code inside div with an ID with SimpleHTMLDom PHP?

This is simple html demo code where i want to fetch html content between div with id first.
<div id="first">
<div id="second">second</div>
<div class="third">Third</div>
<h1>Fourth</h1>
</div>
So output of it will be like
<div id="second">second</div>
<div class="third">Third</div>
<h1>Fourth</h1>
I tried many things but it didn't work such as :
$dom->find('div[id="first"]')->innertext;
$var = (string)$dom->find('div[id="first"]');
So, how to extract html code within div?
find() method returns an array of elements. You probably want the innertext of the first element like this:
$result = $dom->find('div[id="first"]');
print ($result[0]->innertext);
Try this one. Here we are using simplexml_load_string , querying with XPath.
Try this code snippet here
<?php
$dom=simplexml_load_string($your_html_source_string);
$results=$dom->xpath('//div[#id="first"]/*');
$innerHtml="";
foreach($results as $result)
{
$innerHtml.=$result->saveXML();
}
echo $innerHtml;

SimpleHtmlDom extract content inside a div not from its child

I want to extract content from a div, but not needed the contents from its childrens. I m using simplehtmldom parser and the following code
//html code
<div id="frame">
Needed this content
Not needed
</div>
//php code
$elem = file_get_html($url);
$content = $elem->find('div#frame')->plaintext;
echo $content;
but this code results,
Needed this contentNot needed
I want the result as,
Needed this content
How to change th code for getting that output. Help plz. thanks in advance
The only way I can think of, is to delete all your div's children, then print the left content... Here's how:
// includes Simple HTML DOM Parser
include "simple_html_dom.php";
$text = '<div id="frame">
<span><b>Not needed</b></span>
Needed this content
Not needed
</div>';
//Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($text);
$content = $html->find('div#frame',0);
// Before
echo $content->innertext;
// Delete all unwanted children
foreach( $content->children() as $i => $unwantedTags ) {
echo "<br/>$i => ".$unwantedTags->tag;
$unwantedTags->outertext = '';
}
// After
echo "<br/>".$content->innertext;
// Clear dom object
$html->clear();
unset($html);
See this working DEMO

PHP get tag url

Hello If i want this text:
$content = '<div id="hey">
<div id="bla"></div>
</div>
<div id="hey">
hey lol
</div>';
The content inside the id="hey" can be changed.
And now I want to get the tags in array
$array[0] = < div id="bla"></div >;
$array[1] = < hey lol >;
How Can I do that? i though about preg_match_all?
Sounds to me, if I understand this correctly, you're looking to parse HTML with PHP. Though regex can work, it's certainly not the best method.
With that said, have a look at the DOMDocument class. It allows you to parse HTML files, and has methods similar to javascript in terms of referencing elements by tag, id, etc.
Per your example:
<?php
$html = '<div id="hey">hey lol</div>'; /* or file_get_contents('...'); */
$dom = new DOMDocument();
$dom->loadHTML($html);
// this will get <div id="hey"></div>
$hey_div = $dom->getElementById('hey');
echo $hey_div->textContent; // "hey lol"
$content=str_replace("hey","bla",$content);
OR
$divid="hey";
//$divid="bla";
$content = '<div id="' . $divid . '">
<div id="bla"></div>
</div>
<div id="hey">
hey lol
</div>';

php: how can I work with html as xml ? how do i find specific nodes and get the text inside these nodes?

Lets say i have the following web page:
<html>
<body>
<div class="transform">
<span>1</span>
</div>
<div class="transform">
<span>2</span>
</div>
<div class="transform">
<span>3</span>
</div>
</body>
</html>
I would like to find all div elements that contain the class transform and to fetch the text in each div element ?
I know I can do that easily with regular expressions, but i would like to know how can I do that without regular expressions, but parsing the xml and finding the required nodes i need.
update
i know that in this example i can just iterate through all the divs. but this is an example just to illustrate what i need.
in this example i need to query for divs that contain the attribute class=transform
thanks!
Could use SimpleXML - see the example below:
$string = "<?xml version='1.0'?>
<html>
<body>
<div class='transform'>
<span>1</span>
</div>
<div>
<span>2</span>
</div>
<div class='transform'>
<span>3</span>
</div>
</body>
</html>";
$xml = simplexml_load_string($string);
$result = $xml->xpath("//div[#class = 'transform']");
foreach($result as $node) {
echo "span " . $node->span . "<br />";
}
Updated it with xpath...
You can use xpath to address the items. For that particular query, you'd use:
div[contains(concat(" ",#class," "), concat(" ","transform"," "))]
Full PHP example:
<?php
$document = new DomDocument();
$document->loadHtml($html);
$xpath = new DomXPath($document);
foreach ($xpath->query('div[contains(concat(" ",#class," "), concat(" ","transform"," "))]') as $div) {
var_dump($div);
}
If you know CSS, here's a handy CSS-selector to XPath-expression mapping: http://plasmasturm.org/log/444/ -- You can find the above example listed there, as well as other common queries.
If you use it a lot, you might find my csslib library handy. It offers a wrapper csslib_DomCssQuery, which is similar to DomXPath, but using CSS-selectors instead.
ok what i wanted can be easily achieved using php xpath:
example:
http://ditio.net/2008/12/01/php-xpath-tutorial-advanced-xml-part-1/

Categories