Find preceding element using PHP Simple HTML Parser - php

I have some HTML that is setup like the following (this can be different though!):
<table></table>
<h4>Content</h4>
<table></table>
I'm using PHP Simple HTML DOM Parser to loop over a section of code setup like this:
How can I say something like - "Find the table and the preceding h4, grab the text from the h4 if it exists, if it doesn't then leave blank".
If I just use $html->find('div[class=product-table] h4'); then it ignores the fact there was no title for the first table.
This is my full code for context:
$table_rows = $html->find('div[class=product-table] table');
$tablecounter = 1;
foreach ($table_rows as $table){
$tablevalue[] =
array(
"field_5b3f40cae191b" => "Table",
);
}
update_field( $field_key, $tablevalue, $post_id );
Update:
I've found in the documentation that you can use prev_sibling() so I've tried $table_title = $html->find('div[class=product-table] table')->prev_sibling('h4'); but can't seem to get it to work.

I've simplified the example to hopefully show the situation your after, it does assume that the <h4> tag is immediately prior to the <table> tag. But it uses the prev_sibling() of the table tag you find.
require_once 'simple_html_dom.php';
$source = "<html>
<body>
<div class='product-table'>
<table>t1</table>
<h4>Content</h4>
<table>t2</table>
</div>
</body>
</html>";
$html = str_get_html($source);
$table_rows = $html->find('div[class=product-table] table');
foreach ($table_rows as $table){
$prev = $table->prev_sibling();
if ( !empty($prev) && $prev->tag == "h4") {
echo "h4=".(string)$prev->innertext().PHP_EOL;
}
echo "content=".(string)$table.PHP_EOL;
}
echos..
content=<table>t1</table>
h4=Content
content=<table>t2</table>

Related

How to format plaintext in PHP Simple HTML DOM Parser?

I'm trying to extract the content of a webpage in plain text - without the html tags. Here's some sample code:
$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html($url);
$result['body'] = $dom->find('body', 0)->plaintext;
The problem is that what I get in $result['body'] is very messy. The HTML was removed, sure, but sentences often merge into others since there are no spaces or periods to delimit where the text from one HTML tag ended, and text from the following tag begins.
An example:
<body>
<div class="H2">Header</div>
<div class="P">this is a paragraph</div>
<div class="P">this is another paragraph</div>
</body>
Results in:
"Headerthis is a paragraphthis is another paragraph"
Desired result:
"Header. this is a paragraph. this is another paragraph"
Is there any way to format the result from plaintext or perhaps apply extra manipulation on the innertext before using plaintext to achieve clear delimiters for sentences?
EDIT:
I'm thinking of doing something like this:
foreach($dom->find('div') as $element) {
$text = $element->plaintext;
$result['body'] .= $text.'. ';
}
but there's a problem when the divs are nested, since it would add the content of the parent, which includes text from all children, and then add the content of the children, effectively duplicating the text. This can be fixed simply by checking if there is a </div> inside the $text though.
Perhaps I should try callbacks.
Possibly something like this? Tested.
<?php
require_once 'vendor/autoload.php';
$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html("index.html");
$result['body'] = implode('. ', array_map(function($element) {
return $element->plaintext;
}, $dom->find('div')));
echo $result['body'];
<body>
<div class="H2">Header</div>
<div class="P">this is a paragraph</div>
<div class="P">this is another paragraph</div>
</body>
Try this code:
$result = array();
foreach($html->find('div') as $e){
$result[] = $e->plaintext;
}

How to find a h3 tag with a certain value

Well, I have a HTML File with the following structure:
<h3>Heading 1</h3>
<table>
<!-- contains a <thead> and <tbody> which also cointain several columns/lines-->
</table>
<h3>Heading 2</h3>
<table>
<!-- contains a <thead> and <tbody> which also cointain several columns/lines-->
</table>
I want to get JUST the first table with all its content. So I'll load the HTML File
<?php
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML(file_get_contents('http://www.example.com'));
libxml_clear_errors();
?>
All tables have the same classes and also have NO specific ID's. That's why the only way I could think of was to grab the h3-tag with the value "Heading 1". I already found this one, which works well for me. (Thinking of the fact that other tables and captions could be added leaves the solution as unfavorable)
How could I grab the h3 tag WITH the value "Heading 1"? + How could I select the following table?
EDIT#1: I don't have access to the HTML File, so I can't edit it.
EDIT#2: My Solution (thanks to Martin Henriksen) for now is:
<?php
$doc = new DOMDocument(1.0);
libxml_use_internal_errors(true);
$doc->loadHTML(file_get_contents('http://example.com'));
libxml_clear_errors();
foreach($doc->getElementsByTagName('h3') as $element){
if($element->nodeValue == 'exampleString')
$table = $element->nextSibling->nextSibling;
$innerHTML= '';
$children = $table->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
echo $innerHTML;
file_put_contents("test.xml", $innerHTML);
}
?>
You can Find any tag in HTML using simple_html_dom.php class you can download this file from this link https://sourceforge.net/projects/simplehtmldom/?source=typ_redirect
Than
<?php
include_once('simple_html_dom.php');
$htm = "**YOUR HTML CODE**";
$html = str_get_html($htm);
$h3_tag = $html->find("<h3>",0)->innertext;
echo "HTML code in h3 tag";
print_r($h3_tag);
?>
You can fetch out all the DomElements which the tag h3, and check what value it holds by accessing the nodeValue. When you found the h3 tag, you can select the next element in the DomTree by nextSibling.
foreach($dom->getElementsByTagName('h3') as $element)
{
if($element->nodeValue == 'Heading 1')
$table = $element->nextSibling;
}

Simple HTML Dom Crawler returns more than contained in attributes

I would like to extract the contents contained within certain parts of a website using selectors. I am using Simple HTML DOM to do this. However for some reason more data is returned than present in the selectors that I specify. I have checked the FAQ of Simple HTML DOM, but did not see anything that could help me out. I wasn't able to find anything on Stackoverflow either.
I am trying to get the contents/hrefs of all h2 class="hed" tags contained within the ul class="river" on this webpage: http://www.theatlantic.com/most-popular/
In my output I am receiving a lot of data from other tags like p class="dek has-dek" that are not contained within the h2 tag and should not be included. This is really strange as I thought the code would only allow for content within those tags to be scraped.
How can I limit the output to only include the data contained within the h2 tag?
Here is the code I am using:
<div class='rcorners1'>
<?php
include_once('simple_html_dom.php');
$target_url = "http://www.theatlantic.com/most-popular/";
$html = new simple_html_dom();
$html->load_file($target_url);
$posts = $html->find('ul[class=river]');
$limit = 10;
$limit = count($posts) < $limit ? count($posts) : $limit;
for($i=0; $i < $limit; $i++){
$post = $posts[$i];
$post->find('h2[class=hed]',0)->outertext = "";
echo strip_tags($post, '<p><a>');
}
?>
</div>
Output can be seen here. Instead of only a couple of article links, I get information of the author, information on the article, among others.
You are not outputting the h2 contents, but the ul contents in the echo:
echo strip_tags($post, '<p><a>');
Note that the statement before the echo does not modify $post:
$post->find('h2[class=hed]',0)->outertext = "";
Change code to this:
$hed = $post->find('h2[class=hed]',0);
echo strip_tags($hed, '<p><a>');
However, that will only do something with the first found h2. So you need another loop. Here is a rewrite of the code after load_file:
$posts = $html->find('ul[class=river]');
foreach($posts as $postNum => $post) {
if ($postNum >= 10) break; // limit reached
$heds = $post->find('h2[class=hed]');
foreach($heds as $hed) {
echo strip_tags($hed, '<p><a>');
}
}
If you still need to clear outertext, you can do it with $hed:
$hed->outertext = "";
You really only need one loop. Consider this:
foreach($html->find('ul.river > h2.hed') as $postNum => $h2) {
if ($postNum >= 10) break;
echo strip_tags($h2, '<p><a>') . "\n"; // the text
echo $h2->parent->href . "\n"; // the href
}

How to find last div if having a text?

I have like this example a code :
<div>
<div>
<p>SOS</p>
<div>
<p>searching text</p>
</div>
</div>
</div>
now i want with php simple dom parser searching a text like SOS and if strpos true echo thats div. my final result like this :
<div>
<p>SOS</p>
<div>
<p>searching text</p>
</div>
</div>
i wrote this code but doesn't work :
<?php
include('simple_html_dom.php');
$html = #file_get_html('example code');
$mytext = 'SOS';
foreach(#$html->find('div') as $div)
{
if(strpos(strtolower($div->innertext),strtolower($mytext)) !== false)
{
echo $div->outertext;
break;
}
}
?>
Thank you in advance.
Maybe not the answer you are looking for, but the selectors in Simple HTML DOM Parser are, well, simple, and this looks more like a job for XPath.
So, if the use of that library is not a requirement, you could as well use libxml, e.g. something along the lines of
<?php
$dom = new DOMDocument();
$dom->loadHTMLFile( 'sample.html' );
$xp = new DOMXPath($dom);
$mytext='SOS';
foreach( $xp->query("//div[./*[text()[contains(.,'$mytext')]]") as $match ) {
print $dom->saveHTML( $match );
}
?>
This is not tested so the code might need a bit of tweaking.

Finding and Printing all Links within a DIV

I am trying to find all links in a div and then printing those links.
I am using the Simple HTML Dom to parse the HTML file. Here is what I have so far, please read the inline comments and let me know where I am going wrong.
include('simple_html_dom.php');
$html = file_get_html('tester.html');
$articles = array();
//find the div the div with the id abcde
foreach($html->find('#abcde') as $article) {
//find all a tags that have a href in the div abcde
foreach($article->find('a[href]') as $link){
//if the href contains singer then echo this link
if(strstr($link, 'singer')){
echo $link;
}
}
}
What currently happens is that the above takes a long time to load (never got it to finish). I printed what it was doing in each loop since it was too long to wait and I find that its going through things I don't need it to! This suggests my code is wrong.
The HTML is basically something like this:
<div id="abcde">
<!-- lots of html elements -->
<!-- lots of a tags -->
<a href="singer/tom" />
<img src="image..jpg" />
</a>
</div>
Thanks all for any help
The correct way to select a div (or whatever) by ID using that API is:
$html->find('div[id=abcde]');
Also, since IDs are supposed to be unique, the following should suffice:
//find all a tags that have a href in the div abcde
$article = $html->find('div[id=abcde]', 0);
foreach($article->find('a[href]') as $link){
//if the href contains singer then echo this link
if(strstr($link, 'singer')){
echo $link;
}
}
Why don't you use the built-in DOM extension instead?
<?php
$cont = file_get_contents("http://stackoverflow.com/") or die("1");
$doc = new DOMDocument();
#$doc->loadHTML($cont) or die("2");
$nodes = $doc->getElementsByTagName("a");
for ($i = 0; $i < $nodes->length; $i++) {
$el = $nodes->item($i);
if ($el->hasAttribute("href"))
echo "- {$el->getAttribute("href")}\n";
}
gives
... (lots of links before) ...
- http://careers.stackoverflow.com
- http://serverfault.com
- http://superuser.com
- http://meta.stackoverflow.com
- http://www.howtogeek.com
- http://doctype.com
- http://creativecommons.org/licenses/by-sa/2.5/
- http://www.peakinternet.com/business/hosting/colocation-dedicated#
- http://creativecommons.org/licenses/by-sa/2.5/
- http://blog.stackoverflow.com/2009/06/attribution-required/

Categories