How to find last div if having a text? - php

I have like this example a code :
<div>
<div>
<p>SOS</p>
<div>
<p>searching text</p>
</div>
</div>
</div>
now i want with php simple dom parser searching a text like SOS and if strpos true echo thats div. my final result like this :
<div>
<p>SOS</p>
<div>
<p>searching text</p>
</div>
</div>
i wrote this code but doesn't work :
<?php
include('simple_html_dom.php');
$html = #file_get_html('example code');
$mytext = 'SOS';
foreach(#$html->find('div') as $div)
{
if(strpos(strtolower($div->innertext),strtolower($mytext)) !== false)
{
echo $div->outertext;
break;
}
}
?>
Thank you in advance.

Maybe not the answer you are looking for, but the selectors in Simple HTML DOM Parser are, well, simple, and this looks more like a job for XPath.
So, if the use of that library is not a requirement, you could as well use libxml, e.g. something along the lines of
<?php
$dom = new DOMDocument();
$dom->loadHTMLFile( 'sample.html' );
$xp = new DOMXPath($dom);
$mytext='SOS';
foreach( $xp->query("//div[./*[text()[contains(.,'$mytext')]]") as $match ) {
print $dom->saveHTML( $match );
}
?>
This is not tested so the code might need a bit of tweaking.

Related

Fetch nested tags in php using simplehtmldom

Lets say I have this code. I want to fetch all p tag data from nested div tag. there can be 15 nested div tag. so want to write a script which can dig all the div and return p tag data from it.
<div>
<div>
<div>
<p>Hi</p>
</div>
<p>Hello</p>
</div>
<p>Hey</p>
</div>
required output(any order):
Hi
Hello
Hey
I have attempted the following:
function divDigger($div)
{
$internalP = $div->getElementsByTagName('p');
echo $internalP->innertext;
$internalDiv = $div->getElementsByTagName('div');
if (count($internalDiv) > 0) {
foreach ($internalDiv as $div) {
divDigger($div);
}
}
}
You may use the XPath API for this:
$doc = new \DOMDocument();
$doc->loadHTML($yourHtml);
$xpath = new \DOMXPath($doc);
foreach ($xpath->query('//div//p') as $pWithinDiv) {
echo $pWithinDiv->textContent, PHP_EOL;
}
This will find any <p> element under a <div> (not necessarily directly under it, otherwise you can change the expression to //div/p), and display its text content.
Demo: https://3v4l.org/43QqX

How to format plaintext in PHP Simple HTML DOM Parser?

I'm trying to extract the content of a webpage in plain text - without the html tags. Here's some sample code:
$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html($url);
$result['body'] = $dom->find('body', 0)->plaintext;
The problem is that what I get in $result['body'] is very messy. The HTML was removed, sure, but sentences often merge into others since there are no spaces or periods to delimit where the text from one HTML tag ended, and text from the following tag begins.
An example:
<body>
<div class="H2">Header</div>
<div class="P">this is a paragraph</div>
<div class="P">this is another paragraph</div>
</body>
Results in:
"Headerthis is a paragraphthis is another paragraph"
Desired result:
"Header. this is a paragraph. this is another paragraph"
Is there any way to format the result from plaintext or perhaps apply extra manipulation on the innertext before using plaintext to achieve clear delimiters for sentences?
EDIT:
I'm thinking of doing something like this:
foreach($dom->find('div') as $element) {
$text = $element->plaintext;
$result['body'] .= $text.'. ';
}
but there's a problem when the divs are nested, since it would add the content of the parent, which includes text from all children, and then add the content of the children, effectively duplicating the text. This can be fixed simply by checking if there is a </div> inside the $text though.
Perhaps I should try callbacks.
Possibly something like this? Tested.
<?php
require_once 'vendor/autoload.php';
$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html("index.html");
$result['body'] = implode('. ', array_map(function($element) {
return $element->plaintext;
}, $dom->find('div')));
echo $result['body'];
<body>
<div class="H2">Header</div>
<div class="P">this is a paragraph</div>
<div class="P">this is another paragraph</div>
</body>
Try this code:
$result = array();
foreach($html->find('div') as $e){
$result[] = $e->plaintext;
}

How to scrape html contents of one div by id using php

The page on another of my domains which I'd like to scrape one div from contains:
<div id="thisone">
<p>Stuff</p>
</div>
<div id="notthisone">
<p>More stuff</p>
</div>
Using this php...
<?php
$page = file_get_contents('http://thisite.org/source.html');
$doc = new DOMDocument();
$doc->loadHTML($page);
foreach ($doc->getElementsByTagName('div') as $node) {
echo $doc->saveHtml($node), PHP_EOL;
}
?>
...gives me all divs on http://thisite.org/source.html, with html. However, I only want to pull through the div with an id of "thisone" but using:
foreach ($doc->getElementById('thisone') as $node) {
doesn't bring up anything.
$doc->getElementById('thisone');// returns a single element with id this one
Try $node=$doc->getElementById('thisone'); and then print $node
On a side note, you can use phpQuery for a jquery like syntext: pq("#thisone")
$doc->getElementById('thisone') returns a single DOMElement, not an array, so you can't iterate through it
just do:
$node = $doc->getElementById('thisone');
echo $doc->saveHtml($node), PHP_EOL;
Look at PHP manual http://php.net/manual/en/domdocument.getelementbyid.php
getElementByID returns an element or NULL. Not an array and therefore you can't iterate over it.
Instead do this
<?php
$page = file_get_contents('example.html');
$doc = new DOMDocument();
$doc->loadHTML($page);
$node = $doc->getElementById('thisone');
echo $doc->saveHtml($node), PHP_EOL;
?>
On running
php edit.php you get something like this
<div id="thisone">
<p>Stuff</p>
</div>

PHP DOMDocument: insertBefore, how to make it work?

I would like to place a new node element, before a given element. I'm using insertBefore for that, without success!
Here's the code,
<DIV id="maindiv">
<!-- I would like to place the new element here -->
<DIV id="child1">
<IMG />
<SPAN />
</DIV>
<DIV id="child2">
<IMG />
<SPAN />
</DIV>
//$div is a new div node element,
//The code I'm trying, is the following:
$maindiv->item(0)->parentNode->insertBefore( $div, $maindiv->item(0) );
//Obs: This code asctually places the new node, before maindiv
//$maindiv object(DOMNodeList)[5], from getElementsByTagName( 'div' )
//echo $maindiv->item(0)->nodeName gives 'div'
//echo $maindiv->item(0)->nodeValue gives the correct data on that div 'some random text'
//this code actuall places the new $div element, before <DIV id="maindiv>
http://pastie.org/1070788
Any kind of help is appreciated, thanks!
If maindiv is from getElementsByTagName(), then $maindiv->item(0) is the div with id=maindiv. So your code is working correctly because you're asking it to place the new div before maindiv.
To make it work like you want, you need to get the children of maindiv:
$dom = new DOMDocument();
$dom->load($yoursrc);
$maindiv = $dom->getElementById('maindiv');
$items = $maindiv->getElementsByTagName('DIV');
$items->item(0)->parentNode->insertBefore($div, $items->item(0));
Note that if you don't have a DTD, PHP doesn't return anything with getElementsById. For getElementsById to work, you need to have a DTD or specify which attributes are IDs:
foreach ($dom->getElementsByTagName('DIV') as $node) {
$node->setIdAttribute('id', true);
}
From scratch, this seems to work too:
$str = '<DIV id="maindiv">Here is text<DIV id="child1"><IMG /><SPAN /></DIV><DIV id="child2"><IMG /><SPAN /></DIV></DIV>';
$doc = new DOMDocument();
$doc->loadHTML($str);
$divs = $doc->getElementsByTagName("div");
$divs->item(0)->appendChild($doc->createElement("div", "here is some content"));
print_r($divs->item(0)->nodeValue);
Found a solution:
$child = $maindiv->item(0);
$child->insertBefore( $div, $child->firstChild );
I don't know how much sense this makes, but well, it worked.

php: how can I work with html as xml ? how do i find specific nodes and get the text inside these nodes?

Lets say i have the following web page:
<html>
<body>
<div class="transform">
<span>1</span>
</div>
<div class="transform">
<span>2</span>
</div>
<div class="transform">
<span>3</span>
</div>
</body>
</html>
I would like to find all div elements that contain the class transform and to fetch the text in each div element ?
I know I can do that easily with regular expressions, but i would like to know how can I do that without regular expressions, but parsing the xml and finding the required nodes i need.
update
i know that in this example i can just iterate through all the divs. but this is an example just to illustrate what i need.
in this example i need to query for divs that contain the attribute class=transform
thanks!
Could use SimpleXML - see the example below:
$string = "<?xml version='1.0'?>
<html>
<body>
<div class='transform'>
<span>1</span>
</div>
<div>
<span>2</span>
</div>
<div class='transform'>
<span>3</span>
</div>
</body>
</html>";
$xml = simplexml_load_string($string);
$result = $xml->xpath("//div[#class = 'transform']");
foreach($result as $node) {
echo "span " . $node->span . "<br />";
}
Updated it with xpath...
You can use xpath to address the items. For that particular query, you'd use:
div[contains(concat(" ",#class," "), concat(" ","transform"," "))]
Full PHP example:
<?php
$document = new DomDocument();
$document->loadHtml($html);
$xpath = new DomXPath($document);
foreach ($xpath->query('div[contains(concat(" ",#class," "), concat(" ","transform"," "))]') as $div) {
var_dump($div);
}
If you know CSS, here's a handy CSS-selector to XPath-expression mapping: http://plasmasturm.org/log/444/ -- You can find the above example listed there, as well as other common queries.
If you use it a lot, you might find my csslib library handy. It offers a wrapper csslib_DomCssQuery, which is similar to DomXPath, but using CSS-selectors instead.
ok what i wanted can be easily achieved using php xpath:
example:
http://ditio.net/2008/12/01/php-xpath-tutorial-advanced-xml-part-1/

Categories