I am using Simple html dom to scrape a website. The problem I have run into is that there is text positioned outside of any specific element. The only element it seems to be inside is <div id="content">.
<div id="content">
<div class="image-wrap"></div>
<div class="gallery-container"></div>
<h3 class="name">Here is the Heading</h3>
All the text I want is located here !!!
<p> </p>
<div class="snapshot"></div>
</div>
I guess the webmaster has messed up and the text should actually be inside the <p> tags.
I've tried using this code below, however it just won't retrieve the text:
$t = $scrape->find("div#content text",0);
if ($t != null){
$text = trim($t->plaintext);
}
I'm still a newbie and still learning. Can anyone help at all ?
You're almost there... Use a test loop to display the content of your nodes and locate the index of the wanted text. For example:
// Find all texts
$texts = $html->find('div#content text');
foreach ($texts as $key => $txt) {
// Display text and the parent's tag name
echo "<br/>TEXT $key is ", $txt->plaintext, " -- in TAG ", $txt->parent()->tag ;
}
You'll find that you should use index 4 instead of 0:
$scrape->find("div#content text",4);
And if your text doesnt have always the same index but you know for example that it follows the h3 heading, then you could use something like:
foreach ($texts as $key => $txt) {
// Locate the h3 heading
if ($txt->parent()->tag == 'h3') {
// Grab the next index content from $texts
echo $texts[$key+1]->plaintext;
// Stop
break;
}
}
Related
I'm trying to extract the content of a webpage in plain text - without the html tags. Here's some sample code:
$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html($url);
$result['body'] = $dom->find('body', 0)->plaintext;
The problem is that what I get in $result['body'] is very messy. The HTML was removed, sure, but sentences often merge into others since there are no spaces or periods to delimit where the text from one HTML tag ended, and text from the following tag begins.
An example:
<body>
<div class="H2">Header</div>
<div class="P">this is a paragraph</div>
<div class="P">this is another paragraph</div>
</body>
Results in:
"Headerthis is a paragraphthis is another paragraph"
Desired result:
"Header. this is a paragraph. this is another paragraph"
Is there any way to format the result from plaintext or perhaps apply extra manipulation on the innertext before using plaintext to achieve clear delimiters for sentences?
EDIT:
I'm thinking of doing something like this:
foreach($dom->find('div') as $element) {
$text = $element->plaintext;
$result['body'] .= $text.'. ';
}
but there's a problem when the divs are nested, since it would add the content of the parent, which includes text from all children, and then add the content of the children, effectively duplicating the text. This can be fixed simply by checking if there is a </div> inside the $text though.
Perhaps I should try callbacks.
Possibly something like this? Tested.
<?php
require_once 'vendor/autoload.php';
$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html("index.html");
$result['body'] = implode('. ', array_map(function($element) {
return $element->plaintext;
}, $dom->find('div')));
echo $result['body'];
<body>
<div class="H2">Header</div>
<div class="P">this is a paragraph</div>
<div class="P">this is another paragraph</div>
</body>
Try this code:
$result = array();
foreach($html->find('div') as $e){
$result[] = $e->plaintext;
}
I have many articles, divided into sections, stored in a database. Each section consists of a section tag, followed by a header (h2) and a primary div. Some also have subheaders (h3). The raw display looks something like this:
<section id="ecology">
<h2 class="Article">Ecology</h2>
<div class="Article">
<h3 class="Article">Animals</h3>
I'm using the following DOM script to add some classes, ID's and glyphicons:
$i = 1; // initialize counter
// initialize DOMDocument
$dom = new DOMDocument;
#$dom->loadHTML($Content); // load the markup
$sections = $dom->getElementsByTagName('section'); // get all section tags
if($sections->length > 0) { // if there are indeed section tags inside
// work on each section
foreach($sections as $section) { // for each section tag
$section->setAttribute('data-target', '#b' . $i); // set id for section tag
// get div inside each section
foreach($section->getElementsByTagName('h2') as $h2) {
if($h2->getAttribute('class') == 'Article') { // if this div has class maindiv
$h2->setAttribute('id', 'a' . $i); // set id for div tag
}
}
foreach($section->getElementsByTagName('div') as $div) {
if($div->getAttribute('class') == 'Article') { // if this div has class maindiv
$div->setAttribute('id', 'b' . $i); // set id for div tag
}
}
$i++; // increment counter
}
}
// back to string again, get all contents inside body
$Content = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $child) {
$Content .= $dom->saveHTML($child); // convert to string and append to the container
}
I'd like to modify the above code so that it places certain examples of "inner text" between tags.
For example, consider these headings:
<h3 class="Article">Animals</h3>
<h3 class="Article">Plants</h3>
I would like the DOM to change them to this:
<h3 class="Article"><span class="label label-default">Animals</span></h3>
<h3 class="Article"><span class="label label-default">Plants</span></h3>
I want to do something similar with the h2 tags. I don't yet know the DOM terminology well enough to search for good tutorials - not to mention confusion with DOM programs and jQuery. ;)
I think these are the basic functions I need to focus on, but I don't know how to plug them in:
$text = $data->textContent;
elementNode.textContent=string
Two Notes: 1) I understand I can do this with jQuery (perhaps a lot easier), but I think PHP might be better, as they say some users can have JavaScript disabled. 2) I'm using the class "Article" largely to distinguish elements I want to be styled by PHP DOM. A header with a different class, or no class at all, should not be affected by the DOM script.
I am working on a php script for a custom cms that will replace a custom tag with information from a database.
There would be a tag like below
<!-- NAV id="123" suffix="somethinghere" prefix="somethingelse" -->
I need to pull out the id, suffix, and prefix attributes. The code below works great if there is only one instance of this tag on the page, but if I have more than one, or if "-->" is anywhere else on the page it does not work properly. It matches everything between the first
"<!--"
and the last
"-->"
instead of returning each match separately.
Here is my current code. If it were working properly it would replace the entire tag with the value of "id", eventually that will be data from the database.
<?php
global $lastNav, $html;
//the html content
$html = '<html><body><hr><br>Hi this is my content<br> <!-- NAV id="123" suffix="<br />" prefix="•" --> <br>Some more content here <!-- NAV id="125" suffix="<br />" prefix="•" --> </body></html>';
$regexNavPattern = '<!-- NAV.*?(?:(?:\s+(id)="([^"]+)")|(?:\s+(prefix)="([^"]+)")|(?:\s+(suffix)="([^"]+)")|(?:\s+[^\s]+))+.*-->';
preg_replace_callback($regexNavPattern, "parseNav", $html);
function parseNav($navData) {
global $lastNav, $html;
foreach($navData as $key=>$value) {
if($key == 0) { $lastNav['replace'] = '<'.$value.'>'; }
if($value == 'id') { $lastNav['id'] = $navData[$key+1]; }
if($value == 'prefix') { $lastNav['prefix'] = $navData[$key+1]; }
if($value == 'suffix') { $lastNav['suffix'] = $navData[$key+1]; }
}
$html = str_replace($lastNav['replace'], $lastNav['id'], $html);
}
echo $html;
?>
At this point I am not concerned about case sensitivity. There is a chance that the attributes may contain special characters including single or double quotes.
Hopefully I explained this well enough. Thanks in advance.
Jonathan Kuhn's solutions worked. For the time being I went with the first approach of just correcting the existing regex.
/<!-- NAV.*?(?:(?:\s+(id)="([^"]+)")|(?:\s+(prefix)="([^"]+)")|(?:\s+(suffix)="([^"]+)")|(?:\s+[^\s]?+))+.*?-->/
Later I will modify it to break it down to work as a few functions. I appreciate the help.
I'm trying to pull some data from my website. It is pretty simple, but I can't find any good examples/docs, so I am having a tough time. I'm trying to make an API for my friends to use my blog, but it's a bit difficult. Let's assume I have a website at http://www.sample.com, and the html source for that website is:
<div class="container">
<a href="/mywebsiteblogpost/">
<h2 class="title">im the best</h2>
</a>
<span class="author">Josue Espinosa</span>
<div class="thumb"> <img src="http://www.sample.com/imgsrc" alt="">
<span class="category">sports</span>
</div>
<p>preview text</p>
<a class="more" href="/mywebsiteblogpost/">full text...</a>
</div>
I want to get all of .container's children, the first a child's href value, the text value of the class title, author, the img src for the child inside .thumb, and the text value for category.
I started with the a href src, but I didn't even get that far. I thought $title would be echoing the href value of the first anchor tag inside of container, but it doesn't work.
$text = file_get_contents('http://www.sample.com');
$doc = new DOMDocument('1.0');
$doc->loadHTML($text);
foreach($doc->getElementsByTagName('div') AS $div) {
$class = $div->getAttribute('class');
if(strpos($class, 'container') !== FALSE) {
// title doesnt retrieve the href value of title :(
$title = 'TITLE'.$div->getElementsByTagName('a')->getAttribute('href').'<br>';
//this echos all the text in all of the children of $div
echo $div->textContent.'<br>';
}
}
Can anyone explain why please?
The culprit is $div->getElementsByTagName('a')->getAttribute('href'). The first part, $div->getElementsByTagName('a') retrieves a list of elements, not a single element. So the following ->getAttribute('href') will not do the right thing.
To fix this, iterate just as you do with the div-tags:
foreach($div->getElementsByTagName('a') as $a) {
$href = $a->getAttribute('href');
if ($href) echo "TITLE$href<br>";
}
ok so first
$div->getElementsByTagName('a')
returns a domnodelist (http://php.net/manual/en/class.domnodelist.php) object, You need to get the first item there to get the attribute.
Second
$div->textContent
Does as intended ? show all text content in the $div ?
You may be better off looking at xpath queries( http://php.net/manual/en/class.domxpath.php) for this type of DOM searching
I made some corrections on the php code you posted that doesn't work, may be it can help you keep going
$text = file_get_contents('http://www.sample.com');
$doc = new DOMDocument('1.0');
$doc->loadHTML($text);
foreach($doc->getElementsByTagName('div') AS $div)
{
$class = $div->getAttribute('class');
// _($class);
if(strpos($class, 'container') !== FALSE)
{
// title doesnt retrieve the href value of title :(
$a = $div->getElementsByTagName('a');
foreach ($a as $key => $value)
{
$A = $value;
break;
}
$title = 'TITLE'. $A->getAttribute('href').'<br>';
//this echos all the text in all of the children of $div
echo $div->textContent.'<br>';
}
}
I'm hoping someone can help me. I'm using PHP Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/manual.htm) successfully, but I now am trying to find elements based on a certain name. For example, in the fetched HTML, there might be a tags such as:
<p class="mattFacer">Matt Facer</p>
<p class="mattJones">Matt Jones</p>
<p class="daveSmith">DaveS Smith</p>
What I need to do is to read in this HTML and capture any HTML elements which match anything beginning with the word, "matt"
I've tried
$html = str_get_html("http://www.testsite.com");
foreach($html->find('matt*') as $element) {
echo $element;
}
but this doesn't work. It returns nothing.
Is it possible to do this? I basically want to search for any HTML element which contains the word "matt". It could be a span, div or p.
I'm at a dead end here!
$html = str_get_html("http://www.testsite.com");
foreach($html->find('[class*=matt]') as $element) {
echo $element;
}
Let's try that
Maybe this?
foreach(array_merge($html->find('[class*=matt]'),$html->find('[id*=matt]')) as $element) {
echo $element;
}