Finding and Printing all Links within a DIV - php

I am trying to find all links in a div and then printing those links.
I am using the Simple HTML Dom to parse the HTML file. Here is what I have so far, please read the inline comments and let me know where I am going wrong.
include('simple_html_dom.php');
$html = file_get_html('tester.html');
$articles = array();
//find the div the div with the id abcde
foreach($html->find('#abcde') as $article) {
//find all a tags that have a href in the div abcde
foreach($article->find('a[href]') as $link){
//if the href contains singer then echo this link
if(strstr($link, 'singer')){
echo $link;
}
}
}
What currently happens is that the above takes a long time to load (never got it to finish). I printed what it was doing in each loop since it was too long to wait and I find that its going through things I don't need it to! This suggests my code is wrong.
The HTML is basically something like this:
<div id="abcde">
<!-- lots of html elements -->
<!-- lots of a tags -->
<a href="singer/tom" />
<img src="image..jpg" />
</a>
</div>
Thanks all for any help

The correct way to select a div (or whatever) by ID using that API is:
$html->find('div[id=abcde]');
Also, since IDs are supposed to be unique, the following should suffice:
//find all a tags that have a href in the div abcde
$article = $html->find('div[id=abcde]', 0);
foreach($article->find('a[href]') as $link){
//if the href contains singer then echo this link
if(strstr($link, 'singer')){
echo $link;
}
}

Why don't you use the built-in DOM extension instead?
<?php
$cont = file_get_contents("http://stackoverflow.com/") or die("1");
$doc = new DOMDocument();
#$doc->loadHTML($cont) or die("2");
$nodes = $doc->getElementsByTagName("a");
for ($i = 0; $i < $nodes->length; $i++) {
$el = $nodes->item($i);
if ($el->hasAttribute("href"))
echo "- {$el->getAttribute("href")}\n";
}
gives
... (lots of links before) ...
- http://careers.stackoverflow.com
- http://serverfault.com
- http://superuser.com
- http://meta.stackoverflow.com
- http://www.howtogeek.com
- http://doctype.com
- http://creativecommons.org/licenses/by-sa/2.5/
- http://www.peakinternet.com/business/hosting/colocation-dedicated#
- http://creativecommons.org/licenses/by-sa/2.5/
- http://blog.stackoverflow.com/2009/06/attribution-required/

Related

Fetch nested tags in php using simplehtmldom

Lets say I have this code. I want to fetch all p tag data from nested div tag. there can be 15 nested div tag. so want to write a script which can dig all the div and return p tag data from it.
<div>
<div>
<div>
<p>Hi</p>
</div>
<p>Hello</p>
</div>
<p>Hey</p>
</div>
required output(any order):
Hi
Hello
Hey
I have attempted the following:
function divDigger($div)
{
$internalP = $div->getElementsByTagName('p');
echo $internalP->innertext;
$internalDiv = $div->getElementsByTagName('div');
if (count($internalDiv) > 0) {
foreach ($internalDiv as $div) {
divDigger($div);
}
}
}
You may use the XPath API for this:
$doc = new \DOMDocument();
$doc->loadHTML($yourHtml);
$xpath = new \DOMXPath($doc);
foreach ($xpath->query('//div//p') as $pWithinDiv) {
echo $pWithinDiv->textContent, PHP_EOL;
}
This will find any <p> element under a <div> (not necessarily directly under it, otherwise you can change the expression to //div/p), and display its text content.
Demo: https://3v4l.org/43QqX

Simple HTML Dom Crawler returns more than contained in attributes

I would like to extract the contents contained within certain parts of a website using selectors. I am using Simple HTML DOM to do this. However for some reason more data is returned than present in the selectors that I specify. I have checked the FAQ of Simple HTML DOM, but did not see anything that could help me out. I wasn't able to find anything on Stackoverflow either.
I am trying to get the contents/hrefs of all h2 class="hed" tags contained within the ul class="river" on this webpage: http://www.theatlantic.com/most-popular/
In my output I am receiving a lot of data from other tags like p class="dek has-dek" that are not contained within the h2 tag and should not be included. This is really strange as I thought the code would only allow for content within those tags to be scraped.
How can I limit the output to only include the data contained within the h2 tag?
Here is the code I am using:
<div class='rcorners1'>
<?php
include_once('simple_html_dom.php');
$target_url = "http://www.theatlantic.com/most-popular/";
$html = new simple_html_dom();
$html->load_file($target_url);
$posts = $html->find('ul[class=river]');
$limit = 10;
$limit = count($posts) < $limit ? count($posts) : $limit;
for($i=0; $i < $limit; $i++){
$post = $posts[$i];
$post->find('h2[class=hed]',0)->outertext = "";
echo strip_tags($post, '<p><a>');
}
?>
</div>
Output can be seen here. Instead of only a couple of article links, I get information of the author, information on the article, among others.
You are not outputting the h2 contents, but the ul contents in the echo:
echo strip_tags($post, '<p><a>');
Note that the statement before the echo does not modify $post:
$post->find('h2[class=hed]',0)->outertext = "";
Change code to this:
$hed = $post->find('h2[class=hed]',0);
echo strip_tags($hed, '<p><a>');
However, that will only do something with the first found h2. So you need another loop. Here is a rewrite of the code after load_file:
$posts = $html->find('ul[class=river]');
foreach($posts as $postNum => $post) {
if ($postNum >= 10) break; // limit reached
$heds = $post->find('h2[class=hed]');
foreach($heds as $hed) {
echo strip_tags($hed, '<p><a>');
}
}
If you still need to clear outertext, you can do it with $hed:
$hed->outertext = "";
You really only need one loop. Consider this:
foreach($html->find('ul.river > h2.hed') as $postNum => $h2) {
if ($postNum >= 10) break;
echo strip_tags($h2, '<p><a>') . "\n"; // the text
echo $h2->parent->href . "\n"; // the href
}

Modifying Database Data with PHP DOM (enclosing in tags)

I have many articles, divided into sections, stored in a database. Each section consists of a section tag, followed by a header (h2) and a primary div. Some also have subheaders (h3). The raw display looks something like this:
<section id="ecology">
<h2 class="Article">Ecology</h2>
<div class="Article">
<h3 class="Article">Animals</h3>
I'm using the following DOM script to add some classes, ID's and glyphicons:
$i = 1; // initialize counter
// initialize DOMDocument
$dom = new DOMDocument;
#$dom->loadHTML($Content); // load the markup
$sections = $dom->getElementsByTagName('section'); // get all section tags
if($sections->length > 0) { // if there are indeed section tags inside
// work on each section
foreach($sections as $section) { // for each section tag
$section->setAttribute('data-target', '#b' . $i); // set id for section tag
// get div inside each section
foreach($section->getElementsByTagName('h2') as $h2) {
if($h2->getAttribute('class') == 'Article') { // if this div has class maindiv
$h2->setAttribute('id', 'a' . $i); // set id for div tag
}
}
foreach($section->getElementsByTagName('div') as $div) {
if($div->getAttribute('class') == 'Article') { // if this div has class maindiv
$div->setAttribute('id', 'b' . $i); // set id for div tag
}
}
$i++; // increment counter
}
}
// back to string again, get all contents inside body
$Content = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $child) {
$Content .= $dom->saveHTML($child); // convert to string and append to the container
}
I'd like to modify the above code so that it places certain examples of "inner text" between tags.
For example, consider these headings:
<h3 class="Article">Animals</h3>
<h3 class="Article">Plants</h3>
I would like the DOM to change them to this:
<h3 class="Article"><span class="label label-default">Animals</span></h3>
<h3 class="Article"><span class="label label-default">Plants</span></h3>
I want to do something similar with the h2 tags. I don't yet know the DOM terminology well enough to search for good tutorials - not to mention confusion with DOM programs and jQuery. ;)
I think these are the basic functions I need to focus on, but I don't know how to plug them in:
$text = $data->textContent;
elementNode.textContent=string
Two Notes: 1) I understand I can do this with jQuery (perhaps a lot easier), but I think PHP might be better, as they say some users can have JavaScript disabled. 2) I'm using the class "Article" largely to distinguish elements I want to be styled by PHP DOM. A header with a different class, or no class at all, should not be affected by the DOM script.

parsing html page using php to find out text on which link is assiged

say i have html code like this
$html = "This is some stuff right here. OH MY GOSH";
i am trying to get values of href and also on which anchor work i mean check this out text i am able to get href value by following this code
$displaybody->find('a ') as $element;
echo $element;
well it works for me but how do i get value of check this out could you guys help me out. i did search but i am not able to find it out . thanks in advance
my actual html look like this
» Download MP4 « - <b>144p (Video Only)</b> - <span> 19.1</span> MB<br />
my href look like this above code return download mp4 and i want it like downloadmp4 114p (video only) 19.1 mb how do i do that
If what you are using now is the SimpleHTMLDOM, then ->innertext works fine on that anchor elements that you have found:
include 'simple_html_dom.php';
$html = "This is some stuff right here. OH MY GOSH";
$displaybody = str_get_html($html);
foreach($displaybody->find('a ') as $element) {
echo $element->innertext . '<br/>';
}
If you were referring to PHP's DOMDocument, then its not find() function you need to use, to target each anchor element, you need to use ->getElementsByTagName(), then each selected elements you need to use ->nodeValue:
$html = "This is some stuff right here. OH MY GOSH";
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $element) {
echo $element->nodeValue . '<br/>';
}

Get specific element(s) from page

I'm trying to pull some data from my website. It is pretty simple, but I can't find any good examples/docs, so I am having a tough time. I'm trying to make an API for my friends to use my blog, but it's a bit difficult. Let's assume I have a website at http://www.sample.com, and the html source for that website is:
<div class="container">
<a href="/mywebsiteblogpost/">
<h2 class="title">im the best</h2>
</a>
<span class="author">Josue Espinosa</span>
<div class="thumb"> <img src="http://www.sample.com/imgsrc" alt="">
<span class="category">sports</span>
</div>
<p>preview text</p>
<a class="more" href="/mywebsiteblogpost/">full text...</a>
</div>
I want to get all of .container's children, the first a child's href value, the text value of the class title, author, the img src for the child inside .thumb, and the text value for category.
I started with the a href src, but I didn't even get that far. I thought $title would be echoing the href value of the first anchor tag inside of container, but it doesn't work.
$text = file_get_contents('http://www.sample.com');
$doc = new DOMDocument('1.0');
$doc->loadHTML($text);
foreach($doc->getElementsByTagName('div') AS $div) {
$class = $div->getAttribute('class');
if(strpos($class, 'container') !== FALSE) {
// title doesnt retrieve the href value of title :(
$title = 'TITLE'.$div->getElementsByTagName('a')->getAttribute('href').'<br>';
//this echos all the text in all of the children of $div
echo $div->textContent.'<br>';
}
}
Can anyone explain why please?
The culprit is $div->getElementsByTagName('a')->getAttribute('href'). The first part, $div->getElementsByTagName('a') retrieves a list of elements, not a single element. So the following ->getAttribute('href') will not do the right thing.
To fix this, iterate just as you do with the div-tags:
foreach($div->getElementsByTagName('a') as $a) {
$href = $a->getAttribute('href');
if ($href) echo "TITLE$href<br>";
}
ok so first
$div->getElementsByTagName('a')
returns a domnodelist (http://php.net/manual/en/class.domnodelist.php) object, You need to get the first item there to get the attribute.
Second
$div->textContent
Does as intended ? show all text content in the $div ?
You may be better off looking at xpath queries( http://php.net/manual/en/class.domxpath.php) for this type of DOM searching
I made some corrections on the php code you posted that doesn't work, may be it can help you keep going
$text = file_get_contents('http://www.sample.com');
$doc = new DOMDocument('1.0');
$doc->loadHTML($text);
foreach($doc->getElementsByTagName('div') AS $div)
{
$class = $div->getAttribute('class');
// _($class);
if(strpos($class, 'container') !== FALSE)
{
// title doesnt retrieve the href value of title :(
$a = $div->getElementsByTagName('a');
foreach ($a as $key => $value)
{
$A = $value;
break;
}
$title = 'TITLE'. $A->getAttribute('href').'<br>';
//this echos all the text in all of the children of $div
echo $div->textContent.'<br>';
}
}

Categories