Finding and Printing all Links within a DIV

Finding and Printing all Links within a DIV - php

I am trying to find all links in a div and then printing those links.
I am using the Simple HTML Dom to parse the HTML file. Here is what I have so far, please read the inline comments and let me know where I am going wrong.
include('simple_html_dom.php');
$html = file_get_html('tester.html');
$articles = array();
//find the div the div with the id abcde
foreach($html->find('#abcde') as $article) {
//find all a tags that have a href in the div abcde
foreach($article->find('a[href]') as $link){
//if the href contains singer then echo this link
if(strstr($link, 'singer')){
echo $link;
}
}
}
What currently happens is that the above takes a long time to load (never got it to finish). I printed what it was doing in each loop since it was too long to wait and I find that its going through things I don't need it to! This suggests my code is wrong.
The HTML is basically something like this:
<div id="abcde">
<!-- lots of html elements -->
<!-- lots of a tags -->
<a href="singer/tom" />
<img src="image..jpg" />
</a>
</div>
Thanks all for any help

The correct way to select a div (or whatever) by ID using that API is:
$html->find('div[id=abcde]');
Also, since IDs are supposed to be unique, the following should suffice:
//find all a tags that have a href in the div abcde
$article = $html->find('div[id=abcde]', 0);
foreach($article->find('a[href]') as $link){
//if the href contains singer then echo this link
if(strstr($link, 'singer')){
echo $link;
}
}

Why don't you use the built-in DOM extension instead?
<?php
$cont = file_get_contents("http://stackoverflow.com/") or die("1");
$doc = new DOMDocument();
#$doc->loadHTML($cont) or die("2");
$nodes = $doc->getElementsByTagName("a");
for ($i = 0; $i < $nodes->length; $i++) {
$el = $nodes->item($i);
if ($el->hasAttribute("href"))
echo "- {$el->getAttribute("href")}\n";
}
gives
... (lots of links before) ...
- http://careers.stackoverflow.com
- http://serverfault.com
- http://superuser.com
- http://meta.stackoverflow.com
- http://www.howtogeek.com
- http://doctype.com
- http://creativecommons.org/licenses/by-sa/2.5/
- http://www.peakinternet.com/business/hosting/colocation-dedicated#
- http://creativecommons.org/licenses/by-sa/2.5/
- http://blog.stackoverflow.com/2009/06/attribution-required/

Related

Fetch nested tags in php using simplehtmldom

Lets say I have this code. I want to fetch all p tag data from nested div tag. there can be 15 nested div tag. so want to write a script which can dig all the div and return p tag data from it.
<div>
<div>
<div>
<p>Hi</p>
</div>
<p>Hello</p>
</div>
<p>Hey</p>
</div>
required output(any order):
Hi
Hello
Hey
I have attempted the following:
function divDigger($div)
{
$internalP = $div->getElementsByTagName('p');
echo $internalP->innertext;
$internalDiv = $div->getElementsByTagName('div');
if (count($internalDiv) > 0) {
foreach ($internalDiv as $div) {
divDigger($div);
}
}
}

You may use the XPath API for this:
$doc = new \DOMDocument();
$doc->loadHTML($yourHtml);
$xpath = new \DOMXPath($doc);
foreach ($xpath->query('//div//p') as $pWithinDiv) {
echo $pWithinDiv->textContent, PHP_EOL;
}
This will find any <p> element under a <div> (not necessarily directly under it, otherwise you can change the expression to //div/p), and display its text content.
Demo: https://3v4l.org/43QqX

Simple HTML Dom Crawler returns more than contained in attributes

I would like to extract the contents contained within certain parts of a website using selectors. I am using Simple HTML DOM to do this. However for some reason more data is returned than present in the selectors that I specify. I have checked the FAQ of Simple HTML DOM, but did not see anything that could help me out. I wasn't able to find anything on Stackoverflow either.
I am trying to get the contents/hrefs of all h2 class="hed" tags contained within the ul class="river" on this webpage: http://www.theatlantic.com/most-popular/
In my output I am receiving a lot of data from other tags like p class="dek has-dek" that are not contained within the h2 tag and should not be included. This is really strange as I thought the code would only allow for content within those tags to be scraped.
How can I limit the output to only include the data contained within the h2 tag?
Here is the code I am using:
<div class='rcorners1'>
<?php
include_once('simple_html_dom.php');
$target_url = "http://www.theatlantic.com/most-popular/";
$html = new simple_html_dom();
$html->load_file($target_url);
$posts = $html->find('ul[class=river]');
$limit = 10;
$limit = count($posts) < $limit ? count($posts) : $limit;
for($i=0; $i < $limit; $i++){
$post = $posts[$i];
$post->find('h2[class=hed]',0)->outertext = "";
echo strip_tags($post, '<p><a>');
}
?>
</div>
Output can be seen here. Instead of only a couple of article links, I get information of the author, information on the article, among others.

You are not outputting the h2 contents, but the ul contents in the echo:
echo strip_tags($post, '<p><a>');
Note that the statement before the echo does not modify $post:
$post->find('h2[class=hed]',0)->outertext = "";
Change code to this:
$hed = $post->find('h2[class=hed]',0);
echo strip_tags($hed, '<p><a>');
However, that will only do something with the first found h2. So you need another loop. Here is a rewrite of the code after load_file:
$posts = $html->find('ul[class=river]');
foreach($posts as $postNum => $post) {
if ($postNum >= 10) break; // limit reached
$heds = $post->find('h2[class=hed]');
foreach($heds as $hed) {
echo strip_tags($hed, '<p><a>');
}
}
If you still need to clear outertext, you can do it with $hed:
$hed->outertext = "";

You really only need one loop. Consider this:
foreach($html->find('ul.river > h2.hed') as $postNum => $h2) {
if ($postNum >= 10) break;
echo strip_tags($h2, '<p><a>') . "\n"; // the text
echo $h2->parent->href . "\n"; // the href
}

Modifying Database Data with PHP DOM (enclosing in tags)

I have many articles, divided into sections, stored in a database. Each section consists of a section tag, followed by a header (h2) and a primary div. Some also have subheaders (h3). The raw display looks something like this:
<section id="ecology">
<h2 class="Article">Ecology</h2>
<div class="Article">
<h3 class="Article">Animals</h3>
I'm using the following DOM script to add some classes, ID's and glyphicons:
$i = 1; // initialize counter
// initialize DOMDocument
$dom = new DOMDocument;
#$dom->loadHTML($Content); // load the markup
$sections = $dom->getElementsByTagName('section'); // get all section tags
if($sections->length > 0) { // if there are indeed section tags inside
// work on each section
foreach($sections as $section) { // for each section tag
$section->setAttribute('data-target', '#b' . $i); // set id for section tag
// get div inside each section
foreach($section->getElementsByTagName('h2') as $h2) {
if($h2->getAttribute('class') == 'Article') { // if this div has class maindiv
$h2->setAttribute('id', 'a' . $i); // set id for div tag
}
}
foreach($section->getElementsByTagName('div') as $div) {
if($div->getAttribute('class') == 'Article') { // if this div has class maindiv
$div->setAttribute('id', 'b' . $i); // set id for div tag
}
}
$i++; // increment counter
}
}
// back to string again, get all contents inside body
$Content = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $child) {
$Content .= $dom->saveHTML($child); // convert to string and append to the container
}
I'd like to modify the above code so that it places certain examples of "inner text" between tags.
For example, consider these headings:
<h3 class="Article">Animals</h3>
<h3 class="Article">Plants</h3>
I would like the DOM to change them to this:
<h3 class="Article"><span class="label label-default">Animals</span></h3>
<h3 class="Article"><span class="label label-default">Plants</span></h3>
I want to do something similar with the h2 tags. I don't yet know the DOM terminology well enough to search for good tutorials - not to mention confusion with DOM programs and jQuery. ;)
I think these are the basic functions I need to focus on, but I don't know how to plug them in:
$text = $data->textContent;
elementNode.textContent=string
Two Notes: 1) I understand I can do this with jQuery (perhaps a lot easier), but I think PHP might be better, as they say some users can have JavaScript disabled. 2) I'm using the class "Article" largely to distinguish elements I want to be styled by PHP DOM. A header with a different class, or no class at all, should not be affected by the DOM script.

parsing html page using php to find out text on which link is assiged

say i have html code like this
$html = "This is some stuff right here. OH MY GOSH";
i am trying to get values of href and also on which anchor work i mean check this out text i am able to get href value by following this code
$displaybody->find('a ') as $element;
echo $element;
well it works for me but how do i get value of check this out could you guys help me out. i did search but i am not able to find it out . thanks in advance
my actual html look like this
» Download MP4 « - <b>144p (Video Only)</b> - <span> 19.1</span> MB<br />
my href look like this above code return download mp4 and i want it like downloadmp4 114p (video only) 19.1 mb how do i do that

If what you are using now is the SimpleHTMLDOM, then ->innertext works fine on that anchor elements that you have found:
include 'simple_html_dom.php';
$html = "This is some stuff right here. OH MY GOSH";
$displaybody = str_get_html($html);
foreach($displaybody->find('a ') as $element) {
echo $element->innertext . '<br/>';
}
If you were referring to PHP's DOMDocument, then its not find() function you need to use, to target each anchor element, you need to use ->getElementsByTagName(), then each selected elements you need to use ->nodeValue:
$html = "This is some stuff right here. OH MY GOSH";
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $element) {
echo $element->nodeValue . '<br/>';
}

Get specific element(s) from page

I'm trying to pull some data from my website. It is pretty simple, but I can't find any good examples/docs, so I am having a tough time. I'm trying to make an API for my friends to use my blog, but it's a bit difficult. Let's assume I have a website at http://www.sample.com, and the html source for that website is:
<div class="container">
<a href="/mywebsiteblogpost/">
<h2 class="title">im the best</h2>
</a>
<span class="author">Josue Espinosa</span>
<div class="thumb"> <img src="http://www.sample.com/imgsrc" alt="">
<span class="category">sports</span>
</div>
<p>preview text</p>
<a class="more" href="/mywebsiteblogpost/">full text...</a>
</div>
I want to get all of .container's children, the first a child's href value, the text value of the class title, author, the img src for the child inside .thumb, and the text value for category.
I started with the a href src, but I didn't even get that far. I thought $title would be echoing the href value of the first anchor tag inside of container, but it doesn't work.
$text = file_get_contents('http://www.sample.com');
$doc = new DOMDocument('1.0');
$doc->loadHTML($text);
foreach($doc->getElementsByTagName('div') AS $div) {
$class = $div->getAttribute('class');
if(strpos($class, 'container') !== FALSE) {
// title doesnt retrieve the href value of title :(
$title = 'TITLE'.$div->getElementsByTagName('a')->getAttribute('href').'<br>';
//this echos all the text in all of the children of $div
echo $div->textContent.'<br>';
}
}
Can anyone explain why please?

The culprit is $div->getElementsByTagName('a')->getAttribute('href'). The first part, $div->getElementsByTagName('a') retrieves a list of elements, not a single element. So the following ->getAttribute('href') will not do the right thing.
To fix this, iterate just as you do with the div-tags:
foreach($div->getElementsByTagName('a') as $a) {
$href = $a->getAttribute('href');
if ($href) echo "TITLE$href<br>";
}

ok so first
$div->getElementsByTagName('a')
returns a domnodelist (http://php.net/manual/en/class.domnodelist.php) object, You need to get the first item there to get the attribute.
Second
$div->textContent
Does as intended ? show all text content in the $div ?
You may be better off looking at xpath queries( http://php.net/manual/en/class.domxpath.php) for this type of DOM searching

I made some corrections on the php code you posted that doesn't work, may be it can help you keep going
$text = file_get_contents('http://www.sample.com');
$doc = new DOMDocument('1.0');
$doc->loadHTML($text);
foreach($doc->getElementsByTagName('div') AS $div)
{
$class = $div->getAttribute('class');
// _($class);
if(strpos($class, 'container') !== FALSE)
{
// title doesnt retrieve the href value of title :(
$a = $div->getElementsByTagName('a');
foreach ($a as $key => $value)
{
$A = $value;
break;
}
$title = 'TITLE'. $A->getAttribute('href').'<br>';
//this echos all the text in all of the children of $div
echo $div->textContent.'<br>';
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Finding and Printing all Links within a DIV - php

Related

Fetch nested tags in php using simplehtmldom

Simple HTML Dom Crawler returns more than contained in attributes

Modifying Database Data with PHP DOM (enclosing in tags)

parsing html page using php to find out text on which link is assiged

Get specific element(s) from page

Categories

Resources