I am crawling links from a website (this one), but the structure of the website creates unwanted additional output. Basically, the <a> tags have the name of an article and additional information (images and sources of those images) inside them. I would like to get ride of the additional information. I found the :not Selector to do that, but I guess I am implementing it wrong, because every combination I have tried gives me no output at all.
Here is the output.
Here is the code I need to alter:
$posts = $html->find('ul[class=river] a[data-omni-click=inherit] :not[figure]');
(I have also tried figure:not and a couple of other combinations)
Does anyone know where I went wrong, and what I have to do to exclude the <figure> tag?
Here is my full code, not sure if that helps:
<div class='rcorners1'>
<?php
include_once('simple_html_dom.php');
$target_url = "http://www.theatlantic.com/most-popular/";
$html = new simple_html_dom();
$html->load_file($target_url);
$posts = $html->find('ul[class=river] a[data-omni-click=inherit] :not[figure]');
$limit = 10;
$limit = count($posts) < $limit ? count($posts) : $limit;
for($i=0; $i < $limit; $i++){
$post = $posts[$i];
$post->href = 'http://www.theatlantic.com'.$post->href;
echo strip_tags($post, '<p><a>'); //echo ($post);
}
?>
</div>
</div>
Related
I need to produce reports, using DOMDocument. Since the pages will vary ( between 3 to 30 pages) I want to create these pages using a for-loop (amount being defined by a variable). The producing of pages works fine.
As step 2 I need to populate the pages with content. The content will also be procuded by DOMDocument. Since I am using "$page" in the for-loop I assume it is natural behaviour that the defined node value only is added to the last for-loop result.
I added a "wanted result" knowing that I do not have the logic in place for getting that result.
Question:
Is the after-populating possible to be done with only DOMDocument or would I need another tool, e.g. Xpath.
<?php
$totalPages = 3;
$xml = new DomDocument('1.0', 'UTF-8');
$xml->formatOutput = true;
$html = $xml->createElement('html');
$xml->appendChild($html);
$wrapper = $xml->createElement('div');
$wrapper->setAttribute('class', 'wrapper');
$html->appendChild($wrapper);
for ($i=1; $i <= $totalPages ; $i++) {
$page = $xml->createElement('div');
$page->setAttribute('class', 'pages');
// $page->nodeValue = 'Content...'; // Kept as reference. Move to below populating.
$page->setAttribute(
'id',
'page-' . $i
);
$wrapper->appendChild($page);
}
// Populate pages with content.
$page->nodeValue = 'Content...';
$wrapper->appendChild($page);
// Save & print.
$xml->save("result.xml");
echo $xml->saveHTML();
Result
<?xml version="1.0" encoding="UTF-8"?>
<html>
<div class="wrapper">
<div class="pages" id="page-1"/>
<div class="pages" id="page-2"/>
<div class="pages" id="page-3">Content...</div>
</div>
</html>
Wanted result
<?xml version="1.0" encoding="UTF-8"?>
<html>
<div class="wrapper">
<div class="pages" id="page-1"/>Content page-1</div>
<div class="pages" id="page-2"/>Content page-2</div>
<div class="pages" id="page-3">Content page-3</div>
</div>
</html>
You could store the references to the elements created by createElement, and afterwards use them to set values.
For example
$pages = [];
for ($i=1; $i <= $totalPages ; $i++) {
$page = $xml->createElement('div');
$pages[] = $page;
Then afterward you could do:
$pages[0]->nodeValue = 'Content page-1';
Php demo
To fulfill the need of populating the pages later in DOMDocument, one can make use of dynamic variables. It means you create all your pages first and since it is part of loop you can construct as many pages as you want.
Having a variable name per each page, e.g. (page_1, page_2, etc) will let you select the page you wish to populate. The populating of pages can therefor be done outside the for-loop.
The source of populating can be of your choice, e.g DOMDocument, array, import from file.
<?php
$totalPages = 3;
$xml = new DomDocument('1.0', 'UTF-8');
$xml->formatOutput = true;
$html = $xml->createElement('html');
$xml->appendChild($html);
$wrapper = $xml->createElement('div');
$wrapper->setAttribute('class', 'wrapper');
$html->appendChild($wrapper);
for ($i=1; $i <= $totalPages ; $i++) {
${'page_'.$i} = $xml->createElement('div');
${'page_'.$i}->setAttribute('class', 'pages');
${'page_'.$i}->setAttribute(
'id',
'page-' . $i
);
$wrapper->appendChild(
${'page_'.$i}
);
}
// Populate pages with content.
$page_1->nodeValue = 'Content-page-1';
$wrapper->appendChild($page_1);
$page_2->nodeValue = 'Content-page-2';
$wrapper->appendChild($page_2);
$page_3->nodeValue = 'Content-page-3';
$wrapper->appendChild($page_3);
// Save & print.
$xml->save("result.xml");
echo $xml->saveHTML();
Result
<html>
<div class="wrapper">
<div class="pages" id="page-1">Content-page-1</div>
<div class="pages" id="page-2">Content-page-2</div>
<div class="pages" id="page-3">Content-page-3</div>
</div>
</html>
I would like to extract the contents contained within certain parts of a website using selectors. I am using Simple HTML DOM to do this. However for some reason more data is returned than present in the selectors that I specify. I have checked the FAQ of Simple HTML DOM, but did not see anything that could help me out. I wasn't able to find anything on Stackoverflow either.
I am trying to get the contents/hrefs of all h2 class="hed" tags contained within the ul class="river" on this webpage: http://www.theatlantic.com/most-popular/
In my output I am receiving a lot of data from other tags like p class="dek has-dek" that are not contained within the h2 tag and should not be included. This is really strange as I thought the code would only allow for content within those tags to be scraped.
How can I limit the output to only include the data contained within the h2 tag?
Here is the code I am using:
<div class='rcorners1'>
<?php
include_once('simple_html_dom.php');
$target_url = "http://www.theatlantic.com/most-popular/";
$html = new simple_html_dom();
$html->load_file($target_url);
$posts = $html->find('ul[class=river]');
$limit = 10;
$limit = count($posts) < $limit ? count($posts) : $limit;
for($i=0; $i < $limit; $i++){
$post = $posts[$i];
$post->find('h2[class=hed]',0)->outertext = "";
echo strip_tags($post, '<p><a>');
}
?>
</div>
Output can be seen here. Instead of only a couple of article links, I get information of the author, information on the article, among others.
You are not outputting the h2 contents, but the ul contents in the echo:
echo strip_tags($post, '<p><a>');
Note that the statement before the echo does not modify $post:
$post->find('h2[class=hed]',0)->outertext = "";
Change code to this:
$hed = $post->find('h2[class=hed]',0);
echo strip_tags($hed, '<p><a>');
However, that will only do something with the first found h2. So you need another loop. Here is a rewrite of the code after load_file:
$posts = $html->find('ul[class=river]');
foreach($posts as $postNum => $post) {
if ($postNum >= 10) break; // limit reached
$heds = $post->find('h2[class=hed]');
foreach($heds as $hed) {
echo strip_tags($hed, '<p><a>');
}
}
If you still need to clear outertext, you can do it with $hed:
$hed->outertext = "";
You really only need one loop. Consider this:
foreach($html->find('ul.river > h2.hed') as $postNum => $h2) {
if ($postNum >= 10) break;
echo strip_tags($h2, '<p><a>') . "\n"; // the text
echo $h2->parent->href . "\n"; // the href
}
I'm new to Php and Im trynna code a tool that scrape Amazon product title
Right now, I can scrape the first page but I need the tool to go to the next page until there is no page left and do the same task like the 1st page which is scraping.
Here is the code:
<?php
$file_string = file_get_contents('http://www.amazon.com/s/ref=lp_3737671_pg_1?rh=n%3A1055398%2Cn%3A%211063498%2Cn%3A3206324011%2Cn%3A3737671&page=1&ie=UTF8&qid=1361609819');
preg_match_all('/<span class="lrg bold">(.*)<\/span>/i', $file_string, $links);
for($i = 0; $i < count($links[1]); $i++) {
echo $links[1][$i] . '<br>';
}
?>
Any help is appreciate...
To get all pages HTML as one var this would do the trick
<?php
$html = '';
$file_string = file_get_contents('http://www.amazon.com/s/ref=lp_3737671_pg_1?rh=n%3A1055398%2Cn%3A%211063498%2Cn%3A3206324011%2Cn%3A3737671&page=1&ie=UTF8&qid=1361609819');
preg_match_all('/<span class="lrg bold">(.*)<\/span>/i', $file_string, $links);
for($i = 0; $i < count($links[1]); $i++) {
$html .= file_get_contents($links[1][$i]);
}
echo "all pages combined:\n".$html;
?>
However, more than likely your server will time out, run out of memory or something else will go wrong. To scrape HTML content you'd be better off creating a URL list first, then scraping it one at a time. You could do this via a HTML page that calls the scraper via AJAX.
I have this project i'm working on and id like to add a really small list of nearby places using facebooks places in an iframe featured from touch.facebook.com I can easily just use touch.facebook.com/#/places_friends.php but then that loads the headers the and the other navigation bars for like messges, events ect bars and i just want the content.
I'm pretty sure from looking at the touch.facebook.com/#/places_friends.php source, all i need to load is the div "content" Anyway, i'm extremely new to php and im pretty sure what i think i'm trying to do is called web scraping.
For the sake of figuring things out on stackoverflow and not needing to worry about authentication or anything yet i want to load the login page to see if i can at least get the scraper to work. Once I have a working scraping code i'm pretty sure i can handle the rest. It has load everything inside the div. I've seen this done before so i know it is possible. and it will look exactly like what you see when you try to login at touch.facebook.com but without the blue facebook logo up top and thats what im trying to accomplish right here.
So here's the login page, im trying to load the div which contains the text boxes to login the actual login button. If it's done correctly we should just see those with no blur Facebook header bar above it.
I've tried
<?php
$page = file_get_contents('http://touch.facebook.com/login.php');
$doc = new DOMDocument();
$doc->loadHTML($page);
$divs = $doc->getElementsByTagName('div');
foreach($divs as $div) {
if ($div->getAttribute('id') === 'login_form') {
echo $div->nodeValue;
}
}
?>
all that does is load a blank page.
I've also tried using http://simplehtmldom.sourceforge.net/
and i modified the example basic selector to
<?php
include('../simple_html_dom.php');
$html = file_get_html('http://touch.facebook.com/login.php');
foreach($html->find('div#login_form') as $e)
echo $e->nodeValue;
?>
I've also tried
<?php
$stream = "http://touch.facebook.com/login.php";
$cnt = simplexml_load_file($stream);
$result = $cnt->xpath("/html/body/div[#id=login_form]");
for($i = 0; $i < $i < count($result); $i++){
echo $result[$i];
}
?>
that did not work either
$stream = "http://touch.facebook.com";
$cnt = simplexml_load_file($stream);
$result = $nct->xpath("/html/body/div[#id=content]");
for ($i = 0; $i < count($result); $i++){
echo $result[$i];
}
there was a syntax error in this line i removed it now just copy and paste and run this code
Im assuming that you can't use the facebook API, if you can, then I strongly suggest you use it, because you will save yourself from the whole scraping deal.
To scrape text the best tech is using xpath, if the html returned by touch.facebook.com is xhtml transitional, which it sould, the you should use xpath, a sample should look like this:
$stream = "http://touch.facebook.com";
$cnt = simplexml_load_file($stream);
$result = $nct->xpath("/html/body/div[#id=content]");
for ($i = 0; $i < $i < count($result); $i++){
echo $result[$i];
}
You need to learn about your comparison operators
=== is for comparing strictly, you should be using ==
if ($div->getAttribute('id') == 'login_form')
{
}
Scraping isn't always the best idea for capturing data else where. I would suggest using Facebook's API to retrieve the values your needing. Scraping will break any time Facebook decides to change their markup.
http://developers.facebook.com/docs/api
http://github.com/facebook/php-sdk/
I am trying to find all links in a div and then printing those links.
I am using the Simple HTML Dom to parse the HTML file. Here is what I have so far, please read the inline comments and let me know where I am going wrong.
include('simple_html_dom.php');
$html = file_get_html('tester.html');
$articles = array();
//find the div the div with the id abcde
foreach($html->find('#abcde') as $article) {
//find all a tags that have a href in the div abcde
foreach($article->find('a[href]') as $link){
//if the href contains singer then echo this link
if(strstr($link, 'singer')){
echo $link;
}
}
}
What currently happens is that the above takes a long time to load (never got it to finish). I printed what it was doing in each loop since it was too long to wait and I find that its going through things I don't need it to! This suggests my code is wrong.
The HTML is basically something like this:
<div id="abcde">
<!-- lots of html elements -->
<!-- lots of a tags -->
<a href="singer/tom" />
<img src="image..jpg" />
</a>
</div>
Thanks all for any help
The correct way to select a div (or whatever) by ID using that API is:
$html->find('div[id=abcde]');
Also, since IDs are supposed to be unique, the following should suffice:
//find all a tags that have a href in the div abcde
$article = $html->find('div[id=abcde]', 0);
foreach($article->find('a[href]') as $link){
//if the href contains singer then echo this link
if(strstr($link, 'singer')){
echo $link;
}
}
Why don't you use the built-in DOM extension instead?
<?php
$cont = file_get_contents("http://stackoverflow.com/") or die("1");
$doc = new DOMDocument();
#$doc->loadHTML($cont) or die("2");
$nodes = $doc->getElementsByTagName("a");
for ($i = 0; $i < $nodes->length; $i++) {
$el = $nodes->item($i);
if ($el->hasAttribute("href"))
echo "- {$el->getAttribute("href")}\n";
}
gives
... (lots of links before) ...
- http://careers.stackoverflow.com
- http://serverfault.com
- http://superuser.com
- http://meta.stackoverflow.com
- http://www.howtogeek.com
- http://doctype.com
- http://creativecommons.org/licenses/by-sa/2.5/
- http://www.peakinternet.com/business/hosting/colocation-dedicated#
- http://creativecommons.org/licenses/by-sa/2.5/
- http://blog.stackoverflow.com/2009/06/attribution-required/