Open more links and then parse some elements in php - php

I have simple problem.
I need to open for example this page http://www.50states.com/ in php.
Then I need to open every state and from page of state for (example: Alabama), I need to parse name of state, capital, location. I want to do it with simple html dom library http://simplehtmldom.sourceforge.net/
And I want to do it on every state. How it is possible???
//parser.php
include simple_html_dom.php;
$document = file_get_html($site);
foreach($document->find('a') as $e) {
echo $e->href . '<br>';
}
So what now? Please help. I think now I need only hrefs with states and then I must open it... So?

Why not feed it with the state name on the url? Consider this as an example:
<?php
include 'simple_html_dom.php';
$state = 'alabama';
$main_url = 'http://www.50states.com/' . $state . '.htm';
$html = file_get_html($main_url);
$state_info = null;
$capital_city = '';
foreach($html->find('ul[class=bulletedList]') as $key => $value) {
$state_info = $value;
// Get particular value (traverse DOM)
// Sample: search for capital city
if(strpos($value->children(0)->children(0)->innertext, 'Capital City:')) {
$capital_city = $value->children(0)->children(0)->innertext;
}
}
echo $state_info;
echo $capital_city;
?>
For more in-depth information, you should check out the manual, its pretty well documented anyways.
http://simplehtmldom.sourceforge.net/manual.htm

Related

PHP Simple HTML DOM Scrape External URL

I'm trying to build a personal project of mine, however I'm a bit stuck when using the Simple HTML DOM class.
What I'd like to do is scrape a website and retrieve all the content, and it's inner html, that matches a certain class.
My code so far is:
<?php
error_reporting(E_ALL);
include_once("simple_html_dom.php");
//use curl to get html content
$url = 'http://www.peopleperhour.com/freelance-seo-jobs';
$html = file_get_html($url);
//Get all data inside the <div class="item-list">
foreach($html->find('div[class=item-list]') as $div) {
//get all div's inside "item-list"
foreach($div->find('div') as $d) {
//get the inner HTML
$data = $d->outertext;
}
}
print_r($data)
echo "END";
?>
All I get with this is a blank page with "END", nothing else outputted at all.
It seems your $data variable is being assigned a different value on each iteration. Try this instead:
$data = "";
foreach($html->find('div[class=item-list]') as $div) {
//get all divs inside "item-list"
foreach($div->find('div') as $d) {
//get the inner HTML
$data .= $d->outertext;
}
}
print_r($data)
I hope that helps.
I think, you may want something like this
$url = 'http://www.peopleperhour.com/freelance-seo-jobs';
$html = file_get_html($url);
foreach ($html->find('div.item-list div.item') as $div) {
echo $div . '<br />';
};
This will give you something like this (if you add the proper style sheet, it'll be displayed nicely)

Replacing link with plain text with php simple html dom

I have a program that removes certain pages from a web; i want to then traverse the remaining pages and "unlink" any links to those removed pages. I'm using simplehtmldom. My function takes a source page ($source) and an array of pages ($skipList). It finds the links, and I'd like to then manipulate the dom to convert the element into the $link->innertext, but I don't know how. Any help?
function RemoveSpecificLinks($source, $skipList) {
// $source is the html source file;
// $skipList is an array of link destinations (hrefs) that we want unlinked
$docHtml = file_get_contents($source);
$htmlObj = str_get_html($docHtml);
$links = $htmlObj->find('a');
if (isset($links)) {
foreach ($links as $link) {
if (in_array($link->href, $skipList)) {
$link->href = ''; // Should convert to simple text element
}
}
}
$docHtml = $htmlObj->save();
$htmlObj->clear();
unset($htmlObj);
return($docHtml);
}
I have never used simplehtmldom, but this is what I think should solve your problem:
function RemoveSpecificLinks($source, $skipList) {
// $source is the HTML source file;
// $skipList is an array of link destinations (hrefs) that we want unlinked
$docHtml = file_get_contents($source);
$htmlObj = str_get_html($docHtml);
$links = $htmlObj->find('a');
if (isset($links)) {
foreach ($links as $link) {
if (in_array($link->href, $skipList)) {
$link->outertext = $link->plaintext; // THIS SHOULD WORK
// IF THIS DOES NOT WORK TRY:
// $link->outertext = $link->innertext;
}
}
}
$docHtml = $htmlObj->save();
$htmlObj->clear();
unset($htmlObj);
return($docHtml);
}
Please provide me some feedback as if this worked or not, also specifying which method worked, if any.
Update: Maybe you would prefer this:
$link->outertext = $link->href;
This way you get the link displayed, but not clickable.

PHP script that counts the number of outgoing links on a page and ignores the rel="nofollow" ones

I am a php newb but I am pretty sure this will be hard to accomplish and very server consuming. But I want to ask, get the opinion of much smarter users than myself.
Here is what I am trying to do:
I have a list of URL's, an array of URL's actually.
For each URL, I want to count the outgoing links - which DO NOT HAVE REL="nofollow" attribute - on that page.
So in a way, I'm afraid I'll have to make php load the page and preg match using regular expressions all the links?
Would this work if I'd had lets say 1000 links?
Here is what I am thinking, putting it in code:
$homepage = file_get_contents('http://www.site.com/');
$homepage = htmlentities($homepage);
// Do a preg_match for http:// and count the number of appearances:
$urls = preg_match();
// Do a preg_match for rel="nofollow" and count the nr of appearances:
$nofollow = preg_match();
// Do a preg_match for the number of "domain.com" appearances so we can subtract the website's internal links:
$internal_links = preg_match();
// Substract and get the final result:
$result = $urls - $nofollow - $internal_links;
Hope you can help, and if the idea is right maybe you can help me with the preg_match functions.
You can use PHP's DOMDocument class to parse the HTML and parse_url to parse the URLs:
$url = 'http://stackoverflow.com/';
$pUrl = parse_url($url);
// Load the HTML into a DOMDocument
$doc = new DOMDocument;
#$doc->loadHTMLFile($url);
// Look for all the 'a' elements
$links = $doc->getElementsByTagName('a');
$numLinks = 0;
foreach ($links as $link) {
// Exclude if not a link or has 'nofollow'
preg_match_all('/\S+/', strtolower($link->getAttribute('rel')), $rel);
if (!$link->hasAttribute('href') || in_array('nofollow', $rel[0])) {
continue;
}
// Exclude if internal link
$href = $link->getAttribute('href');
if (substr($href, 0, 2) === '//') {
// Deal with protocol relative URLs as found on Wikipedia
$href = $pUrl['scheme'] . ':' . $href;
}
$pHref = #parse_url($href);
if (!$pHref || !isset($pHref['host']) ||
strtolower($pHref['host']) === strtolower($pUrl['host'])
) {
continue;
}
// Increment counter otherwise
echo 'URL: ' . $link->getAttribute('href') . "\n";
$numLinks++;
}
echo "Count: $numLinks\n";
You can use SimpleHTMLDOM:
// Create DOM from URL or file
$html = file_get_html('http://www.site.com/');
// Find all links
foreach($html->find('a[href][rel!=nofollow]') as $element) {
echo $element->href . '<br>';
}
As I'm not sure that SimpleHTMLDOM supports a :not selector and [rel!=nofollow] might only return a tags with a rel attribute present (and not ones where it isn't present), you may have to:
foreach($html->find('a[href][!rel][rel!=nofollow]') as $element)
Note the added [!rel]. Or, do it manually instead of with a CSS attribute selector:
// Find all links
foreach($html->find('a[href]') as $element) {
if (strtolower($element->rel) != 'nofollow') {
echo $element->href . '<br>';
}
}

PHP SimpleXML Breaking when trying to traverse nodes

I'm trying to read the xml information that tumblr provides to create a kind of news feed off the tumblr, but I'm very stuck.
<?php
$request_url = 'http://candybrie.tumblr.com/api/read?type=post&start=0&num=5&type=text';
$xml = simplexml_load_file($request_url);
if (!$xml)
{
exit('Failed to retrieve data.');
}
else
{
foreach ($xml->posts[0] AS $post)
{
$title = $post->{'regular-title'};
$post = $post->{'regular-body'};
$small_post = substr($post,0,320);
echo .$title.;
echo '<p>'.$small_post.'</p>';
}
}
?>
Which always breaks as soon as it tries to go through the nodes. So basically "tumblr->posts;....ect" is displayed on my html page.
I've tried saving the information as a local xml file. I've tried using different ways to create the simplexml object, like loading it as a string (probably a silly idea). I double checked that my webhosting was running PHP5. So basically, I'm stuck on why this wouldn't be working.
EDIT: Ok I tried changing from where I started (back to the original way it was, starting from tumblr was just another (actually silly) way to try to fix it. It still breaks right after the first ->, so displays "posts[0] AS $post....ect" on screen.
This is the first thing I've ever done in PHP so there might be something obvious that I should have set up beforehand or something. I don't know and couldn't find anything like that though.
This should work :
<?php
$request_url = 'http://candybrie.tumblr.com/api/read?type=post&start=0&num=5&type=text';
$xml = simplexml_load_file($request_url);
if ( !$xml ){
exit('Failed to retrieve data.');
}else{
foreach ( $xml->posts[0] AS $post){
$title = $post->{'regular-title'};
$post = $post->{'regular-body'};
$small_post = substr($post,0,320);
echo $title;
echo '<p>'.$small_post.'</p>';
echo '<hr>';
}
}
First thing in you code is that you used root element that should not be used.
<?php
$request_url = 'http://candybrie.tumblr.com/api/read?type=post&start=0&num=5&type=text';
$xml = simplexml_load_file($request_url);
if (!$xml)
{
exit('Failed to retrieve data.');
}
else
{
foreach ($xml->posts->post as $post)
{
$title = $post->{'regular-title'};
$post = $post->{'regular-body'};
$small_post = substr($post,0,320);
echo .$title.;
echo '<p>'.$small_post.'</p>';
}
}
?>
$xml->posts returns you the posts nodes, so if you want to iterate the post nodes you should try $xml->posts->post, which gives you the ability to iterate through the post nodes inside the first posts node.
Also as Needhi pointed out you shouldn't pass through the root node (tumblr), because $xml represents itself the root node. (So I fixed my answer).

Url and title with php snippet

I read several similar posts but I don't see my fault.
index.php looks like:
<head>
<title>Demo Title</title>
</head>
<body>
<?php
require_once "footer.php";
?>
</body>
footer.php looks like:
<?php
/*
* _$ Rev. : 08 Sep 2010 14:52:26 $_
* footer.php
*/
$host = $_SERVER['SERVER_NAME'];
$param = $_SERVER ['REQUEST_URI'];
$url = "http://".$host.$param;
echo $url;
$file = # fopen($_SERVER[$url],"r") or die ("Can't open HTTP_REFERER.");
$text = fread($file,16384);
if (preg_match('/<title>(.*?)<\/title>/is',$text,$found)) {
$title = $found[1];
} else {
$title = " -- no title found -- ";
}
?>
A request for the URL http://127.0.0.1/test/index.php results in:
http://127.0.0.1/test/index.phpCan't open HTTP_REFERER.
or for http://127.0.0.1/test/
http://127.0.0.1/test/Can't open HTTP_REFERER.
Any hints appreciated.
$_SERVER is an array which contains a bunch of fields relating to the server config. It does not contain an element named "http://".$host.$param, so trying to open that as a filename will result in the fopen call failing, and thus going to the die() statement.
More likely what you wanted to do was just open the file called "http://".$host.$param. If that's what you want, then just drop the $_SERVER[] bit and it should work better.
Note that because it's a URL, you will need your PHP config to allow opening of remote files using fopen(). PHP isn't always configured this way by default as it can be a security risk. Your dev machine may also be configured differently to the system you will eventually deploy to. If you find you can't open a remote URL using fopen(), there are alternatives such as using CURL, but they're not quite as straightforward as a simple fopen() call.
Also, if you're reading the whole file, you may want to consider file_get_contents() rather than fopen() and fread(), as it replaces the whole thing into a single function call.
try this:
$file = # fopen($url,"r") or die ("Can't open HTTP_REFERER.");
Try
<?php
$dom = new DOMDocument();
$host = $_SERVER['SERVER_NAME'];
$param = $_SERVER ['REQUEST_URI'];
$url = "http://".$host.$param;
echo 'getting title for page: "' . $url . '"';
$dom->loadHTML($url);
$dom->getElementsByTagName('title');
if ($dom->length)
{
$title = $dom->item(0);
echo $title;
}
else
{
echo 'title tag not found';
}
?>
I can see your trying to track the referral's title
You need to use $_SERVER['HTTP_REFERER']; to get that
what you want to do is something like this
$referrer = (!empty($_SERVER['HTTP_REFERER']) && !substr($_SERVER['SERVER_NAME']) ? $_SERVER['HTTP_REFERER'] : false);
if($referrer)
{
try
{
if(false !== ($resource = fopen($referrer,"r")))
{
$contents = '';
while($contents .= fread($resource,128))
{}
if(preg_match('/<title>(.*?)<\/title>/is',$contents,$found))
{
echo 'Referrer: ' $found[1];
}
}
}catch(Exception $e){}
}

Categories