when parsing html, check if element is present - php

Im parsing html from some a page, to get a list of the outgoing, i want to split them in two - the ones with the rel="nofollow" / rel="nofollow me" / rel="me nofollow" element and the ones with with out those expressions.
At the moment im using the code bellow parsed using - PHP Simple HTML DOM Parser
$html = file_get_html("$url");
foreach($html->find('a') as $element) {
echo $element->href; // THE LINK
}
but im not quite sure how to implement it, any ideas ?

Try using something like this :
$html = file_get_html("$url");
// Creating array for storing links
$arrayLinks = array(
"nofollow" => array(),
"others" => array()
);
foreach($html->find('a') as $element) {
// Search for "nofollow" expression with no case-sensitive (i flag)
if(preg_match('#nofollow#i', $element->rel)) {
$arrayLinks["nofollow"][] = $element->href;
}
else {
$arrayLinks["others"][] = $element->href;
}
}
// Display the array
echo "<pre>";
print_r($arrayLinks);
echo "</pre>";

Do a regexp on $element->rel I guess

Related

Question about using simple html dom parser to store HTML tags as objects

I am building a web scraper using the simple HTML DOM parser. However, I ran into some issues figuring out how to store HTML elements on a web page as objects. I would like to take an input URL, and turn all the HTML elements like tags, divs, fields, etc. and turn them into an object that gets spit out onto a page. I have written some code that currently works when I type in a URL, but the output is not what I am trying to achieve. Below, I have attached the code that I have worked out already, and I am seeking to find a way in which I could achieve what I am trying to do.
I have tried finding all images and links as well as creating a DOM object. I can't seem to figure out how to convert these elements into objects that I can use to learn more about a website, and possibly store that data into a database.
<?php
require('simple_html_dom.php');
// Create DOM from URL or file
$url = $_POST["url"];
$html = file_get_html($url);
echo $html;
// Find all images
$element = new simple_html_dom();
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
$element = new simple_html_dom();
foreach($html->find('a') as $element)
echo $element->href . '<br>';
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a URL
$html->load_file($url);
echo $html;
?>
I am expecting an output of objects, but I am instead getting an actual output of images and links on a web page.
<?php
require('simple_html_dom.php');
// Create DOM from URL or file
// $url = $_POST["url"];
$url = 'Your-Url'; // Your url: 'www.example.com'
$html = file_get_html($url);
// Find all images
$images = []; //create empty images array
foreach($html->find('img') as $element){
$images[] = $element->src . '<br>'; //Store the found elements in the images array
}
echo '<pre>Output $images: '; var_dump($images); echo '</pre>'; //An output from the images array
// Find all links
$links = []; //create empty images array
foreach($html->find('a') as $element){
$links[] = $element->href . '<br>'; //Store the found elements in the links array
}
echo '<pre>Output $links: '; var_dump($links); echo '</pre>'; //An output from the links array
The echo's display the arrays filled with 'image' and 'a' tags value's from your page

How to find just relative links in dom parser

Just started playing with PHP Simple HTML DOM Parser and I see from PHP Simple HTML DOM Parser > Modify Fetched Links that it should be possible to find just the relative links to process.
My code so far:
<?php
include('phpsimpledom/simple_html_dom.php');
$html = file_get_html('phpimport.html');
// to fetch all hyperlinks from a webpage
$links = array();
foreach($html->find('a') as $a) {
$links[] = $a->href;
}
echo '<pre>', print_r($links, 1), '</pre>';
This prints out all the links and I only want the relative links
As your going through the links, check the href doesn't start with http:// or https:// before you add them to the list, this uses preg_match() to check it...
foreach($html->find('a') as $a) {
if ( preg_match("/^https?:\/\//i", $a->href) === 0) {
$links[] = $a->href;
}
}

I want to parse span innertext using simple_html_dom in php

<span class="contact-seller-name">Enda</span>
Now I want to echo 'Enda' inside this span tag using php
Here's my php code
$url="http://website.example.com";
$html = file_get_html( $url );
$value = $html->find('span.contact-seller-name');
echo $value->innertext;
From their documentation it looks like find returns an array of found values matching filter parameters:
From:
http://simplehtmldom.sourceforge.net/
Code:
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
They also provide another example for getting a specific element:
$html->find('div[id=hello]', 0)->innertext = 'foo';
So my guess would be something like this will get you want you desire:
$value = $html->find('span.contact-seller-name', 0);
echo $value->innertext;
By adding the 0 as a parameter it returns the first found instance of that filter.
Take a look at their API here:
http://simplehtmldom.sourceforge.net/manual_api.htm
It describe what the find method returns (an array of element objects or element object if the second parameter is defined)
Then using any of the provided methods for the element object you can get the desired text.
Full working example tested on a live site:
$url = "http://fleeceandthankyou.org/";
$html = file_get_html($url);
$value = $html->find('span.givecamp-header-wide', 0);
//If it can't find the element, throw an error
try
{
echo $value->innertext;
}
catch (Exception $e)
{
echo "Couldn't access magic method: " . $e->getMessage();
}

PHP Simple HTML DOM Scrape External URL

I'm trying to build a personal project of mine, however I'm a bit stuck when using the Simple HTML DOM class.
What I'd like to do is scrape a website and retrieve all the content, and it's inner html, that matches a certain class.
My code so far is:
<?php
error_reporting(E_ALL);
include_once("simple_html_dom.php");
//use curl to get html content
$url = 'http://www.peopleperhour.com/freelance-seo-jobs';
$html = file_get_html($url);
//Get all data inside the <div class="item-list">
foreach($html->find('div[class=item-list]') as $div) {
//get all div's inside "item-list"
foreach($div->find('div') as $d) {
//get the inner HTML
$data = $d->outertext;
}
}
print_r($data)
echo "END";
?>
All I get with this is a blank page with "END", nothing else outputted at all.
It seems your $data variable is being assigned a different value on each iteration. Try this instead:
$data = "";
foreach($html->find('div[class=item-list]') as $div) {
//get all divs inside "item-list"
foreach($div->find('div') as $d) {
//get the inner HTML
$data .= $d->outertext;
}
}
print_r($data)
I hope that helps.
I think, you may want something like this
$url = 'http://www.peopleperhour.com/freelance-seo-jobs';
$html = file_get_html($url);
foreach ($html->find('div.item-list div.item') as $div) {
echo $div . '<br />';
};
This will give you something like this (if you add the proper style sheet, it'll be displayed nicely)

PHP script that counts the number of outgoing links on a page and ignores the rel="nofollow" ones

I am a php newb but I am pretty sure this will be hard to accomplish and very server consuming. But I want to ask, get the opinion of much smarter users than myself.
Here is what I am trying to do:
I have a list of URL's, an array of URL's actually.
For each URL, I want to count the outgoing links - which DO NOT HAVE REL="nofollow" attribute - on that page.
So in a way, I'm afraid I'll have to make php load the page and preg match using regular expressions all the links?
Would this work if I'd had lets say 1000 links?
Here is what I am thinking, putting it in code:
$homepage = file_get_contents('http://www.site.com/');
$homepage = htmlentities($homepage);
// Do a preg_match for http:// and count the number of appearances:
$urls = preg_match();
// Do a preg_match for rel="nofollow" and count the nr of appearances:
$nofollow = preg_match();
// Do a preg_match for the number of "domain.com" appearances so we can subtract the website's internal links:
$internal_links = preg_match();
// Substract and get the final result:
$result = $urls - $nofollow - $internal_links;
Hope you can help, and if the idea is right maybe you can help me with the preg_match functions.
You can use PHP's DOMDocument class to parse the HTML and parse_url to parse the URLs:
$url = 'http://stackoverflow.com/';
$pUrl = parse_url($url);
// Load the HTML into a DOMDocument
$doc = new DOMDocument;
#$doc->loadHTMLFile($url);
// Look for all the 'a' elements
$links = $doc->getElementsByTagName('a');
$numLinks = 0;
foreach ($links as $link) {
// Exclude if not a link or has 'nofollow'
preg_match_all('/\S+/', strtolower($link->getAttribute('rel')), $rel);
if (!$link->hasAttribute('href') || in_array('nofollow', $rel[0])) {
continue;
}
// Exclude if internal link
$href = $link->getAttribute('href');
if (substr($href, 0, 2) === '//') {
// Deal with protocol relative URLs as found on Wikipedia
$href = $pUrl['scheme'] . ':' . $href;
}
$pHref = #parse_url($href);
if (!$pHref || !isset($pHref['host']) ||
strtolower($pHref['host']) === strtolower($pUrl['host'])
) {
continue;
}
// Increment counter otherwise
echo 'URL: ' . $link->getAttribute('href') . "\n";
$numLinks++;
}
echo "Count: $numLinks\n";
You can use SimpleHTMLDOM:
// Create DOM from URL or file
$html = file_get_html('http://www.site.com/');
// Find all links
foreach($html->find('a[href][rel!=nofollow]') as $element) {
echo $element->href . '<br>';
}
As I'm not sure that SimpleHTMLDOM supports a :not selector and [rel!=nofollow] might only return a tags with a rel attribute present (and not ones where it isn't present), you may have to:
foreach($html->find('a[href][!rel][rel!=nofollow]') as $element)
Note the added [!rel]. Or, do it manually instead of with a CSS attribute selector:
// Find all links
foreach($html->find('a[href]') as $element) {
if (strtolower($element->rel) != 'nofollow') {
echo $element->href . '<br>';
}
}

Categories