Replacing link with plain text with php simple html dom - php

I have a program that removes certain pages from a web; i want to then traverse the remaining pages and "unlink" any links to those removed pages. I'm using simplehtmldom. My function takes a source page ($source) and an array of pages ($skipList). It finds the links, and I'd like to then manipulate the dom to convert the element into the $link->innertext, but I don't know how. Any help?
function RemoveSpecificLinks($source, $skipList) {
// $source is the html source file;
// $skipList is an array of link destinations (hrefs) that we want unlinked
$docHtml = file_get_contents($source);
$htmlObj = str_get_html($docHtml);
$links = $htmlObj->find('a');
if (isset($links)) {
foreach ($links as $link) {
if (in_array($link->href, $skipList)) {
$link->href = ''; // Should convert to simple text element
}
}
}
$docHtml = $htmlObj->save();
$htmlObj->clear();
unset($htmlObj);
return($docHtml);
}

I have never used simplehtmldom, but this is what I think should solve your problem:
function RemoveSpecificLinks($source, $skipList) {
// $source is the HTML source file;
// $skipList is an array of link destinations (hrefs) that we want unlinked
$docHtml = file_get_contents($source);
$htmlObj = str_get_html($docHtml);
$links = $htmlObj->find('a');
if (isset($links)) {
foreach ($links as $link) {
if (in_array($link->href, $skipList)) {
$link->outertext = $link->plaintext; // THIS SHOULD WORK
// IF THIS DOES NOT WORK TRY:
// $link->outertext = $link->innertext;
}
}
}
$docHtml = $htmlObj->save();
$htmlObj->clear();
unset($htmlObj);
return($docHtml);
}
Please provide me some feedback as if this worked or not, also specifying which method worked, if any.
Update: Maybe you would prefer this:
$link->outertext = $link->href;
This way you get the link displayed, but not clickable.

Related

PHP Simple HTML DOM Scrape External URL

I'm trying to build a personal project of mine, however I'm a bit stuck when using the Simple HTML DOM class.
What I'd like to do is scrape a website and retrieve all the content, and it's inner html, that matches a certain class.
My code so far is:
<?php
error_reporting(E_ALL);
include_once("simple_html_dom.php");
//use curl to get html content
$url = 'http://www.peopleperhour.com/freelance-seo-jobs';
$html = file_get_html($url);
//Get all data inside the <div class="item-list">
foreach($html->find('div[class=item-list]') as $div) {
//get all div's inside "item-list"
foreach($div->find('div') as $d) {
//get the inner HTML
$data = $d->outertext;
}
}
print_r($data)
echo "END";
?>
All I get with this is a blank page with "END", nothing else outputted at all.
It seems your $data variable is being assigned a different value on each iteration. Try this instead:
$data = "";
foreach($html->find('div[class=item-list]') as $div) {
//get all divs inside "item-list"
foreach($div->find('div') as $d) {
//get the inner HTML
$data .= $d->outertext;
}
}
print_r($data)
I hope that helps.
I think, you may want something like this
$url = 'http://www.peopleperhour.com/freelance-seo-jobs';
$html = file_get_html($url);
foreach ($html->find('div.item-list div.item') as $div) {
echo $div . '<br />';
};
This will give you something like this (if you add the proper style sheet, it'll be displayed nicely)

How to crawl a specific div tag to get all data inside that?

Actually i'm new to PHP and I want to crawl this link to get info about all courier companies providing services in our country. All the info that i required are insid a div tag i.e. . I need all info inside this tag with Images, paragraphs and links. I've done some research on that and able to crawl the page.
<?php
function crawl_page($url, $depth = 1)
{
static $seen = array();
if (isset($seen[$url]) || $depth === 0) {
return;
}
$seen[$url] = true;
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($url);
$xpath = new DomXpath($dom);
$divTag = $xpath->query('//div[#class="rescont"]');
foreach ($divTag as $val) {
echo $dom->saveXML($val).'<br />\n';// or
}
}
crawl_page("http://www.phonebook.com.pk/Dynamic/Search.aspx?k=courier&l=pakistan&SearchType=kl", 1);
?>
EDit:
Now i'm able to display all contents on my webpage but images and some other info is not available because it was linked as relative to that server. Can i extract that info too?

PHP script that counts the number of outgoing links on a page and ignores the rel="nofollow" ones

I am a php newb but I am pretty sure this will be hard to accomplish and very server consuming. But I want to ask, get the opinion of much smarter users than myself.
Here is what I am trying to do:
I have a list of URL's, an array of URL's actually.
For each URL, I want to count the outgoing links - which DO NOT HAVE REL="nofollow" attribute - on that page.
So in a way, I'm afraid I'll have to make php load the page and preg match using regular expressions all the links?
Would this work if I'd had lets say 1000 links?
Here is what I am thinking, putting it in code:
$homepage = file_get_contents('http://www.site.com/');
$homepage = htmlentities($homepage);
// Do a preg_match for http:// and count the number of appearances:
$urls = preg_match();
// Do a preg_match for rel="nofollow" and count the nr of appearances:
$nofollow = preg_match();
// Do a preg_match for the number of "domain.com" appearances so we can subtract the website's internal links:
$internal_links = preg_match();
// Substract and get the final result:
$result = $urls - $nofollow - $internal_links;
Hope you can help, and if the idea is right maybe you can help me with the preg_match functions.
You can use PHP's DOMDocument class to parse the HTML and parse_url to parse the URLs:
$url = 'http://stackoverflow.com/';
$pUrl = parse_url($url);
// Load the HTML into a DOMDocument
$doc = new DOMDocument;
#$doc->loadHTMLFile($url);
// Look for all the 'a' elements
$links = $doc->getElementsByTagName('a');
$numLinks = 0;
foreach ($links as $link) {
// Exclude if not a link or has 'nofollow'
preg_match_all('/\S+/', strtolower($link->getAttribute('rel')), $rel);
if (!$link->hasAttribute('href') || in_array('nofollow', $rel[0])) {
continue;
}
// Exclude if internal link
$href = $link->getAttribute('href');
if (substr($href, 0, 2) === '//') {
// Deal with protocol relative URLs as found on Wikipedia
$href = $pUrl['scheme'] . ':' . $href;
}
$pHref = #parse_url($href);
if (!$pHref || !isset($pHref['host']) ||
strtolower($pHref['host']) === strtolower($pUrl['host'])
) {
continue;
}
// Increment counter otherwise
echo 'URL: ' . $link->getAttribute('href') . "\n";
$numLinks++;
}
echo "Count: $numLinks\n";
You can use SimpleHTMLDOM:
// Create DOM from URL or file
$html = file_get_html('http://www.site.com/');
// Find all links
foreach($html->find('a[href][rel!=nofollow]') as $element) {
echo $element->href . '<br>';
}
As I'm not sure that SimpleHTMLDOM supports a :not selector and [rel!=nofollow] might only return a tags with a rel attribute present (and not ones where it isn't present), you may have to:
foreach($html->find('a[href][!rel][rel!=nofollow]') as $element)
Note the added [!rel]. Or, do it manually instead of with a CSS attribute selector:
// Find all links
foreach($html->find('a[href]') as $element) {
if (strtolower($element->rel) != 'nofollow') {
echo $element->href . '<br>';
}
}

Regex Replacement Dependent On Class

I have the following code that replaces all tags on a page and adds the nCode image resizer to it. The code is as follows:
function ncode_the_content($content) {
return preg_replace("/<img([^`|>]*)>/im", "<img onload=\"NcodeImageResizer.createOn(this);\"$1>", $content); }
}
What I need to do is make it so that if an image has the class of "noresize" it doesn't do the preg_match.
I have only managed to get it so that if there is the "noresize" class anywhere on the page it stops resizing all images instead of just the one with the correct class.
Any suggestions?
UPDATE:
Am I even remotely in the right ballpark with this?
function ncode_the_content($content) {
//Load the HTML page
$html = file_get_contents($content);
//Parse it. Here we use loadHTML as a static method
//to parse the HTML and create the DOM object in one go.
#$dom = DOMDocument::loadHTML($html);
//Init the XPath object
$xpath = new DOMXpath($dom);
//Query the DOM
$linksnoresize = $xpath->query( 'img[#class = "noresize"]' );
$links = $xpath->query( 'img[]' );
//Display the results as in the previous example
foreach($links as $link){
echo $link->getAttribute('onload'), 'NcodeImageResizer.createOn(this);';
}
foreach($linksnoresize as $link){
echo $link->getAttribute('onload'), '';
}
}
Here's some untested code:
$dom = DOMDocument::loadHTML($content);
$images = $dom->getElementsByTagName("img");
foreach ($images as $image) {
if (!strstr($image->getAttribute("class"), "noresize")) {
$image->setAttribute("onload", "NcodeImageResizer.createOn(this);");
}
}
But, if it were me, I would eschew any such inline event handler and instead just find the appropriate elements with Javascript.
I ended up just using pure CSS and adding a around the images I didn't want to be resized. Forced the width and height of that div back to auto and then removed the warning message that was displayed above them. Seems to work fine. Thanks for your help :)

How to validate plain text (link text) in a hyperlink using php?

I am using simple html dom to fetch datas from other websites. while fetching data it fetches both hyperlinks with plain text and without plain text. I want to remove hyperlinks without plain text(link text) while fetching the data ..
i have tried below codes
if($title==""){ echo "No text";}
and
if(ctype_space($title)) { echo "No text";}
where $title is the plaintext fetched from the website
but both method didnt worked..can any one help
Advance thanks for your help
Until you give us more information on what value is my best guess would be to try something like this
if(empty($title))
{
echo "No Text";
}
Does it really need to be "plain text validation"?
Reading your question it seems you just want to remove links with empty values.
If the latter is true, you can do something like this:
$html = <<<EOL
Text
More Text
EOL;
$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
if (strlen(trim($link->nodeValue)) == 0) {
$link->parentNode->removeChild($link);
}
}
var_dump($dom->saveHTML());
$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($html);
$links_array = $xPath->query("//a"); // select all a tags
$totalLinks = $links_array->length; // how many links there are.
for($i = 0; $i < $totalLinks; $i++) // process each link one by one
{
$title = $links_array->item($i)->nodeValue; // get LInkText
if($title == '') // if no link text
{
$url = $links_array->item($i)->getAttribute('href');
// do here what you want
}
}
You need to use preg_match, with a regular expression, to extract the link text. For example
if (preg_match("/<a.*?>(.*?)</",$title,$matches))
{
echo $matches[1];
}

Categories