crawl site and retrieve only links that start with http://

crawl site and retrieve only links that start with http:// - php

I am using the following code to retrieve links from the <a> tag but would like to make some adjustments.
Would like to only return links that begin with "http://"
Would like to include links to image and script references that include "http://"
Would be even better if it can return links for all tags as long as it begins with "http://"
Here is the current code:
<?php
$html = file_get_contents('http://mattressandmore.com/in-the-community/');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the links on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'<br />';
}
?>

You will need to apply starts-with function to href attribute of a element :)
Check some reference and you will get idea, here is the code:
...
$hrefs = $xpath->evaluate("/html/body//a[starts-with(#href, \"http:\")]");
...
Full code:
<?php
$html = file_get_contents('http://mattressandmore.com/in-the-community/');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the links on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a[starts-with(#href, \"http:\")]");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'<br />';
}
?>
Similarly you can try for img tag with src starting with "http://" and script href attribute too.
...
$hrefs = $xpath->evaluate("/html/body//img[starts-with(#src, \"http:\")]");
...

Related

Append <li> innertext to php url scraper results

I have a list of links on one page:
<li><span>site1.com : Description 1</span></li>
<li><span>site2.com : Description 2</span></li>
<li><span>site3.com : Description 3</span></li>
<li><span>site4.com : Description 4</span></li>
I'm using php to take the links from one page and display them on another as such:
<?php
$urlContent = file_get_contents('https://www.example.com/');
$dom = new DOMDocument();
#$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = filter_var($url, FILTER_SANITIZE_URL);
if(!filter_var($url, FILTER_VALIDATE_URL) === false){
echo ''.$url.'<br />';
}
}
?>
However, what I'm trying to figure out is how to include the description next to the link.
here is one of my many attempts:
<?php
$urlContent = file_get_contents('https://www.example.com');
$dom = new DOMDocument();
#$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a/li");
$li = document.getElementsByTagName("li");
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = filter_var($url, FILTER_SANITIZE_URL);
if(!filter_var($url, FILTER_VALIDATE_URL) === false){
echo ''.$url.' : '.$li.' <br />';
}
}
?>
The first part works great but everything I have tried to add the description has failed.

Here's a simple example according to current markup:
$dom = new DOMDocument();
#$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$lis = $xpath->evaluate("/html/body/li");
foreach ($lis as $li) {
$a = $xpath->evaluate("span/a", $li)->item(0);
$url = $a->getAttribute('href');
var_dump($url, $a->nextSibling->nodeValue);
}
Here nextSibling is text content, which follows <a> tag, so nextSibling->nodeValue will be " : Description", and you'll have to remove spaces and :, for example with trim.
Working fiddle.

How can i find out if an a tag somethimes contain a img as anchor

I have some xpath code that loops html code for an a-tag and retrive href, rel-tags and anchortext. But i cant determen weather the anchortext is an img-tag, and if it is, can i get the alt tag info?
For finding links, and retriving infomation about them.
$dom = new \DOMDocument();
#$dom->loadHTML($html);
$xpath = new \DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
//$img = $href->evaluate("img");
$url = $href->getAttribute('href');
$rel = $href->getAttribute('rel');
$anchortext=$href->nodeValue;
}
The above works fine, but i cannot figure out how to determen if the anchortext is an image or not, and if it is retrive the alt tag infomation.

You can use xpath as you do to retrieve the links:
$dom = new \DOMDocument();
#$dom->loadHTML('<html><body><img src="img.png">sdqsdsdq');
$xpath = new \DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body/a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
//$img = $href->evaluate("img");
$url = $href->getAttribute('href');
$rel = $href->getAttribute('rel');
$anchortext=$href->nodeValue;
// get images
$nodes = $href->childNodes;
$contentAnImage = 0;
$images = array();
foreach ($nodes as $node) {
if ($node->nodeName == 'img'){
$contentAnImage = 1;
// if you want the image src:
$images[] = $node->getAttribute('src');
}
}
}

PHP code that displays all links on a web page

I found this code here
<?php
$urlContent = file_get_contents('https://www.google.co.il/searchq=cow&rlz=1C1SQJL_iwIL827IL82&source=lnms&tbm=isch&sa=X&ved=0ahUKEwje7-3q8uPiAhUG_qQKHdWAACwQ_AUIECgB&biw=1280&bih=578');
$dom = new DOMDocument();
#$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = filter_var($url, FILTER_SANITIZE_URL);
// validate url
if(!filter_var($url, FILTER_VALIDATE_URL) === false){
echo ''.$url.'<br />';
}
}
?>
I do not understand why when I run it it brings me only the links of the page and it does not bring me the links of the images

For all my crawlers I use this class https://simplehtmldom.sourceforge.io/
Try it.

Extract href from html page using php

I trying to extract the news headlines and the link (href) of each headline using the code bellow, but the link extraction is not working. It's only getting the headline. Please help me find out what's wrong with the code.
Link to page from which I want to get the headline and link from:
http://web.tmxmoney.com/news.php?qm_symbol=BCM
<?php
$data = file_get_contents('http://web.tmxmoney.com/news.php?qm_symbol=BCM');
$dom = new domDocument;
#$dom->loadHTML($data);
$dom->preserveWhiteSpace = true;
$xpath = new DOMXPath($dom);
$rows = $xpath->query('//div');
foreach ($rows as $row) {
$cols = $row->getElementsByTagName('span');
$newstitle = $cols->item(0)->nodeValue;
$link = $cols->item(0)->nodeType === HTML_ELEMENT_NODE ? $cols->item(0)->getElementsByTagName('a')->item(0)->getAttribute('href') : '';
echo $newstitle . '<br>';
echo $link . '<br><br>';
}
?>
Thanks in advance for your help!

Try to do this:
<?php
$data= file_get_contents('http://web.tmxmoney.com/news.php?qm_symbol=BCM');
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$hrefs= $xpath->query('/html/body//a');
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = filter_var($url, FILTER_SANITIZE_URL);
if(!filter_var($url, FILTER_VALIDATE_URL) === false){
echo ''.$url.'<br />';
}
}
?>

I have found the solution. Here it goes:
<?php
$data = file_get_contents('http://web.tmxmoney.com/news.php?qm_symbol=BCM');
$dom = new domDocument;
#$dom->loadHTML($data);
$dom->preserveWhiteSpace = true;
$xpath = new DOMXPath($dom);
$rows = $xpath->query('//div');
foreach ($rows as $row) {
$cols1 = $row->getElementsByTagName('a');
$link = $cols1->item(0)->nodeType === XML_ELEMENT_NODE ? $cols1->item(0)->getAttribute('href') : '';
$cols2 = $row->getElementsByTagName('span');
$title = $cols2->item(0)->nodeValue;
$source = $cols2->item(1)->nodeValue;
echo $title . '<br>';
echo $source . '<br>';
echo $link . '<br><br>';
}
?>

Xpath php fetch links

I'm using this example to fetch links from a website :
http://www.merchantos.com/makebeta/php/scraping-links-with-php/
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
var_dump($href);
$url = $href->getAttribute('href');
echo "<br />Link stored: $url";
}
It works well; getting all the links; but I cannot get the actual 'title' of the link; for example if i have :
Google
I want to be able to fetch 'Google' term too.
I'm little lost and quite new to xpath.

You are looking for the "nodeValue" of the Textnode inside the "a" node.
You can get that value with
$title = $href->firstChild->nodeValue;
Full working example:
<?php
$dom = DomDocument::loadHTML("<html><body><a href='www.test.de'>DONE</a></body></html>");
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$title = $href->firstChild->nodeValue;
echo "<br />Link stored: $url $title";
}
Prints:
Link stored: www.test.de DONE

Try this:
$link_title = $href->nodeValue;

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

crawl site and retrieve only links that start with http:// - php

Related

Append <li> innertext to php url scraper results

How can i find out if an a tag somethimes contain a img as anchor

PHP code that displays all links on a web page

Extract href from html page using php

Xpath php fetch links

Categories

Resources