Extract href from html page using php - php

I trying to extract the news headlines and the link (href) of each headline using the code bellow, but the link extraction is not working. It's only getting the headline. Please help me find out what's wrong with the code.
Link to page from which I want to get the headline and link from:
http://web.tmxmoney.com/news.php?qm_symbol=BCM
<?php
$data = file_get_contents('http://web.tmxmoney.com/news.php?qm_symbol=BCM');
$dom = new domDocument;
#$dom->loadHTML($data);
$dom->preserveWhiteSpace = true;
$xpath = new DOMXPath($dom);
$rows = $xpath->query('//div');
foreach ($rows as $row) {
$cols = $row->getElementsByTagName('span');
$newstitle = $cols->item(0)->nodeValue;
$link = $cols->item(0)->nodeType === HTML_ELEMENT_NODE ? $cols->item(0)->getElementsByTagName('a')->item(0)->getAttribute('href') : '';
echo $newstitle . '<br>';
echo $link . '<br><br>';
}
?>
Thanks in advance for your help!

Try to do this:
<?php
$data= file_get_contents('http://web.tmxmoney.com/news.php?qm_symbol=BCM');
$dom = new DOMDocument();
#$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$hrefs= $xpath->query('/html/body//a');
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = filter_var($url, FILTER_SANITIZE_URL);
if(!filter_var($url, FILTER_VALIDATE_URL) === false){
echo ''.$url.'<br />';
}
}
?>

I have found the solution. Here it goes:
<?php
$data = file_get_contents('http://web.tmxmoney.com/news.php?qm_symbol=BCM');
$dom = new domDocument;
#$dom->loadHTML($data);
$dom->preserveWhiteSpace = true;
$xpath = new DOMXPath($dom);
$rows = $xpath->query('//div');
foreach ($rows as $row) {
$cols1 = $row->getElementsByTagName('a');
$link = $cols1->item(0)->nodeType === XML_ELEMENT_NODE ? $cols1->item(0)->getAttribute('href') : '';
$cols2 = $row->getElementsByTagName('span');
$title = $cols2->item(0)->nodeValue;
$source = $cols2->item(1)->nodeValue;
echo $title . '<br>';
echo $source . '<br>';
echo $link . '<br><br>';
}
?>

Related

Append <li> innertext to php url scraper results

I have a list of links on one page:
<li><span>site1.com : Description 1</span></li>
<li><span>site2.com : Description 2</span></li>
<li><span>site3.com : Description 3</span></li>
<li><span>site4.com : Description 4</span></li>
I'm using php to take the links from one page and display them on another as such:
<?php
$urlContent = file_get_contents('https://www.example.com/');
$dom = new DOMDocument();
#$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = filter_var($url, FILTER_SANITIZE_URL);
if(!filter_var($url, FILTER_VALIDATE_URL) === false){
echo ''.$url.'<br />';
}
}
?>
However, what I'm trying to figure out is how to include the description next to the link.
here is one of my many attempts:
<?php
$urlContent = file_get_contents('https://www.example.com');
$dom = new DOMDocument();
#$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a/li");
$li = document.getElementsByTagName("li");
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = filter_var($url, FILTER_SANITIZE_URL);
if(!filter_var($url, FILTER_VALIDATE_URL) === false){
echo ''.$url.' : '.$li.' <br />';
}
}
?>
The first part works great but everything I have tried to add the description has failed.
Here's a simple example according to current markup:
$dom = new DOMDocument();
#$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$lis = $xpath->evaluate("/html/body/li");
foreach ($lis as $li) {
$a = $xpath->evaluate("span/a", $li)->item(0);
$url = $a->getAttribute('href');
var_dump($url, $a->nextSibling->nodeValue);
}
Here nextSibling is text content, which follows <a> tag, so nextSibling->nodeValue will be " : Description", and you'll have to remove spaces and :, for example with trim.
Working fiddle.

How to parse url with DOMparser using getNamedItem

I am trying to grab URL, with DOMparser but stuck at getNamedItem
How to solve this problem? What I am missing here? I welcome for any idea!
$url = 'https://www.31sumai.com/search/area/kansai/result/?area=16,17,18';
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$DOMParser = new \DOMDocument();
$DOMParser->loadHTML($html);
$mainlink = null;
$allPTags = $DOMParser->getElementsByTagName('p');
foreach ($allPTags as $ptag) {
$class = $ptag->attributes->getNamedItem("class");
if ($class && $class->nodeValue == 'c-name') {
$main = $ptag->attributes->getNamedItem("href");
if ($main) {
$mainlink = $main->nodeValue;
}
}
}
var_dump($mainlink);
It s returning null but already checked the website, there is a URL in that tag.
$url = 'https://lions-mansion.jp/area/kansai/';
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$DOMParser = new \DOMDocument();
$DOMParser->loadHTML($html);
$mainlink = null;
$allPTags = $DOMParser->getElementsByTagName('p');
foreach ($allPTags as $ptag) {
$class = $ptag->attributes->getNamedItem("class");
if ($class && $class->nodeValue == 'areapageDetailList_item_btn_hp') {
$links = $ptag->getElementsByTagName('a');
foreach ($links as $link) {
$hrefAttr = $link->attributes->getNamedItem("href");
if ($hrefAttr) {
$mainlink = $hrefAttr->nodeValue;
}
}
}
}
echo $mainlink;

PHP - Weird file_get_contents behavior

When I run the first code it works well. The echo works.
<?php
$html = file_get_contents('https://feedback.aliexpress.com/display/productEvaluation.htm?productId=32795887882&ownerMemberId=230515078&withPictures=true&i18n=true&Page=3');
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
echo $image->getAttribute('src');
echo "<br>";
}
?>
But when I try the following code and running it with parameters nothing is returned:
index.php?url=https://feedback.aliexpress.com/display/productEvaluation.htm?productId=32795887882&ownerMemberId=230515078&withPictures=true&i18n=true&Page=3
<?php
$html = file_get_contents($_GET["url"]);
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
echo $image->getAttribute('src');
echo "<br>";
}
?>
Anyone got any idea?
Update:
Probally not the best and cleanest solution, but it works :)
<?
$url = urldecode($_GET['url']);
$ownerMemberId = urldecode($_GET['ownerMemberId']);
$withPictures = urldecode($_GET['withPictures']);
$page = urldecode($_GET['Page']);
$newurl = $url . "&ownerMemberId=" . $ownerMemberId .
"&withPictures=true&i18n=true&Page=" . $page;
$html = file_get_contents($newurl);
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
echo "<img src='";
echo $image->getAttribute('src');
echo "'>";
echo "<br>";
}
?>
Please decode the url as it is sending another url.
$url = urldecode($_GET['url']);
$html = file_get_contents($url);
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
echo $image->getAttribute('src');
echo "<br>";
}
Hope that works for you.

php how to get htmlcontent instead textcontent

In var2, I want HTML content, as there are some br so they are not getting included and I want them to get included.
<?php
$url = "http://sms.hindijokes.co";
$html = file_get_contents($url);
$doc = new DOMDocument;
$doc->strictErrorChecking = false;
$doc->recover = true;
#$doc->loadHTML("<html><body>".$html."
</body> </html>");
$xpath = new DOMXPath($doc);
$query1 = "//h2[#class='entry-title']/a";
$query2 = "//div[#class='entry-content'][1]/p";
$entries1 = $xpath->query($query1);
$entries2 = $xpath->query($query2);
$var1 = $entries1->item(0)->textContent;
$var2 = $entries2->item(0)->textContent;
echo "$var1";
echo "<br>";
$f = $entries2->length;
for($i = 0; $i < $f; $i++){
echo $entries2->item($i)->textContent."\n";
}
?>

PHP/DOMXpath/DOMDocument - Unable to parse specific links

Here is my code, you can copy and paste it to start runing, it's complete for test:
<?php
$url = "http://www.sportsdirect.com/ladies/ladies-underwear";
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
$xpath = new DOMXpath($doc);
$n = $xpath->query('//div[#class="s-producttext-top-wrapper"]');
$l = $xpath->query('//div[#class="s-producttext-top-wrapper"]/a');
$p = $xpath->query('//div[#class="s-largered"]');
$nl = $xpath->query('//a[#class="swipeNextClick NextLink"]');
$NextLink = $nl->item(0)->getAttribute("data-dcp");
$item = 0;
foreach ($n as $entry) {
$Name = $entry->nodeValue;
$Link = $l->item($item)->getAttribute("href");
$Price = $p->item($item)->nodeValue;
$Find = array('£');
$Replace = array('');
$Price = str_replace($Find, $Replace, $Price);
echo "Name: $Name - Link: $Link - Price: $Price - $NextLink<br>";
$item++;
}
?>
This is parsing all the products from http://www.sportsdirect.com/ladies/ladies-underwear which are on the FIRST page.
Here is the link for the second page http://www.sportsdirect.com/ladies/ladies-underwear#dcp=2&dppp=100&OrderBy=rank
And when i execute this code to get all the products from the SECOND page:
<?php
$url = "http://www.sportsdirect.com/ladies/ladies-underwear#dcp=2&dppp=100&OrderBy=rank";
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
$xpath = new DOMXpath($doc);
$n = $xpath->query('//div[#class="s-producttext-top-wrapper"]');
$l = $xpath->query('//div[#class="s-producttext-top-wrapper"]/a');
$p = $xpath->query('//div[#class="s-largered"]');
$nl = $xpath->query('//a[#class="swipeNextClick NextLink"]');
$NextLink = $nl->item(0)->getAttribute("data-dcp");
$item = 0;
foreach ($n as $entry) {
$Name = $entry->nodeValue;
$Link = $l->item($item)->getAttribute("href");
$Price = $p->item($item)->nodeValue;
$Find = array('£');
$Replace = array('');
$Price = str_replace($Find, $Replace, $Price);
echo "Name: $Name - Link: $Link - Price: $Price - $NextLink<br>";
$item++;
}
?>
I still get the results for the products of the FIRST page. Why?
How can i parse all the products from Page 2, where is my mistake?
Can you please help me out?
Thanks in advance!

Categories