Im working on a simple app that scans an array of websites, what I'm trying to do is save the urls in an array then put that in another array, my problem is only the result of the first domain on the array is being displayed(sorry my observation is wrong earlier).
<?php
$arrDomains = array('http://example1.com/', 'http://example2.com/');
$arrExternals = array();
for($i = 0; $i < count($arrDomains); $i++){
$domain = test_input($arrDomains[$i]);
$domain = filter_var($domain, FILTER_SANITIZE_URL);
// START HERE
$html = file_get_contents($domain);
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
$external = array();
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
if (filter_var($url, FILTER_VALIDATE_URL) !== false) {
if (strpos($url, 'mailto') === false) { // exclude emails
if (!in_array($url, $external)) {
array_push($external, $url);
}
}
}
}
array_push($arrExternals, $external);
}
?>
You need to change variable $i because it overrides $i in the first for loop. I changed one $i to $j:
$arrDomains = array('http://example1.com/', 'http://example2.com/');
$arrExternals = array();
for($i = 0; $i < count($arrDomains); $i++){
$domain = test_input($arrDomains[$i]);
$domain = filter_var($domain, FILTER_SANITIZE_URL);
// START HERE
$html = file_get_contents($domain);
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
$external = array();
for ($j = 0; $j < $hrefs->length; $j++) {
$href = $hrefs->item($j);
$url = $href->getAttribute('href');
if (filter_var($url, FILTER_VALIDATE_URL) !== false) {
if (strpos($url, 'mailto') === false) { // exclude emails
if (!in_array($url, $external)) {
array_push($external, $url);
}
}
}
}
array_push($arrExternals, $external);
}
Related
I have a list of links on one page:
<li><span>site1.com : Description 1</span></li>
<li><span>site2.com : Description 2</span></li>
<li><span>site3.com : Description 3</span></li>
<li><span>site4.com : Description 4</span></li>
I'm using php to take the links from one page and display them on another as such:
<?php
$urlContent = file_get_contents('https://www.example.com/');
$dom = new DOMDocument();
#$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = filter_var($url, FILTER_SANITIZE_URL);
if(!filter_var($url, FILTER_VALIDATE_URL) === false){
echo ''.$url.'<br />';
}
}
?>
However, what I'm trying to figure out is how to include the description next to the link.
here is one of my many attempts:
<?php
$urlContent = file_get_contents('https://www.example.com');
$dom = new DOMDocument();
#$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a/li");
$li = document.getElementsByTagName("li");
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = filter_var($url, FILTER_SANITIZE_URL);
if(!filter_var($url, FILTER_VALIDATE_URL) === false){
echo ''.$url.' : '.$li.' <br />';
}
}
?>
The first part works great but everything I have tried to add the description has failed.
Here's a simple example according to current markup:
$dom = new DOMDocument();
#$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$lis = $xpath->evaluate("/html/body/li");
foreach ($lis as $li) {
$a = $xpath->evaluate("span/a", $li)->item(0);
$url = $a->getAttribute('href');
var_dump($url, $a->nextSibling->nodeValue);
}
Here nextSibling is text content, which follows <a> tag, so nextSibling->nodeValue will be " : Description", and you'll have to remove spaces and :, for example with trim.
Working fiddle.
I have some xpath code that loops html code for an a-tag and retrive href, rel-tags and anchortext. But i cant determen weather the anchortext is an img-tag, and if it is, can i get the alt tag info?
For finding links, and retriving infomation about them.
$dom = new \DOMDocument();
#$dom->loadHTML($html);
$xpath = new \DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
//$img = $href->evaluate("img");
$url = $href->getAttribute('href');
$rel = $href->getAttribute('rel');
$anchortext=$href->nodeValue;
}
The above works fine, but i cannot figure out how to determen if the anchortext is an image or not, and if it is retrive the alt tag infomation.
You can use xpath as you do to retrieve the links:
$dom = new \DOMDocument();
#$dom->loadHTML('<html><body><img src="img.png">sdqsdsdq');
$xpath = new \DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body/a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
//$img = $href->evaluate("img");
$url = $href->getAttribute('href');
$rel = $href->getAttribute('rel');
$anchortext=$href->nodeValue;
// get images
$nodes = $href->childNodes;
$contentAnImage = 0;
$images = array();
foreach ($nodes as $node) {
if ($node->nodeName == 'img'){
$contentAnImage = 1;
// if you want the image src:
$images[] = $node->getAttribute('src');
}
}
}
I found this code here
<?php
$urlContent = file_get_contents('https://www.google.co.il/searchq=cow&rlz=1C1SQJL_iwIL827IL82&source=lnms&tbm=isch&sa=X&ved=0ahUKEwje7-3q8uPiAhUG_qQKHdWAACwQ_AUIECgB&biw=1280&bih=578');
$dom = new DOMDocument();
#$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = filter_var($url, FILTER_SANITIZE_URL);
// validate url
if(!filter_var($url, FILTER_VALIDATE_URL) === false){
echo ''.$url.'<br />';
}
}
?>
I do not understand why when I run it it brings me only the links of the page and it does not bring me the links of the images
For all my crawlers I use this class https://simplehtmldom.sourceforge.io/
Try it.
Is it possible to convert just a selection of a HTML with multiple tables to JSON ?
I have this Table:
<div class="mon_title">2.11.2015 Montag</div>
<table class="info" >
<tr class="info"><th class="info" align="center" colspan="2">Nachrichten zum Tag</th></tr>
<tr class='info'><td class='info' colspan="2"><b><u></u> </b>
...
</table>
<p>
<table class="mon_list" >
...
</table>
And this PHP code to covert it into JSON:
function save_table_to_json ( $in_file, $out_file ) {
$html = file_get_contents( $in_file );
file_put_contents( $out_file, convert_table_to_json( $html ) );
}
function convert_table_to_json ( $html ) {
$document = new DOMDocument();
$document->loadHTML( $html );
$obj = [];
$jsonObj = [];
$th = $document->getElementsByTagName('th');
$td = $document->getElementsByTagName('td');
$thNum = $th->length;
$arrLength = $td->length;
$rowIx = 0;
for ( $i = 0 ; $i < $arrLength ; $i++){
$head = $th->item( $i%$thNum )->textContent;
$content = $td->item( $i )->textContent;
$obj[ $head ] = $content;
if( ($i+1) % $thNum === 0){
$jsonObj[++$rowIx] = $obj;
$obj = [];
}
}
save_table_to_json( 'heute_S.htm', 'heute_S.json' );
What it does is takes the table class=info and the table class=mon_list and converts it to json.
Is there any way that it can just take the table class=mon_list?
You can use XPath to search for the class, and then create a new DOM document that only contains the results of the XPath query. This is untested, but should get you on the right track.
It's also worth mentioning that you can use foreach to iterate over the node list.
$document = new DOMDocument();
$document->loadHTML( $html );
$xpath = new DomXPath($document);
$tables = $xpath->query("//*[contains(#class, 'mon_list')]");
$tableDom = new DomDocument();
$tableDom->appendChild($tableDom->importNode($tables->item(0), true));
$obj = [];
$jsonObj = [];
$th = $tableDom->getElementsByTagName('th');
$td = $tableDom->getElementsByTagName('td');
$thNum = $th->length;
$arrLength = $td->length;
$rowIx = 0;
for ( $i = 0 ; $i < $arrLength ; $i++){
$head = $th->item( $i%$thNum )->textContent;
$content = $td->item( $i )->textContent;
$obj[ $head ] = $content;
if( ($i+1) % $thNum === 0){
$jsonObj[++$rowIx] = $obj;
$obj = [];
}
}
Another unrelated answer is to use getAttribute() to check the class name. Someone on a different answer has written a function for doing this:
function getElementsByClass(&$parentNode, $tagName, $className) {
$nodes=array();
$childNodeList = $parentNode->getElementsByTagName($tagName);
for ($i = 0; $i < $childNodeList->length; $i++) {
$temp = $childNodeList->item($i);
if (stripos($temp->getAttribute('class'), $className) !== false) {
$nodes[]=$temp;
}
}
return $nodes;
}
How can I take all the attribute of an element? Like on my example below I can only get one at a time, I want to pull out all of the anchor tag's attribute.
$dom = new DOMDocument();
#$dom->loadHTML(http://www.example.com);
$a = $dom->getElementsByTagName("a");
echo $a->getAttribute('href');
thanks!
$length = $a->attributes->length;
$attrs = array();
for ($i = 0; $i < $length; ++$i) {
$name = $a->attributes->item($i)->name;
$value = $a->getAttribute($name);
$attrs[$name] = $value;
}
print_r($attrs);
"Inspired" by Simon's answer. I think you can cut out the getAttribute call, so here's a solution without it:
$attrs = array();
for ($i = 0; $i < $a->attributes->length; ++$i) {
$node = $a->attributes->item($i);
$attrs[$node->nodeName] = $node->nodeValue;
}
var_dump($attrs);
$a = $dom->getElementsByTagName("a");
foreach($a as $element)
{
echo $element->getAttribute('href');
}
$html = $data['html'];
if(!empty($html)){
$doc = new DOMDocument();
$doc->loadHTML($html);
$doc->saveHTML();
$datadom = $doc->getElementsByTagName("input");
foreach($datadom as $element)
{
$class =$class." ".$element->getAttribute('class');
}
}