I'm newbie about programming, so this is my issue. I am trying to build a recursive php spider usind Simple HTML DOM Parser, crawling into a certain website and returning a list of pages including 2xx, 3xx, 4xx and 5xx. I've been searching several days for a solution but (maybe due to my low experience) I haven't found anything working. My actual code finds all the links on the root/index page, however i would like to be able to find links inside those previously found links recursively and so on, for example to level 5. Assuming the root page is level 0, the recursive function I wrote only shows me level 1 links, repeating them 5 times. Any help appreciated. Thanks.
<?php
echo "<strong><h1>Sitemap</h1></strong><br>";
include_once('simple_html_dom.php');
$url = "http://www.gnet.it/";
$html = new simple_html_dom();
$html->load_file($url);
echo "<strong><h2>Int Links</h2></strong><br>";
foreach($html->find("a") as $a)
{
if((!(preg_match('#^(?:https?|ftp)://.+$#', $a->href)))&&($a->href != null)&&($a->href != "javascript:;")&&($a->href != "#"))
{
echo "<strong>" . $a->href . "</strong><br>";
}
}
echo "<strong><h2>Ext Links</h2></strong><br>";
foreach($html->find("a") as $a)
{
if(((preg_match('#^(?:https?|ftp)://.+$#', $a->href)))&&($a->href != null)&&($a->href != "javascript:;")&&($a->href != "#"))
{
echo "<strong>" . $a->href . "</strong><br>";
}
}
//recursion
$depth = 1;
$maxDepth = 5;
$recurl = "$a->href";
$rechtml = new simple_html_dom();
$rechtml->load_file($recurl);
while($depth <= $maxDepth){
echo "<strong><h2>Link annidati livello $depth</h2></strong><br>";
foreach($rechtml->find("a") as $a)
{
if(($a->href != null))
{
echo "<strong>" . $a->href . "</strong><br>";
}
}
$depth++;
}
//csv
echo "<strong><h1>Google Crawl Errors from CSV</h1></strong><br>";
echo "<table>\n\n";
$f = fopen("CrawlErrors.csv", "r");
while (($line = fgetcsv($f)) !== false) {
echo "<tr>";
foreach ($line as $cell) {
echo "<td>" . htmlspecialchars($cell) . "</td>";
}
echo "</tr>\n";
}
fclose($f);
echo "\n</table>";
?>
Try this:
I call this routine in a basic scraper to recursively find all of the links across the site. You'll have to put in some logic to prevent it from crawling external sites that are linked to from pages on your site, else you'll be running forever!
Note, I did get the majority of this code from another SO thread a while back, so the answers are out there.
function crawl_page($url, $depth = 2){
// strip trailing slash from URL
if(substr($url, -1) == '/') {
$url= substr($url, 0, -1);
}
// which URLs have we already crawled?
static $seen = array();
if (isset($seen[$url]) || $depth === 0) {
return;
}
$seen[$url] = true;
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href');
if (0 !== strpos($href, 'http')) {
// build the URLs to the same standard - with http:// etc
$path = '/' . ltrim($href, '/');
if (extension_loaded('http')) {
$href = http_build_url($url, array('path' => $path));
} else {
$parts = parse_url($url);
$href = $parts['scheme'] . '://';
if (isset($parts['user']) && isset($parts['pass'])) {
$href .= $parts['user'] . ':' . $parts['pass'] . '#';
}
$href .= $parts['host'];
if (isset($parts['port'])) {
$href .= ':' . $parts['port'];
}
$href .= $path;
}
}
crawl_page($href, $depth - 1);
}
// pull out the actual page name without any parent dirs
$pos = strrpos($url, '/');
$slug = $pos === false ? "root" : substr($url, $pos + 1);
echo "slug:" . $slug . "<br>";
}
Related
Good Day,Today i am asking my brothers about "crawler application"
which need to contains the following parts:
saves results in database
the front will get all the links of any given website as indexed
below each others
if the results of that crawler too long so move to the next Page by
counting up the last number in the URL and get again the links
like described under
Get the links that have an «&id=» in their URL first then the other
this task really for the real G :D
this is my code till now :
<?php
function crawl_page($url, $depth = 5)
{
if (!isset($url) || $depth == 0) {
return;
}
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href');
if (0 !== strpos($href, 'http')) {
$path = '/' . ltrim($href, '/');
if (extension_loaded('http')) {
$href = http_build_url($url, array('path' => $path));
} else {
$parts = parse_url($url);
$href = $parts['scheme'] . '://';
if (isset($parts['user']) && isset($parts['pass'])) {
$href .= $parts['user'] . ':' . $parts['pass'] . '#';
}
$href .= $parts['host'];
if (isset($parts['port'])) {
$href .= ':' . $parts['port'];
}
$href .= dirname($parts['path'], 1).$path;
}
}
crawl_page($href, $depth - 1);
}
echo "URL:".$url."<br />";
}
crawl_page("http://www.pizza.com/", 2);
Okay, I've been searching for a way to list directories and files, which I've figured out and am utilizing code I found here on StackOverflow (Listing all the folders subfolders and files in a directory using php).
So far I've altered code found in one of the answers. I've been able to remove file extensions from both the path and the file name using preg_replace, capitalize the file names using ucwords, and switch out dashes for spaces using str_replace.
What I'm having trouble with now is wrapping the whole thing in a properly nested HTML list. I've managed to set it up so it's wrapped in a list, but it doesn't use nested lists where needed and I can't, for the life of me, figure out how to capitalize the directory names or replace any dashes within the directory name.
So, the questions are, if anyone would be so kind:
How do I wrap the output in properly nested lists?
How do I capitalize directory names while removing the preceding slash and replace dashes or underscores with spaces?
I've left the | within the $ss variable intentionally. I use it as a marker of sorts when I want to throw in characters that will identify where it shows up during trial and error (example $ss = $ss . "<li>workingOrNot").
I'm using:
<?php
$pathLen = 0;
function prePad($level) {
$ss = "";
for ($ii = 0; $ii < $level; $ii++) {
$ss = $ss . "| ";
}
return $ss;
}
function dirScanner($dir, $level, $rootLen) {
global $pathLen;
$filesHidden = array(".", "..", '.htaccess', 'resources', 'browserconfig.xml', 'scripts', 'articles');
if ($handle = opendir($dir)) {
$fileList = array();
while (false !== ($entry = readdir($handle))) {
if ($entry != "." && $entry != ".." && !in_array($entry, $filesHidden)) {
if (is_dir($dir . "/" . $entry)) {
$fileList[] = "F: " . $dir . "/" . $entry;
}
else {
$fileList[] = "D: " . $dir . "/" . $entry;
}
}
}
closedir($handle);
natsort($fileList);
foreach($fileList as $value) {
$displayName = ucwords ( str_replace("-", " ", substr(preg_replace('/\\.[^.\\s]{3,5}$/', '', $value), $rootLen + 4)));
$filePath = substr($value, 3);
$linkPath = str_replace(" ", "%20", substr(preg_replace('/\\.[^.\\s]{3,5}$/', '', $value), $pathLen + 3));
if (is_dir($filePath)) {
echo prePad($level) . "<li>" . $linkPath . "</li>\n";
dirScanner($filePath, $level + 1, strlen($filePath));
} else {
echo "<li>" . prePad($level) . "" . $displayName . "</li>\n";
}
}
}
}
I feel like these answers should be simple, so maybe I've been staring at it too much the last two days or maybe it has become Frankenstein code.
I'm about out of trial and error and I need help.
foreach($fileList as $value) {
$displayName = ucwords ( str_replace("-", " ", substr(preg_replace('/\\.[^.\\s]{3,5}$/', '', $value), $rootLen + 4)));
$filePath = substr($value, 3);
$linkPath = str_replace(" ", "%20", substr(preg_replace('/\\.[^.\\s]{3,5}$/', '', $value), $pathLen + 3));
if (is_dir($filePath)) {
// Do not close <li> yet, instead, open an <ul>
echo prePad($level) . "<li>" . $linkPath; . "<ul>\n";
dirScanner($filePath, $level + 1, strlen($filePath));
// Close <li> and <ul>
echo "</li></ul>\n";
} else {
echo "<li>" . prePad($level) . "" . $displayName . "</li>\n";
}
}
I guess you're opening the main before call the function and closing it at the end.
I'm trying to implement a header response to follow recursevely headers redirects. I've implemented the following which works correctly for the first request, but if a location redirect is found in the header, the get_headers do not return any result for the redirected location. I would like to display the header for each header request.
This is what I have done.
function redirectURL($domain) {
$newLocation = '';
$domain = str_replace("\r", "", $domain);
$headers=get_headers($domain);
echo "<ul class='list-group' >";
print "<li class='list-group-item'>".$domain. "</li>";
foreach($headers as $k=>$v){
print "<li class='list-group-item'>".$k . ": " . $v . "</li>";
if(strstr($v, 'Location')){
$location = explode(":",$v);
$newLocation = $location[1].":".$location[2];
}
}
echo "</ul>";
if($newLocation != $domainName && $newLocation != ''){
redirectURL($newLocation);
}
unset($headers);
return true;
}
Any idea? I've a online implementation ... if need to see a working code.
Thank you
Ok it was just bad coding. I've made it working.
This is a working code
function redirectURL($domainName) {
$i=0;
$newLocation = '';
$isNew = false;
$headers = array();
$domainName = str_replace("\r", "", $domainName);
$headers=get_headers($domainName,1);
echo "<ul class='list-group' >";
print "<li class='list-group-item'><strong>".$domainName. "</strong></li>";
foreach($headers as $k => $v){
print "<li class='list-group-item'>".$k . ": " . $v . "</li>";
if($k == 'Location'){
$newLocation = $v;
$isNew = true;
print "<li class='list-group-item'><strong>".$k . ": " . $v . "</strong></li>";
}
}
echo "</ul>";
unset($headers);
//limit recurse to $i < 4 to avoid overload
if($isNew){
$i++;
if($i<4) {redirectURL($newLocation);}
}
return true;
}
You can check the working script at https://www.neting.it/risorse-internet/controlla-redirect-server.html
i creating file manager in cakephp 2.x, i need to showing tree list of '/uploads/' directory and sub-directories.
i have the below code in classic PHP
//show recursive directory tree
function print_tree($dir = '.') {
global $root_path;
echo '<ul class="dirlist">';
$d = opendir($dir);
while($f = readdir($d)) {
if(strpos($f, '.') === 0) continue;
$ff = $dir . '/' . $f;
if(is_dir($ff)) {
echo '<li>' . $f . '';
print_tree($ff);
echo '</li>';
}
}
echo '</ul>';
}
but i need cakephp version of the top code
I have a list of directory name and need to get the first letter from each name and just display it once before the start of that lettered group ie;
what I have:
1
2
3
4
5
Aberdeen
Arundel
Aberyswith
Bath
Bristol
Brighton
Cardiff
coventry
what I would like:
#
1
2
3
4
5
A
Aberdeen
Arundel
Aberyswith
B
Bath
Bristol
Brighton
C
Cardiff
coventry
function htmlDirList($subdirs) {
global $z_self, $z_img_play, $z_img_lofi, $z_img_more, $z_admin,
$z_img_down, $z_img_new, $zc;
$now = time();
$diff = $zc['new_time']*60*60*24;
$num = 0;
$dir_list_len = $zc['dir_list_len'];
if ($zc['low']) { $dir_list_len -= 2; }
$html = "";
$checkbox = ($z_admin || ($zc['playlists'] && $zc['session_pls']));
/**/
$row = 0;
$items = sizeof($subdirs);
$cat_cols = "2";
$rows_in_col = ceil($items/$cat_cols);
if ($rows_in_col < $cat_cols) { $cat_cols = ceil($items/$rows_in_col); }
$col_width = round(100 / $cat_cols);
$html = "<table width='600'><tr>";
$i = 0;
/**/
foreach ($subdirs as $subdir => $opts) {
if ($row == 0) {
$class = ($cat_cols != ++$i) ? ' class="z_artistcols"' : '';
$html .= "<td $class valign='top' nowrap='nowrap' width='$col_width%'>";
}
/*$currentleter = substr($opts, 0 , 1);
if($lastletter != $currentleter){
echo $currentleter;
$lastletter = $currentleter;
}*/
if($alphabet != substr($opts,0,1)) {
echo strtoupper(substr($opts,0,1)); // add your html formatting too.
$alphabet = substr($opts,0,1);
}
$dir_len = $dir_list_len;
$dir = false;
$image = $opts['image'];
$new_beg = $new_end = "";
if (substr($subdir, -1) == "/") {
$dir = true;
$subdir = substr($subdir, 0, -1);
}
$path_raw = getURLencodedPath($subdir);
$href = "<a href='$path_raw";
if (!$dir) {
if ($zc['download'] && $zc['cmp_sel']) { $html .= "$href/.lp&l=8&m=9&c=0'>$z_img_down</a> "; }
if ($zc['play']) { $html .= "$href&l=8&m=0'>$z_img_play</a> "; }
if ($zc['low'] && ($zc['resample'] || $opts['lofi'])) { $html .= "$href&l=8&m=0&lf=true'>$z_img_lofi</a> "; }
if ($checkbox) { $html .= "<input type='checkbox' name='mp3s[]' value='$path_raw/.lp'/> "; }
$num++;
if ($zc['new_highlight'] && isset($opts['mtime']) && ($now - $opts['mtime'] < $diff)) {
$dir_len -= 5;
if ($z_img_new) {
$new_end = $z_img_new;
} else {
$new_beg = $zc['new_beg'];
$new_end = $zc['new_end'];
}
}
}
$title = formatTitle(basename($subdir));
if (strlen($title) > $dir_len) {
$ht = " title=\"$title.\"";
$title = substr($title,0,$dir_len).$opts['mtime']."...";
} else {
$ht = "";
}
if ($zc['dir_list_year']) {
$di = getDirInfo($subdir);
if (!empty($di['year'])) $title .= " (".$di['year'].")";
}
$html .= "$href'$ht>$new_beg$title$new_end</a><br />";
$row = ++$row % $rows_in_col;
if ($row == 0) { $html .= "</td>"; }
}
if ($row != 0) $html .= "</td>";
$html .= "</tr></table>";
$arr['num'] = $num;
$arr['list'] = $html;
return $arr;
}
I need help to get work.
The following will display the list of directories, beginning each group with a first letter as beginning of the group (see codepad for proof):
(this assumes $dirs is array containing the names)
$cur_let = null;
foreach ($dirs as $dir) {
if ($cur_let !== strtoupper(substr($dir,0,1))){
$cur_let = strtoupper(substr($dir,0,1));
echo $cur_let."\n";
}
echo $dir . "\n";
}
You just need to add some formatting on your own, suited to your needs.
Edit:
Version grouping under # sign entries that begin with a number, can look like that:
$cur_let = null;
foreach ($dirs as $dir) {
$first_let = (is_numeric(strtoupper(substr($dir,0,1))) ? '#' : strtoupper(substr($dir,0,1)));
if ($cur_let !== $first_let){
$cur_let = $first_let;
echo $cur_let."\n";
}
echo $dir . "\n";
}
Please see codepad as a proof.
Is this what you are looking for?
<?php
$places = array(
'Aberdeen',
'Arundel',
'Aberyswith',
'Bath',
'Bristol',
'Brighton',
'Cardiff',
'coventry'
);
$first_letter = $places[0][0];
foreach($places as $p)
{
if(strtolower($p[0])!=$first_letter)
{
echo "<b>" . strtoupper($p[0]) . "</b><br/>";
$first_letter = strtolower($p[0]);
}
echo $p . "<br/>";
}
?>
Prints:
A
Aberdeen
Arundel
Aberyswith
B
Bath
Bristol
Brighton
C
Cardiff
coventry
My approach would be to generate a second array that associates the first letter to the array of names that begin with that letter.
$dirs; // assumed this contains your array of names
$groupedDirs = array();
foreach ($dirs as $dir) {
$firstLetter = strtoupper($dir[0]);
$groupedDirs[$firstLetter][] = $dir;
}
Then, you can iterate on $groupedDirs to print out the list.
<?php foreach ($groupedDirs as $group => $dirs): ?>
<?php echo $group; ?>
<?php foreach ($dirs as $dir): ?>
<?php echo $dir; ?>
<?php endforeach; ?>
<?php endforeach; ?>
This allows for a clean separation between two separate tasks: figuring out what the groups are and, secondly, displaying the grouped list. By keeping these tasks separate, not only is the code for each one clearer, but you can reuse either part for different circumstances.
Use something like this, change it to output the HTML the way you want thouugh:
sort($subdirs);
$count = count($subdirs);
$lastLetter = '';
foreach($subdirs as $subdir => $opts){
if(substr($subdir,0,1) !== $lastLetter){
$lastLetter = substr($subdir,0,1);
echo '<br /><div style="font-weight: bold;">'.strtoupper($lastLetter).'</div>';
}
echo '<div>'.$subdir.'</div>';
}
EDIT
Just realized $subdir is associative, made the change above: