Good Day,Today i am asking my brothers about "crawler application"
which need to contains the following parts:
saves results in database
the front will get all the links of any given website as indexed
below each others
if the results of that crawler too long so move to the next Page by
counting up the last number in the URL and get again the links
like described under
Get the links that have an «&id=» in their URL first then the other
this task really for the real G :D
this is my code till now :
<?php
function crawl_page($url, $depth = 5)
{
if (!isset($url) || $depth == 0) {
return;
}
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href');
if (0 !== strpos($href, 'http')) {
$path = '/' . ltrim($href, '/');
if (extension_loaded('http')) {
$href = http_build_url($url, array('path' => $path));
} else {
$parts = parse_url($url);
$href = $parts['scheme'] . '://';
if (isset($parts['user']) && isset($parts['pass'])) {
$href .= $parts['user'] . ':' . $parts['pass'] . '#';
}
$href .= $parts['host'];
if (isset($parts['port'])) {
$href .= ':' . $parts['port'];
}
$href .= dirname($parts['path'], 1).$path;
}
}
crawl_page($href, $depth - 1);
}
echo "URL:".$url."<br />";
}
crawl_page("http://www.pizza.com/", 2);
Related
i have this code which replace [img] tag with its source, but text after last image tag is not showing in the output
$url = 'aa<img class="emojioneemoji" src="http://localhost/sng/assets/js/plugins/em/2.1.4/assets/png/1f62c.png">bb<img class="emojioneemoji" src="http://localhost/sng/assets/js/plugins/em/2.1.4/assets/png/1f600.png">cc';
$doc = new DOMDocument();
#$doc->loadHTML($url);
$tags = $doc->getElementsByTagName('img');
$str = "" ;
foreach ($tags as $tag) {
$img_path = $tag->getAttribute('src');
$directory = $img_path;
$ee = pathinfo($directory);
$pic_name= $ee['basename'];
$next = "" ;
$previous = "";
//echo $tag->nextSibling->wholeText;
if ($tag->previousSibling && get_class($tag->previousSibling) == "DOMText") {
$previous = $tag->previousSibling->wholeText . "-" ;
}
elseif($tag->nextSibling && get_class($tag->nextSibling) == "DOMText") {
$next = $tag->nextSibling->wholeText . "-" ;
}
$str .= $previous. $pic_name . "-" . $next ;
}
echo $str ;
output of above is
aa-1f62c.png-bb-1f600.png-
how can i get text 'cc' after last [img] tag. ?
There are logic errors in you if-else statements. Try the following code:
<?php
$url = 'aa<img class="emojioneemoji" src="http://localhost/sng/assets/js/plugins/em/2.1.4/assets/png/1f62c.png">bb<img class="emojioneemoji" src="http://localhost/sng/assets/js/plugins/em/2.1.4/assets/png/1f600.png">cc';
$doc = new DOMDocument();
#$doc->loadHTML($url);
$tags = $doc->getElementsByTagName('img');
$str = "" ;
$i=0;
$src_array=array();
foreach ($tags as $tag) {
$img_path = $tag->getAttribute('src');
$src_array[]=$img_path;
$directory = $img_path;
$ee = pathinfo($directory);
$pic_name= $ee['basename'];
$next = "" ;
$previous = "";
if ($tag->previousSibling && get_class($tag->previousSibling) == "DOMText") {
$previous = $tag->previousSibling->wholeText . "-" ;
}
if($tag->nextSibling && get_class($tag->nextSibling) == "DOMText") {
$next = $tag->nextSibling->wholeText . "-" ;
}
if(isset($previous_tag)){
$previous="";
}
$str .= $previous. $pic_name . "-" . $next ;
$previous_tag=$tag;
}
$str=rtrim($str,"-");
echo $str ;
Updated:
If you want to recover the string, you could try the following code with adding an array $src_array in the code above to store the link src:
echo "<br/>";
$str_array=explode("-",$str); //please pay attention to the splitter character, it should be enough special
$j=0;
$recover_str="";
for($i=0;$i<count($str_array);$i++)
{
if(($i%2)==0){
$recover_str .= $str_array[$i];
}
else{
$recover_str .= '<img class="emojioneemoji" src="'.$src_array[$j].'">';
$j++;
}
}
echo $recover_str ;
Okay, I've been searching for a way to list directories and files, which I've figured out and am utilizing code I found here on StackOverflow (Listing all the folders subfolders and files in a directory using php).
So far I've altered code found in one of the answers. I've been able to remove file extensions from both the path and the file name using preg_replace, capitalize the file names using ucwords, and switch out dashes for spaces using str_replace.
What I'm having trouble with now is wrapping the whole thing in a properly nested HTML list. I've managed to set it up so it's wrapped in a list, but it doesn't use nested lists where needed and I can't, for the life of me, figure out how to capitalize the directory names or replace any dashes within the directory name.
So, the questions are, if anyone would be so kind:
How do I wrap the output in properly nested lists?
How do I capitalize directory names while removing the preceding slash and replace dashes or underscores with spaces?
I've left the | within the $ss variable intentionally. I use it as a marker of sorts when I want to throw in characters that will identify where it shows up during trial and error (example $ss = $ss . "<li>workingOrNot").
I'm using:
<?php
$pathLen = 0;
function prePad($level) {
$ss = "";
for ($ii = 0; $ii < $level; $ii++) {
$ss = $ss . "| ";
}
return $ss;
}
function dirScanner($dir, $level, $rootLen) {
global $pathLen;
$filesHidden = array(".", "..", '.htaccess', 'resources', 'browserconfig.xml', 'scripts', 'articles');
if ($handle = opendir($dir)) {
$fileList = array();
while (false !== ($entry = readdir($handle))) {
if ($entry != "." && $entry != ".." && !in_array($entry, $filesHidden)) {
if (is_dir($dir . "/" . $entry)) {
$fileList[] = "F: " . $dir . "/" . $entry;
}
else {
$fileList[] = "D: " . $dir . "/" . $entry;
}
}
}
closedir($handle);
natsort($fileList);
foreach($fileList as $value) {
$displayName = ucwords ( str_replace("-", " ", substr(preg_replace('/\\.[^.\\s]{3,5}$/', '', $value), $rootLen + 4)));
$filePath = substr($value, 3);
$linkPath = str_replace(" ", "%20", substr(preg_replace('/\\.[^.\\s]{3,5}$/', '', $value), $pathLen + 3));
if (is_dir($filePath)) {
echo prePad($level) . "<li>" . $linkPath . "</li>\n";
dirScanner($filePath, $level + 1, strlen($filePath));
} else {
echo "<li>" . prePad($level) . "" . $displayName . "</li>\n";
}
}
}
}
I feel like these answers should be simple, so maybe I've been staring at it too much the last two days or maybe it has become Frankenstein code.
I'm about out of trial and error and I need help.
foreach($fileList as $value) {
$displayName = ucwords ( str_replace("-", " ", substr(preg_replace('/\\.[^.\\s]{3,5}$/', '', $value), $rootLen + 4)));
$filePath = substr($value, 3);
$linkPath = str_replace(" ", "%20", substr(preg_replace('/\\.[^.\\s]{3,5}$/', '', $value), $pathLen + 3));
if (is_dir($filePath)) {
// Do not close <li> yet, instead, open an <ul>
echo prePad($level) . "<li>" . $linkPath; . "<ul>\n";
dirScanner($filePath, $level + 1, strlen($filePath));
// Close <li> and <ul>
echo "</li></ul>\n";
} else {
echo "<li>" . prePad($level) . "" . $displayName . "</li>\n";
}
}
I guess you're opening the main before call the function and closing it at the end.
I'm newbie about programming, so this is my issue. I am trying to build a recursive php spider usind Simple HTML DOM Parser, crawling into a certain website and returning a list of pages including 2xx, 3xx, 4xx and 5xx. I've been searching several days for a solution but (maybe due to my low experience) I haven't found anything working. My actual code finds all the links on the root/index page, however i would like to be able to find links inside those previously found links recursively and so on, for example to level 5. Assuming the root page is level 0, the recursive function I wrote only shows me level 1 links, repeating them 5 times. Any help appreciated. Thanks.
<?php
echo "<strong><h1>Sitemap</h1></strong><br>";
include_once('simple_html_dom.php');
$url = "http://www.gnet.it/";
$html = new simple_html_dom();
$html->load_file($url);
echo "<strong><h2>Int Links</h2></strong><br>";
foreach($html->find("a") as $a)
{
if((!(preg_match('#^(?:https?|ftp)://.+$#', $a->href)))&&($a->href != null)&&($a->href != "javascript:;")&&($a->href != "#"))
{
echo "<strong>" . $a->href . "</strong><br>";
}
}
echo "<strong><h2>Ext Links</h2></strong><br>";
foreach($html->find("a") as $a)
{
if(((preg_match('#^(?:https?|ftp)://.+$#', $a->href)))&&($a->href != null)&&($a->href != "javascript:;")&&($a->href != "#"))
{
echo "<strong>" . $a->href . "</strong><br>";
}
}
//recursion
$depth = 1;
$maxDepth = 5;
$recurl = "$a->href";
$rechtml = new simple_html_dom();
$rechtml->load_file($recurl);
while($depth <= $maxDepth){
echo "<strong><h2>Link annidati livello $depth</h2></strong><br>";
foreach($rechtml->find("a") as $a)
{
if(($a->href != null))
{
echo "<strong>" . $a->href . "</strong><br>";
}
}
$depth++;
}
//csv
echo "<strong><h1>Google Crawl Errors from CSV</h1></strong><br>";
echo "<table>\n\n";
$f = fopen("CrawlErrors.csv", "r");
while (($line = fgetcsv($f)) !== false) {
echo "<tr>";
foreach ($line as $cell) {
echo "<td>" . htmlspecialchars($cell) . "</td>";
}
echo "</tr>\n";
}
fclose($f);
echo "\n</table>";
?>
Try this:
I call this routine in a basic scraper to recursively find all of the links across the site. You'll have to put in some logic to prevent it from crawling external sites that are linked to from pages on your site, else you'll be running forever!
Note, I did get the majority of this code from another SO thread a while back, so the answers are out there.
function crawl_page($url, $depth = 2){
// strip trailing slash from URL
if(substr($url, -1) == '/') {
$url= substr($url, 0, -1);
}
// which URLs have we already crawled?
static $seen = array();
if (isset($seen[$url]) || $depth === 0) {
return;
}
$seen[$url] = true;
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href');
if (0 !== strpos($href, 'http')) {
// build the URLs to the same standard - with http:// etc
$path = '/' . ltrim($href, '/');
if (extension_loaded('http')) {
$href = http_build_url($url, array('path' => $path));
} else {
$parts = parse_url($url);
$href = $parts['scheme'] . '://';
if (isset($parts['user']) && isset($parts['pass'])) {
$href .= $parts['user'] . ':' . $parts['pass'] . '#';
}
$href .= $parts['host'];
if (isset($parts['port'])) {
$href .= ':' . $parts['port'];
}
$href .= $path;
}
}
crawl_page($href, $depth - 1);
}
// pull out the actual page name without any parent dirs
$pos = strrpos($url, '/');
$slug = $pos === false ? "root" : substr($url, $pos + 1);
echo "slug:" . $slug . "<br>";
}
I have tried to create a function to handle text from the database to be publish with automatic formating.
If there is a \n : it should be converted to <p>...</p>
If there is a - : it should also add the list tags
My problem is that I cant really figure out how to add the <ul> and </ul> tags.
function nl2p($string){
$string = explode("\n", $string);
$paragraphs = '';
foreach ($string as $line) {
if (trim($line)) {
if (substr($line,0,1) == '-'){
$paragraphs .= '<li>' . substr($line,1) . '</li>'."\r\n";
} else {
$paragraphs .= '<p>' . $line . '</p>'."\r\n";
}
}
}
return $paragraphs;
}
Use a variable ($ul in my example):
function nl2p($string){
$string = explode("\n", $string);
$paragraphs = '';
$ul = 0;
foreach ($string as $line) {
if (trim($line)) {
if (substr($line,0,1) == '-'){
if($ul == 0){
$paragraphs .= "<ul>\r\n";
$ul = 1;
}
$paragraphs .= '<li>' . substr($line,1) . '</li>'."\r\n";
} else {
if($ul == 1){
$paragraphs .= "</ul>\r\n";
$ul = 0;
}
$paragraphs .= '<p>' . $line . '</p>'."\r\n";
}
}
}
return $paragraphs;
}
Collect all continuous <li> elements with text in a string and then enclose this string in a <ul> and </ul>.
I would search for the first character to be a dash and, if the previous line did not start with a dash, add the <ul> there. Then wrap the line in li tags and the same check at the end - if the next line does not start with a dash, then add an </ul>.
Then, as a default action, wrap the line in a paragraph.
$string = "this is a string.
New line 1.
New line 2.
- List item
- Another list item
Some more lines of text";
function nl2p($input) {
$lines = explode("\r\n",$input);
$return = '';
foreach($lines as $key => $line) {
if(strpos($line,'-') === 0) {
if(array_key_exists($key-1,$lines) AND strpos($lines[$key-1],'-') === FALSE) {
$return .= '<ul>' . "\r\n";
}
$return .= '<li>' . $line . '</li>' . "\r\n";
if(array_key_exists($key+1,$lines) AND strpos($lines[$key+1],'-') !== 0) {
$return .= '</ul>' . "\r\n";
}
continue;
}
$return .= '<p>' . $line . '</p>' . "\r\n";
}
return $return;
}
var_dump(nl2p($string));
/*
<p>this is a string.</p>
<p>New line 1.</p>
<p>New line 2.</p>
<ul>
<li>- List item</li>
<li>- Another list item</li>
</ul>
<p>Some more lines of text</p>
*/
function nl2p($input) {
if(strpos($input, "\n")) {
$slash = explode("\n",$input);
$newPara = '';
foreach($slash as $slashval) {
$slashval = '#'.$slashval;
if(strpos($slashval,"-")) {
$slashval = substr($slashval, 1);
$hypen = explode("-",$slashval);
$newPara .= '<ul>';
foreach($hypen as $hypenval) {
if(!empty($hypenval)) {
$newPara .= '<li>'.$hypenval.'</li>';
}
}
$newPara .= '</ul>';
} else {
$slashval = substr($slashval, 1);
$newPara .= '<p>'.$slashval.'</p>';
}
}
return $newPara;
} else {
$slashval = $input;
if(strpos($slashval,"-")) {
$hypen = explode("-",$slashval);
$newPara .= '<ul>';
foreach($hypen as $cnt => $hypenval) {
if($cnt == 0) {
$start = $hypenval;
} else {
if(!empty($hypenval)) {
$newPara .= '<li>'.$hypenval.'</li>';
}
}
}
$newPara .= '</ul>';
$newPara = $start.$newPara;
} else {
$slashval = '#'.$input;
if(strpos($slashval,"-")) {
$slashval = substr($slashval, 1);
$hypen = explode("-",$slashval);
$newPara .= '<ul>';
foreach($hypen as $hypenval) {
if(!empty($hypenval)) {
$newPara .= '<li>'.$hypenval.'</li>';
}
}
$newPara .= '</ul>';
}
}
return $newPara;
}
return $input;
}
I what to put a span element for $term['nodes']
I have tried to put after bracket and between but nothing works for me
if (isset($term['nodes'])) {
$term['name'] = $term['name'] . ' (' . $term['nodes'] . ')';
}
here is the all functin
function bootstrap_taxonomy_menu_block($variables) {
$tree = $variables['items'];
$config = $variables['config'];
$num_items = count($tree);
$i = 0;
$output = '<ul class="nav nav-pills nav-stacked">';
foreach ($tree as $tid => $term) {
$i++;
// Add classes.
$attributes = array();
if ($i == 1) {
$attributes['class'][] = '';
}
if ($i == $num_items) {
$attributes['class'][] = '';
}
if ($term['active_trail'] == '1') {
$attributes['class'][] = 'active-trail';
}
if ($term['active_trail'] == '2') {
$attributes['class'][] = 'active';
}
// Alter link text if we have to display the nodes attached.
if (isset($term['nodes']))
{
$term['name'] = $term['name'] . ' (<span>' . $term['nodes'] . '</span>)';
}
// Set alias option to true so we don't have to query for the alias every
// time, as this is cached anyway.
$output .= '<li' . drupal_attributes($attributes) . '>' . l($term['name'], $term['path'], $options = array('alias' => TRUE));
if (!empty($term['children'])) {
$output .= theme('taxonomy_menu_block__' . $config['delta'], (array('items' => $term['children'], 'config' => $config)));
}
$output .= '</li>';
}
$output .= '</ul>';
return $output;
}
i what this for the bootstrap cdn class , i have move the function on template.php , of drupal theme , but the span element is in plain text in browser
Try this:
if (isset($term['nodes']))
{
$term['name'] = $term['name'] . ' (<span>' . $term['nodes'] . '</span>)';
echo $term['name']; // To see the output
}