PHP and MySql Web Crawler - php

I need some help with my web crawler exercise.I should write a crawler for saving the first pages of some websites and all of their content in a MySql database. I am using xampp MySql database. Here is my code:
<?php
$main_url="webpage";
$str = file_get_contents($main_url);
// Gets Webpage Title
if(strlen($str)>0)
{
$str = trim(preg_replace('/\s+/', ' ', $str)); // supports line breaks inside <title>
preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); // ignore case
$title=$title[1];
}
// Gets Webpage Description
$b =$main_url;
#$url = parse_url( $b );
#$tags = get_meta_tags($url['scheme'].'://'.$url['host'] );
$description=$tags['description'];
// Gets Webpage Internal Links
$doc = new DOMDocument;
#$doc->loadHTML($str);
$items = $doc->getElementsByTagName('a');
foreach($items as $value)
{
$attrs = $value->attributes;
$sec_url[]=$attrs->getNamedItem('href')->nodeValue;
}
$all_links=implode(",",$sec_url);
//Gets Webpage images
require_once('C:\xampp\htdocs\simple_html_dom.php');
require_once('C:\xampp\htdocs\url_to_absolute.php');
$url='webpage'
$html=file_get_html('webpage');
foreach($html->find('img') as $element) {
echo url_to_absolute($url, $element->src), "\n";
}
$images=
// Store Data In Database
$host="localhost";
$username="root";
$password="";
$databasename="db";
$connect=mysql_connect($host,$username,$password);
$db=mysql_select_db($databasename);
mysql_query("insert into webpage_details values('$main_url','$title','$description','$all_links','$images')");
?>
I should save everything from a home page, i have a problem saving the images. Any ideas?
Are there any other items I need to save?
Thanks!

Related

How to open for each url when I got the strings

I have got a problem with opening the urls. It will not open for each url when I have output the list of urls in the PHP after I have extract the urls from the mysql database.
Here is the php:
<?php
//Connect to the database
require_once('config.php');
$qrytable1="SELECT links FROM channels_list";
$result1=mysql_query($qrytable1) or die('Error:<br />' . $qry . '<br />' . mysql_error());
while ($row = mysql_fetch_array($result1))
{
echo $row["links"];
$baseUrl = file_get_contents($row["links"]);
$domdoc = new DOMDocument();
$domdoc->strictErrorChecking = false;
$domdoc->recover=true;
$domdoc->loadHTML($baseUrl);
$links = $domdoc->getElementsByTagName('a');
foreach ($links as $link)
{
echo "we are now opening for each url";
}
}
Here is the output for the urls:
http://example.com.com/some_name/?id=963
http://example.com.com/some_name/?id=102
http://example.com.com/some_name/?id=103
http://example.com.com/some_name/?id=104
http://example.com.com/some_name/?id=171
http://example.com.com/some_name/?id=106
http://example.com.com/some_name/?id=107
http://example.com.com/some_name/?id=108
http://example.com.com/some_name/?id=402
http://example.com.com/some_name/?id=403
http://example.com.com/some_name/?id=404
http://example.com.com/some_name/?id=405
http://example.com.com/some_name/?id=406
http://example.com.com/some_name/?id=408
http://example.com.com/some_name/?id=407
http://example.com.com/some_name/?id=409
http://example.com.com/some_name/?id=435
http://example.com.com/some_name/?id=436
http://example.com.com/some_name/?id=439
http://example.com.com/some_name/?id=440
http://example.com.com/some_name/?id=410
http://example.com.com/some_name/?id=411
http://example.com.com/some_name/?id=413
http://example.com.com/some_name/?id=414
http://example.com.com/some_name/?id=415
http://example.com.com/some_name/?id=417
http://example.com.com/some_name/?id=418
http://example.com.com/some_name/?id=421
I think there is a problem with this code:
$links = $domdoc->getElementsByTagName('a');
I don't have the html tag in my PHP page, it is only show the strings of the actual urls like what I show on above.
What I'm expect is I want to open each url when I get the list of urls from mysql.
Can you please help me with how I can open for each url when I get the urls from mysql database?
i'm not exactly sure, what you mean by "open for each url".
if you want as output a list of links, on which you can click:
while ($row = mysql_fetch_array($result1))
{
echo "<a href='".$row["links"]."'>".$row["links"]."</a>";
}
if you want to download the contents of each url:
while ($row = mysql_fetch_array($result1))
{
$content_string = file_get_contents($row["links"]);
}
$content_string is a content of a page as string, not sure what you want to do with it.

Fetching image from google using dom

I want to fetch image from google using PHP. so I tried to get help from net I got a script as I needed but it is showing this fatal error
Fatal error: Call to a member function find() on a non-object in C:\wamp\www\nq\qimages.php on line 7**
Here is my script:
<?php
include "simple_html_dom.php";
$search_query = "car";
$search_query = urlencode( $search_query );
$html = file_get_html( "https://www.google.com/search?q=$search_query&tbm=isch" );
$image_container = $html->find('div#rcnt', 0);
$images = $image_container->find('img');
$image_count = 10; //Enter the amount of images to be shown
$i = 0;
foreach($images as $image){
if($i == $image_count) break;
$i++;
// DO with the image whatever you want here (the image element is '$image'):
echo $image;
}
?>
I am also using Simple html dom.
Look at my example that works and gets first image from google results:
<?php
$url = "https://www.google.hr/search?q=aaaa&biw=1517&bih=714&source=lnms&tbm=isch&sa=X&ved=0CAYQ_AUoAWoVChMIyKnjyrjQyAIVylwaCh06nAIE&dpr=0.9";
$content = file_get_contents($url);
libxml_use_internal_errors(true);
$dom = new DOMDocument;
#$dom->loadHTML($content);
$images_dom = $dom->getElementsByTagName('img');
foreach ($images_dom as $img) {
if($img->hasAttribute('src')){
$image_url = $img->getAttribute('src');
}
break;
}
//this is first image on url
echo $image_url;
This error usually means that $html isn't an object.
It's odd that you say this seems to work. What happens if you output $html? I'd imagine that the url isn't available and that $html is null.
Edit: Looks like this may be an error in the parser. Someone has submitted a bug and added a check in his code as a workaround.

Display rss xml index in html

I am adding an RSS feed to my website. I created the RSS.xml index file and next I want to display its contents in a nicely formatted way in a webpage.
Using PHP, I can do this:
$index = file_get_contents ($path . 'RSS.xml');
echo $index;
But all that does is dump the contents as a long stream of text with the tags removed.
I know that treating RSS.xml as a link, like this:
<a href="../blogs/RSS.xml">
<img src="../blogs/feed-icon-16.gif">Blog Index
</a>
causes my browser to parse and display it in a reasonable way when the user clicks on the link. However I want to embed it directly in the web page and not make the user go through another click.
What is the proper way to do what I want?
Use the following code:
include_once('Simple/autoloader.php');
$feed = new SimplePie();
$feed->set_feed_url($url);
$feed->enable_cache(false);
$feed->set_output_encoding('utf-8');
$feed->init();
$i=0;
$items = $feed->get_items();
foreach ($items as $item) {
$i++;
/*You are getting title,description,date of your rss by the following code*/
$title = $item->get_title();
$url = $item->get_permalink();
$desc = $item->get_description();
$date = $item->get_date();
}
Download the Simple folder data from : https://github.com/jewelhuq/Online-News-Grabber/tree/master/worldnews/Simple
Hope it will work for you. There $url mean your rss feed url. If you works then response.
Turns out, it's simple by using the PHP xml parer function:
$xml = simplexml_load_file ($path . 'RSS.xml');
$channel = $xml->channel;
$channel_title = $channel->title;
$channel_description = $channel->description;
echo "<h1>$channel_title</h1>";
echo "<h2>$channel_description</h2>";
foreach ($channel->item as $item)
{
$title = $item->title;
$link = $item->link;
$descr = $item->description;
echo "<h3><a href='$link'>$title</a></h3>";
echo "<p>$descr</p>";
}

How to crawl a specific div tag to get all data inside that?

Actually i'm new to PHP and I want to crawl this link to get info about all courier companies providing services in our country. All the info that i required are insid a div tag i.e. . I need all info inside this tag with Images, paragraphs and links. I've done some research on that and able to crawl the page.
<?php
function crawl_page($url, $depth = 1)
{
static $seen = array();
if (isset($seen[$url]) || $depth === 0) {
return;
}
$seen[$url] = true;
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($url);
$xpath = new DomXpath($dom);
$divTag = $xpath->query('//div[#class="rescont"]');
foreach ($divTag as $val) {
echo $dom->saveXML($val).'<br />\n';// or
}
}
crawl_page("http://www.phonebook.com.pk/Dynamic/Search.aspx?k=courier&l=pakistan&SearchType=kl", 1);
?>
EDit:
Now i'm able to display all contents on my webpage but images and some other info is not available because it was linked as relative to that server. Can i extract that info too?

Replacing link with plain text with php simple html dom

I have a program that removes certain pages from a web; i want to then traverse the remaining pages and "unlink" any links to those removed pages. I'm using simplehtmldom. My function takes a source page ($source) and an array of pages ($skipList). It finds the links, and I'd like to then manipulate the dom to convert the element into the $link->innertext, but I don't know how. Any help?
function RemoveSpecificLinks($source, $skipList) {
// $source is the html source file;
// $skipList is an array of link destinations (hrefs) that we want unlinked
$docHtml = file_get_contents($source);
$htmlObj = str_get_html($docHtml);
$links = $htmlObj->find('a');
if (isset($links)) {
foreach ($links as $link) {
if (in_array($link->href, $skipList)) {
$link->href = ''; // Should convert to simple text element
}
}
}
$docHtml = $htmlObj->save();
$htmlObj->clear();
unset($htmlObj);
return($docHtml);
}
I have never used simplehtmldom, but this is what I think should solve your problem:
function RemoveSpecificLinks($source, $skipList) {
// $source is the HTML source file;
// $skipList is an array of link destinations (hrefs) that we want unlinked
$docHtml = file_get_contents($source);
$htmlObj = str_get_html($docHtml);
$links = $htmlObj->find('a');
if (isset($links)) {
foreach ($links as $link) {
if (in_array($link->href, $skipList)) {
$link->outertext = $link->plaintext; // THIS SHOULD WORK
// IF THIS DOES NOT WORK TRY:
// $link->outertext = $link->innertext;
}
}
}
$docHtml = $htmlObj->save();
$htmlObj->clear();
unset($htmlObj);
return($docHtml);
}
Please provide me some feedback as if this worked or not, also specifying which method worked, if any.
Update: Maybe you would prefer this:
$link->outertext = $link->href;
This way you get the link displayed, but not clickable.

Categories