How to open for each url when I got the strings - php

I have got a problem with opening the urls. It will not open for each url when I have output the list of urls in the PHP after I have extract the urls from the mysql database.
Here is the php:
<?php
//Connect to the database
require_once('config.php');
$qrytable1="SELECT links FROM channels_list";
$result1=mysql_query($qrytable1) or die('Error:<br />' . $qry . '<br />' . mysql_error());
while ($row = mysql_fetch_array($result1))
{
echo $row["links"];
$baseUrl = file_get_contents($row["links"]);
$domdoc = new DOMDocument();
$domdoc->strictErrorChecking = false;
$domdoc->recover=true;
$domdoc->loadHTML($baseUrl);
$links = $domdoc->getElementsByTagName('a');
foreach ($links as $link)
{
echo "we are now opening for each url";
}
}
Here is the output for the urls:
http://example.com.com/some_name/?id=963
http://example.com.com/some_name/?id=102
http://example.com.com/some_name/?id=103
http://example.com.com/some_name/?id=104
http://example.com.com/some_name/?id=171
http://example.com.com/some_name/?id=106
http://example.com.com/some_name/?id=107
http://example.com.com/some_name/?id=108
http://example.com.com/some_name/?id=402
http://example.com.com/some_name/?id=403
http://example.com.com/some_name/?id=404
http://example.com.com/some_name/?id=405
http://example.com.com/some_name/?id=406
http://example.com.com/some_name/?id=408
http://example.com.com/some_name/?id=407
http://example.com.com/some_name/?id=409
http://example.com.com/some_name/?id=435
http://example.com.com/some_name/?id=436
http://example.com.com/some_name/?id=439
http://example.com.com/some_name/?id=440
http://example.com.com/some_name/?id=410
http://example.com.com/some_name/?id=411
http://example.com.com/some_name/?id=413
http://example.com.com/some_name/?id=414
http://example.com.com/some_name/?id=415
http://example.com.com/some_name/?id=417
http://example.com.com/some_name/?id=418
http://example.com.com/some_name/?id=421
I think there is a problem with this code:
$links = $domdoc->getElementsByTagName('a');
I don't have the html tag in my PHP page, it is only show the strings of the actual urls like what I show on above.
What I'm expect is I want to open each url when I get the list of urls from mysql.
Can you please help me with how I can open for each url when I get the urls from mysql database?

i'm not exactly sure, what you mean by "open for each url".
if you want as output a list of links, on which you can click:
while ($row = mysql_fetch_array($result1))
{
echo "<a href='".$row["links"]."'>".$row["links"]."</a>";
}
if you want to download the contents of each url:
while ($row = mysql_fetch_array($result1))
{
$content_string = file_get_contents($row["links"]);
}
$content_string is a content of a page as string, not sure what you want to do with it.

Related

Is there a function to add text to x if x not containing y | PHP

So im working on an url crawler but i get a lot of paths without the domain and http.
And i want to make a function if the path not contain the domain and http in that to add it.
here is my code
<?php
$source_url = 'http://www.google.com/';
$html = file_get_contents($source_url);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
$input_url = $link->getAttribute('href');
echo $input_url . "<br>";
}
?>
If there is not any how i can just extract the urls containing http
You can use regular expressions to check whether the link is an absolute URL or relative i.e. contains domain or not. What I have done is check the link whether it starts with http:// or https://. If it isn't then the source domain is added to the beginning of the link.
foreach ($links as $link) {
$input_url = $link->getAttribute('href');
if (!preg_match('/^https?:\/\//', $input_url)) {
$input_url = $source_url . preg_replace('/^\//', '', $input_url);
}
echo $input_url . "<br>";
}

Web scraping information from the site in PHP

I do PHP script, the script must copy the list of publications (from the homepage) and copy the information that is inside these publications.
I need to copy content from my previous site and add the content to the new site!
I have some success, my PHP script copies the list of publications on the home page. I need to make a script that pulled information inside each publication (title, photo, full text)!
For this, I wrote a function that extracts a link to each post.
Help me write a function that will copy information on a given link!
<?php
header('Content-type: text/html; charset=utf-8');
require 'phpQuery.php';
function print_arr($arr){
echo '<pre>' . print_r($arr, true) . '</pre>';
}
$url = 'http://goruzont.blogspot.com/';
$file = file_get_contents($url);
$doc = phpQuery::newDocument($file);
foreach($doc->find('.blog-posts .post-outer .post') as $article){
$article = pq($article);
$text = $article->find('.entry-title a')->html();
print_arr($text);
$texturl = $article->find('.entry-title a')->attr('href');
echo $texturl;
$text = $article->find('.date-header')->html();
print_arr($text);
$img = $article->find('.thumb a')->attr('style');
$img."<br>"; if (preg_match('!background:url.(.+). no!',$img,$match)) {
$imgurl = $match[1];
} else
{echo "<img src = http://goruzont.blogspot.com".$item.">";}
echo "<img src='$imgurl'>";
}
?>

PHP and MySql Web Crawler

I need some help with my web crawler exercise.I should write a crawler for saving the first pages of some websites and all of their content in a MySql database. I am using xampp MySql database. Here is my code:
<?php
$main_url="webpage";
$str = file_get_contents($main_url);
// Gets Webpage Title
if(strlen($str)>0)
{
$str = trim(preg_replace('/\s+/', ' ', $str)); // supports line breaks inside <title>
preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); // ignore case
$title=$title[1];
}
// Gets Webpage Description
$b =$main_url;
#$url = parse_url( $b );
#$tags = get_meta_tags($url['scheme'].'://'.$url['host'] );
$description=$tags['description'];
// Gets Webpage Internal Links
$doc = new DOMDocument;
#$doc->loadHTML($str);
$items = $doc->getElementsByTagName('a');
foreach($items as $value)
{
$attrs = $value->attributes;
$sec_url[]=$attrs->getNamedItem('href')->nodeValue;
}
$all_links=implode(",",$sec_url);
//Gets Webpage images
require_once('C:\xampp\htdocs\simple_html_dom.php');
require_once('C:\xampp\htdocs\url_to_absolute.php');
$url='webpage'
$html=file_get_html('webpage');
foreach($html->find('img') as $element) {
echo url_to_absolute($url, $element->src), "\n";
}
$images=
// Store Data In Database
$host="localhost";
$username="root";
$password="";
$databasename="db";
$connect=mysql_connect($host,$username,$password);
$db=mysql_select_db($databasename);
mysql_query("insert into webpage_details values('$main_url','$title','$description','$all_links','$images')");
?>
I should save everything from a home page, i have a problem saving the images. Any ideas?
Are there any other items I need to save?
Thanks!

Display rss xml index in html

I am adding an RSS feed to my website. I created the RSS.xml index file and next I want to display its contents in a nicely formatted way in a webpage.
Using PHP, I can do this:
$index = file_get_contents ($path . 'RSS.xml');
echo $index;
But all that does is dump the contents as a long stream of text with the tags removed.
I know that treating RSS.xml as a link, like this:
<a href="../blogs/RSS.xml">
<img src="../blogs/feed-icon-16.gif">Blog Index
</a>
causes my browser to parse and display it in a reasonable way when the user clicks on the link. However I want to embed it directly in the web page and not make the user go through another click.
What is the proper way to do what I want?
Use the following code:
include_once('Simple/autoloader.php');
$feed = new SimplePie();
$feed->set_feed_url($url);
$feed->enable_cache(false);
$feed->set_output_encoding('utf-8');
$feed->init();
$i=0;
$items = $feed->get_items();
foreach ($items as $item) {
$i++;
/*You are getting title,description,date of your rss by the following code*/
$title = $item->get_title();
$url = $item->get_permalink();
$desc = $item->get_description();
$date = $item->get_date();
}
Download the Simple folder data from : https://github.com/jewelhuq/Online-News-Grabber/tree/master/worldnews/Simple
Hope it will work for you. There $url mean your rss feed url. If you works then response.
Turns out, it's simple by using the PHP xml parer function:
$xml = simplexml_load_file ($path . 'RSS.xml');
$channel = $xml->channel;
$channel_title = $channel->title;
$channel_description = $channel->description;
echo "<h1>$channel_title</h1>";
echo "<h2>$channel_description</h2>";
foreach ($channel->item as $item)
{
$title = $item->title;
$link = $item->link;
$descr = $item->description;
echo "<h3><a href='$link'>$title</a></h3>";
echo "<p>$descr</p>";
}

Replacing link with plain text with php simple html dom

I have a program that removes certain pages from a web; i want to then traverse the remaining pages and "unlink" any links to those removed pages. I'm using simplehtmldom. My function takes a source page ($source) and an array of pages ($skipList). It finds the links, and I'd like to then manipulate the dom to convert the element into the $link->innertext, but I don't know how. Any help?
function RemoveSpecificLinks($source, $skipList) {
// $source is the html source file;
// $skipList is an array of link destinations (hrefs) that we want unlinked
$docHtml = file_get_contents($source);
$htmlObj = str_get_html($docHtml);
$links = $htmlObj->find('a');
if (isset($links)) {
foreach ($links as $link) {
if (in_array($link->href, $skipList)) {
$link->href = ''; // Should convert to simple text element
}
}
}
$docHtml = $htmlObj->save();
$htmlObj->clear();
unset($htmlObj);
return($docHtml);
}
I have never used simplehtmldom, but this is what I think should solve your problem:
function RemoveSpecificLinks($source, $skipList) {
// $source is the HTML source file;
// $skipList is an array of link destinations (hrefs) that we want unlinked
$docHtml = file_get_contents($source);
$htmlObj = str_get_html($docHtml);
$links = $htmlObj->find('a');
if (isset($links)) {
foreach ($links as $link) {
if (in_array($link->href, $skipList)) {
$link->outertext = $link->plaintext; // THIS SHOULD WORK
// IF THIS DOES NOT WORK TRY:
// $link->outertext = $link->innertext;
}
}
}
$docHtml = $htmlObj->save();
$htmlObj->clear();
unset($htmlObj);
return($docHtml);
}
Please provide me some feedback as if this worked or not, also specifying which method worked, if any.
Update: Maybe you would prefer this:
$link->outertext = $link->href;
This way you get the link displayed, but not clickable.

Categories