Extract links from a list of urls

Extract links from a list of urls - php

I am trying to extract all the links from a set list of or urls in a text file and save the extracted links in another text file. I am trying to use the script below which was originally meant to extract Emails:
I changed the the email extract part
// preg_match_all('/([\w+\.]*\w+#[\w+\.]*\w+[\w+\-\w+]*\.\w+)/is', $sPageContent, $aResults);
to extract links like this:
preg_match_all("/a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>/is", $sPageContent, $aResults);
Here is the full code:
class getEmails
{
const EMAIL_STORAGE_FILE = 'links.txt';
public function __construct($sFilePath)
{
$aUrls = $this->getUrls($sFilePath);
foreach($aUrls as $sUrl) {
$rPage = $this->getContents($sUrl);
$this->getAndSaveEmails($rPage);
}
$this->removeDuplicate();
}
protected function getAndSaveEmails($sPageContent)
{
// preg_match_all('/([\w+\.]*\w+#[\w+\.]*\w+[\w+\-\w+]*\.\w+)/is', $sPageContent, $aResults);
preg_match_all("/a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>/is", $sPageContent, $aResults);
foreach($aResults[1] as $sCurrentEmail) {
file_put_contents(self::EMAIL_STORAGE_FILE, $sCurrentEmail . "\r\n", FILE_APPEND);
}
}
protected function getContents($sUrl)
{
if (function_exists('curl_init')) {
$rCh = curl_init();
curl_setopt($rCh, CURLOPT_URL, $sUrl);
curl_setopt($rCh, CURLOPT_HEADER, 0);
curl_setopt($rCh, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($rCh, CURLOPT_FOLLOWLOCATION, 1);
$mResult = curl_exec($rCh);
curl_close($rCh);
unset($rCh);
return $mResult;
} else {
return file_get_contents($sUrl);
}
}
protected function getUrls($sFilePath)
{
return file($sFilePath);
}
protected function removeDuplicate()
{
$aEmails = file(self::EMAIL_STORAGE_FILE);
$aEmails = array_unique($aEmails);
file_put_contents(self::EMAIL_STORAGE_FILE, implode('', $aEmails));
}
}
new getEmails('sitemap_index.txt');
The problem i have with this is that it is supposed to get all links from a list of urls but it only scanned the first link and ignored the rest. I have 30 links that i want to extract from, how can i make the above code work?

you must using trim() at the url.. try add trim on your code
foreach($aUrls as $sUrl) {
$sUrl=trim($sUrl); //this
$rPage = $this->getContents($sUrl);
$this->getAndSaveEmails($rPage);
}

Related

How to retrieve broken links

I would like to retrieve broken links of a given website.
I have this code but it doesn't work.
Can you help me ?
// function to check url
function check_url($url) {
//echo "Test broken liens";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch , CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($ch);
$headers = curl_getinfo($ch);
curl_close($ch);
return $headers['http_code'];
}
if(check_url("https://www.amazon.com/")==200){
echo "<br> The link is validated <br>";
}else{
echo "<br>broken links<br>";
}
// this function check all the code of a website and retrieve the tag of a hyperlink
function getLinks(){
$html = file_get_contents('https://www.amazon.com/');
$dom = new domDocument;
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('a');
foreach ($images as $image) {
$file= $image->getAttribute('href')."<br>";
$lien= "https://www.amazon.com/".$file;
echo $lien;
echo existenceLien($lien);
}
}
echo getLinks();
// The target is to search the broken links in a website and worn the existence of those links
//check if link exist and display the result for each
function linkexistence($url){
// get the url
$test = get_headers($url , 1);
$message="";
// use preg_match function
if (preg_match("#HTTP/1.1 200i#", $test[0])) {
$message="Valid";
}elseif (preg_match("#HTTP/1.1 404i#", $test[0])) {
$message="Non-existent page ! (404)";
}elseif (preg_match("#HTTP/1.1 301i#", $test[0])) {
$message="The page has been moved";
}elseif (preg_match("#HTTP/1.1 403i#", $test[0])) {
$message="Access to the page refused! (403)";
}else {
$message="Invalid links";
}
return $message;
}*****

The mask is wrong in preg_match function, currently your mask is
#HTTP/1.1 200i#
but I believe that you have to use the following mask
#HTTP/1.1 200#i
thus you have to move the "i" after "#" in all your preg_match functions.
the "i" means the case sensitivity will be ignored

Search Files Nothing Found

I am trying to search (filter) for files in a Dropbox folder, but no files are being found when there are files that match the filter. I am not using the PHP library provided by Dropbox.
Here is an extract of the code:
class Dropbox {
private $headers = array();
private $authQueryString = "";
public $SubFolders = array();
public $Files = array();
function __construct() {
$this->headers = array('Authorization: OAuth oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="'.DROPBOX_APP_KEY.'", oauth_token="'.DROPBOX_OAUTH_ACCESS_TOKEN.'", oauth_signature="'.DROPBOX_APP_SECRET.'&'.DROPBOX_OAUTH_ACCESS_SECRET.'"');
$this->authQueryString = "oauth_consumer_key=".DROPBOX_APP_KEY."&oauth_token=".DROPBOX_OAUTH_ACCESS_TOKEN."&oauth_signature_method=PLAINTEXT&oauth_signature=".DROPBOX_APP_SECRET."%26".DROPBOX_OAUTH_ACCESS_SECRET."&oauth_version=1.0";
}
public function GetFolder($folder, $fileFilter = "") {
//Add the required folder to the end of the base path for folder call
if ($fileFilter == "")
$subPath = "metadata/sandbox";
else
$subPath = "search/sandbox";
if (strlen($folder) > 1) {
$subPath .= (substr($folder, 0, 1) != "/" ? "/" : "")
.$folder;
}
//Set up the post parameters for the call
$params = null;
if ($fileFilter != "") {
$params = array(
"query" => $fileFilter
);
}
//Clear the sub folders and files logged
$this->SubFolders = array();
$this->Files = array();
//Make the call
$content = $this->doCall($subPath, $params);
//Log the files and folders
for ($i = 0; $i < sizeof($content->contents); $i++) {
$f = $content->contents[$i];
if ($f->is_dir == "1") {
array_push($this->SubFolders, $f->path);
} else {
array_push($this->Files, $f->path);
}
}
//Return the content
return $content;
}
private function doCall($urlSubPath, $params = null, $filePathName = null, $useAPIContentPath = false) {
//Create the full URL for the call
$url = "https://api".($useAPIContentPath ? "-content" : "").".dropbox.com/1/".$urlSubPath;
//Initialise the curl call
$ch = curl_init();
//Set up the curl call
curl_setopt($ch, CURLOPT_HTTPHEADER, $this->headers);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
if ($params != null)
curl_setopt($ch, CURLOPT_POSTFIELDS, $params);
$fh = null;
if ($filePathName != null) {
$fh = fopen($filePathName, "rb");
curl_setopt($context, CURLOPT_BINARYTRANSFER, true);
curl_setopt($context, CURLOPT_INFILE, $fh);
curl_setopt($context, CURLOPT_INFILESIZE, filesize($filePathName));
}
//Excecute and get the response
$api_response = curl_exec($ch);
if ($fh != null)
fclose($fh);
//Process the response into an array
$json_response = json_decode($api_response);
//Has there been an error
if (isset($json_response->error )) {
throw new Exception($json_response["error"]);
}
//Send the response back
return $json_response;
}
}
I then call the GetFolder method of Dropbox as such:
$dbx = new Dropbox();
$filter = "MyFilter";
$dbx->GetFolder("MyFolder", $filter);
print "Num files: ".sizeof($dbx->Files);
As I am passing $filter into GetFolder, it uses the search/sandbox path and creates a parameter array ($params) with the required query parameter in it.
The process works fine if I don't provide the $fileFilter parameter to GetFolder and all files in the folder are returned (uses the metadata/sandbox path).
Other methods (that are not in the extract for brevity) of the Dropbox class use the $params feature and they to work fine.
I have been using the Dropbpox API reference for guidance (https://www.dropbox.com/developers/core/docs#search)

At first glance, it looks like you're making a GET request to /search but passing parameters via CURLOPT_POSTFIELDS. Try using a POST or encoding the search query as a query string parameter.
EDIT
Below is some code that works for me (usage: php search.php <term>). Note that I'm using OAuth 2 instead of OAuth 1, so my Authorization header looks different from yours.
<?php
$access_token = '<REDACTED>';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://api.dropbox.com/1/search/auto');
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Authorization:Bearer ' . $access_token));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, array('query' => $argv[1]));
$api_response = curl_exec($ch);
echo "Matching files:\n\t" . join("\n\t",
array_map(function ($file) {
return $file['path'];
}, json_decode($api_response, true)))."\n";
?>

Function Inside Function Can't Find Variable

I'm trying to use the curl headerfunction opt from a class. I've tried putting the functions inside the class normally, but curl can't find them that way. So I put them inside the function I need it in. Here's the part of code that's applicable:
$ht = array();
$t = array();
function htWrite($stream,$buffer)
{
$ht[] = $buffer;
return strlen($buffer);
}
function tWrite($stream,$buffer)
{
$t[] = $buffer;
return strlen($buffer);
}
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_HEADERFUNCTION, 'htWrite');
curl_setopt($ch, CURLOPT_WRITEFUNCTION, 'tWrite');
When I put an echo statement in htWrite for the buffer, it echos out just fine. But if I do a print_r statement on $ht later, it says that it's empty. Further investigation says that it's creating its own $ht variable, because if I remove that line, $ht is null according to the function. So what can I do to fix this?

You need to look at how you can specify object methods as callbacks:
class Foo {
public function Bar() {
// do whatever
}
public function Test() {
$ch = curl_init('http://www.google.com');
curl_setopt($ch, CURLOPT_HEADERFUNCTION, array($this, 'Bar'));
}
}

How to speed up class that checks links in HTML?

I have cobbled together a class that checks links. It works but it is slow:
The class basically parses a HTML string and returns all invalid links for href and src attributes. Here is how I use it:
$class = new Validurl(array('html' => file_get_contents('http://google.com')));
$invalid_links = $class->check_links();
print_r($invalid_links);
With HTML that has a lot of links it becomes really slow and I know it has to go through each link and follow it, but maybe someone with more experience can give me a few pointers on how to speed it up.
Here's the code:
class Validurl{
private $html = '';
public function __construct($params){
$this->html = $params['html'];
}
public function check_links(){
$invalid_links = array();
$all_links = $this->get_links();
foreach($all_links as $link){
if(!$this->is_valid_url($link['url'])){
array_push($invalid_links, $link);
}
}
return $invalid_links;
}
private function get_links() {
$xml = new DOMDocument();
#$xml->loadHTML($this->html);
$links = array();
foreach($xml->getElementsByTagName('a') as $link) {
$links[] = array('type' => 'url', 'url' => $link->getAttribute('href'), 'text' => $link->nodeValue);
}
foreach($xml->getElementsByTagName('img') as $link) {
$links[] = array('type' => 'img', 'url' => $link->getAttribute('src'));
}
return $links;
}
private function is_valid_url($url){
if ((strpos($url, "http")) === false) $url = "http://" . $url;
if (is_array(#get_headers($url))){
return true;
}else{
return false;
}
}
}

First of all I would not push the links and images into an array, and then iterate through the array, when you could directly iterate the results of getElementsByTagName(). You'd have to do it twice for <a> and <img> tags, but if you separate the checking logic into a function, you just call that for each round.
Second, get_headers() is slow, based on comments from the PHP manual page. You should rather use cUrl in some way like this (found in a comment on the same page):
function get_headers_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
$r = curl_exec($ch);
$r = split("\n", $r);
return $r;
}
UPDATE: and yes, some kind of caching could also help, e.g. an SQLITE database with one table for the link and the result, and you could purge that db like each day.

You could cache the results (in DB, eg: a key-value store), so that your validator assumes that if a link was valid it's going to be valid for 24 hours or a week or something like that.

How use CURLOPT_WRITEFUNCTION when download a file by CURL

My Class for download file direct from a link:
MyClass{
function download($link){
......
$ch = curl_init($link);
curl_setopt($ch, CURLOPT_FILE, $File->handle);
curl_setopt($ch,CURLOPT_WRITEFUNCTION , array($this,'__writeFunction'));
curl_exec($ch);
curl_close($ch);
$File->close();
......
}
function __writeFunction($curl, $data) {
return strlen($data);
}
}
I want know how to use CRULOPT_WRITEFUNCTION when download file.
Above code if i remove line:
curl_setopt($ch,CURLOPT_WRITEFUNCTION , array($this,'__writeFunction'));
Then it will run good, i can download that file.But if i use CURL_WRITEFUNCTION option i can't download file.

I know this is an old question, but maybe my answer will be of some help for you or someone else. Try this:
function get_write_function(){
return function($curl, $data){
return strlen($data);
}
}
I don't know exactly what you want to do, but with PHP 5.3, you can do a lot with the callback. What's really great about generating a function in this way is that the values passed through the 'use' keyword remain with the function afterward, kind of like constants.
function get_write_function($var){
$obj = $this;//access variables or functions within your class with the object variable
return function($curl, $data) use ($var, $obj) {
$len = strlen($data);
//just an example - you can come up with something better than this:
if ($len > $var){
return -1;//abort the download
} else {
$obj->do_something();//call a class function
return $len;
}
}
}
You can retrieve the function as a variable as follows:
function download($link){
......
$var = 5000;
$write_function = $this->get_write_function($var);
$ch = curl_init($link);
curl_setopt($ch, CURLOPT_FILE, $File->handle);
curl_setopt($ch, CURLOPT_WRITEFUNCTION , $write_function);
curl_exec($ch);
curl_close($ch);
$File->close();
......
}
That was just an example. You can see how I used it here: Parallel cURL Request with WRITEFUNCTION Callback. I didn't actually test all of this code, so there may be minor errors. Let me know if you have problems, and I'll fix it.

<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_BUFFERSIZE, 8096);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, 'http://blog.ronnyristau.de/wp-content/uploads/2008/12/php.jpg');
$content = curl_exec($ch);
curl_close($ch);
$out = fopen('/tmp/out.png','w');
if($out){
fwrite($out, $content);
fclose($out);
}

Why do you use curl to download a file? Is there a special reason? You can simply use fopen and fread
I have written a small class for it.
<?php
class Utils_FileDownload {
private $source;
private $dest;
private $buffer;
private $overwrite;
public function __construct($source,$dest,$buffer=4096,$overwrite=false){
$this->source = $source;
$this->dest = $dest;
$this->buffer = $buffer;
$this->overwrite = $overwrite;
}
public function download(){
if($this->overwrite||!file_exists($this->dest)){
if(!is_dir(dirname($this->dest))){mkdir(dirname($this->dest),0755,true);}
if($this->source==""){
$resource = false;
Utils_Logging_Logger::getLogger()->log("source must not be empty.",Utils_Logging_Logger::TYPE_ERROR);
}
else{ $resource = fopen($this->source,"rb"); }
if($this->source==""){
$dest = false;
Utils_Logging_Logger::getLogger()->log("destination must not be empty.",Utils_Logging_Logger::TYPE_ERROR);
}
else{ $dest = fopen($this->dest,"wb"); }
if($resource!==false&&$dest!==false){
while(!feof($resource)){
$read = fread($resource,$this->buffer);
fwrite($dest,$read,$this->buffer);
}
chmod($this->dest,0644);
fclose($dest); fclose($resource);
return true;
}else{
return false;
}
}else{
return false;
}
}
}

It seems like cURL uses your function instead of writing to the request once CURLOPT_WRITEFUNCTION is specified.
So the correct solution would be :
MyClass{
function download($link){
......
$ch = curl_init($link);
curl_setopt($ch, CURLOPT_FILE, $File->handle);
curl_setopt($ch,CURLOPT_WRITEFUNCTION , array($this,'__writeFunction'));
curl_exec($ch);
curl_close($ch);
$File->close();
......
}
function __writeFunction($curl, $data) {
echo $data;
return strlen($data);
}
}
This can also handle binary files as well.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extract links from a list of urls - php

you must using trim() at the url.. try add trim on your code foreach($aUrls as $sUrl) { $sUrl=trim($sUrl); //this $rPage = $this->getContents($sUrl); $this->getAndSaveEmails($rPage); }

Related

How to retrieve broken links

Search Files Nothing Found

Function Inside Function Can't Find Variable

How to speed up class that checks links in HTML?

How use CURLOPT_WRITEFUNCTION when download a file by CURL

Categories

Resources