How to retrieve broken links

How to retrieve broken links - php

I would like to retrieve broken links of a given website.
I have this code but it doesn't work.
Can you help me ?
// function to check url
function check_url($url) {
//echo "Test broken liens";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch , CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($ch);
$headers = curl_getinfo($ch);
curl_close($ch);
return $headers['http_code'];
}
if(check_url("https://www.amazon.com/")==200){
echo "<br> The link is validated <br>";
}else{
echo "<br>broken links<br>";
}
// this function check all the code of a website and retrieve the tag of a hyperlink
function getLinks(){
$html = file_get_contents('https://www.amazon.com/');
$dom = new domDocument;
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('a');
foreach ($images as $image) {
$file= $image->getAttribute('href')."<br>";
$lien= "https://www.amazon.com/".$file;
echo $lien;
echo existenceLien($lien);
}
}
echo getLinks();
// The target is to search the broken links in a website and worn the existence of those links
//check if link exist and display the result for each
function linkexistence($url){
// get the url
$test = get_headers($url , 1);
$message="";
// use preg_match function
if (preg_match("#HTTP/1.1 200i#", $test[0])) {
$message="Valid";
}elseif (preg_match("#HTTP/1.1 404i#", $test[0])) {
$message="Non-existent page ! (404)";
}elseif (preg_match("#HTTP/1.1 301i#", $test[0])) {
$message="The page has been moved";
}elseif (preg_match("#HTTP/1.1 403i#", $test[0])) {
$message="Access to the page refused! (403)";
}else {
$message="Invalid links";
}
return $message;
}*****

The mask is wrong in preg_match function, currently your mask is
#HTTP/1.1 200i#
but I believe that you have to use the following mask
#HTTP/1.1 200#i
thus you have to move the "i" after "#" in all your preg_match functions.
the "i" means the case sensitivity will be ignored

Related

How to extract the direct Sibnet video url PHP

I'm searching for a solution to this problem for a long time and I didn't get any solutions.
I managed to extract the mp4 URL, but the problem is that this link redirects to another URL that can be seen in response header: Location, I don't know how I can get this URL.
Response Header(img)
<?php
function tidy_html($input_string) {
$config = array('output-html' => true,'indent' => true,'wrap'=> 800);
// Detect if Tidy is in configured
if( function_exists('tidy_get_release') ) {
$tidy = new tidy;
$tidy->parseString($input_string, $config, 'raw');
$tidy->cleanRepair();
$cleaned_html = tidy_get_output($tidy);
}
else {
# Tidy not configured for this Server
$cleaned_html = $input_string;
}
return $cleaned_html;
}
function getFromPage($webAddress,$path){
$source = file_get_contents($webAddress); //download the page
$clean_source = tidy_html($source);
$doc = new DOMDocument;
// suppress errors
libxml_use_internal_errors(true);
// load the html source from a string
$doc->loadHTML($clean_source);
$xpath = new DOMXPath($doc);
$data="";
$nodelist = $xpath->query($path);
$node_counts = $nodelist->length; // count how many nodes returned
if ($node_counts) { // it will be true if the count is more than 0
foreach ($nodelist as $element) {
$data= $data.$element->nodeValue . "\n";
}
}
return $data;
}
$vidID = 4145616; //videoid : https://video.sibnet.ru/shell.php?videoid=4145616
$link1 = getFromPage("https://video.sibnet.ru/shell.php?videoid=".$vidID,"/html/body/script[21]/text()"); // Use XPath
$json = urldecode($link1);
$link2 = strstr($json, "player.src");
$url = substr($link2, 0, strpos($link2, ","));
$url =str_replace('"',"",$url);
$url = substr($url , 18);
//header('Location: https://video.sibnet.ru'.$url);
echo ('https://video.sibnet.ru'.$url)
?>

<?php
$url='https://video.sibnet.ru'.$url;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$a = curl_exec($ch);
$url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL); // This is what you need, it will return you the last effective URL
$realUrl = $url; //here you go
?>
SOURCE: https://stackoverflow.com/a/17473000/14885297

get title tag value using DOMDocument

i want to get the value of the <title> tag for all the pages of my website. i am trying to run the script only on my website domain, and get all the pages links on my website , and the titles of them.
This is my code:
$html = file_get_contents('http://xxxxxxxxx.com');
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The # is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
#$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
What i get is: z2 i get z1.html and z2....
my z1.html have a title named z3. i want to get z1.html and z3, not z2. Can anyone help me?

adding a bit to hitesh's answer to check if the elements have attributes and the desired attribute exists. also if the getting the 'title' elements actually does return at least one item before trying to grab the first one ($a_html_title->item(0)).
and added an option for curl to follow location (needed it for my hardcoded test for google.com)
foreach ($links as $link) {
//Extract and show the "href" attribute.
if ($link->hasAttributes()){
if ($link->hasAttribute('href')){
$href = $link->getAttribute('href');
$href = 'http://google.com'; // hardcoding just for testing
echo $link->nodeValue;
echo "<br/>".'MY ANCHOR LINK : - ' . $href . "---TITLE--->";
$a_html = my_curl_function($href);
$a_doc = new DOMDocument();
#$a_doc->loadHTML($a_html);
$a_html_title = $a_doc->getElementsByTagName('title');
//get and display what you need:
if ($a_html_title->length){
$a_html_title = $a_html_title->item(0)->nodeValue;
echo $a_html_title;
echo '<br/>';
}
}
}
}
function my_curl_function($url) {
$curl_handle = curl_init();
curl_setopt($curl_handle, CURLOPT_URL, $url);
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'name');
curl_setopt($curl_handle, CURLOPT_FOLLOWLOCATION, TRUE); // added this
$html = curl_exec($curl_handle);
curl_close($curl_handle);
return $html;
}

you need to make your own custom function and call it in appropriate places , if you need to get multiple tags from the pages which are in anchor tag, you just need to create new custom function.
Below code will help you get started
$html = my_curl_function('http://www.anchorartspace.org/');
$doc = new DOMDocument();
#$doc->loadHTML($html);
$mytag = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $mytag->item(0)->nodeValue;
$links = $doc->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link) {
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo "<br/>".'MY ANCHOR LINK : - ' . $link->getAttribute('href') . "---TITLE--->";
$a_html = my_curl_function($link->getAttribute('href'));
$a_doc = new DOMDocument();
#$a_doc->loadHTML($a_html);
$a_html_title = $a_doc->getElementsByTagName('title');
//get and display what you need:
$a_html_title = $a_html_title->item(0)->nodeValue;
echo $a_html_title;
echo '<br/>';
}
echo "Title: $title" . '<br/><br/>';
function my_curl_function($url) {
$curl_handle = curl_init();
curl_setopt($curl_handle, CURLOPT_URL, $url);
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'name');
$html = curl_exec($curl_handle);
curl_close($curl_handle);
return $html;
}
let me know if you need any more help

Extracting URL from tweets and getting numbers of tweets containing that url

Here I am getting url from tweets, converting that url to long url.
And then getting count value for numbers of tweets containing that url.
if(preg_match($reg_exUrl, $tweet, $url)) {
preg_match_all($reg_exUrl, $tweet, $urls);
foreach ($urls[0] as $url) {
echo "Tiny url : {$url}<br>";\
$full = MyURLDecode($url);
echo "Full url : $full<br>";
if (strpos($full, '//t.co') === true)
continue;
if (strpos($full, '//twitter.com') === true)
continue;
else if (strpos($full, '//bit.ly') === true)
$full = MyURLDecode($full);
$url_count = get_twitter_url_count($full);
echo "Url count: $url_count";
//echo "Numbers of tweets containing this link : ", $code['count']
echo "<br>";
}
} else {
echo "Mismatch<br>";
}
function MyURLDecode($url)
{
$ch = #curl_init($url);
#curl_setopt($ch, CURLOPT_HEADER, TRUE);
#curl_setopt($ch, CURLOPT_NOBODY, TRUE);
#curl_setopt($ch, CURLOPT_FOLLOWLOCATION, FALSE);
#curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$url_resp = #curl_exec($ch);
preg_match('/Location:\s+(.*)\n/i', $url_resp, $i);
if (!isset($i[1]))
{
return $url;
}
return $i[1];
}
function get_twitter_url_count($url) {
$encoded_url = urlencode($url);
$content = #file_get_contents('http://urls.api.twitter.com/1/urls/count.json?url=' . $encoded_url);
return $content ? json_decode($content)->count : 0;
}
problem:
If full_url is again short url then get actual long url
If url is link to twitter photo like http://twitter.com/ADSPLAYINDIA/status/415847973210181632/photo/1 then skip further getting tweet count
I added continue but still it does not skip it

For the first problem try setting follow location to true in your MyURLDecode function
#curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
For your second problem,i think strpos will never return true.Check out this link to a comment on php.net http://www.php.net/manual/en/function.strpos.php#107240
Please let me know if it helped
Thanks

php strange looping problem

Sorry for the long code, I'm really losing it.
This code is supposed to get a list of urls through POST, in a textarea with breaklines between each url. The script should download each url, go through the html and take some links, then go in those links, get some data and echo it out.
For some reason, visually it looks as if I'm running getDetails() only once, as I'm getting only one set of results.
I have checked multiple times if the foreach loop takes each url separately and that part is working
Can anyone spot the problem?
require_once('simple_html_dom.php');
function getDetails($html) {
$dom = new simple_html_dom;
$dom->load($html);
$title = $dom->find('h1', 0)->find('a', 0);
foreach($dom->find('span[style="color:#333333"]') as $element) {
$address = $element->innertext;
}
$address = str_replace("<br>"," ",$address);
$address = str_replace(","," ",$address);
$title->innertext = str_replace(","," ",$title->innertext);
if ($address == "") {
$exp = explode("<strong><strong>",$html);
$exp2 = explode("</strong>",$exp[1]);
$address = $exp2[0];
}
echo $title->innertext . "," . $address . "<br>";
}
function getHtml($Url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Url);
curl_setopt($ch, CURLOPT_REFERER, "http://www.google.com/");
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$output = curl_exec($ch);
curl_close($ch);
return $output;
}
function getdd($u) {
$html = getHtml($u);
$dom = new simple_html_dom;
$dom->load($html);
foreach($dom->find('a') as $element) {
if (strstr($element->href,"display_one.asp")) {
$durls[] = $element->href;
}
}
return $durls;
}
if (isset($_POST['url'])) {
$urls = explode("\n",$_POST['url']);
foreach ($urls as $u) {
$durls2 = getdd($u);
$durls2 = array_unique($durls2);
foreach ($durls2 as $durl) {
$d = getHtml("http://www.example.co.il/" . $durl);
getDetails($d);
}
}
}

You're only assigning the last element in the loop, it looks like. You'll need to concatenate. Something like $address .= $element->innertext; inside the loop (note the .= instead of =).
edit: unless I'm mistaking what it's supposed to be doing. I think I may've been focusing on the wrong part of the code.

When you use DOMDocument on html you load it with $dom->loadHTMLFile() or $dom->loadHTML() you should also call libxml_use_internal_errors(true) before hand so that it will not crash because of improperly formatted html.

PHP: Check if URL redirects?

I have implemented a function that runs on each page that I want to restrict from non-logged in users. The function automatically redirects the visitor to the login page in the case of he or she is not logged in.
I would like to make a PHP function that is run from a exernal server and iterates through a number of set URLs (array with URLs that is for each protected site) to see if they are redirected or not. Thereby I could easily make sure if protection is up and running on every page.
How could this be done?
Thanks.

$urls = array(
'http://www.apple.com/imac',
'http://www.google.com/'
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
foreach($urls as $url) {
curl_setopt($ch, CURLOPT_URL, $url);
$out = curl_exec($ch);
// line endings is the wonkiest piece of this whole thing
$out = str_replace("\r", "", $out);
// only look at the headers
$headers_end = strpos($out, "\n\n");
if( $headers_end !== false ) {
$out = substr($out, 0, $headers_end);
}
$headers = explode("\n", $out);
foreach($headers as $header) {
if( substr($header, 0, 10) == "Location: " ) {
$target = substr($header, 10);
echo "[$url] redirects to [$target]<br>";
continue 2;
}
}
echo "[$url] does not redirect<br>";
}

I use curl and only take headers, after I compare my url and url from header curl:
$url="http://google.com";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, '60'); // in seconds
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_NOBODY, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$res = curl_exec($ch);
if(curl_getinfo($ch)['url'] == $url){
echo "not redirect";
}else {
echo "redirect";
}

You could always try adding:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
since 302 means it moved, allow the curl call to follow it and return whatever the moved url returns.

Getting the headers with get_headers() and checking if Location is set is much simpler.
$urls = [
"https://example-1.com",
"https://example-2.com"
];
foreach ($urls as $key => $url) {
$is_redirect = does_url_redirect($url) ? 'yes' : 'no';
echo $url . ' is redirected: ' . $is_redirect . PHP_EOL;
}
function does_url_redirect($url){
$headers = get_headers($url, 1);
if (!empty($headers['Location'])) {
return true;
} else {
return false;
}
}

I'm not sure whether this really makes sense as a security check.
If you are worried about files getting called directly without your "is the user logged in?" checks being run, you could do what many big PHP projects do: In the central include file (where the security check is being done) define a constant BOOTSTRAP_LOADED or whatever, and in every file, check for whether that constant is set.
Testing is great and security testing is even better, but I'm not sure what kind of flaw you are looking to uncover with this? To me, this idea feels like a waste of time that will not bring any real additional security.
Just make sure your script die() s after the header("Location:...") redirect. That is essential to stop additional content from being displayed after the header command (a missing die() wouldn't be caught by your idea by the way, as the redirect header would still be issued...)
If you really want to do this, you could also use a tool like wget and feed it a list of URLs. Have it fetch the results into a directory, and check (e.g. by looking at the file sizes that should be identical) whether every page contains the login dialog. Just to add another option...

Do you want to check the HTTP code to see if it's a redirect?
$params = array('http' => array(
'method' => 'HEAD',
'ignore_errors' => true
));
$context = stream_context_create($params);
foreach(array('http://google.com', 'http://stackoverflow.com') as $url) {
$fp = fopen($url, 'rb', false, $context);
$result = stream_get_contents($fp);
if ($result === false) {
throw new Exception("Could not read data from {$url}");
} else if (! strstr($http_response_header[0], '301')) {
// Do something here
}
}

I hope it will help you:
function checkRedirect($url)
{
$headers = get_headers($url);
if ($headers) {
if (isset($headers[0])) {
if ($headers[0] == 'HTTP/1.1 302 Found') {
//this is the URL where it's redirecting
return str_replace("Location: ", "", $headers[9]);
}
}
}
return false;
}
$isRedirect = checkRedirect($url);
if(!$isRedirect )
{
echo "URL Not Redirected";
}else{
echo "URL Redirected to: ".$isRedirect;
}

You can use session,if the session array is not set ,the url redirected to a login page.
.

I modified Adam Backstrom answer and implemented chiborg suggestion. (Download only HEAD). It have one thing more: It will check if redirection is in a page of the same server or is out. Example: terra.com.br redirects to terra.com.br/portal. PHP will considerate it like redirect, and it is correct. But i only wanted to list that url that redirect to another URL. My English is not good, so, if someone found something really difficult to understand and can edit this, you're welcome.
function RedirectURL() {
$urls = array('http://www.terra.com.br/','http://www.areiaebrita.com.br/');
foreach ($urls as $url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// chiborg suggestion
curl_setopt($ch, CURLOPT_NOBODY, true);
// ================================
// READ URL
// ================================
curl_setopt($ch, CURLOPT_URL, $url);
$out = curl_exec($ch);
// line endings is the wonkiest piece of this whole thing
$out = str_replace("\r", "", $out);
echo $out;
$headers = explode("\n", $out);
foreach($headers as $header) {
if(substr(strtolower($header), 0, 9) == "location:") {
// read URL to check if redirect to somepage on the server or another one.
// terra.com.br redirect to terra.com.br/portal. it is valid.
// but areiaebrita.com.br redirect to bwnet.com.br, and this is invalid.
// what we want is to check if the address continues being terra.com.br or changes. if changes, prints on page.
// if contains http, we will check if changes url or not.
// some servers, to redirect to a folder available on it, redirect only citting the folder. Example: net11.com.br redirect only to /heiden
// only execute if have http on location
if ( strpos(strtolower($header), "http") !== false) {
$address = explode("/", $header);
print_r($address);
// $address['0'] = http
// $address['1'] =
// $address['2'] = www.terra.com.br
// $address['3'] = portal
echo "url (address from array) = " . $url . "<br>";
echo "address[2] = " . $address['2'] . "<br><br>";
// url: terra.com.br
// address['2'] = www.terra.com.br
// check if string terra.com.br is still available in www.terra.com.br. It indicates that server did not redirect to some page away from here.
if(strpos(strtolower($address['2']), strtolower($url)) !== false) {
echo "URL NOT REDIRECT";
} else {
// not the same. (areiaebrita)
echo "SORRY, URL REDIRECT WAS FOUND: " . $url;
}
}
}
}
}
}

function unshorten_url($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, $url);
$out = curl_exec($ch);
$real_url = $url;//default.. (if no redirect)
if (preg_match("/location: (.*)/i", $out, $redirect))
$real_url = $redirect[1];
if (strstr($real_url, "bit.ly"))//the redirect is another shortened url
$real_url = unshorten_url($real_url);
return $real_url;
}

I have just made a function that checks if a URL exists or not
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
function url_exists($url, $ch) {
curl_setopt($ch, CURLOPT_URL, $url);
$out = curl_exec($ch);
// line endings is the wonkiest piece of this whole thing
$out = str_replace("\r", "", $out);
// only look at the headers
$headers_end = strpos($out, "\n\n");
if( $headers_end !== false ) {
$out = substr($out, 0, $headers_end);
}
//echo $out."====<br>";
$headers = explode("\n", $out);
//echo "<pre>";
//print_r($headers);
foreach($headers as $header) {
//echo $header."---<br>";
if( strpos($header, 'HTTP/1.1 200 OK') !== false ) {
return true;
break;
}
}
}
Now I have used an array of URLs to check if a URL exists as following:
$my_url_array = array('http://howtocode.pk/result', 'http://google.com/jobssss', 'https://howtocode.pk/javascript-tutorial/', 'https://www.google.com/');
for($j = 0; $j < count($my_url_array); $j++){
if(url_exists($my_url_array[$j], $ch)){
echo 'This URL "'.$my_url_array[$j].'" exists. <br>';
}
}

I can't understand your question.
You have an array with URLs and you want to know if user is from one of the listed URLs?
If I'm right in understanding your quest:
$urls = array('http://url1.com','http://url2.ru','http://url3.org');
if(in_array($_SERVER['HTTP_REFERER'],$urls))
{
echo 'FROM ARRAY';
} else {
echo 'NOT FROM ARR';
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to retrieve broken links - php

The mask is wrong in preg_match function, currently your mask is #HTTP/1.1 200i# but I believe that you have to use the following mask #HTTP/1.1 200#i thus you have to move the "i" after "#" in all your preg_match functions. the "i" means the case sensitivity will be ignored

Related

How to extract the direct Sibnet video url PHP

get title tag value using DOMDocument

Extracting URL from tweets and getting numbers of tweets containing that url

php strange looping problem

PHP: Check if URL redirects?

Categories

Resources