Web crawler links/page logic in PHP - php

I'm writing a basic crawler that simply caches pages with PHP.
All it does is use get_file_contents to get contents of a webpage and regex to get all the links out DESCRIPTION - at the moment it returns:
Array {
[url] => URL
[desc] => DESCRIPTION
}
The problem I'm having is figuring out the logic behind determining whether the page link is local or sussing out whether it may be in a completely different local directory.
It could be any number of combinations: i.e. href="../folder/folder2/blah/page.html" or href="google.com" or href="page.html" - the possibilities are endless.
What would be the correct algorithm to approach this? I don't want to lose any data that could be important.

First of all, regex and HTML don't mix. Use:
foreach(DOMDocument::loadHTML($source)->getElementsByTagName('a') as $a)
{
$a->getAttribute('href');
}
Links that may go outside your site start with protocol or //, i.e.
http://example.com
//example.com/
href="google.com" is link to a local file.
But if you want to create static copy of a site, why not just use wget?

Let's first consider the properties of local links.
These will either be:
relative with no scheme and no host, or
absolute with a scheme of 'http' or 'https' and a host that
matches the machine from which the
script is running
That's all the logic you'd need to identify if a link is local.
Use the parse_url function to separate out the different components of a URL to identify the scheme and host.

You would have to look for http:// in the href. Else, you could determine if it starts with ./ or any combination of "./". If you don't find a "/" then you would have to assume that its a file. Would you like a script for this?

Related

What is the function of using php for site links?

I am working on a site and the builders have used a mix of php and html for links. For example:
<li>Variable Speed Drives</li>
<li>Corrosion Resistant Baseplates</li>
and
<li>MP Repair</li>
<li>MTA Repair</li>
The php is referenced in another file in this way:
<?php
$pdf_link = "../pdf/";
$external_pdf_link = "../../pdf/";
$video_link = "../video/";
$external_video_link = "../../video/";
?>
My concern is not knowing the function of the php, other than it being a placeholder, and given that the links work both ways, I don't want to break something because I am clueless to its purpose.
In doing my due diligence researching, I ran across this post, which is close, but still no cigar, Add php variable inside echo statement as href link address?. All of the research seems to be about how rather than why. This is the site, and they only used it for the "Downloads" links: http://magnatexpumps.com/
Thank you...
B
There is no right way. They are just different.
Let's forget the PHP for a while. If you have this link in a page:
<a href='about.html'/>About</a>
What will happen? The browser will change the URL of the document. If you are at the root of the site like: "www.example.com", will redirect to "www.example.com/about.html". If you are in a URL like "www.example.com/news/index.html" will redirect you to "www.example.com/new/about". That's why sometimes it is useful to have a variable before, to force a full path URL.
Another case of URL variable interpolation is when you have different systems running in the same url. In this case, you will have to append the system name in order to get to where you want. If you don't know where your application will run if it will run on the doc root, or in a subfolder, use a variable to indicate the base path.

How to properly create relative links

in my source code, this a button link like this:
then the web page's code is like this:
but when click the button, the url is:
http://localhost/personal/applications/mywebtest/install/?step=2
why "/personal/applications" is added in?
Edit: Let me start with...
How the href attribute works
Let's say, you are on a page http://example.com/foo/bar.html and you have a hyperlink there, it can be one of the three:
path/newpage.html - this is a relative path (no slash in front), which means it will take you to http://example.com/foo/path/newpage.html
/path/newpage.html - this is an absolute path (starts with a slash), which will take you to http://example.com/path/newpage.html
example.com/otherpage.html - this is still a relative path (as in the first example), it will take you to http://example.com/example.com/otherpage.html *
this is because the browser don't know if you mean domain example.com or directory with a dot in it example.com, so it's a standard to treat it as a directory.
http://example.com/path/newpage.html - this is an absolute URL (it starts with a protocol) - the browser don't need to do any guessing here, can take you straight to http://example.com/path/newpage.html
(this is assuming that the base is not set, please read rest of the answer or take a look at https://www.w3schools.com/tags/tag_base.asp)
Now the original answer
When you use relative links (links without domain name) in the <a> elements your browser needs a whole URL (with domain name) to fulfill such request when you click the link. So it takes the current protocol (http) and domain localhost and glues it with $_SERVER['PHP_SELF'] which is
The filename of the currently executing script, relative to the
document root. For instance, $_SERVER['PHP_SELF'] in a script at the
address http://example.com/foo/bar.php would be /foo/bar.php.
(https://secure.php.net/manual/en/reserved.variables.server.php)
So you can either create full urls for the href as suggested here https://stackoverflow.com/a/46359685/299774
href=<?php echo $_SERVER['HTTP_HOST']."/mywebtest/install"; ?>?step=2';"
But this can cause problems in case you want to save such HTML in the DB (for example it is a post in your blog) - then moving your app to a different domain would require changing contents in the DB and fill it with new domain.
So I would stick to relative values in href, you can accomplish that by setting base meta tag in your HTML: https://stackoverflow.com/a/6848509/299774
or by utilizing mod_rewrite or similar tool four your server, but it has been long since I was doing it + outside of scope of this question, but you can check popular frameworks (CakePHP, Laravel) how they do it.
(And they have to do it, because being able to move app between domains is a must: local testing, staging, production)
That's because PHP_SELF returns the path on disk to the current file. PHP itself is unaware of where/how the page is served. You will have to find another variable/strategy for your link. Also, if you really want to link to the same page, you could just use ?step=2.
you can use of $_SERVER['HTTP_HOST']
<input type="button" name="step" value="Continue to step 2 of 3"
onClick="location.href = '<?php echo (isset($_SERVER['HTTPS'])?"https" : "http")."://". $_SERVER['HTTP_HOST']."/mywebtest/install/"; ?>?step=2';"/>

How to detect the path to the application root?

I'm trying to dynamically detect the root directory of my page in order to direct to a specific script.
echo ($_SERVER['DOCUMENT_ROOT']);
It prints /myName/folder/index.php
I'd like to use in a html-file to enter a certain script like this:
log out
This seems to be in bad syntax, the path is not successfully resolved.
What's the proper approach to detect the path to logout.php?
The same question in different words:
How can I reliably achieve the path to the root directory (which contains my index.php) from ANY subdirectory? No matter if the html file is in /lib/subfolder or in /anotherDirectory, I want it to have a link directing to /lib/logout.php
On my machine it's supposed to be http://localhost/myName/folder (which contains index.php and all subdirectories), on someone else's it might be http://localhost/project
How can I detect the path to application root?
After some clarification from the OP it become possible to answer this question.
If you have some configuration file being included in all php scripts, placed in the app's root folder you can use this file to determine your application root:
$approot = substr(dirname(__FILE__),strlen($_SERVER['DOCUMENT_ROOT']));
__FILE__ constant will give you filesystem path to this file. If you subtract DOCUMENT_ROOT from it, the rest will be what you're looking for. So it can be used in your templates:
log out
Probably you are looking for the URL not the Path
log out
and you are not echoing the variable in your example.
Your DOCUMENT_ROOT is local to your machine - so it might end up being c:/www or something, useful for statements like REQUIRE or INCLUDE but not useful for links.
If you've got a page accessible on the web - linking back to a document on C: is going to try and get that drive from the local machine.
So for links, you should just be able to go /lib/logout.php with the initial slash taking you right to the top of your web accessible structure.
Your page, locally - might be in c:/www/myprojects/project1/lib/logout.php but the site itself might be at http://www.mydomain.com/lib/project.php
Frameworks like Symfony offer a sophisticated routing mechanism which allows you to write link urls like this:
log out
It has tons of possibilities, which are described in the tutorial.
Try this,
log out
This jumps to the root directly.
DOCUMENT_ROOT refers to the physical path on the webserver. There is no generic way to detect the http path fragment. Quite often you can however use PHP_SELF or REQUEST_URI
Both depend on how the current script was invoked. If the current request was to the index.php in a /whatever/ directory, then try the raw REQUEST_URI string. Otherwise it's quite commonly:
<?= dirname($_SERVER["SCRIPT_NAME"]) . "/lib/logout.php" ?>
It's often best if you use a configurable constant for such purposes however. There are too many ifs going on here.
I'm trying to figure this out for PHP as well. In asp.net, we have Request.ApplicationPath, which makes this pretty easy.
For anyone out there fluent in PHP who is trying to help, this code does what the OP is asking, but in asp.net:
public string AppUrl
{
get
{
string appUrl = Request.Url.GetLeftPart(UriPartial.Authority) + Request.ApplicationPath;
if (appUrl.Substring(appUrl.Length - 1) != "/")
{
appUrl += "/";
}
// Workaround for sockets issue when using VS Built-int web server
appUrl = appUrl.Replace("0.0.0.0", "localhost");
return appUrl;
}
}
I couldn't figure out how to do this in PHP, so what I did was create a file called globals.php, which I stuck in the root. It has this line:
$appPath = "http://localhost/MyApplication/";
It is part of the project, but excluded from source control. So various devs just set it to whatever they want and we make sure to never deploy it. This is probably the effort the OP is trying to skip (as I skipped with my asp.net code).
I hope this helps lead to an answer, or provides a work-around for PHPers out there.

Image upload - Return URL

Hello I build a script that does image uploading and resizing and it all works well, but how can I get the URL from image afterwards? I don't want my Image Source in HTML be like "../img/cat/1.png/" I want it to be like "http://MyIP/img/cat/1.png" I understand that I can just make a variable like $myHost = "http://blabla.com"; and add strip the ".." at the beginning but then it's not so good if I want to use it on other site because I need to replace this all the time. Maybe there is any other way?
You will have to use some kind of solution like what you yourself have mentioned. You can use also:
$host = $_SERVER['HTTP_HOST']
But it is not 100% reliable because of very different PHP configurations that can occur on different hosting services, and such.
Put your $myHost variable's content into a configuration file that you load up whenever you start your application. If you need to deploy the application on another server and domain and etc, just change the configuration. This is the most common way to deal with this issue.
I'm not sure if this is what you're looking for, but I think that you should explore the content of $_SERVER array (e.g. $_SERVER['HTTP_HOST']).

To convert an absolute path to a relative path in php

I would like to convert an absolute path into a relative path.
This is what the current absolute code looks like
$sitefolder = "/wmt/";
$adminfolder = "/wmt/admin/";
$site_path = $_SERVER["DOCUMENT_ROOT"]."$sitefolder";
// $site_path ="//winam/refiller/";
$admin_path = $site_path . "$adminfolder";
$site_url = "http://".$_SERVER["HTTP_HOST"]."$sitefolder";
$admin_url = $site_url . "$adminfolder";
$site_images = $site_url."images/";
so for example, the code above would give you a site url of
www.temiremi.com/wmt
and accessing a file in that would give
www.temiremi.com/wmt/folder1.php
What I want to do is this I want to mask the temiremi.com/wmt and replace it with dolapo.com, so it would say www.dolapo.com/folder1.php
Is it possible to do that with relative path.
I'm a beginner coder. I paid someone to do something for me, but I want to get into doing it myself now.
The problem is that your question, although it seems very specific, is missing some crucial details.
If the script you posted is always being executed, and you always want it to go to delapo.com instead of temiremi.com, then all you would have to do is replace
$site_url = "http://".$_SERVER["HTTP_HOST"]."$sitefolder";
with
$site_url = "http://www.delapo.com/$sitefolder";
The $_SERVER["HTTP_HOST"] variable will return the domain for whatever site was requested. Therefore, if the user goes to www.temiremi.com/myscript.php (assuming that the script you posted is saved in a file called myscript.php) then $_SERVER["HTTP_HOST"] just returns www.temiremi.com.
On the other hand, you may not always be redirecting to the same domain or you may want the script to be able to adapt easily to go to different domains without having to dig through layers of code in the future. If this is the case, then you will need a way to figuring out what domain you wish to link to.
If you have a website hosted on temiremi.com but you want it to look like you are accessing from delapo.com, this is not an issue that can be resolved by PHP. You would have to have delapo.com redirect to temiremi.com or simply host on delapo.com in the first place.
If the situation is the other way around and you want a website hosted on delapo.com but you want users to access temiremi.com, then simply re-writing links isn't a sophisticated enough answer. This strategy would redirect the user to the other domain when they clicked the link. Instead you would need to have a proxy set up to forward the information. Proxy scripts vary in complexity, but the simplest one would be something like:
<?php
$site = file_get_contents("http://www.delapo.com/$sitefolder");
echo $site;
?>
So you see, we really need a little more information on why you need this script and its intended purpose in order to assist you.
This would be a lot easier to do in the HTTP server configuration. For example, using Apache's VHost
I'm not really sure what you're going for bc this doesnt look like absolute path to relative path, but rather one absolute path to another.
Are you always trying to simply change "www.temiremi.com/wmt/" to "delapo.com"? If thats the case, you just want simple string replacement rather than $_SERVER variables or path functions.
$alteredPath = str_replace("www.temiremi.com/wmt/", "delapo.com", $oldPath);
OR
$alteredParth "www.delapo.com/" . basename($oldPath)
If i misunderstand please explain, I don't know if you need this to be more robust/generic, and you kind of threw me for a loop with "dolapo.com" (when i first thought your title, i originally thought of comparing path to a value from $_SERVER and removing common parts,)
And as mentioned, if you are just trying to make the URL displayed the the user (in the address bar or links) look different PHP can't do this.

Categories