find base href of a given url - php

I'm modifying a simple php crawler script.
one of the modules it uses is a converter of relative urls into absolute urls.
For this, I need to find a way to determine the base href of a given url. Otherwise I end up with a bunch of wrongly converted links.
I need a simple function to check if an url has a base href tag, and if yes, return it.
Thanks

parse_url() splits up a URL into its parts. You can get what you need from that.

I need a simple function to check if an url has a base href tag, and if yes, return it.
A URL cannot have a base href tag, since that is an HTML tag. It might be defined in the HTML that you retreive from that URL. How to read that can be found at this question.

I don't know what you exactly mean but parse_url will give you a lot of information such as the hostname, the querystring, etc.
If I understand you correctly you wan't to know if there is a http in your url. The scheme part of the information parse_url returns is your friend here. If scheme is empty or something different then http, you know that there was no http in your URL.
Inside the crawler you start crawling a specific page and you parse that HTML if I understand your question correct. Simply construct the base URL (without paths) from the information parse_url gives you and I don't see any problems.

Related

Using the URL as a pseudo GET variable?

I understand how to parse the URL to get data. What I don't know how to do, or rather, can't seem to search properly for it, is how to prevent redundant file creation.
Here's what I mean. Let's say we have referrer1, referrer2, and referrer3. I want the URL link for each referrer (they are each given their own) to be www.test.com/referrer1, www.test.com/referrer2, and www.test.com/referrer3. Other than pulling the refferer name from the URL, the website functions identically. Is there some way that I can do this that scales to having any arbitrary number of referrers, so that I don't have to make an identical subfolder for every single referrer?
You are looking for url rewriting for example. You could have a adress example.com/index.php?ref=ref1,2,3 etc and use url rewrite to make it look lite example.com/referrer1,2,3 etc
http://en.wikipedia.org/wiki/Rewrite_engine

Joomla - Most Efficient way to force non http:// pre-fixed links to be External

Friends,
I'm looking to find the most efficient way to choose for an anchor tag which will contain a user-submitted link to link to an external site instead of erroneously appending it to the end of the current site url.
// Explanation:
As many of you know, when writing links in Joomla such as the following:
Google
or
Google
It appends the href to the current site url.
For example, if my site was http://www.stackoverflow.com/questions/ask
And I clicked on either link above it would take me to http://www.stackoverflow.com/questions/ask/google.com
...as opposed to what would seem natural, just taking me to google.com
// End Explanation
Of course I know prepending http:// to the href solves this issue. However for user-submitted content this means calling a string-based method to check user-submitted links to make sure http:// (or https, etc.) is what starts the url and if not, to append it.
Could someone shed some light on other options for doing this. I'm hoping to find out if there are possibly better, more efficient methods.
Also, if it turns out that I am doing it the best way possible, then I would love to see what others use for this string function.
Thank you Stackfriends.
That's not a "Joomla" behaviour thats the way URL's are to be resolved as defined by the standard. What you're talking about is how browsers etc are supposed to process a relative or absolute URL.
Changing this behaviour is IMO only likely to result in grief.
A URL is a string that represents an identifier.
A URL is either a relative URL or an absolute URL. Either form can be
followed by a fragment.
A relative URL is a URL without a scheme. A relative URL must be
relative to a base URL.
An absolute URL is a URL with a scheme.
A base URL is an absolute URL with a relative scheme.
You might want to read more of this at http://url.spec.whatwg.org/#urls
This code will enforce http:// for $unknownlink without scheme (protocol):
$link = JUri::getInstance($unknownlink);
if (!$link->getScheme()) $link->setScheme('http');
echo JHtml::link($link, $link, ['target' => '_blank']);
works in J3.4, not sure about old versions

How do I get this URL without considering the Apache settings?

HEllo I have this URL I need to get with PHP
http://www.domain.com/forum/#forum/General-discussions-0.htm
The problem is this is not a real URL, but this the mask created by the .htaccess.
I need to get the visible URL and not the real path of the file, because I need to compare it with some PHP variables I have.
In fact the real path will look like this:
http://domain.com/modules/boonex/forum/index.php
And in that way is totally useless for me.
How do I get the first URL as it is?
You can't get that from http://www.domain.com/forum/#forum/General-discussions-0.htm. Everything after the fragment (#) is not even send to the server, there is no way to retrieve it save for a delayed update with javascript. All you'll get it is http://www.domain.com/forum/ send to the server, and on the onload event of your document you can possibly load something in with javascript.
Look into the source code or it may not have real urls at all. The part is for ajax based navigation. It may mean that there are no real urls on that site and if there are then they should be extracted from <a href="someurl"> as they might masked using javascript.
With
file_get_contents();
for example. Neither user nor your server mind about .htaccess
It's server proccessing the request who have to direct you to correct address
however php does ignore everything after #, so in this case you have no chance to get it without real url
As #Wrikken said, there is no way to get url after # fragment

How to automatically append "http://" before links before saving them to a database?

I'm developing a PHP-based web-application in which you have a form with textarea inputs that can accept links via anchor tags. But when I tested it after adding a hyperlink as follows, it pointed to a non-existent local subdirectory:
link
I realized that this was because I had not appended http:// before the link.
there might be cases where a user might input the link just as I did above. In such cases I don't want the link to be pointing as it did above. is there any possible solution, such as automatically appending http:// before the link in case that it doesn't exist? How do I do that?
----------------------------------------Edit---------------------------------------------
Please consider that the anchor tags are amidst other plaintext and this is making things harder to work with.
I'd go for something like this:
if (!parse_url($url, PHP_URL_SCHEME)) {
$url = 'http://' . $url;
}
This is an easy and stable way to check for the presence of a protocol in a URL, and allows others (e.g. ftp, https) that may be entered.
What you're talking about involves two steps, URL detection and URL normalization. First you'll have to detect all the URLs in the string being parsed and store them in a data structure for further processing, such as an array. Then you need to iterate over the array and normalize each URL in turn, before attempting to store them.
Unfortunately, both detection and normalization can be problematic, as a URL has a quite complicated structure. http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/ makes some suggestions, but as the page itself says, no regex URL detection is ever perfect.
There are examples of regular expressions that can detect URLs available from various sites, but in my experience none of them are completely reliable.
As for normalization, Wikipedia has an article on the subject which may be a good starting point. http://en.wikipedia.org/wiki/URL_normalization

using curl to get from one webpage to another involving javascript

I have webpage1.html which has a hyperlink whose href="some/javascript/function/outputLink()"
Now, using curl (or any other method in php) how do I deduce the hyperlink (of http:// format) from the javascript function() so that I can go to next page.
Thanks
You'd have to scrape the JavaScript. Figure out where the function is and see what URL it's using.
Sometimes http:// is omitted for links that are on the same page, so that won't be a good search reference.
At this point the only valuable thing to do is to try and understand the JavaScript code yourself, and once you find the link you could use regex to filter the result programmatically with PHP.
preg_match("/url + \'\/apples.html/g", "blah blah var javaScriptVar= url + '/apples.html';", $matches);
There is no straight forward way. There are very few to zero libraries which can perfectly do what you require. I think http://www.dapper.net/ is something close to what you want. I am not sure if its the ideal solution. Dapper.net will help you parse text and links and would probably also handle javascript.

Categories