I am using a script to check links on a given page. I am using simple html DOM to parse the information into an array. I have to check the href of all the a tags to find if they contain a file or something like # or JS.
I tried the following without success.
if(preg_match("|^(.*)|iU", $href)){
save_link();
}
I dont know it my pattern is wrong or if there is a better method to complete this function.
I want to be able to detect if $href contains .com .php .file extensions. This way it will filter out items like # "function()" and other items used in the href attribute.
EDIT:
parse_url will not work stop posting it. The value # returns as a valid url like I stated above I am trying to look for any string followed by .* with no more than 4 chars following the .
I believe that the function you're looking for is parse_url().
This function will take a URL string, and return an array of components, which will allow you to work out what kind of URL it is.
However note that it has issues with incomplete URLs in PHP versions prior to 5.4.7, so you need to have the very latest PHP to get the best out of it.
Hope that helps.
See http://php.net/manual/en/function.parse-url.php
I'm assuming you don't want to match fragments (#) because you are not concerned with following internal anchors.
parse_url breaks up the different parts of the url into an array. You can see the path component of the URL in this array and run your check against that.
You can use parse_url() , like this :
$res = parse_url($href);
if ( $res['scheme'] == 'http' || $res['scheme'] == 'https'){
//valid url
save_link();
}
UPDATE:
I've added code to filter only http and https urls, thanks to Baba for spotting this.
Related
I'm trying to check for /index.[ php | htm | html | asp ], etc. (basically any wildcard suffix)
or simply "/"
to convert a canonical URL such as http://www.example.com/index.php or http://www.example.com/ to just http://www.example.com so I can just use one consistent PHP variable in my canonicalmeta tag throughout all the pages on my site.
I dont want the script to effect URLs that are NOT homepages such as http://www.example.com/page.php
REGEX preferred, but not necessary.
I'm working from the variable $current_Webpage_Complete_URL_Address in "domain_info.php" from this script: http://www.perfecterrordocs.com
I want to use just one PHP variable, formed from modifying $current_Webpage_Complete_URL_Address, if that makes any sense.
Please ask for further clarification, if neccassary.
Edit: Also, I want the newly formed variable to be named
$current_Webpage_Canonical_URL_Address
Edit(2): I just ran into another problem. Even when I do find a match with preg_match how do I remove that particular ending sub-string?
Final Answer (Works Perfectly!!):
$current_Webpage_Canonical_URL_Address = preg_replace('/((\\index)\.[a-z]+)$/', '', $current_Webpage_Complete_URL_Address);
$current_Webpage_Canonical_URL_Address = preg_replace('/\/$/', '', $current_Webpage_Canonical_URL_Address);
I made a .htaccess file that redirects, for example, link:
website.com/module#controller
to:
website.com/?url=module#controller
As # is the PHP comment declarer, I get a problem when need to load:
$bootstrap->init($url) // $url = module#controller;
I tried to use addslashees($url);, but still when I:
echo $url;
I still get an output of:
module
How I should clear that string, to treat the # sign as part of the string?
$url = module#controller; is not valid PHP.
$url = 'module#controller'; will (correctly) not treat the # as a comment initiator.
Additionally, a # in a URL isn't going to work the way you expect. That's the marker for the URL hash/anchor, which is not passed to the web server. This is likely why you get output of module - your problem is at the browser level, not PHP.
The hashtag fragment identifier is a client-side concept only. The browser would never send a hashtag value to the server.
If you are relying on this functionality, your are going to be disappointed as server has absolutely no way to do redirection based on the hashtag, as it never even sees the hashtag.
I need to pass a url using a GET address. To give an example which was I have tried:
http://www.example.com/area/#http://www.example.com/area2/
I've also tried replacing the forward slashes with other characters but that doesn't seem to work. How would you pass a url in a GET?
As I have understood, you should use url_encode() and url_decode().
The function url_encode() lets you create a string that can be used as a link.
You should use it this way:
$link = 'goto.php?link=' . url_encode($_POST['target_site']);
And when you were going to redirect to the user defined site (eg), you can decode the parameter given this way:
$decoded_link = url_decode($_GET['link']);
// Now it's safe to use the given URL (for example I can redirect to there)
header('location: ' . $decoded_link);
Hope it helps.
The # character links to an anchor on the page. The browser will automatically scroll to the element with the id after the point sign. So that's not what you're looking for.
To pass a GET parameter, the syntax would be like this:
http://example.com/area?http://example.com/area2
Then, if you var_dump($_GET), you'll see your URL. But, if you have other fields you also want to pass in your URL, you can use key=value pairs, like so:
http://example.com/area?url=http://example.com/area2¶m1=a¶m2=b
In this case, your URL will be available in $_GET['url'].
Say I have a url like this in a php variable:
$url = "http://mywebsite.extension/names/level/etc/page/x";
how would I automatically remove everything after the .com (or other extension) and before /page/2?
Basically I would like every url that could be in $url to become http://mywebsite.extension/page/x
Is there a way to do this in php? :s
thanks for your help guys!
I think parse_url() is the function you're looking for. You can use it to break down an URL into it's component parts, and then put it back together however you want, adding in your own things as needed.
As PeeHaa noted, explode() will be useful for dividing up the path.
I am hoping someone can help me, I have created url's like this
/this/is/a/test/index.asp
/this/is/a/test/index-1.asp
/this/is/a/test/index-2.asp
/this/is/a/test/index-3.asp
What is the easiest way to strip the number from the URL?
Instead of using variables like this:
/this/is/a/test/index.asp?no=1
/this/is/a/test/index.asp?no=2
/this/is/a/test/index.asp?no=3
to create variables I am using the number in the URL to dynamically call the content on the page.
If the url is: /this/is/a/test/index-3.asp it should use the 3 and match the content according to it, similar as if I were to call
?no=3
I am using PHP for this...
The url structure will always have the variable define as match the last '-' [the-number] '.asp'
Thanks
Gerald Ferreira
You could use mod_rewrite to map URL patterns to actual URLs. You could achieve what you want with something similar to the following:
RewriteRule ^index-([0-9]+).asp$ index.asp?no=$1
You should be able to match it out with:
if (preg_match('/\-(\d+)\.asp$/', $_SERVER['REQUEST_URI'], $a)) {
$pageNumber = $a[1];
} else {
// failed to match number from URL
}