How to use python/PHP to remove redundancy in URL link? - php

Many website add tags to url link for tracking purpose, such as
http://www.washingtonpost.com/blogs/answer-sheet/post/report-we-still-dont-know-much-about-charter-schools/2012/01/13/gIQAxMIeyP_blog.html?wprss=linkset&tid=sm_twitter_washingtonpost
If we remove the appendix "?wprss=linkset&tid=sm_twitter_washingtonpost", would still go to same page.
Is there any general approach could remove those redundancy element? Any comment would be helpful.
Thanks!

To remove query, fragment parts from URL
In Python using urlparse:
import urlparse
url = urlparse.urlsplit(URL) # parse url
print urlparse.urlunsplit(url[:3]+('','')) # remove query, fragment parts
Or a more lightweight approach but it might be less universal:
print URL.partition('?')[0]
According to rfc 3986 URI can be parsed using the regular expression:
/^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?/
Therefore if there is no fragment identifier (the last part in the above regex) or the query component is present (the 2nd to last part) then URL.partition('?')[0] should work, otherwise answers that split an url on '?' would fail e.g.,
http://example.com/path#here-?-ereh
but urlparse answer still works.
To check whether you can access page via URL
In Python:
import urllib2
try:
resp = urllib2.urlopen(URL)
except IOError, e:
print "error: can't open %s, reason: %s" % (URL, e)
else:
print "success, status code: %s, info:\n%s" % (resp.code, resp.info()),
resp.read() could be used to read the contents of the page.

To remove query string in URL :
<?php
$url = 'http://www.washingtonpost.com/blogs/answer-sheet/post/report-we-still-dont-know-much-about-charter-schools/2012/01/13/gIQAxMIeyP_blog.html?wprss=linkset&tid=sm_twitter_washingtonpost';
$url = explode('?',$url);
$url = $url[0];
//check output
echo $url;
?>
To check URL valid or not:
You can use PHP function get_headers($url). Example:
<?php
//$url_o = 'http://www.washingtonpost.com/blogs/answer-sheet/post/report-we-still-dont-know-much-about-charter-schools/2012/01/13/gIQAxMIeyP_blog.html?wprss=linkset&tid=sm_twitter_washingtonpost';
$url_o = 'http://mobile.nytimes.com/article?a=893626&f=21';
$url = explode('?',$url_o);
$url = $url[0];
$header = get_headers($url);
if(strpos($header[0],'Not Found'))
{
$url = $url_o;
}
//check output
echo $url;
?>

You can use a regular expression:
$yourUrl = preg_replace("/[?].*/","",$yourUrl);
Which meanss: "replace the question mark and everything afterwards with an empty string".

You can make a URL parser that will cut everything from "?" and on
<?php
$pos = strpos($yourUrl, '?'); //First, find the index of "?"
//Then, cut all the chars after the "?" and a append to a new URL string://
$newUrl = substr($yourUrl, 0, -1*(strlen($yourUrl)-((int)$pos)));
echo ($newUrl);
?>

Related

parse non encoded url

there is an external page, that passes a URL using a param value, in the querystring. to my page.
eg: page.php?URL=http://www.domain2.com?foo=bar
i tried saving the param using
$url = $_GET['url']
the problem is the reffering page does not send it encoded. and therefore it recognizes anything trailing the "&" as the beginning of a new param.
i need a way to parse the url in a way that anything trailing the second "?" is part or the passed url and not the acctual querystring.
Get the full querystring and then take out the 'URL=' part of it
$name = http_build_query($_GET);
$name = substr($name, strlen('URL='));
Antonio's answer is probably best. A less elegant way would also work:
$url = $_GET['url'];
$keys = array_keys($_GET);
$i=1;
foreach($_GET as $value) {
$url .= '&'.$keys[$i].'='.$value;
$i++;
}
echo $url;
Something like this might help:
// The full request
$request_full = $_SERVER["REQUEST_URI"];
// Position of the first "?" inside $request_full
$pos_question_mark = strpos($request_full, '?');
// Position of the query itself
$pos_query = $pos_question_mark + 1;
// Extract the malformed query from $request_full
$request_query = substr($request_full, $pos_query);
// Look for patterns that might corrupt the query
if (preg_match('/([^=]+[=])([^\&]+)([\&]+.+)?/', $request_query, $matches)) {
// If a match is found...
if (isset($_GET[$matches[1]])) {
// ... get rid of the original match...
unset($_GET[$matches[1]]);
// ... and replace it with a URL encoded version.
$_GET[$matches[1]] = urlencode($matches[2]);
}
}
As you have hinted in your question, the encoding of the URL you get is not as you want it: a & will mark a new argument for the current URL, not the one in the url parameter. If the URL were encoded correctly, the & would have been escaped as %26.
But, OK, given that you know for sure that everything following url= is not escaped and should be part of that parameter's value, you could do this:
$url = preg_replace("/^.*?([?&]url=(.*?))?$/i", "$2", $_SERVER["REQUEST_URI"]);
So if for example the current URL is:
http://www.myhost.com/page.php?a=1&URL=http://www.domain2.com?foo=bar&test=12
Then the returned value is:
http://www.domain2.com?foo=bar&test=12
See it running on eval.in.

file_get_contents( - Fix relative urls

I am trying to display a website to a user, having downloaded it using php.
This is the script I am using:
<?php
$url = 'http://stackoverflow.com/pagecalledjohn.php';
//Download page
$site = file_get_contents($url);
//Fix relative URLs
$site = str_replace('src="','src="' . $url,$site);
$site = str_replace('url(','url(' . $url,$site);
//Display to user
echo $site;
?>
So far this script works a treat except for a few major problems with the str_replace function. The problem comes with relative urls. If we use an image on our made up pagecalledjohn.php of a cat (Something like this: ). It is a png and as I see it it can be placed on the page using 6 different urls:
1. src="//www.stackoverflow.com/cat.png"
2. src="http://www.stackoverflow.com/cat.png"
3. src="https://www.stackoverflow.com/cat.png"
4. src="somedirectory/cat.png"
4 is not applicable in this case but added anyway!
5. src="/cat.png"
6. src="cat.png"
Is there a way, using php, I can search for src=" and replace it with the url (filename removed) of the page being downloaded, but without sticking url in there if it is options 1,2 or 3 and change procedure slightly for 4,5 and 6?
Rather than trying to change every path reference in the source code, why don't you simply inject a <base> tag in your header to specifically indicate the base URL upon which all relative URL's should be calculated?
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base
This can be achieved using your DOM manipulation tool of choice. The example below would show how to do this using DOMDocument and related classes.
$target_domain = 'http://stackoverflow.com/';
$url = $target_domain . 'pagecalledjohn.php';
//Download page
$site = file_get_contents($url);
$dom = DOMDocument::loadHTML($site);
if($dom instanceof DOMDocument === false) {
// something went wrong in loading HTML to DOM Document
// provide error messaging and exit
}
// find <head> tag
$head_tag_list = $dom->getElementsByTagName('head');
// there should only be one <head> tag
if($head_tag_list->length !== 1) {
throw new Exception('Wow! The HTML is malformed without single head tag.');
}
$head_tag = $head_tag_list->item(0);
// find first child of head tag to later use in insertion
$head_has_children = $head_tag->hasChildNodes();
if($head_has_children) {
$head_tag_first_child = $head_tag->firstChild;
}
// create new <base> tag
$base_element = $dom->createElement('base');
$base_element->setAttribute('href', $target_domain);
// insert new base tag as first child to head tag
if($head_has_children) {
$base_node = $head_tag->insertBefore($base_element, $head_tag_first_child);
} else {
$base_node = $head_tag->appendChild($base_element);
}
echo $dom->saveHTML();
At the very minimum, it you truly want to modify all path references in the source code, I would HIGHLY recommend doing so with DOM manipulation tools (DOMDOcument, DOMXPath, etc.) rather than regex. I think you will find it a much more stable solution.
I don't know if I get your question completely right, if you want to deal with all text-sequences enclosed in src=" and ", the following pattern could make it:
~(\ssrc=")([^"]+)(")~
It has three capturing groups of which the second one contains the data you're interested in. The first and last are useful to change the whole match.
Now you can replace all instances with a callback function that is changing the places. I've created a simple string with all the 6 cases you've got:
$site = <<<BUFFER
1. src="//www.stackoverflow.com/cat.png"
2. src="http://www.stackoverflow.com/cat.png"
3. src="https://www.stackoverflow.com/cat.png"
4. src="somedirectory/cat.png"
5. src="/cat.png"
6. src="cat.png"
BUFFER;
Let's ignore for a moment that there are no surrounding HTML tags, you're not parsing HTML anyway I'm sure as you haven't asked for a HTML parser but for a regular expression. In the following example, the match in the middle (the URL) will be enclosed so that it's clear it matched:
So now to replace each of the links let's start lightly by just highlighting them in the string.
$pattern = '~(\ssrc=")([^"]+)(")~';
echo preg_replace_callback($pattern, function ($matches) {
return $matches[1] . ">>>" . $matches[2] . "<<<" . $matches[3];
}, $site);
The output for the example given then is:
1. src=">>>//www.stackoverflow.com/cat.png<<<"
2. src=">>>http://www.stackoverflow.com/cat.png<<<"
3. src=">>>https://www.stackoverflow.com/cat.png<<<"
4. src=">>>somedirectory/cat.png<<<"
5. src=">>>/cat.png<<<"
6. src=">>>cat.png<<<"
As the way of replacing the string is to be changed, it can be extracted, so it is easier to change:
$callback = function($method) {
return function ($matches) use ($method) {
return $matches[1] . $method($matches[2]) . $matches[3];
};
};
This function creates the replace callback based on a method of replacing you pass as parameter.
Such a replacement function could be:
$highlight = function($string) {
return ">>>$string<<<";
};
And it's called like the following:
$pattern = '~(\ssrc=")([^"]+)(")~';
echo preg_replace_callback($pattern, $callback($highlight), $site);
The output remains the same, this was just to illustrate how the extraction worked:
1. src=">>>//www.stackoverflow.com/cat.png<<<"
2. src=">>>http://www.stackoverflow.com/cat.png<<<"
3. src=">>>https://www.stackoverflow.com/cat.png<<<"
4. src=">>>somedirectory/cat.png<<<"
5. src=">>>/cat.png<<<"
6. src=">>>cat.png<<<"
The benefit of this is that for the replacement function, you only need to deal with the URL match as single string, not with regular expression matches array for the different groups.
Now to your second half of your question: How to replace this with the specific URL handling like removing the filename. This can be done by parsing the URL itself and remove the filename (basename) from the path component. Thanks to the extraction, you can put this into a simple function:
$removeFilename = function ($url) {
$url = new Net_URL2($url);
$base = basename($path = $url->getPath());
$url->setPath(substr($path, 0, -strlen($base)));
return $url;
};
This code makes use of Pear's Net_URL2 URL component (also available via Packagist and Github, your OS packages might have it, too). It can parse and modify URLs easily, so is nice to have for the job.
So now the replacement done with the new URL filename replacement function:
$pattern = '~(\ssrc=")([^"]+)(")~';
echo preg_replace_callback($pattern, $callback($removeFilename), $site);
And the result then is:
1. src="//www.stackoverflow.com/"
2. src="http://www.stackoverflow.com/"
3. src="https://www.stackoverflow.com/"
4. src="somedirectory/"
5. src="/"
6. src=""
Please note that this is exemplary. It shows how you can to it with regular expressions. You can however to it as well with a HTML parser. Let's make this an actual HTML fragment:
1. <img src="//www.stackoverflow.com/cat.png"/>
2. <img src="http://www.stackoverflow.com/cat.png"/>
3. <img src="https://www.stackoverflow.com/cat.png"/>
4. <img src="somedirectory/cat.png"/>
5. <img src="/cat.png"/>
6. <img src="cat.png"/>
And then process all <img> "src" attributes with the created replacement filter function:
$doc = new DOMDocument();
$saved = libxml_use_internal_errors(true);
$doc->loadHTML($site, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
libxml_use_internal_errors($saved);
$srcs = (new DOMXPath($doc))->query('//img/#hsrc') ?: [];
foreach ($srcs as $src) {
$src->nodeValue = $removeFilename($src->nodeValue);
}
echo $doc->saveHTML();
The result then again is:
1. <img src="//www.stackoverflow.com/cat.png">
2. <img src="http://www.stackoverflow.com/cat.png">
3. <img src="https://www.stackoverflow.com/cat.png">
4. <img src="somedirectory/cat.png">
5. <img src="/cat.png">
6. <img src="cat.png">
Just a different way of parsing has been used - the replacement still is the same. Just to offer two different ways that are also the same in part.
I suggest doing it in more steps.
In order to not complicate the solution, let's assume that any src value is always an image (it could as well be something else, e.g. a script).
Also, let's assume that there are no spaces, between equals sign and quotes (this can be fixed easily if there are). Finally, let's assume that the file name does not contain any escaped quotes (if it did, regexp would be more complicated).
So you'd use the following regexp to find all image references:
src="([^"]*)". (Also, this does not cover the case, where src is enclosed into single quotes. But it is easy to create a similar regexp for that.)
However, the processing logic could be done with preg_replace_callback function, instead of str_replace. You can provide a callback to this function, where each url can be processed, based on its contents.
So you could do something like this (not tested!):
$site = preg_replace_callback(
'src="([^"]*)"',
function ($src) {
$url = $src[1];
$ret = "";
if (preg_match("^//", $url)) {
// case 1.
$ret = "src='" . $url . '"';
}
else if (preg_match("^https?://", $url)) {
// case 2. and 3.
$ret = "src='" . $url . '"';
}
else {
// case 4., 5., 6.
$ret = "src='http://your.site.com.com/" . $url . '"';
}
return $ret;
},
$site
);

if else on variable link input

I have a method of pulling Youtube video data from API links. I use Wordpress and ran into a snag.
In order to pull the thumbnail, views, uploader and video title I need the user to input the 11 character code at the end of watch?v=_______. This is documented with specific instructions for the user, but what if they ignore it and paste the whole url?
// the url 'code' the user should input.
_gXp4hdd2pk
// the wrong way, when the user pastes the whole url.
https://www.youtube.com/watch?v=_gXp4hdd2pk
If the user accidentally pastes the entire URL and not the 11 character code then is there a way I can use PHP to grab either the code or whats at the end of this url (11 characters after 'watch?v='?
Here is my PHP code to pull the data:
// $url is the code at the end of 'watch?v=' that the user inputs
$url = get_post_meta ($post->ID, 'youtube_url', $single = true);
// $code is a variable for placing the $url in a youtube link so I can output it to an API link
$code = 'http://www.youtube.com/watch?v=' . $url;
// $code is called at the end of this oembed code, allowing me to decode json data and pull elements from json to echo in my html
// echoed output returns json file. example: http://www.youtube.com/oembed?url=http://www.youtube.com/watch?v=_gXp4hdd2pk
$json = file_get_contents('http://www.youtube.com/oembed?url='.urlencode($code));
Im looking for something like...
"if user inputs code, use this block of code, else if user inputs whole url use a different block of code, else throw error."
Or... if they use the whole URL can PHP only use a specific section of that url...?
EDIT: Thank you for all the answers! I am new to PHP, so thank you all for your patience. It is difficult for graphic designers to learn PHP, even reading the PHP manual can give us headaches. All of your answers were great and the ones ive tested have worked. Thank you so much :)
Try this,
$code = 'https://www.youtube.com/watch?v=_gXp4hdd2pk';
if (filter_var($code, FILTER_VALIDATE_URL) == TRUE) {
// if `$code` is valid url
$code_arr = explode('?v=', $code);
$query_str = explode('&', $code_arr[1]);
$new_code = $query_str[0];
} else {
// if `$code` is not a valid url like '_gXp4hdd2pk'
$new_code = $code;
}
echo $new_code;
Here's a simple option for you to do, unless you want to use regex like Nisse Engström's Answer.
Using the function parse_url() you could do something like this:
$url = 'https://www.youtube.com/watch?v=_gXp4hdd2pk&list=RD_gXp4hdd2pk#t=184';
$split = parse_url('https://www.youtube.com/watch?v=_gXp4hdd2pk&list=RD_gXp4hdd2pk#t=184');
$params = explode('&', $split['query']);
$video_id = str_replace('v=', '', $params[0]);
now $video_id would return:
_gXp4hdd2pk
from the $url supplied in the above code.
I suggest you read the parse_url() documentation to ensure you understand and grasp it all :-)
Update
for your comment.
You'd use something like this to make sure the parsed value is a valid URL:
// this will check if valid url
if (filter_var($code, FILTER_VALIDATE_URL)) {
// its valid as it returned true
// so run the code
$url = 'https://www.youtube.com/watch?v=_gXp4hdd2pk&list=RD_gXp4hdd2pk#t=184';
$split = parse_url('https://www.youtube.com/watch?v=_gXp4hdd2pk&list=RD_gXp4hdd2pk#t=184');
$params = explode('&', $split['query']);
$video_id = str_replace('v=', '', $params[0]);
} else {
// they must have posted the video code as the if check returned false.
$video_id = $url;
}
Just try as follows ..
$url =" https://www.youtube.com/watch?v=_gXp4hdd2pk";
$url= explode('?v=', $url);
$endofurl = end($url);
echo $endofurl;
Replace $url variable with input .
I instruct my users to copy and paste the whole youtube url.
Then, I do this:
$video_url = 'https://www.youtube.com/watch?v=_gXp4hdd2pk'; // this is from user input
$parsed_url = parse_url($video_url);
parse_str($parsed_url['query'], $query);
$vidID = isset($query['v']) ? $query['v'] : NULL;
$url = "http://gdata.youtube.com/feeds/api/videos/". $vidID; // this is used for the Api
$m = array();
if (preg_match ('#^(https?://www.youtube.com/watch\\?v=)?(.+)$#', $url, $m)) {
$code = $m[2];
} else {
/* No match */
}
The code uses a Regular Expression to match the user input (the subject) against a pattern. The pattern is enclosed in a pair of delimiters (#) of your choice. The rest of the pattern works like this:
^ matches the beginning of the string.
(...) creates a subpattern.
? matches 0 or 1 of the preceeding character or subpattern.
https? matches "http" or "https".
\? matches "?".
(.+) matches 1 or more arbitrary charactes. The . matches any character (except newline). + matches 1 or more of the preceeding character or subpattern.
$ matches the end of the string.
In other words, optionally match an http or https base URL, followed by the video code.
The matches are then written to $m. $m[0] contains the entire string, $m[1] contains the first subpattern (base URL) and $m[2] contains the second subpattern (code).

php preg_match get everything after match in string

Looking for how to get the complete string in a URI, after the away?to=
My code:
if (isset($_SERVER[REQUEST_URI])) {
$goto = $_SERVER[REQUEST_URI];
}
if (preg_match("/to=(.+)/", $goto, $goto_url)) {
$link = "<a href='{$goto_url[1]}' target='_blank'>{$goto_url[1]}</a>";
The original link is:
https://domain.com/away?to=http://www.zdf.de/ZDFmediathek#/beitrag/video/2162504/Verschw%C3%B6rung-gegen-die-Freiheit-%281%29
.. but my code is cutting the string after the away?to= to only
http://www.zdf.de/ZDFmediathek
You know the fix for this preg_match function to allow really every character following the away?to= ??
UPDATE:
Found out, that $_SERVER['REQUEST_URI'] or $_SERVER['QUERY_STRING'] is already cutting the original URL. Do you know why and how to prevent that?
try use (.*) to get all after to=
$str = 'away?to=dfkhgkjdshfgkhldsflkgh';
preg_match("/to=(.*)/", $str, $goto_url);
echo $goto_url[1]; //dfkhgkjdshfgkhldsflkgh
Instead of extracting the URL with regex from the request URI you can just get it from the $_GET array:
$link = "<a href='{$_GET['to']}' target='_blank'>{$_GET['to']}</a>";

PHP regex find and replace url attributes in DOM

Currently I have the following code:
//loop here
foreach ($doc['a'] as $link) {
$href = pq($link)->attr('href');
if (preg_match($url,$href))
{
//delete matched string and append custom url to href attr
}
else
{
//prepend custom url to href attr
}
}
//end loop
Basically I've fetched vial curl an external page. I need to append my own custom URL to each href link in the DOM. I need to check via regex if each href attr already has a base url e.g. www.domain.com/MainPage.html/SubPage.html
If yes, then replace the www.domain.com part with my custom url.
If not, then simply append my custom url to the relative url.
My question is, what regex syntax should I use and which php function? Is preg_replace() the proper function for this?
Cheers
You should use internals as opposed to REGEX whenever possible, because often the authors of those functions have considered edge cases (or read the REALLY long RFC for URLs that details all of the cases). For you case, I would use parse_url() and then http_build_url() (note that the latter function needs PECL HTTP, which can be installed by following the docs page for the http package):
$href = 'http://www.domain.com/MainPage.html/SubPage.html';
$parts = parse_url($href);
if($parts['host'] == 'www.domain.com') {
$parts['host'] = 'www.yoursite.com';
$href = http_build_url($parts);
}
echo $href; // 'http://www.yoursite.com/MainPage.html/SubPage.html';
Example using your code:
foreach ($doc['a'] as $link) {
$urlParts = parse_url(pq($link)->attr('href'));
$urlParts['host'] = 'www.yoursite.com'; // This replaces the domain if there is one, otherwise it prepends your domain
$newURL = http_build_url($urlParts);
pq($link)->attr('href', $newURL);
}

Categories