PHP: Convert url in html to fully-fledged url? - php

I am able to scrape a page for URLs, but I want to know what is the easiest way to convert the various formats that these links can be in, into a fully fledged url. For example:
If I scrape: www.mysite.com/some/place/in/space.html
And I get the following urls:
../img.jpg
img.jpg
../../bla.jpg
inc/bla.jpg
/
./
They should resolve to
www.mysite.com/some/place/img.jpg
www.mysite.com/some/place/in/img.jpg
www.mysite.com/some/bla.jpg
www.mysite.com/some/place/in/inc/bla.jpg
www.mysite.com/some/place/in/
www.mysite.com/some/place/in/
Is there a function that does this for all cases or is it something I would have to code?

I use this function for a crawler i wrote long time ago: http://codepad.org/1VxMECNj
call the function with host prepended:
relativeUrl('http://host/dir/dir2/../../file.html');
//> returns http://host/file.html

You can just add www.mysite.com/some/place/in/ in front of the urls.. www.mysite.com/some/place/in/../img.jpg should resolve I think.

You could do a REGEX to replace the relative links with the absolute URLs:
$data = preg_replace('#(href|src)="([^:"]*)("|(?:(?:%20|\s|\+)[^"]*"))#', '$1="' . $site_url . '$2$3', $data);

Related

Yii2 - Get an app URL without params

I have no idea how to get a full url to my app web folder in Yii2.
The following rules:
<?=Yii::$app->getUrlManager()->getBaseUrl();?><br>
<?=Yii::$app->homeUrl;?><br>
<?=Yii::$app->getHomeUrl();?><br>
<?=Yii::$app->request->url;?><br>
<?=Yii::$app->request->absoluteUrl;?><br>
<?=Yii::$app->request->baseUrl;?><br>
<?=Yii::$app->request->scriptUrl;?><br>
<?=Url::to();?><br>
<?=Url::to(['site/index']);?><br>
<?=Url::base();?><br>
<?=Url::home();?><br>
<?=Yii::$app->getUrlManager()->getBaseUrl();?><br>
returns:
/yiiapp/web
/yiiapp/web/
/yiiapp/web/
/yiiapp/web/en/reset-password-request
http://website.com/yiiapp/web/en/reset-password-request
/yiiapp/web
/yiiapp/web/index.php
/yiiapp/web/en/reset-password-request
/yiiapp/web/site/index
/yiiapp/web
/yiiapp/web/
/yiiapp/web
when I need to get the (absoluteUrl is the closest one here):
http://website.com/yiiapp/web
I could probably combine one of the results with some $_SERVER var… but is it a solution?
I realize this post is quite old but I want to answer it anyway.
To get a full URL to your app web folder in Yii2 you can try these three options:
Url::to('#web/', ''); returns //website.com/yiiapp/web/
Url::to('#web/', true); returns http://website.com/yiiapp/web/
Url::to('#web/', 'https'); returns https://website.com/yiiapp/web/
You can use Yii::$app->getUrlManager()->createAbsoluteUrl() method or yii\helpers\Url::toRoute() to generate absolute urls. yii\helpers\Url::to() also can be used look at the documentation. E.g. <?=Url::to(['site/index'], true);?> should output http://website.com/yiiapp/web/site/index. If you need to get root url to your app, try \yii\helpers\Url::to('/', true);
There are multiple ways to achieve this, but probably the most clean way to get base URL of your app is to use Url::base():
Url::base(true);
Most of methods in Url helper allows to you specify $scheme argument - you should use it if you want to create absolute URL (with domain).
The URI scheme to use in the returned base URL:
false (default): returning the base URL without host info.
true: returning an absolute base URL whose scheme is the same as that in yii\web\UrlManager::$hostInfo.
string: returning an absolute base URL with the specified scheme (either http, https or empty string for protocol-relative URL).

untangling directory separator madness using string manipulation?

I'm working on converting a website. It involved standardizing the directory structure of images and media files. I'm parsing path information from various tags, standardizing them, checking to see if the media exists in the new standardized location, and putting it there if it doesn't. I'm using string manipulation to do so.
This is a little open-ended, but is there a class, tool, or concept out there I can use to save myself some headaches? For instance, I'm running into problems where, say, a page in a sudirectory (website.com/subdir/dir/page.php) has relative image paths (../images/image.png), or other kinds of things like this. It's not like there's one overarching problem, but just a lot of little things that add up.
When I think I've got my script covering most cases, then I get errors like Could not find file at export/standardized_folder/proper_image_folderimage.png where it should be export/standardized_folder/proper_image_folder/image.png. It's kind of driving me mad, doing string parsing and checks to make sure that directory separators are in the proper places.
I feel like I'm putting too much work into making a one-off import script very robust. Perhaps someone's already untangled this mess in a re-useable way, one which I can take advantage of?
Post Script: So here's a more in-depth scoop. I write my script that parses one "type" of page and pulls content from the same of its kind. Then I turn my script to parse another type of page, get all knids of errors, and learn that all my assumptions about how paths are referenced must be thrown out the window. Wash, rinse, repeat.
So I'm looking at doing some major re-factoring of my script, throwing out all assumptions, and checking, re-checking, and double-checking path information. Since I'm really trying to build a robust path building script, hopefully I can avoid re-inventing the wheel. Is there a wheel out there?
If your problems have their root in resolving the relative links from a document and resolve to an absolute one (which should be half the job to map the linked images paths onto the file-system), I normally use Net_URL2 from pear. It's a simple class that just does the job.
To install, as root just call
# pear install channel://pear.php.net/Net_URL2-0.3.1
Even if it's a beta package, it's really stable.
A little example, let's say there is an array with all the images srcs in question and there is a base-URL for the document:
require_once('Net/URL2.php');
$baseUrl = 'http://www.example.com/test/images.html';
$docSrcs = array(...);
$baseUrl = new Net_URL2($baseUrl);
foreach($docSrcs as $href)
{
$url = $baseUrl->resolve($href);
echo ' * ', $href, ' -> ', $url->getURL(), "\n";
// or
echo " $href -> $url\n"; # Net_URL2 supports string context
}
This will convert any relative links into absolute ones based on your base URL. The base URL is first of all the documents address. The document can override it by specifying another one with the base elementDocs. So you could look that up with the HTML parser you're already using (as well as the src and href values).
Net_URL2 reflects the current RFC 3986 to do the URL resolving.
Another thing that might be handy for your URL handling is the getNormalizedURL function. It does remove some potential error-cases like needless dot segments etc. which is useful if you need to compare one URL with another one and naturally for mapping the URL to a path then:
foreach($docSrcs as $href)
{
$url = $baseUrl->resolve($href);
$url = $url->getNormalizedURL();
echo " $href -> $url\n";
}
So as you can resolve all URLs to absolute ones and you get them normalized, you can decide whether or not they are in question for your site, as long as the url is still a Net_URL2 instance, you can use one of the many functions to do that:
$host = strtolower($url->getHost());
if (in_array($host, array('example.com', 'www.example.com'))
{
# URL is on my server, process it further
}
Left is the concrete path to the file in the URL:
$path = $url->getPath();
That path, considering you're comparing against a UNIX file-system, should be easy to prefix with a concrete base directory:
$filesystemImagePath = '/var/www/site-new/images';
$newPath = $filesystemImagePath . $path;
if (is_file($newPath))
{
# new image already exists.
}
If you've got problems to combine the base path with the image path, the image path will always have a slash at the beginning.
Hope this helps.
Truepath() to the rescue!
No, you shouldn't use realpath() (see why).

How to encode a URL as a CakePHP parameter

I would like to create a bookmarklet for adding bookmarks. So you just click on the Bookmark this Page JavaScript Snippet in your Bookmarks and you are redirected to the page.
This is my current bookmarklet:
"javascript: location.href='http://…/bookmarks/add/'+encodeURIComponent(document.URL);"
This gives me an URL like this when I click on it on the Bookmarklet page:
http://localhost/~mu/cakemarks/bookmarks/add/http%3A%2F%2Flocalhost%2F~mu%2Fcakemarks%2Fpages%2Fbookmarklet
The server does not like that though:
The requested URL /~mu/cakemarks/bookmarks/add/http://localhost/~mu/cakemarks/pages/bookmarklet was not found on this server.
This gives the desired result, but is pretty useless for my use case:
http://localhost/~mu/cakemarks/bookmarks/add/test-string
There is the CakePHP typical mod_rewrite in progress, and it should transform the last part into a parameter for my BookmarksController::add($url = null) action.
What am I doing wrong?
I had a similar problem, and tried different solutions, only to be confused by the cooperation between CakePHP and my Apache-config.
My solution was to encode the URL in Base64 with JavaScript in browser before sending the request to server.
Your bookmarklet could then look like this:
javascript:(function(){function myb64enc(s){s=window.btoa(s);s=s.replace(/=/g, '');s=s.replace(/\+/g, '-');s=s.replace(/\//g, '_');return s;} window.open('http://…/bookmarks/add/'+myb64enc(window.location));})()
I make two replacements here to make the Base64-encoding URL-safe. Now it's only to reverse those two replacements and Base64-decode at server-side. This way you won't confuse your URL-controller with slashes...
Bases on poplitea's answer I translate troubling characters, / and : manually so that I do not any special function.
function esc(s) {
s=s.replace(/\//g, '__slash__');
s=s.replace(/:/g, '__colon__');
s=s.replace(/#/g, '__hash__');
return s;
}
In PHP I convert it back easily.
$url = str_replace("__slash__", "/", $url);
$url = str_replace("__colon__", ":", $url);
$url = str_replace("__hash__", "#", $url);
I am not sure what happens with chars like ? and so …
Not sure, but hope it helps
you should add this string to yout routs.php
Router::connect (
'/crazycontroller/crazyaction/crazyparams/*',
array('controller'=>'somecontroller', 'action'=>'someaction')
);
and after that your site will able to read url like this
http://site.com/crazycontroller/crazyaction/crazyparams/http://crazy.com

parsing an url for crawler

i am writting an small crawler that extract some 5 to 10 sites while getting the links i am getting some urls like this
../tets/index.html
if it is /test/index.html we can add with base url http://www.example.com/test/index.html
what can i do for this kind of urls.
Url like these are relative urls . ".." means "parent directory", whereas "." simply means "this directory", as in bash.
For instance, if you are looking at this page : http://www.someserver/test/foo/bar.html , and there is an url like this in it : "../baz/foobar.html", it will in fact point to http://www.someserver/test/baz/foobar.html I think. Just test.
Use dirname() to get base directoy, remove the .. using substr() and append it there. Like this:
<?php
$url = "../tets/index.html";
$currentURL = "http://example.com/somedir/anotherdir";
echo dirname($currentURL).substr($url, 2);
?>
This outputs:
http://example.com/somedir/tets/index.html
Take a look into this URL Normalization Wikipedia page.

How to pass part of url in new link? Using only HTML & PHP

I have been trying to attempt to use the facebook share function in my website but i cant seems to have the right result.
Say:
i have a page called http://www.example.com/product.php?prod=lpd026n&cat=43
and i am using facebook's share function to have visitors to share the page in the FB wall.
i tried writing the link this way but i doesn't seems to be successful:
href="http://www.facebook.com/share.php?u=www.example.com/proddetail.php?<?php print urlencode(#$_SERVER['QUERY_STRING']!=''?'?'.$_SERVER['QUERY_STRING']:'')?>"
as the result the arguments in the URL came out to be in %26, %3D and etc..
Ie: example.com/proddetail.php?prod%3Dlpd026n%26cat%3D43
as some of you may know that the data after '?' is dynamic and i am planing to use the code above in the frame of the page, so it will have different query passed to the share link in every new item.
The end result that i want got to look like this:
http://www.facebook.com/sharer.php?u=http://www.example.com/proddetail.php?prod=lpd026n&cat=43
Not
http://www.facebook.com/share.php?u=http://www.example.com/proddetail.php?prod%3Dlpd026n%26cat%3D43
can anyone help me to solve this problem?
Thanks in advance!
Ps: if you are unclear, please ask me to further clarify.
This URL:
http://www.facebook.com/share.php?u=http://www.example.com/proddetail.php?prod%3Dlpd026n%26cat%3D43
is only partially-encoded. You actually need to fully URL-encode it before passing to FB, so that it won't interfere with FB's URL structure. I'm sure that their script will know how to parse it properly.
The correct method is:
$url = 'http://www.facebook.com/sharer.php?u='.urlencode('http://www.example.com/proddetail.php?prod=lpd026n&cat=43');
// evaluates to:
// http://www.facebook.com/sharer.php?u=http%3A%2F%2Fwww.example.com%2Fproddetail.php%3Fprod%3Dlpd026n%26cat%3D43
Update: build your dynamic query
// Original URL
$url = 'http://www.example.com/proddetail.php';
if ($_SERVER['QUERY_STRING'])
$url .= '?'.$_SERVER['QUERY_STRING'];
// Final URL for FB
$fb_url = 'http://www.facebook.com/share.php?u='.urlencode($url);
This is what urlencode does, what is the problem with the link this way?
Edit: I do not use PHP, but I think the following will do the trick (omitted the urlencode):
href="http://www.facebook.com/share.php?u=www.example.com/proddetail.php?<?php print $_SERVER['QUERY_STRING']?>"
I guess K Prime is right.
u need to encode the whole url because the slashes and ":" are still causing problems in this link ;)
$url = 'http://www.facebook.com/sharer.php?u='.urlencode('http://www.example.com/proddetail.php?prod=lpd026n&cat=43');
should be fine for your purposes.

Categories