Unicode characters causing 404 error in file_get_contents() - php

I have an app visiting URLs automatically through links. It works good as long as the URL doesn't contain Unicode.
For example, I have a link:
Kraków
The link contains just pure ó character in the source. When I try to do:
$href = $crawler->filter('a')->attr('href');
$html = file_get_contents($href);
It returns 404 error. If I visit that URL in the browser, it's fine, because the browser replaces ó to %C3%B3.
What should I do to make is possible to visit that URL via file_get_contents()?

urlencode can be used to encode url parts. The following snippet extracts the path /catalog/kraków/list.html and encodes the contents: catalog, kraków and list.html instead of the entire url to preserve the path.
Checkout the following solution:
function encodeUri($uri){
$urlParts = parse_url($uri);
$path = implode('/', array_map(function($pathPart){
return strpos($pathPart, '%') !== false ? $pathPart : urlencode($pathPart);
},explode('/', $urlParts['path'])));
$query = array_key_exists('query', $urlParts) ? '?' . $urlParts['query'] : '';
return $urlParts['scheme'] . '://' . $urlParts['host'] . $path . $query;
}
$href = $crawler->filter('a')->attr('href');
$html = file_get_contents(encodeUri($href)); // outputs: https://example.com/catalog/krak%C3%B3w/list.html
parse_url docs: https://www.php.net/manual/en/function.parse-url.php

Related

PHP get meta tags function doesn't work with non http

I was hoping I can get help with a problem I am having.
I'm using the php get meta tags function to see if a tag exist on a list of websites, the problem occurs when ever there is a domain without HTTP.
Ideally I would want to add the HTTP if it doesn't exist, and also I would need a work around if the domain has HTTPS here is the code I'm using.
I will get this error if I land on a site without HTTP in the domain.
Warning: get_meta_tags(www.drhugopavon.com/): failed to open stream: No such file or directory in C:\xampp\htdocs\webresp\index.php on line 16
$urls = array(
'https://www.smilesbycarroll.com/',
'https://hurstbournedentalcare.com/',
'https://www.dentalhc.com/',
'https://www.springhurstdentistry.com/',
'https://www.smilesbycarroll.com/',
'www.drhugopavon.com/'
);
foreach ($urls as $url) {
$tags = get_meta_tags($url);
if (isset($tags['viewport'])) {
echo "$url tag exist" . "</br>";
}
if (!isset($tags['viewport'])) {
echo "$url tag doesnt exist" . "</br>";
}
}
You could use parse_url() to check if the element scheme exists or not. If not, you could add it:
$urls = array(
'https://www.smilesbycarroll.com/',
'https://hurstbournedentalcare.com/',
'https://www.dentalhc.com/',
'https://www.springhurstdentistry.com/',
'https://www.smilesbycarroll.com/',
'www.drhugopavon.com/'
);
$urls = array_map(function($url) {
$data = parse_url($url);
if (!isset($data['scheme'])) $url = 'http://' . $url ;
return $url;
}, $urls);
print_r($urls);
You can use this to check if the domain has http
foreach($urls as $url){
if(strpos($url, "http") === FALSE) //check if the url contains http and add it to the beginning of the string if it doesn't
$url = "http://" . $url;
$tags = get_meta_tags($url);
}
Another simpler option would be to check for :// in the url
foreach($urls as $url){
if(strpos($url, "://") === FALSE) //check if the url contains http and add it to the beginning of the string if it doesn't
$url = "http://" . $url;
$tags = get_meta_tags($url);
}
Or you can use regex like Wild Beard suggested
you know whats funny, I thought it was because it had http but I put error_reporting(0); in my original code and it worked as I wanted it to haha.

Converting incorrect url to correct url

Im am looking for a function that can convert domain.com into http://domain.com/.
Should I do this with a regex or is there a default php function which can handle this?
I have a bunch of website addresses saved mysql like this:
domain.com
www.domain.com
http://domain.com
I like to convert all of those to http://domain.com. And I am looking for a way to do this good so I won't screw up the website address.
I fixed it like this:
$url = 'domain.com';
if (strpos($url, '://') === false)
$url = 'http://' . $url;
echo $url;
based on: Validate url and convert into protocol format
You could do something like this:
$string = "http://www.domain.com";
url_fix($string);
function url_fix($str)
{
$str = str_replace(array("http://", "https://"), "", $str);
// string = www.domain.com
$str = substr_replace('www.', 0,4);
//string = domain.com
$str = "http://".$str;
//string = http://domain.com
return $str;
}
Instead of checking for both http:// and www. and doing a fancy regex for it, you could strip it of both tags (if it has it) and then just prepend http:// before the final example.com.

How to safely get full URL of parent directory of current PHP page

I'm using:
$domain = $_SERVER['HTTP_HOST'];
$path = $_SERVER['SCRIPT_NAME'];
$themeurl = $domain . $path;
But this of course gives the full URL.
Instead I need the full URL minus the current file and up one directory and minus the trailing slash.
so no matter what the browser URL domain is eg localhost, https://, http://, etc that the full real (bypassing any mod rewrites) URL path of the parent directory is given without a trailing slash.
How is this done?
Safely so no XSS as I guess (from reading) using anything but 'SCRIPT_NAME' has such risk.. not sure though ofc.. just been reading a ton trying to figure this out.
examples:
if given:
https://stackoverflow.com/questions/somequestions/index.php
need:
https://stackoverflow.com/questions
without the trailing slash.
and should also work for say:
http://localhost/GetSimple/admin/load.php
to get
http://localhost/GetSimple
which is what I'm trying to do.
Thank you.
Edit:
Here's the working solution I used:
$url = isset($_SERVER['HTTPS']) ? 'https://' : 'http://';
$url .= $_SERVER['SERVER_NAME'];
$url .= htmlspecialchars($_SERVER['REQUEST_URI']);
$themeurl = dirname(dirname($url)) . "/theme";
it works perfectly.
Thats easy - using the function dirname twice :)
echo dirname(dirname('https://stackoverflow.com/questions/somequestions/index.php'));
Also note #Sid's comment. When you you need the full uri to the current script, with protocol and server the use something like this:
$url = isset($_SERVER['HTTPS']) ? 'https://' : 'http://';
$url .= $_SERVER['SERVER_NAME'];
$url .= $_SERVER['REQUEST_URI'];
echo dirname(dirname($url));
I have more simple syntax to get parent addres with port and url
lets try my code
dirname($_SERVER['PHP_SELF'])
with this code you can got a direct parent of adres
if you want to 2x roll back directory you can looping
dirname(dirname($_SERVER['PHP_SELF']))
dirname is fungtion to get parent addrest web and $_SERVER['PHP_SELF'] can showing current addres web.
thakyou Sir https://stackoverflow.com/users/171318/hek2mgl
I do not suggest using dirname()as it is for directories and not for URIs. Examples:
dirname("http://example.com/foo/index.php") returns http://example.com/foo
dirname("http://example.com/foo/") returns http://example.com
dirname("http://example.com/") returns http:
dirname("http://example.com") returns http:
So you have to be very carful which $_SERVER var you use and of course it works only for this specific problem. A much better general solution would be to use currentdir() on which basis you could use this to get the parent directory:
function parentdir($url) {
// note: parent of "/" is "/" and parent of "http://example.com" is "http://example.com/"
// remove filename and query
$url = currentdir($url);
// get parent
$len = strlen($url);
return currentdir(substr($url, 0, $len && $url[ $len - 1 ] == '/' ? -1 : $len));
}
Examples:
parentdir("http://example.com/foo/bar/index.php") returns
http://example.com/foo/
parentdir("http://example.com/foo/index.php") returns http://example.com/
parentdir("http://example.com/foo/") returns http://example.com/
parentdir("http://example.com/") returns http://example.com/
parentdir("http://example.com") returns http://example.com/
So you would have much more stable results. Maybe you could explain why you wanted to remove the trailing slash. My experience is that it produces more problems as you are not able to differentiate between a file named "/foo" and a folder with the same name without using is_dir(). But if this is important for you, you could remove the last char.
This example works with ports
function full_url($s)
{
$ssl = (!empty($s['HTTPS']) && $s['HTTPS'] == 'on') ? true:false;
$sp = strtolower($s['SERVER_PROTOCOL']);
$protocol = substr($sp, 0, strpos($sp, '/')) . (($ssl) ? 's' : '');
$port = $s['SERVER_PORT'];
$port = ((!$ssl && $port=='80') || ($ssl && $port=='443')) ? '' : ':'.$port;
$host = isset($s['HTTP_HOST']) ? $s['HTTP_HOST'] : $s['SERVER_NAME'];
return $protocol . '://' . $host . $port . $s['REQUEST_URI'];
}
$themeurl = dirname(dirname(full_url($_SERVER))).'/theme';
echo 'Theme URL';
Source: https://stackoverflow.com/a/8891890/175071
I'm with hek2mgl. However, just in case the script isn't always specifically 2 directories below your target, you could use explode:
$parts = explode("/",ltrim($_SERVER['SCRIPT_NAME'],"/"));
echo $_SERVER['HTTP_HOST'] . "/" . $parts[0];
As hek2mgl mentioned, it's correct, and a more dynamic approach would be dirname(dirname(htmlspecialchars($_SERVER['REQUEST_URI'])));.
EDIT:
$_SERVER['REQUEST_URI'] will omit the domain name. Referring #hek2mgl's post, you can echo dirname(dirname(htmlspecialchars($url)));
Here are useful commands to get the desired path:
( For example, you are executing in http:// yoursite.com/folder1/folder2/file.php)
__FILE__ (on L.Hosting) === /home/xfiddlec/http_docs/folder1/folder2/yourfile.php
__FILE__ (on Localhost) === C:\wamp\www\folder1\folder2\yourfile.php
$_SERVER['HTTP_HOST'] === www.yoursite.com (or without WWW)
$_SERVER["PHP_SELF"] === /folder1/folder2/yourfile.php
$_SERVER["REQUEST_URI"] === /folder1/folder2/yourfile.php?var=blabla
$_SERVER["DOCUMENT_ROOT"] === /home/xfiddlec/http_docs
// BASENAME and DIRNAME (lets say,when __file__ is '/folder1/folder2/yourfile.php'
basename(__FILE__) ==== yourfile.php
dirname(__FILE__) ==== /folder1/folder2
Examples:
*HOME url ( yoursite.com )
<?php echo $_SERVER['HTTP_HOST'];?>
*file's BASE url ( yoursite.com/anyfolder/myfile.php )
<?php echo $_SERVER['HTTP_HOST'].$_SERVER['PHP_SELF']; ?>
*COMPLETE current url ( yoursite.com/anyfolder/myfile.php?action=blabla )
<?php echo $_SERVER['HTTP_HOST'].$_SERVER["REQUEST_URI"];?>
*CURRENT FOLDER's URL ( yoursite.com/anyfolder/ )
<?php echo $_SERVER['HTTP_HOST'] . dirname($_SERVER['REQUEST_URI']); ?>
*To get RealPath to the file (even if it is included) (change /var/public_html to your desired root)
<?php
$cur_file=str_replace('\\','/',__FILE__); //Then Remove the root path::
$cur_file=preg_replace('/(.*?)\/var\/public_html/','',$cur_file);
?>
p.s.for wordpress, there exist already pre-defined functions to get plugins or themes url.
i.e. get plugin folder ( http://yoursite.com/wp-content/plugins/pluginName/ )
<?php echo plugin_dir_url( __FILE__ );?>

Get Request URI PHP

I am struck in getting the URI in my wordpress application and lack of PHP knowledge is making my progress slow.
I have this URL
http://abc.com/my-blog/abc/cde
i need to create a URL something like
http://abc.com/my-blog/added-value/abc/cde
where http://abc.com/my-blog is the URL of my wordpress blog which i can easily get using following method
home_url()
i can use PHP $_SERVER["REQUEST_URI"] to get request URI which will come up as
/my-blog/abc/cde
and than i have no direct way to add value as per my requirement
is there any way to achieve this easily in PHP or Wordpress where i can get following information
Home URL
Rest part of the URL
so that in end i can do following
Home-URL+ custom-value+Rest part of the URL
My point of Confusion
On my local set up $_SERVER["REQUEST_URI"] is giving me /my-blog/abc/cde, where /my-blog is installation directory of wordpress and i can easily skip first level.
On production server its not same as /my-blog will not be part of the URL.
Very briefly:
<?php
$url = "http://abc.com/my-blog/abc/cde";
$parts = parse_url($url);
$path = explode("/", $parts["path"]);
array_splice($path, 2, 0, array("added-part")); //This line does the magic!
echo $parts["scheme"] . "://" . $parts["host"] . implode("/",$path);
OK, so if $addition is the bit you want in the middle and $uri is what you obtain from $_SERVER["REQUEST_URI"] then this..
$addition = "MIDDLEBIT/";
$uri = "/my-blog/abc/cde";
$parts = explode("/",$uri);
$homeurl = $parts[1]."/";
for($i=2;$i<count($parts);$i++){
$resturl .= $parts[$i]."/";
}
echo $homeurl . $addition . $resturl;
Should print:
my-blog/MIDDLEBIT/abc/cde/
You might want to use explode or some other sting function. Some examples below:
$urlBits = explode($_SERVER["REQUEST_URI"]);
//blog address
$blogAddress = $urlBits[0];
//abc
$secondPartOfUri = $urlBits[1];
//cde
$thirdPartOfUri = $urlBits[2];
//all of uri except your blog address
$uri = str_replace("/my-blog/", "", $_SERVER["REQUEST_URI"]);
This is a reliable way to get current url in PHP .
public static function getCurrentUrl($withQuery = true)
{
$protocol = stripos($_SERVER['SERVER_PROTOCOL'], 'https') === false ? 'http' : 'https';
$uri = $protocol . '://' . $_SERVER['HTTP_HOST'] . $_SERVER['REQUEST_URI'];
return $withQuery ? $uri : str_replace('?' . $_SERVER['QUERY_STRING'], '', $uri);
}
You can store the home url in a variable, using wordpress, using get_home_url()
$home_url = get_home_url();
$custom_value = '/SOME_VALUE';
$uri = $_SERVER['REQUEST_URI'];
$new_url = $home_url . $custom_value . $uri;

1 bug to kill... Letting PHP Generate The Canonical

for building a clean canonical url, that always returns 1 base URL, im stuck in following case:
<?php
# every page
$extensions = $_SERVER['REQUEST_URI']; # path like: /en/home.ast?ln=ja
$qsIndex = strpos($extensions, '?'); # removes the ?ln=de part
$pageclean = $qsIndex !== FALSE ? substr($extensions, 0, $qsIndex) : $extensions;
$canonical = "http://website.com" . $pageclean; # basic canonical url
?>
<html><head><link rel="canonical" href="<?=$canonical?>"></head>
when URL : http://website.com/de/home.ext?ln=de
canonical: http://website.com/de/home.ext
BUT I want to remove the file extension aswell, whether its .php, .ext .inc or whatever two or three char extension .[xx] or .[xxx] so the base url becomes: http://website.com/en/home
Aaah much nicer! but How do i achieve that in current code?
Any hints are much appreciated +!
Think this should do it, just strip off the end if there is an extension, just like you did for the query string:
$pageclean = $qsIndex !== FALSE ? substr($extensions, 0, $qsIndex) : $extensions;
$dotIndex = strrpos($pageclean, '.');
$pagecleanNoExt = $dotIndex !== FALSE ? substr($pageclean, 0, $dotIndex) : $pageclean;
$canonical = "http://website.com" . $pagecleanNoExt; # basic canonical url
try this:
preg_match("/(.*)\.([^\?]{2,3})(\?(.*)){0,1}$/msiU", $_SERVER['REQUEST_URI'], $res);
$canonical = "http://website.com" . $res[1];
and $res[1] => clean url;
$res[2] = extension;
$res[4] = everything after the "?" (if present and if you need it)

Categories