PHP: get_headers issue with bitly AND long urls with spaces - php

I have a long URL:
$url='http://www.likecool.com/Car/Motorcycle/BMW%20Urban%20Racer%20Concept%20Motorcycle/BMW-Urban-Racer-Concept-Motorcycle.jpg';
I create a short one:
$url='http://goo.gl/oZ04P8';
$url='http://bit.ly/1CzOQbf';
I run $headers = get_headers($url); print_r($headers);
SCENARIO:
get_headers works correctly for goo.gl short code but incorrectly for BITLY shortcode (404).
The difference is that BITLY shows up with spaces in the long url (bad) and GOOGL %20 (good).
When get_headers redirects the (long) url (with spaces) it FAILS.
I see no obvious way to fix this - am I missing something?
TWO OPTIONS
- change the way BITLY encodes ? (I force %20 formatting in long url)
- change the way get_headers encodes its URLs

You could replace the content of the header by yourself once you received it :
$url = 'http://bit.ly/1CzOQbf';
$headers = get_headers($url, 1);
$headers['Location'] = str_replace(' ', '%20', $headers['Location']);
print_r($headers);
Output :
[Location]=>http://www.likecool.com/Car/Motorcycle/BMW%20Urban%20Racer%20Concept%20Motorcycle/BMW-Urban-Racer-Concept-Motorcycle-1.jpg
I added the second parameter of get_headers so it names the keys of the returned array, that way it's clearer to use / modify. It is obviously not needed at all.

Related

How to echo javascript encoded url from GET variable on php

How to echo-ing decoded url from GET method on php,
i have this link:
localhost/whatsapp?text=Hello%2C%20Who%20are%20U%3F%0AName%3A%20%0AAddress%3A%20%0AAge%3A%20
with this code
$text = $_GET['text'];
header("Location: https://web.whatsapp.com/send?phone=00000000&text=".$text);
but get error
Warning: Header may not contain more than a single header, new line detected in index.php on line
When you retrieve values from $_GET, they are URL decoded. If you want to pass it back, you need to encode it again, so that newlines, spaces and other weird characters gets encoded. Use urlencode($text) for this.
$text = $_GET['text'];
header("Location: https://web.whatsapp.com/send?phone=00000000&text=".urlencode($text));
You will have to parse the incoming encoded URL before adding the correct part to the new URL, such as:
<?php
$url = 'localhost/wa-gen/?text=Hello%2C%20Who%20are%20U%3F%0AName%3A%20%0AAddress%3A%20%0AAge%3A%20';
$newURL = "https://web.whatsapp.com/send?phone=00000000&".parse_url($url, PHP_URL_QUERY);
?>
Parse_url is very good at extracting pieces from a given URL.

Using PHP cURL to downloading URL with special characters

I'm trying to download the following URL https://www.astegiudiziarie.it/vendita-asta-appartamento-genova-via-san-giovanni-d’acri-14-1360824 with PHP cURL:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.astegiudiziarie.it/vendita-asta-appartamento-genova-via-san-giovanni-d’acri-14-1360824');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$f = curl_exec($ch);
curl_close($ch);
echo $f;
but the server always returns an error page. Navigating the same URL in a web browser works fine. Manually comparing the HTML source returned by curl_exec with the HTML source in a web browser, the difference is immediately noticeable.
I tried to utf8_decode() the URL without success.
I cannot simply wrap the url
in urlencode() because it will encode even normal characters like : and /.
That URLs are retrieved programmatically (scraping) and won't always have the same structure, so it would be difficult to split them and urlencode just some parts.
Btw, it seems that modern web browsers handle this case very well. There is a solution for that in PHP?
Your URL is already encoded. Do not call urlencode() on it as that the reason you get 404, as server decodes only once. Just remove the call.
Parse the URL components, then encode them.
the idea is to use urlencode() only on the path and query parts of the URL, leaving the initial segment alone. I believe that is what browsers do behind the scenes.
You can use parse_url() to split the URL into its components, escape the parts you need to (most likely path and query) and reassemble it. Someone even posted a function to reassemble the URL in the comments on the parse_url() documentation page.
maybe
$urli=parse_url('https://www.astegiudiziarie.it/vendita-asta-appartamento-genova-via-san-giovanni-d’acri-14-1360824');
$url=urli['scheme'].'://'.$urli['host'].'/'.urlencode(ltrim('/',$urli['path'])).'?'.$urli['query'];
I finally ended up with:
function urlencode_parts($url) {
$parts = parse_url($url);
$parts['path'] = implode('/', array_map('urlencode', explode('/', $parts['path'])));
$url = new \http\Url($parts);
return $url->toString();
}
using the package \http\Url, that replaces http_build_url function in newest PHP versions.
Seems that file_get_contents doesn't work too with special characters.
Update 2018-05-09: it seems fixed in cUrl 7.52.1

Show another site in my page and change all links throught mine (like proxy)

I want to make like a proxy page (not for proxy at all) and as i knew i need to change all URLS SRC LINK and so on to others - for styles and images grab from right play, and urls goto throught my page going to $_GET["url"] and then to give me next page.
But iv tied to preg_replace() each element, also im not so good with it, and if on one website it works, on another i cant see CSS for example...
The first question is there are any PHP classes or just scripts to make it easy? (I was trying to google hours)
And if not help me with the following code :
<?php
$url = $_GET["url"];
$text = file_get_contents($url);
$data = parse_url($url);
$url=$data['scheme'].'://'.$data['host'];
$text = preg_replace('|<iframe [^>]*[^>]*|', '', $text);
$text = preg_replace('/<a(.*?)href="([^"]*)"(.*?)>/','<a $1 href="http://my.site/?url='.$url.'$2" $3>',$text);
$text = preg_replace('/<link(.*?)href="(?!http:\/\/)([^"]+)"(.*?)/', "<link $1 href=\"".$url."/\\2\"$3", $text);
$text = preg_replace('/src="(?!http:\/\/)([^"]+)"/', "src=\"".$url."/\\1\"", $text);
$text = preg_replace('/background:url\(([^"]*)\)/',"background:url(".$url."$1)", $text);
echo $text;
?>
Replacing with "src" №4 i need to denied replace when starts from double slash, because it could starts like 'src="//somethingdomain"' and not need to replace them.
Also i need to ignore replace №2 when href is going to the same domain, or it looks like need.site/news.need.site/324244
And is it possible to pass action in form throught my script? For example google search query.
And one small problem one web site is openning corrent some times before, but after iv open it hundreds times by this script in getting unknown symbols (without any divs body etc...) ��S�n�#�� i was trying to encode to UTF-8 ANSI but symbol just changing,
maybe they ban me ? oO
function link_replace($url,$myurl) {
$content = file_get_contents($url);
$content = preg_replace('#href="(http)(.*?)"#is', 'href="'.$myurl.'?url=$1$2"', $content);
$content = preg_replace('#href="([^http])(.*?)"#is', 'href="'.$myurl.'?url='.$url.'$1$2"', $content);
return $content;
}
echo link_replace($url,$myurl);
I'm not absolutely sure but I guess the result is just compressed e.g. with gzip try removing the accepted encoding headers while proxying the request.

Urlencode and file_get_contents

We have an url like http://site.s3.amazonaws.com/images/some image #name.jpg inside $string
What I'm trying to do (yes, there is a whitespace around the url):
$string = urlencode(trim($string));
$string_data = file_get_contents($string);
What I get (# is also replaced):
file_get_contents(http%3A%2F%2Fsite.s3.amazonaws.com%2Fimages%2Fsome+image+#name.jpg)[function.file-get-contents]: failed to open stream: No such file or directory
If you copy/paste http://site.s3.amazonaws.com/images/some image #name.jpg into browser address bar, image will open.
What's bad and how to fix that?
Using function urlencode() for entire URL, will generate an invalid URL. Leaving the URL as it is also is not correct, because in contrast to the browsers, the file_get_contents() function don't perform URL normalization. In your example, you need to replace spaces with %20:
$string = str_replace(' ', '%20', $string);
The URL you have specified is invalid. file_get_contents expects a valid http URI (more precisely, the underlying http wrapper does). As your invalid URI is not a valid URI, file_get_contents fails.
You can fix this by turning your invalid URI into a valid URI. Information how to write a valid URI is available in RFC3986. You need to take care that all special characters are represented correctly. e.g. spaces to plus-signs, and the commercial at sign has to be URL encoded. Also superfluous whitespace at beginning and end need to be removed.
When done, the webserver will tell you that the access is forbidden. You then might need to add additional request headers via HTTP context options for the HTTP file wrapper to solve that. You find the information in the PHP manual: http:// -- https:// — Accessing HTTP(s) URLs

Digg button rejects encoded url

I wrote a php site (it's still a prototype) and I placed a Digg button on it. It was easy but...
The official manual says: "the URL has to be encoded". I did that with urlencode(). After urlencode, my URL looks like this:
http%3A%2F%2Fwww.mysite.com%2Fen%2Fredirect.php%3Fl%3Dhttp%3A%2F%2Fwww.othersite.rs%2FNews%2FWorld%2F227040%2FRusia-Airplane-crashed%26N%3DRusia%3A+Airplane+crashed
So far it's good, but when I want to submit that URL to Digg, it is recognized as an invalid URL:
http://www.mysite.com/en/redirect.php?l=http://www.othersite.rs/News/World/227040/Rusia-Airplane-crashed&N=Rusia:+Airplane crashed
If I place a "+" between "Airplane" and "crashed" (at the end of the link), then Digg recognizes it without any problems!
Please help, this bizarre problem is killing my brain cells!
P.S. For purpose of this answer, urls are changed (to nonexisting ones) because, in the original, non-english sites are involved.
After you've urlencode()ed it, encode the resulting plus signs as well:
$encoded_url = urlencode($original_url);
$final_url = str_replace('+', '%2B', $encoded_url);
Or alternatively, you could replace spaces in your URL with + first, and then urlencode() the result:
$spaceless_url = str_replace(' ', '+', $original_url);
$final_url = urlencode($spaceless_url);
If your own site required the parameters in the query string to be encoded in the first place, you wouldn't have the issue (since there wouldn't be an unencoded space in the original URL).

Categories