Using PHP cURL to downloading URL with special characters - php

I'm trying to download the following URL https://www.astegiudiziarie.it/vendita-asta-appartamento-genova-via-san-giovanni-d’acri-14-1360824 with PHP cURL:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.astegiudiziarie.it/vendita-asta-appartamento-genova-via-san-giovanni-d’acri-14-1360824');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$f = curl_exec($ch);
curl_close($ch);
echo $f;
but the server always returns an error page. Navigating the same URL in a web browser works fine. Manually comparing the HTML source returned by curl_exec with the HTML source in a web browser, the difference is immediately noticeable.
I tried to utf8_decode() the URL without success.
I cannot simply wrap the url
in urlencode() because it will encode even normal characters like : and /.
That URLs are retrieved programmatically (scraping) and won't always have the same structure, so it would be difficult to split them and urlencode just some parts.
Btw, it seems that modern web browsers handle this case very well. There is a solution for that in PHP?

Your URL is already encoded. Do not call urlencode() on it as that the reason you get 404, as server decodes only once. Just remove the call.

Parse the URL components, then encode them.
the idea is to use urlencode() only on the path and query parts of the URL, leaving the initial segment alone. I believe that is what browsers do behind the scenes.
You can use parse_url() to split the URL into its components, escape the parts you need to (most likely path and query) and reassemble it. Someone even posted a function to reassemble the URL in the comments on the parse_url() documentation page.

maybe
$urli=parse_url('https://www.astegiudiziarie.it/vendita-asta-appartamento-genova-via-san-giovanni-d’acri-14-1360824');
$url=urli['scheme'].'://'.$urli['host'].'/'.urlencode(ltrim('/',$urli['path'])).'?'.$urli['query'];

I finally ended up with:
function urlencode_parts($url) {
$parts = parse_url($url);
$parts['path'] = implode('/', array_map('urlencode', explode('/', $parts['path'])));
$url = new \http\Url($parts);
return $url->toString();
}
using the package \http\Url, that replaces http_build_url function in newest PHP versions.
Seems that file_get_contents doesn't work too with special characters.
Update 2018-05-09: it seems fixed in cUrl 7.52.1

Related

Get dynamic replacement with preg_replace()

This is my url:
ftp://dynamic_text:user_password#my-ftp-domain.com/so-on/param
My regex will turn it into this:
ftp://*****:*****#my-ftp-domain.com/so-on/param
Note that the url can start with either ftp or http.
regex:
My regex below will always return ftp regardless if my url started with http.
preg_replace('#(ftp|http)://(.*:.*)\##', 'ftp://****:****#', $url);
Now my question is: how can I modify my code, so that it will dynamically return ftp or http depending on how my url started.
I read about Named Groups, but I wasn't able to solve it.
Just change the ftp part in your replacement to $1 to get the value of the first group, e.g.
preg_replace('#(ftp|http)://(.*:.*)\##', '$1://****:****#', $url);
//^^

PHP: get_headers issue with bitly AND long urls with spaces

I have a long URL:
$url='http://www.likecool.com/Car/Motorcycle/BMW%20Urban%20Racer%20Concept%20Motorcycle/BMW-Urban-Racer-Concept-Motorcycle.jpg';
I create a short one:
$url='http://goo.gl/oZ04P8';
$url='http://bit.ly/1CzOQbf';
I run $headers = get_headers($url); print_r($headers);
SCENARIO:
get_headers works correctly for goo.gl short code but incorrectly for BITLY shortcode (404).
The difference is that BITLY shows up with spaces in the long url (bad) and GOOGL %20 (good).
When get_headers redirects the (long) url (with spaces) it FAILS.
I see no obvious way to fix this - am I missing something?
TWO OPTIONS
- change the way BITLY encodes ? (I force %20 formatting in long url)
- change the way get_headers encodes its URLs
You could replace the content of the header by yourself once you received it :
$url = 'http://bit.ly/1CzOQbf';
$headers = get_headers($url, 1);
$headers['Location'] = str_replace(' ', '%20', $headers['Location']);
print_r($headers);
Output :
[Location]=>http://www.likecool.com/Car/Motorcycle/BMW%20Urban%20Racer%20Concept%20Motorcycle/BMW-Urban-Racer-Concept-Motorcycle-1.jpg
I added the second parameter of get_headers so it names the keys of the returned array, that way it's clearer to use / modify. It is obviously not needed at all.

Which characters in urls cause file_get_contents / curl to fail?

EDIT FOR CLARIFICATION:
I would like to know which characters in a url cause file_get_contents / curl to fail.
In the example below, the only character which causes a problem is the space, so the best thing for me to do would simply be to str_replace spaces in the url with %20. Are there any other characters which also cause it to fail? If so, what are they? Is there a function which does this replacement for me?
ORIGINAL PHRASING:
I'd like to be able to download an arbitrary file by its URL, chosen by the user, and have access to it as a string. My initial reaction was:
$str = file_get_contents($url);
However, this fails on URLs like:
http://i.ebayimg.com/t/2-WAY-PHOTO-FRAME-KEY-BOX-SHABBY-CHIC-STYLE-/00/s/NjAwWDYwMA==/$(KGrHqRHJDoE-PBe-SSLBPlrnIYb Q~~60_35.JPG
Next, I tried cURL:
function file_get_contents_curl($url) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
However, for the same URL, cURL fails with "Invalid URL".
I've read on a number of questions here that when downloading from URLs with arbitrary characters in them, urlencode must be used. However, this results in:
http%3A%2F%2Fi.ebayimg.com%2Ft%2F2-WAY-PHOTO-FRAME-KEY-BOX-SHABBY-CHIC-STYLE-%2F00%2Fs%2FNjAwWDYwMA%3D%3D%2F%24%28KGrHqRHJDoE-PBe-SSLBPlrnIYb+Q%7E%7E60_35.JPG
which doesn't fetch either, using either method, I think because now it thinks it's a local file. What do I need to do to be able to fetch an arbitrary url?
Try this:
$url = "http://i.ebayimg.com/t/2-WAY-PHOTO-FRAME-KEY-BOX-SHABBY-CHIC-STYLE-/00/s/NjAwWDYwMA==/$(" . urlencode("KGrHqRHJDoE-PBe-SSLBPlrnIYb Q~~60_35.JPG");
$str = file_get_contents($url);
Edit: As Galen said the only problem with URL is the space and it can be fixed using str_replace as below.
$url = "http://i.ebayimg.com/t/2-WAY-PHOTO-FRAME-KEY-BOX-SHABBY-CHIC-STYLE-/00/s/NjAwWDYwMA==/$(KGrHqRHJDoE-PBe-SSLBPlrnIYb Q~~60_35.JPG";
$url = str_replace(' ', '+', $url);
$str = file_get_contents($url);

URL with query string validation using PHP

I need a PHP validation function for URL with Query string (parameters seperated with &). currently I've the following function for validating URLs
$pattern = '/^(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?$/';
echo preg_match($pattern, $url);
This function correctly validates input like
google.com
www.google.com
http://google.com
http://www.google.com ...etc
But this won't validate the URL when it comes with parameters (Query string). for eg.
http://google.com/index.html?prod=gmail&act=inbox
I need a function that accepts both types of URL inputs. Please help. Thanks in advance.
A simple filter_var
if(filter_var($yoururl, FILTER_VALIDATE_URL))
{
echo 'Ok';
}
might do the trick, although there are problems with url not preceding the schema:
http://codepad.org/1HAdufMG
You can turn around the issue by placing an http:// in front of urls without it.
As suggested by #DaveRandom, you could do something like:
$parsed = parse_url($url);
if (!isset($parsed['scheme'])) $url = "http://$url";
before feeding the filter_var() function.
Overall it's still a simpler solution than some extra-complicated regex, though..
It also has these flags available:
FILTER_FLAG_PATH_REQUIRED FILTER_VALIDATE_URL Requires the URL to
contain a path part. FILTER_FLAG_QUERY_REQUIRED FILTER_VALIDATE_URL
Requires the URL to contain a query string.
http://php.net/manual/en/function.parse-url.php
Some might think this is not a 100% bullet-proof,
but you can give a try as a start

replace url using preg_replace php

Hi all i know preg_replace can be used for formatting string but
i need help in that concerned area
my url will be like this
www.example.com/en/index.php
or
www.example.com/fr/index.php
what i want is to get
result as
www.example.com/index.php
i need it in php code so as to set in a session
can anyone please explain how ?
preg_replace('/www.example.com\/(.+)\/index.php/i', "www.example.com/index.php?lang=$1", $url); will do the thing
This is one way to do it:-
$newurl = preg_replace('/\/[a-z][a-z]\//', '/', $url);
Note that the search string appears with quotes and forward slashes ('/.../') and that the forward slashes in the URL then have to be escaped (\/). The language code is then matched with '[a-z][a-z]', but there are several other ways to do this and you may want something more liberal in case there are ever 3 letter codes, or caps. Equally you may need to do something tighter depending on what other URL schemes might appear.
I suspect in this instance it would be faster simply to use str_replace as follows:
$cleanedData = str_replace(array('www.example.com/en/', 'www.example.com/fr/'), '', $sourceData);
Finally i got a method my thanks to Purpletoucan
$newurl = preg_replace('/\/(en|esp|fr)\//', '/', $url);
it's working now i think!

Categories