Which characters in urls cause file_get_contents / curl to fail? - php

EDIT FOR CLARIFICATION:
I would like to know which characters in a url cause file_get_contents / curl to fail.
In the example below, the only character which causes a problem is the space, so the best thing for me to do would simply be to str_replace spaces in the url with %20. Are there any other characters which also cause it to fail? If so, what are they? Is there a function which does this replacement for me?
ORIGINAL PHRASING:
I'd like to be able to download an arbitrary file by its URL, chosen by the user, and have access to it as a string. My initial reaction was:
$str = file_get_contents($url);
However, this fails on URLs like:
http://i.ebayimg.com/t/2-WAY-PHOTO-FRAME-KEY-BOX-SHABBY-CHIC-STYLE-/00/s/NjAwWDYwMA==/$(KGrHqRHJDoE-PBe-SSLBPlrnIYb Q~~60_35.JPG
Next, I tried cURL:
function file_get_contents_curl($url) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
However, for the same URL, cURL fails with "Invalid URL".
I've read on a number of questions here that when downloading from URLs with arbitrary characters in them, urlencode must be used. However, this results in:
http%3A%2F%2Fi.ebayimg.com%2Ft%2F2-WAY-PHOTO-FRAME-KEY-BOX-SHABBY-CHIC-STYLE-%2F00%2Fs%2FNjAwWDYwMA%3D%3D%2F%24%28KGrHqRHJDoE-PBe-SSLBPlrnIYb+Q%7E%7E60_35.JPG
which doesn't fetch either, using either method, I think because now it thinks it's a local file. What do I need to do to be able to fetch an arbitrary url?

Try this:
$url = "http://i.ebayimg.com/t/2-WAY-PHOTO-FRAME-KEY-BOX-SHABBY-CHIC-STYLE-/00/s/NjAwWDYwMA==/$(" . urlencode("KGrHqRHJDoE-PBe-SSLBPlrnIYb Q~~60_35.JPG");
$str = file_get_contents($url);
Edit: As Galen said the only problem with URL is the space and it can be fixed using str_replace as below.
$url = "http://i.ebayimg.com/t/2-WAY-PHOTO-FRAME-KEY-BOX-SHABBY-CHIC-STYLE-/00/s/NjAwWDYwMA==/$(KGrHqRHJDoE-PBe-SSLBPlrnIYb Q~~60_35.JPG";
$url = str_replace(' ', '+', $url);
$str = file_get_contents($url);

Related

Using PHP cURL to downloading URL with special characters

I'm trying to download the following URL https://www.astegiudiziarie.it/vendita-asta-appartamento-genova-via-san-giovanni-d’acri-14-1360824 with PHP cURL:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.astegiudiziarie.it/vendita-asta-appartamento-genova-via-san-giovanni-d’acri-14-1360824');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$f = curl_exec($ch);
curl_close($ch);
echo $f;
but the server always returns an error page. Navigating the same URL in a web browser works fine. Manually comparing the HTML source returned by curl_exec with the HTML source in a web browser, the difference is immediately noticeable.
I tried to utf8_decode() the URL without success.
I cannot simply wrap the url
in urlencode() because it will encode even normal characters like : and /.
That URLs are retrieved programmatically (scraping) and won't always have the same structure, so it would be difficult to split them and urlencode just some parts.
Btw, it seems that modern web browsers handle this case very well. There is a solution for that in PHP?
Your URL is already encoded. Do not call urlencode() on it as that the reason you get 404, as server decodes only once. Just remove the call.
Parse the URL components, then encode them.
the idea is to use urlencode() only on the path and query parts of the URL, leaving the initial segment alone. I believe that is what browsers do behind the scenes.
You can use parse_url() to split the URL into its components, escape the parts you need to (most likely path and query) and reassemble it. Someone even posted a function to reassemble the URL in the comments on the parse_url() documentation page.
maybe
$urli=parse_url('https://www.astegiudiziarie.it/vendita-asta-appartamento-genova-via-san-giovanni-d’acri-14-1360824');
$url=urli['scheme'].'://'.$urli['host'].'/'.urlencode(ltrim('/',$urli['path'])).'?'.$urli['query'];
I finally ended up with:
function urlencode_parts($url) {
$parts = parse_url($url);
$parts['path'] = implode('/', array_map('urlencode', explode('/', $parts['path'])));
$url = new \http\Url($parts);
return $url->toString();
}
using the package \http\Url, that replaces http_build_url function in newest PHP versions.
Seems that file_get_contents doesn't work too with special characters.
Update 2018-05-09: it seems fixed in cUrl 7.52.1

PHP file_get_contents() not working when + (plus) sign is in file name

I have a couple of files that have the plus sign in their name, and I cannot seem to be able to open them, as the function interprets it as a space.
Example:
File name: Report_Tue-Jun-02-2015-14:11:04-GMT+0200-(W.-Europe-Daylight-Time).html
And when I try to open it:
Warning: file_get_contents(/cores/Report_Tue-Jun-02-2015-14:11:04-GMT 0200-(W.-Europe-Daylight-Time).html) [function.file-get-contents]: failed to open stream: No such file or directory in .... on line 150
this is my code:
$file = $_GET['FILE'];
$file = str_replace('+', '%2B', $file);
$content = file_get_contents($file);
Any thoughts/solutions?
The following methods work fine for me.
<?
# File name: Report_Tue-Jun-02-2015-14:11:04-GMT+0200-(W.-Europe-Daylight-Time).html
# Method 1.
$file = 'Report_Tue-Jun-02-2015-14:11:04-GMT+0200-(W.-Europe-Daylight-Time).html';
$data = file_get_contents($file);
print($data);
# Method 2.
$data = file_get_contents('Report_Tue-Jun-02-2015-14:11:04-GMT+0200-(W.-Europe-Daylight-Time).html');
print($data);
?>
Information how to write a valid URI is available in RFC3986. First, You need to take care that all special characters are represented correctly. e.g. spaces to plus-signs, and the commercial at sign has to be URL encoded.
Also superfluous whitespace at beginning and end need to be removed. Using function urlencode() for entire URL, will generate an invalid URL. Leaving the URL as it is also is not correct, because in contrast to the browsers, the file_get_contents() function does not perform URL normalization. In your example, you need to replace plus sign with %2B:`
$string = str_replace('+', '%2B', $string);
This is what eg. encodeURIComponent() does in JavaScript. Unfortunately it's not what urlencode does in PHP (rawurlencode is safer). check also the link
I hope this will work for you.
First of all file should not contain : sign.
Then the below code works fine for me.
$content = file_get_contents('Report_Tue-Jun-02-2015-141104-GMT+0200-(W.-Europe-Daylight-Time).html');
echo $content;
If not worked, please use:
$link = urlencode($url);
$content = file_get_contents($link);
echo $content;
I think its work each time.

Insert slash at specific place in string with preg_replace

I'm trying to write a function that will insert a backlash into the broken URL that for some reason is missing a second backslash after http, and I only want to return fixed version, for example
addSlash(http:/example.net) -> http://example.net
addSlash(https:/example.net`) -> https://example.net
I thought this could be solved with preg_replace in one line of code, but I can't get it to work. Using $url = 'http:/example.net' and
preg_replace("#^(https?:)(.*?)#", "\1/\2", $url);
I'm getting back / /example.net , as if 'http' is not matched and placed into \1.
Any suggestions ? I would like to avoid callbacks and anonymous function if possible, cause this is supposed to run on an older version of PHP.
/^(https:|http:)?[\\/](?![\\/])(.*)/
Something like that should work for you.
$re = "/^(https:|http:)?[\\/](?![\\/])(.*)/mi";
$str = "https:/regex101.com\nhttp:/regex101.com\nhttps://regex101.com";
$result = preg_replace($re, '$1//$2' , $str);
var_dump($result);
This should work:
preg_replace('#^(https*:)(/.*)#', '\1/\2', $url);
Or even:
preg_replace('#^(https*:)#', '\1/', $url);

PHP - How can I replace dashes with spaces?

I am currently using the following code to convert my strings to seo friendly urls:
function url($url) {
$url = str_replace(" ", " ", $url);
$url = str_replace(array("'", "-"), "", $url);
$url = mb_convert_case($url, MB_CASE_LOWER, "UTF-8");
$url = preg_replace("#[^a-zA-Z]+#", "-", $url);
$url = preg_replace("#(-){2,}#", "$1", $url);
$url = trim($url, "-");
return $url;
}
When I query my database I match the url against the article titles in my database, my problem is that after performing the seo friendly url function the urls do not match any article titles in my database.
The addition of dashes (not sure about the lowercase) means that they are completely different to the entries in the database.
What is my next step, should I remove the dashes before querying the database, if so how?
Or is it better practice to include the article id in my url somewhere and reference it?
Querying by id seems far faster and simplier to me than reconverting back your titles, using url rerwriting to ignore the title (just for referencement) and call a page with the id as a GET argument. Looking at the current URL let me think that StackOverflow works this way.
Using the current page as an example, i suspect that
http://stackoverflow.com/questions/8034788/php-how-can-i-replace-dashes-with-spaces
is rewritten to something like
http://stackoverflow.com/questions.php?id=8034788
where a simple SQL query gets the content of the article.

How do I remove a "&" symbol from a URL using regular expressions?

how to remove a & Symbol from a url address use php regular?
for example:http://www.google.com/search?hl=en&q=php
left the & and get :http://www.google.com/search?hl=enq=php
Thanks
$url = 'http://www.google.com/search?hl=en&q=php';
$url = str_replace('&', '', $url);
I'm not sure I really understand the question. It sounds like you just want to remove & characters. That can be easily done:
$url = str_replace('&', '', $url);
You can remove it easily enough with str_replace. Why you would want to do this, however, is another matter entirely.
Are you trying to insert a URL into a page with PHP and getting validation errors due to the & symbol? In that case urlencode() might be what you really need.

Categories