Japanese (or non-ASCII) urls and curl - php

I am trying to access a file at this url: http://www.myurl.com/伊勢/image.jpg.
The urls are predefined and there is no specific format or consistency.
The basic curl function I am using is fine for downloading images from myurl.com, but not when there are Japanese characters contained in the url. I have tried sanitising the url in various ways (such as urlencode, filter_var, and mb_convert_encoding), with no success.
If I visit the url directly from the browser, it's fine - so the only problem that I can't resolve is the handling of non-ASCII (Japanese) characters in the curl function.
My question is - how can this be resolved? Is there a curl option that can be included in the function in order to read the url as a browser will?

If I visit the url directly from the browser, it's fine
That means your browser encode "伊勢" (like %E4%BC%8A%E5%8B%A2) and send request in background. But still keep the look in your browser address box.
My suggestion is use http debugger, like "firebug" in firefox or "developer tools" in chrome.
Check the "network" tab and find the REAL request parameters in its detail page. Then you can find what your browser sent.
Hope this helpful.

Nothing special.
I have created a php file in UTF-8 (using notepad's save as encoding UTF-8 ):
<?php
$url = 'http://rp.postcontrol.ru/伊勢.txt';
$ch = curl_init( $url );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true);
if ( $result = curl_exec($ch) )
{
echo $result;
}
else
echo "cURL error: ".curl_error($ch);
curl_close( $ch );
You may take the PHP file at http://rp.postcontrol.ru/eddz.php.txt
It works for me and returns (伊勢.txt is in UTF-8 too):
おはようございます eddz さん.

Append the path parameter as a url encoded string and it will work.
ex:
$url = 'http://rp.postcontrol.ru/';
$filename = urlencode("伊勢.txt");
$url .= $filename;

Related

PHP cURL doesn't work with - in subdomains

I have a PHP script that I'm trying to get the contents of a page. The code im using is below
$url = "http://test.tumblr.com";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$txt = curl_exec($ch);
curl_close($ch);
echo "$txt";
It works fine for me as it is now. The problem I'm having is, if I change the string URL to
$url = "http://-test.tumblr.com"; or $url = "http://test-.tumblr.com";
It will not work. I understand that -test.example.com or test-.example.com is not a valid hostnames but with Tumblr they do exists. Is there a work around for this?
I even tried creating a header redirect on another php file so cURL would be first getting a valid hostname but works the same way.
Thank you
Domain Names with hyphens
As you can see in a previous question about the allowed characters in a subdomain, - is not a valid character to start or end a subdomain with. So this is actually correct behavior.
The same problem was reported over the curl mailing list some time ago but since curl follows the standard, there is actually nothing to change on their site.
Most likely tumblr knows about this and therefore offers some alternative address leading to the same site.
Possible workaround
However you could try using nslookup to manually lookup the IP and then send your request directly to this IP (and manually setting the hostname to the correct value). I didn't try this out, but it seems as if nslookup is capable to resolve malformatted domain names that start or end in a hyphen.
curl
Additionally you should know, that the php curl function should be a direct interface to the curl command line tool and therefore, if you would encounter special behavior it would most likely be due to the logic in the curl command line tool and not the php function.

How to get Wikipedia page HTML with absolute URLs using the API?

I'm trying to retrieve articles through wikipedia API using this code
$url = 'http://en.wikipedia.org/w/api.php?action=parse&page=example&format=json&prop=text';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$c = curl_exec($ch);
$json = json_decode($c);
$content = $json->{'parse'}->{'text'}->{'*'};
I can view the content in my website and everything is fine but I have a problem with the links inside the article that I have retrieved. If you open the url you can see that all the links start with href=\"/
meaning that if someone clicks on any related link in the article it redirects him to www.mysite.com/wiki/.. (Error 404) instead of en.wikipedia.com/wiki/..
Is there any piece of code that I can add to the existing one to fix this issue?
This seems to be a shortcoming in the MediaWiki action=parse API. In fact, someone already filed a feature request asking for an option to make action=parse return full URLs.
As a workaround, you could either try to mangle the links yourself (like adil suggests), or use index.php?action=render like this:
http://en.wikipedia.org/w/index.php?action=render&title=Example
This will only give you the page HTML with no API wrapper, but if that's all you want anyway then it should be fine. (For example, this is the method used internally by InstantCommons to show remote file description pages.)
You should be able to fix the links like this:
$content = str_replace('<a href="/w', '<a href="//en.wikipedia.org/w', $content);
In case anyone else needs to replace all instances of the URL.
You'll need to use regex and the g flag
/<a href="\/w/g

Making PHP cURL skip binary data like images, video, etc

Setting up curl like this:
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$this->domain);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,3);
curl_setopt($ch,CURLOPT_FAILONERROR,TRUE);
curl_setopt($ch,CURLOPT_USERAGENT,"Useragent");
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,TRUE);
curl_setopt($ch,CURLOPT_MAXREDIRS,1);
$str = curl_exec($ch);
return $str;
$str = $this->cURL();
Pass the url to an html page and all is well - but pass a link direct to a .jpg for example and it returns a load of garbled data.
I'd like to ensure that if a page, say, redirects to a .jpg or .gif, etc - it's ignored and only html pages are returned.
I can't seem to find a setopt for curl that does this.
Any ideas?
-The Swan.
Curl doesn't care if the content's text (html) or binary garbage (a jpg), it'll just return what you tell it to fetch. You've told curl to follow redirects with the "CURLOPT_FOLLOWLOCATION" option, so it'll just follow the chain of redirects until it hits the regular limit, or gets something to download
If you don't know what the URL might contain ahead of time, you'd have to do some workarounds, such as issuing a custom HEAD request, which would return the URL's normal http headers, from which you can extract the mime type (Content-type: ...) of the response and decide if you want to fetch it.
Or just fetch the URL and then keep/toss the data based on the mime type in the full response's headers.
My idea - use HEAD request, check if content-type is interesting( eg. another HTML ) and after this make GET request for data.
set CURLOPT_NOBODY for HEAD request

PHP connecting to MediaWiki API and retrieve data

I noticed there was a question somewhat similar to mine, only with c#:link text.
Let me explain: I'm very new to the whole web-services implementation and so I'm experiencing some difficulty understanding (especially due to the vague MediaWiki API manual).
I want to retrieve the entire page as a string in PHP (XML file) and then process it in PHP (I'm pretty sure there are other more sophisticated ways to parse XML files but whatever):
Main Page wikipedia.
I tried doing $fp = fopen($url,'r');. It outputs: HTTP request failed! HTTP/1.0 400 Bad Request. The API does not require a key to connect to it.
Can you describe in detail how to connect to the API and get the page as a string?
EDIT:
The URL is $url='http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&redirects&titles=Main Page';. I simply want to read the entire content of the file into a string to use it.
Connecting to that API is as simple as retrieving the file,
fopen
$url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&redirects&titles=Main%20Page';
$fp = fopen($url, 'r');
while (!feof($fp)) {
$c .= fread($fp, 8192);
}
echo $c;
file_get_contents
$url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&redirects&titles=Main%20Page';
$c = file_get_contents($url);
echo $c;
The above two can only be used if your server has the fopen wrappers enabled.
Otherwise if your server has cURL installed you can use that,
$url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&redirects&titles=Main%20Page';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$c = curl_exec($ch);
echo $c;
You probably need to urlencode the parameters that you are passing in the query string ; here, at least the "Main Page" requires encoding -- without this encoding, I get a 400 error too.
If you try this, it should work better (note the space is replaced by %20) :
$url='http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&redirects&titles=Main%20Page';
$str = file_get_contents($url);
var_dump($str);
With this, I'm getting the content of the page.
A solution is to use urlencode, so you don't have to encode yourself :
$url='http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&redirects&titles=' . urlencode('Main Page');
$str = file_get_contents($url);
var_dump($str);
According to the MediaWiki API docs, if you don't specify a User-Agent in your PHP request, WikiMedia will refuse the connection with a 4xx HTTP response code:
https://www.mediawiki.org/wiki/API:Main_page#Identifying_your_client
You might try updating your code to add that request header, or change the default setting in php.ini if you have edit access to that.

file_get_contents() GET request not showing up on my webserver log

I've got a simple php script to ping some of my domains using file_get_contents(), however I have checked my logs and they are not recording any get requests.
I have
$result = file_get_contents($url);
echo $url. ' pinged ok\n';
where $url for each of the domains is just a simple string of the form http://mydomain.com/, echo verifies this. Manual requests made by myself are showing.
Why would the get requests not be showing in my logs?
Actually I've got it to register the hit when I send $result to the browser. I guess this means the webserver only records browser requests? Is there any way to mimic such in php?
ok tried curl php:
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "getcorporate.co.nr");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
same effect though - no hit registered in logs. So far it only registers when I feed the http response back from my script to the browser. Obviously this will only work for a single request and not a bunch as is the purpose of my script.
If something else is going wrong, what debugging output can I look at?
Edit: D'oh! See comments below accepted answer for explanation of my erroneous thinking.
If the request is actually being made, it would be in the logs.
Your example code could be failing silently.
What happens if you do:
<?PHP
if ($result = file_get_contents($url)){
echo "Success";
}else{
echo "Epic Fail!";
}
If that's failing, you'll want to turn on some error reporting or logging and try to figure out why.
Note: if you're in safe mode, or otherwise have fopen url wrappers disabled, file_get_contents() will not grab a remote page. This is the most likely reason things would be failing (assuming there's not a typo in the contents of $url).
Use curl instead?
That's odd. Maybe there is some caching afoot? Have you tried changing the URL dynamically ($url = $url."?timestamp=".time() for example)?

Categories