I read some URL with fsockopen() and fread(), and i get this kind of data:
<li
10
></li>
<li
9f
>asd</li>
d
<li
92
Which is totally messed up O_O
--
While using file _ get _ contents() function i get this kind of data:
<li></li>
<li>asd</li>
Which is correct! So, what the HELL is wrong? i tried on my windows server and linux server, both behaves same. And they dont even have the same PHP version.
--
My PHP code is:
$fp = #fsockopen($hostname, 80, $errno, $errstr, 30);
if(!$fp){
return false;
}else{
$out = "GET /$path HTTP/1.1\r\n";
$out .= "Host: $hostname\r\n";
$out .= "Accept-language: en\r\n";
$out .= "Connection: Close\r\n\r\n";
fwrite($fp, $out);
$data = "";
while(!feof($fp)){
$data .= fread($fp, 1024);
}
fclose($fp);
Any help/tips is appreciated, been wondering this whole day now :/
Oh, and i cant use fopen() or file _ get _ contents() because the server where my script runs doesnt have fopen wrappers enabled > __ <
I really want to know how to fix this, just for curiousity. and i dont think i can use any extra libraries on this server anyways.
About your "strange data" problem, this might be because the server you are requesting data from is transferring it in chunked mode.
You can take a look at the HTTP headers, when calling the same URL in your browser ; one of those headers might be like this :
Transfer-encoding: chunked
Quoting wikipedia's article on that matter :
Each non-empty chunk starts with the
number of octets of the data it embeds
(size written in hexadecimal) followed
by a CRLF (carriage return and line
feed), and the data itself. The chunk
is then closed with a CRLF. In some
implementations, white space
characters (0x20) are padded between
chunk-size and the CRLF.
The last chunk is a single line,
simply made of the chunk-size (0),
some optional padding white spaces and
the terminating CRLF. It is not
followed by any data, but optional
trailers can be sent using the same
syntax as the message headers.
The message is finally closed by a
final CRLF combination.
This looks close to what you are getting... So I'm guessing this is the problem.
As far as I remember, curl knows how to deal with that -- so, the easy way would be to use curl instead of fsockopen and the like
And using curl is often a better idea that using sockets : it will deal with many problems you might encounter ; like this one ;-)
Anoter idea, if you don't have curl enabled on your server, would be to use some already existing library based on fsockopen -- hoping it would take care of those kind of things for you already.
For instance, I've worked with Snoopy a couple of times ; maybe it ealready knows how to deal with that ?
(Not sure : you'll have to test by yourself -- or take a look at the documentation to know if this is OK)
Still, if you want to deal with the mysteries of the HTTP protocol by yourself... Well, I wish you luck !
You probably want to use cURL.
<?php
// create a new cURL resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.example.com/");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// grab URL and pass it to the browser
$output = curl_exec($ch);
// close cURL resource, and free up system resources
curl_close($ch);
?>
With fsockopen(), you get the raw TCP data, not the HTTP contents. I assume you also see the HTTP headers, right? If it's in chunked encoding, you will get all the chunk headers.
This is a known issue. Someone posted a solution here on how to remove chunk headers.
Related
If I download a file from a website using:
$html = file_get_html($url);
Then how can I know the size, in kilobyes, of the HTML string? I want to know, because I want to skip files over 100Kb.
If you do file_get_contents, you've already gotten the whole file.
If you mean "skip processing", rather than "skip retrieval", you can just get the length of the string: strlen($html). For kilobytes, divide that by 1024.
This is imprecise because the string may contain UTF-8 characters over one byte in length, and very small files will actually occupy a FS block instead of their byte length, but it's probably good enough for the arbitrary-threshold cutoff you're looking for.
To skip fetching large files, you want to use the cURL library.
<?php
function get_content_length($url) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_NOBODY, 1);
$hraw=explode("\r\n",curl_exec($ch));
curl_close($ch);
$hdrs=array();
foreach($hraw as $hdr) {
$a=explode(": ", trim($hdr));
$hdrs[$a[0]]=$a[1];
}
return (isset($hdrs['Content-Length'])) ? $hdrs['Content-Length'] : FALSE;
}
$url="http://www.example.com/";
if (get_content_length($url) < 100000) {
$html = file_get_contents($url);
print "Yes.\n";
} else {
print "No.\n";
}
?>
There may be a more elegant way to pull this information out of curl, but this is what came to mind fastest. YMMV.
Note that setting the CURLOPT options this way makes curl use a "HEAD" rather than "GET" request, so we're not actually fetching this URL twice.
The definition, what a string is, is different between PHP and the intuitive meaning:
"Hällo" (mind the Umlaut) looks like a 5-character String, but to PHP it is really a 6-byte array (assuming UTF8) - PHP doesn't have a notion of a String representing text, it just sees it as a sequence of bytes (The PHP euphemism is "binary safe").
So strlen("Hällo") will be 6 (UTF8).
That said, if you want to skip above 100Kb you probably won't mind if it is 99.5k characters translating to 100k bytes.
file_get_html returns an object to you, the information of how big the string is is lost at that point. Get the string first, the object later:
$html = file_get_contents($url);
echo strlen($html); // size in bytes
$html = str_get_html($html);
You can use mb_strlen to force 8bit or what not and then 1 character = 1 byte
I'm using fsockopen to connect to an OpenVAS manager and send XML. The code I am using is:
$connection = fsockopen('ssl://'.$server_data['host'], $server_data['port']);
stream_set_timeout($connection, 5);
fwrite($connection, $xml);
while ($chunk = fread($connection, 2048)) {
$response .= $chunk;
}
However after reading the first two chunks of data, PHP hangs on fread and doesn't time out after 5 seconds. I have tried using stream_get_contents, which gives the same result, BUT if I only use one fread, it works ok, just that I want to read everything, regardless of length.
I am guessing, it is an issue with OpenVAS, which doesn't end the stream the way PHP expects it to, but that's a shot in the dark. How do I read the stream?
I believe that fread is hanging up because on that last chunk, it is expecting 2048 bytes of information and is probably getting less that that, so it waits until it times out.
You could try to refactor your code like this:
$bytes_to_read = 2048;
while ($chunk = fread($connection, $bytes_to_read)) {
$response .= $chunk;
$status = socket_get_status ($connection);
$bytes_to_read = $status["unread_bytes"];
}
That way, you'll read everything in two chunks.... I haven't tested this code, but I remember having a similar issue a while ago and fixing it with something like this.
Hope it helps!
I have pretty basic knowledge of PHP sockets and the FIX protocol altogether. I have an account that allows me to connect to a server and retrieve currency prices.
I adapted this code to connect and figure out what I receive back from the remote server:
$host = "the-server.com";
$port = "2xxxx";
$fixv = "8=FIX.4.2";
$clid = "client-name";
$tid = "target-name";
$fp = fsockopen($host, $port, $errno, $errstr, 30);
if (!$fp) {
echo "$errstr ($errno)<br />\n";
} else {
$out = "$fixv|9=70|35=A|49=$clid|56=$tid|34=1|52=20000426-12:05:06|98=0|108=30|10=185|";
echo "\n".$out."\n";
fwrite($fp, $out);
while (!feof($fp)) {
echo ".";
echo fgets($fp, 1024);
}
fclose($fp);
}
and I get nothing back. The host is good because I'm getting an error when I use a random one.
Is the message I'm sending not generating a reply ?
I might not be very good at finding things in Google but I could not find any simple tutorial on how to do this with php (at least nothing that puts together fix and php).
Any help is greatly appreciated.
FIX separator character is actually '\001' not '|', so you have to replace that when sending.
Some links for you:
FIX protocol - formal specs
Onixs FIX dictionary - very useful site for tag lookup
Edit 0:
From that same wikipedia article you mention:
The message fields are delimited using the ASCII 01 character.
...
Example of a FIX message : Execution Report (Pipe character is used to represent SOH character) ...
Edit 1:
Couple more points:
Tag 9 holds message length without tags 8 (type), 9 (length), and 10 (checksum).
Tag 10, checksum, has to be a modulo 256 sum of ASCII values of all message characters including all SOH separators, but not including the tag 10 itself (I know, it's stupid to have checksums on top of TCP, but ...)
The issue is the use of fgets(...), it is expecting a \n which does not exists in this FIX protocol.
On top of that, an expected length of 1024 is specified, which is a length that the response is unlikely to exceed.
To cap it off, since the server doesn't terminate the connection, fgets(...) hangs there "forever"
I want to know if there is any way to get only a particular amount of data through cURL?
Option 1:
curl_setopt ($curl_handle, CURLOPT_HTTPHEADER, array("Range: bytes=0-1000"));
but Its not supported by all servers
Option 2:
Having trouble limiting download size of PHP's cURL function but this function is giving me error Failed writing body (0 != 11350) and reading for which I found that many say its a bug.
So following the above write_function I tried to curl_close($handle) instead of returning 0 but this throws an error Attempt to close cURL handle from a callback
Now the only way I can think of is parsing headers for content length but this will eventually result in 2 requests ?? first getting headers with CURLOPT_NOBODY then getting full content?
Option 2: Having trouble limiting
download size of PHP's cURL function
but this function is giving me error
Failed writing body (0 != 11350) and
reading for which I found that many
say its a bug.
It's not clear what you are doing there exactly. If you return 0 then cURL will signal an error, sure, but you will have read all the data you need. Just ignore the error.
Another option that you don't mention if you have tried is to use fopen with the http:// wrapper. For example:
$h = fopen('http://example.com/file.php', 'r');
$first1000Bytes = fread($h, 1000);
fclose($h);
Is it possible to use fopen and fgets to read a line at a time until you believe you've read enough lines, or read a character at a time using fgetc.
fgets
Not sure if this is excatly what you're looking for, but should limit the amount of data gotten from the remote source.
This seems to solve your problem:
mb_strlen($string, '8bit');
I'm using fsockopen on a small cronjob to read and parse feeds on different servers. For the most past, this works very well. Yet on some servers, I get very weird lines in the response, like this:
<language>en</language>
<sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>
11
<item>
<title>
1f
July 8th, 2010</title>
<link>
32
http://darkencomic.com/?p=2406</link>
<comments>
3e
But when I open the feed in e.g. notepad++, it works just fine, showing:
<language>en</language>
<sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>
<item>
<title>July 8th, 2010</title>
<link>http://darkencomic.com/?p=2406</link>
<comments>
...just to show an excerpt. So, am I doing anything wrong here or is this beyond my control? I'm grateful for any idea to fix this.
Here's part of the code I'm using to retrieve the feeds:
$fp = #fsockopen($url["host"], 80, $errno, $errstr, 5);
if (!$fp) {
throw new UrlException("($errno) $errstr ~~~ on opening ".$url["host"]."");
} else {
$out = "GET ".$path." HTTP/1.1\r\n"
."Host: ".$url["host"]."\r\n"
."Connection: Close\r\n\r\n";
fwrite($fp, $out);
$contents = '';
while (!feof($fp)) {
$contents .= stream_get_contents($fp,128);
}
fclose($fp);
This looks like HTTP Chunked transfer encoding -- which is a way HTTP has of segmenting a response into several small parts ; quoting :
Each non-empty chunk starts with the
number of octets of the data it embeds
(size written in hexadecimal) followed
by a CRLF (carriage return and line
feed), and the data itself. The chunk
is then closed with a CRLF. In some
implementations, white space
characters (0x20) are padded between
chunk-size and the CRLF.
When working with fsockopen and the like, you have to deal with the HTTP Protocol yourself... Which is not always as easy as one might think ;-)
A solution to avoid having to deal with such stuff would be to use something like curl : it already knows the HTTP Protocol -- which means you won't have to re-invent the whell ;-)
I don't see anything strange that could cause that kind of behaviour. Is there any way you can use cURL to do this for you? It might solve the problem altogether :)