I have just a PHP script for HTML parsing and it works on simple web sites, but now I need to parse the cinema program from this website. I am using the file_get_contents function, which returns just 4 new line delimiters \n and I just can't figure out why.
The website itself will be more difficult to parse with DOMDocument a XPath because the program itself is just pop-up window and it doesn't seem to change the URL address but I will try to handle this problem after retrieving the HTML code of the site.
Here is the shortened version of my script:
<?php
$url = "http://www.cinemacity.cz/";
$content = file_get_contents($url);
$dom = new DomDocument;
$dom->loadHTML($content);
if ($dom == FALSE) {
echo "FAAAAIL\n";
}
$xpath = new DOMXPath($dom);
$tags = $xpath->query("/html");
foreach ($tags as $tag) {
var_dump(trim($tag->nodeValue));
}
?>
EDIT:
So, following the advice by WBAR (thank you), I was looking for a way how to change the header in file_get_contents() function a this is the answer I've found elsewhere. Now I am able to obtain the HTML of the site, hopefully I will manage parsing of this mess :D
<?php
libxml_use_internal_errors(true);
// Create a stream
$opts = array(
'http'=>array(
'user_agent' => 'PHP libxml agent', //Wget 1.13.4
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"Cookie: foo=bar\r\n"
)
);
$context = stream_context_create($opts);
// Open the file using the HTTP headers set above
$content = file_get_contents('http://www.cinemacity.cz/', false, $context);
$dom = new DomDocument;
$dom->loadHTML($content);
if ($dom == FALSE) {
echo "FAAAAIL\n";
}
$xpath = new DOMXPath($dom);
$tags = $xpath->query("/html");
foreach ($tags as $tag) {
var_dump(trim($tag->nodeValue));
}
?>
The problem is not in PHP but in target host. It detects client's User-Agent header. Look at this:
wget http://www.cinemacity.cz/
2012-10-07 13:54:39 (1,44 MB/s) - saved `index.html.1' [234908]
but when remove User-Agent headers:
wget --user-agent="" http://www.cinemacity.cz/
2012-10-07 13:55:41 (262 KB/s) - saved `index.html.2' [4/4]
Only 4 bytes were returned by the server
Try to get the contents this way:
function get2url($url, $timeout = 30, $port = 80, $buffer = 128) {
$arr = parse_url($url);
if(count($arr) < 3) return "URL ERROR";
$ssl = "";
if($arr['scheme'] == "https") $ssl = "ssl://";
$header = "GET " . $arr['path'] . "?" . $arr['query'] . " HTTP/1.0\r\n";
$header .= "Host: " . $arr['host'] . "\r\n";
$header .= "\r\n";
$f = #fsockopen($ssl . $arr['host'], $port, $errno, $errstr, $timeout);
if(!$f)
return $errstr . " (" . $errno . ")";
else{
#fputs($f, $header . $arr['query']);
$echo = "";
while(!feof($f)) { $echo .= #fgets($f, $buffer); }
#fclose($f);
return $echo;
}
}
You will have to remove the headers though.
Related
I built the next request:
char* messege = "POST / HTTP / 1.1\r\n"
"Host: musicshare.pe.hu \r\n"
"Content-Length: 18\r\n"
"Content-Type: text/plain\r\n"
"Accept: text/plain\r\n\r\n"
"command=michaeliko";
This request I built by looking at some capturing wintin Wireshark because I didnt find any appropriate guide to it.
I get an OK messege OK on it, but without the "echo" (print) command I did on the php side.
So for example, a regular chrome capture respond will look like this:
Hello michaeliko\n
\n
\n
\n
<form method="post">\n
\n
Subject: <input type="text" name="command"> <br>\n
\n
</form>\t
And for the POST REQUEST I showed abouve, I will get this from the website server:
\n
\n
\n
<form method="post">\n
\n
Subject: <input type="text" name="command"> <br>\n
\n
</form>\t
The php side looks like this:
<?php
if( isset($_POST['command']))
{
$command = $_POST['command'];
echo 'Hello ' . $command . "\n";
}
?>
<form method="post">
Subject: <input type="text" name="command"> <br>
</form>
I tried to manipulate this code by many ways but find no answer.
What is wrong with my request?
I'm not able to comment yet, so unable to ask questions, such as: How are you submitting the data? Are you trying to do this from a PHP program? Below is a function I wrote years ago, I don't know if it is what you are looking for; if not, you might try the cURL library.
/*
* POST data to a URL with optional auth and custom headers
* $URL = URL to POST to
* $DataStream = Associative array of data to POST
* $UP = Optional username:password
* $Headers = Optional associative array of custom headers
*/
function hl_PostIt($URL, $DataStream, $UP='', $Headers = '') {
// Strip http:// from the URL if present
$URL = preg_replace('=^http://=', '', $URL);
// Separate into Host and URI
$Host = substr($URL, 0, strpos($URL, '/'));
$URI = strstr($URL, '/');
// Form up the request body
$ReqBody = '';
while (list($key, $val) = each($DataStream)) {
if ($ReqBody) $ReqBody.= '&';
$ReqBody.= $key.'='.urlencode($val);
}
$ContentLength = strlen($ReqBody);
// Form auth header
if ($UP) $AuthHeader = 'Authorization: Basic '.base64_encode($UP)."\n";
// Form other headers
if (is_array($Headers)) {
while (list($HeaderName, $HeaderVal) = each($Headers)) {
$OtherHeaders.= "$HeaderName: $HeaderVal\n";
}
}
// Generate the request
$ReqHeader =
"POST $URI HTTP/1.0\n".
"Host: $Host\n".
"User-Agent: PostIt 2.0\n".
$AuthHeader.
$OtherHeaders.
"Content-Type: application/x-www-form-urlencoded\n".
"Content-Length: $ContentLength\n\n".
"$ReqBody\n";
// Open the connection to the host
$socket = fsockopen($Host, 80, $errno, $errstr);
if (!$socket) {
$Result["errno"] = $errno;
$Result["errstr"] = $errstr;
return $Result;
}
// Send the request
fputs($socket, $ReqHeader);
// Receive the response
while (!feof($socket) && $line != "0\r\n") {
$line = fgets($socket, 1024);
$Result[] = $line;
}
$Return['Response'] = implode('', $Result);
list(,$StatusLine) = each($Result);
unset($Result[0]);
preg_match('=HTTP/... ([0-9]{3}) (.*)=', $StatusLine, $Matches);
$Return['Status'] = trim($Matches[0]);
$Return['StatCode'] = $Matches[1];
$Return['StatMesg'] = trim($Matches[2]);
do {
list($IX, $line) = each($Result);
$line = trim($line);
unset($Result[$IX]);
if (strlen($line)) {
list($Header, $Value) = explode(': ', $line, 2);
if (isset($Return[$Header])) {
if (!is_array($Return[$Header])) {
$temp = $Return[$Header];
$Return[$Header] = [$temp];
}
$Return[$Header][] = $Value;
}
else $Return[$Header] = $Value;
}
}
while (strlen($line));
$Return['Body'] = implode('', $Result);
return $Return;
}
i want to use this web server in php
set_time_limit(0);
$address = '127.0.0.1';
$port = 80;
$sock = socket_create(AF_INET, SOCK_STREAM, 0);
socket_bind($sock, $address, $port) or die('Could not bind to address');
echo "\n Listening On port $port For Connection... \n\n";
while(1)
{
socket_listen($sock);
$client = socket_accept($sock);
$input = socket_read($client, 1024);
$incoming = array();
$incoming = explode("\r\n", $input);
$fetchArray = array();
$fetchArray = explode(" ", $incoming[0]);
$file = $fetchArray[1];
if($file == "/"){
$file = "index.php";
} else {
$filearray = array();
$filearray = explode("/", $file);
$file = $filearray[1];
}
echo $fetchArray[0] . " Request " . $file . "\n";
$output = "";
$Header = "HTTP/1.1 200 OK \r\n" .
"Date: Fri, 31 Dec 1999 23:59:59 GMT \r\n" .
"Content-Type: text/html \r\n\r\n";
$Content = file_get_contents($file);
$output = $Header . $Content;
socket_write($client,$output,strlen($output));
socket_close($client);
and in my index.php is echo function to write a string or other functions but this web server can not run that echo or other php functions and in my localhost i see full withe page and this is my problem where is the problem ?
There can be several reasons for this error (one of which for example that your web server has not read permissions and thus cannot open the .php files). You could try enabling the php logs and setting your preffered locations and read the error that prevents your code from being displayed.
I'm trying to create a fire and forget method in PHP so that I can POST data to a web server and not have wait for a response. I read that this could be achieved by using CURL like in the following code:
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
curl_exec($ch);
curl_close($ch);
However I don't think it works as I expect. For example if the URL I send the request to has an error it causes my script to throw an error as well. If it was fire and forget I would expect that not to happen.
Can anyone tell me whether I'm doing something wrong or offer an alternative suggestion. I'm using Windows locally and Linux for dev, staging and production environments.
UPDATE
I have found an alternative solution here: http://blog.markturansky.com/archives/205
I've cleaned it up into the code below:
function curl_post_async($url, $params = array())
{
// create POST string
$post_params = array();
foreach ($params as $key => &$val)
{
$post_params[] = $key . '=' . urlencode($val);
}
$post_string = implode('&', $post_params);
// get URL segments
$parts = parse_url($url);
// workout port and open socket
$port = isset($parts['port']) ? $parts['port'] : 80;
$fp = fsockopen($parts['host'], $port, $errno, $errstr, 30);
// create output string
$output = "POST " . $parts['path'] . " HTTP/1.1\r\n";
$output .= "Host: " . $parts['host'] . "\r\n";
$output .= "Content-Type: application/x-www-form-urlencoded\r\n";
$output .= "Content-Length: " . strlen($post_string) . "\r\n";
$output .= "Connection: Close\r\n\r\n";
$output .= isset($post_string) ? $post_string : '';
// send output to $url handle
fwrite($fp, $output);
fclose($fp);
}
This one seems to work better for me.
Is it a valid solution?
Yes, using sockets is the way to go if you don't care about the response from the URL you're calling. This is because socket connection can be terminated straight after sending the request without waiting and this is exactly what you're after - Fire and Forget.
Two notes though:
It's no longer a cURL request, so it's worth renaming the function. :)
It's definitely worth checking whether the socket could've been opened to prevent the script from complaining later when if fails:
$fp = fsockopen($parts['host'], $port, $errno, $errstr, 30);
if ( ! $fp)
{
return FALSE;
}
It's worth linking to the original source of the fsocket() script you're now using:
http://w-shadow.com/blog/2007/10/16/how-to-run-a-php-script-in-the-background/
Here is a cleaned up version of diggersworld's code that also handles other HTTP methods then POST and throws meaningful exceptions if the function fails.
/**
* Send a HTTP request, but do not wait for the response
*
* #param string $method The HTTP method
* #param string $url The url (including query string)
* #param array $params Added to the URL or request body depending on method
*/
public function sendRequest(string $method, string $url, array $params = []): void
{
$parts = parse_url($url);
if ($parts === false)
throw new Exception('Unable to parse URL');
$host = $parts['host'] ?? null;
$port = $parts['port'] ?? 80;
$path = $parts['path'] ?? '/';
$query = $parts['query'] ?? '';
parse_str($query, $queryParts);
if ($host === null)
throw new Exception('Unknown host');
$connection = fsockopen($host, $port, $errno, $errstr, 30);
if ($connection === false)
throw new Exception('Unable to connect to ' . $host);
$method = strtoupper($method);
if (!in_array($method, ['POST', 'PUT', 'PATCH'], true)) {
$queryParts = $params + $queryParts;
$params = [];
}
// Build request
$request = $method . ' ' . $path;
if ($queryParts) {
$request .= '?' . http_build_query($queryParts);
}
$request .= ' HTTP/1.1' . "\r\n";
$request .= 'Host: ' . $host . "\r\n";
$body = http_build_query($params);
if ($body) {
$request .= 'Content-Type: application/x-www-form-urlencoded' . "\r\n";
$request .= 'Content-Length: ' . strlen($body) . "\r\n";
}
$request .= 'Connection: Close' . "\r\n\r\n";
$request .= $body;
// Send request to server
fwrite($connection, $request);
fclose($connection);
}
$f = fsockopen("www....",80,$x,$y);
fwrite("GET request HTTP/1.1\r\nConnection: keep-alive\r\n\r\n");
while($s = fread($f,1024)){
...
}
The above stalls because of the Connection: keep-alive, and works with Connection: close.
How do you do it without stalling?
It depends on the response, if the transfer-encoding of the response is chunked, then you read until you encounter the "last chunk" (\r\n0\r\n).
If the content-encoding is gzip, then you look at the content-length response header and read that much data and then inflate it. If the transfer-encoding is also set to chunked, then you must dechunk the decoded response.
The easiest thing is to build a simple state machine to read the response from the socket while there is still data left for the response.
When reading chunked data, you should read the first chunk length (and any chunked extension) and then read as much data as the chunk size, and do so until the last chunk.
Put another way:
Read the HTTP response headers (read small chunks of data until you encounter \r\n\r\n)
Parse the response headers into an array
If the transfer-encoding is chunked, read and dechunk the data piece by piece.
If the content-length header is set, you can read that much data from the socket
If the content-encoding is gzip, decompress the read data
Once you have performed the above steps, you should have read the entire response and you can now send another HTTP request on the same socket and repeat the process.
On the other hand, unless you have the absolute need for a keep-alive connection, just set Connection: close in the request and you can safely read while (!feof($f)).
I don't have any PHP code for reading and parsing HTTP responses at the moment (I just use cURL) but if you'd like to see actual code, let me know and I can work something up. I could also refer you to some C# code I've made that does all of the above.
EDIT: Here is working code that uses fsockopen to issue an HTTP request and demonstrate reading keep-alive connections with the possibility of chunked encoding and gzip compression. Tested, but not tortured - use at your own risk!!!
<?php
/**
* PHP HTTP request demo
* Makes HTTP requests using PHP and fsockopen
* Supports chunked transfer encoding, gzip compression, and keep-alive
*
* #author drew010 <http://stackoverflow.com/questions/11125463/if-connection-is-keep-alive-how-to-read-until-end-of-stream-php/11812536#11812536>
* #date 2012-08-05
* Public domain
*
*/
error_reporting(E_ALL);
ini_set('display_errors', 1);
$host = 'www.kernel.org';
$sock = fsockopen($host, 80, $errno, $errstr, 30);
if (!$sock) {
die("Connection failed. $errno: $errstr\n");
}
request($sock, $host, 'GET', '/');
$headers = readResponseHeaders($sock, $resp, $msg);
$body = readResponseBody($sock, $headers);
echo "Response status: $resp - $msg\n\n";
echo '<pre>' . var_export($headers, true) . '</pre>';
echo "\n\n";
echo $body;
// if the connection is keep-alive, you can make another request here
// as demonstrated below
request($sock, $host, 'GET', '/kernel.css');
$headers = readResponseHeaders($sock, $resp, $msg);
$body = readResponseBody($sock, $headers);
echo "Response status: $resp - $msg\n\n";
echo '<pre>' . var_export($headers, true) . '</pre>';
echo "\n\n";
echo $body;
exit;
function request($sock, $host, $method = 'GET', $uri = '/', $params = null)
{
$method = strtoupper($method);
if ($method != 'GET' && $method != 'POST') $method = 'GET';
$request = "$method $uri HTTP/1.1\r\n"
."Host: $host\r\n"
."Connection: keep-alive\r\n"
."Accept-encoding: gzip, deflate\r\n"
."\r\n";
fwrite($sock, $request);
}
function readResponseHeaders($sock, &$response_code, &$response_status)
{
$headers = '';
$read = 0;
while (true) {
$headers .= fread($sock, 1);
$read += 1;
if ($read >= 4 && $headers[$read - 1] == "\n" && substr($headers, -4) == "\r\n\r\n") {
break;
}
}
$headers = parseHeaders($headers, $resp, $msg);
$response_code = $resp;
$response_status = $msg;
return $headers;
}
function readResponseBody($sock, array $headers)
{
$responseIsChunked = (isset($headers['transfer-encoding']) && stripos($headers['transfer-encoding'], 'chunked') !== false);
$contentLength = (isset($headers['content-length'])) ? $headers['content-length'] : -1;
$isGzip = (isset($headers['content-encoding']) && $headers['content-encoding'] == 'gzip') ? true : false;
$close = (isset($headers['connection']) && stripos($headers['connection'], 'close') !== false) ? true : false;
$body = '';
if ($contentLength >= 0) {
$read = 0;
do {
$buf = fread($sock, $contentLength - $read);
$read += strlen($buf);
$body .= $buf;
} while ($read < $contentLength);
} else if ($responseIsChunked) {
$body = readChunked($sock);
} else if ($close) {
while (!feof($sock)) {
$body .= fgets($sock, 1024);
}
}
if ($isGzip) {
$body = gzinflate(substr($body, 10));
}
return $body;
}
function readChunked($sock)
{
$body = '';
while (true) {
$data = '';
do {
$data .= fread($sock, 1);
} while (strpos($data, "\r\n") === false);
if (strpos($data, ' ') !== false) {
list($chunksize, $chunkext) = explode(' ', $data, 2);
} else {
$chunksize = $data;
$chunkext = '';
}
$chunksize = (int)base_convert($chunksize, 16, 10);
if ($chunksize === 0) {
fread($sock, 2); // read trailing "\r\n"
return $body;
} else {
$data = '';
$datalen = 0;
while ($datalen < $chunksize + 2) {
$data .= fread($sock, $chunksize - $datalen + 2);
$datalen = strlen($data);
}
$body .= substr($data, 0, -2); // -2 to remove the "\r\n" before the next chunk
}
} // while (true)
}
function parseHeaders($headers, &$response_code = null, &$response_message = null)
{
$lines = explode("\r\n", $headers);
$return = array();
$response = array_shift($lines);
if (func_num_args() > 1) {
list($proto, $code, $message) = explode(' ', $response, 3);
$response_code = $code;
if (func_num_args() > 2) {
$response_message = $message;
}
}
foreach($lines as $header) {
if (trim($header) == '') continue;
list($name, $value) = explode(':', $header, 2);
$return[strtolower(trim($name))] = trim($value);
}
return $return;
}
The following code works without any problem for me:
<?php
$f = fsockopen("www.google.de",80);
fwrite($f,"GET / HTTP/1.1\r\n Connection: keep-alive\r\n\r\n");
while($s = fread($f,1024)){
echo "got: $s";
}
echo "finished;";
?>
The funny thing is that without keep-alive this example stalls for me.
Can you add an example that can be just copy&pasted and shows your error?
This code is getting the headers and content from $url, and prints it to the browser. It is really slow, and it's not because the server. How can I improve this?
$headers = get_headers($url);
foreach ($headers as $value)
header($value);
$fh = fopen($url, "r");
fpassthru($fh);
Thanks
Why make two requests when one will do?
$fh = fopen($url, 'r');
foreach ($http_response_header as $value) {
header($value);
}
fpassthru($fh);
Or:
$content = file_get_contents($url);
foreach ($http_response_header as $value) {
header($value);
}
echo $content;
I'm not sure why you're opening a connection there on line 6 if you already have and have printed out the headers. Is this doing more than printing out headers?
If you are really looking to just proxy a page, the cURL functions are much more efficient:
<?
$curl = curl_init("http://www.google.com");
curl_setopt($curl, CURLOPT_HEADER, true);
curl_exec($curl);
curl_close($curl);
?>
Of course, cURL has to be enabled on your server, but it's not uncommon.
Are you trying to make a proxy? If so, here is a recipe, in proxy.php:
<?php
$host = 'example.com';
$port = 80;
$page = $_SERVER['REQUEST_URI'];
$conn = fsockopen($host, $port, $errno, $errstr, 180);
if (!$conn) throw new Exception("$errstr ($errno)");
$hbase = array();
$hbase[] = 'User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)';
if (!empty($_SERVER['HTTP_REFERER'])) $hbase[] = 'Referer: '.str_ireplace($_SERVER['HTTP_HOST'], $host, $_SERVER['HTTP_REFERER']);
if (!empty($_SERVER['HTTP_COOKIE'])) $hbase[] = 'Cookie: '.$_SERVER['HTTP_COOKIE'];
$hbase = implode("\n", $hbase);
if ($_SERVER['REQUEST_METHOD'] === 'POST') {
$post = file_get_contents("php://input");
$length = strlen($post);
$request = "POST $page HTTP/1.0\nHost: $host\n$hbase\nContent-Type: application/x-www-form-urlencoded\nContent-Length: $length\n\n$post";
} else $request = "GET $page HTTP/1.0\nHost: $host\n$hbase\n\n";
do {
$conn = fsockopen($host, 80, $errno, $errstr, 180);
if (!$conn) throw new Exception("$errstr ($errno)");
fputs($conn, $request);
$header = false;
$body = false;
stream_set_blocking($conn, false);
$info = stream_get_meta_data($conn);
while (!feof($conn) && !$info['timed_out']) {
$str = fgets($conn);
if (!$str) {
usleep(50000);
continue;
}
if ($body !== false) $body .= $str;
else $header .= $str;
if ($body === false && $str == "\r\n") $body = '';
$info = stream_get_meta_data($conn);
}
fclose($conn);
} while ($info['timed_out']);
$header = str_ireplace($host, $_SERVER['HTTP_HOST'], $header);
if (stripos($body, $host) !== false) $body = str_ireplace($host, $_SERVER['HTTP_HOST'], $body);
$header = str_replace('domain=.example.com; ', '', $header);
$header_array = explode("\r\n", $header);
foreach ($header_array as $line) header($line);
if (strpos($header, 'Content-Type: text') !== false) {
$body = str_replace('something', '', $body);
}
echo $body;
In .htaccess:
Options +FollowSymlinks
RewriteEngine on
RewriteBase /
RewriteRule ^(.*)$ proxy.php [QSA,L]
You may be able to pinpoint the slowness by changing $url to a know fast site, or even a local webserver. The only thing that seems to be possible is a slow response from the server.
Of course as suggested by GZipp, if you're going to output the file contents as well, just do it with a single request. That would make the server you're requesting from happier.