How to use multiple async fread with Fibers in PHP? - php

I would to like to get contents from each url in a list using fread and Fibers where each stream does not need to wait a feof to run another fread in another url
My current code is the follow:
<?php
function getFiberFromStream($stream, $url): Fiber {
return new Fiber(function ($stream) use ($url): void {
while (!feof($stream)) {
echo "reading 100 bytes from $url".PHP_EOL;
$contents = fread($stream, 100);
Fiber::suspend($contents);
}
});
}
function getContents(array $urls): array {
$contents = [];
foreach ($urls as $key => $url) {
$stream = fopen($url, 'r');
stream_set_blocking($stream, false);
$fiber = getFiberFromStream($stream, $url);
$content = $fiber->start($stream);
while (!$fiber->isTerminated()) {
$content .= $fiber->resume();
}
fclose($stream);
$contents[$urls[$key]] = $content;
}
return $contents;
}
$urls = [
'https://www.google.com/',
'https://www.twitter.com',
'https://www.facebook.com'
];
var_dump(getContents($urls));
Unfortunatelly, the echo used in getFiberFromStream() are showing that this current code is waiting to get the entire content from a url to go to next one:
reading 100 bytes from https://www.google.com
reading 100 bytes from https://www.google.com
reading 100 bytes from https://www.google.com //finished
reading 100 bytes from https://www.twitter.com
reading 100 bytes from https://www.twitter.com
reading 100 bytes from https://www.twitter.com //finished
reading 100 bytes from https://www.facebook.com
[...]
I would like something like:
reading 100 bytes from https://www.google.com
reading 100 bytes from https://www.twitter.com
reading 100 bytes from https://www.facebook.com
reading 100 bytes from https://www.google.com
reading 100 bytes from https://www.twitter.com
reading 100 bytes from https://www.facebook.com
[...]

The behaviour you see is because you poll the current fiber till full completion before go onto next fiber.
Solution here is to start all fibers for all urls at once and only after that do poll them.
Try something like this:
function getContents(array $urls): array {
$contents = [];
$fibers = [];
// start them all up
foreach ($urls as $key => $url) {
$stream = fopen($url, 'r');
stream_set_blocking($stream, false);
$fiber = getFiberFromStream($stream, $url);
$content = $fiber->start($stream);
// save fiber context so we can process them later
$fibers[$key] = [$fiber, $content, $stream];
}
// now poll
$have_unterminated_fibers = true;
while ($have_unterminated_fibers) {
// first suppose we have no work to do
$have_unterminated_fibers = false;
// now loop over fibers to see if any is still working
foreach ($fibers as $key => $item) {
// fetch context
$fiber = $item[0];
$content = $item[1];
$stream = $item[2];
// don't do while till the end here,
// just process next chunk
if (!$fiber->isTerminated()) {
// yep, mark we still have some work left
$have_unterminated_fibers = true;
// update content in the context
$content .= $fiber->resume();
$fibers[$key][1] = $content;
} else {
if ($stream) {
fclose($stream);
// save result for return
$contents[$urls[$key]] = $content;
// mark stream as closed in context
// so it don't close twice
$fibers[$key][2] = null;
}
}
}
}
return $contents;
}

Related

PHP webscraper does not produce errors nor start the loop/create output

I am writing a web scraper in PHP using gitpod. After a while I have managed to solve all problems. But even though no problems are left, the code does not open the browser nor produce any output.
Does anybody have an idea why that could be the case?
<?php
if (file_exists('vendor/autoload.php')) {
require 'vendor/autoload.php';
}
use Goutte\Client;
$client = new Goutte\Client();
// Create a new array to store the scraped data
$data = array();
// Loop through the pages
if ($client->getResponse()->getStatus() != 200) {
echo 'Failed to access website. Exiting script.';
exit();
}
for ($i = 0; $i < 3; $i++) {
// Make a request to the website
$crawler = $client->request('GET', 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page=' . $i);
// Find all the initiatives on the page
$crawler->filter('.initiative')->each(function ($node) use (&$data) {
// Extract the information for each initiative
$title = $node->filter('h3')->text();
$link = $node->filter('a')->attr('href');
$description = $node->filter('p')->text();
$deadline = $node->filter('time')->attr('datetime');
// Append the data for the initiative to the data array
$data[] = array($title, $link, $description, $deadline);
});
// Sleep for a random amount of time between 5 and 10 seconds
$sleep = rand(5,10);
sleep($sleep);
}
// Open the output file
$fp = fopen('initiatives.csv', 'w');
// Write the header row
fputcsv($fp, array('Title', 'Link', 'Description', 'Deadline'));
// Write the data rows
foreach ($data as $row) {
fputcsv($fp, $row);
}
// Close the output file
fclose($fp);
?>

CSV import thousands of rows

I have multiple csv feeds to read with php 7.3.
When i have less than 500 lines (perhaps 1000) it works fine but with more i have
'PHP message: PHP Fatal error: Allowed memory size of 134217728 bytes
exhausted (tried to allocate 8192 bytes)
I read a lot of things about this common error
In php.ini, if I added
memory_limit=1024M
It is similar if I change it to -1.
Data comes from an api
You could see below the 4 main functions called
The error provides from the last (function parse - $lines[$index] = str_getcsv($line);)
Instead of getting the full file into a variable, i parse it for newlines and then do str_getcsv on each array element.
public function getProducts(Advertiser $advertiser): ProductCollection
{
$response = $this->getResponse($advertiser->getFeedUrl());
$content = gzdecode($response);
$products = $this->csvParser->parseProducts($content);
return $products;
}
private function getResponse(string $url): string
{
$cacheKey = sprintf('%s-%s',
static::CACHE_PREFIX,
md5($url)
);
if ($this->cache->has($cacheKey)) {
return $this->cache->get($cacheKey);
}
$response = $this->client->request(static::REQUEST_METHOD, $url);
$body = (string) $response->getBody();
$this->cache->set($cacheKey, $body);
return $body;
}
public function parseProducts(string $csvContent): ProductCollection
{
$lines = $this->parse($csvContent);
$keys = array_shift($lines);
return new ProductCollection($keys, $lines);
}
private function parse(string $content): array
{
$lines = str_getcsv($content, PHP_EOL);
foreach ($lines as $index => $line) {
$lines[$index] = str_getcsv($line);
}
return $lines;
}
So, i think it's due to parse functon but i don't know what to do

Upload File in chunks to URL Endpoint using Guzzle PHP

I want to upload files in chunks to a URL endpoint using guzzle.
I should be able to provide the Content-Range and Content-Length headers.
Using php I know I can split using
define('CHUNK_SIZE', 1024*1024); // Size (in bytes) of chunk
function readfile_chunked($filename, $retbytes = TRUE) {
$buffer = '';
$cnt = 0;
$handle = fopen($filename, 'rb');
if ($handle === false) {
return false;
}
while (!feof($handle)) {
$buffer = fread($handle, CHUNK_SIZE);
echo $buffer;
ob_flush();
flush();
if ($retbytes) {
$cnt += strlen($buffer);
}
}
$status = fclose($handle);
if ($retbytes && $status) {
return $cnt; // return num. bytes delivered like readfile() does.
}
return $status;
}
How Do I achieve sending the files in chunk using guzzle, if possible using guzzle streams?
This method allows you to transfer large files using guzzle streams:
use GuzzleHttp\Psr7;
use GuzzleHttp\Client;
use GuzzleHttp\Psr7\Request;
$resource = fopen($pathname, 'r');
$stream = Psr7\stream_for($resource);
$client = new Client();
$request = new Request(
'POST',
$api,
[],
new Psr7\MultipartStream(
[
[
'name' => 'bigfile',
'contents' => $stream,
],
]
)
);
$response = $client->send($request);
Just use multipart body type as it's described in the documentation. cURL then handles the file reading internally, you don't need to so implement chunked read by yourself. Also all required headers will be configured by Guzzle.

Yii2 setDownloadHeaders() is not working

Code
public function actionExport() {
ini_set('memory_limit','32M');
$whileAgo = date('Y-m-d', time() - 2*24*60*60); // 7-9 seems to be the limit for # days before the 30s timeout
$agkn = AdGroupKeywordNetwork::find()
->select(['field1', 'field2', ...])
->where(['>', 'event_date', $whileAgo])->asArray()->each(10000);
$dateStamp = date('Y-m-d');
Yii::$app->response->setDownloadHeaders("stats_$dateStamp.csv", 'text/csv');
echo 'id,ad_group_keyword_id,keyword,network_id,quality,event_date,clicks,cost,impressions,position,ebay_revenue,prosperent_revenue'.PHP_EOL;
// flush(); // gives us 55s more // doesn't work with gzip
foreach ($agkn as $row) {
echo join(',', $row).PHP_EOL;
// flush();
}
}
Tested:
$ time (curl -sv -b 'PHPSESSID=ckg8l603vpls8jgj6h49d32tq0' http://localhost:81/web/ad-group-keyword-network/export | head)
...
< HTTP/1.1 200 OK
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=UTF-8
<
{ [8277 bytes data]
id,ad_group_keyword_id,keyword,network_id,quality,event_date,clicks,cost,impressions,position,ebay_revenue,prosperent_revenue
9690697,527322,ray ban predator,1,6,2015-11-22,0,0.00,1,5.0,,
It's not downloading a CSV file in the browser either. It's not setting the headers. What is wrong?
Reference: http://www.yiiframework.com/doc-2.0/yii-web-response.html#setDownloadHeaders()-detail
It's because php send header at first echo, before Yii does it.
There are some way to solve the issue.
Collect output to buffer, then send it.
Yii::$app->response->setDownloadHeaders("stats_$dateStamp.csv", 'text/csv');
$data = 'id,ad_group_keyword_id,keyword,network_id,quality,event_date,clicks,cost,impressions,position,ebay_revenue,prosperent_revenue'.PHP_EOL;
foreach ($agkn as $row) {
$data .= join(',', $row).PHP_EOL;
}
return $data;
If output is too large to fit in memory, then data may be stored to temp file. Then send file and delete temp file. There is no need to set header manually in this case.
$filePath = tempnam(sys_get_temp_dir(), 'export');
$fp = fopen($filePath, 'w');
if ($fp) {
fputs($fp, ...);
}
fclose($fp);
return Yii::$app->response->sendFile($filePath, "stats_$dateStamp.csv")
->on(\yii\web\Response::EVENT_AFTER_SEND, function($event) {
unlink($event->data);
}, $filePath);

Download rapidshare file using rapidshare api in php

I am trying to download a rapidshare file using its "download" subroutine as a free user. The following is the code that I use to get response from the subroutine.
function rs_download($params)
{
$url = "http://api.rapidshare.com/cgi-bin/rsapi.cgi?sub=download&fileid=".$params['fileid']."&filename=".$params['filename'];
$reply = #file_get_contents($url);
if(!$reply)
{
return false;
}
$result_arr = array();
$result_keys = array(0=> 'hostname', 1=>'dlauth', 2=>'countdown_time', 3=>'md5hex');
if( preg_match("/DL:(.*)/", $reply, $reply_matches) )
{
$reply_altered = $reply_matches[1];
}
else
{
return false;
}
foreach( explode(',', $reply_altered) as $index => $value )
{
$result_arr[ $result_keys[$index] ] = $value;
}
return $result_arr;
}
For instance; trying to download this...
http://rapidshare.com/files/440817141/AutoRun__live-down.com_Champ.rar
I pass the fileid(440817141) and filename(AutoRun__live-down.com_Champ.rar) to rs_download(...) and I get a response just as rapidshare's api doc says.
The rapidshare api doc (see "sub=download") says call the server hostname with the download authentication string but I couldn't figure out what form the url should take.
Any suggestions?, I tried
$download_url = "http://$the-hostname/$the-dlauth-string/files/$fileid/$filename"
and a couple other variations of the above, nothing worked.
I use curl to download the file, like the following;
$cr = curl_init();
$fp = fopen ("d:/downloaded_files/file1.rar", "w");
// set curl options
$curl_options = array(
CURLOPT_URL => $download_url
,CURLOPT_FILE => $fp
,CURLOPT_HEADER => false
,CURLOPT_CONNECTTIMEOUT => 0
,CURLOPT_FOLLOWLOCATION => true
);
curl_setopt_array($cr, $curl_options);
curl_exec($cr);
curl_close($cr);
fclose($fp);
The above curl code doesn't seem to work, nothing gets downloaded. Probably its the download url that is incorrect.
Also tried this format for the download url:
"http://rs$serverid$shorthost.rapidshare.com/files/$fileid/$filename"
With this curl writes a file entry but that is all it does(writes a 0/1 kb file).
Here is the code that I use to get the serverid, shorthost, among a few other values from rapidshare.
function rs_checkfile($params)
{
$url = "http://api.rapidshare.com/cgi-bin/rsapi.cgi?sub=checkfiles_v1&files=".$params['fileids']."&filenames=".$params['filenames'];
// the response from rapishare would a string something like:
// 440817141,AutoRun__live-down.com_Champ.rar,47768,20,1,l3,0
$reply = #file_get_contents($url);
if(!$reply)
{
return false;
}
$result_arr = array();
$result_keys = array(0=> 'file_id', 1=>'file_name', 2=>'file_size', 3=>'server_id', 4=>'file_status', 5=>'short_host'
, 6=>'md5');
foreach( explode(',', $reply) as $index => $value )
{
$result_arr[ $result_keys[$index] ] = $value;
}
return $result_arr;
}
rs_checkfile(...) takes comma seperated fileids and filenames(no commas if calling for a single file)
Thanks in advance for any suggestions.
You start by requesting ?sub=download&fileid=X&filename=Y, and it returns $hostname,$dlauth,$countdown,$md5hex.. since you're a free user you have to delay for $countdown seconds, and then call ?sub=download&fileid=X&filename=Y&dlauth=Z to perform the download.
There's a working implementation in python here that would probably answer any of your other questions.

Categories