I post a strange behavior that could be reproduced (at least on apache2+php5).
I don't know if I am doing wrong but let me explain what I try to achieve.
I need to send chunks of binary data (let's say 30) and analyze the average Kbit/s at the end :
I sum each chunk output time, each chunk size, and perform my Kbit/s calculation at the end.
<?php
// build my binary chunk
$var= '';
$o=10000;
while($o--)
{
$var.= pack('N', 85985);
}
// get the size, prepare the memory.
$size = strlen($var);
$tt_sent = 0;
$tt_time = 0;
// I send my chunk 30 times
for ($i = 0; $i < 30; $i++)
{
// start time
$t = microtime(true);
echo $var."\n";
ob_flush();
flush();
$e = microtime(true);
// end time
// the difference should reprenent what it takes to the server to
// transmit chunk to client right ?
// add this chuck bench to the total
$tt_time += round($e-$t,4);
$tt_sent += $size;
}
// total result
echo "\n total: ".(($tt_sent*8)/($tt_time)/1024)."\n";
?>
In this example above, it works so far ( on localhost, it oscillate from 7000 to 10000 Kbit/s through different tests).
Now, let's say I want to shape the transmission, because I know that the client will have enough of a chunk of data to process for a second.
I decide to use usleep(1000000), to mark a pause between chunk transmission.
<?php
// build my binary chunk
$var= '';
$o=10000;
while($o--)
{
$var.= pack('N', 85985);
}
// get the size, prepare the memory.
$size = strlen($var);
$tt_sent = 0;
$tt_time = 0;
// I send my chunk 30 times
for ($i = 0; $i < 30; $i++)
{
// start time
$t = microtime(true);
echo $var."\n";
ob_flush();
flush();
$e = microtime(true);
// end time
// the difference should reprenent what it takes to the server to
// transmit chunk to client right ?
// add this chuck bench to the total
$tt_time += round($e-$t,4);
$tt_sent += $size;
usleep(1000000);
}
// total result
echo "\n total: ".(($tt_sent*8)/($tt_time)/1024)."\n";
?>
In this last example, I don't know why, the calculated bandwidth can jump from 72000 Kbit/s to 1,200,000, it is totally inaccurate / irrelevant.
Part of the problem is that the time measured to output my chunk is ridiculously low each time a chunk is sent (after the first usleep).
I am doing something wrong ? Does the buffer output is not synchronous ?
I'm not sure how definitive these tests are, but I found it interesting. On my box I'm averaging around 170000 kb/s. From a networked box this number goes up to around 280000 kb/s. I guess we have to assume microtime(true) is fairly accurate even though I read it's operating system dependent. Are you on a Linux based system? The real question is how do we calculate the kilobits transferred in a 1 second time period? I try to project how many chunks can be sent in 1 second and then store the calculated Kb/s to be averaged at the end. I added a sleep(1) before flush() and this results in a negative kb/s as to be expected.
Something don't feel right and I would be interested in knowing if you have improved your testing method. Good Luck!
<?php
// build my binary chunk
$var= '';
$o=10000;
//Alternative to get actual bytes
$m1 = memory_get_usage();
while($o--)
{
$var.= pack('N', 85985);
}
$m2 = memory_get_usage();
//Your size estimate
$size = strlen($var);
//Calculate alternative bytes
$bytes = ($m2 - $m1); //40108
//Convert to Kilobytes 1 Kilobyte = 1024 bytes
$kilobytes = $size/1024;
//Convert to Kilobits 1 byte = 8 bits
$kilobits = $kilobytes * 8;
//Display our data for the record
echo "<pre>size: $size</pre>";
echo "<pre>bytes: $bytes</pre>";
echo "<pre>kilobytes: $kilobytes</pre>";
echo "<pre>kilobits: $kilobits</pre>";
echo "<hr />";
//The test count
$count = 100;
//Initialize total kb/s variable
$total = 0;
for ($i = 0; $i < $count; $i++)
{
// Start Time
$start = microtime(true);
// Utilize html comment to prevent browser from parsing
echo "<!-- $var -->";
// End Time
$end = microtime(true);
// Seconds it took to flush binary chunk
$seconds = $end - $start;
// Calculate how many chunks we can send in 1 second
$chunks = (1/$seconds);
// Calculate the kilobits per second
$kbs = $chunks * $kilobits;
// Store the kbs and we'll average all of them out of the loop
$total += $kbs;
}
//Process the average (data generation) kilobits per second
$average = $total/$count;
echo "<h4>Average kbit/s: $average</h4>";
Analysis
Even though I arrive at some arbitrary value in the test, it is still a value that can be measured. Using a networked computer adds insight into what is really going on. I would have thought the localhost machine would have a higher value then the networked box, but tests prove otherwise in a big way. When on the localhost we have to both send the raw binary data and receive it. This of course shows that two threads are sharing cpu cycles and therefore the supposed kb/s value is in fact lower when testing in a browser on the same machine. We are therefore really measuring cpu cycles and we obtain higher values when the server is allowed to be a server.
Some interesting things start to show up when you increase the test count to 1000. First don't make the browser parse the data. It takes alot of cpu to attempt to render raw data at such high test cases. We can manually watch what is going on with say system monitor and task manager. In my case the local is a linux server and the network box is xp. You can obtain some real kb/s speeds this way and it makes it obvious that we are dynamically serving up data using mainly the cpu and network interfaces. The server doesn't replicate the data and, therefore no matter how high we set the test count, we only need 40 kilobytes of memory space. So 40 kilobytes can generate 40 megabytes dynamically at a 1000 test cases and 400mb at 10000 cases.
I crashed firefox in xp with a virtual memory to low error after running the test case 10000 several times. System monitor on the linux server showed some understandable spikes in cpu and network, but overall pushed out a large amount of data really quick and had plenty of room to spare. Running 10000 cases on linux a few times actually spun up the swap drive and pegged the server cpu cycle. The most interesting fact though is that my values that I obtained above only changed when I was both receiving in firefox and transmitting in apache when testing locally. I practically locked up the xp box, yet my network value of ~280000 kb/s did not change on the print out.
Conclusion:
The kb/s value we arrive at above is virtually useless other then to prove its useless. The test itself however shows some interesting things. At high test cases I beleive I could actually see some physical buffering going on in both the server and the client . Our test script actually dumps the data to apache and it gets released to continue its work. Apache handles the details of the transfer of course. This is actually nice to know, but proves we can't measure the actual transmission rate from our server to browser this way. We can measure our servers data generation rate I guess if thats meaningful in some way. Guess what! Flushing actually slowed down the speeds. Theres a penalty for flushing. In this case, there is no reason for it and removing flush() actually speeds up our data generation rate. Since we are not dealing with networking, our value above is actually more meaningful kept as Kilobytes. It's useless any ways so I'm not changing.
Related
I am running a range of queries in BigQuery and exporting them to CSV via PHP. There are reasons why this is the easiest method for me to do this (multiple queries dependent on variables within an app).
I am struggling with memory issues when the result set is larger than 100mb. It appears that the memory usage of my code seems to grow in line with the result set, which I thought would be avoided by paging. Here is my code:
$query = $bq->query($myQuery);
$queryResults = $bq->runQuery($query,['maxResults'=>5000]);
$FH = fopen($storagepath, 'w');
$rows = $queryResults->rows();
foreach ($rows as $row) {
fputcsv($FH, $row);
}
fclose($FH);
The $queryResults->rows() function returns a Google Iterator which uses paging to scroll through the results, so I do not understand why memory usage grows as the script runs.
Am I missing a way to discard previous pages from memory as I page through the results?
UPDATE
I have noticed that actually since upgrading to the v1.4.3 BigQuery PHP API, the memory usage does cap out at 120mb for this process, even when the result set reaches far beyond this (currently processing a 1gb result set). But still, 120mb seems too much. How can I identify and fix where this memory is being used?
UPDATE 2
This 120mb seems to be tied at 24kb per maxResult in the page. E.g. adding 1000 rows to maxResults adds 24mb of memory. So my question is now why is 1 row of data using 24kb in the Google Iterator? Is there a way to reduce this? The data itself is < 1kb per row.
Answering my own question
The extra memory is used by a load of PHP type mapping and other data structure info that comes alongside the data from BigQuery. Unfortunately I couldn't find a way to reduce the memory usage below around 24kb per row multiplied by the page size. If someone finds a way to reduce the bloat that comes along with the data please post below.
However thanks to one of the comments I realized you can extract a query directly to CSV in a Google Cloud Storage Bucket. This is really easy:
query = $bq->query($myQuery);
$queryResults = $bq->runQuery($query);
$qJobInfo = $queryResults->job()->info();
$dataset = $bq->dataset($qJobInfo['configuration']['query']['destinationTable']['datasetId']);
$table = $dataset->table($qJobInfo['configuration']['query']['destinationTable']['tableId']);
$extractJob = $table->extract('gs://mybucket/'.$filename.'.csv');
$table->runJob($extractJob);
However this still didn't solve my issue as my result set was over 1gb, so I had to make use of the data sharding function by adding a wildcard.
$extractJob = $table->extract('gs://mybucket/'.$filename.'*.csv');
This created ~100 shards in the bucket. These need to be recomposed using gsutil compose <shard filenames> <final filename>. However, gsutil only lets you compose 32 files at a time. Given I will have variable numbers of shards, opten above 32, I had to write some code to clean them up.
//Save above job as variable
$eJob = $table->runJob($extractJob);
$eJobInfo = $eJob->info();
//This bit of info from the job tells you how many shards were created
$eJobFiles = $eJobInfo['statistics']['extract']['destinationUriFileCounts'][0];
$composedFiles = 0; $composeLength = 0; $subfile = 0; $fileString = "";
while (($composedFiles < $eJobFiles) && ($eJobFiles>1)) {
while (($composeLength < 32) && ($composedFiles < $eJobFiles)) {
// gsutil creates shards with a 12 digit number after the filename, so build a string of 32 such filenames at a time
$fileString .= "gs://bucket/$filename" . str_pad($composedFiles,12,"0",STR_PAD_LEFT) . ".csv ";
$composedFiles++;
$composeLength++;
}
$composeLength = 0;
// Compose a batch of 32 into a subfile
system("gsutil compose $fileString gs://bucket/".$filename."-".$subfile.".csv");
$subfile++;
$fileString="";
}
if ($eJobFiles > 1) {
//Compose all the subfiles
system('gsutil compose gs://bucket/'.$filename.'-* gs://fm-sparkbeyond/YouTube_1_0/' . $filepath . '.gz') ==$
}
Note in order to give my Apache user access to gsutil I had to allow the user to create a .config directory in the web root. Ideally you would use the gsutil PHP library, but I didn't want the code bloat.
If anyone has a better answer please post it
Is there a way to get smaller output from the BigQuery library than 24kb per row?
Is there a more efficient way to clean up variable numbers of shards?
I have this very simple PHP call to Alpha Vantage API to fill a table (or list) with NASDAQ stock prices:
<?php
function get_price($commodity = "")
{
$url = 'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&symbol=' . $commodity . '&outputsize=full&apikey=myKey';
$obj = json_decode(file_get_contents($url), true);
$date = $obj['Meta Data']['3. Last Refreshed'];
$result = $obj['Time Series (Daily)']['2018-03-23']['4. close'];
$rd_result = round($result, 2);
echo $result;
}
?>
<?php get_price("XOM");
get_price("AAPL");
get_price("MSFT");
get_price("CVX");
get_price("CAT");
get_price("BA");
?>
And it works, but just so freaking slow. It can take ove 30 secs. to load while the json file from Alpha Vantage loads in a fraction of second.
Does anyone knows where am I going wrong?
This what i did when the API took time to reply, my solution is written in C# but the logic would be the same.
string[] AlphaVantageApiKey = { "RK*********", "B2***********", 4FD*********QN", "7S3Z*********FRX", "U************I3" };
int ApiKeyValue = 0;
foreach (var stock in listOfStocks)
{
DataTable dtResult = DataRetrival.GetIntradayStockFeedForSelectedStockAs(stock.Symbol.Trim().ToUpper(), ApiKeyValue);
ApiKeyValue = (ApiKeyValue == 4) ? 0 : ApiKeyValue + 1;
}
I use 5 to 6 different API keys, when i'm querying data. I loop thought each of them for each call. There by reducing load on one perpendicular token.
I observed that this improved my performance a lot. It takes me less than 1 min to get Intraday data for 50 stocks.
Another, way you can improve your performance is to use
outputsize=compact
compact returns only the latest 100 data points in the time series.
UPDATE: Batch Stock Quotes
You might want to consider using this type of query as well. Multiple stock quotes all in one call.
Also, using the full output size is grabbing data from the past 20 years, if applicable. Take that out of your query and have the API use its condensed output default.
EDIT: According to the above, you should make changes to your query. But it can also be an issue with your server. I tested this for a use case I am working on and it takes me a few seconds to get the data, albeit I am only pulling it for one stock symbol on a page at a time.
Try increasing your memory limit if things are too slow for your liking.
<?php
ini_set('memory_limit','500M'); // or your desired limit
?>
Also, if you have shared hosting, that might be the problem. However, I do not know enough about your server to answer that fully.
With this code :
for($i = 1; $i <= $times; $i++){
$milliseconds = round(microtime(true) * 1000);
$res = file_get_contents($url);
$milliseconds2 = round(microtime(true) * 1000);
$milisecondsCount = $milliseconds2 - $milliseconds;
echo 'miliseconds=' . $milisecondsCount . ' ms' . "\n";
}
I get this output :
miliseconds=1048 ms
miliseconds=169 ms
miliseconds=229 ms
miliseconds=209 ms
....
But with sleep:
for($i = 1; $i <= $times; $i++){
$milliseconds = round(microtime(true) * 1000);
$res = file_get_contents($url);
$milliseconds2 = round(microtime(true) * 1000);
$milisecondsCount = $milliseconds2 - $milliseconds;
echo 'miliseconds=' . $milisecondsCount . ' ms' . "\n";
sleep(2);
}
This :
miliseconds=1172 ms
miliseconds=1157 ms
miliseconds=1638 ms
....
So what is happening here ?
My questions:
1) Why don't you test this yourself by using clearstatcache and checking the time signatures with and without using it?
2) Your method of testing is terrible, as a first fix - have you tried swapping so that the "sleep" reading function plays first rather than second?
3) How many iterations of your test have you done?If it's less than 10,000 then I suggest you repeat your tests to be sure to identify firstly the average delay (or lack thereof) and then what makes you think that this is caused specifically by caching?
4) What are the specs. of your machine, your RAM and free and available memory upon each iteration?
5) Is your server live? Are you able to remove outside services as possible causes of time disparity? (such as anti-virus, background processes loading, server traffic, etc. etc.)?
My Answer:
Short answer, is NO. file_get_contents does not use the cache.
However, operating system level caches may cache recently used files to
make for quicker access, but I wouldn't count on those being
available.
Calling a file reference multiple times will only read the file
from disk a single time, subsequent calls will read the value from the
static variable within the function. Note that you shouldn't count on
this approach for large files or ones used only once per page, since
the memory used by the static variable will persist until the end of
the request. Keeping the contents of files unnecessarily in static
variables is a good way to make your script a memory hog.
Quoted from This answer.
For remote (non local filesystem) files, there are so many possible causes of variable delays that really file_get_contents caching is a long way down the list of options.
As you claim to be connecting using a localhost/ reference structure, I would hazard (but not certain) that your server will be using various firewall and checking techniques to check the incoming request which will add a large variable to the time taken.
Concatenate a random query string with url.
For eg: $url = 'http://example.com/file.html?r=' . rand(0, 9999);
I'm making a little benchmark class to display page load time and memory usage.
Load time is already working, but when I display the memory usage, it doesn't change
Example:
$conns = array();
ob_start();
benchmark::start();
$conns[] = mysql_connect('localhost', 'root', '');
benchmark::stop();
ob_flush();
uses the same memory as
$conns = array();
ob_start();
benchmark::start();
for($i = 0; $i < 1000; $i++)
{
$conns[] = mysql_connect('localhost', 'root', '');
}
benchmark::stop();
ob_flush();
I'm using memory_get_usage(true) to get the memory usage in bytes.
memory_get_usage(true) will show the amount of memory allocated by the php engine, not actually used by the script. It's very possible that your test script hasn't required the engine to ask for more memory.
For a test, grab a large(ish) file and read it into memory. You should see a change then.
I've successfully used memory_get_usage(true) to track the memory usage of web crawling scripts, and it's worked fine (since the goal was to slow things down before hitting the system memory limit). The one thing to remember is that it doesn't change based on actual usage, it changes based on the memory requested by the engine. So what you end up seeing is sudden jumps instead of slowing growing (or shrinking).
If you set the real_usage flag to false, you may be able to see very small memory changes - however, this won't help you monitor the true amount of memory php is requesting from the system.
(Update: To be clear the difference I describe is between memory used by the variables of your script, compared to the memory the engine requested to run your script. All the same script, different way of measuring.)
I'm no Guru in PHP's internals, but I could imagine an echo does not affect the amount of memory used by PHP, as it just outputs something to the client.
It could be different if you enable output buffering.
The following should make a difference:
$result = null;
benchmark::start()
for($i = 0; $i < 10000; $i++)
{
$result.='test';
}
Look at:
for($i = 0; $i < 1000; $i++)
{
$conns[] = mysql_connect('localhost', 'root', '');
}
You could have looped to 100,000 and nothing would have changed, its the same connection. No resources are allocated for it because the linked list remembering them never grew. Why would it grow? There's already (assumingly) a valid handle at $conns[0]. It won't make a difference in memory_get_usage(). You did test $conns[15] to see if it worked, yes?
Can root#localhost have multiple passwords? No. Why would PHP bother to handle another connection just because you told it to? (tongue in cheek).
I suggest running the same thing via CLI through Valgrind to see the actual heap usage:
valgrind /usr/bin/php -f foo.php .. or something similar. At the bottom, you'll see what was allocated, what was freed and garbage collection at work.
Disclaimer: I do know my way around PHP internals, but I am no expert in that deliberately obfuscated maze written in C that Zend calls PHP.
echo won't change the allocated number of bytes (unless you use output buffers).
the $i-variable is being unset after the for-loop, so it's not changing the number of allocated bytes either.
try to use a output buffering example:
ob_start();
benchmark::start();
for($i = 0; $i < 10000; $i++)
{
echo 'test';
}
benchmark::stop();
ob_flush();
Wonder if anyone can help me out with a little cron issue i am experience
The problem is that the load can spike up to 5, and the CPU usage can jump to 40%, on a dual core 'Xeon L5410 # 2.33GHz' with 356Mb RAM, and I'm not sure where I should be tweaking the code and which way to prevent that. code sample below
//Note $productFile can be 40Mb .gz compressed, 700Mb uncompressed (xml text file)
if (file_exists($productFile)) {
$fResponse = gzopen($productFile, "r");
if ($fResponse) {
while (!gzeof($fResponse)) {
$sResponse = "";
$chunkSize = 10000;
while (!gzeof($fResponse) && (strlen($sResponse) < $chunkSize)) {
$sResponse .= gzgets($fResponse, 4096);
}
$new_page .= $sResponse;
$sResponse = "";
$thisOffset = 0;
unset($matches);
if (strlen($new_page) > 0) {
//Emptying
if (!(strstr($new_page, "<product "))) {
$new_page = "";
}
while (preg_match("/<product [^>]*>.*<\/product>/Uis", $new_page, $matches, PREG_OFFSET_CAPTURE, $thisOffset)) {
$thisOffset = $matches[0][1];
$thisLength = strlen($matches[0][0]);
$thisOffset = $thisOffset + $thisLength;
$new_page = substr($new_page, $thisOffset-1);
$thisOffset = 0;
$new_page_match = $matches[0][0];
//- Save collected data here -//
}
}//End while loop
}
}
gzclose($fResponse);
}
}
$chunkSize - should it be as small as possible to keep the memory usage down and ease the regular expression, or should it be larger to avoid the code taking too long to run.
With 40,000 matches the load/CPU spikes. So does anyone have any advice on how to manage large feed uploads via crons.
Thanks in advance for your help
You have at least two problems. The first is you're trying to decompress the entire 700 MB file into memory. In fact, you're doing this twice.
while (!gzeof($fResponse) && (strlen($sResponse) < $chunkSize)) {
$sResponse .= gzgets($fResponse, 4096);
}
$new_page .= $sResponse;
Both $sResponse and $new_page will hold a string that will eventaully contain the entire 700 MB file. So that's 1.4 GB of memory you're eating up by the time the script finishes running, not to mention the cost of string concatenation (while PHP handles strings better than other languages, there are limits to what mutable vs. non-mutable will get you)
The second problem is you're running a regular expression over the increasingly large string in $new_page. This will put increased load on the server as $new_page gets larger and larger.
The easiest way to solve your problems is to split up the tasks.
Decompress the entire file to disk before doing any processing.
Use a steram based XML parser, such as XMLReader or the old SAX Event Based parser.
Even with a stream/event based parser, storing the results in memory may end up eating up a lot of ram. In that case you'll want to take each match and store it on disk/in-a-database.
Since you said you're using lamp, may I suggest an answer to one of my questions: Suggestions/Tricks for Throttling a PHP script
He suggests using the nice command on the offending script to lower the chances of it bogging down the server.
An alternative would be to profile the script and see where any bottlenecks are. I would recommend xdebug and kcachegrind or webcachegrind. There are countless questions and websites available to help you setup script profiling.
You may also want to look at PHP's SAX event-based XML parser -
http://uk3.php.net/manual/en/book.xml.php
This is good for parsing large XML files (we use it for XML files of a similar size that you are dealing with) and it does a pretty good job. No need for regexes then :)
You'd need to uncompress the file first before processing it.
Re Alan. The script will never hold 700Mb in memory as it looks like $sResponse is cleared instantly after it reaches $chunkSize and has been added to $new_page,
$new_page .= $sResponse;
$sResponse = "";
and $new_page is strlen reduced once each match is found and cleared if there are no possible matches, for each $chunkSize chunk of data.
$new_page = substr($new_page, $thisOffset-1);
if (!(strstr($new_page, "<product "))) {
$new_page = "";
}
Although I can't say I can see where the actual problem lies.