PHP Google Cloud Vision API : annotate instantly flood the memory - php

I use cloud vision to annotate documents with DOCUMENT_TEXT_DETECTION, and I only use the words data.
The annotate command returns a lot of information for each letter/symbol (languages, vertices, breaks, text, confidence, ...) which adds up to a lot of memory usage. Running annotate on a 4 pages document¹ return over 100MB of data, which is past my php memory limit, causing the script to crash. Getting only the words data would probably be about 5 times smaller.
To be clear, I load the VisionClient, set up the image, run the annotate() command, and it returns a 100MB variable directly, crashing at this point, before I get the chance to do any cleaning.
$vision = new VisionClient([/* key & id here */]);
$image = $vision->image(file_get_contents($imagepath), ['DOCUMENT_TEXT_DETECTION']);
$annotation = $vision->annotate($image); // Crash at that point trying to allocate too much memory.
Is there a way to not request the entirety of the data? The documentation on annotate seems to indicate that it's possible to annotate only part of the picture, but not to toss the symbols data.
At a more fundamental level, am I doing something wrong here regarding memory management in general?
Thanks
Edit : Just realized : I also need to store the data in a file, which I do using serialize()... which double the memory usage when ran, even if I do $annotation = serialize($annotation) to avoid having 2 variables. So I'd actually need 200MB per user.
¹ Though this is related to the amount of text rather than the amount of pages.

Dino,
When dealing with large images, I would highly recommend uploading your image to Cloud Storage and then running the annotation request against the image in a bucket. This way you'll be able to take advantage of the resumable or streaming protocols available in the Storage library to upload your object with more reliability and with less memory consumption. Here's a quick snippet of what this could look like using the resumable uploader:
use Google\Cloud\Core\Exception\GoogleException;
use Google\Cloud\Storage\StorageClient;
use Google\Cloud\Vision\VisionClient;
$storage = new StorageClient();
$bucket = $storage->bucket('my-bucket');
$imageName = 'my-image.png';
$uploader = $bucket->getResumableUploader(
fopen('/path/to/local/image.png', 'r'),
[
'name' => $imageName,
'chunkSize' => 262144 // This will read data in smaller chunks, freeing up memory
]
);
try {
$uploader->upload();
} catch (GoogleException $ex) {
$resumeUri = $uploader->getResumeUri();
$uploader->resume($resumeUri);
}
$vision = new VisionClient();
$image = $vision->image($bucket->object($imageName), [
'FACE_DETECTION'
]);
$vision->annotate($image);
https://googlecloudplatform.github.io/google-cloud-php/#/docs/google-cloud/v0.63.0/storage/bucket?method=getResumableUploader

Related

How can I get count files in bucket to avoid memory leak?

I'm using AWS PHP SDK, and I need to avoid memory leak when I get many files from S3.
I want to set limit. If bucket has more than 50k files I want to throw an exception. Does S3 has functionality to get count of files in bucket/prefix before I get all files from S3?
My current solution looks like this, but it's bad
$documents = $driver->client->getPaginator('ListObjects',
$arguments)->search('Contents[].Key');
if (iterator_count($document) > $limit) { // but this way got all docs to memory
throw new Exception("We exceeded the limit");
}

Run out of memory writing files to zip with flysystem

I'm programming a tool that gathers images uploaded by a user into a zip-archive. For this I came across ZipArchiveAdapter from Flysystem that seems to do a good job.
I'm encountering an issue with the memory limit when the amount of files in the zip archive goes into the thousands.
When the amount of images for a user starts to go beyond a 1000 it usually fails due to the available memory being exhausted. To get to the point where it seems to handle most users with less than 1000 images I've increased memory limit to 4GB, but increasing it beyond this is not really an option.
Simplified code at this point:
<?php
use League\Flysystem\Filesystem;
use League\Flysystem\ZipArchive\ZipArchiveAdapter;
use League\Flysystem\Memory\MemoryAdapter;
class User {
// ... Other user code
public function createZipFile()
{
$tmpFile = tempnam('/tmp', "zippedimages_");
$download = new FileSystem(new ZipArchiveAdapter($tmpFile));
if ($this->getImageCount()) {
foreach ($this->getImages() as $image) {
$path_in_zip = "My Images/{$image->category->title}/{$image->id}_{$image->image->filename}";
$download->write($path_in_zip, $image->image->getData());
}
}
$download->getAdapter()->getArchive()->close();
return $tmpFile;
// Upload zip to s3-storage
}
}
So my questions:
a) Is there a way to have Flysystem write to the zip-file "on the go" to disk? Currently it stores the entire zip in memory before writing to disk when the object is destroyed.
b) Should I utilize another library that would be better for this?
c) Should I take another approach here? For example having the user download multiple smaller zips instead of one large zip. (Ideally I want them to download just one file regardless)

How do I upload big (video) files in streams to AWS S3 with Laravel 5 and filesystem?

I want to upload a big video file to my AWS S3 bucket. After a good deal of hours, I finally managed to configure my php.ini and nginx.conf files, so they allowed bigger files.
But then I got a "Fatal Error: Allowed Memory Size of XXXXXXXXXX Bytes Exhausted". After some time I found out larger files should be uploaded with streams using fopen(),fwrite(), and fclose().
Since I'm using Laravel 5, the filesystem takes care of much of this. Except that I can't get it to work.
My current ResourceController#store looks like this:
public function store(ResourceRequest $request)
{
/* Prepare data */
$resource = new Resource();
$key = 'resource-'.$resource->id;
$bucket = env('AWS_BUCKET');
$filePath = $request->file('resource')->getRealPath();
/* Open & write stream */
$stream = fopen($filePath, 'w');
Storage::writeStream($key, $stream, ['public']);
/* Store entry in DB */
$resource->title = $request->title;
$resource->save();
/* Success message */
session()->flash('message', $request->title . ' uploadet!');
return redirect()->route('resource-index');
}
But now I get this long error:
CouldNotCreateChecksumException in SignatureV4.php line 148:
A sha256 checksum could not be calculated for the provided upload body, because it was not seekable. To prevent this error you can either 1) include the ContentMD5 or ContentSHA256 parameters with your request, 2) use a seekable stream for the body, or 3) wrap the non-seekable stream in a GuzzleHttp\Stream\CachingStream object. You should be careful though and remember that the CachingStream utilizes PHP temp streams. This means that the stream will be temporarily stored on the local disk.
So I am currently completely lost. I can't figure out if I'm even on the right track. Here are the resource I try to make sense of:
AWS SDK guide for PHP: Stream Wrappers
AWS SDK introduction on stream wrappers
Flysystem original API on stream wrappers
And just to confuse me even more, there seems to be another way to upload large files other than streams: The so called "multipart" upload. I actually thought that was what the streams where all about...
What is the difference?
I had the same problem and came up with this solution.
Instead of using
Storage::put('file.jpg', $contents);
Which of course ran into an "out of memory error" I used this method:
use Aws\S3\MultipartUploader;
use Aws\Exception\MultipartUploadException;
// ...
public function uploadToS3($fromPath, $toPath)
{
$disk = Storage::disk('s3');
$uploader = new MultipartUploader($disk->getDriver()->getAdapter()->getClient(), $fromPath, [
'bucket' => Config::get('filesystems.disks.s3.bucket'),
'key' => $toPath,
]);
try {
$result = $uploader->upload();
echo "Upload complete";
} catch (MultipartUploadException $e) {
echo $e->getMessage();
}
}
Tested with Laravel 5.1
Here are the official AWS PHP SDK docs:
http://docs.aws.amazon.com/aws-sdk-php/v3/guide/service/s3-multipart-upload.html
the streaming part applies to downloads.
for uploads you need to know the content size. for large files multipart uploads are the way to go.

uploading large object to Cloudfiles returns different md5

So I have this code and I'm trying to upload large files as per https://github.com/rackspace/php-opencloud/blob/master/docs/userguide/ObjectStore/Storage/Object.md to Rackspace:
$src_path = 'pathtofile.zip'; //about 700MB
$md5_checksum = md5_file($src_path); //result is f210775ccff9b0e4f686ea49ac4932c2
$trans_opts = array(
'name' => $md5_checksum,
'concurrency' => 6,
'partSize' => 25000000
);
$trans_opts['path'] = $src_path;
$transfer = $container->setupObjectTransfer($trans_opts);
$response = $transfer->upload();
Which allegedly uploads the file just fine
However when I try to download the file as recommended here https://github.com/rackspace/php-opencloud/blob/master/docs/userguide/ObjectStore/USERGUIDE.md:
$name = 'f210775ccff9b0e4f686ea49ac4932c2';
$object = $container->getObject($name);
$objectContent = $object->getContent();
$pathtofile = 'destinationpathforfile.zip';
$objectContent->rewind();
$stream = $objectContent->getStream();
file_put_contents($pathtofile, $stream);
$md5 = md5_file($pathtofile);
The result of md5_file ends up being different from 'f210775ccff9b0e4f686ea49ac4932c2'....moreover the downloaded zip ends up being unopenable/corrupted
What did I do wrong?
It's recommended that you only use multipart uploads for files over 5GB. For files under this threshold, you can use the normal uploadObject method.
When you use the transfer builder, it segments your large file into smaller segments (you provide the part size) and concurrently uploads each one. When this process has finished, a manifest file is created which contains a list of all these segments. When you download the manifest file, it collates them all together, effectively pretending to be the big file itself. But it's just really an organizer.
To get back to answering your question, the ETag header of a manifest file is not calculated how you may think. What you're currently doing is taking the MD5 checksum of the entire 700MB file, and comparing it against the MD5 checksum of the manifest file. But these aren't comparable. To quote the documentation:
the ETag header is calculated by taking the ETag value of each segment, concatenating them together, and then returning the MD5 checksum of the result.
There are also downsides to using this DLO operation that you need to be aware of:
End-to-end integrity is not assured. The eventual consistency model means that although you have uploaded a segment object, it might not appear in the container list immediately. If you download the manifest before the object appears in the container, the object will not be part of the content returned in response to a GET request.
If you think there's been an error in transmission, perhaps it's because a HTTP request failed along the way. You can use retry strategies (using the backoff plugin) to retry failed requests.
You can also turn on HTTP logging to check every network transaction to help with debugging. Be careful, though, using the above with echo out the HTTP request body (>25MB) into STDOUT. You might want to use this instead:
use Guzzle\Plugin\Log\LogPlugin;
use Guzzle\Log\ClosureLogAdapter;
$stream = fopen('php://output', 'w');
$logSubscriber = new LogPlugin(new ClosureLogAdapter(function($m) use ($stream) {
fwrite($stream, $m . PHP_EOL);
}), "# Request:\n{url} {method}\n\n# Response:\n{code} {phrase}\n\n# Connect time: {connect_time}\n\n# Total time: {total_time}", false);
$client->addSubscriber($logSubscriber);
As you can see, you're using a template to dictate what's outputted. There's a full list of template variables here.

Estimate required memory for libGD operation

Before attempting to resize an image in PHP using libGD, I'd like to check if there's enough memory available to do the operation, because an "out of memory" completely kills the PHP process and can't be catched.
My idea was that I'd need 4 byte of memory for each pixel (RGBA) in the original and in the new image:
// check available memory
if(!is_mem_available(($from_w * $from_h * 4) + ($to_w * $to_h * 4))){
return false;
}
Tests showed that this much more memory than the library really seem to use. Can anyone suggest a better method?
You should check this comment out, and also this one.
I imagine it must be possible to find out GD's peak memory usage by analyzing imagecopyresampled's source code, but this may be hard, require extended profiling, vary from version to version, and be generally unreliable.
Depending on your situation, a different approach comes to mind: When resizing an image, call another PHP script on the same server, but using http:
$file = urlencode("/path/to/file");
$result = file_get_contents("http://example.com/dir/canary.php?file=$file&width=1000&height=2000");
(sanitizing the file parameter, obviously)
If that script fails with an "out of memory" error, you'll know the image is too large.
If it successfully manages to resize the image, it could return the path to a temporary file containing the resize result. Things would go ahead normally from there.

Categories