Fork me on GitHub
#aws
<
2020-01-15
>
viesti08:01:58

Re: AWS S3 big files. I've seen multi gigabyte objects uploads, so single large files can be done, but the size of the object needs to be known beforehand, which might be a reason that libs tend to keep the data in memory

jsyrjala08:01:57

I think that S3 has limit of 5G for file transfer in one request. If the file is bigger than that then you must use multipart upload.

viesti17:01:01

yeah, I think the limit used to be lower, but got raised at some point

viesti08:01:23

for multipart uploads, the size of the part needs probably be known beforehand

viesti08:01:57

the Java libs have a nice TransferManager, that can do parallel multipart downloads, if the object was uploaded in multipart fashion

viesti08:01:07

would be need to add such support for aws-api too 🙂

viesti17:01:22

so we need a name for a support lib :)

jsyrjala08:01:28

If I remember correclty current aws-api wants to keep the whole file in memory at least when downloading. http client used by aws-api does not support streaming. amazon java sdk do not have that limitation.

kirill.salykin08:01:22

Indeed, with java sdk it doesnt keep all content in memory during upload

steveb8n09:01:51

what would go around all this big file stuff would be support for pre-signed upload requests in the aws-api client. I’m hoping to see this soon

Linus Ericsson14:01:29

Dito. I think the new SDK for Java (from Aws) will solve this, but maybe It’s just wishful thinking.

viesti17:01:56

the bring-your-own http client will help :)

ghadi17:01:47

@viesti @jsyrjala multipart uploads are already supported. Call the various Multipart operations with your input split into chunks downloading large files is problematic because of no streaming

ghadi17:01:51

CreateMultipartUpload -> UploadPart (many, you can do it concurrently) -> CompleteMultipartUpload

hiredman17:01:11

we have some code at work to do this for staging build artifacts. reduces over the chunks of a file, starts a future uploading a part for each one, then waits for all the futures to complete and completes the multipart upload. works great.

viesti17:01:33

have used only multipart download via the java libs in the past, since had a Redshift cluster write data to S3 in parallel :)

ghadi17:01:08

there is no multipart download

ghadi17:01:28

you can do byte range requests on S3 objects in parallel, though. Similar effect 🙂

viesti17:01:34

got carried away remembering old project :)

viesti17:01:43

nice point

ghadi17:01:11

we've thought about some "userspace helpers" for aws-api, like paginators, etc. but so far we're focusing on the raw operations

ghadi17:01:56

(presigning URLs is in that ballpark too)

viesti17:01:36

if there would be 3rd psrty "userspace" libs, would it be ok to use aws-api as name prefix?

ghadi17:01:04

I can't stop you, but viesti.aws-api would be better 🙂

viesti17:01:41

liking that already :)

viesti17:01:16

would have to do some explaining on that particular name though :)