Glacier is the low-cost cloud storage solution offered by AWS. Files uploaded to AWS Glacier are called archives and archives are organized in vaults. If you want to upload a file to AWS Glacier, first you need to create a vault and then you can upload files in that vault. Vaults have names which are pretty important as many of the commands (talking here about AWS cli) need the vault name as an argument. For a vault you can also check the total size of the vault (all the uploaded archives in total) and the number of archives in it. And nothing more, no details about the archives in the vault. AWS Glacier maintains an index of the archives for every vault but this can only be accessed if you initiate an inventory download job. This can take up to several hours.
In this article I will show how you can download an archive form AWS Glacier, step by step. Firstly, you need to find the name of the vault from which you will download the archive. Then you will download the index of the vault and hopefully you will identify the archive that you want to download. Finally, you will download the archive in multiple chunks and you will combine them. So, let’s get started.
Step 1 – Listing the vaults
There is no way to search for your archive, you need to know in which vault your archive is. A good idea is to maintain your own index of the files uploaded to AWS Glacier. The index could contain the name of the file, a description, the archive id (generated during upload) and whatever else you need to identify your file. For vaults listing you can use list-vaults
command.
# the command $ aws glacier list-vaults --account-id - # the output { "VaultList":[ { "SizeInBytes":9102027675, "VaultARN":"arn:aws:glacier:eu-west-1:123456789:vaults/photos", "LastInventoryDate":"2017-01-16T08:06:17.699Z", "VaultName":"photos", "NumberOfArchives":2, "CreationDate":"2017-01-15T10:59:58.303Z" } ] }
The command will print all the vaults associated with your AWS account. Hopefully you can identify the vault in which your archive is located. Note down the name of the vault as it will be used in many of the following steps.
Step 2 – Initiating inventory download
Having the vault name, you can then initiate an inventory download job. You will need it so as to identify the archive. Basically, you need the archive id. If you already have it (in your own index), you can skip this and the next step. Run the command, and wait. Or you can create a SNS topic that will be triggered when the operation has finished.
# the command $ aws glacier initiate-job \ --account-id - \ --vault-name photos \ --job-parameters '{"Type": "inventory-retrieval"}' # the output { "location":"/123456789/vaults/photos/jobs/loisaHeJvzYrsM9aewRl", "jobId":"loisaHeJvzYrsM9aewRl" }
Note down the job id, you will need it for the next commands. If you want to check the status of the job, you can use the describe-job
command. It took 3h:45m in my case. You can get it faster, but you have to pay more for this.
$ aws glacier describe-job \ --account-id -\ --vault-name photos \ --job-id loisaHeJvzYrsM9aewRl # the output when the command hasn't yet finished { "InventoryRetrievalParameters":{ "Format":"JSON" }, "VaultARN":"arn:aws:glacier:eu-west-1:123456789:vaults/photos", "Completed":false, "JobId":"loisaHeJvzYrsM9aewRl", "Action":"InventoryRetrieval", "CreationDate":"2018-12-21T15:32:16.292Z", "StatusCode":"InProgress" } # the output when done { "CompletionDate":"2018-12-21T19:16:47.637Z", "VaultARN":"arn:aws:glacier:eu-west-1:123456789:vaults/photos", "InventoryRetrievalParameters":{ "Format":"JSON" }, "Completed":true, "InventorySizeInBytes":2273, "JobId":"loisaHeJvzYrsM9aewRl", "Action":"InventoryRetrieval", "CreationDate":"2018-12-21T15:32:16.292Z", "StatusMessage":"Succeeded", "StatusCode":"Succeeded" }
Step 3 – Downloading the inventory file
The command for downloading the inventory file is get-job-output
. Actually, this command is used for downloading the output of any job. You need the name of the vault, the job id and a file name. The output will be saved in the specified file, in our case inventory.json
# the command $ aws glacier get-job-output \ --account-id - \ --vault-name photos \ --job-id loisaHeJvzYrsM9aewRl inventory.json # the output { "status":200, "acceptRanges":"bytes", "contentType":"application/json" }
And check below how an inventory file looks like. You will get in this file the description of the archive which was provided during upload. This can help on identifying the archive. Note down the archive id.
{ "VaultARN":"arn:aws:glacier:eu-west-1:123456789:vaults/photos", "InventoryDate":"2017-01-16T08:06:17Z", "ArchiveList":[ { "ArchiveId":"VCRQDrz8kryyhV90q1Gq", "ArchiveDescription":"2016 photos", "CreationDate":"2017-01-15T18:49:17Z", "Size":4534377311, "SHA256TreeHash":"0e7ab9f3ad2adb7a40d1..." }, { "ArchiveId":"2CTtNlWpf6PdtDIH62Zy", "ArchiveDescription":"2017 photos", "CreationDate":"2017-01-15T20:53:55Z", "Size":4567584828, "SHA256TreeHash":"1fc71814e54d3986bb3d..." } ] }
Step 4 – Initiating the download archive request
Before initiating the download request, you need to prepare a json file (archive_retrierval_request.json
) which will be used as parameter in the initiate-job
command. The file contains the type of the job (archive-retrieval
), the archive id and a description. It should look like below.
# the job parameters file content { "Type": "archive-retrieval", "ArchiveId": "2CTtNlWpf6PdtDIH62Zy", "Description": "Download 2017 photos archive" }
There is one more thing to check: Data Retrieval Settings. There are three retrieval policies there, the Free Tire Only policy didn’t allow me to download my 4GB archive so I’ve changed the policy to No Retrieval Limit. The error when the retrieval policy doesn’t allow you to download the archive sounds like this: An error occurred (PolicyEnforcedException) when calling the InitiateJob operation: InitiateJob request denied by current data retrieval policy
“. The command for initiating the download archive request is the following:
# the command $ aws glacier initiate-job \ --account-id - \ --vault-name photos \ --job-parameters file://archive_retrierval_request.json # the output of the command { "location": "/123456789/vaults/photos/jobs/YBcVmT0ypwMBGjupI6oD", "jobId": "YBcVmT0ypwMBGjupI6oD" }
And you have to wait. It took 3h:47m in my case. If you configured a SNS topic for the vault, you will get a notification when the job has finished. At any time, you can interrogate the status of the job using the describe-job
command. When the archive is ready for download, the describe-job
output will look like below.
# the command $ aws glacier describe-job \ --account-id - \ --vault-name photos \ --job-id YBcVmT0ypwMBGjupI6oD # the output { "CompletionDate": "2018-12-21T23:42:24.157Z", "VaultARN": "arn:aws:glacier:eu-west-1:123456789:vaults/photos", "RetrievalByteRange": "0-4567584827", "SHA256TreeHash": "1fc71814e54d3986bb3d...", "Action": "ArchiveRetrieval", "JobDescription": "Download 2017 photos archive", "ArchiveId": "2CTtNlWpf6PdtDIH62Zy", "StatusMessage": "Succeeded", "StatusCode": "Succeeded", "Completed": true, "JobId": "YBcVmT0ypwMBGjupI6oD", "Tier": "Standard", "ArchiveSHA256TreeHash": "1fc71814e54d3986bb3d...", "CreationDate": "2018-12-21T19:56:55.189Z", "ArchiveSizeInBytes": 4567584828 }
Step 5 – Downloading the archive
Finally, you are ready to download the archive. If the archive is big (like more than 4GB in my case) it would be wiser to get it in chunks. You can see the size of the archive either in the inventory file or in the output of the describe-job
command. I split the file as following (1GB chunks):
chunk 1 0-1073741823 chunk 2 1073741824-2147483647 chunk 3 2147483648-3221225471 chunk 4 3221225472-4294967295 chunk 5 4294967296-4567584827
To download the archive you have to use the same get-job-output
command previously used for downloading the inventory file. You need the vault name, the job id, the range of bytes to retrieve from output and the file name where the content will be saved.
# the command for downloading the first part $ aws glacier get-job-output \ --account-id - \ --vault-name photos \ --range bytes=0-1073741823 \ --job-id YBcVmT0ypwMBGjupI6oD \ photos.part1 # the output { "status": 206, "acceptRanges": "bytes", "contentType": "application/octet-stream", "checksum": "0b69951bbe9e23faf941...", "contentRange": "bytes 0-1073741823/4567584828", "archiveDescription": "Download 2017 photos archive" }
The chunk size that I used was too big as the download command failed a couple of times, like this: [Errno 10054] An existing connection was forcibly closed by the remote host
. When you successfully downloaded all the pieces, all you have to do is to concatenate them:
$ cat photos.part1 \ photos.part2 \ photos.part3 \ photos.part4 \ photos.part5 > photos.zip
And you are done! Well, there is one more thing you can do, to verify the integrity of every chunk and, in the end, the integrity of the archive. But I will leave it for now, maybe for a future article.
Hi,
I have a question about this command
–job-parameters file://archive_retrierval_request.json
what is the value should I put after –job-parameter?
and what is the file named “archive_retrierval_request.json” ?
Well, a bit late for you I guess, but here are some clarifications anyway:
job-parameters
option expects either an inline json string of a json file. When you want to initiate the download request, the json that you need to provide is a bit too complex to give it inline and it’s easier to just create a json file. The json file should contain the Type of the command (archive-retrieval
in this case), the archive id (from the inventory file) and a description. In my example I saved the file under the namearchive_retrierval_request.json
and I gave it as parameter to theinitiate-job
command. I have updated the post as well (step 4), it should be more clear now.