I am using python 2.7...
I am trying to cat two log files , get data from specific dates using sed. Need to compress the files and upload them to s3 without making any temp files on the system,
sed_command = "sed -n '/{}/,/{}/p'".format(last_date, last_date)
Flow :
cat two files .
Example : cat file1 file2
Run sed manipulation in memory.
compress the result in memory with zip or gzip.
Upload the compressed file in memory to s3.
I have successfully done this with creation of temp files on the system and removing them when the upload to s3 is completed. I could not find a working solution to get this working on the fly without creation of any temp files.
Here's the gist of it:
conn = boto.s3.connection.S3Connection(aws_key, secret_key)
bucket = conn.get_bucket(bucket_name, validate=True)
buffer = cStringIO.StringIO()
writer = gzip.GzipFile(None, 'wb', 6, buffer)
writer.write(sys.stdin.read())
writer.close()
buffer.seek(0)
boto.s3.key.Key(bucket, key_path).set_contents_from_file(buffer)
buffer.close()
Kind of a late answer, but I recently published a package that does just that, it's installable via pypi:
pip install aws-logging-handlers
And you can find usage documentation on git
Related
I have some pdf files which are uploaded on a remote server. I have URL for each file and we can download these PDF files by visiting those URLs.
My question is,
I want to merge all pdf files into a single file (but, without storing these files into local directory). How can I do that (in python module 'PyPDF2')?
Please move to pypdf. It's essentially the same as PyPDF2, but the development will continue there (I'm the maintainer of both projects).
Your question is answered in the docs:
https://pypdf.readthedocs.io/en/latest/user/streaming-data.html
Instead of writing to a file, you write to io.ByteIO stream:
from io import ByteIO
# e.g. writer = PdfWriter()
# ... do what you want to do with the PDFs
with BytesIO() as bytes_stream:
writer.write(bytes_stream)
bytes_stream.seek(0)
data = bytes_stream.read() # that is now the "bytes" represention
I am currently working with the s3a adapter of Hadoop/HDFS to allow me to upload a number of files from a Hive database to a particular s3 bucket. I'm getting nervous because I can't find anything online about specifying a bunch of filepaths (not directories) for copy via distcp.
I have set up my program to collect an array of filepaths using a function, inject them all into a distcp command, and then run the command:
files = self.get_files_for_upload()
if not files:
logger.warning("No recently updated files found. Exiting...")
return
full_path_files = [f"hdfs://nameservice1{file}" for file in files]
s3_dest = "path/to/bucket"
cmd = f"hadoop distcp -update {' '.join(full_path_files)} s3a://{s3_dest}"
logger.info(f"Preparing to upload Hive data files with cmd: \n{cmd}")
result = subprocess.run(cmd, shell=True, check=True)
This basically just creates one long distcp command with 15-20 different filepaths. Will this work? Should I be using the -cp or -put commands instead of distcp?
(It doesn't make sense to me to copy all these files to their own directory and then distcp that entire directory, when I can just copy them directly and skip those steps...)
-cp and -put would require you to download the HDFS files, then upload to S3. That would be a lot slower.
I see no immediate reason why this wouldn't work, however, reading over the documentation, I would recommend using -f flag instead.
E.g.
files = self.get_files_for_upload()
if not files:
logger.warning("No recently updated files found. Exiting...")
return
src_file = 'to_copy.txt'
with open(src_file, 'w') as f:
for file in files:
f.write(f'hdfs://nameservice1{file}\n')
s3_dest = "path/to/bucket"
result = subprocess.run(['hadoop', 'distcp', '-f', src_file, f's3a://{s3_dest}'], shell=True, check=True)
If the all files were already in their own directory, then you should just copy the directory, like you said.
I am working with python and jupyter notebook, and would like to open files from an s3 bucket into my current jupyter directory.
I have tried:
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket')
for obj in bucket.objects.all():
key = obj.key
body = obj.get()['Body'].read()
but I believe this is just reading them, and I would like to save them into this directory. Thank you!
You can use AWS Command Line Interface (CLI), specifically the aws s3 cp command to copy files to your local directory.
late response but was struggling with this earlier today and thought I'd throw in my solution. I needed to work with a bunch of pdfs stored on S3 using Jupyter Notebooks on Sagemaker.
I used a workaround by downloading the files to my repo, which works a lot faster than uploading them and makes my code reproducible for anyone with access to S3.
Step 1
create a list of all the objects to be downloaded, then split each element by '/' so that the file name can be extracted for iteration in step 2
import awswrangler as wr
objects = wr.s3.list_objects({"s3 URI"})
objects_list = [obj.split('/') for obj in objects]
Step 2
Make local folder called data and then iterate through list objects to download them into jupyter notebooks to a folder called data
import boto3
import os
os.makedirs("./data")
s3_client = boto3.client('s3')
for obj in objects_list:
s3_client.download_file({'bucket'}, #can also use obj[2]
{"object_path"}+obj[-1],#object_path is everything that comes after the / after the bucket in your S3 URI
'../data/'+obj[-1])
Thats it! First time answering anything on this so I hope its useful to someone.
I need to unzip large zip files (Approx. size ~10 GB) and put it back on S3. I have a memory limitation of 512 MB.
I tried this code and getting a MemoryError at line: 9 where it loads the entire file content into memory and hence the this memory error. How to retrieve a chunk of the zip file, unzip it and upload it back to S3?
import json
import boto3
import io
import zipfile
def lambda_handler(event, context):
s3_resource = boto3.resource('s3')
zip_obj = s3_resource.Object(bucket_name="bucket.name", key="test/big.zip")
buffer = io.BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket="bucket.name",
Key=f'{"test/" + filename}'
)
Please let me know
I would suggest launch an EC2 instance with a predefined script in UserData of run instance api using Lambda function, so that you can specify location file_name, etc in the script. In the script you can download zip from S3 and unzip using linux commands and then recursively upload entire folder to S3.
You can choose RAM/ROM as per your zip file size, post that you can stop you instance via the same script.
I have a big zip file containing many files that i'd like to unzip by chunks to avoid consuming too much memory.
I tried to use python module zipfile but I didn't find a way to load the archive by chunk and to extract it on disk.
Is there simple way to do that in python ?
EDIT
#steven-rumbalski correctly pointed that zipfile correctly handle big files by unzipping the files one by one without loading the full archive.
My problem here is that my zip file is on AWS S3 and that my EC2 instance cannot load such a big file in RAM so I download it by chunks and I would like to unzip it by chunk.
You don't need a special way to extract a large archive to disk. The source Lib/zipfile.py shows that zipfile is already memory efficient. Creating a zipfile.ZipFile object does not read the whole file into memory. Rather it just reads in the table of contents for the ZIP file. ZipFile.extractall() extracts files one at a time using shutil.copyfileobj() copying from a subclass of io.BufferedIOBase.
If all you want to do is a one-time extraction Python provides a shortcut from the command line:
python -m zipfile -e archive.zip target-dir/
You can use zipfile (or possibly tarfile) as follows:
import zipfile
def extract_chunk(fn, directory, ix_begin, ix_end):
with zipfile.ZipFile("{}/file.zip".format(directory), 'r') as zf:
infos = zf.infolist()
print(infos)
for ix in range(max(0, ix_begin), min(ix_end, len(infos))):
zf.extract(infos[ix], directory)
zf.close()
directory = "path"
extract_chunk("{}/file.zip".format(directory), directory, 0, 50)