One single distcp command to upload several files to s3 (NO DIRECTORY) - python

I am currently working with the s3a adapter of Hadoop/HDFS to allow me to upload a number of files from a Hive database to a particular s3 bucket. I'm getting nervous because I can't find anything online about specifying a bunch of filepaths (not directories) for copy via distcp.
I have set up my program to collect an array of filepaths using a function, inject them all into a distcp command, and then run the command:
files = self.get_files_for_upload()
if not files:
logger.warning("No recently updated files found. Exiting...")
return
full_path_files = [f"hdfs://nameservice1{file}" for file in files]
s3_dest = "path/to/bucket"
cmd = f"hadoop distcp -update {' '.join(full_path_files)} s3a://{s3_dest}"
logger.info(f"Preparing to upload Hive data files with cmd: \n{cmd}")
result = subprocess.run(cmd, shell=True, check=True)
This basically just creates one long distcp command with 15-20 different filepaths. Will this work? Should I be using the -cp or -put commands instead of distcp?
(It doesn't make sense to me to copy all these files to their own directory and then distcp that entire directory, when I can just copy them directly and skip those steps...)

-cp and -put would require you to download the HDFS files, then upload to S3. That would be a lot slower.
I see no immediate reason why this wouldn't work, however, reading over the documentation, I would recommend using -f flag instead.
E.g.
files = self.get_files_for_upload()
if not files:
logger.warning("No recently updated files found. Exiting...")
return
src_file = 'to_copy.txt'
with open(src_file, 'w') as f:
for file in files:
f.write(f'hdfs://nameservice1{file}\n')
s3_dest = "path/to/bucket"
result = subprocess.run(['hadoop', 'distcp', '-f', src_file, f's3a://{s3_dest}'], shell=True, check=True)
If the all files were already in their own directory, then you should just copy the directory, like you said.

Related

python gzip file in memory and upload to s3

I am using python 2.7...
I am trying to cat two log files , get data from specific dates using sed. Need to compress the files and upload them to s3 without making any temp files on the system,
sed_command = "sed -n '/{}/,/{}/p'".format(last_date, last_date)
Flow :
cat two files .
Example : cat file1 file2
Run sed manipulation in memory.
compress the result in memory with zip or gzip.
Upload the compressed file in memory to s3.
I have successfully done this with creation of temp files on the system and removing them when the upload to s3 is completed. I could not find a working solution to get this working on the fly without creation of any temp files.
Here's the gist of it:
conn = boto.s3.connection.S3Connection(aws_key, secret_key)
bucket = conn.get_bucket(bucket_name, validate=True)
buffer = cStringIO.StringIO()
writer = gzip.GzipFile(None, 'wb', 6, buffer)
writer.write(sys.stdin.read())
writer.close()
buffer.seek(0)
boto.s3.key.Key(bucket, key_path).set_contents_from_file(buffer)
buffer.close()
Kind of a late answer, but I recently published a package that does just that, it's installable via pypi:
pip install aws-logging-handlers
And you can find usage documentation on git

Using gdal in python to produce tiff files from csv files

I have many csv files with this format:
Latitude,Longitude,Concentration
53.833399,-122.825257,0.021957
53.837893,-122.825238,0.022642
....
My goal is to produce GeoTiff files based on the information within these files (one tiff file per csv file), preferably using python. This was done several years ago on the project I am working on, however how they did it before has been lost. All I know is that they most likely used GDAL.
I have attempted to do this by researching how to use GDAL, but this has not got me anywhere, as there are limited resources and I have no knowledge of how to use this.
Can someone help me with this?
Here is a little code I adapted for your case. You need to have the GDAL directory with all the *.exe in added to your path for it to work (in most cases it's C:\Program Files (x86)\GDAL).
It uses the gdal_grid.exe util (see doc here: http://www.gdal.org/gdal_grid.html)
You can modify as you wish the gdal_cmd variable to suits your needs.
import subprocess
import os
# your directory with all your csv files in it
dir_with_csvs = r"C:\my_csv_files"
# make it the active directory
os.chdir(dir_with_csvs)
# function to get the csv filenames in the directory
def find_csv_filenames(path_to_dir, suffix=".csv"):
filenames = os.listdir(path_to_dir)
return [ filename for filename in filenames if filename.endswith(suffix) ]
# get the filenames
csvfiles = find_csv_filenames(dir_with_csvs)
# loop through each CSV file
# for each CSV file, make an associated VRT file to be used with gdal_grid command
# and then run the gdal_grid util in a subprocess instance
for fn in csvfiles:
vrt_fn = fn.replace(".csv", ".vrt")
lyr_name = fn.replace('.csv', '')
out_tif = fn.replace('.csv', '.tiff')
with open(vrt_fn, 'w') as fn_vrt:
fn_vrt.write('<OGRVRTDataSource>\n')
fn_vrt.write('\t<OGRVRTLayer name="%s">\n' % lyr_name)
fn_vrt.write('\t\t<SrcDataSource>%s</SrcDataSource>\n' % fn)
fn_vrt.write('\t\t<GeometryType>wkbPoint</GeometryType>\n')
fn_vrt.write('\t\t<GeometryField encoding="PointFromColumns" x="Longitude" y="Latitude" z="Concentration"/>\n')
fn_vrt.write('\t</OGRVRTLayer>\n')
fn_vrt.write('</OGRVRTDataSource>\n')
gdal_cmd = 'gdal_grid -a invdist:power=2.0:smoothing=1.0 -zfield "Concentration" -of GTiff -ot Float64 -l %s %s %s' % (lyr_name, vrt_fn, out_tif)
subprocess.call(gdal_cmd, shell=True)

Python code, does subprocess work with glob?

the short of it is that i need a program to upload all txt files from a local directory via sftp, to a specific remote directory. if i run mput *.txt from sftp command line, while im already in the right local directory, then that was what i was shooting for.
Here is the code im trying. No errors when i run it, but no results either when i sftp to the server and ls the upload directory, its empty. i may be barking up the wrong tree all together. i see other solutions like lftp using mget in bash...but i really want this to work with python. either way i have a lot to learn still. this is what ive come up with after a few days reading about what some stackoverflow users suggested, a few libraries that might help. im not sure i can do the "for i in allfiles:" with subprocess.
import os
import glob
import subprocess
os.chdir('/home/submitid/Local/Upload') #change pwd so i can use mget *.txt and glob similarly
pwd = '/Home/submitid/Upload' #remote directory to upload all txt files to
allfiles = glob.glob('*.txt') #get a list of txt files in lpwd
target="user#sftp.com"
sp = subprocess.Popen(['sftp', target], shell=False, stdin=subprocess.PIPE)
sp.stdin.write("chdir %s\n" % pwd) #change directory to pwd
for i in allfiles:
sp.stdin.write("put %s\n" % allfiles) #for each file in allfiles, do a put %filename to pwd
sp.stdin.write("bye\n")
sp.stdin.close()
When you iterate over allfiles, you are not passing the iterator variable sp.stdin.write, but allfiles itself. It should be
for i in allfiles:
sp.stdin.write("put %s\n" % i) #for each file in allfiles, do a put %filename to pwd
You may also need to wait for sftp to authenticate before issuing commands. You could read stdout from the process, or just put some time.sleep delays in your code.
But why not just use scp and build the full command line, then check if it executes successfully? Something like:
result = os.system('scp %s %s:%s' % (' '.join(allfiles), target, pwd))
if result != 0:
print 'error!'
You dont need to iterate over allfiles
sp.stdin.write("put *.txt\n")
is enough. You instruct sftp to put all files at once, instead of one by one.

output file to another directory

I have a python script and I wrote it such that it will generate an output file to a new directory called test using these two lines:
self.mkdir_p("test") # create directory named "test"
file_out = open("test/"+input,"w")
and the mkdir_p function is as follow:
def mkdir_p(self,path):
try:
os.makedirs(path)
except OSError as exc:
if exc.errno == errno.EEXIST:
pass
else: raise
Now, I have all the files that I would like my script to run on stored in directory called storage, and my question is, how can I write a script such that I can run all the files in storage from my home directory(where my python script is located), and saved all the output to the test directory as I coded in my python script?
I did a naive approach in bash like:
#!/bin/bash
# get the directory name where the files are stored (storage)
in=$1
# for files in (storage) directory
for f in $1/*
do
echo "Processing $f file..."
./my_python_script.py $f
done
and it didnt work and threw IOError: No such file or directory: 'test/storage/inputfile.txt'
I hope I explained my problem clear enough.
Thanks in advance
$f is storage/inputfile.txt and Python prepends test/ to that, then complains because test/storage does not exist. Create the directory, or strip the directory part before creating the output file name.

Python ftplib - uploading multiple files?

I've googled but I could only find how to upload one file... and I'm trying to upload all files from local directory to remote ftp directory. Any ideas how to achieve this?
with the loop?
edit: in universal case uploading only files would look like this:
import os
for root, dirs, files in os.walk('path/to/local/dir'):
for fname in files:
full_fname = os.path.join(root, fname)
ftp.storbinary('STOR remote/dir' + fname, open(full_fname, 'rb'))
Obviously, you need to look out for name collisions if you're just preserving file names like this.
Look at Python-scriptlines required to make upload-files from JSON-Call and next FTPlib-operation: why some uploads, but others not?
Although a different starting position than your question, in the Answer of that first url you see an example construction to upload by ftplib a json-file plus an xml-file: look at scriptline 024 and further.
In the second url you see some other aspects related to upload of more files.
Also applicable for other file-types than json and xml, obviously with a different 'entry' before the 2 final sections which define and realize the FTP_Upload-function.
Create a FTP batch file (with a list of files that you need to transfer). Use python to execute ftp.exe with the "-s" option and pass in the list of files.
This is kludgy but apparently the FTPlib does not have accept multiple files in its STOR command.
Here is a sample ftp batch file.
*
OPEN inetxxx
myuser mypasswd
binary
prompt off
cd ~/my_reg/cronjobs/k_load/incoming
mput *.csv
bye
If the above contents were in a file called "abc.ftp" - then my ftp command would be
ftp -s abc.ftp
Hope that helps.

Categories

Resources