I'm trying to download a large number of files that all share a common string (DEM) from an FTP sever. These files are nested inside multiple directories. For example, Adair/DEM* and Adams/DEM*
The FTP sever is located here: ftp://ftp.igsb.uiowa.edu/gis_library/counties/ and requires no username and password.
So, I'd like to go through each county and download the files containing the string DEM.
I've read many questions here on Stack Overflow and the documentation from Python, but cannot figure out how to use ftplib.FTP() to get into the site without a username and password (which is not required), and I can't figure out how to grep or use glob.glob inside of ftplib or urllib.
Thanks in advance for your help
Ok, seems to work. There may be issues if trying to download a directory, or scan a file. Exception handling may come handy to trap wrong filetypes and skip.
glob.glob cannot work since you're on a remote filesystem, but you can use fnmatch to match the names
Here's the code: it download all files matching *DEM* in TEMP directory, sorting by directory.
import ftplib,sys,fnmatch,os
output_root = os.getenv("TEMP")
fc = ftplib.FTP("ftp.igsb.uiowa.edu")
fc.login()
fc.cwd("/gis_library/counties")
root_dirs = fc.nlst()
for l in root_dirs:
sys.stderr.write(l + " ...\n")
#print(fc.size(l))
dir_files = fc.nlst(l)
local_dir = os.path.join(output_root,l)
if not os.path.exists(local_dir):
os.mkdir(local_dir)
for f in dir_files:
if fnmatch.fnmatch(f,"*DEM*"): # cannot use glob.glob
sys.stderr.write("downloading "+l+"/"+f+" ...\n")
local_filename = os.path.join(local_dir,f)
with open(local_filename, 'wb') as fh:
fc.retrbinary('RETR '+ l + "/" + f, fh.write)
fc.close()
The answer by #Jean with the local pattern matching is the correct portable solution adhering to FTP standards.
Though as most FTP servers do support non-standard wildcard use with file listing commands, you can almost always use a simpler and mainly more efficient solution like:
files = ftp.nlst("*DEM*")
for f in files:
with open(f, 'wb') as fh:
ftp.retrbinary('RETR ' + f, fh.write)
You can use fsspecs FTPFileSystem for convenient globbing on an FTP server:
import fsspec.implementations.ftp
ftpfs = fsspec.implementations.ftp.FTPFileSystem("ftp.ncdc.noaa.gov")
files = ftpfs.glob("/pub/data/swdi/stormevents/csvfiles/*1985*")
print(files)
contents = ftpfs.cat(files[0])
print(contents[:100])
Result:
['/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1985_c20160223.csv.gz', '/pub/data/swdi/stormevents/csvfiles/StormEvents_fatalities-ftp_v1.0_d1985_c20160223.csv.gz', '/pub/data/swdi/stormevents/csvfiles/StormEvents_locations-ftp_v1.0_d1985_c20160223.csv.gz']
b'\x1f\x8b\x08\x08\xcb\xd8\xccV\x00\x03StormEvents_details-ftp_v1.0_d1985_c20160223.csv\x00\xd4\xfd[\x93\x1c;r\xe7\x8b\xbe\x9fOA\xe3\xd39f\xb1h\x81[\\\xf8\x16U\x95\xac\xca\xc5\xacL*3\x8b\xd5\xd4\x8bL\xd2\xb4\x9d'
A nested search also works, for example, nested_files = ftpfs.glob("/pub/data/swdi/stormevents/**1985*"), but it can be quite slow.
Related
I would be very grateful indeed for some help for a frustrated and confused Python beginner.
I am trying to create a script that searches a windows directory containing multiple subdirectories and different file types for a specific single string (a name) in the file contents and if found prints the filenames as a list. There are approximately 2000 files in 100 subdirectories, and all the files I want to search don't necessarily have the same extension - but are all in essence, ASCII files.
I've been trying to do this for many many days but I just cannot figure it out.
So far I have tried using glob recursive coupled with reading the file but I'm so very bewildered. I can successfully print a list of all the files in all subdirectories, but don't know where to go from here.
import glob
files = []
files = glob.glob('C:\TEMP' + '/**', recursive=True)
print(files)
Can anyone please help me? I am 72 year old scientist trying to improve my skills and "automate the boring stuff", but at the moment I'm just losing the will.
Thank you very much in advance to this community.
great to have you here!
What you have done so far is found all the file paths, now the simplest way is to go through each of the files, read them into the memory one by one and see if the name you are looking for is there.
import glob
files = glob.glob('C:\TEMP' + '/**', recursive=True)
target_string = 'John Smit'
# itereate over files
for file in files:
try:
# open file for reading
with open(file, 'r') as f:
# read the contents
contents = f.read()
# check if contents have your target string
if target_string in conents:
print(file)
except:
pass
This will print the file path each time it found a name.
Please also note I have removed the second line from your code, because it is redundant, you initiate the list in line 3 anyway.
Hope it helps!
You could do it like this, though i think there must be a better approach
When you find all files in your directory, you iterate over them and check if they contain that specific string.
for file in files:
if(os.path.isfile(file)):
with open(file,'r') as f:
if('search_string' in f.read()):
print(file)
Here is the below code we have developed for single directory of files
from os import listdir
with open("/user/results.txt", "w") as f:
for filename in listdir("/user/stream"):
with open('/user/stream/' + filename) as currentFile:
text = currentFile.read()
if 'checksum' in text:
f.write('current word in ' + filename[:-4] + '\n')
else:
f.write('NOT ' + filename[:-4] + '\n')
I want loop for all directories
Thanks in advance
If you're using UNIX you can use grep:
grep "checksum" -R /user/stream
The -R flag allows for a recursive search inside the directory, following the symbolic links if there are any.
My suggestion is to use glob.
The glob module allows you to work with files. In the Unix universe, a directory is / should be a file so it should be able to help you with your task.
More over, you don't have to install anything, glob comes with python.
Note: For the following code, you will need python3.5 or greater
This should help you out.
import os
import glob
for path in glob.glob('/ai2/data/prod/admin/inf/**', recursive=True):
# At some point, `path` will be `/ai2/data/prod/admin/inf/inf_<$APP>_pvt/error`
if not os.path.isdir(path):
# Check the `id` of the file
# Do things with the file
# If there are files inside `/ai2/data/prod/admin/inf/inf_<$APP>_pvt/error` you will be able to access them here
What glob.glob does is, it Return a possibly-empty list of path names that match pathname. In this case, it will match every file (including directories) in /user/stream/. If these files are not directories, you can do whatever you want with them.
I hope this will help you!
Clarification
Regarding your 3 point comment attempting to clarify the question, especially this part we need to put appi dynamically in that path then we need to read all files inside that directory
No, you do not need to do this. Please read my answer carefully and please read glob documentation.
In this case, it will match every file (including directories) in /user/stream/
If you replace /user/stream/ with /ai2/data/prod/admin/inf/, you will have access to every file in /ai2/data/prod/admin/inf/. Assuming your app ids are 1, 2, 3, this means, you will have access to the following files.
/ai2/data/prod/admin/inf/inf_1_pvt/error
/ai2/data/prod/admin/inf/inf_2_pvt/error
/ai2/data/prod/admin/inf/inf_3_pvt/error
You do not have to specify the id, because you will be iterating over all files. If you do need the id, you can just extract it from the path.
If everything looks like this, /ai2/data/prod/admin/inf/inf_<$APP>_pvt/error, you can get the id by removing /ai2/data/prod/admin/inf/ and taking everything until you encounter _.
I am trying to copy and rename some PDFs with absolute paths.
ie: c:\users\andrew\pdf\p.pdf gets copied to c:\users\pdf\ORGp.pdf
Leaving two files in the directory p.pdf and ORGp.pdf
I've been working on this issue for the past hour and I can't seem to nail it.
Is there a more pythonic way to do it then splitting the string into a list and rejoining them after adding ORG on the last element?
Using python 2.7 on windows 8.
Your question is a bit ambiguous, but I will try to answer it anyway.
This is a python code sample that will copy under the new names, all files under a particular folder, specified at the beginning of the script:
import os
import shutil
folder_name = "c:\\users\\andrew\\pdf"
for root_folder, _, file_names in os.walk(folder_name):
for file_n in file_names:
new_name = os.path.join(root_folder, "ORG" + file_n)
old_name = os.path.join(root_folder, file_n)
print "We will copy at ", new_name, old_name
shutil.copyfile(old_name, new_name)
This code will copy and rename a list of absolute file paths:
import os
import shutil
files_to_rename = ["c:\\users\\andrew\\pdf\\p.pdf", "c:\\users\\andrew\\pdf2\\p2.pdf"]
for file_full_path in files_to_rename:
folder_n, file_n = os.path.split(file_full_path)
new_name = os.path.join(folder_n, "ORG" + file_n)
print "We will copy at ", new_name, file_full_path
shutil.copyfile(file_full_path, new_name)
I testing this script on Mac OS, with Python 2.7.7, but I think it should work nicely also on Windows.
You can try
import os
.......some logic.....
os.rename(filename, newfilename)
Splitting the string into a list and rejoining (after removing 'andrew' from the list and prefixing 'ORG' to the last element) is quite Pythonic. It's an explicit and obvious way to do it.
You can use the standard str and list methods to do it. However, there are various dedicated file path manipulation functions in the os.path module which you should become familiar with, but the str and list methods are fine when you are sure that all the file names you're processing are sane. os.path also has other useful file-related functions: you can check if a file exists, whether it's a file or a directory, get a file's timestamps, etc.
To actually copy the file once you've generated the new name, use shutil.copyfile(). You may also wish to check first that the file doesn't already exist using os.path.exists(). Unfortunately, some metadata gets lost in this process, eg file owners, as mentioned in the warning in the shutil docs.
This is what I ended up doing to do the rename. I'm not sure how pythonic it is, but it works.
split=fle.split('\\')
print split
pdf=split[len(split)-1]
pdf='ORG%s' % pdf
print pdf
del split[len(split)-1]
split.append(pdf)
fle1 = '\\'.join(split)
try:
shutil.copy(fle, fle1)
except:
print('failed copy')
return''
In the python documentation, it is adviced not to extract a tar archive without prior inspection. What is the best way to make sure an archive is safe using the tarfile python module? Should I just iterate over all the filename and check wether they contain absolute pathnames?
Would something like the following be sufficient?
import sys
import tarfile
with tarfile.open('sample.tar', 'r') as tarf:
for n in tarf.names():
if n[0] == '/' or n[0:2] == '..':
print 'sample.tar contains unsafe filenames'
sys.exit(1)
tarf.extractall()
Edit
This script is not compatible with versions prior to 2.7. cf with and tarfile.
I now iterate over the members:
target_dir = "/target/"
with closing(tarfile.open('sample.tar', mode='r:gz')) as tarf:
for m in tarf:
pathn = os.path.abspath(os.path.join(target_dir, m.name))
if not pathn.startswith(target_dir):
print 'The tar file contains unsafe filenames. Aborting.'
sys.exit(1)
tarf.extract(m, path=tdir)
Almost, although it would still be possible to have a path like foo/../../.
Better would be to use os.path.join and os.path.abspath, which together will correctly handle leading / and ..s anywhere in the path:
target_dir = "/target/" # trailing slash is important
with tarfile.open(…) as tarf:
for n in tarf.names:
if not os.path.abspath(os.path.join(target_dir, n)).startswith(target_dir):
print "unsafe filenames!"
sys.exit(1)
tarf.extractall(path=target_dir)
I've googled but I could only find how to upload one file... and I'm trying to upload all files from local directory to remote ftp directory. Any ideas how to achieve this?
with the loop?
edit: in universal case uploading only files would look like this:
import os
for root, dirs, files in os.walk('path/to/local/dir'):
for fname in files:
full_fname = os.path.join(root, fname)
ftp.storbinary('STOR remote/dir' + fname, open(full_fname, 'rb'))
Obviously, you need to look out for name collisions if you're just preserving file names like this.
Look at Python-scriptlines required to make upload-files from JSON-Call and next FTPlib-operation: why some uploads, but others not?
Although a different starting position than your question, in the Answer of that first url you see an example construction to upload by ftplib a json-file plus an xml-file: look at scriptline 024 and further.
In the second url you see some other aspects related to upload of more files.
Also applicable for other file-types than json and xml, obviously with a different 'entry' before the 2 final sections which define and realize the FTP_Upload-function.
Create a FTP batch file (with a list of files that you need to transfer). Use python to execute ftp.exe with the "-s" option and pass in the list of files.
This is kludgy but apparently the FTPlib does not have accept multiple files in its STOR command.
Here is a sample ftp batch file.
*
OPEN inetxxx
myuser mypasswd
binary
prompt off
cd ~/my_reg/cronjobs/k_load/incoming
mput *.csv
bye
If the above contents were in a file called "abc.ftp" - then my ftp command would be
ftp -s abc.ftp
Hope that helps.